JP2013205890A

JP2013205890A - Machine learning system and machine learning method

Info

Publication number: JP2013205890A
Application number: JP2012071205A
Authority: JP
Inventors: Toshiyuki Yasuda; 俊行保田; Kazuhiro Okura; 和博大倉
Original assignee: Hiroshima University NUC
Current assignee: Hiroshima University NUC
Priority date: 2012-03-27
Filing date: 2012-03-27
Publication date: 2013-10-07
Anticipated expiration: 2032-03-27
Also published as: JP5916466B2

Abstract

PROBLEM TO BE SOLVED: To adaptively select a parametrically-expressed state space and a non-parametrically-expressed state space.SOLUTION: A machine learning system (1) comprises: knowledge acquisition means (12) for generating a parametrically-expressed class set through reinforcement learning on the basis of a reward or penalty given to input and output to the input; knowledge reconstitution means (14) for generating a non-parametrically-expressed class set, on the basis of the learned input used for the generation of the parametrically-expressed class set; and knowledge utilization means (16) for performing class determination of determining to which of the non-parametrically-expressed classes unknown input belongs, and performing output corresponding to a result of the determination. The knowledge reconstitution means (14) generates the non-parametrically-expressed class set when the number of learned inputs is greater than a predetermined number and a variance of each of the parametrically-expressed classes is less than a predetermined value.

Description

本発明は、機械学習に関し、特に、強化学習による知識獲得の頑健性向上に関する。 The present invention relates to machine learning, and more particularly to improving the robustness of knowledge acquisition by reinforcement learning.

システムを制御する場合、一般的にはモデル化に基づくトップダウン的アプローチがとられる。しかし、システムの大規模化などの要因により制御が困難になるということも考えられる。一方、ボトムアップアプローチではシステムの構成要素を知能化することで系全体としての合目的的な入出力関係の獲得が可能である。その中の一つに強化学習法がある。強化学習法は、目標状態を与えるのみでそこに至る入出力の系列を自律的に構築できるという実装の容易さから、さまざまなシステムへの応用が期待される。 When controlling a system, a top-down approach based on modeling is generally taken. However, it may be difficult to control due to factors such as an increase in the scale of the system. On the other hand, in the bottom-up approach, it is possible to acquire purposeful input / output relationships as a whole system by making the system components intelligent. One of them is the reinforcement learning method. Reinforcement learning methods are expected to be applied to various systems because of the ease of implementation in which an input / output sequence leading to a reinforcement state can be established autonomously simply by giving a target state.

強化学習法の従来の枠組みでは、離散的な状態・行動空間における写像関係の構築を対象としている。ここで、学習性能はこの状態・行動空間の離散化具合に大きく影響されるが、現在のところそのための設計指針は存在していない。この問題は、連続空間において動作する多くの実システムでは重大な課題である。本願発明者らはこの状態・行動空間の設計問題に対する手法として、強化学習を機能拡張したBayesian-discrimination-function-based Reinforcement Learning（ＢＲＬ）を研究・開発してきた。ＢＲＬは、連続な状態・行動空間を自律的に分割する機能を持つ。さらには、従来型強化学習はマルコフ環境において学習収束が保証されているのみであるが、ＢＲＬは学習過程で分割具合を適応的に更新可能であるために動的環境でも学習可能であるという特徴を持つ。これまで、本願発明者らは、実システムとしてロボット、特に複数のロボットで構成されるマルチロボットシステム（Multi-Robot Systems：ＭＲＳ）を取り上げ、自律移動ロボット群やアーム型ロボット群による協調問題においてＢＲＬの有効性を示してきた。 The conventional framework of reinforcement learning is aimed at constructing mapping relationships in discrete state / action spaces. Here, the learning performance is greatly influenced by the discretization of the state / action space, but there is no design guideline for that purpose at present. This problem is a serious problem in many real systems operating in continuous space. The inventors of the present application have studied and developed Bayesian-discrimination-function-based Reinforcement Learning (BRL), which is a function expansion of reinforcement learning, as a method for designing the state / action space. BRL has a function of autonomously dividing a continuous state / action space. Furthermore, conventional reinforcement learning only guarantees learning convergence in a Markov environment, but BRL can be updated in a dynamic environment because it can adaptively update the degree of division in the learning process. have. Up to now, the present inventors have taken up a robot as a real system, in particular, a multi-robot system (MRS) composed of a plurality of robots, and in a cooperative problem by an autonomous mobile robot group or an arm type robot group, BRL Has been shown to be effective.

ところが、その後の追加実験において、ＢＲＬでは行動獲得後にさらに学習を続けると、徐々にその頑健性が損なわれる場合があることが観察された。これは、タスク達成に寄与しないルールは削除され、寄与するルールのみが強化されてルール集合に残ることが原因である。すなわち、ＢＲＬでは環境に特化したルール集合となる結果、過学習状態となるためにシステムが不安定になる。そこで、近年、本願発明者らはパターン認識手法の一つであるSupport Vector Machine（ＳＶＭ）の高い識別性能に着目し、ＳＶＭによるルール判別がＢＲＬの過学習抑制に有効であることを明らかにした（例えば、非特許文献１参照）。 However, in subsequent additional experiments, it was observed that the robustness of the BRL may be gradually lost if further learning is continued after acquiring the action. This is because rules that do not contribute to task achievement are deleted, and only the contributing rules are strengthened and remain in the rule set. That is, in the BRL, a rule set specialized for the environment results in an overlearning state, and the system becomes unstable. Therefore, in recent years, the inventors of the present application have focused on the high discrimination performance of Support Vector Machine (SVM), which is one of the pattern recognition methods, and clarified that rule discrimination by SVM is effective in suppressing overlearning of BRL. (For example, refer nonpatent literature 1).

J. Sakanoue, T. Yasuda, and K. Ohkura, "Preservation and Application of Acquired Knowledge Using Instance-Based Reinforcement Learning," Joint 5th International Conference on Soft Computing and Intelligent Systems and 10th International Symposium on advanced Intelligent Systems, 2010, pp.576-581J. Sakanoue, T. Yasuda, and K. Ohkura, "Preservation and Application of Acquired Knowledge Using Instance-Based Reinforcement Learning," Joint 5th International Conference on Soft Computing and Intelligent Systems and 10th International Symposium on advanced Intelligent Systems, 2010, pp .576-581

ＢＲＬの過学習抑制にＳＶＭが有効であることは実証できたものの、具体的にＢＲＬのルール判別にどのようにＳＶＭを用いるかについてはまだ提案できていない。かかる問題に鑑み、本発明は、機械学習システムにおいて、パラメトリック表現された状態空間とノンパラメトリック表現された状態空間とを適応的に選択する手法を提供することを目的とする。 Although it has been proved that SVM is effective in suppressing BRL overlearning, it has not yet been proposed how to use SVM specifically for BRL rule discrimination. In view of such a problem, an object of the present invention is to provide a method for adaptively selecting a state space expressed in a parametric manner and a state space expressed in a non-parametric manner in a machine learning system.

本発明の一局面に従った機械学習システムは、入力が状態空間におけるどのクラスに属するかクラス判別を行って当該判別結果に応じた出力をし、入出力を繰り返すことで環境に適応した知識を獲得する機械学習システムであって、入力および当該入力に対する出力に対して与えられる報酬または罰に基づいて強化学習を行って、パラメトリック表現されたクラス集合を生成する知識獲得手段と、前記パラメトリック表現されたクラス集合の生成に使用された学習済み入力に基づいて、ノンパラメトリック表現されたクラス集合を生成する知識再構成手段と、未知の入力が前記ノンパラメトリック表現されたどのクラスに属するかクラス判別を行って当該判別結果に応じた出力をする知識利用手段とを備えている。前記知識再構成手段は、前記学習済み入力の個数が所定数よりも多く、かつ、前記パラメトリック表現された各クラスの分散が所定値よりも小さいとき、前記ノンパラメトリック表現されたクラス集合を生成する。 The machine learning system according to one aspect of the present invention performs class discrimination as to which class in the state space the input belongs to, outputs according to the discrimination result, and repeats input / output to acquire knowledge adapted to the environment. A machine learning system for acquiring knowledge acquisition means for performing reinforcement learning based on an input and a reward or punishment given to an output for the input to generate a set of parametrically expressed classes, and the parametrically expressed Based on the learned input used to generate the class set, knowledge reconstructing means for generating a non-parametrically expressed class set, and class discrimination for which class the unknown input belongs to Knowledge utilization means for performing and outputting according to the determination result. The knowledge reconstructing means generates the non-parametrically represented class set when the number of learned inputs is larger than a predetermined number and the variance of each class represented by the parametric expression is smaller than a predetermined value. .

これによると、知識獲得手段による強化学習が十分に進んだところで、知識獲得手段において生成されたパラメトリック表現されたクラス集合が知識再構成手段によってノンパラメトリック表現されたクラス集合に再構成され、当該ノンパラメトリック表現されたクラス集合を用いて知識利用手段によって未知の入力のクラス判別が行われる。 According to this, when the reinforcement learning by the knowledge acquisition means has sufficiently progressed, the class set expressed by the parametric expression generated by the knowledge acquisition means is reconstructed into the class set expressed by the parametric expression by the knowledge reconstruction means, and A class determination of unknown input is performed by a knowledge using means using a class set expressed in parametric form.

具体的には、前記パラメトリック表現された各クラスが多変量の正規確率分布であり、前記知識獲得手段は、ベイズ判別法に従って、入力が前記パラメトリック表現されたどのクラスに属するかクラス判別を行う。 Specifically, each class represented by the parametric expression is a multivariate normal probability distribution, and the knowledge acquisition means performs class discrimination according to a Bayes discrimination method to which class the input belongs to the parametric expression.

また、具体的には、前記知識再構成手段は、ＳＶＭを用いて前記学習済み入力を線形分離して、前記ノンパラメトリック表現されたクラス集合を生成する。 Further, specifically, the knowledge reconstructing means linearly separates the learned input using SVM to generate the non-parametric expressed class set.

本発明によると、機械学習システムにおいて、パラメトリック表現された状態空間とノンパラメトリック表現された状態空間とが適応的に選択され、機械学習システムの頑健性が向上する。これにより、機械学習システムが環境変化にも柔軟に対応することができるようになる。 According to the present invention, in a machine learning system, a state space expressed in a parametric manner and a state space expressed in a non-parametric manner are adaptively selected, and the robustness of the machine learning system is improved. As a result, the machine learning system can flexibly cope with environmental changes.

本発明の一実施形態に係る機械学習システムの機能ブロック図Functional block diagram of a machine learning system according to an embodiment of the present invention 図１の機械学習システムによる知識獲得および利用のフローチャートFlowchart of knowledge acquisition and use by the machine learning system of FIG. 計算機実験の実験環境を示す模式図Schematic diagram showing the experimental environment for computer experiments 各ＳＶＭによるタスク達成率を示すグラフGraph showing task achievement rate by each SVM 知識獲得までに要したエピソード数を示すグラフGraph showing the number of episodes required to acquire knowledge

以下、図面を参照しながら本発明を実施するための形態について説明する。なお、本発明は、以下の実施形態に限定されるものではない。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. In addition, this invention is not limited to the following embodiment.

（機械学習システムの実施形態）
図１は、本発明の一実施形態に係る機械学習システムの機能ブロック図である。本実施形態に係る機械学習システム１は、入力が状態空間におけるどのクラスに属するかクラス判別を行って当該判別結果に応じた出力をし、入出力を繰り返すことで環境１００に適応した知識を獲得するものである。機械学習システム１は、例えばＭＲＳを構成する各ロボットなどに適用可能である。 (Embodiment of machine learning system)
FIG. 1 is a functional block diagram of a machine learning system according to an embodiment of the present invention. The machine learning system 1 according to the present embodiment performs class discrimination as to which class in the state space the input belongs to, outputs according to the discrimination result, and acquires knowledge adapted to the environment 100 by repeating input / output To do. The machine learning system 1 can be applied to each robot constituting the MRS, for example.

機械学習システム１は、知識獲得手段１２、知識再構成手段１４、および知識利用手段１６を備えている。これら各手段は電子デバイスなどのハードウェアとして実現してもよいし、コンピュータで実行されるソフトウェアモジュールとして実現することもできる。以下、各手段について詳細に説明する。 The machine learning system 1 includes a knowledge acquisition unit 12, a knowledge reconstruction unit 14, and a knowledge use unit 16. Each of these means may be realized as hardware such as an electronic device, or may be realized as a software module executed by a computer. Hereinafter, each means will be described in detail.

（知識獲得手段１２の詳細説明）
知識獲得手段１２は、環境１００からの入力および当該入力に対する出力に対して与えられる報酬または罰に基づいて強化学習を行って、パラメトリック表現されたクラス集合を生成する。例えば、知識獲得手段１２はＢＲＬによって強化学習を行う。すなわち、パラメトリック表現された各クラスは多変量の正規確率分布であり、知識獲得手段１２は、ベイズ判別法に従って、入力がパラメトリック表現されたどのクラスに属するかクラス判別を行う。 (Detailed explanation of the knowledge acquisition means 12)
The knowledge acquisition unit 12 performs reinforcement learning based on a reward or punishment given to an input from the environment 100 and an output corresponding to the input, and generates a class set expressed in a parametric manner. For example, the knowledge acquisition unit 12 performs reinforcement learning by BRL. In other words, each parametric-expressed class is a multivariate normal probability distribution, and the knowledge acquisition means 12 performs class discrimination according to the Bayes discrimination method to which class the input belongs to the parametric representation.

≪ＢＲＬ≫
ＢＲＬでは統計的にパターン分類を行うベイズ判別法を用いて入力ｘがｋ番目（ただし、ｋは１からＮまでの整数である。）のクラスＣ_ｋに分類されるかを識別する。ベイズ判別法は、識別対象のクラスＣ＝｛Ｃ_ｋ｝^Ｋ _ｋ＝１および各クラスの事前確率Ｐ（Ｃ_ｋ）と確率分布ｐ（ｘ｜Ｃ_ｋ）が既知の場合、入力ｘが観測されたときの各クラスＣ_ｋの事後確率Ｐ（Ｃ_ｋ｜ｘ）をベイズの公式から求め、事後確率最大となるクラスに入力を識別する方法である（数式（１）参照）。ＢＲＬでは、（１）クラスの追加と削除、（２）確率分布モデルのパラメータ更新によって観測データから環境の確率モデルをリアルタイムに更新し、状態空間の分割を行う。 ≪BRL≫
In BRL, a Bayes discriminant that statistically classifies patterns is used to identify whether the input x is classified into k-th class C _k (where k is an integer from 1 to N). In the Bayes discriminant method, the input x is observed when the class C to be identified is C = {C _k } ^K _{k = 1,} and the prior probability P (C _k ) and probability distribution p (x | C _k ) of each class are known. In this method, the posterior probability P (C _k | x) of each class C _k is obtained from the Bayes formula, and the input is identified to the class having the maximum posterior probability (see Expression (1)). In BRL, the environment probability model is updated from observation data in real time by (1) addition and deletion of classes, and (2) parameter update of the probability distribution model, and the state space is divided.

≪ルール構成≫
各クラスをガウス分布によって表現し、各クラスの確率分布を表すパラメータとそのときの出力をif-then形式で記述したルールとして知識獲得手段１２に記憶する。これ以降、クラスとルールを同義として扱う。ルール集合Ｒはルールｒｌ∈Ｒにより構成され、各ルールは次式で記述される。 ≪Rule structure≫
Each class is represented by a Gaussian distribution, and parameters representing the probability distribution of each class and the output at that time are stored in the knowledge acquisition means 12 as rules described in an if-then format. Hereinafter, class and rule are treated as synonymous. The rule set R is composed of rules rlεR, and each rule is described by the following expression.

各ルールｒｌは特徴ベクトルｖ＝｛ｖ_１,…,ｖ_ｎｄ｝^Ｔ、共分散行列Σ、クラスの事前確率ｆ、クラスの信頼性を表す有効度ｕ、各クラスで観測されたセンサ入力の集合Φ＝｛φ_１,…,φ_ｎｓ｝^Ｔ、そして、動作ａ＝｛ａ_１,…,ａ_ｎａ｝^Ｔより構成されている。ただし、ｎ_ｄは入力空間の次元数、ｎ_ａは出力空間の次元数、ｎ_ｓは各クラスが記憶しているサンプルデータを表す。学習初期、状態空間にはクラスは存在せず、機械学習システム１が実際に観測した入出力をもとに状態空間にクラスを追加し、状態空間をガウス分布で覆っていく。 Each rule rl includes a feature vector v = {v ₁ ,..., V _nd } ^T , a covariance matrix Σ, a class prior probability f, a validity u representing class reliability, and a set of sensor inputs observed in each class Φ = {φ ₁ ,..., Φ _ns } ^T and operation a = {a ₁ ,..., A _na } ^T. Here, n _d is the number of dimensions in the input space, n _a is the number of dimensions in the output space, and n _s is sample data stored in each class. In the initial stage of learning, no class exists in the state space. A class is added to the state space based on the input / output actually observed by the machine learning system 1, and the state space is covered with a Gaussian distribution.

≪動作選択≫
以下にＢＲＬの行動選択の概要を示す。 ≪Operation selection≫
The outline of BRL action selection is shown below.

・知覚した入力を、ベイズ判別法によりどのクラスに属するかの判別を行う。 -Determine which class the perceived input belongs to by the Bayes discrimination method.

・既存のルールに属さない場合、ランダム行動を出力し、罰を受けなければ新たなルールを作成する。・ If it does not belong to an existing rule, it outputs a random action and creates a new rule if there is no penalty.

・既存のルールに属す場合、そのルール行動を出力する。 -If it belongs to an existing rule, the rule action is output.

入力に対する各ルールの事後確率をベイズの公式から求め、事後確率最大のルールに記述されている出力を実行する。ここでは、まず事後確率の負の対数を取り、誤って識別する確率ｇ_ｉが最小となるルールを勝者ルールｒｌ_ｗとする。 The posterior probability of each rule for the input is obtained from the Bayes formula, and the output described in the rule with the maximum posterior probability is executed. Here, first, the negative logarithm of the posterior probability is taken, and the rule that minimizes the probability g _{i of} erroneous identification is defined as the winner rule rl _w .

このとき、事後確率が非常に小さいルールが選択されるのは適切でないと考え、事後確率に閾値Ｐ_ｔｈを設ける。そして、それをもとに計算される閾値ｇ_ｔｈ＝−ｌｏｇ｛ｆ_０・Ｐ_ｔｈ｝によってｒｌ_ｗの動作を実行するかどうか判断する。具体的には、ｇ_ｗ＜ｇ_ｔｈの場合、ｒｌ_ｗの動作Ａ_ｗを実行する。ｇ_ｗ≧ｇ_ｔｈの場合、ランダムに動作を実行する。なお、ｆ_０およびＰ_ｔｈは定数である。 At this time, it is not appropriate to select a rule having a very small posterior probability, and a threshold value _Pth is set for the posterior probability. Then, it is determined whether or not to execute the operation of rl _w based on the threshold value g _th = −log {f ₀ · P _th } calculated based on the threshold value. _{Specifically,} in the case of g w _{<g th,} it performs the operations _{A w} of rl _w. When g _w ≧ g _th , the operation is executed at random. Note that f ₀ and P _th are constants.

≪有効度の更新≫
Profit SharingとBucket Brigade的戦略により報酬を過去に遡って伝播させる。その他、ループ行動を防ぐために選択されたルールに課すコスト、報酬獲得に寄与しないルールを削除してメモリ消費量を抑えるためにタスク達成時に全ルールに作用させる消散がある。 ≪Update of effectiveness≫
Profit Sharing and Bucket Brigade strategies are used to propagate rewards retroactively. In addition, there is dissipation that acts on all the rules when the task is completed in order to reduce the cost imposed on the rule selected to prevent the loop action and the rule that does not contribute to the reward acquisition to reduce the memory consumption.

≪パラメータの更新≫
各ルールは入力をもとに確率分布のパラメータをオンラインで更新していく。リアルタイムに更新することで環境やシステム変動に対する迅速な対応が期待できる。その反面、ノイズや一時的な入力の偏りに影響を受けやすいため何らかの対処が必要となる。ＢＲＬでは、区間推定法を用いたパラメータ更新によりこの問題を解決する。区間推定法は、確率分布のパラメータがある区間に入る確率を設定した確率以上になるように保証する手法であり、サンプルデータが増大するにつれて推定精度が上がる。そのため、観測データの増加に伴ってより信頼性の高いパラメータ推定が期待できる。 ≪Parameter update≫
Each rule updates the probability distribution parameters online based on the input. Updating in real time can be expected to respond quickly to environmental and system changes. On the other hand, some measures are necessary because it is easily affected by noise and temporary bias of input. In BRL, this problem is solved by updating parameters using the interval estimation method. The interval estimation method is a method for guaranteeing that the probability of the probability distribution parameter entering a certain interval is equal to or higher than the set probability, and the estimation accuracy increases as the sample data increases. Therefore, more reliable parameter estimation can be expected as the observation data increases.

ここで、ｖはｒｌ_ｗの平均、σ^２はｒｌ_ｗの分散、α，βは区間［０，１］の定数、ｊは入力ベクトルにおける各次元、ｘバーはサンプル入力の平均、ｓ^２はサンプル入力の分散、Ｐは報酬である。 Where v is the average of rl _w , σ ² is the variance of rl _w , α and β are constants in the interval [0, 1], j is each dimension in the input vector, x bar is the average of the sample input, and s ² is Variance of sample input, P is reward.

≪既存ルールのパラメータに基づく行動空間の適応的探索≫
常に行動空間をランダムに探索するのは非効率であるという観点から、知識獲得がある程度行われた状況では幅広く行動空間を探索するよりも既存のルールの近傍を探索して行動の調整を行うことが有効であると考える。そこで、ランダム探索をするための閾値Ｐ_ｔｈの他に新たに閾値Ｐ’_ｔｈを設定（Ｐ’_ｔｈ＜Ｐ_ｔｈ）し、ｇ_ｔｈ≦ｇ_ｗ＜ｇ’_ｔｈの場合はその間にあるルールのパラメータを参照して新しいルールパラメータを決定する。つまり、行動選択を以下のように変更する。 ≪Adaptive search of action space based on parameters of existing rules≫
From the viewpoint that it is inefficient to always search the behavior space at random, in the situation where knowledge has been acquired to some extent, search the neighborhood of existing rules and adjust the behavior rather than searching the behavior space widely Is considered effective. Therefore, in addition to the threshold value P _th for performing a random search, a new threshold value P ′ _th is set (P ′ _th <P _th ). If g _th ≦ g _w <g ′ _th , the parameters of the rules in the meantime To determine new rule parameters. That is, the action selection is changed as follows.

・ｇ_ｗ＜ｇ_ｔｈの場合、ｒｌ_ｗの動作Ａ_ｗを実行する。 In the case of · _{_g} w _{_<g th,} to perform the operations _{A w} of rl _w.

・ｇ_ｔｈ≦ｇ_ｗ＜ｇ’_ｔｈの場合、この間にあるルールをもとに動作を生成する。 When g _th ≦ g _w <g ′ _th , an action is generated based on the rules in between.

・ｇ_ｗ≧ｇ’_ｔｈの場合、ランダムに動作を実行する。 When g _w ≧ g ′ _th , the operation is executed at random.

≪ｇ_ｔｈ≦ｇ_ｗ＜ｇ’_ｔｈの行動≫
この範囲には、ｒｌ_ｗ以外にも複数のルールが含まれる場合がある。これらのルールはその状況化での選択確率としては大きな差はないものの、それまでの学習過程におけるタスク達成への貢献度に従って有効度が異なる。そのため、新しいルールの動作Ａ’は、この範囲に含まれるルールの有効度に基づく加重平均により求める。 ≪g _th ≦ g _w <g ' _th action≫
This range may include a plurality of rules other than rl _w . Although these rules do not have a large difference in selection probability in the situation, their effectiveness differs according to the degree of contribution to task achievement in the learning process so far. Therefore, the action A ′ of the new rule is obtained by a weighted average based on the effectiveness of the rules included in this range.

ｎ_ｒはこの範囲に含まれるルール数であり、Ｎ（０，σ）は平均０・標準偏差σの正規分布を用いたノイズである。ノイズを付加することで、ｒｌ_ｗ以外のルールがない場合であっても、Ａ_ｗの近傍を探索することができる。 n _r is the number of rules included in this range, and N (0, σ) is noise using a normal distribution with mean 0 and standard deviation σ. By adding noise, even if there is no rule other than rl _w , the vicinity of A _w can be searched.

（知識再構成手段１４および知識利用手段１６の詳細説明）
知識再構成手段１４は、知識獲得手段１２においてパラメトリック表現されたクラス集合の生成に使用された学習済み入力に基づいて、ノンパラメトリック表現されたクラス集合を生成する。例えば、知識再構成手段１４はＳＶＭを用いて、知識獲得手段１２における学習済み入力を線形分離して、ノンパラメトリック表現されたクラス集合を生成する。知識利用手段１６は、環境１００からの未知の入力がノンパラメトリック表現されたどのクラスに属するかクラス判別を行って当該判別結果に応じた出力をする。 (Detailed description of the knowledge reconstructing means 14 and the knowledge using means 16)
The knowledge reconstructing unit 14 generates a non-parametrically represented class set based on the learned input used by the knowledge acquiring unit 12 to generate the parametrically represented class set. For example, the knowledge reconstructing unit 14 linearly separates the learned input in the knowledge acquiring unit 12 using SVM, and generates a class set expressed in a non-parametric manner. The knowledge utilization means 16 performs class discrimination as to which class the unknown input from the environment 100 belongs to in a non-parametric expression, and outputs according to the discrimination result.

≪ＳＶＭによる知識再構成と利用≫
ＳＶＭは高い識別性能を持つとさまざまな分野で示されている。そこで、ＢＲＬの獲得知識のデータをＳＶＭにより識別することで、より正確な行動決定が可能になると期待できる。 ≪Knowledge reconstruction and use by SVM≫
SVM has been shown in various fields to have high discrimination performance. Therefore, it can be expected that more accurate action determination can be achieved by identifying BRL acquired knowledge data by SVM.

本願発明者らは、ＢＲＬによりタスク達成に有効なルールを獲得できずサンプルデータが不十分な状態で、ＳＶＭによる知識利用を行うことは有効ではないと過去に示した（非特許文献１参照）。そこで、知識獲得手段１２における学習が十分に進み、機械学習システム１の行動が安定し始めたタイミングで知識再構成手段１４を動作させて獲得知識を再構成するため指標を設ける。 The inventors of the present application have shown in the past that it is not effective to use knowledge by SVM in a state where sample rules are insufficient because a rule effective for accomplishing a task cannot be obtained by BRL (see Non-Patent Document 1). . Therefore, an index is provided to reconstruct the acquired knowledge by operating the knowledge reconstructing means 14 at the timing when the learning in the knowledge acquiring means 12 has sufficiently progressed and the behavior of the machine learning system 1 has started to stabilize.

まず、ＳＶＭによる知識利用の前提として、サンプルデータが十分に多い、すなわち、知識獲得手段１２においてパラメトリック表現されたクラス集合の生成に使用された学習済み入力の個数が十分に多い必要がある。したがって、学習済み入力の個数が所定数よりも多いことを指標に設定する。 First, as a premise for using knowledge by the SVM, the sample data needs to be sufficiently large, that is, the number of learned inputs used for generating the class set expressed parametrically in the knowledge acquisition means 12 needs to be sufficiently large. Therefore, the fact that the number of learned inputs is greater than a predetermined number is set as an index.

また、サンプルデータ数が十分に多くても、パラメトリック表現されたクラスの中に分散が大きいクラスが存在すると、ＳＶＭによる十分な識別精度が得られないおそれがある。そこで、パラメトリック表現された各クラス（ルール）の分散が所定値よりも小さいことも指標に設定する。例えば、この指標を各ルールの構成要素である共分散行列Σにより設定する。Σの値は状態空間においてルールの範囲を表す。行動が収束し始めるとΣの値が収束していく。よって、知識再構成手段１４は、例えば、エピソード毎の各ルールの｜Σ｜の平均（Σ_ｅｐｓ）を計算し、前エピソードの｜Σ｜の平均（Σ_{ｅｐｓ−１}）との差が閾値以下の場合、ＳＶＭによる知識再構成を行って、ノンパラメトリック表現されたクラス集合を生成する。そして、知識利用手段１４は、当該ノンパラメトリック表現されたクラス集合を用いてクラス判別、つまりＳＶＭによるルール選択を行う。 Further, even if the number of sample data is sufficiently large, if there is a class having a large variance among the classes represented by parametric expression, there is a possibility that sufficient identification accuracy by SVM cannot be obtained. Therefore, the indicator that the variance of each class (rule) expressed parametrically is smaller than a predetermined value is also set as an index. For example, this index is set by a covariance matrix Σ that is a component of each rule. The value of Σ represents the range of the rule in the state space. When the behavior starts to converge, the value of Σ converges. Therefore, for example, the knowledge reconstructing unit 14 calculates the average (Σ _eps ) of | Σ | of each rule for each episode, and the difference from the average (Σ _eps−1 ) of | Σ | In the case of (4), knowledge reconstruction by SVM is performed to generate a class set expressed in a non-parametric manner. Then, the knowledge using means 14 performs class discrimination, that is, rule selection by SVM, using the non-parametrically expressed class set.

また、ルール判別可とする範囲をＢＲＬの場合よりも広くするために新たな閾値Ｐ_ｓ（Ｐ_ｓ＞Ｐ_ｔｈ）を設定する。閾値を広げることで新ルールの生成を抑制し、振る舞いの不安定化を防ぐことが期待される。以下に、機械学習システム１の行動選択の概要を示す。また、図２に、機械学習システム１による知識獲得および利用の概要を示す。 In addition, a new threshold value P _s (P _s > P _th ) is set in order to make the range in which the rule can be determined wider than in the case of BRL. Widening the threshold is expected to suppress the creation of new rules and prevent behavioral instability. Below, the outline | summary of the action selection of the machine learning system 1 is shown. FIG. 2 shows an outline of knowledge acquisition and use by the machine learning system 1.

・知覚した入力をもとに、ＢＲＬにより知識探索を行うかＳＶＭにより知識利用を行うかを、Σ’_ｅｐｓの値と閾値との比較により決定する。 Based on the perceived input, whether to perform knowledge search by BRL or to use knowledge by SVM is determined by comparing the value of Σ ' _eps with a threshold value.

・ＢＲＬにより知識探索を行う場合、ランダム行動を出力し、罰を受けなければ新たなルールを作成する。 -When performing a knowledge search by BRL, a random action is output and a new rule is created if there is no penalty.

・ＳＶＭにより知識利用を行う場合、ＳＶＭにより状態空間の再分割を行い、再分割した状態空間により判別されたルールの行動を出力する。 When knowledge is used by SVM, the state space is subdivided by SVM, and the action of the rule determined by the subdivided state space is output.

なお、数式（９）は指標の一例であり本発明はこれに限定されない。例えば、Σ’_ｅｐｓの移動平均や加重平均を利用してもよい。 Note that Equation (9) is an example of an index, and the present invention is not limited to this. For example, a moving average or a weighted average of Σ ′ _eps may be used.

≪多クラス分類ＳＶＭs≫
ＳＶＭは基本的には２クラスの識別問題を対象にして定式化されている。しかし、２クラスの判別モデルを組み合わせることで多クラス分類を可能にしている。組み合わせ方としてOne-versus-AllとOne-versus-Oneという２種類を取り上げる。One-versus-Allとは、全クラスに対して、ある一つのクラスとそれ以外のクラスに分ける識別平面を作成し、これらの識別平面のうち最も高い判別値を返すクラスを出力するという方法である。ｎクラスの問題の場合、識別平面の数はｎとなる。一方、One-versus-Oneとは、各クラス毎に対となる識別平面を作成し、多数決により出力を決定する方法である。識別平面の数はｎ（ｎ−１）／２となる。知識利用手段１４に用いるＳＶＭに、この２種類の多クラス分類方式を導入する。 ≪Multi-class classification SVMs≫
SVM is basically formulated for two classes of identification problems. However, multi-class classification is possible by combining two classes of discrimination models. Two types of combinations, One-versus-All and One-versus-One, are taken up. One-versus-All is a method of creating an identification plane that divides into one class and other classes for all classes, and outputting the class that returns the highest discrimination value among these identification planes. is there. For n-class problems, the number of identification planes is n. On the other hand, One-versus-One is a method of creating a paired identification plane for each class and determining the output by majority vote. The number of identification planes is n (n-1) / 2. These two types of multi-class classification methods are introduced into the SVM used for the knowledge utilization means 14.

≪ＳＶＭに用いるカーネル関数≫
知識再構成手段１４で使用するＳＶＭには高次元空間への写像により非線形分離を可能にするカーネルトリックを用いる。カーネルは以下に示す線形カーネルＫ_ｌｉｎｅ、多項式カーネルＫ_ｐｏｌｙ、ＲＢＦカーネルＫ_ＲＢＦ、シグモイドカーネルＫ_ｓｉｇを使用する。ｕはサポートベクトル、ｖは識別する特徴ベクトルを表す。 ≪Kernel function used for SVM≫
The SVM used in the knowledge reconstruction unit 14 uses a kernel trick that enables nonlinear separation by mapping to a high-dimensional space. As the kernel, the following linear kernel K _line , polynomial kernel K _poly , RBF kernel K _RBF , and sigmoid kernel K _sig are used. u represents a support vector, and v represents a feature vector to be identified.

（計算機実験）
ＢＲＬによる知識獲得のタスクとして、二台のロボットによるピアノ運び問題を取り上げ、計算機実験を行う。ピアノ運び問題とは単体では搬送不可能な長尺物をロボットが協調しゴールまで搬送する問題である。狭い通路を通行するためには、二台のロボットが協調し、フォーメーションを形成しなければならない。 (Computer experiment)
As a task of knowledge acquisition by BRL, we take up the problem of carrying a piano by two robots and conduct computer experiments. The piano carrying problem is a problem in which a robot cooperates to convey a long object that cannot be conveyed alone to the goal. In order to pass through a narrow passage, two robots must cooperate to form a formation.

実験環境を図３（ａ）に示す。フィールドは四方を壁で囲まれており、初期位置からゴールラインまで移動するとタスク達成となる。ロボットは差動駆動型を用い、二輪の駆動輪を持つ。各ロボットは全方位カメラにより他のロボットの状態を知覚し、ロボット間で通信を行わない。ロボットの一度の意思決定を１ステップとし、タスクを達成するか４００ステップ経過した時点でエピソードを更新し、ロボットを初期位置に戻す。学習成功の定義は、２０エピソード連続でタスクを達成したときとする。５００エピソード経過までを１試行とし、１００試行繰り返す。なお、シミュレーション環境はオープンソース三次元物理エンジンＯＤＥ（Open Dynamic Engine）により作成している。 The experimental environment is shown in FIG. The field is surrounded by walls on all sides, and the task is accomplished when moving from the initial position to the goal line. The robot uses a differential drive type and has two drive wheels. Each robot perceives the state of other robots with an omnidirectional camera and does not communicate between the robots. One decision of the robot is taken as one step, and when the task is completed or 400 steps have passed, the episode is updated and the robot is returned to the initial position. The definition of learning success is when the task is achieved for 20 consecutive episodes. One trial is made until 500 episodes have elapsed, and 100 trials are repeated. The simulation environment is created by an open source three-dimensional physics engine ODE (Open Dynamic Engine).

≪実験１≫
ＢＲＬのみを用いて学習を行う。学習が成功した試行には、図３（ｂ）に示したような環境変化を行い、ＳＶＭによる知識利用の効果を検証するためＳＶＭのみで行動選択を行う。環境変化は通路の幅が狭くなることであり、タスクの難易度が上昇するため、獲得知識を状態に応じて正確に判別することが求められる。ここで使用する多クラス分類ＳＶＭｓは、２種類の多クラス分類方式と４種類のカーネル関数の組み合わせの８パターンを使用する。それぞれの組み合わせを表１で示すＡ〜Ｈまでの記号で表す。学習が成功した試行中、環境変化のタスクを達成できた試行の成功率を観察する。 ≪Experiment 1≫
Learning is performed using only BRL. For trials in which learning has succeeded, the environment changes as shown in FIG. 3B are performed, and action selection is performed only by the SVM in order to verify the effect of knowledge utilization by the SVM. The environmental change is that the width of the passage is narrowed, and the difficulty level of the task increases. Therefore, it is required to accurately determine the acquired knowledge according to the state. The multi-class classification SVMs used here uses eight patterns of a combination of two types of multi-class classification methods and four types of kernel functions. Each combination is represented by symbols A to H shown in Table 1. During successful learning trials, observe the success rate of trials that were able to accomplish the task of environmental change.

≪実験２≫
実験１で成功率の高かった組み合わせパターンについて、機械学習システム１を用いて学習を行う。学習するのに要したエピソード数の推移を調べ、機械学習システム１の行動獲得に対する有効性を検証する。パラメータチューニングにより、ＳＶＭを使用するΣ’_ｅｐｓの閾値を０．０２と規定した。 ≪Experiment 2≫
The machine learning system 1 is used to learn the combination patterns that have a high success rate in Experiment 1. The transition of the number of episodes required for learning is examined, and the effectiveness of the machine learning system 1 for action acquisition is verified. By parameter tuning, the threshold of Σ ' _eps using SVM was defined as 0.02.

≪機械学習システム１の設定≫
入力は、Ｉ＝｛ｒ_０，ｃｏｓθ_０，ｓｉｎθ_０，ｒ_１、ｃｏｓθ_１，ｓｉｎθ_１，ｒ_２，ｃｏｓθ_２，ｓｉｎθ_２，ｃｏｓθ_３，ｓｉｎθ_３｝の１１次元である。ｒ，θは対象物までの距離とその角度を、添字０，１，２は対象物がそれぞれゴールライン、最近傍の壁、第二近傍の壁、添字３は隣のロボットを示している。ＢＲＬの出力はロボットの左右のモータ回転速度Ｏ＝｛ｍ_０，ｍ_１｝の２次元である。ＳＶＭの設定は、ライブラリＬＩＢＳＶＭ（http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html参照）を改良したものを使用し、各パラメータはチューニングにより規定した。 ≪Setting of machine learning system 1≫
The inputs are eleven dimensions of I = {r ₀ , cos θ ₀ , sin θ ₀ , r ₁ , cos θ ₁ , sin θ ₁ , r ₂ , cos θ ₂ , sin θ ₂ , cos θ ₃ , sin θ ₃ }. r and θ indicate the distance to the object and its angle, subscripts 0, 1 and 2 indicate the goal line, the nearest wall, the second adjacent wall, and subscript 3 indicate the adjacent robot. The output of the BRL is two-dimensional with the left and right motor rotation speeds O = {m ₀ , m ₁ }. The SVM was set using an improved library LIBSVM (see http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html), and each parameter was defined by tuning.

≪実験結果１≫
全１００試行のうち、７６試行において学習に成功した。この７６試行に関して、Ａ〜Ｈまでの各ＳＶＭによる行動選択をそれぞれ行い、７６試行中タスクを達成できた試行の成功率を図４に示す。図４のＢＲＬとはランダム行動をせず、学習成功時のみの知識をベイズ判別法により行動決定を行った場合のことである。Ｃ，Ｄ，Ｅ，Ｆの場合がタスク成功率が高く、それぞれ７８％，７１％，７８％，７１％であった。ＢＲＬのみで環境変化のタスク達成率は５１％であることから、カーネル関数がＫ_ｐｏｌｙ、Ｋ_ＲＢＦを用いた場合が判別性能が高いとわかる。ＢＲＬに比べ、ＳＶＭを使用したＡ〜Ｈのすべての場合でタスク成功率が高いことからＳＶＭの判別性能は比較的高いとも言える。また、One-versus-AllとOne-versus-OneをＫ_ｐｏｌｙとＫ_ＲＢＦの場合で比較すると、若干であるがOne-versus-Allの方が成功率が高い。これは一般的にOne-versus-Allはクラス数が１００００などの多い場合に有効であるとも言われるが、クラス数（ルール数）が最大１００のＢＲＬにおいては有効性に差はない。 ≪Experimental result 1≫
Learning was successful in 76 trials out of all 100 trials. FIG. 4 shows the success rate of trials in which the SVM from A to H was selected for each of the 76 trials and the tasks during the 76 trials were achieved. The BRL in FIG. 4 is a case where the behavior is determined by the Bayes discriminant method without knowledge of random behavior and knowledge only when learning is successful. In the cases of C, D, E, and F, the task success rates were high, which were 78%, 71%, 78%, and 71%, respectively. Since the task achievement rate of the environmental change is 51% with only BRL, it can be seen that the discrimination performance is high when the kernel function uses K _poly and K _RBF . Compared to BRL, it can be said that the discrimination performance of SVM is relatively high because the task success rate is high in all cases A to H using SVM. Moreover, when comparing One-versus-All and One-versus-One in the case of K _poly and _KRBF , One-versus-All has a higher success rate. It is generally said that One-versus-All is effective when the number of classes is as large as 10,000, but there is no difference in effectiveness in a BRL having a maximum number of classes (number of rules) of 100.

≪実験結果２≫
実験１の結果から、実験２ではＣ，Ｄ，Ｅ，Ｆの４パターンの場合について、機械学習システム１を用いて学習を行った。１００試行ずつ行った結果、Ｃ，Ｄ，Ｅ，Ｆはそれぞれ７８，７９，７５，７９試行において学習に成功した。ＢＲＬのみの学習成功回数７６試行と比較し、大きな差はないように思われる。学習成功までに要したエピソード数の平均と標準偏差を図５に示す。この結果にＴ検定を行ったところ、ＢＲＬのみと比較しＤとＦの場合に関しては有意水準１％において差があることが示された。つまり、ＤもしくはＦの場合、学習するまでに要する収束速度が早いことが示された。 ≪Experiment result 2≫
From the results of Experiment 1, in Experiment 2, learning was performed using the machine learning system 1 for the case of four patterns of C, D, E, and F. As a result of 100 trials, C, D, E, and F succeeded in learning in 78, 79, 75, and 79 trials, respectively. It seems that there is no big difference compared with 76 trials with BRL only. FIG. 5 shows the average number of episodes required for successful learning and the standard deviation. When this result was subjected to a T test, it was shown that there was a difference at a significance level of 1% in the case of D and F as compared with BRL alone. That is, in the case of D or F, it was shown that the convergence speed required for learning is fast.

実験１と２の結果から総合的に判断して、ＤもしくはＦのパターンのＳＶＭを用いることが行動獲得に有効であり、識別精度が高いと言える。 Judging comprehensively from the results of Experiments 1 and 2, it can be said that the use of the SVM of the D or F pattern is effective for action acquisition and the identification accuracy is high.

本発明に係る機械学習システムおよび機械学習方法は、頑健性に優れ、環境変化にも対応可能であるため、マルチロボットシステムなどに有用である。また、ロボットに限らず、パターン認識における強化学習にも有用である。 Since the machine learning system and the machine learning method according to the present invention are excellent in robustness and can cope with environmental changes, they are useful for multi-robot systems and the like. Moreover, it is useful not only for robots but also for reinforcement learning in pattern recognition.

１機械学習システム
１２知識獲得手段
１４知識再構成手段
１６知識利用手段
１００環境 DESCRIPTION OF SYMBOLS 1 Machine learning system 12 Knowledge acquisition means 14 Knowledge reconstruction means 16 Knowledge utilization means 100 Environment

Claims

It is a machine learning system that performs class discrimination as to which class in the state space belongs, outputs according to the discrimination result, and acquires knowledge adapted to the environment by repeating input and output,
A knowledge acquisition means for performing reinforcement learning based on an input and a reward or punishment given to an output for the input to generate a class set represented by parametric expression;
Knowledge reconstructing means for generating a non-parametrically represented class set based on the learned input used to generate the parametrically represented class set;
A knowledge using means for performing class discrimination as to which class the unknown input belongs to which is represented in the non-parametric expression and outputting according to the discrimination result;
The knowledge reconstructing means generates the non-parametrically represented class set when the number of learned inputs is larger than a predetermined number and the variance of each class represented by the parametric expression is smaller than a predetermined value. A machine learning system characterized by that.

The machine learning system according to claim 1,
Each parametric expressed class is a multivariate normal probability distribution;
The machine learning system according to claim 1, wherein the knowledge acquisition unit performs class discrimination according to a Bayes discrimination method to which class the input belongs to the parametric expression.

The machine learning system according to any one of claims 1 and 2,
The knowledge reconstructing means linearly separates the learned input using a support vector machine to generate the class set represented by the non-parametric expression.

A machine learning method for classifying which class an input belongs to in a state space, performing output according to the result of the determination, and acquiring knowledge adapted to the environment by repeating input and output,
A first step of performing reinforcement learning based on an input and a reward or punishment given to an output for the input to generate a parametric expressed class set;
Generating a non-parametrically represented class set based on the learned input used to generate the parametrically represented class set;
A third step of performing a class determination as to which class the unknown input belongs to in the non-parametric representation and outputting according to the determination result;
In the second step, when the number of learned inputs is greater than a predetermined number and the variance of each class represented by the parametric expression is smaller than a predetermined value, the class set represented by the non-parametric expression is generated. A machine learning method characterized by that.

The machine learning method according to claim 4,
The parametric expressed class is a multivariate normal probability distribution;
In the first step, a class discrimination is performed as to which class the input belongs to in the parametric expression according to a Bayes discrimination method.

The machine learning method according to any one of claims 4 and 5,
In the second step, the learned input is linearly separated using a support vector machine, and the non-parametrically expressed class set is generated.