JP2007164406A

JP2007164406A - Decision making system with learning mechanism

Info

Publication number: JP2007164406A
Application number: JP2005358823A
Authority: JP
Inventors: Naomichi Sueda; 直道末田; Sasuke Shimoyama; 佐助下山
Original assignee: Oita University
Current assignee: Oita University
Priority date: 2005-12-13
Filing date: 2005-12-13
Publication date: 2007-06-28

Abstract

<P>PROBLEM TO BE SOLVED: To develop a decision making system with a learning mechanism for increasing learning accuracy and learning efficiency when a space for selecting action is very large. <P>SOLUTION: When action decision branches are enormous, the decision making system for understanding situation and deciding an action appropriate for the situation comprises a learning section for learning rules for decision making and an action clustering section for hierarchically clustering enormous action decision branches, and dynamically executes learning of the learning section and clustering of the action clustering section. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

近年、知的ソフトウェア（ソフトウェア・エージェント）への期待が高まっている。いろいろな状況において、自律的に判断をして外界に対して行動を行うソフトウェアーである。ロボットや、人を知的に支援する意思決定支援システムに多く利用されている。 In recent years, expectations for intelligent software (software agents) have increased. Software that makes autonomous decisions and acts on the outside world in various situations. It is often used in robots and decision support systems that intelligently support people.

知的ソフトウェア（以下エージェントと呼ぶ）として構築するためには、状況を適確に判断して、それに即した最適な行動を効率よく決めていく必要がある。システム設計時に全ての状況に対して、それに対応する行動選択を設定しておくことは、実問題の規模では不可能に近い。従って、そのエージェントが色々経験をつみながら学習して行き、状況に次第に適応していく機能が必要になってくる。この機能を実現する方法として、近年、強化学習という方法が注目を集めている。

In order to build it as intelligent software (hereinafter referred to as an agent), it is necessary to accurately determine the situation and efficiently determine the optimal action. It is almost impossible to set action selections corresponding to all situations at the time of system design. Therefore, it is necessary for the agent to learn while gaining experience and adapt to the situation. In recent years, reinforcement learning has attracted attention as a method for realizing this function.

ここで、γ(0≦γ≦1)は割引率、α(0≦α≦1)は学習率である。この手法は、環境がマルコフ性を満たすときに状態行動の価値が最適解へと収束することを保証している。
しかし、状態Ｓは基本的に離散値パラメータの組合せで表現され、有限状態空間上をベースにしている。そのため、連続値パラメータに対しても、何らかの量子化（離散化）を行う必要はあり、色々な手法が提案されている。また、行動価値Ｑ（ｓ_ｔ，ａ_ｔ）の空間は状態空間とその状態における選択可能な行動空間の組合せになる。行動パラメータが連続値の場合（制御量としての速度、圧力など）の量子化するアイデアは、状態空間量子化と同様、色々行われている。しかし、状態Ｓについて取り得る行動が離散であっても何百、何千とある場合の対応はなされていない。
[Watkins 92] Watkins,C.J.H.&Dayan,P.:Technical Note:Q-Learning,Machine,vol.8, pp.55-68(1992) [RL 98] Richard S.Sutton,Andrew G.Barto： Reinforcement Learning,MIT press(1998) [Kohonen 96] T.コホネン著, 徳高平蔵岸田悟藤村喜久朗訳:自己組織化マップ:シュプリンガー・フェアラーク東京(1996) [吉岡 03] 吉岡信和田原康之本位田真一著:モバイルエージェントによる柔軟なコンテンツ流通を実現するアクティブコンテンツ:情報処理学会論文誌(2003)

Here, γ (0 ≦ γ ≦ 1) is a discount rate, and α (0 ≦ α ≦ 1) is a learning rate. This method guarantees that the value of the state action converges to the optimal solution when the environment satisfies the Markov property.
However, the state S is basically represented by a combination of discrete value parameters and is based on a finite state space. For this reason, it is necessary to perform some quantization (discretization) on the continuous value parameter, and various methods have been proposed. Also, the space of action value Q (s _t , a _t ) is a combination of a state space and a selectable action space in that state. There are various ideas to quantize when the behavior parameter is a continuous value (speed, pressure, etc. as a controlled variable), as in the state space quantization. However, even if the actions that can be taken with respect to the state S are discrete, there is no correspondence when there are hundreds or thousands.
[Watkins 92] Watkins, CJH & Dayan, P .: Technical Note: Q-Learning, Machine, vol.8, pp.55-68 (1992) [RL 98] Richard S. Sutton, Andrew G. Barto: Reinforcement Learning, MIT press (1998) [Kohonen 96] by T. Kohonen, Heizo Tokutaka, Satoru Kishida, Kikuro Fujimura Translation: Self-Organizing Map: Springer Fairlark Tokyo (1996) [Yoshioka 03] Nobukazu Yoshioka Yasuyuki Tahara Shinichi Honda Written: Active Content Realizing Flexible Content Distribution by Mobile Agents: Transactions of Information Processing Society of Japan

本発明は前記のように、行動の選択のための空間が非常に大きいとき、学習精度、学習効率を高めることを目的になされたものである。 As described above, the present invention is intended to improve learning accuracy and learning efficiency when a space for selecting an action is very large.

(1)、状況を理解して、その状況に適応した行動を決定する意思決定システムにおいて、行動選択肢が膨大な時、意思決定のためのルールを学習する学習部と、膨大な行動選択肢に対して階層クラスタリングを行う行動クラスタリング部を有し、前記学習部の学習と、前記行動クラスタリング部のクラスタリングが動的に行われることを特徴とする学習機構付意思決定システム。
(2)、前記行動クラスタリング部において、行動のクラスタリングに用いる属性に連続値と離散値を利用でき、連続値に対して離散化するために量子化部を有することを特徴とする前記(1)に記載の学習機構付意思決定システム。
(3)、前記学習部において、強化学習機構を用いて、同一クラスにおける未経験の行動に対する、状態行動価値に対しても強化する方式を有し、かつ前記行動クラスタリング部のクラスタリングに於いて、この状態行動価値を利用して類似性を求める方式を有しクラスタリングを行うことを特徴とする前記(1)に記載の学習機構付意思決定システム。
(4)、前記行動クラスタリング部において、行動の都度クラスタリングを行うのではなく、クラスタリングの更新を一定の間隔で行うことを特徴とする前記(1)に記載の学習機構付意思決定システム。 (1) In a decision-making system that understands the situation and decides an action adapted to the situation, when the action options are enormous, the learning unit that learns the rules for decision-making and the enormous action options And a behavior clustering unit that performs hierarchical clustering, and learning of the learning unit and clustering of the behavior clustering unit are dynamically performed.
(2) In the behavior clustering unit, the continuous value and the discrete value can be used for the attribute used for behavior clustering, and the quantization unit is provided to discretize the continuous value (1) The decision-making system with a learning mechanism described in 1.
(3) In the learning unit, the reinforcement learning mechanism is used to enhance the state behavior value with respect to the inexperienced behavior in the same class, and in the clustering of the behavior clustering unit, The decision making system with a learning mechanism according to the above (1), characterized in that clustering is performed by using a method for obtaining similarity using state action values.
(4) The decision making system with a learning mechanism according to (1), wherein the behavior clustering unit updates clustering at regular intervals instead of performing clustering for each behavior.

上記のようなモデルで実験した結果、非常に高い学習効率で、学習精度のよい結果を得ることができ、このような、大規模な問題に対して有効な方式である。
つまり、従来方法では、学習が収束しないケースが多いが、本発明の方法では収束性が格段に向上する。また、従来のクラスタリングで行うより行動状態価値の類似性によるクラスタリングを組み合わせて階層型クラスタリングを用いる方が学習精度が一段と向上できる。 As a result of experiments with the above model, it is possible to obtain results with very high learning efficiency and good learning accuracy, which is an effective method for such a large-scale problem.
That is, in the conventional method, learning does not converge in many cases, but the convergence of the method of the present invention is remarkably improved. In addition, learning accuracy can be further improved by using hierarchical clustering in combination with clustering based on similarity of behavioral state values compared to conventional clustering.

図１に基づいて本発明システムの作用を説明する。
本発明の意思決定システムとしてエージェント１は環境２から得られる情報（状態情報）を基に最適な意思決定を行い行動する。その結果、環境に変化が生じ、またその状況を認知し最適な意思決定を行い行動するというサイクルを繰り返す。その間にエージェントはある望ましい状態（望ましくない状態）になることにより、報酬（罰）をうることになる。
つまり、環境２から状態情報を状態認識部１１で知覚し、その状態が、どの状態空間であるかを認識する。当然このとき、状態空間パラメータが連続値である場合、量子化する機構は組み込まれている。認識された状態ｓ_ｔをもとに、行動選択部１３においてその状態での状態行動価値Ｑ（ｓ_ｔ，ａ_ｔ）から行動を選択する。行動を行うことにより環境２は変化する。このとき環境からの報酬（罰）をベースに状態行動価値Ｑ（ｓ_ｔ，ａ_ｔ）を数１にそって更新する。
本発明では、行動選択肢が非常に大きい場合を想定している。そこで、行動空間を減少させるために行動クラスタ部１４が作動し数多くある行動を階層的にクラスタ分類しその抽象クラスを用いることで行動選択、学習の効率化、高精度化を実現する。
本発明野の特徴の一つとして状態行動価値の類似性に基づいてクラスタリングを行うことが挙げられる。この方式を図２に基づいて説明する。
２１に於いて、全ての行動をALリストに格納する。２２に於いて、そのリストのトップの行動取り出し２１１のループの終端までの処理をALリストが空になるまで行う。まず、２３でa'が一度でも選択されたことのある行動かチェックし、もし一度も選択されていない行動なら２１２の処理を行う（後述）。一度でも経験ある行動ならば２４のようにALから取り出された行動a'とALに残されている行動との状態行動価値の類似度を全て調べる。類似性を計算する処理（２５）、つまり行動ａ_ｉとa_jnの類似性は以下の数２で求める。 The operation of the system of the present invention will be described with reference to FIG.
As the decision making system of the present invention, the agent 1 performs an optimum decision making based on information (state information) obtained from the environment 2. As a result, the environment changes, and the cycle of recognizing the situation and making an optimal decision is repeated. In the meantime, the agent gets a reward (punishment) by entering a desired state (an undesirable state).
That is, state information is perceived from the environment 2 by the state recognition unit 11 and the state space is recognized. Naturally, at this time, when the state space parameter is a continuous value, a mechanism for quantization is incorporated. Based on the recognized state s _t , the action selection unit 13 selects an action from the state action value Q (s _t , a _t ) in that state. Environment 2 changes by performing an action. At this time, the state action value Q (s _t , a _t ) is updated according to Equation 1 based on the reward (punishment) from the environment.
In the present invention, it is assumed that the action options are very large. Therefore, the action cluster unit 14 operates to reduce the action space, and a number of actions are hierarchically classified into clusters, and the action selection, learning efficiency, and high accuracy are realized by using the abstract class.
One feature of the present invention is that clustering is performed based on the similarity of state action values. This method will be described with reference to FIG.
At 21, all actions are stored in the AL list. At 22, the processing up to the end of the loop of the action extraction 211 at the top of the list is performed until the AL list becomes empty. First, in 23, it is checked whether a 'has been selected even once. If the behavior has never been selected, the process 212 is performed (described later). If the action is experienced even once, the degree of similarity of the state action value between the action a ′ taken out from the AL and the action left in the AL as in 24 is examined. Processing for calculating similarity (25), that is, the similarity between actions a _i and a _jn is obtained by the following _equation (2).

行動ａ‘とＡＬの中にある行動との類似性で最も類似性の高い（σが最も小さい）行動ａ“を選ぶ（２７）。２８においてａ’が既にクラスに属しているならば（２８）、行動ａ”もａ‘と同じクラスとする（２９）。ａ’が、未だどのクラスにも所属していないのならば新しいクラスを作って、ａ‘とａ“をそのクラスに所属させる。２１２は、図３において行動ａ’が一度も選択されなかったケースであり、その場合は、ａ‘に対して新しいクラスを作成して単独のクラスとする。

The action a "having the highest similarity (sigma is the smallest) is selected from the actions a 'and the actions in AL (27). If a' already belongs to the class in 28 (28 ), Action a ″ is also in the same class as a ′ (29). If a 'does not yet belong to any class, a new class is created, and a' and a "belong to that class. In 212, action a 'has never been selected in FIG. In this case, a new class is created for a 'to be a single class.

以上、状態行動価値の類似性によるクラスタリングについて述べたが、図１の学習部12と連携しておく必要がある。つまり、非常に数多くある行動に対して、実際に選択行動できるものには限りがあり、状態行動価値を学習できない行動が多くでてきてしまう。そのために、学習部に於いては、ある行動が行われ、それによって、状態行動価値が更新される際に、その行動と同じクラスにある行動に対する状態行動価値もある影響度をもって更新する方式をとる。このことにより、未経験な行動に対しても、ある程度、学習できることになる。
学習における更新式を以下の数３のようにする。 As described above, clustering based on similarity of state action values has been described, but it is necessary to cooperate with the learning unit 12 in FIG. In other words, there are only a limited number of actions that can actually be selected for a very large number of actions, and many actions that cannot learn the state action value will be generated. Therefore, in the learning unit, when a certain action is performed and the state action value is updated, the state action value for the action in the same class as the action is also updated with a certain degree of influence. Take. As a result, even inexperienced behavior can be learned to some extent.
The update formula in learning is as shown in Equation 3 below.

ここでζ(0≦ζ≦1)は選択された行動の影響度を表す。影響度は、実行された行動が同じクラスに属する別の行動に与える影響の度合いであり、この値が高くなれば学習の速度が増加する。このような学習を行う理由は、同一クラス内の行動は類似しており、選択された行動以外にも報酬を伝播することで、学習効率を向上させることができると考えたからである。ただし、学習の初期段階では、同一クラス内に属する行動が本当に類似した行動かどうかの判断が難しいため、なるべく低い値に設定する必要がある。
以上のように、類似度によるクラスタリングにより、学習効率を向上させることが出来る。しかし、行動空間が非常に大きい場合、必ずしも行動空間がこのクラスタリングによって十分に小さくなる保証はない。そこで、階層型クラスタリングを導入する。

Here, ζ (0 ≦ ζ ≦ 1) represents the degree of influence of the selected action. The degree of influence is the degree of influence of the executed action on another action belonging to the same class, and the learning speed increases as this value increases. The reason for performing such learning is that the actions in the same class are similar, and it is considered that the learning efficiency can be improved by propagating the reward other than the selected action. However, at the initial stage of learning, it is difficult to determine whether actions belonging to the same class are really similar actions, so it is necessary to set the value as low as possible.
As described above, learning efficiency can be improved by clustering based on similarity. However, when the action space is very large, there is no guarantee that the action space becomes sufficiently small by this clustering. Therefore, hierarchical clustering is introduced.

図３において、3-1は類似度によるクラスタリングを示している。図４において、3-2は階層型クラスタリングを示している。つまり、3-2の下位クラスは行動属性によるクラスタリングで事前にクラスタリングを行う。この行動属性によるクラスタリングは、通常の多変量解析によるクラスタリングやK-mean法によるクラスタリングなど、既存のクラスタリング手法は利用できる。3-2の下位クラスのC1、C2、…は通常のクラスタリングで行ったクラスであり、このC1、C2、…を図２の21の行動リストとして、類似度クラスタリングを行いC'1、C'2を得る。学習の方式は、下位階層の行動状態価値の計算方法は前記（１）式の通り行う。また、上位クラスの状態行動価値の計算は、そのクラスに属する下位クラスの状態行動価値の平均値として計算することとが出来る。

以上により行動空間が非常に大きい問題に対しても効率よく学習することが出来る。 In FIG. 3, 3-1 indicates clustering based on similarity. In FIG. 4, 3-2 indicates hierarchical clustering. In other words, the lower class of 3-2 performs clustering in advance by clustering by action attributes. For the clustering based on behavior attributes, existing clustering methods such as clustering based on ordinary multivariate analysis and K-mean method can be used. The subclasses C1, C2,... Of 3-2 are classes performed by normal clustering. Similarity clustering is performed by using C1, C2,... Get two. As the learning method, the behavior state value calculation method of the lower hierarchy is performed according to the equation (1). The calculation of the state action value of the upper class can be calculated as an average value of the state action values of the lower class belonging to the class.

As described above, it is possible to learn efficiently even for a problem with a very large action space.

本発明の中心である図１の行動クラスタ部１４について以下の実施例に基づいて詳述する。
今、ソフトウェアーエージェントを想定する。このエージェントは幾種類かのコンテンツ（例えば映像コンテンツなど）をもっており、これを各顧客にセールスにいくエージェントである。このエージェントは数多くいる顧客に対し、どの顧客を訪ねていけば効率の良いセールス活動ができるかを学習する。従来ネットワーク上でその様な行動をするには、全ての顧客に対してブロードキャストすれば良い。しかし、将来的にはこのような方式は難しくなるト考えられる。つまり各顧客に対し多くのエージェントがブロードキャスト的に訪問されること自体非常に煩わしく、またセキュリティの面でも問題があるため、その訪問に対して制約およびコストをかけるという方式に移行していくことを想定している。従って闇雲に顧客を訪問するのではなく、ある戦略のもとに訪問することが必要となってくる。また、コンテンツの販売に関しても数限りなくコピーして販売できるのではなく、販売元から許可された本数のみコピーし販売できるという前提に立っている。 The behavior cluster unit 14 of FIG. 1 which is the center of the present invention will be described in detail based on the following embodiments.
Now assume a software agent. This agent has several kinds of contents (for example, video contents) and goes to sales for each customer. This agent learns from many customers who can visit them for efficient sales activities. In order to perform such actions on a conventional network, it is only necessary to broadcast to all customers. However, in the future, such a method will become difficult. In other words, it is very annoying that many agents visit each customer in a broadcast manner, and there is a problem with security, so it is necessary to move to a method that places restrictions and costs on the visit. Assumed. Therefore, it is necessary to visit customers under a certain strategy rather than visiting customers in the dark clouds. In addition, it is based on the premise that content can be copied and sold not limited to the number of copies but sold only by the number permitted by the seller.

以上のような前提のもと、エージェントの動作を述べる。コンテンツを販売するエージェントをセールスエージェントと呼ぶことにする。このセールスエージェントの動作を図５に添って説明する
図５の41においてセールスエージェントは５種類の商品のを受け取る（商品リストに格納する）。セールス基地をスタートする。
図５の42に於いて商品の販売対象となる顧客を選択する。このとき顧客の選択方法は上位階層(類似度クラスタで作成したクラスタ)から状態行動価値Q(ｓ_t,a)の高いクラスを選択し、次にそのクラスに属する行動クラス（通常のクラスタ処理で作成したクラスタ）そのクラスの状態行動価値が高いクラスを選択し、最終的にそのクラスに含まれる顧客集合（木構造のリーフ）の中からランダムで一つの顧客を選択する。このときの状態ｓ_ｔは商品の品揃え状態であり行動ａは移動先（顧客の属するクラスを指している。選択後、商品郡（コンテンツ）を持ったコンテンツを持って、顧客の下へ移動する。
図５の43においてエージェントは訪問した顧客との交渉を行い、商品を販売する。もし、顧客にある商品が売れた時は、その商品を商品リストから削除する（同じ種類の商品を持っている場合は一つ減ずる）。これによって状態ｓ_ｔが変化する。
図５の44に於いて、交渉結果に対しての報酬を得る。その顧客に対して商品は一つも売れなかった時は“０”。商品が売れた場合は“売れた商品の数＊０．２”の報酬を得る。また、その顧客に対して、一回の交渉ごとに“−０．１”の報酬も同時にえることとする。このようにしてセールスエージェントは交渉戦略を学習する。
図５の45において、次の顧客を訪れるかどうかを判断する。この実施例では、商品リストが空になるか、ある一定の回数顧客を訪問してしまったら、顧客の訪問をやめ、セールス基地に戻り、改めて商品の品揃えを整えて、上記の処理を繰り返す。このセールス基地を出発して、セールス基地に戻る一サイクルを１エピソードと呼ぶことにする。
図５の46はある一定のエピソード回数に達したかを判断し、達成したら学習を終了する。
この動作実験を実世界上で行うには大変なコストがかかるために下記のような顧客モデルを計算機上で作成して本発明の有用性を示す Based on the above assumptions, the operation of the agent will be described. An agent that sells content is called a sales agent. The operation of the sales agent will be described with reference to FIG. 5. In 41 of FIG. 5, the sales agent receives five types of products (stores them in the product list). Start sales base.
At 42 in FIG. 5, a customer to be sold is selected. At this time, the customer selection method is to select a class having a high state action value Q (s _t , a) from the upper hierarchy (cluster created by similarity cluster), and then to an action class belonging to that class (in normal cluster processing) The created cluster) selects a class whose state action value is high, and finally selects one customer randomly from the customer set (leaf of tree structure) included in the class. After the state s _t is the destination a is behavior a assortment state of the product (which refers to the class that belongs to the customer. The choice of this time, with a content that has the goods County (content), moved to the bottom of the customer To do.
In 43 of FIG. 5, the agent negotiates with the visited customer and sells the product. If a product sold to a customer is sold, the product is deleted from the product list (if it has the same type of product, it is decremented by one). This state _{s t} is changed.
At 44 in FIG. 5, a reward for the negotiation result is obtained. “0” when no product is sold to the customer. When a product is sold, a reward of “number of sold products * 0.2” is obtained. In addition, a reward of “−0.1” is given to the customer at the same time for each negotiation. In this way, the sales agent learns the negotiation strategy.
At 45 in FIG. 5, it is determined whether or not to visit the next customer. In this embodiment, if the product list is empty or the customer has been visited a certain number of times, the customer stops visiting, returns to the sales base, arranges the product lineup again, and repeats the above processing. . One cycle starting from this sales base and returning to the sales base is called one episode.
46 in FIG. 5 determines whether or not a certain number of episodes has been reached, and when it is achieved, the learning is terminated.
Since it is very expensive to perform this operation experiment in the real world, the following customer model is created on a computer to show the usefulness of the present invention.

＜顧客モデル＞
１．性別{男性,女性}(離散値)
２．年齢{15~65}(連続値)
３．年収{100~1000}(連続値)
４．職業{学生,会社員,無職}(離散値)
５．商品の嗜好{興味がある,普通,興味がない}(離散値)
ただし商品の嗜好は商品の数だけ存在する。今回は10個の商品データを使用するので商品の嗜好も10個存在する。商品はロック,クラシック,ジャズ,ブルース,メタル,演歌,ヒーリング,J-POP,トラッド,R&Bとする。顧客モデルのパラメータはセールスエージェントが自由に見ることができる。ただし、見る事のできないパラメータに購入確率がある。購入確率は嗜好の度合いにより5~95(%)の整数値で表され商品と同じ数だけ存在する。 <Customer model>
1. Gender {male, female} (discrete value)
2. Age {15-65} (continuous value)
3. Annual income {100 ~ 1000} (continuous value)
4). Occupation {student, office worker, unemployed} (discrete value)
5. Product preferences {interested, normal, not interested} (discrete values)
However, there are as many product preferences as there are products. This time, we use 10 product data, so there are 10 product preferences. The products are Rock, Classic, Jazz, Blues, Metal, Enka, Healing, J-POP, Trad, R & B. The parameters of the customer model can be viewed freely by sales agents. However, purchase probability is a parameter that cannot be seen. The purchase probability is expressed by an integer value of 5 to 95 (%) depending on the degree of preference, and there are the same number as the product.

本発明は、今後インターネットでの各種コンテンツ販売機能として広大なコンテンツ流通機構に利用できる。 The present invention can be used for a vast content distribution mechanism as various content sales functions on the Internet in the future.

本発明の機能構成図である。It is a functional block diagram of this invention. 類似度クラスタリングの処理フローである。It is a processing flow of similarity clustering. １段のクラスタリングを示したものであるShows one-stage clustering ２段の階層クラスタリングを示したものであるShows two-level hierarchical clustering セールスエージェントの動作フローである。It is an operation flow of a sales agent.

Explanation of symbols

１エージェント
２環境
１１状態認識部
１２学習部
１３行動選択部
１４行動クラスタ部
１５量子化部
3-1 類似度クラスタ部
3-2 階層型クラスタ部
DESCRIPTION OF SYMBOLS 1 Agent 2 Environment 11 State recognition part 12 Learning part 13 Action selection part 14 Action cluster part 15 Quantization part
3-1 Similarity cluster part
3-2 Hierarchical cluster part

Claims

In a decision-making system that understands the situation and decides the action adapted to the situation, when there are a lot of action options, a learning unit that learns the rules for decision-making and hierarchical clustering for the huge action options A decision making system with a learning mechanism, comprising: a behavior clustering unit that performs learning of the learning unit and clustering of the behavior clustering unit dynamically.

The learning mechanism according to claim 1, wherein the behavior clustering unit can use continuous values and discrete values as attributes used for behavior clustering, and has a quantization unit to discretize the continuous values. A decision-making system.

In the learning unit, a reinforcement learning mechanism is used to reinforce the state action value for an inexperienced action in the same class, and the state action value is determined in the clustering of the action clustering unit. The decision making system with a learning mechanism according to claim 1, wherein clustering is performed by using a method for obtaining similarity by using the clustering.

2. The decision making system with a learning mechanism according to claim 1, wherein the behavior clustering unit updates clustering at regular intervals instead of performing clustering for each behavior. 3.