JP7338858B2

JP7338858B2 - Behavior learning device, behavior learning method, behavior determination device, and behavior determination method

Info

Publication number: JP7338858B2
Application number: JP2019144121A
Authority: JP
Inventors: 由仁宮内; 安規男宇田; 恭聖山本
Original assignee: NEC Solution Innovators Ltd
Current assignee: NEC Solution Innovators Ltd
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2023-09-05
Anticipated expiration: 2039-08-06
Also published as: WO2021025094A1; JP2021026518A

Description

本発明は、行動学習装置、行動学習方法、行動決定装置及び行動決定方法に関する。 The present invention relates to an action learning device, an action learning method, an action determining device, and an action determining method.

近年、機械学習手法として、多層ニューラルネットワークを用いた深層学習（ディープラーニング）が注目されている。深層学習は、バック・プロパゲーションと呼ばれる計算手法を用い、大量の教師データを多層ニューラルネットワークへ入力した際の出力誤差を計算し、誤差が最小となるように学習を行うものである。 In recent years, as a machine learning method, deep learning using multi-layer neural networks has attracted attention. Deep learning uses a calculation method called back propagation to calculate the output error when a large amount of teacher data is input to a multi-layer neural network and perform learning to minimize the error.

特許文献１乃至３には、大規模なニューラルネットワークを複数のサブネットワークの組み合わせとして規定することにより、少ない労力及び演算処理量でニューラルネットワークを構築することを可能にしたニューラルネットワーク処理装置が開示されている。また、特許文献４には、ニューラルネットワークの最適化を行う構造最適化装置が開示されている。 Patent Documents 1 to 3 disclose a neural network processing device that enables the construction of a neural network with a small amount of labor and computational processing by defining a large-scale neural network as a combination of a plurality of sub-networks. ing. Further, Patent Document 4 discloses a structure optimization device that optimizes a neural network.

特開２００１－０５１９６８号公報JP-A-2001-051968 特開２００２－２５１６０１号公報JP-A-2002-251601 特開２００３－３１７０７３号公報Japanese Patent Application Laid-Open No. 2003-317073 特開平０９－０９１２６３号公報JP-A-09-091263

しかしながら、深層学習では、教師データとして良質な大量のデータが必要であり、また、学習に長時間を要していた。特許文献１乃至４にはニューラルネットワークの構築のための労力や演算処理量を低減する手法が提案されているが、システム負荷等の更なる軽減のために、より簡単なアルゴリズムにより行動の学習が可能な行動学習装置が望まれていた。 However, deep learning requires a large amount of high-quality data as teacher data, and requires a long time for learning. Patent Documents 1 to 4 propose a method for reducing the labor and computational processing amount for constructing a neural network, but in order to further reduce the system load, etc., it is possible to learn behavior using a simpler algorithm. A possible action learning device has been desired.

本発明の目的は、環境及び自己の状況に応じた行動の学習及び選択をより簡単なアルゴリズムで実現しうる行動学習装置及び行動決定装置を提供することにある。 SUMMARY OF THE INVENTION It is an object of the present invention to provide an action learning device and an action decision device that can learn and select an action according to the environment and one's own situation with a simpler algorithm.

本発明の一観点によれば、環境及び自己の状況を表す状況情報データに基づいて、前記環境に対して実行する行動候補を選択する行動選択部と、前記行動選択部により選択された前記行動候補に対するユーザの評価であって、前記状況情報データが示す状況において前記行動候補を実行する又は実行しないとの判断を理由とともに示す評価を取得する評価取得部と、前記評価における前記理由に基づき、前記状況情報データの注目箇所を示すスロットを生成するスロット生成部と、前記行動候補に、前記状況情報データ、前記スロット及び前記評価における前記判断が紐付けられているユーザ学習モデルを生成するユーザ学習モデル生成部とを有する行動学習装置が提供される。 According to one aspect of the present invention, an action selection unit that selects a candidate action to be executed with respect to the environment based on situation information data representing the environment and one's own situation, and the action selected by the action selection unit an evaluation acquisition unit that acquires an evaluation of the candidate for the candidate, which is an evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data, together with a reason; and based on the reason in the evaluation, a slot generation unit that generates a slot indicating a point of interest in the situation information data; and user learning that generates a user learning model in which the action candidate is associated with the situation information data, the slot, and the judgment in the evaluation. A behavior learning device having a model generation unit is provided.

また、本発明の他の一観点によれば、複数の行動候補の各々に対して、環境及び自己の状況を表す状況情報データと、前記状況情報データの注目箇所を示すスロットと、前記状況情報データ及び前記スロットが示す状況において前記行動候補を実行する又は実行しないとの判断と、が紐付けられているユーザ学習モデルを保持する記憶部と、現在の環境及び自己の状況を表す現在の状況情報データに基づいて、前記環境に対して実行する行動候補を選択する行動選択部と、前記記憶部から、前記行動選択部により選択された前記行動候補に紐付けられた前記ユーザ学習モデルのうち、前記現在の状況情報データに対する適合性が最も高い前記状況情報データを有する前記ユーザ学習モデルを抽出するユーザ学習モデル抽出部と、前記現在の状況情報データと抽出した前記ユーザ学習モデルの前記スロットとの関係に基づいて、前記行動選択部により選択された前記行動候補を実行するか否かを判断する行動決定部とを有する行動決定装置が提供される。 According to another aspect of the present invention, for each of a plurality of action candidates, situation information data representing the environment and the self's situation; A storage unit that holds a user learning model that is associated with data and a decision to execute or not execute the action candidate in the situation indicated by the slot, and a current situation that represents the current environment and the self's situation an action selection unit that selects an action candidate to be executed in the environment based on information data; and the user learning model linked to the action candidate selected by the action selection unit from the storage unit , a user learning model extracting unit for extracting the user learning model having the situation information data most suitable to the current situation information data; and the slots of the user learning model extracted with the current situation information data. and a behavior determination unit that determines whether or not to execute the behavior candidate selected by the behavior selection unit based on the relationship of the behavior determination device.

また、本発明の更に他の一観点によれば、環境及び自己の状況を表す状況情報データに基づいて、前記環境に対して実行する行動候補を選択するステップと、前記選択するステップにおいて選択された前記行動候補に対するユーザの評価であって、前記状況情報データが示す状況において前記行動候補を実行する又は実行しないとの判断を理由とともに示す評価を取得するステップと、前記評価における前記理由に基づき、前記状況情報データの注目箇所を示すスロットを生成するステップと、前記行動候補に、前記状況情報データ、前記スロット及び前記評価における前記判断が紐付けられているユーザ学習モデルを生成するステップとを有する行動学習方法が提供される。 According to still another aspect of the present invention, a step of selecting a candidate action to be executed with respect to the environment based on situation information data representing the environment and one's own situation; a step of acquiring a user's evaluation of the action candidate, the evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data together with a reason, based on the reason in the evaluation; generating a slot indicating a point of interest in the situation information data; and generating a user learning model in which the action candidate is associated with the situation information data, the slot, and the judgment in the evaluation. A behavioral learning method is provided.

また、本発明の更に他の一観点によれば、現在の環境及び自己の状況を表す現在の状況情報データに基づいて、前記環境に対して実行する行動候補を選択するステップと、複数の行動候補の各々に対して、環境及び自己の状況を表す状況情報データと、前記状況情報データの注目箇所を示すスロットと、前記状況情報データ及び前記スロットが示す状況において前記行動候補を実行する又は実行しないとの判断と、が紐付けられているユーザ学習モデルの中から、前記選択するステップにおいて選択された前記行動候補に紐付けられた前記ユーザ学習モデルであって、前記現在の状況情報データに対する適合性が最も高い前記状況情報データを有する前記ユーザ学習モデルを抽出するステップと、前記現在の状況情報データと抽出した前記ユーザ学習モデルの前記スロットとの関係に基づいて、前記選択するステップにおいて選択された前記行動候補を実行するか否かを判断するステップとを有する行動決定方法が提供される。 According to still another aspect of the present invention, based on current situation information data representing the current environment and one's own situation, a step of selecting a candidate action to be executed with respect to the environment; For each candidate, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and executing or performing the action candidate in the situation indicated by the situation information data and the slot and the user learning model linked to the action candidate selected in the selecting step from among the user learning models linked to the current situation information data. extracting the user learning model having the most relevant contextual information data; and selecting in the selecting step based on the relationship between the current contextual information data and the slots of the extracted user learning model. and determining whether to execute the proposed action.

本発明によれば、環境及び自己の状況に応じた行動の学習及び選択をより簡単なアルゴリズムで実現することができる。また、状況情報に応じて選択した行動に対するユーザのコメントを蓄積してノウハウとして利用することができ、より適切な行動の選択を実現することができる。 ADVANTAGE OF THE INVENTION According to this invention, action learning and selection according to an environment and one's own situation can be implement|achieved with a simpler algorithm. In addition, users' comments on actions selected according to situation information can be accumulated and used as know-how, and more appropriate actions can be selected.

本発明の第１実施形態による行動学習装置の構成例を示す概略図である。BRIEF DESCRIPTION OF THE DRAWINGS It is the schematic which shows the structural example of the action learning apparatus by 1st Embodiment of this invention. 本発明の第１実施形態による行動学習装置における状況学習部の構成例を示す概略図である。It is a schematic diagram showing an example of composition of a situation learning part in an action learning device by a 1st embodiment of the present invention. 本発明の第１実施形態による行動学習装置におけるスコア取得部の構成例を示す概略図である。It is a schematic diagram showing an example of composition of a score acquisition part in an action learning device by a 1st embodiment of the present invention. 本発明の第１実施形態による行動学習装置におけるニューラルネットワーク部の構成例を示す概略図である。It is a schematic diagram showing a configuration example of a neural network unit in the action learning device according to the first embodiment of the present invention. 本発明の第１実施形態による行動学習装置における学習セルの構成例を示す概略図である。It is a schematic diagram showing a configuration example of a learning cell in the behavior learning device according to the first embodiment of the present invention. 本発明の第１実施形態による行動学習装置における用法学習部の構成例を示す概略図である。It is a schematic diagram showing a configuration example of a usage learning unit in the action learning device according to the first embodiment of the present invention. 本発明の第１実施形態による行動学習装置における状況学習部の学習方法を示すフローチャートである。4 is a flow chart showing a learning method of the situation learning unit in the action learning device according to the first embodiment of the present invention; 状況情報生成部が生成する状況情報データの一例を示す図である。It is a figure which shows an example of the situation information data which a situation information production|generation part produces|generates. 状況情報生成部が生成する状況情報データ及びその要素値の一例を示す図である。It is a figure which shows an example of the situation information data which a situation information production|generation part produces|generates, and its element value. 本発明の第１実施形態による行動学習装置における用法学習部の学習方法を示すフローチャートである。4 is a flow chart showing a learning method of a usage learning unit in the action learning device according to the first embodiment of the present invention; 状況情報生成部が状況情報から生成した状況情報データの一例を示す図である。FIG. 4 is a diagram showing an example of situation information data generated from situation information by a situation information generation unit; 状況情報及び行動選択部により選択された行動に関する情報の表示例とユーザエピソードの例を示す図である。FIG. 10 is a diagram showing a display example of information about the action selected by the situation information and the action selection unit, and an example of a user episode; 状況情報データの注目箇所を示すスロットの生成方法の一例を示す図である。FIG. 10 is a diagram showing an example of a method of generating a slot indicating a point of interest in status information data; 本発明の第１実施形態による行動学習装置における行動決定方法を示すフローチャートである。4 is a flow chart showing an action determination method in the action learning device according to the first embodiment of the present invention; 状況情報に対するユーザ学習モデルの適合性を評価する方法の一例を示す図である。FIG. 2 is a diagram illustrating an example of a method for evaluating suitability of a user learning model for contextual information; 本発明の第１実施形態による行動学習装置のハードウェア構成例を示す概略図である。BRIEF DESCRIPTION OF THE DRAWINGS It is the schematic which shows the hardware structural example of the action-learning apparatus by 1st Embodiment of this invention. 本発明の第２実施形態による行動学習装置における状況学習部の学習方法を示すフローチャートである。It is a flowchart which shows the learning method of the situation learning part in the action learning apparatus by 2nd Embodiment of this invention. 本発明の第３実施形態による行動学習装置の構成例を示す概略図である。It is a schematic diagram showing an example of composition of an action learning device by a 3rd embodiment of the present invention. 本発明の第４実施形態による行動決定装置の構成例を示す概略図である。It is a schematic diagram showing an example of composition of an action deciding device by a 4th embodiment of the present invention.

［第１実施形態］
本発明の第１実施形態による行動学習装置の概略構成について、図１乃至図６を用いて説明する。図１は、本実施形態による行動学習装置の構成例を示す概略図である。図２は、本実施形態による行動学習装置における状況学習部の構成例を示す概略図である。図３は、本実施形態による行動学習装置におけるスコア取得部の構成例を示す概略図である。図４は、本実施形態による行動学習装置におけるニューラルネットワーク部の構成例を示す概略図である。図５は、本実施形態による行動学習装置における学習セルの構成例を示す概略図である。図６は、本実施形態による行動学習装置における用法学習部の構成例を示す概略図である。 [First embodiment]
A schematic configuration of a behavior learning device according to a first embodiment of the present invention will be described with reference to FIGS. 1 to 6. FIG. FIG. 1 is a schematic diagram showing a configuration example of an action learning device according to this embodiment. FIG. 2 is a schematic diagram showing a configuration example of the situation learning unit in the action learning device according to this embodiment. FIG. 3 is a schematic diagram showing a configuration example of the score acquisition unit in the action learning device according to this embodiment. FIG. 4 is a schematic diagram showing a configuration example of the neural network unit in the action learning device according to this embodiment. FIG. 5 is a schematic diagram showing a configuration example of a learning cell in the action learning device according to this embodiment. FIG. 6 is a schematic diagram showing a configuration example of the usage learning unit in the action learning device according to this embodiment.

本実施形態による行動学習装置１００は、図１に示すように、状況学習部１１０と、用法学習部１２０と、を有する。状況学習部１１０は、環境２００から受け取った情報及び自己の状況に基づき学習（状況学習）を行い、環境２００に対して実行する行動を選択する。用法学習部１２０は、状況学習部１１０が選択した行動に対するユーザの評価（アドバイス）を受け、状況学習部１１０が選択した行動とユーザの評価とを関連付けるユーザ学習モデルを生成する（用法学習）。また、用法学習部１２０は、状況学習部１１０が選択した行動とユーザ学習モデルとに基づいて、環境２００に対して実行する行動を決定する。行動学習装置１００は、環境２００とともに行動学習システム４００を構成する。 The action learning device 100 according to this embodiment has a situation learning section 110 and a usage learning section 120, as shown in FIG. The situation learning unit 110 performs learning (situation learning) based on the information received from the environment 200 and its own situation, and selects an action to be executed with respect to the environment 200 . The usage learning unit 120 receives the user's evaluation (advice) of the action selected by the situation learning unit 110, and generates a user learning model that associates the action selected by the situation learning unit 110 with the user's evaluation (usage learning). Also, the usage learning unit 120 determines actions to be performed on the environment 200 based on the actions selected by the situation learning unit 110 and the user learning model. Action learning device 100 constitutes action learning system 400 together with environment 200 .

状況学習部１１０は、例えば図２に示すように、行動候補取得部１０と、状況情報生成部２０と、スコア取得部３０と、行動選択部７０と、スコア調整部８０と、により構成され得る。 For example, as shown in FIG. 2, the situation learning unit 110 can be configured by an action candidate acquisition unit 10, a situation information generation unit 20, a score acquisition unit 30, an action selection unit 70, and a score adjustment unit 80. .

行動候補取得部１０は、環境２００から受け取った情報及び自己（エージェント）の状況に基づいて、その状況下で取り得る行動（行動候補）を抽出する機能を備える。なお、エージェントとは、学習し、行動を選択する主体である。環境とは、エージェントが働きかける対象である。 The action candidate acquisition unit 10 has a function of extracting possible actions (action candidates) based on the information received from the environment 200 and the situation of the agent itself (agent). An agent is a subject that learns and selects actions. An environment is an object on which an agent works.

状況情報生成部２０は、環境２００から受け取った情報及び自己の状況（状況情報）をもとに、行動に関わる情報を表す状況情報データを生成する機能を備える。状況情報データに含まれる情報は、行動に関わるものであれば特に限定されるものではなく、例えば、環境情報、時間、回数、自己状態、過去の行動等を含み得る。 The situation information generation unit 20 has a function of generating situation information data representing information related to actions based on information received from the environment 200 and the situation of the user (situation information). Information included in the situation information data is not particularly limited as long as it relates to actions, and may include, for example, environment information, time, number of times, self-state, past actions, and the like.

スコア取得部３０は、行動候補取得部１０が抽出した行動候補の各々について、状況情報生成部２０で生成した状況情報データに対するスコアを取得する機能を備える。ここで、スコアとは、行動した結果に対して見込まれる効果を表す指標として用いられる変数である。例えば、行動した結果の評価が高いと見込まれる場合のスコアは大きく、行動した結果の評価が低いと見込まれる場合のスコアは小さい。 The score acquisition unit 30 has a function of acquiring scores for the situation information data generated by the situation information generation unit 20 for each of the action candidates extracted by the action candidate acquisition unit 10 . Here, the score is a variable used as an index representing the expected effect of the action result. For example, the score is large when the evaluation of the action result is expected to be high, and the score is small when the evaluation of the action result is expected to be low.

行動選択部７０は、行動候補取得部１０が抽出した行動候補の中から、スコア取得部３０で取得したスコアが最も大きい行動候補を選択する。また、行動選択部７０は、選択した行動を環境２００に対して実行し、或いは、選択した行動を用法学習部１２０に通知する機能を備える。 The action selection unit 70 selects the action candidate with the highest score acquired by the score acquisition unit 30 from among the action candidates extracted by the action candidate acquisition unit 10 . The action selection unit 70 also has a function of executing the selected action on the environment 200 or notifying the usage learning unit 120 of the selected action.

スコア調整部８０は、行動選択部７０で選択した行動が環境２００に与えた結果に応じて、選択した行動に紐付けられているスコアの値を調整する機能を備える。例えば、行動した結果の評価が高い場合はスコアを上げ、行動した結果の評価が低い場合はスコアを下げる。 The score adjustment unit 80 has a function of adjusting the score value associated with the selected action according to the result given to the environment 200 by the action selected by the action selection unit 70 . For example, if the evaluation of the action result is high, the score is raised, and if the evaluation of the action result is low, the score is lowered.

スコア取得部３０は、例えば図３に示すように、ニューラルネットワーク部４０と、判定部５０と、学習部６０と、を含んで構成され得る。学習部６０は、重み修正部６２と、学習セル生成部６４と、を含んで構成され得る。 The score acquisition unit 30 may be configured including a neural network unit 40, a determination unit 50, and a learning unit 60, as shown in FIG. 3, for example. The learning unit 60 can be configured including a weight correction unit 62 and a learning cell generation unit 64 .

ニューラルネットワーク部４０は、例えば図４に示すように、入力層と出力層とを含む２層の人工ニューラルネットワークにより構成され得る。入力層は、１つの状況情報データから抽出される要素値の数に対応する数のセル（ニューロン）４２を備える。例えば、１つの状況情報データがＭ個の要素値を含む場合、入力層は、少なくともＭ個のセル４２_１，４２_２，…，４２_ｉ，…，４２_Ｍを含む。出力層は、少なくとも、取り得る行動の数に対応する数のセル（ニューロン）４４を備える。例えば、出力層は、Ｎ個のセル４４_１，４４_２，…，４４_ｊ，…，４４_Ｎを含む。出力層を構成するセル４４の各々は、取り得る行動のうちのいずれかに紐付けられている。また、各々のセル４４には、所定のスコアが設定されている。 The neural network unit 40 may be composed of a two-layer artificial neural network including an input layer and an output layer, as shown in FIG. 4, for example. The input layer has a number of cells (neurons) 42 corresponding to the number of element values extracted from one piece of situation information data. For example, if one context information data includes M element values, the input layer includes at least M cells 42 ₁ , 42 ₂ , . . . , ₄₂ _i , . The output layer comprises at least a number of cells (neurons) 44 corresponding to the number of possible actions. For example, the output layer includes N cells 44 ₁ , 44 ₂ , . . . , ₄₄ _j , . Each of the cells 44 that make up the output layer is associated with one of possible actions. A predetermined score is set in each cell 44 .

入力層のセル４２_１，４２_２，…，４２_ｉ，…，４２_Ｍには、状況情報データのＭ個の要素値Ｉ_１，Ｉ_２，…，Ｉ_ｉ，…，Ｉ_Ｍが、それぞれ入力される。セル４２_１，４２_２，…，４２_ｉ，…，４２_Ｍの各々は、入力された要素値Ｉをセル４４_１，４４_２，…，４４_ｊ，…，４４_Ｎのそれぞれに出力する。 M element _values I ₁ _, I ₂ _, . _. _. , I _i , . be done. Each of the _cells 42 ₁ _, 42 ₂ _, . . . , _{42 i} _, _.

セル４２とセル４４とを繋ぐ枝（軸索）の各々には、要素値Ｉに対して所定の重み付けをするための重み付け係数ωが設定されている。例えば、セル４２_１，４２_２，…，４２_ｉ，…，４２_Ｍとセル４４_ｊとを繋ぐ枝には、例えば図５に示すように、重み付け係数ω_１ｊ，ω_２ｊ，…，ω_ｉｊ，…，ω_Ｍｊが設定されている。これによりセル４４_ｊは、以下の式（１）に示す演算を行い、出力値Ｏ_ｊを出力する。

For each branch (axon) connecting the cell 42 and the cell 44, a weighting coefficient ω for giving a predetermined weighting to the element value I is set. For example, the branches _connecting the cells 42 ₁ , ₄₂ ₂ , . _. . , ₄₂ _i _, . , ω _Mj are set. Accordingly, the cell 44 _j performs the calculation shown in the following equation (1) and outputs the output value O _j .

なお、本明細書では、１つのセル４４と、そのセル４４に要素値Ｉ_１～Ｉ_Ｍを入力する枝（入力ノード）と、そのセル４４から出力値Ｏを出力する枝（出力ノード）とを総称して学習セル４６と表記することがある。 In this specification, one cell 44, a branch (input node) that inputs the element values I ₁ to I _M to the cell 44, and a branch (output node) that outputs the output value O from the cell 44 may be collectively referred to as a learning cell 46.

判定部５０は、状況情報データから抽出した複数の要素値と学習セルの出力値との間の相関値を所定の閾値と比較し、当該相関値が閾値以上であるか閾値未満であるかを判定する。相関値の一例は、学習セルの出力値に対する尤度である。なお、判定部５０の機能は、学習セル４６の各々が備えていてもよい。 The determining unit 50 compares the correlation value between the plurality of element values extracted from the situation information data and the output value of the learning cell with a predetermined threshold, and determines whether the correlation value is greater than or equal to the threshold or less than the threshold. judge. An example of a correlation value is the likelihood for the output value of a learning cell. Note that each of the learning cells 46 may have the function of the determination unit 50 .

学習部６０は、判定部５０の判定結果に応じてニューラルネットワーク部４０の学習を行う機能ブロックである。重み修正部６２は、上記相関値が所定の閾値以上である場合に、学習セル４６の入力ノードに設定された重み付け係数ωを更新する。また、学習セル生成部６４は、上記相関値が所定の閾値未満である場合に、ニューラルネットワーク部４０に新たな学習セル４６を追加する。 The learning unit 60 is a functional block that performs learning of the neural network unit 40 according to the determination result of the determination unit 50 . The weight correction unit 62 updates the weighting coefficient ω set to the input node of the learning cell 46 when the correlation value is equal to or greater than a predetermined threshold. Also, the learning cell generation unit 64 adds a new learning cell 46 to the neural network unit 40 when the correlation value is less than the predetermined threshold.

用法学習部１２０は、例えば図６に示すように、選択行動取得部１３０と、評価取得部１４０と、行動評価部１５０と、行動決定部１６０と、記憶部１７０と、により構成されうる。 The usage learning unit 120 can be configured by a selection behavior acquisition unit 130, an evaluation acquisition unit 140, an behavior evaluation unit 150, an behavior determination unit 160, and a storage unit 170, as shown in FIG. 6, for example.

選択行動取得部１３０は、行動選択部７０が選択した行動に関する情報を状況学習部１１０から取得する機能を備える。評価取得部１４０は、行動選択部７０が選択した行動に関する情報に対するユーザ（アドバイザ）の評価を取得する機能を備える。この評価は、状況情報データが示す状況において行動選択部７０が選択した行動を実行する又は実行しないとの判断を理由とともに示すものである。 The selected action acquisition unit 130 has a function of acquiring information about the action selected by the action selection unit 70 from the situation learning unit 110 . The evaluation acquisition unit 140 has a function of acquiring a user's (advisor's) evaluation of the information on the action selected by the action selection unit 70 . This evaluation indicates the judgment to perform or not to perform the action selected by the action selection unit 70 in the situation indicated by the situation information data, together with the reason.

行動評価部１５０は、スロット生成部と、ユーザ学習モデル生成部と、ユーザ学習モデル抽出部と、を含んで構成されうる。スロット生成部は、ユーザの評価における理由に基づき、状況情報データの注目箇所を示すスロットを生成する機能を備える。ユーザ学習モデル生成部は、行動選択部７０が選択した行動に、状況情報データ、スロット及び評価における判断が紐付けられているユーザ学習モデルを生成し、記憶部１７０に保存する機能を備える。ユーザ学習モデル抽出部は、記憶部１７０から、行動選択部７０が選択した行動に紐付けられたユーザ学習モデルのうち、現在の状況情報データに対する適合性が最も高い状況情報データを有するユーザ学習モデルを抽出する機能を備える。 The behavior evaluation unit 150 may include a slot generation unit, a user learning model generation unit, and a user learning model extraction unit. The slot generation unit has a function of generating a slot indicating a point of interest in the situation information data based on the reason in the user's evaluation. The user learning model generation unit has a function of generating a user learning model in which the action selected by the action selection unit 70 is linked to the situation information data, the slot, and the judgment in the evaluation, and storing it in the storage unit 170 . The user learning model extraction unit retrieves from the storage unit 170 the user learning model having the situation information data that is most compatible with the current situation information data among the user learning models linked to the action selected by the action selection unit 70. It has a function to extract

行動決定部１６０は、現在の状況情報データとユーザ学習モデル抽出部が抽出したユーザ学習モデルのスロットとの関係に基づいて、行動選択部７０が選択した行動を環境２００に対して実行するか否かを判断する機能を備える。 The action determination unit 160 determines whether or not to execute the action selected by the action selection unit 70 on the environment 200 based on the relationship between the current situation information data and the slot of the user learning model extracted by the user learning model extraction unit. It has a function to judge whether

すなわち、用法学習部１２０は、行動選択部７０が選択した行動に対するユーザの評価（アドバイス）を学習し、学習の結果に基づいて環境２００に対して実行する行動を決定する。 That is, the usage learning unit 120 learns the user's evaluation (advice) for the action selected by the action selection unit 70, and determines the action to be executed in the environment 200 based on the learning result.

次に、本実施形態による行動学習装置１００を用いた行動学習方法について、図７乃至図１５を用いて説明する。なお、ここでは理解を容易にするために、カードゲームの「大富豪」におけるプレイヤーの行動を例に挙げて適宜説明を補足するものとする。ただし、本実施形態による行動学習装置１００は、環境２００の状況に応じて行動を選択する様々な用途に広く適用することができる。 Next, a behavior learning method using the behavior learning device 100 according to this embodiment will be described with reference to FIGS. 7 to 15. FIG. In order to facilitate understanding, the player's actions in the card game "Millionaire" will be used as an example to supplement the description as appropriate. However, the behavior learning device 100 according to this embodiment can be widely applied to various uses for selecting behavior according to the situation of the environment 200 .

まず、本実施形態による行動学習装置１００の状況学習部１１０における学習方法について、図７乃至図９を用いて説明する。図７は、本実施形態による行動学習装置の状況学習部における学習方法を示すフローチャートである。図８は、状況情報データの一例を示す図である。図９は、状況情報データ及びその要素値の一例を示す図である。 First, the learning method in the situation learning unit 110 of the behavior learning device 100 according to this embodiment will be described with reference to FIGS. 7 to 9. FIG. FIG. 7 is a flow chart showing a learning method in the situation learning unit of the action learning device according to this embodiment. FIG. 8 is a diagram showing an example of situation information data. FIG. 9 is a diagram showing an example of situation information data and its element values.

行動候補取得部１０は、環境２００から受け取った情報及び自己の状況に基づいて、その状況下で取り得る行動（行動候補）を抽出する（ステップＳ１０１）。行動候補を抽出する方法は、特に限定されるものではないが、例えば、ルールに基づいたプログラムを用いて抽出を行うことができる。 The action candidate acquisition unit 10 extracts actions (action candidates) that can be taken under the situation based on the information received from the environment 200 and the situation (step S101). Although the method for extracting action candidates is not particularly limited, for example, extraction can be performed using a rule-based program.

「大富豪」の場合、環境２００から受け取る情報としては、例えば、場に出ている札の種類（例えば、１枚の札か複数枚の札か）や強さ、他のプレイヤーがパスをしているかどうか、などの情報が挙げられる。自己の状況としては、例えば、手札の情報、これまでに出した札の情報、何巡目か、などが挙げられる。行動候補取得部１０は、「大富豪」のルールに則って、これら環境２００及び自己の状況下において取り得る総ての行動（行動候補）を抽出する。例えば、場に出ている札と同じ種類でより強い札を複数、手札の中に所有している場合には、これら複数の札のうちのいずれかを出す行動の各々が行動候補となる。また、自分の順番をパスすることも、行動候補の一つである。 In the case of the “millionaire”, the information received from the environment 200 includes, for example, the type of cards on the field (for example, whether it is a single card or a plurality of cards), strength, and whether other players have passed. information such as whether or not Examples of the player's own situation include information on cards in hand, information on cards played so far, number of rounds, and the like. The action candidate acquisition unit 10 extracts all actions (action candidates) that can be taken under the environment 200 and the self's situation in accordance with the "millionaire" rule. For example, if the player has a plurality of stronger cards of the same type as those on the field in his or her hand, each of the actions of playing one of these cards becomes an action candidate. Also, passing one's turn is one of the action candidates.

次いで、行動候補取得部１０が抽出した行動候補の各々が、スコア取得部３０のニューラルネットワーク部４０に含まれる少なくとも１つの学習セル４６に紐付けられているかどうかを確認する。学習セル４６に紐付けられていない行動候補が存在する場合には、ニューラルネットワーク部４０に、当該行動候補に紐付けられた学習セル４６を新たに追加する。なお、取り得る行動の総てが既知である場合には、想定される総ての行動の各々に紐付けられた学習セル４６を、予めニューラルネットワーク部４０に設定しておいてもよい。 Next, it is checked whether each action candidate extracted by the action candidate acquisition unit 10 is linked to at least one learning cell 46 included in the neural network unit 40 of the score acquisition unit 30 . If there is an action candidate that is not linked to the learning cell 46 , the learning cell 46 that is linked to the action candidate is newly added to the neural network unit 40 . If all possible actions are known, learning cells 46 associated with all possible actions may be set in advance in the neural network unit 40 .

学習セル４６の各々には、前述の通り、所定のスコアが設定されている。学習セル４６を追加する場合には、その学習セル４６にスコアの初期値として任意の値を設定する。例えば－１００～＋１００の数値範囲でスコアを設定する場合、スコアの初期値として例えば０を設定することができる。 A predetermined score is set for each of the learning cells 46 as described above. When a learning cell 46 is added, an arbitrary value is set as the initial value of the score for that learning cell 46 . For example, when the score is set in a numerical range of -100 to +100, 0, for example, can be set as the initial value of the score.

次いで、状況情報生成部２０は、環境２００から受け取った情報及び自己の状況をもとに、行動に関わる情報を写像した状況情報データを生成する（ステップＳ１０２）。状況情報データは、特に限定されるものではないが、例えば、環境や自己の状況に基づく情報をビットマップ状のイメージデータとして表すことにより生成することができる。状況情報データの生成は、ステップＳ１０１よりも前に或いはステップＳ１０１と並行して行ってもよい。 Next, based on the information received from the environment 200 and the user's own situation, the situation information generating unit 20 generates situation information data mapping information related to actions (step S102). The situation information data is not particularly limited, but can be generated, for example, by expressing information based on the environment or one's own situation as bitmap image data. The situation information data may be generated before step S101 or in parallel with step S101.

図８は、環境２００や自己の状況を示す情報のうち、場の札、回数、手札、過去情報をビットマップイメージとして表した状況情報データの一例を示す図である。図中、「場の札」、「手札」、「過去情報」として示すイメージの横軸に表した「数」は、札の強さを表している。すなわち、「数」が小さいほど弱い札であることを示し、「数」が大きいほど強い札であることを示している。図中、「場の札」、「手札」、「過去情報」として示すイメージの縦軸に表した「ペア」は、札の組枚数を表している。例えば、１種類の数字で構成される役においては、１枚、２枚（ペア）、３枚（スリーカード）、４枚（フォーカード）の順に、「ペア」の値は多くなる。図中、「回数」は、現在のターンが１ゲームの開始から終了までのどの段階にあるかを横軸方向に２次元的に表したものである。なお、図示するプロットにおいて各点の境界をぼかしているのは汎化性能を向上する意図であるが、各点の境界は必ずしもぼかす必要はない。 FIG. 8 is a diagram showing an example of situation information data in which the cards on the table, the number of times, the cards in hand, and the past information among the information indicating the environment 200 and one's own situation are represented as bitmap images. In the figure, the "number" shown on the horizontal axis of the images shown as "play cards", "hand cards", and "past information" represents the strength of the cards. That is, a smaller "number" indicates a weaker card, and a larger "number" indicates a stronger card. In the figure, "pairs" shown on the vertical axis of the images shown as "play cards", "hand cards", and "past information" represent the number of sets of cards. For example, in a hand consisting of one type of number, the value of "pair" increases in the order of 1, 2 (pair), 3 (three of a kind), and 4 (four of a kind). In the figure, "number of times" is a two-dimensional expression along the horizontal axis of the current turn from the start to the end of one game. In the illustrated plot, the boundaries of each point are blurred in order to improve the generalization performance, but the boundaries of each point do not necessarily need to be blurred.

状況情報の写像について、処理時間の短縮、学習セルの量の削減、行動選択の精度を良くするなどの目的で、情報の一部を切り出しながら段階的に処理を行う階層化、情報の変換、情報の組み合わせなどの処理を行ってもよい。 Regarding the mapping of situation information, for the purpose of shortening the processing time, reducing the amount of learning cells, and improving the accuracy of action selection, layering is performed step by step while extracting a part of the information, information conversion, Processing such as combination of information may be performed.

図９は、図８に示した状況情報データの「手札」の部分を抜き出したものである。この状況情報データに対しては、例えば右側の拡大図に示すように、１つの画素を１つの要素値に対応づけることができる。そして、白の画素に対応する要素値を０、黒の画素に対応する要素値を１と定義することができる。例えば、図９の例では、ｐ番目の画素に対応する要素値Ｉ_ｐは１となり、ｑ番目の画素に対応する要素値Ｉ_ｑは０となる。１つの状況情報データに対応する要素値が、要素値Ｉ_１～Ｉ_Ｍである。 FIG. 9 shows the "hand" portion extracted from the situation information data shown in FIG. For this situation information data, one pixel can be associated with one element value, as shown in the enlarged view on the right side, for example. An element value corresponding to a white pixel can be defined as 0, and an element value corresponding to a black pixel can be defined as 1. For example, in the example of FIG. 9, the element value Ip corresponding to the _p -th pixel is 1, and the element value Iq corresponding to the _q -th pixel is 0. Element values corresponding to one piece of situation information data are element values I ₁ to I _M .

次いで、状況情報生成部２０で生成した状況情報データの要素値Ｉ_１～Ｉ_Ｍを、ニューラルネットワーク部４０に入力する（ステップＳ１０３）。ニューラルネットワーク部４０に入力された要素値Ｉ_１～Ｉ_Ｍは、セル４２_１～４２_Ｍを介して、行動候補取得部１０により抽出された行動候補に紐付けられた学習セル４６の各々に入力される。要素値Ｉ_１～Ｉ_Ｍが入力された学習セル４６の各々は、式（１）に基づいて出力値Ｏを出力する。こうして、要素値Ｉ_１～Ｉ_Ｍに対する学習セル４６からの出力値Ｏを取得する（ステップＳ１０４）。 Next, the element values I ₁ to I _M of the situation information data generated by the situation information generation section 20 are input to the neural network section 40 (step S103). The element values I ₁ to I _M input to the neural network unit 40 are input to each of the learning cells 46 linked to the action candidates extracted by the action candidate acquisition unit 10 via the cells 42 ₁ to 42 _M. be done. Each of the learning cells 46 to which the element values I ₁ to I _M are input outputs an output value O based on equation (1). Thus, the output value O from the learning cell 46 for the element values I ₁ to I _M is obtained (step S104).

学習セル４６が、各入力ノードに重み付け係数ωが設定されていない状態、すなわち一度も学習を行っていない初期状態である場合には、入力された要素値Ｉ_１～Ｉ_Ｍの値を、当該学習セル４６の入力ノードの重み付け係数ωの初期値として設定する。例えば、図９の例では、学習セル４６_ｊのｐ番目の画素に対応する入力ノードの重み付け係数ω_ｐｊは１となり、学習セル４６_ｊのｑ番目の画素に対応する入力ノードの重み付け係数ω_ｑｊは０となる。この場合の出力値Ｏは、初期値として設定した重み付け係数ωを用いて算出される。 When the learning cell 46 is in a state in which no weighting coefficient ω is set for each input node, that is, in an initial state in which no learning has been performed, the input element values I ₁ to I _M are It is set as the initial value of the weighting coefficient ω of the input node of the learning cell 46 . For example, in the example of FIG. 9, the weighting coefficient ω _pj of the input node corresponding to the p-th pixel of the learning cell 46 _j is 1, and the weighting coefficient ω _qj of the input node corresponding to the q-th pixel of the learning cell 46 _j . becomes 0. The output value O in this case is calculated using the weighting coefficient ω set as the initial value.

次いで、判定部５０において、要素値Ｉ_１～Ｉ_Ｍと学習セル４６からの出力値Ｏとの間の相関値（ここでは、学習セルの出力値に関する尤度Ｐとする）を取得する（ステップＳ１０５）。尤度Ｐの算出方法は、特に限定されるものではない。例えば、学習セル４６_ｊの尤度Ｐ_ｊは、以下の式（２）に基づいて算出することができる。

Next, the determining unit 50 acquires the correlation value (here, the likelihood P regarding the output value of the learning cell) between the element values I ₁ to I _M and the output value O from the learning cell 46 (step S105). A method for calculating the likelihood P is not particularly limited. For example, the likelihood P _j of the learning cell 46 _j can be calculated based on Equation (2) below.

式（２）は、尤度Ｐ_ｊが、学習セル４６_ｊの複数の入力ノードの重み付け係数ω_ｉｊの累積値に対する学習セル４６_ｊの出力値Ｏ_ｊの比率で表されることを示している。或いは、尤度Ｐ_ｊが、複数の入力ノードの重み付け係数ω_ｉｊに基づく学習セル４６_ｊの出力の最大値に対する、複数の要素値を入力したときの学習セル４６_ｊの出力値の比率で表されることを示している。 Equation (2) indicates that the likelihood P _j is expressed by the ratio of the output value O _j of the learning cell 46 _j to the cumulative value of the weighting factors ω _ij of the multiple input nodes of the learning cell 46 _j . . Alternatively, the likelihood P _j is expressed as the ratio of the output value of the learning cell 46 _j when a plurality of element values are input to the maximum value of the output of the learning cell 46 _j based on the weighting coefficients ω _ij of the plurality of input nodes. It indicates that

次いで、判定部５０において、取得した尤度Ｐの値と所定の閾値とを比較し、尤度Ｐの値が閾値以上であるか否かを判定する（ステップＳ１０６）。 Next, the determination unit 50 compares the obtained value of likelihood P with a predetermined threshold value, and determines whether the value of likelihood P is equal to or greater than the threshold value (step S106).

行動候補の各々において、当該行動候補に紐付けられた学習セル４６のうち、尤度Ｐの値が閾値以上である学習セル４６が１つ以上存在した場合（ステップＳ１０６の「Ｙｅｓ」）には、ステップＳ１０７へと移行する。ステップＳ１０７では、当該行動候補に紐付けられた学習セル４６のうち尤度Ｐの値が最も大きい学習セル４６の入力ノードの重み付け係数ωを更新する。学習セル４６_ｊの入力ノードの重み付け係数ω_ｉｊは、例えば以下の式（３）に基づいて修正することができる。
ω_ｉｊ＝（ｉ番目の画素における黒の出現回数）／（学習回数） …（３） In each action candidate, if there is one or more learning cells 46 whose likelihood P value is equal to or greater than the threshold among the learning cells 46 linked to the action candidate ("Yes" in step S106) , the process proceeds to step S107. In step S107, the weighting factor ω of the input node of the learning cell 46 having the largest likelihood P value among the learning cells 46 linked to the action candidate is updated. The weighting factors ω _ij of the input nodes of learning cell 46 _j can be modified, for example, based on Equation (3) below.
ω _ij =(number of appearances of black in i-th pixel)/(number of times of learning) (3)

式（３）は、学習セル４６の複数の入力ノードの各々の重み付け係数ωが、対応する入力ノードから入力された要素値Ｉの累積平均値により決定されることを示している。このようにして、尤度Ｐの値が所定の閾値以上である状況情報データの情報を各入力ノードの重み付け係数ωに累積していくことにより、黒（１）の出現回数の多い画素に対応する入力ノードほど、重み付け係数ωの値が大きくなる。このような学習セル４６の学習アルゴリズムは、人の脳の学習原理として知られるヘブ則に近似したものである。 Equation (3) indicates that the weighting factor ω of each of the multiple input nodes of the learning cell 46 is determined by the cumulative average value of the element values I input from the corresponding input nodes. In this way, by accumulating the information of the situation information data whose value of likelihood P is equal to or greater than the predetermined threshold in the weighting coefficient ω of each input node, it is possible to correspond to the pixels in which black (1) appears frequently. The value of the weighting factor ω increases for an input node that The learning algorithm of such a learning cell 46 approximates the Hebbian law known as the learning principle of the human brain.

一方、行動候補の各々において、当該行動候補に紐付けられた学習セル４６の中に尤度Ｐの値が閾値以上である学習セル４６が１つも存在しない場合（ステップＳ１０６の「Ｎｏ」）には、ステップＳ１０８へと移行する。ステップＳ１０８では、当該行動候補に紐付けられた新たな学習セル４６を生成する。新たに生成した学習セル４６の各入力ノードには、学習セル４６が初期状態であった場合と同様、要素値Ｉ_１～Ｉ_Ｍの値を重み付け係数ωの初期値として設定する。また、追加する学習セル４６には、スコアの初期値として任意の値を設定する。このようにして、同じ行動候補に紐付けられた学習セル４６を追加することにより、同じ行動候補に属する様々な態様の状況情報データを学習することが可能となり、より適切な行動を選択することが可能となる。 On the other hand, in each action candidate, if there is not even one learning cell 46 whose likelihood P value is equal to or greater than the threshold among the learning cells 46 linked to the action candidate (“No” in step S106), moves to step S108. In step S108, a new learning cell 46 linked to the action candidate is generated. For each input node of the newly generated learning cell 46, the element values I ₁ to I _M are set as the initial values of the weighting coefficients ω in the same way as when the learning cell 46 was in the initial state. An arbitrary value is set as the initial value of the score in the learning cell 46 to be added. In this way, by adding the learning cells 46 linked to the same action candidate, it becomes possible to learn various forms of situation information data belonging to the same action candidate, and to select a more appropriate action. becomes possible.

なお、学習セル４６の追加は、尤度Ｐの値が閾値以上である学習セル４６がいずれかの行動候補において１つも存在しない場合に、常に行う必要はない。例えば、尤度Ｐの値が閾値以上である学習セル４６が総ての行動候補において１つも存在しない場合にのみ、学習セル４６を追加するようにしてもよい。この場合、追加する学習セル４６は、複数の行動候補の中からランダムに選択したいずれかの行動候補に紐付けることができる。 It should be noted that the addition of the learning cell 46 need not always be performed when there is not even one learning cell 46 whose likelihood P value is equal to or greater than the threshold in any of the action candidates. For example, a learning cell 46 may be added only when there is not even one learning cell 46 with a value of likelihood P equal to or greater than a threshold among all action candidates. In this case, the learning cell 46 to be added can be associated with one of the action candidates randomly selected from among the plurality of action candidates.

尤度Ｐの判定に用いる閾値は、その値が大きいほど、状況情報データに対する適合性は高くなるが、学習セル４６の数も多くなり学習に時間を要する。逆に、閾値は、その値が小さいほど、状況情報データに対する適合性は低くなるが、学習セル４６の数は少なくなり学習に要する時間は短くなる。閾値の設定値は、状況情報データの種類や形態等に応じて、所望の適合率や学習時間が得られるように、適宜設定することが望ましい。 As the threshold value used to determine the likelihood P increases, the adaptability to the situation information data increases, but the number of learning cells 46 also increases, requiring more time for learning. Conversely, the smaller the threshold value, the lower the adaptability to the situation information data, but the smaller the number of learning cells 46 and the shorter the time required for learning. It is desirable that the set value of the threshold is appropriately set according to the type and form of the situation information data so that the desired relevance rate and learning time can be obtained.

次いで、行動候補の各々において、当該行動候補に紐付けられた学習セル４６の中から、状況情報データに対する相関（尤度Ｐ）が最も高い学習セル４６を抽出する（ステップＳ１０９）。 Next, for each action candidate, the learning cell 46 with the highest correlation (likelihood P) to the situation information data is extracted from the learning cells 46 linked to the action candidate (step S109).

次いで、ステップＳ１０９において抽出した学習セル４６の中から、最もスコアの高い学習セル４６を抽出する（ステップＳ１１０）。 Next, the learning cell 46 with the highest score is extracted from the learning cells 46 extracted in step S109 (step S110).

次いで、行動選択部７０において、最もスコアの高い学習セル４６に紐付けられた行動候補を選択し、環境２００に対して実行する（ステップＳ１１１）。これにより、行動した結果の評価が最も高いと見込まれる行動を、環境２００に対して実行することができる。 Next, the action selection unit 70 selects the action candidate linked to the learning cell 46 with the highest score, and executes it on the environment 200 (step S111). As a result, an action that is expected to have the highest evaluation as a result of the action can be executed with respect to the environment 200 .

次いで、スコア調整部８０により、行動選択部７０により選択された行動を環境２００に対して実行した結果の評価に基づき、最もスコアの高い学習セル４６として抽出された学習セル４６のスコアを調整する（ステップＳ１１２）。例えば、行動した結果の評価が高い場合はスコアを上げ、行動した結果の評価が低い場合ステップＳ１１２はスコアを下げる。このようにして学習セル４６のスコアを調整することで、環境２００に対して実行した結果の評価が高いと見込まれる学習セル４６ほどスコアが高くなるように、ニューラルネットワーク部４０は学習を進めることができる。 Next, the score adjustment unit 80 adjusts the score of the learning cell 46 extracted as the learning cell 46 with the highest score based on the evaluation of the result of executing the action selected by the action selection unit 70 on the environment 200. (Step S112). For example, if the evaluation of the action result is high, the score is raised, and if the evaluation of the action result is low, the score is lowered in step S112. By adjusting the scores of the learning cells 46 in this way, the neural network unit 40 advances learning so that the scores of the learning cells 46 that are expected to be highly evaluated as a result of execution on the environment 200 will have higher scores. can be done.

「大富豪」の場合、１ゲーム中における１回の行動によってその結果を評価することは困難であるため、１ゲームが終了したときの順位に基づいて学習セル４６のスコアを調整することができる。例えば、１位で上がった場合には、そのゲーム中の各ターンにおいて最もスコアの高い学習セル４６として抽出された学習セル４６のスコアをそれぞれ１０増やす。２位で上がった場合には、そのゲーム中の各ターンにおいて最もスコアの高い学習セル４６として抽出された学習セル４６のスコアをそれぞれ５増やす。３位で上がった場合には、スコアの調整は行わない。４位で上がった場合には、そのゲーム中の各ターンにおいて最もスコアの高い学習セル４６として抽出された学習セル４６のスコアをそれぞれ５減らす。５位で上がった場合には、そのゲーム中の各ターンにおいて最もスコアの高い学習セル４６として抽出された学習セル４６のスコアをそれぞれ１０減らす。 In the case of "millionaire", it is difficult to evaluate the result based on one action in one game, so the score of the learning cell 46 can be adjusted based on the ranking at the end of one game. . For example, if the game moves up in first place, the score of the learning cell 46 extracted as the learning cell 46 with the highest score in each turn during the game is increased by 10. In the case of second place, the score of the learning cell 46 extracted as the learning cell 46 with the highest score in each turn during the game is increased by 5. There will be no score adjustment for third-place finishers. If the player moves up to fourth place, the score of the learning cell 46 extracted as the learning cell 46 with the highest score in each turn during the game is reduced by 5. If the player moves up to fifth place, the score of the learning cell 46 extracted as the learning cell 46 with the highest score in each turn during the game is reduced by 10.

このように構成することで、状況情報データに基づいてニューラルネットワーク部４０を学習することができる。また、学習の進んだニューラルネットワーク部４０に状況情報データを入力することで、複数の行動候補の中から環境２００に対して実行した結果の評価が高いと見込まれる行動を選択することができる。 With this configuration, the neural network section 40 can learn based on the situation information data. Further, by inputting the situation information data to the neural network unit 40 that has advanced in learning, it is possible to select an action expected to be highly evaluated as a result of execution on the environment 200 from among a plurality of action candidates.

状況学習部１１０におけるニューラルネットワーク部４０の学習方法は、深層学習などにおいて用いられている誤差逆伝播法（バック・プロパゲーション）を適用するものではなく、１パスでの学習が可能である。このため、ニューラルネットワーク部４０の学習処理を簡略化することができる。また、各々の学習セル４６は独立しているため、データの追加、削除、更新が容易である。また、どのような情報であってもマップ化して処理することが可能であり、汎用性が高い。また、状況学習部１１０は、いわゆるダイナミック学習を行うことが可能であり、状況情報データを用いた追加の学習処理を容易に行うことができる。 The learning method of the neural network unit 40 in the situation learning unit 110 does not apply the error back propagation method (back propagation) used in deep learning and the like, and learning in one pass is possible. Therefore, the learning process of the neural network section 40 can be simplified. Moreover, since each learning cell 46 is independent, addition, deletion, and update of data are easy. Moreover, any information can be mapped and processed, and the versatility is high. In addition, the situation learning unit 110 can perform so-called dynamic learning, and can easily perform additional learning processing using the situation information data.

次に、本実施形態による行動学習装置１００の用法学習部１２０における学習方法について、図１０乃至図１３を用いて説明する。図１０は、本実施形態による行動学習装置の用法学習部における学習方法を示すフローチャートである。図１１は、状況情報生成部が状況情報から生成した状況情報データの一例を示す図である。図１２は、状況情報及び行動選択部により選択された行動に関する情報の表示例とユーザエピソードの例を示す図である。図１３は、状況情報データの注目箇所を示すスロットの生成方法の一例を示す図である。 Next, a learning method in the usage learning unit 120 of the behavior learning device 100 according to this embodiment will be described with reference to FIGS. 10 to 13. FIG. FIG. 10 is a flow chart showing a learning method in the usage learning unit of the action learning device according to this embodiment. 11 is a diagram illustrating an example of situation information data generated from situation information by the situation information generation unit; FIG. 12A and 12B are diagrams showing a display example of situation information and information on actions selected by the action selection unit, and an example of user episodes. FIG. 13 is a diagram showing an example of a method of generating slots indicating points of interest in situation information data.

用法学習部１２０の学習には、上述の手順により学習を行った後の状況学習部１１０が用いられる。 For the learning of the usage learning section 120, the situation learning section 110 after learning according to the above procedure is used.

まず、選択行動取得部１３０は、状況情報に基づいて行動選択部７０が選択した行動に関する情報を、状況学習部１１０から取得する（ステップＳ２０１）。状況学習部１１０から取得する情報には、状況情報生成部２０が生成した状況情報データと、行動選択部７０により選択された行動と、が含まれる。 First, the selected action acquisition unit 130 acquires information about the action selected by the action selection unit 70 based on the situation information from the situation learning unit 110 (step S201). The information acquired from the situation learning section 110 includes situation information data generated by the situation information generating section 20 and actions selected by the action selecting section 70 .

図１１は、状況情報生成部２０によって状況情報から生成された状況情報データの一例を示す図である。図１１には、「大富豪」の例における状況情報のうち、「場の札」、「手札」、「ターン数」、「前回出し札」をビットマップイメージとして表した状況情報データの一例を示している。図中、「場の札」、「手札」及び「前回出し札」において、縦軸はスートを表し、横軸は札の強さを表している。また、「ターン数」は、現在のターンが１ゲームの開始から終了までのどの段階にあるかを横軸方向に２次元的に表したものである。 FIG. 11 is a diagram showing an example of situation information data generated from situation information by the situation information generation unit 20. As shown in FIG. FIG. 11 shows an example of situation information data in which "cards in play", "cards in hand", "number of turns", and "cards put out last time" among the situation information in the example of "millionaire" are expressed as bitmap images. showing. In the figure, the vertical axis represents the suit, and the horizontal axis represents the strength of the cards in the "play cards", "hand cards" and "previous cards". The "number of turns" is a two-dimensional representation along the horizontal axis of the current turn from the start to the end of one game.

なお、図１１では各種情報を単純なビットマップイメージで表しているが、図８と同様、各点の境界をぼかし、汎化性能を向上するように構成してもよい。また、ここでは状況情報データを視覚的にイメージしやすいようにビットマップイメージで表しているが、状況情報データの形態はビットマップイメージに限定されるものではない。例えば、状況情報データは、要素値の値を並べた数字列として表すこともできる。 Although FIG. 11 shows various kinds of information as simple bitmap images, as in FIG. 8, the boundaries between points may be blurred to improve the generalization performance. Also, here, the situation information data is represented by a bitmap image so that it can be easily visualized, but the form of the situation information data is not limited to the bitmap image. For example, the status information data can also be expressed as a numeric string in which element values are arranged.

ここでは一例として、選択行動取得部１３０が、図１１に示す状況情報データと、行動選択部７０により選択された行動として「スペードのＱとハートのＱのペアを出す」という行動に関する情報と、を状況学習部１１０から受信した場合を想定する。 Here, as an example, the selected action acquisition unit 130 obtains the situation information data shown in FIG. is received from the situation learning unit 110 .

次いで、用法学習部１２０は、現在の環境や自己の状況に基づく情報（状況情報）と行動選択部７０が選択した行動とを、表示装置などを介してユーザ（アドバイザ）に提示する。評価取得部１４０は、現在の状況に対して行動選択部７０が選択した行動に関するユーザの評価を、入力装置などを介して取得する（ステップＳ２０２）。 Next, the usage learning unit 120 presents information (situation information) based on the current environment and one's own situation and the action selected by the action selection unit 70 to the user (advisor) via a display device or the like. The evaluation acquisition unit 140 acquires the user's evaluation of the action selected by the action selection unit 70 for the current situation via an input device or the like (step S202).

例えば、用法学習部１２０は、図１２に示すように、表示装置１４２に、状況情報及び行動選択部７０により選択された行動に関する情報１４４を表示する。ユーザは、これらの情報を検討し、行動選択部７０により選択された行動の評価を戦略的に解説するユーザエピソード１４６を入力する。 For example, as shown in FIG. 12, the usage learning unit 120 displays information 144 on the action selected by the situation information and the action selection unit 70 on the display device 142 . The user reviews this information and enters user episodes 146 that strategically comment on the evaluation of the actions selected by action selector 70 .

ここで、ユーザエピソードとは、行動選択部７０が選択した行動に対して、それを行うか行わないかの判断を、理由とともに説明するものである。例えば、「大富豪」の例の場合、「対象」、「理由」、「出す／出さない」の三語分で構成されるユーザエピソードを設定することができる。ここで、「対象」としては、手札、場の札、ターン数（例えば、序盤、中盤、終盤）、前回出し札などが挙げられる。「理由」としては、強い、弱い、などが挙げられる。例えば、「『場の札』が『弱い』ので、状況学習が選択した手を『出す』」、「『手札』が『弱い』ので、状況学習が選択した手は『出さない』」などのユーザエピソードが想定され得る。なお、ここでは理解を容易にするために三語分で構成される簡単なユーザエピソードを想定しているが、状況情報の次元等に応じてより複雑なユーザエピソードを設定するようにしてもよい。 Here, the user episode describes the judgment as to whether or not to perform the action selected by the action selection unit 70, together with the reason. For example, in the case of the example of "millionaire", it is possible to set a user episode composed of three words: "target", "reason", and "issue/not issue". Here, the "target" includes cards in hand, cards in the field, number of turns (for example, early stage, middle stage, final stage), previously played cards, and the like. The "reasons" include strong, weak, and the like. For example, ``Since the ``play card'' is ``weak'', the hand selected by the situational learning is ``play'', or ``the ``hand'' is ``weak'', so the hand selected by the situational learning is ``do not play''. User episodes can be assumed. To facilitate understanding, a simple user episode composed of three words is assumed here, but a more complicated user episode may be set according to the dimension of the situation information. .

次いで、行動評価部１５０のスロット生成部は、ユーザが入力したユーザエピソードから、状況情報データに対応するビットマップイメージの注目箇所を示すスロットを生成する（ステップＳ２０３）。例えば、「大富豪」の例の場合、「対象」を示すビットマップイメージ（図１３（ａ））と「理由」を示すビットマップイメージ（図１３（ｂ））とを２次元行列と見なし、これら行列の対応する要素値同士を掛け合わせる（要素毎の積を取る）。これにより、ユーザエピソードの注目箇所を示すスロット（図１３（ｃ））を生成することができる。言わば、行動評価部１５０は、ユーザが入力したユーザエピソードを文法解釈し、その意味を表すマップを生成するのである。 Next, the slot generation unit of the behavior evaluation unit 150 generates a slot indicating a point of interest in the bitmap image corresponding to the situation information data from the user episode input by the user (step S203). For example, in the case of the example of "millionaire", the bitmap image indicating the "target" (Fig. 13(a)) and the bitmap image indicating the "reason" (Fig. 13(b)) are regarded as a two-dimensional matrix, Corresponding element values of these matrices are multiplied (taken the product for each element). As a result, a slot (FIG. 13(c)) indicating a point of interest in a user episode can be generated. In other words, the behavior evaluation unit 150 grammatically interprets the user episode input by the user and generates a map representing the meaning.

なお、ユーザエピソードの「対象」及び「理由」に応じて要素値“１”を与えるビットマップイメージ上の位置は、常識データとして事前に保存しておけばよい。 Note that the positions on the bitmap image that give the element value "1" according to the "target" and "reason" of the user episode may be stored in advance as common sense data.

次いで、行動評価部１５０のユーザ学習モデル生成部は、行動選択部７０が選択した行動に、状況情報データ、スロット及びユーザエピソードにおける「行う／行わない」の評価を紐付けてなるユーザ学習モデルを生成する。そして、生成したユーザ学習モデルを記憶部１７０に保存する（ステップＳ２０４）。 Next, the user learning model generation unit of the behavior evaluation unit 150 generates a user learning model by linking the behavior selected by the behavior selection unit 70 with the situation information data, the slot, and the evaluation of “do/not do” in the user episode. Generate. Then, the generated user learning model is stored in the storage unit 170 (step S204).

用法学習部１２０に対し、ステップＳ２０１からステップＳ２０４までの処理を繰り返し行うことで、記憶部１７０には、状況学習部１１０が選択した行動に対するユーザの評価を示すユーザ学習モデルが蓄積されていく。つまり、状況情報とユーザのコメント（言葉）とを結びつけ、状況情報に応じたユーザの戦略を学習することができる。用法学習部１２０が行う動作は、言わば、状況情報とそれに応じたユーザのコメントを収集してノウハウを生成することである。 By repeatedly performing the processing from step S201 to step S204 on the usage learning unit 120, the storage unit 170 accumulates a user learning model indicating the user's evaluation of the action selected by the situation learning unit 110. FIG. In other words, the user's strategy can be learned according to the situation information by linking the situation information and the user's comments (words). The operation performed by the usage learning unit 120 is, so to speak, collecting situation information and corresponding user comments to generate know-how.

次に、本実施形態による行動学習装置１００を用いた行動決定方法について、図１４及び図１５を用いて説明する。図１４は、本実施形態による行動学習装置における行動決定方法を示すフローチャートである。図１５は、状況情報に対するユーザ学習モデルの適合性を評価する方法の一例を示す図である。 Next, a behavior determination method using the behavior learning device 100 according to this embodiment will be described with reference to FIGS. 14 and 15. FIG. FIG. 14 is a flow chart showing an action determination method in the action learning device according to this embodiment. FIG. 15 is a diagram showing an example of a method of evaluating suitability of a user learning model for situation information.

まず、選択行動取得部１３０は、状況情報に基づいて行動選択部７０が選択した行動に関する情報を、状況学習部１１０から取得する（ステップＳ３０１）。状況学習部１１０から取得する情報には、状況情報生成部２０が生成した状況情報データと、行動選択部７０により選択された行動と、が含まれる。 First, the selected action acquisition unit 130 acquires information about the action selected by the action selection unit 70 based on the situation information from the situation learning unit 110 (step S301). The information acquired from the situation learning section 110 includes situation information data generated by the situation information generating section 20 and actions selected by the action selecting section 70 .

次いで、行動評価部１５０のユーザ学習モデル抽出部は、記憶部１７０に保存されているユーザ学習モデルの中から、行動選択部７０が選択した行動に紐付けられているユーザ学習モデルを検索する（ステップＳ３０２）。 Next, the user learning model extraction unit of the behavior evaluation unit 150 searches for user learning models linked to the behavior selected by the behavior selection unit 70 from among the user learning models stored in the storage unit 170 ( step S302).

検索の結果、記憶部１７０に保存されているユーザ学習モデルの中に行動選択部７０が選択した行動に紐付けられているユーザ学習モデルが少なくとも１つ存在する場合（ステップＳ３０３における「Ｙｅｓ」）には、ステップＳ３０４ヘと移行する。一方、記憶部１７０に保存されているユーザ学習モデルの中に、行動選択部７０が選択した行動に紐付けられているユーザ学習モデルが１つも存在しない場合（ステップＳ３０３における「Ｎｏ」）には、ステップＳ３０７ヘと移行する。 When at least one user learning model linked to the action selected by the action selecting unit 70 exists among the user learning models stored in the storage unit 170 as a result of the search (“Yes” in step S303) to step S304. On the other hand, if there is no user learning model linked to the action selected by the action selecting unit 70 among the user learning models stored in the storage unit 170 (“No” in step S303), , to step S307.

次いで、行動評価部１５０のユーザ学習モデル抽出部は、行動選択部７０が選択した行動に紐付けられているユーザ学習モデルの中から、現在の状況情報データに対して状況情報データの適合性が最も高いユーザ学習モデルを抽出する（ステップＳ３０４）。 Next, the user learning model extraction unit of the behavior evaluation unit 150 selects from among the user learning models linked to the behavior selected by the behavior selection unit 70 the suitability of the situation information data to the current situation information data. Extract the highest user learning model (step S304).

例えば、現在の状況情報データと選択した行動に紐付けられたユーザ学習モデルの状況情報データとを２次元ベクトルと見なし、これらベクトルの内積値を算出する。そして、現在の状況情報データと選択した行動に紐付けられたユーザ学習モデルの状況情報データとの組み合わせのうち、内積値が最も大きい組み合わせにおける状況情報データを含むユーザ学習モデルを、適合性が最も高いユーザ学習モデルとして抽出する。 For example, the current situation information data and the situation information data of the user learning model linked to the selected action are regarded as two-dimensional vectors, and the inner product value of these vectors is calculated. Then, among the combinations of the current situation information data and the situation information data of the user learning model linked to the selected action, the user learning model including the situation information data in the combination with the largest inner product value is selected as the most suitable. Extract as a high user learning model.

或いは、状況学習部１１０のスコア取得部３０と同様のアルゴリズムを用い、尤度やスコアを基準として適合性が最も高いユーザ学習モデルを抽出するようにしてもよい。 Alternatively, an algorithm similar to that of the score acquisition section 30 of the situation learning section 110 may be used to extract the user learning model with the highest suitability based on likelihood and score.

次いで、行動評価部１５０は、抽出したユーザ学習モデルのスロットが、現在の状況情報データに適合するかどうかの判定を行う（ステップＳ３０５）。具体的には、ステップＳ３０４で抽出したユーザ学習モデルのスロットと現在の状況情報データとの間に一致する情報があるかどうかをチェックする。例えば、ステップＳ３０４で抽出したユーザ学習モデルのスロットが図１５（ａ）に示すビットマップイメージで表され、現在の状況情報データが図１５（ｂ）に示すビットマップイメージで表されたものとする。これらビットマップイメージを２次元行列と見なして要素値毎の積を取ると、図１５（ｃ）に示すビットマップイメージが得られる。この場合、スロットと状況情報データとに一致する情報が存在するため、抽出したスロットが現在の状況情報データに該当すると判定する。 Next, the action evaluation unit 150 determines whether the extracted slot of the user learning model matches the current situation information data (step S305). Specifically, it is checked whether there is matching information between the slot of the user learning model extracted in step S304 and the current situation information data. For example, it is assumed that the slot of the user learning model extracted in step S304 is represented by the bitmap image shown in FIG. 15(a), and the current situation information data is represented by the bitmap image shown in FIG. 15(b). . Considering these bitmap images as a two-dimensional matrix and taking the product of each element value, the bitmap image shown in FIG. 15(c) is obtained. In this case, since there is information that matches the slot and the status information data, it is determined that the extracted slot corresponds to the current status information data.

或いは、状況学習部１１０のスコア取得部３０と同様のアルゴリズムを用い、尤度やスコアを基準として、抽出したスロットが現在の状況情報データに該当するか否かを判定するようにしてもよい。 Alternatively, an algorithm similar to that of the score acquisition section 30 of the situation learning section 110 may be used to determine whether or not the extracted slot corresponds to the current situation information data based on the likelihood and score.

判定の結果、抽出したスロットが現在の状況情報データに該当する場合（ステップＳ３０５における「Ｙｅｓ」）には、ステップＳ３０６ヘと移行する。一方、ステップ３０５における判定の結果、抽出したユーザ学習モデルのスロットが現在の状況情報データに該当しない場合（ステップＳ３０５における「Ｎｏ」）には、ステップＳ３０７ヘと移行する。 As a result of the determination, if the extracted slot corresponds to the current situation information data ("Yes" in step S305), the process proceeds to step S306. On the other hand, if the extracted slot of the user learning model does not correspond to the current situation information data as a result of determination in step 305 ("No" in step S305), the process proceeds to step S307.

ステップＳ３０６において、行動評価部１５０は、抽出したユーザ学習モデルのユーザエピソードにおける判断が、行動選択部７０の選択した行動を「行う」とするものであるか否かを判定する。 In step S<b>306 , the behavior evaluation unit 150 determines whether or not the user episode of the extracted user learning model determines whether or not the behavior selected by the behavior selection unit 70 is “perform”.

ユーザエピソードの判断が、行動選択部７０が選択した行動を「行う」とするものである場合（ステップＳ３０６における「Ｙｅｓ」）には、ステップＳ３０７ヘと移行する。一方、ユーザエピソードの判断が、行動選択部７０の選択した行動を「行わない」とするものである場合（ステップＳ３０６における「Ｎｏ」）、行動候補取得部１０において候補に挙がった行動の中から次に尤度の高い行動を選択する。そして、上述したステップＳ３０１～ステップＳ３０７の処理を繰り返す。 If the determination of the user episode is that the action selected by the action selection unit 70 is to be "executed" ("Yes" in step S306), the process proceeds to step S307. On the other hand, if the determination of the user episode indicates that the action selected by the action selection unit 70 is "not performed" ("No" in step S306), the Next, the action with the highest likelihood is selected. Then, the processing of steps S301 to S307 described above is repeated.

ステップＳ３０７において、行動決定部１６０は、行動選択部７０が選択した行動を実行する。行動選択部７０が選択した行動が実行されるのは、その行動がユーザ学習モデルに合致している場合、その行動に紐付けられているユーザ学習モデルが１つも存在しない場合、抽出したユーザ学習モデルのスロットが状況情報データに該当しない場合である。行動決定部１６０は、ユーザ学習モデルの中に、状況情報に応じて行動選択部７０が選択した行動に反するものがある場合には、行動選択部７０が選択した行動を実行しないように構成されている。 In step S<b>307 , the action determination unit 160 executes the action selected by the action selection unit 70 . The action selected by the action selection unit 70 is executed when the action matches the user learning model, when there is no user learning model linked to the action, and when there is no user learning model linked to the action. This is when the slot of the model does not correspond to the status information data. The action determination unit 160 is configured not to execute the action selected by the action selection unit 70 when there is a user learning model that contradicts the action selected by the action selection unit 70 according to the situation information. ing.

このように、本実施形態による行動学習装置においては、状況情報に応じた行動の学習及び選択を、より簡単なアルゴリズムで実現することができる。また、状況情報に応じて選択した行動に対するユーザのコメントを蓄積してノウハウとして利用することができ、より適切な行動の選択を実現することができる。 As described above, in the behavior learning device according to the present embodiment, it is possible to realize learning and selection of behavior according to situation information with a simpler algorithm. In addition, users' comments on actions selected according to situation information can be accumulated and used as know-how, and more appropriate actions can be selected.

次に、本実施形態による行動学習装置１００のハードウェア構成例について、図１６を用いて説明する。図１６は、本実施形態による行動学習装置のハードウェア構成例を示す概略図である。 Next, a hardware configuration example of the action learning device 100 according to this embodiment will be described with reference to FIG. 16 . FIG. 16 is a schematic diagram showing a hardware configuration example of the action learning device according to this embodiment.

行動学習装置１００は、例えば図１６に示すように、一般的な情報処理装置と同様のハードウェア構成によって実現することが可能である。例えば、行動学習装置１００は、ＣＰＵ（Central Processing Unit）３００、主記憶部３０２、通信部３０４、入出力インターフェース部３０６を備え得る。 The action learning device 100 can be realized by a hardware configuration similar to that of a general information processing device, as shown in FIG. 16, for example. For example, the action learning device 100 can include a CPU (Central Processing Unit) 300 , a main memory section 302 , a communication section 304 and an input/output interface section 306 .

ＣＰＵ３００は、行動学習装置１００の全体的な制御や演算処理を司る制御・演算装置である。主記憶部３０２は、データの作業領域やデータの一時退避領域に用いられる記憶部であり、ＲＡＭ（Random Access Memory）等のメモリにより構成され得る。通信部３０４は、ネットワークを介してデータの送受信を行うためのインターフェースである。入出力インターフェース部３０６は、外部の出力装置３１０、入力装置３１２、記憶装置３１４等と接続してデータの送受信を行うためのインターフェースである。ＣＰＵ３００、主記憶部３０２、通信部３０４及び入出力インターフェース部３０６は、システムバス３０８によって相互に接続されている。記憶装置３１４は、例えばＲＯＭ（Read Only Memory）、磁気ディスク、半導体メモリ等の不揮発性メモリから構成されるハードディスク装置等によって構成され得る。 The CPU 300 is a control/arithmetic device that manages overall control and arithmetic processing of the action learning device 100 . The main storage unit 302 is a storage unit used as a data work area and a data temporary save area, and can be configured by a memory such as a RAM (Random Access Memory). A communication unit 304 is an interface for transmitting and receiving data via a network. The input/output interface unit 306 is an interface for transmitting and receiving data by connecting to an external output device 310, input device 312, storage device 314, and the like. The CPU 300 , main storage unit 302 , communication unit 304 and input/output interface unit 306 are interconnected by a system bus 308 . The storage device 314 can be configured by, for example, a hard disk device configured from a non-volatile memory such as a ROM (Read Only Memory), a magnetic disk, or a semiconductor memory.

主記憶部３０２は、複数の学習セル４６を含むニューラルネットワーク部４０を構築し演算を実行するための作業領域として用いることができる。ＣＰＵ３００は、主記憶部３０２に構築したニューラルネットワーク部４０における演算処理を制御する制御部として機能する。記憶装置３１４には、学習済みの学習セル４６に関する情報を含む学習セル情報を保存することができる。また、記憶装置３１４に記憶された学習セル情報を読み出し、主記憶部３０２においてニューラルネットワーク部４０を構築するように構成することで、様々な状況情報データに対する学習環境を構築することができる。また、記憶部１７０は、記憶装置３１４によって構成され得る。ＣＰＵ３００は、主記憶部３０２に構築したニューラルネットワーク部４０の複数の学習セル４６における演算処理を並列して実行するように構成されていることが望ましい。 The main memory unit 302 can be used as a work area for constructing the neural network unit 40 including a plurality of learning cells 46 and executing calculations. The CPU 300 functions as a control section that controls arithmetic processing in the neural network section 40 constructed in the main memory section 302 . Storage device 314 may store learning cell information, including information about learned cells 46 that have been trained. Further, by reading the learning cell information stored in the storage device 314 and constructing the neural network section 40 in the main memory section 302, it is possible to construct a learning environment for various situation information data. Also, the storage unit 170 may be configured by the storage device 314 . The CPU 300 is desirably configured to execute arithmetic processing in parallel in a plurality of learning cells 46 of the neural network section 40 constructed in the main memory section 302 .

通信部３０４は、イーサネット（登録商標）、Ｗｉ－Ｆｉ（登録商標）等の規格に基づく通信インターフェースであり、他の装置との通信を行うためのモジュールである。学習セル情報は、通信部３０４を介して他の装置から受信するようにしてもよい。例えば、頻繁に使用する学習セル情報は記憶装置３１４に記憶しておき、使用頻度の低い学習セル情報は他の装置から読み込むように構成することができる。 The communication unit 304 is a communication interface based on standards such as Ethernet (registered trademark) and Wi-Fi (registered trademark), and is a module for communicating with other devices. The learning cell information may be received from another device via communication section 304 . For example, frequently used learning cell information may be stored in the storage device 314, and infrequently used learning cell information may be read from another device.

出力装置３１０は、例えば液晶表示装置等のディスプレイを含む。出力装置３１０は、用法学習部１２０の学習時にユーザに対して状況情報や行動選択部により選択された行動に関する情報を提示するための表示装置として利用可能である。また、ユーザへの学習結果や行動決定の通知は、出力装置３１０を介して行うことができる。入力装置３１２は、キーボード、マウス、タッチパネル等であって、ユーザが行動学習装置１００に所定の情報、例えば用法学習部１２０の学習時におけるユーザエピソードを入力するために用いられる。 The output device 310 includes, for example, a display such as a liquid crystal display. The output device 310 can be used as a display device for presenting situation information and information about actions selected by the action selection unit to the user when the usage learning unit 120 learns. In addition, notification of learning results and action determination to the user can be performed via the output device 310 . The input device 312 is a keyboard, a mouse, a touch panel, or the like, and is used by the user to input predetermined information to the behavior learning device 100, such as a user episode during learning by the usage learning unit 120. FIG.

状況情報データは、通信部３０４を介して他の装置から読み込むように構成することもできる。或いは、入力装置３１２を、状況情報データを入力するための手段として用いることもできる。 The status information data can also be configured to be read from another device via the communication unit 304 . Alternatively, the input device 312 can be used as a means for entering context information data.

本実施形態による行動学習装置１００の各部の機能は、プログラムを組み込んだＬＳＩ（Large Scale Integration）等のハードウェア部品である回路部品を実装することにより、ハードウェア的に実現することができる。或いは、その機能を提供するプログラムを、記憶装置３１４に格納し、そのプログラムを主記憶部３０２にロードしてＣＰＵ３００で実行することにより、ソフトウェア的に実現することも可能である。 The functions of each part of the action learning device 100 according to the present embodiment can be implemented in hardware by implementing circuit components, which are hardware components such as LSI (Large Scale Integration) incorporating programs. Alternatively, a program that provides the function can be stored in the storage device 314, loaded into the main storage unit 302, and executed by the CPU 300, thereby realizing software.

このように、本実施形態によれば、状況情報に応じた行動の学習及び選択をより簡単なアルゴリズムで実現することができる。また、状況情報に応じて選択した行動に対するユーザのコメントを蓄積してノウハウとして利用することができ、より適切な行動の選択を実現することができる。 As described above, according to the present embodiment, learning and selection of actions according to situation information can be realized with a simpler algorithm. In addition, users' comments on actions selected according to situation information can be accumulated and used as know-how, and more appropriate actions can be selected.

［第２実施形態］
本発明の第２実施形態による行動学習装置及び行動学習方法について、図１７を用いて説明する。第１実施形態による行動学習装置と同様の構成要素には同一の符号を付し、説明を省略し或いは簡潔にする。 [Second embodiment]
A behavior learning device and behavior learning method according to the second embodiment of the present invention will be described with reference to FIG. Components similar to those of the behavior learning device according to the first embodiment are denoted by the same reference numerals, and description thereof is omitted or simplified.

本実施形態による行動学習装置の基本的な構成は、図１に示す第１実施形態による行動学習装置と同様である。本実施形態による行動学習装置が第１実施形態による行動学習装置と異なる点は、スコア取得部３０がデータベースにより構成されていることである。以下、第１実施形態による行動学習装置と異なる点を中心に、本実施形態による行動学習装置を、図１を参照して説明する。 The basic configuration of the action learning device according to this embodiment is the same as that of the action learning device according to the first embodiment shown in FIG. The behavior learning device according to the present embodiment differs from the behavior learning device according to the first embodiment in that the score acquisition unit 30 is configured by a database. Hereinafter, the behavior learning device according to the present embodiment will be described with reference to FIG. 1, focusing on the differences from the behavior learning device according to the first embodiment.

状況情報生成部２０は、環境２００から受け取った情報及び自己の状況をもとに、データベースを検索するためのキーとなる状況情報データを生成する機能を備える。状況情報データは、第１実施形態の場合のように写像する必要はなく、環境２００から受け取った情報や自己の状況をそのまま適用可能である。例えば、「大富豪」の例では、前述の、場の札、回数、手札、過去情報等を、検索を実行するためのキーとして利用することができる。 The situation information generation unit 20 has a function of generating situation information data which is a key for searching the database based on the information received from the environment 200 and its own situation. The situation information data does not need to be mapped as in the first embodiment, and the information received from the environment 200 and the own situation can be applied as they are. For example, in the example of "millionaire", the cards on the table, the number of times, cards in hand, past information, etc. can be used as keys for executing a search.

スコア取得部３０は、状況情報データをキーとして、特定の行動に対するスコアを与えるデータベースを備える。スコア取得部３０のデータベースは、状況情報データのあらゆる組み合わせについて、想定される総ての行動に対するスコアを保持している。状況情報生成部２０で生成した状況情報データをキーとしてスコア取得部３０のデータベースを検索することにより、行動候補取得部１０が抽出した行動候補の各々に対するスコアを取得することができる。 The score acquisition unit 30 has a database that gives a score for a specific action using situation information data as a key. The database of the score acquisition unit 30 holds scores for all possible actions for all combinations of situation information data. By searching the database of the score acquisition unit 30 using the situation information data generated by the situation information generation unit 20 as a key, the score for each of the action candidates extracted by the action candidate acquisition unit 10 can be acquired.

スコア調整部８０は、行動選択部７０で選択した行動が環境２００に与えた結果に応じて、スコア取得部３０のデータベースに登録されているスコアの値を調整する機能を備える。このように構成することで、行動した結果に基づいてスコア取得部３０のデータベースを学習することができる。 The score adjustment unit 80 has a function of adjusting the score value registered in the database of the score acquisition unit 30 according to the result given to the environment 200 by the action selected by the action selection unit 70 . By configuring in this way, the database of the score acquisition unit 30 can be learned based on the results of actions.

次に、本実施形態による行動学習装置を用いた行動学習方法について、図１７を用いて説明する。 Next, an action learning method using the action learning device according to this embodiment will be described with reference to FIG.

まず、行動候補取得部１０は、環境２００から受け取った情報及び自己の状況に基づいて、その状況下で取り得る行動（行動候補）を抽出する（ステップＳ４０１）。行動候補を抽出する方法は、特に限定されるものではないが、例えば、ルールベースに登録されたルールに基づいて行うことができる。 First, the action candidate acquisition unit 10 extracts possible actions (action candidates) based on the information received from the environment 200 and the situation (step S401). The method of extracting action candidates is not particularly limited, but can be performed, for example, based on rules registered in a rule base.

次いで、状況情報生成部２０は、環境２００から受け取った情報及び自己の状況をもとに、行動に関わる情報を表す状況情報データを生成する（ステップＳ４０２）。状況情報データの生成は、ステップＳ４０１よりも前に或いはステップＳ４０１と並行して行ってもよい。 Next, the situation information generating unit 20 generates situation information data representing information related to behavior based on the information received from the environment 200 and the situation of the user itself (step S402). The situation information data may be generated before step S401 or in parallel with step S401.

次いで、状況情報生成部２０で生成した状況情報データを、スコア取得部３０に入力する（ステップＳ４０３）。スコア取得部３０は、入力された状況情報データをキーとしてデータベースを検索し、行動候補取得部１０が抽出した行動候補の各々に対するスコアを取得する（ステップＳ４０４）。 Next, the situation information data generated by the situation information generation unit 20 is input to the score acquisition unit 30 (step S403). The score acquisition unit 30 searches the database using the input situation information data as a key, and acquires a score for each action candidate extracted by the action candidate acquisition unit 10 (step S404).

次いで、行動選択部７０において、行動候補取得部１０が抽出した行動候補の中から、スコア取得部３０が取得したスコアの最も高い行動候補を抽出し（ステップＳ４０５）、環境２００に対して実行する（ステップＳ４０６）。これにより、行動した結果の評価が最も高いと見込まれる行動を、環境２００に対して実行することができる。 Next, in the action selection unit 70, the action candidate with the highest score acquired by the score acquisition unit 30 is extracted from among the action candidates extracted by the action candidate acquisition unit 10 (step S405), and executed on the environment 200. (Step S406). As a result, an action that is expected to have the highest evaluation as a result of the action can be executed with respect to the environment 200 .

次いで、スコア調整部８０により、行動選択部７０により選択された行動を環境２００に対して実行した結果の評価に基づき、スコア取得部３０のデータベースに登録されているスコアの値を調整する（ステップＳ４０７）。例えば、行動した結果の評価が高い場合はスコアを上げ、行動した結果の評価が低い場合はスコアを下げる。このようにしてデータベースのスコアを調整することで、行動した結果に基づいてスコア取得部３０のデータベースを学習することができる。 Next, the score adjustment unit 80 adjusts the score value registered in the database of the score acquisition unit 30 based on the evaluation of the result of executing the action selected by the action selection unit 70 with respect to the environment 200 (step S407). For example, if the evaluation of the action result is high, the score is raised, and if the evaluation of the action result is low, the score is lowered. By adjusting the score of the database in this way, the database of the score acquisition unit 30 can be learned based on the result of the action.

このように、本実施形態によれば、スコア取得部３０をデータベースで構成する場合においても、第１実施形態の場合と同様、環境及び自己の状況に応じた行動の学習及び選択をより簡単なアルゴリズムで実現することができる。また、状況情報に応じて選択した行動に対するユーザのコメントを蓄積してノウハウとして利用することができ、より適切な行動の選択を実現することができる。 As described above, according to the present embodiment, even when the score acquisition unit 30 is configured with a database, as in the case of the first embodiment, it is possible to more easily learn and select an action according to the environment and one's own situation. It can be realized by an algorithm. In addition, users' comments on actions selected according to situation information can be accumulated and used as know-how, and more appropriate actions can be selected.

［第３実施形態］
本発明の第３実施形態による行動学習装置について、図１８を用いて説明する。第１又は第２実施形態による行動学習装置と同様の構成要素には同一の符号を付し、説明を省略し或いは簡潔にする。図１８は、本実施形態による行動学習装置の構成例を示す概略図である。 [Third embodiment]
A behavior learning device according to a third embodiment of the present invention will be described with reference to FIG. Components similar to those of the behavior learning device according to the first or second embodiment are denoted by the same reference numerals, and description thereof is omitted or simplified. FIG. 18 is a schematic diagram showing a configuration example of the action learning device according to this embodiment.

本実施形態による行動学習装置１００は、図１８に示すように、行動選択部７０と、評価取得部１４０と、スロット生成部１５２と、ユーザ学習モデル生成部１５４と、を有している。 The action learning device 100 according to this embodiment has an action selection unit 70, an evaluation acquisition unit 140, a slot generation unit 152, and a user learning model generation unit 154, as shown in FIG.

行動選択部７０は、環境及び自己の状況を表す状況情報データに基づいて、環境に対して実行する行動候補を選択する機能を備える。評価取得部１４０は、行動選択部により選択された行動候補に対するユーザの評価であって、状況情報データが示す状況において行動候補を実行する又は実行しないとの判断を理由とともに示す評価を取得する機能を備える。スロット生成部１５２は、評価における理由に基づき、状況情報データの注目箇所を示すスロットを生成する機能を備える。ユーザ学習モデル生成部１５４は、行動候補に、状況情報データ、スロット及び評価における判断が紐付けられているユーザ学習モデルを生成する機能を備える。 The action selection unit 70 has a function of selecting a candidate action to be executed with respect to the environment based on the situation information data representing the environment and one's own situation. The evaluation acquisition unit 140 is a function of acquiring a user's evaluation of the action candidate selected by the action selection unit, which is an evaluation indicating whether or not to execute the action candidate in the situation indicated by the situation information data, together with the reason. Prepare. The slot generation unit 152 has a function of generating a slot indicating a point of interest in the situation information data based on the reason for the evaluation. The user learning model generation unit 154 has a function of generating a user learning model in which action candidates are associated with situation information data, slots, and judgments in evaluations.

このように、本実施形態によれば、状況情報に応じて選択した行動に対するユーザのコメントを蓄積してノウハウとして利用することができ、より適切な行動の学習を実現することができる。 As described above, according to the present embodiment, it is possible to accumulate user's comments on actions selected according to situation information and use them as know-how, thereby realizing learning of more appropriate actions.

［第４実施形態］
本発明の第４実施形態による行動決定装置について、図１９を用いて説明する。第１又は第２実施形態による行動学習装置と同様の構成要素には同一の符号を付し、説明を省略し或いは簡潔にする。図１９は、本実施形態による行動決定装置の構成例を示す概略図である。 [Fourth embodiment]
A behavior determination device according to a fourth embodiment of the present invention will be described with reference to FIG. Components similar to those of the behavior learning device according to the first or second embodiment are denoted by the same reference numerals, and description thereof is omitted or simplified. FIG. 19 is a schematic diagram showing a configuration example of the behavior determination device according to this embodiment.

本実施形態による行動決定装置５００は、図１９に示すように、行動選択部７０と、ユーザ学習モデル抽出部１５６と、行動決定部１６０と、記憶部１７０と、を有している。 A behavior determination device 500 according to this embodiment includes a behavior selection unit 70, a user learning model extraction unit 156, a behavior determination unit 160, and a storage unit 170, as shown in FIG.

行動選択部７０は、現在の環境及び自己の状況を表す現在の状況情報データに基づいて、環境に対して実行する行動候補を選択する機能を備える。記憶部１７０は、複数の行動候補の各々に対して、環境及び自己の状況を表す状況情報データと、状況情報データの注目箇所を示すスロットと、状況情報データ及びスロットが示す状況において行動候補を実行する又は実行しないとの判断と、が紐付けられているユーザ学習モデルを保持する。ユーザ学習モデル抽出部１５６は、記憶部１７０から、行動選択部により選択された行動候補に紐付けられたユーザ学習モデルのうち、現在の状況情報データに対する適合性が最も高い状況情報データを有するユーザ学習モデルを抽出する機能を備える。行動決定部１６０は、現在の状況情報データとユーザ学習モデル抽出部１５６が抽出したユーザ学習モデルのスロットとの関係に基づいて、行動選択部により選択された行動候補を実行するか否かを判断する機能を備える。 The action selection unit 70 has a function of selecting action candidates to be executed in the environment based on the current situation information data representing the current environment and one's own situation. The storage unit 170 stores, for each of a plurality of action candidates, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and action candidates in the situation indicated by the situation information data and the slot. Holds a user learning model that is associated with a decision to execute or not to execute. The user learning model extraction unit 156 retrieves from the storage unit 170 the user learning model linked to the action candidate selected by the action selection unit that has the situation information data with the highest suitability for the current situation information data. It has a function to extract learning models. The action determination unit 160 determines whether or not to execute the action candidate selected by the action selection unit based on the relationship between the current situation information data and the slot of the user learning model extracted by the user learning model extraction unit 156. It has a function to

このように、本実施形態によれば、状況情報に応じて選択した行動に対するユーザのコメントを蓄積してノウハウとして利用することができ、より適切な行動の選択を実現することができる。 As described above, according to the present embodiment, it is possible to accumulate user's comments on actions selected according to situation information and use them as know-how, thereby realizing selection of more appropriate actions.

［変形実施形態］
本発明は、上記実施形態に限らず種々の変形が可能である。
例えば、いずれかの実施形態の一部の構成を他の実施形態に追加した例や、他の実施形態の一部の構成と置換した例も、本発明の実施形態である。 [Modified embodiment]
The present invention is not limited to the above embodiment, and various modifications are possible.
For example, an example in which a part of the configuration of one of the embodiments is added to another embodiment, or an example in which a part of the configuration of another embodiment is replaced is also an embodiment of the present invention.

また、上記実施形態では、本発明の適用例としてカードゲームの「大富豪」におけるプレイヤーの行動を例に挙げて説明したが、本発明は環境及び自己の状況に基づいて行動する場合における行動の学習及び選択に広く適用することができる。 Further, in the above-described embodiment, as an application example of the present invention, the action of the player in the card game "Millionaire" has been described as an example. It can be widely applied to learning and selection.

また、上述の実施形態の機能を実現するように該実施形態の構成を動作させるプログラムを記録媒体に記録させ、該記録媒体に記録されたプログラムをコードとして読み出し、コンピュータにおいて実行する処理方法も各実施形態の範疇に含まれる。すなわち、コンピュータ読取可能な記録媒体も各実施形態の範囲に含まれる。また、上述のプログラムが記録された記録媒体はもちろん、そのプログラム自体も各実施形態に含まれる。 Further, there are various processing methods in which a program for operating the configuration of the embodiment is recorded on a recording medium so as to realize the functions of the above embodiment, the program recorded on the recording medium is read as code, and executed by a computer. It is included in the scope of the embodiment. That is, a computer-readable recording medium is also included in the scope of each embodiment. In addition to the recording medium on which the above program is recorded, the program itself is also included in each embodiment.

該記録媒体としては例えばフロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、磁気テープ、不揮発性メモリカード、ＲＯＭを用いることができる。また該記録媒体に記録されたプログラム単体で処理を実行しているものに限らず、他のソフトウェア、拡張ボードの機能と共同して、ＯＳ上で動作して処理を実行するものも各実施形態の範疇に含まれる。 For example, a floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, CD-ROM, magnetic tape, nonvolatile memory card, and ROM can be used as the recording medium. Further, not only the program recorded on the recording medium alone executes the process, but also the one that operates on the OS and executes the process in cooperation with other software and functions of the expansion board. included in the category of

上記実施形態は、いずれも本発明を実施するにあたっての具体化の例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならない。すなわち、本発明はその技術思想、又はその主要な特徴から逸脱することなく、様々な形で実施することができる。 All of the above-described embodiments merely show specific examples for carrying out the present invention, and the technical scope of the present invention should not be construed to be limited by these. That is, the present invention can be embodied in various forms without departing from its technical concept or main features.

上記実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Some or all of the above embodiments may also be described in the following additional remarks, but are not limited to the following.

（付記１）
環境及び自己の状況を表す状況情報データに基づいて、前記環境に対して実行する行動候補を選択する行動選択部と、
前記行動選択部により選択された前記行動候補に対するユーザの評価であって、前記状況情報データが示す状況において前記行動候補を実行する又は実行しないとの判断を理由とともに示す評価を取得する評価取得部と、
前記評価における前記理由に基づき、前記状況情報データの注目箇所を示すスロットを生成するスロット生成部と、
前記行動候補に、前記状況情報データ、前記スロット及び前記評価における前記判断が紐付けられているユーザ学習モデルを生成するユーザ学習モデル生成部と
を有することを特徴とする行動学習装置。 (Appendix 1)
an action selection unit that selects an action candidate to be executed with respect to the environment based on situation information data representing the environment and one's own situation;
An evaluation acquisition unit that acquires a user's evaluation of the action candidate selected by the action selection unit, the evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data, together with a reason. and,
a slot generation unit that generates a slot indicating a point of interest in the situation information data based on the reason in the evaluation;
and a user learning model generation unit that generates a user learning model in which the situation information data, the slot, and the judgment in the evaluation are linked to the action candidate.

（付記２）
前記状況情報データに基づいて、前記環境に対して取り得る複数の行動候補を抽出する行動候補取得部と、
前記複数の行動候補の各々について、行動した結果に対して見込まれる効果を表す指標であるスコアを取得するスコア取得部と、を更に有し、
前記行動選択部は、前記複数の行動候補の中から、前記スコアが最も大きい行動候補を前記行動候補として選択する
ことを特徴とする付記１記載の行動学習装置。 (Appendix 2)
an action candidate acquisition unit that extracts a plurality of action candidates that can be taken with respect to the environment based on the situation information data;
a score acquisition unit that acquires a score, which is an index representing an expected effect of the action result, for each of the plurality of action candidates;
The action learning device according to Supplementary note 1, wherein the action selection unit selects, from among the plurality of action candidates, an action candidate with the highest score as the action candidate.

（付記３）
選択した前記行動候補を前記環境に対して実行した結果に基づいて、選択した前記行動候補に紐付けられている前記スコアの値を調整するスコア調整部を更に有する
ことを特徴とする付記２記載の行動学習装置。 (Appendix 3)
Supplementary Note 2, further comprising: a score adjustment unit that adjusts the score value associated with the selected action candidate based on a result of executing the selected action candidate on the environment. behavioral learning device.

（付記４）
前記スコア取得部は、前記状況情報データに基づく複数の要素値の各々に所定の重み付けをする複数の入力ノードと、重み付けをした前記複数の要素値を加算して出力する出力ノードと、を各々が含む複数の学習セルを有するニューラルネットワーク部を有し、
前記複数の学習セルの各々は、所定のスコアを有し、前記複数の行動候補のうちのいずれかに紐付けられており、
前記スコア取得部は、前記複数の行動候補の各々に紐付けられた前記学習セルのうち、前記複数の要素値と前記学習セルの出力値との間の相関値が最も大きい前記学習セルの前記スコアを、対応する前記行動候補のスコアに設定し、
前記行動選択部は、前記複数の行動候補のうち、前記スコアが最も大きい前記行動候補を選択して前記環境に対して実行し、
前記スコア調整部は、選択した前記行動候補を実行した結果に基づいて、選択した前記行動候補に紐付けられている前記学習セルの前記スコアを調整する
ことを特徴とする付記３記載の行動学習装置。 (Appendix 4)
The score acquisition unit has a plurality of input nodes that weight each of the plurality of element values based on the situation information data with a predetermined weight, and an output node that adds and outputs the plurality of weighted element values. has a neural network unit having a plurality of learning cells including
Each of the plurality of learning cells has a predetermined score and is associated with one of the plurality of action candidates,
The score acquisition unit selects the learning cell having the largest correlation value between the plurality of element values and the output value of the learning cell among the learning cells linked to each of the plurality of action candidates. setting a score to the score of the corresponding action candidate;
The action selection unit selects the action candidate with the highest score from among the plurality of action candidates and executes the action candidate on the environment;
Action learning according to Supplementary note 3, wherein the score adjustment unit adjusts the score of the learning cell linked to the selected action candidate based on a result of executing the selected action candidate. Device.

（付記５）
前記スコア取得部は、前記ニューラルネットワーク部の学習を行う学習部を更に有し、
前記学習部は、前記学習セルの出力値に応じて、前記学習セルの前記複数の入力ノードの重み付け係数を更新し、又は、前記ニューラルネットワーク部に新たな学習セルを追加する
ことを特徴とする付記４記載の行動学習装置。 (Appendix 5)
The score acquisition unit further has a learning unit that performs learning of the neural network unit,
The learning unit updates the weighting coefficients of the plurality of input nodes of the learning cell or adds a new learning cell to the neural network unit according to the output value of the learning cell. The action learning device according to appendix 4.

（付記６）
前記学習部は、前記複数の要素値と前記学習セルの出力値との間の相関値が所定の閾値未満の場合に、前記新たな学習セルを追加する
ことを特徴とする付記５記載の行動学習装置。 (Appendix 6)
The action according to Supplementary Note 5, wherein the learning unit adds the new learning cell when a correlation value between the plurality of element values and the output value of the learning cell is less than a predetermined threshold. learning device.

（付記７）
前記学習部は、前記複数の要素値の値と前記学習セルの出力値との間の相関値が所定の閾値以上の場合に、前記学習セルの前記複数の入力ノードの前記重み付け係数を更新する
ことを特徴とする付記５記載の行動学習装置。 (Appendix 7)
The learning unit updates the weighting coefficients of the plurality of input nodes of the learning cell when a correlation value between the plurality of element values and the output value of the learning cell is equal to or greater than a predetermined threshold. The action learning device according to Supplementary Note 5, characterized by:

（付記８）
前記相関値は、前記学習セルの前記出力値に関する尤度である
ことを特徴とする付記４乃至７のいずれか１項に記載の行動学習装置。 (Appendix 8)
8. The behavior learning device according to any one of appendices 4 to 7, wherein the correlation value is a likelihood of the output value of the learning cell.

（付記９）
前記尤度は、前記複数の入力ノードの各々に設定されている重み付け係数に応じた前記学習セルの出力の最大値に対する前記複数の要素値を入力したときの前記学習セルの前記出力値の比率である
ことを特徴とする付記８記載の行動学習装置。 (Appendix 9)
The likelihood is the ratio of the output value of the learning cell when the plurality of element values are input to the maximum value of the output of the learning cell according to the weighting factors set for each of the plurality of input nodes. The action learning device according to appendix 8, characterized by:

（付記１０）
前記環境及び前記自己の状況に基づき、行動に関わる情報を写像した前記状況情報データを生成する状況情報生成部を更に有する
ことを特徴とする付記４乃至９のいずれか１項に記載の行動学習装置。 (Appendix 10)
10. The action learning according to any one of Appendices 4 to 9, further comprising a situation information generating unit that generates the situation information data that maps information related to actions based on the environment and the self's situation. Device.

（付記１１）
前記スコア取得部は、前記状況情報データをキーとして前記複数の行動候補の各々に対する前記スコアを与えるデータベースを有する
ことを特徴とする付記２又は３記載の行動学習装置。 (Appendix 11)
The action learning device according to appendix 2 or 3, wherein the score acquisition unit has a database that provides the score for each of the plurality of action candidates using the situation information data as a key.

（付記１２）
複数の行動候補の各々に対して、環境及び自己の状況を表す状況情報データと、前記状況情報データの注目箇所を示すスロットと、前記状況情報データ及び前記スロットが示す状況において前記行動候補を実行する又は実行しないとの判断と、が紐付けられているユーザ学習モデルを保持する記憶部と、
現在の環境及び自己の状況を表す現在の状況情報データに基づいて、前記環境に対して実行する行動候補を選択する行動選択部と、
前記記憶部から、前記行動選択部により選択された前記行動候補に紐付けられた前記ユーザ学習モデルのうち、前記現在の状況情報データに対する適合性が最も高い前記状況情報データを有する前記ユーザ学習モデルを抽出するユーザ学習モデル抽出部と、
前記現在の状況情報データと抽出した前記ユーザ学習モデルの前記スロットとの関係に基づいて、前記行動選択部により選択された前記行動候補を実行するか否かを判断する行動決定部と
を有することを特徴とする行動決定装置。 (Appendix 12)
For each of a plurality of action candidates, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and executing the action candidate in the situation indicated by the situation information data and the slot A storage unit that holds a user learning model that is associated with a judgment of whether to do or not to execute,
an action selection unit that selects a candidate action to be executed with respect to the environment based on current situation information data representing the current environment and one's own situation;
The user learning model having, from the storage unit, the situation information data having the highest compatibility with the current situation information data among the user learning models linked to the action candidate selected by the action selection unit. a user learning model extraction unit that extracts
an action determination unit that determines whether or not to execute the action candidate selected by the action selection unit based on the relationship between the current situation information data and the extracted slot of the user learning model; A behavior decision device characterized by:

（付記１３）
前記状況情報データに基づいて、前記環境に対して取り得る複数の行動候補を抽出する行動候補取得部と、
前記複数の行動候補の各々について、行動した結果に対して見込まれる効果を表す指標であるスコアを取得するスコア取得部と、を更に有し、
前記行動選択部は、前記複数の行動候補の中から、前記スコアが最も大きい行動候補を前記行動候補として選択する
ことを特徴とする付記１２記載の行動決定装置。 (Appendix 13)
an action candidate acquisition unit that extracts a plurality of action candidates that can be taken with respect to the environment based on the situation information data;
a score acquisition unit that acquires a score, which is an index representing an expected effect of the action result, for each of the plurality of action candidates;
13. The action determination device according to Supplementary note 12, wherein the action selection unit selects, from among the plurality of action candidates, an action candidate with the highest score as the action candidate.

（付記１４）
前記行動決定部は、抽出した前記ユーザ学習モデルの前記スロットが前記現在の状況情報データに適合し、且つ、前記ユーザ学習モデルに前記行動候補を実行するとの判断が紐付けられている場合には、前記行動選択部により選択された前記行動候補を実行することを決定する
ことを特徴とする付記１２又は１３記載の行動決定装置。 (Appendix 14)
If the slot of the extracted user learning model matches the current situation information data and the user learning model is associated with a determination to execute the candidate action, , and the action candidate selected by the action selection unit is determined to be executed.

（付記１５）
前記行動決定部は、抽出した前記ユーザ学習モデルの前記スロットが前記現在の状況情報データに適合しない場合には、前記行動選択部により選択された前記行動候補を実行することを決定する
ことを特徴とする付記１２又は１３記載の行動決定装置。 (Appendix 15)
The action determination unit determines to execute the action candidate selected by the action selection unit when the extracted slot of the user learning model does not match the current situation information data. 14. The behavior determination device according to additional remark 12 or 13.

（付記１６）
前記行動選択部は、抽出した前記ユーザ学習モデルの前記スロットが前記現在の状況情報データに適合し、且つ、前記ユーザ学習モデルに前記行動候補を実行しないとの判断が紐付けられている場合には、前記スコアが次に大きい行動候補を前記行動候補として選択する
ことを特徴とする付記１３記載の行動決定装置。 (Appendix 16)
If the slot of the extracted user learning model matches the current situation information data and the user learning model is associated with a determination not to execute the candidate action, the action selection unit selects the action candidate with the next highest score as the action candidate.

（付記１７）
前記スコア取得部は、前記状況情報データに基づく複数の要素値の各々に所定の重み付けをする複数の入力ノードと、重み付けをした前記複数の要素値を加算して出力する出力ノードと、を各々が含む複数の学習セルを有するニューラルネットワーク部を有し、
前記複数の学習セルの各々は、所定のスコアを有し、前記複数の行動候補のうちのいずれかに紐付けられており、
前記スコア取得部は、前記複数の行動候補の各々に紐付けられた前記学習セルのうち、前記複数の要素値と前記学習セルの出力値との間の相関値が最も大きい前記学習セルの前記スコアを、対応する前記行動候補のスコアに設定し、
前記行動選択部は、前記複数の行動候補のうち、前記スコアが最も大きい前記行動候補を選択する
ことを特徴とする付記１３記載の行動決定装置。 (Appendix 17)
The score acquisition unit has a plurality of input nodes that weight each of the plurality of element values based on the situation information data with a predetermined weight, and an output node that adds and outputs the plurality of weighted element values. has a neural network unit having a plurality of learning cells including
Each of the plurality of learning cells has a predetermined score and is associated with one of the plurality of action candidates,
The score acquisition unit selects the learning cell having the largest correlation value between the plurality of element values and the output value of the learning cell among the learning cells linked to each of the plurality of action candidates. setting a score to the score of the corresponding action candidate;
14. The action determination device according to Supplementary note 13, wherein the action selection unit selects the action candidate with the highest score from among the plurality of action candidates.

（付記１８）
前記相関値は、前記学習セルの前記出力値に関する尤度である
ことを特徴とする付記１７記載の行動決定装置。 (Appendix 18)
18. The action determination device according to appendix 17, wherein the correlation value is a likelihood of the output value of the learning cell.

（付記１９）
前記尤度は、前記複数の入力ノードの各々に設定されている重み付け係数に応じた前記学習セルの出力の最大値に対する前記複数の要素値を入力したときの前記学習セルの前記出力値の比率である
ことを特徴とする付記１８記載の行動決定装置。 (Appendix 19)
The likelihood is the ratio of the output value of the learning cell when the plurality of element values are input to the maximum value of the output of the learning cell according to the weighting factors set for each of the plurality of input nodes. 19. The behavior determination device according to appendix 18, characterized by:

（付記２０）
前記スコア取得部は、前記状況情報データをキーとして前記複数の行動候補の各々に対する前記スコアを与えるデータベースを有する
ことを特徴とする付記１３記載の行動決定装置。 (Appendix 20)
14. The action determination device according to Supplementary Note 13, wherein the score acquisition unit has a database that provides the score for each of the plurality of action candidates using the situation information data as a key.

（付記２１）
環境及び自己の状況を表す状況情報データに基づいて、前記環境に対して実行する行動候補を選択するステップと、
前記選択するステップにおいて選択された前記行動候補に対するユーザの評価であって、前記状況情報データが示す状況において前記行動候補を実行する又は実行しないとの判断を理由とともに示す評価を取得するステップと、
前記評価における前記理由に基づき、前記状況情報データの注目箇所を示すスロットを生成するステップと、
前記行動候補に、前記状況情報データ、前記スロット及び前記評価における前記判断が紐付けられているユーザ学習モデルを生成するステップと
を有することを特徴とする行動学習方法。 (Appendix 21)
a step of selecting a candidate action to be executed with respect to the environment based on situation information data representing the environment and one's own situation;
a step of acquiring a user's evaluation of the action candidate selected in the selecting step, the evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data, together with a reason;
generating a slot indicating a point of interest in the context information data based on the reason in the evaluation;
and generating a user learning model in which the situation information data, the slot, and the judgment in the evaluation are linked to the action candidate.

（付記２２）
環境及び自己の状況を表す状況情報データに基づいて、取り得る複数の行動候補を抽出するステップと、
前記複数の行動候補の各々について、行動した結果に対して見込まれる効果を表す指標であるスコアを取得するステップと、を更に有し、
前記選択するステップでは、前記複数の行動候補の中から、前記スコアが最も大きい行動候補を選択する
ことを特徴とする付記２１記載の行動学習方法。 (Appendix 22)
a step of extracting a plurality of action candidates that can be taken based on situation information data representing the environment and one's own situation;
obtaining a score, which is an index representing an expected effect on the result of the action, for each of the plurality of action candidates;
22. The action learning method according to Supplementary note 21, wherein in the selecting step, the action candidate with the highest score is selected from among the plurality of action candidates.

（付記２３）
選択した前記行動候補を前記環境に対して実行した結果に基づいて、選択した前記行動候補に紐付けられている前記スコアの値を調整するステップを更に有する
ことを特徴とする付記２２記載の行動学習方法。 (Appendix 23)
23. The action according to Supplementary Note 22, further comprising: adjusting the score value associated with the selected action candidate based on a result of executing the selected action candidate on the environment. learning method.

（付記２４）
前記取得するステップでは、前記状況情報データに基づく複数の要素値の各々に所定の重み付けをする複数の入力ノードと、重み付けをした前記複数の要素値を加算して出力する出力ノードと、を各々が含む複数の学習セルを有し、前記複数の学習セルの各々が、所定のスコアを有し、前記複数の行動候補のうちのいずれかに紐付けられているニューラルネットワーク部において、前記複数の行動候補の各々に紐付けられた前記学習セルのうち、前記複数の要素値と前記学習セルの出力値との間の相関値が最も大きい前記学習セルの前記スコアを、対応する前記行動候補のスコアに設定し、
前記選択するステップでは、前記複数の行動候補のうち、前記スコアが最も大きい前記行動候補を選択し、
前記調整するステップでは、選択した前記行動候補を実行した結果に基づいて、選択した前記行動候補に紐付けられている前記学習セルの前記スコアを調整する
ことを特徴とする付記２３記載の行動学習方法。 (Appendix 24)
In the obtaining step, a plurality of input nodes that weight each of the plurality of element values based on the situation information data with a predetermined weight, and an output node that adds and outputs the plurality of weighted element values, respectively. wherein each of the plurality of learning cells has a predetermined score and is linked to one of the plurality of action candidates, wherein the plurality of The score of the learning cell having the largest correlation value between the plurality of element values and the output value of the learning cell, among the learning cells linked to each of the action candidates, is the score of the corresponding action candidate. set to score,
In the selecting step, the action candidate with the highest score is selected from among the plurality of action candidates;
Action learning according to Supplementary note 23, wherein the adjusting step adjusts the score of the learning cell linked to the selected action candidate based on a result of executing the selected action candidate. Method.

（付記２５）
現在の環境及び自己の状況を表す現在の状況情報データに基づいて、前記環境に対して実行する行動候補を選択するステップと、
複数の行動候補の各々に対して、環境及び自己の状況を表す状況情報データと、前記状況情報データの注目箇所を示すスロットと、前記状況情報データ及び前記スロットが示す状況において前記行動候補を実行する又は実行しないとの判断と、が紐付けられているユーザ学習モデルの中から、前記選択するステップにおいて選択された前記行動候補に紐付けられた前記ユーザ学習モデルであって、前記現在の状況情報データに対する適合性が最も高い前記状況情報データを有する前記ユーザ学習モデルを抽出するステップと、
前記現在の状況情報データと抽出した前記ユーザ学習モデルの前記スロットとの関係に基づいて、前記選択するステップにおいて選択された前記行動候補を実行するか否かを判断するステップと
を有することを特徴とする行動決定方法。 (Appendix 25)
a step of selecting an action candidate to be executed with respect to the environment based on the current situation information data representing the current environment and the self's situation;
For each of a plurality of action candidates, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and executing the action candidate in the situation indicated by the situation information data and the slot the user learning model linked to the action candidate selected in the selecting step from among the user learning models linked to the decision to perform or not to perform, wherein the current situation extracting the user learning model having the context information data that best matches the information data;
determining whether or not to execute the action candidate selected in the selecting step based on the relationship between the current situation information data and the extracted slot of the user learning model. Action decision method.

（付記２６）
前記状況情報データに基づいて、前記環境に対して取り得る複数の行動候補を抽出するステップと、
前記複数の行動候補の各々について、行動した結果に対して見込まれる効果を表す指標であるスコアを取得するステップと、を更に有し、
前記選択するステップでは、前記複数の行動候補の中から、前記スコアが最も大きい行動候補を前記行動候補として選択する
ことを特徴とする付記２５記載の行動決定方法。 (Appendix 26)
extracting a plurality of possible action candidates for the environment based on the situation information data;
obtaining a score, which is an index representing an expected effect on the result of the action, for each of the plurality of action candidates;
26. The behavior determination method according to Supplementary note 25, wherein in the selecting step, a behavior candidate with the highest score is selected from among the plurality of behavior candidates as the behavior candidate.

（付記２７）
前記判断するステップでは、抽出した前記ユーザ学習モデルの前記スロットが前記現在の状況情報データに適合し、且つ、前記ユーザ学習モデルに前記行動候補を実行するとの判断が紐付けられている場合には、前記選択するステップにおいて選択された前記行動候補を実行することを決定する
ことを特徴とする付記２５又は２６記載の行動決定方法。 (Appendix 27)
In the determining step, if the slot of the extracted user learning model matches the current situation information data, and the user learning model is associated with a determination to execute the action candidate, 27. The action determination method according to appendix 25 or 26, further comprising: determining to execute the action candidate selected in the selecting step.

（付記２８）
前記判断するステップでは、抽出した前記ユーザ学習モデルの前記スロットが前記現在の状況情報データに適合しない場合には、前記選択するステップにおいて選択された前記行動候補を実行することを決定する
ことを特徴とする付記２５又は２６記載の行動決定方法。 (Appendix 28)
In the determining step, if the extracted slot of the user learning model does not match the current situation information data, it is determined to execute the action candidate selected in the selecting step. The behavior determination method according to Supplementary Note 25 or 26.

（付記２９）
前記判断するステップでは、抽出した前記ユーザ学習モデルの前記スロットが前記現在の状況情報データに適合し、且つ、前記ユーザ学習モデルに前記行動候補を実行しないとの判断が紐付けられている場合には、前記スコアが次に大きい行動候補を前記行動候補として選択する
ことを特徴とする付記２６記載の行動決定方法。 (Appendix 29)
In the determining step, if the slot of the extracted user learning model matches the current situation information data and the user learning model is associated with a determination not to execute the candidate action. selects the action candidate with the next highest score as the action candidate.

（付記３０）
コンピュータを、
環境及び自己の状況を表す状況情報データに基づいて、前記環境に対して実行する行動候補を選択する手段、
前記選択する手段により選択された前記行動候補に対するユーザの評価であって、前記状況情報データが示す状況において前記行動候補を実行する又は実行しないとの判断を理由とともに示す評価を取得する手段、
前記評価における前記理由に基づき、前記状況情報データの注目箇所を示すスロットを生成する手段、
前記行動候補に、前記状況情報データ、前記スロット及び前記評価における前記判断が紐付けられているユーザ学習モデルを生成する手段
として機能させるプログラム。 (Appendix 30)
the computer,
means for selecting action candidates to be executed in the environment based on situation information data representing the environment and one's own situation;
means for acquiring a user's evaluation of the action candidate selected by the selecting means, the evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data, together with a reason;
means for generating a slot indicating a point of interest of the situation information data based on the reason in the evaluation;
A program that functions as means for generating a user learning model in which the situation information data, the slot, and the judgment in the evaluation are linked to the action candidate.

（付記３１）
コンピュータを、
現在の環境及び自己の状況を表す現在の状況情報データに基づいて、前記環境に対して実行する行動候補を選択する手段、
複数の行動候補の各々に対して、環境及び自己の状況を表す状況情報データと、前記状況情報データの注目箇所を示すスロットと、前記状況情報データ及び前記スロットが示す状況において前記行動候補を実行する又は実行しないとの判断と、が紐付けられているユーザ学習モデルの中から、前記選択する手段により選択された前記行動候補に紐付けられた前記ユーザ学習モデルであって、前記現在の状況情報データに対する適合性が最も高い前記状況情報データを有する前記ユーザ学習モデルを抽出する手段、
前記現在の状況情報データと抽出した前記ユーザ学習モデルの前記スロットとの関係に基づいて、前記選択するステップにおいて選択された前記行動候補を実行するか否かを判断する手段
として機能させるプログラム。 (Appendix 31)
the computer,
Means for selecting action candidates to be executed in the environment based on current situation information data representing the current environment and self's situation;
For each of a plurality of action candidates, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and executing the action candidate in the situation indicated by the situation information data and the slot the user learning model linked to the action candidate selected by the selecting means from among the user learning models linked to the determination of whether to perform or not to perform, wherein the current situation means for extracting the user learning model having the contextual information data that best matches the information data;
A program that functions as means for determining whether or not to execute the action candidate selected in the selecting step based on the relationship between the current situation information data and the extracted slot of the user learning model.

（付記３２）
付記３０又は３１記載のプログラムを記録したコンピュータが読み取り可能な記録媒体。 (Appendix 32)
A computer-readable recording medium recording the program according to appendix 30 or 31.

（付記３３）
付記１乃至１１のいずれか１項に記載の行動学習装置と、
前記行動学習装置が働きかける対象である環境と
を有することを特徴とする行動学習システム。 (Appendix 33)
The action learning device according to any one of Appendices 1 to 11;
and an environment to be acted upon by the action learning device.

１０…行動候補取得部
２０…状況情報生成部
３０…スコア取得部
４０…ニューラルネットワーク部
４２，４４…セル
４６…学習セル
５０…判定部
６０…学習部
６２…重み修正部
６４…学習セル生成部
７０…行動選択部
８０…スコア調整部
１００…行動学習装置
１１０…状況学習部
１２０…用法学習部
１３０…選択行動取得部
１４０…評価取得部
１５０…行動評価部
１５２…スロット生成部
１５４…ユーザ学習モデル生成部
１５６…ユーザ学習モデル抽出部
１６０…行動決定部
１７０…記憶部
２００…環境
３００…ＣＰＵ
３０２…主記憶部
３０４…通信部
３０６…入出力インターフェース部
３０８…システムバス
３１０…出力装置
３１２…入力装置
３１４…記憶装置
４００…行動学習システム DESCRIPTION OF SYMBOLS 10... Action candidate acquisition part 20... Situation information generation part 30... Score acquisition part 40... Neural network parts 42, 44... Cell 46... Learning cell 50... Judgment part 60... Learning part 62... Weight correction part 64... Learning cell generation part 70 Action selection unit 80 Score adjustment unit 100 Action learning device 110 Situation learning unit 120 Usage learning unit 130 Selected action acquisition unit 140 Evaluation acquisition unit 150 Action evaluation unit 152 Slot generation unit 154 User learning Model generation unit 156 User learning model extraction unit 160 Action determination unit 170 Storage unit 200 Environment 300 CPU
302...main storage section 304...communication section 306...input/output interface section 308...system bus 310...output device 312...input device 314...storage device 400...action learning system

Claims

an action selection unit that selects an action candidate to be executed with respect to the environment based on situation information data representing the environment and one's own situation;
An evaluation acquisition unit that acquires a user's evaluation of the action candidate selected by the action selection unit, the evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data, together with a reason. and,
a slot generation unit that generates a slot indicating a point of interest in the situation information data based on the reason in the evaluation;
and a user learning model generation unit that generates a user learning model in which the situation information data, the slot, and the judgment in the evaluation are linked to the action candidate.

an action candidate acquisition unit that extracts a plurality of action candidates that can be taken with respect to the environment based on the situation information data;
a score acquisition unit that acquires a score, which is an index representing an expected effect of the action result, for each of the plurality of action candidates;
2. The action learning device according to claim 1, wherein the action selection unit selects, from among the plurality of action candidates, an action candidate with the highest score as the action candidate.

2. The score adjustment unit further comprises a score adjustment unit that adjusts the score value associated with the selected action candidate based on a result of executing the selected action candidate on the environment. Behavioral learning device as described.

The score acquisition unit has a plurality of input nodes that weight each of the plurality of element values based on the situation information data with a predetermined weight, and an output node that adds and outputs the plurality of weighted element values. has a neural network unit having a plurality of learning cells including
Each of the plurality of learning cells has a predetermined score and is associated with one of the plurality of action candidates,
The score acquisition unit selects the learning cell having the largest correlation value between the plurality of element values and the output value of the learning cell among the learning cells linked to each of the plurality of action candidates. setting a score to the score of the corresponding action candidate;
The action selection unit selects the action candidate with the highest score from among the plurality of action candidates and executes the action candidate on the environment;
4. The action according to claim 3, wherein the score adjustment unit adjusts the score of the learning cell linked to the selected action candidate based on a result of executing the selected action candidate. learning device.

For each of a plurality of action candidates, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and executing the action candidate in the situation indicated by the situation information data and the slot A storage unit that holds a user learning model that is associated with a judgment of whether to do or not to execute,
an action selection unit that selects a candidate action to be executed with respect to the environment based on current situation information data representing the current environment and one's own situation;
The user learning model having, from the storage unit, the situation information data having the highest compatibility with the current situation information data among the user learning models linked to the action candidate selected by the action selection unit. a user learning model extraction unit that extracts
an action determination unit that determines whether or not to execute the action candidate selected by the action selection unit based on the relationship between the current situation information data and the extracted slot of the user learning model; A behavior decision device characterized by:

an action candidate acquisition unit that extracts a plurality of action candidates that can be taken with respect to the environment based on the situation information data;
a score acquisition unit that acquires a score, which is an index representing an expected effect of the action result, for each of the plurality of action candidates;
6. The action determination device according to claim 5, wherein the action selection unit selects, from among the plurality of action candidates, an action candidate with the highest score as the action candidate.

If the slot of the extracted user learning model matches the current situation information data and the user learning model is associated with a determination to execute the candidate action, 7. The action determination device according to claim 5, wherein the action candidate selected by the action selection unit is determined to be executed.

The action determination unit determines to execute the action candidate selected by the action selection unit when the extracted slot of the user learning model does not match the current situation information data. 7. The action determining device according to claim 5 or 6, wherein

If the slot of the extracted user learning model matches the current situation information data and the user learning model is associated with a determination not to execute the candidate action, the action selection unit selects the action candidate with the next highest score as the action candidate.

The score acquisition unit has a plurality of input nodes that weight each of the plurality of element values based on the situation information data with a predetermined weight, and an output node that adds and outputs the plurality of weighted element values. has a neural network unit having a plurality of learning cells including
Each of the plurality of learning cells has a predetermined score and is associated with one of the plurality of action candidates,
The score acquisition unit selects the learning cell having the largest correlation value between the plurality of element values and the output value of the learning cell among the learning cells linked to each of the plurality of action candidates. setting a score to the score of the corresponding action candidate;
7. The action determination device according to claim 6, wherein the action selection unit selects the action candidate with the highest score from among the plurality of action candidates.

a step of selecting a candidate action to be executed with respect to the environment based on situation information data representing the environment and one's own situation;
a step of acquiring a user's evaluation of the action candidate selected in the selecting step, the evaluation indicating a judgment to execute or not to execute the action candidate in the situation indicated by the situation information data, together with a reason;
generating a slot indicating a point of interest in the context information data based on the reason in the evaluation;
and generating a user learning model in which the situation information data, the slot, and the judgment in the evaluation are linked to the action candidate.

a step of selecting an action candidate to be executed with respect to the environment based on the current situation information data representing the current environment and the self's situation;
For each of a plurality of action candidates, situation information data representing the environment and one's own situation, a slot indicating a point of interest in the situation information data, and executing the action candidate in the situation indicated by the situation information data and the slot the user learning model linked to the action candidate selected in the selecting step from among the user learning models linked to the decision to perform or not to perform, wherein the current situation extracting the user learning model having the context information data that best matches the information data;
determining whether or not to execute the action candidate selected in the selecting step based on the relationship between the current situation information data and the extracted slot of the user learning model. Action decision method.