JP5361615B2

JP5361615B2 - Behavior control learning method, behavior control learning device, behavior control learning program

Info

Publication number: JP5361615B2
Application number: JP2009199376A
Authority: JP
Inventors: 泰浩南; 啓森; 豊美目黒; 竜一郎東中; 浩二堂坂; 英作前田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-08-31
Filing date: 2009-08-31
Publication date: 2013-12-04
Anticipated expiration: 2029-08-31
Also published as: JP2011053735A

Description

本発明はシステムとユーザーが交互にやり取りをするようなシステム（対話システムなど）において、ユーザーの行動に対して、システムが次にどんな行動をとるかを決定する行動制御学習方法、行動制御学習装置、行動制御学習プログラムに関する。 The present invention relates to a behavior control learning method and a behavior control learning device for determining what action the system will take next in response to a user action in a system in which the system and the user interact with each other (such as a dialogue system). It relates to a behavior control learning program.

部分観測マルコフ決定過程（Partially Observable Markov Decision Process、以下「ＰＯＭＤＰ」という）を用いた行動制御技術として，非特許文献１、２及び３が知られている。 Non-patent documents 1, 2 and 3 are known as behavior control techniques using a partially observable Markov decision process (hereinafter referred to as “POMDP”).

非特許文献１は、６都市間のチケットを買うタスクを対象としている。また、非特許文献２は、ＤＳＬ（Digital Subscriber Line）のトラブルシューティングのタスクを対象としている。これらの行動制御技術は、タスクの種類（取りうる行動の種類）と、状態遷移の仕方（どの順序で行動するか）は既知である。また、非特許文献３は、大量のデータからシステムの行動を決定するが、ＰＯＭＤＰを求める際には、他の非特許文献と同様にタスクが既知である。 Non-Patent Document 1 targets the task of buying tickets between six cities. Non-Patent Document 2 is directed to a task of troubleshooting DSL (Digital Subscriber Line). In these behavior control technologies, the types of tasks (the types of actions that can be taken) and the state transition method (in which order the behaviors are performed) are known. In Non-Patent Document 3, the system behavior is determined from a large amount of data. When POMDP is obtained, the task is known as in other non-patent documents.

J.Williams, P. Poupart, S. Young、"Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management"、Recent Trends in Discourse and Dialogue、Springer Netherlands、2008、Volume 39、p.191-217J. Williams, P. Poupart, S. Young, "Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management", Recent Trends in Discourse and Dialogue, Springer Netherlands, 2008, Volume 39, p.191-217 Jason D.Williams、" Applying POMDPs to Dialog Systems in the Troubleshooting Domain "、Bridging the Gap: Academic and Industrial Research in Dialog Technologies、2007.4、p.1-8Jason D. Williams, "Applying POMDPs to Dialog Systems in the Troubleshooting Domain", Bridging the Gap: Academic and Industrial Research in Dialog Technologies, 2007.4, p.1-8 K. Kim, C. Lee, S. Jung, G. G. Lee、“A Frame-Based Probabilistic Framework for Spoken Dialog Management Using Dialog Examples”、 Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue、2008.6、p.120-127K. Kim, C. Lee, S. Jung, G. G. Lee, “A Frame-Based Probabilistic Framework for Spoken Dialog Management Using Dialog Examples”, Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, 2008.6, p.120-127

しかしながら、何れの従来技術もタスクの種類と、状態遷移の仕方が既知であるタスクを対象とするため、対話のようにタスクの種類（挨拶、握手、楽しい会話、雑談など）やタスクの状態遷移の仕方が様々であり、予めシステム設計者が予想できないようなシステムに対する行動制御を行うことはできないという問題がある。 However, since all conventional technologies target tasks whose task types and state transitions are known, task types (greeting, handshake, fun conversations, chats, etc.) and task state transitions such as dialogue There are various methods, and there is a problem that it is not possible to perform behavior control on a system that cannot be predicted in advance by a system designer.

本発明の行動制御学習装置は、人対人の行動を人対システムで行うための学習データを生成する装置である。また、人対人の行動を示すデータにおいて、一方の人をユーザーとし、他方の人をシステムとして割り当て、ユーザーの行動を観測値とし、システムの行動をアクションとし、観測値とアクションからなる一連の行動系列が所望の行動系列であったか否かを評価したものを評価値とする。そして、本発明の行動制御学習装置は、観測値とアクションと評価値を記憶する行動データ記録部、ＤＢＮ生成部、ＤＢＮ−ＰＯＭＤＰ変換部、強化学習部を備える。ＤＢＮ生成部は、観測値、アクション及び評価値を用いて、ダイナミックベイジアンネットワーク（以下「ＤＢＮ」という）を生成し、状態ｓ_ｔでアクションａ_ｔを実行したときの報酬の確率Ｐ（ｒ_ｔ｜ｓ_ｔ，ａ_ｔ）、アクションａ_ｔによって状態がｓ_ｔからｓ_ｔ＋１へ変わる確率Ｐ（ｓ_ｔ＋１｜ｓ_ｔ，ａ_ｔ）、アクションａ_ｔによって状態ｓ_ｔ＋１において観測値ｏ_ｔ＋１が観測される確率Ｐ（ｏ_ｔ＋１｜ｓ_ｔ＋１，ａ_ｔ）を推定する。ＤＢＮ−ＰＯＭＤＰ変換部は、確率Ｐ（ｒ_ｔ｜ｓ_ｔ，ａ_ｔ）、Ｐ（ｓ_ｔ＋１｜ｓ_ｔ，ａ_ｔ）、Ｐ（ｏ_ｔ＋１｜ｓ_ｔ＋１，ａ_ｔ）を用いて、アクションａによって状態がｓからｓ’へ変わる確率Ｐ（ｓ’｜ｓ，ａ）、アクションａによって状態ｓ’で観測値ｏ’を出力する確率Ｐ（ｏ’｜ｓ’，ａ_ｔ）及び状態ｓでアクションａを実行したときの報酬ｒ（ｓ，ａ）を生成する。強化学習部は、確率Ｐ（ｓ’｜ｓ，ａ）、Ｐ（ｏ’｜ｓ’，ａ_ｔ）と報酬ｒ（ｓ，ａ）を用いて、現在の状態の確率分布を引数としシステムがとるべきアクションを一つ出力する関数を生成する。 The behavior control learning device of the present invention is a device that generates learning data for performing person-to-person behavior in a person-to-person system. In addition, in the data indicating person-to-person behavior, one person is assigned as a user, the other person is assigned as a system, the user's action is taken as an observation value, the system action is taken as an action, and a series of actions consisting of observation values and actions. An evaluation value is obtained by evaluating whether or not the series is a desired action series. The behavior control learning device of the present invention includes a behavior data recording unit that stores observation values, actions, and evaluation values, a DBN generation unit, a DBN-POMDP conversion unit, and a reinforcement learning unit. DBN generation unit, observations, using the action and the evaluation value to generate a dynamic Bayesian network (hereinafter referred to as "DBN"), the compensation when executing an action a _t in state s _t probability P (r _t | s _t, _a _t), the probability state by the action _{a t} changes from _{s t} to _{_{s t + 1 P (s t}} + 1 | s t, a t), the probability that the observed value _{o t + 1} in the state _{s t + 1} by the action _{a t} is observed P (o _{t + 1} | s _{t + 1} , a _t ) is estimated. The DBN-POMDP converter uses the probabilities P (r _t | s _t , a _t ), P (s _{t + 1} | s _t , a _t ), and P (o _{t + 1} | s _{t + 1} , a _t ) by action a. Probability P (s ′ | s, a) that state changes from s to s ′, probability P (o ′ | s ′, a _t ) that outputs observation value o ′ in state s ′ by action a, and action in state s A reward r (s, a) when a is executed is generated. The reinforcement learning unit uses the probabilities P (s ′ | s, a), P (o ′ | s ′, a _t ) and the reward r (s, a), and the system uses the probability distribution of the current state as an argument. Generate a function that outputs one action to take.

また、本発明は、状態ｓ_ｔを観測値の内部状態を表すｓ_ｏとアクションの内部状態を表すｓ_ａの組ｓ_ｔ＝（ｓ_ｏ，ｓ_ａ）（なお、ｓ_ｏ，ｓ_ａの表記ではｔを省略する）に分け、ＤＢＮ生成部は、ａ＝ｓ_ａのときに限り、Ｐ（ａ｜ｓ_ａ）＝１として、ＤＢＮを生成し、ＤＢＮ−ＰＯＭＤＰ変換部は、報酬＾ｒ（（＊，ｓ_ａ），ａ）［ここで、＊は任意のｓ_ｏを表す］をａ＝ｓ_ａのときに１をとり、それ以外のときには０をとるように報酬＾ｒ（（＊，ｓ_ａ），ａ）を定め、所望の行動系列に対する報酬ｒと統計的な行動系列に対する報酬＾ｒの線形和αｒ＋β＾ｒで置き換えた以下の式により最終的な目的関数Ｖ_ｔを得る。 Further, the present invention, the state _{s t} a _{s a} representative of the internal state of the _{s o} and actions representing the internal state of the observed values set _{_{_{s t = (s o, s}}} a) ( _Note, s o, notation _{s a} In this case, the DBN generation unit generates a DBN with P (a | s _a ) = 1 only when a = s _a , and the DBN-POMDP conversion unit generates a reward ^ r ( (*, S _a ), a) [where, * represents an arbitrary s _o ] is a reward ^ r ((*,), so that 1 is taken when a = s _a and 0 is taken otherwise. s _a ), a) are determined, and a final objective function V _t is obtained by the following equation, which is replaced with a linear sum αr + β ^ r of a reward r for a desired action sequence and a reward r for a statistical action sequence.

本発明の行動制御学習装置によれば、所望の行動系列以外もモデル化し、アクションを決定する関数を生成している。したがって、本発明の行動制御学習装置が生成した関数を用いたシステムであれば、所望の行動系列以外のユーザーの行動に対しても、統計的に自然なふるまいを行うようすることができる。 According to the behavior control learning device of the present invention, a function other than a desired behavior series is modeled and an action is determined. Therefore, if the system uses the function generated by the behavior control learning apparatus of the present invention, it is possible to perform statistically natural behavior even for user behavior other than the desired behavior series.

実施例１の行動制御学習装置１００の構成例を示す図。The figure which shows the structural example of the action control learning apparatus 100 of Example 1. FIG. 行動データ記憶部に記憶されるデータ例を示す図。The figure which shows the example of data memorize | stored in an action data memory | storage part. ｓ_ａとａとＰ（ａ｜ｓ_ａ）の関係を示す図。The figure which shows the relationship between s _a , a, and P (a | s _a ). ＰＯＭＤＰの構造と変数を示す図。The figure which shows the structure and variable of POMDP. シミュレーション結果を示す図。The figure which shows a simulation result. 行動制御学習装置１００のハードウェア構成を例示したブロック図。The block diagram which illustrated the hardware constitutions of action control learning device 100.

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

［行動制御学習装置１００］
行動制御学習装置１００は人対人の行動を人対システムで行うための学習データを生成する。図１は実施例１の行動制御学習装置１００の構成例を示す。図１を用いて実施例１に係る行動制御学習装置１００を説明する。 [Behavioral control learning device 100]
The behavior control learning device 100 generates learning data for performing a person-to-person action in a person-to-system. FIG. 1 shows a configuration example of the behavior control learning apparatus 100 according to the first embodiment. A behavior control learning apparatus 100 according to the first embodiment will be described with reference to FIG.

行動制御学習装置１００は、行動データ記憶部１０１と、ダイナミックベイジアンネットワーク（以下「ＤＢＮ」という）生成部１０３と、ＤＢＮ確率テーブル記憶部１０５と、ＤＢＮ−ＰＯＭＤＰ変換部１０７と、ＰＯＭＤＰ確率・報酬テーブル記憶部１０９と、強化学習部１１１と、ＰＯＭＤＰポリシー記憶部１１３と、状態分布更新部１１５と、状態確率テーブル記憶部１１７と、アクション決定部１１９を備える。 The behavior control learning apparatus 100 includes a behavior data storage unit 101, a dynamic Bayesian network (hereinafter referred to as “DBN”) generation unit 103, a DBN probability table storage unit 105, a DBN-POMDP conversion unit 107, and a POMDP probability / reward table. A storage unit 109, a reinforcement learning unit 111, a POMDP policy storage unit 113, a state distribution update unit 115, a state probability table storage unit 117, and an action determination unit 119 are provided.

［行動データ記憶部１０１］
人対人の行動を示すデータ（例えば、対話を記録した音声データや画像データ等）において、一方の人をユーザーとし、他方の人をシステムとして割り当て、ユーザーの行動を観測値ｏとし、システムの行動をアクションａとし、観測値とアクションからなる一連の行動系列が所望の行動系列であったか否かを評価したものを評価値ｒとする。
行動データ記録部１０１は観測値ｏとアクションａと評価値ｒを記憶する。図２は行動データ記憶部に記憶されるデータ例を示す。 [Behavior data storage unit 101]
In data indicating person-to-person actions (for example, voice data or image data recording a dialogue), one person is assigned as a user, the other person is assigned as a system, and the user action is set as an observation value o. Is an action a, and an evaluation value r is an evaluation of whether or not a series of behavior sequences composed of observed values and actions is a desired behavior sequence.
The behavior data recording unit 101 stores an observation value o, an action a, and an evaluation value r. FIG. 2 shows an example of data stored in the behavior data storage unit.

例えば、握手、挨拶、笑い、移動、おしゃべり、うなずき、首ふり、無行動の８種類の行動ラベルを用意し、各ラベルに０〜７の数値を対応させ、観測値及びアクションとして、それぞれの行動に対応する数値を、行動データとして行動データ記憶部１０１に記憶する。本実施例では、観測値とアクションを一対のペアとして記憶する。さらに、一連の行動系列（１以上のユーザーとシステムの行動のペア）が、所望の行動系列であったか否か評価し、評価値として所望の行動系列である場合には１とし、そうでない場合には０として記憶する。 For example, handshake, greeting, laughter, movement, chatter, nod, swing, no action, and 8 action labels are prepared. Each label is associated with a numerical value of 0-7. Are stored in the behavior data storage unit 101 as behavior data. In this embodiment, the observation value and the action are stored as a pair. Further, it is evaluated whether or not a series of action series (one or more user / system action pairs) is a desired action series, and when the evaluation value is a desired action series, 1 is set. Is stored as 0.

なお、所望の行動系列としては、例えばユーザーが楽しんだか？典型的な行動系列であるか？ユーザーの役に立ったか？等である。典型的な行動系列としては、「お互いに握手をし、お互いに挨拶をし、その後、笑いとおしゃべりとうなずきを数回ランダムにお互い繰り返し、最後に挨拶をし合い、握手をし合う」等である。 For example, did the user enjoy the desired action sequence? Is it a typical action sequence? Was it helpful to users? Etc. A typical sequence of actions is “shake each other, greet each other, then repeat laughter, chatter and nod several times at random, greet each other at the end, and shake hands” is there.

この評価は行動系列一つにつき、一つ付与される。ここでは、この値を統計的に学習するために、この値を各時刻に分配する。この分配する手法としては以下の何れかを用いる。
（分配手段１）観測された行動系列の評価が１であれば全ての値を１に設定する。評価が０であれば、全ての値を０に設定する。
（分配手段２）観測された行動系列の一部分だけに対して評価をつける。その部分の評価が１であれば、その部分の始端と終端の間だけを１にする。その他の部分は全て０とする。
（分配手段３）（分配手段２）のように始端と終端がわかっているときに、その部分の最後のデータに対してのみ１を付与する他の値は０とする。 One evaluation is given for each action series. Here, in order to learn this value statistically, this value is distributed to each time. One of the following is used as a method for this distribution.
(Distributing means 1) If the observed action series is evaluated as 1, all values are set to 1. If the evaluation is 0, all values are set to 0.
(Distributing means 2) Evaluates only a part of the observed action sequence. If the evaluation of the part is 1, only 1 between the start and end of the part is set to 1. All other parts are set to zero.
(Distributing means 3) When the starting and ending points are known as in (distributing means 2), the other values that give 1 only to the last data of that portion are set to 0.

なお、この評価値は、０と１の２値でなく多値をとっても良く、連続値としてもよい。また、ここでは一人の人の評価で話を進めるが、多人数の平均をとったものを評価としてもよい。また、所望の行動系列は複数用意してもよく、各所望の行動系列に対して、行動系列ラベルを設けてもよい。各行動系列に対して評価を与え、行動系列ラベルとその評価を組合せて記憶してもよい。また、人対人の行動データは一対一のデータでなく複数の人のデータに基づいて収集してもよい。この場合、ユーザー、システムともに複数となる。 Note that this evaluation value may be a multi-value instead of a binary value of 0 and 1, or may be a continuous value. In addition, although the discussion proceeds with the evaluation of one person here, the average of a large number of people may be used as the evaluation. A plurality of desired action sequences may be prepared, and an action sequence label may be provided for each desired action sequence. Evaluation may be given to each action series, and the action series label and the evaluation may be combined and stored. The person-to-person behavior data may be collected based on a plurality of person data instead of one-to-one data. In this case, there are a plurality of users and systems.

なお、行動ラベルの付与は、人手により行ってもよいし、音声認識ソフトや画像認識ソフトを用いて、何れの行為に該当するかを認識し、自動的に付与してもよい。また、評価は、評価対象により人手、自動を適宜選択すればよい。例えば、適宜ユーザーの役に立ったか？等の判断は、音声認識ソフトや画像認識ソフトを用いて、認識するのが困難であるため、人手により付与する。典型的な行動系列が行われたか否かは自動で付与する構成としてもよい。行動制御学習装置１００は、認識部及びラベル付与部を設け、会話データや映像データそのものを入力値として内部で観測値、アクション、評価値を生成する構成としてもよい。 The action label may be assigned manually, or it may be automatically assigned by recognizing which action corresponds to using voice recognition software or image recognition software. Moreover, what is necessary is just to select manual and automatic as evaluation according to evaluation object. For example, was it useful for the user as appropriate? Is difficult to recognize using voice recognition software or image recognition software, and is therefore given manually. Whether or not a typical action sequence is performed may be automatically given. The behavior control learning device 100 may be configured to include a recognition unit and a label providing unit, and generate observation values, actions, and evaluation values internally using conversation data or video data itself as input values.

［ＤＢＮ生成部１０３及びＤＢＮ確率テーブル記憶部１０５］
ＤＢＮ生成部１０３は、観測値ｏ、アクションａ及び評価値ｒを用いて、ＤＢＮを生成し、状態ｓ_ｔでアクションａ_ｔを実行したときの報酬の確率Ｐ（ｒ_ｔ｜ｓ_ｔ，ａ_ｔ）、アクションａ_ｔによって状態がｓ_ｔからｓ_ｔ＋１へ変わる確率Ｐ（ｓ_ｔ＋１｜ｓ_ｔ，ａ_ｔ）、アクションａ_ｔによって状態ｓ_ｔ＋１において観測値ｏ_ｔ＋１が観測される確率Ｐ（ｏ_ｔ＋１｜ｓ_ｔ＋１，ａ_ｔ）を推定する。なお、ｓはユーザー・システム間の隠れ状態（以下、「状態」という）とし、状態ｓは、ユーザー・システムの隠れ状態ｓ_ｏと行動生成のための隠れ状態ｓ_ａとの組からなり、ｔは時刻を表すものとし、評価値ｒを確率変数である報酬ｒとして扱う。ここで、ｔは変数の相対的な時刻の関係を明確にするために用いた記号であり、特定の時刻を想定しているものではない。すなわち、ここで示す確率及びそれを使った演算は、時刻に依存しない。 [DBN generation unit 103 and DBN probability table storage unit 105]
DBN generating unit 103, by using the observed value o, action a and the evaluation value r, to generate a DBN, of reward when you perform an action _{a t} in state _{s t} probability _{_{P (r t | s t,}} a t ), the action _{a t} probability state changes from _{s t} to _{s t + 1} by _{_{P (s t + 1 | s}} t, a t), the action _{a t} probability the observed value _{o t + 1} in the state _{s t + 1} is observed by _{P (o t +} 1 | s _{t + 1} , a _t ). It should be noted, s hidden state between the user system (hereinafter referred to as "state"), and the state s is composed of a combination of a hidden state s _a for a hidden state s _o of the user-system behavior generation, t Represents time, and the evaluation value r is treated as a reward r which is a random variable. Here, t is a symbol used to clarify the relative time relationship of the variables, and does not assume a specific time. That is, the probabilities shown here and the calculations using them are independent of time.

例えば、ＤＢＮ生成部１０３は、観測値ｏ_ｔ、アクションａ_ｔ、評価値ｒ_ｔの時系列を用いて、ＥＭアルゴリズム、ジャンクションツリーアルゴリズム、サンプリング手法などにより、尤度最大化を行い、行動生成モデルのためのＤＢＮを学習、生成する。また、システムとユーザーの内部状態をｓ＝（ｓ_ｏ，ｓ_ａ）のようにシステムの内部状態と、アクションに対応する状態とに分離する。ｓ_ａとａを一対一に対応させるため、ａ＝ｓ_ａの時に限り，Ｐ（ａ｜ｓ_ａ）＝１として、ＤＢＮを作成する。図３はｓ_ａとａとＰ（ａ｜ｓ_ａ）の関係を示す。 For example, the DBN generation unit 103 performs likelihood maximization by using an EM algorithm, a junction tree algorithm, a sampling method, and the like using a time series of the observation value o _t , the action a _t , and the evaluation value r _t , and generates an action generation model Learning and generating DBN for. Further, the internal state of the system and the user is separated into the internal state of the system and the state corresponding to the action as s = (s _o , s _a ). In order to make one-to-one correspondence between s _a and a, a DBN is created with P (a | s _a ) = 1 only when a = s _a . FIG. 3 shows the relationship between _sa , a, and P (a | s _a ).

ＤＢＮ生成部１０３で推定された確率は、ＤＢＮ確率テーブル記憶部１０５に記憶される。
［ＤＢＮ−ＰＯＭＤＰ変換部１０７及びＰＯＭＤＰ確率・報酬テーブル記憶部１０９］
ＤＢＮ−ＰＯＭＤＰ変換部１０７は、確率Ｐ（ｒ_ｔ｜ｓ_ｔ，ａ_ｔ）、Ｐ（ｓ_ｔ＋１｜ｓ_ｔ，ａ_ｔ）、Ｐ（ｏ_ｔ＋１｜ｓ_ｔ＋１，ａ_ｔ）を用いて、アクションａによって状態がｓからｓ’へ変わる確率（状態遷移確率）Ｐ（ｓ’｜ｓ，ａ）、アクションａによって状態ｓ’で観測値ｏ’を出力する確率（出力確率）Ｐ（ｏ’｜ｓ’，ａ_ｔ）及び状態ｓでアクションａを実行したときの報酬ｒ（ｓ，ａ）を生成する。 The probability estimated by the DBN generation unit 103 is stored in the DBN probability table storage unit 105.
[DBN-POMDP conversion unit 107 and POMDP probability / reward table storage unit 109]
The DBN-POMDP conversion unit 107 uses the probabilities P (r _t | s _t , a _t ), P (s _{t + 1} | s _t , a _t ), and P (o _{t + 1} | s _{t + 1} , a _t ) to take action a Is the probability that the state will change from s to s ′ (state transition probability) P (s ′ | s, a), and the probability that the observed value o ′ will be output in state s ′ by action a (output probability) P (o ′ | s ', A _t ) and a reward r (s, a) when the action a is executed in the state s.

ここで、ＰＯＭＤＰという確率モデルについて説明する。行動生成はこのＰＯＭＤＰによって実現する。図４は、ＰＯＭＤＰの構造と変数を示す。このモデルではシステムの状態やユーザーの心理な状態を記述する状態ｓが定義される。ｓはｓ＝（ｓ_１，ｓ_２，ｓ_３，…，ｓ_Ｎ）というように、複数の状態の組み合わせで表現される。ｏは観測される観測値を、ａはシステム側からユーザーに働きかけるアクションを表す。このとき、これらの変数の間に確率Ｐ（ｓ’｜ｓ，ａ）、確率Ｐ（ｏ’｜ｓ’，ａ_ｔ）及び報酬ｒ（ｓ，ａ）が設定されている。 Here, a probability model called POMDP will be described. Action generation is realized by this POMDP. FIG. 4 shows the structure and variables of POMDP. This model defines a state s that describes the state of the system and the psychological state of the user. s is expressed as a combination of a plurality of states, such as s = (s ₁ , s ₂ , s ₃ ,..., s _N ). o represents an observed value, and a represents an action that acts on the user from the system side. In this case, the probability between these variables P (s' | s, a ), the probability P (o '| s', a t) and reward r (s, a) has been set.

ＤＢＮ−ＰＯＭＤＰ変換部１０７は、ＤＢＮ生成部１０３で推定された確率を以下の式により、ＰＯＭＤＰの確率・報酬に変換する。なお、観測値及びアクションとして同じ定義のシンボルが使われると仮定する。 The DBN-POMDP conversion unit 107 converts the probability estimated by the DBN generation unit 103 into a POMDP probability / reward according to the following equation. It is assumed that symbols with the same definition are used as observation values and actions.

ＤＢＮとＰＯＭＤＰの構造はほぼ同じなので、状態遷移確率Ｐ（ｓ’｜ｓ，ａ）、出力確率Ｐ（ｏ’｜ｓ’，ａ_ｔ）に関しては、対応する確率に値を代入すればよい。報酬は、ＤＢＮにおいて確率変数として扱われるため、ＤＢＮで得られた確率変数を平均化することによって、実数に変換する。例えば、ｒの確率分布から式（１）によって求める。なお、この設定は、従来技術にはない本発明独自の手法である。 Since the structures of DBN and POMDP are almost the same, regarding the state transition probability P (s ′ | s, a) and the output probability P (o ′ | s ′, a _t ), values may be substituted for the corresponding probabilities. Since the reward is handled as a random variable in the DBN, the random variable obtained by the DBN is averaged to be converted into a real number. For example, it is obtained from the probability distribution of r by equation (1). This setting is a method unique to the present invention that is not present in the prior art.

ＰＯＭＤＰ確率・報酬テーブル記憶部１０９は、ＤＢＮ−ＰＯＭＤＰ変換部１０７で変換及び求められた確率Ｐ（ｓ’｜ｓ，ａ）、Ｐ（ｏ’｜ｓ’，ａ_ｔ）及び報酬ｒ（ｓ，ａ）を記憶する。 The POMDP probability / reward table storage unit 109 includes the probabilities P (s ′ | s, a), P (o ′ | s ′, a _t ) and the reward r (s, converted and obtained by the DBN-POMDP conversion unit 107. Store a).

［強化学習部１１１及びＰＯＭＤＰポリシー記憶部１１３］
強化学習部１１１は、確率Ｐ（ｓ’｜ｓ，ａ）、Ｐ（ｏ’｜ｓ’，ａ_ｔ）と報酬ｒ（ｓ，ａ）を用いて、強化学習により、現在の状態の確率分布を引数としシステムがとるべきアクションを一つ出力する関数（以下、「ポリシー」という）を生成する。 [Reinforcement learning unit 111 and POMDP policy storage unit 113]
The reinforcement learning unit 111 uses the probabilities P (s ′ | s, a), P (o ′ | s ′, a _t ) and the reward r (s, a) to perform probability distribution of the current state through reinforcement learning. A function that outputs one action to be taken by the system (hereinafter referred to as “policy”) is generated.

ＰＯＭＤＰポリシー記憶部１１３は、強化学習部１１１で生成されたポリシーを記憶する。 The POMDP policy storage unit 113 stores the policy generated by the reinforcement learning unit 111.

次にポリシーの計算方法について説明する。まず、式（４）はアクション系列ａ_τ＋ｔが分かっているときに将来獲得できる報酬を示す。 Next, a policy calculation method will be described. First, equation (4) shows a reward that can be acquired in the future when the action sequence a _{τ + t} is known.

ここで、ｂ_τ＋ｔ（ｓ）は、時刻τ＋ｔの状態の分布である。また、正定数γ（＜１）により未来の報酬の寄与は小さくなる。ポリシーは、式（４）を最大にする現在のアクションaを、現在の状態分布ｂ_ｔ（ｓ）から計算する関数である。
［データ中に出現する統計情報に従って行動を選択する手法］
まず、現在の状態の確率分布ｂ_ｔ（ｓ）は、その定義から次式が得られる。 Here, b _{τ + t} (s) is the distribution of the state at time τ + t. Further, the contribution of the future reward is reduced by the positive constant γ (<1). The policy is a function that calculates the current action a that maximizes Equation (4) from the current state distribution b _t (s).
[Method to select action according to statistical information appearing in data]
First, the following expression is obtained from the definition of the probability distribution b _t (s) of the current state.

これは、過去のｏ_１，ａ_１，…，ａ_ｔ−１，ｏ_ｔという系列、すなわちユーザーとシステムの観測値とアクションの履歴が実行された後に、状態がｓ_ａとなる確率を表している。
ａ_ｔ＝ｓ_ａのときにＰ（ａ｜ｓ_ａ）＝１としているため、ａ_ｔ＝ｓ_ａのときに以下の式を得る。 This is, past o _1, a _{1, _...,} series that a _{_t-1,} o _t, that is, after the history of observations and actions of the user and the system has been executed, represents the probability that the state is s _a Yes.
P at _a _t = _s a _| because you are (a _s a) = _1, to obtain the following expression when _a t = _s a.

これは、過去のｏ_１，ａ_１，…，ａ_ｔ−１，ｏ_ｔが観測されたときの次にアクションａ_ｔが起こる確率を表す。すなわち、今までのデータからａ_ｔがどれだけ自然かを表す確率となっている。すなわち、式（７）を最大化するようにＰＯＭＤＰの報酬を決めれば、ポリシーにより、自然なアクションを生成するようになる。これを実現するためには、報酬を This is, past _o _{_1,} a _{1, _...,} represents the probability that the action _{a t} happens next _{when a _t-1,} _o _t is observed. In other words, it is made up of data of until now as the probability of indicating whether a _t how much nature. That is, if the POMDP reward is determined so as to maximize the expression (7), a natural action is generated according to the policy. To achieve this, rewards

として設定すればよい。但し、ａ＝ｓ_ａを満たす必要がある。このように報酬を決定するため、ここでは、ａ＝ｓ_ａのときに１をとり、それ以外のときには０をとるように報酬＾ｒ（（＊，ｓ_ａ），ａ）を定める。 Can be set as However, it is necessary to satisfy a = s _a . In order to determine the reward in this way, here, reward {circumflex over (r)} ((*, s _a ), a) is determined so that 1 is taken when a = s _a and 0 is taken otherwise.

ここで、＊は任意のｓ_ｏを指す。この値を用いて、ｒを＾ｒに置き換えれば、自然な対話を実現できる。ここでは、従来型の所望の行動系列も実現するために従来手法の報酬の線形和をとる。これを行うために、式（４）のｒをαｒ＋β＾ｒで置き換えた下記式（１０）により最終的な目的関数Ｖ_ｔを得る。 Here, * refers to any of the _{s o.} If r is replaced with ^ r using this value, a natural dialogue can be realized. Here, in order to realize a conventional desired action sequence, a linear sum of rewards of the conventional method is taken. In order to do this, the final objective function V _t is obtained by the following equation (10) in which r in equation (4) is replaced by αr + β ^ r.

ここで、α、βは任意の実数である。このα、βを変化させることにより、所望の行動を実現する（αが大きい場合）のか統計的な行動を優先する（βが大きい場合）のか、その優先度合いの重みづけを行うことができる。なお、α、βを０とすることも可能である。 Here, α and β are arbitrary real numbers. By changing [alpha] and [beta], it is possible to weight the degree of priority whether a desired action is realized (when [alpha] is large) or a statistical action is prioritized ([beta] is large). Note that α and β can be set to 0.

通常、対象となる所望の行動系列に対してＰＯＭＤＰによる行動生成の学習を行うと、所望の行動系列だけをシステムは実現しようとする。このため、人と人の行動の記録には、所望の行動系列だけではなく、様々な系列が含まれているのにもかかわらず、所望の行動系列以外の行動を選択しなくなる。よって、人と人との間のやり取りを再現しつつ、かつ。所望の行動系列にユーザーを引き込むようなシステムを作る場合には、所望の行動系列だけを学習するだけでは不十分である。本発明によれば、システムを構成する際にはこれらの行動系列の統計情報も含めて、システムの行動制御を学習するため、所望の行動系列へ引き込みつつも、自然な行動制御を行うことができる。 Normally, when learning of action generation by POMDP is performed on a target desired action sequence, the system tries to realize only the desired action sequence. For this reason, not only the desired behavior sequence but also various sequences are included in the record of the human behavior, the behavior other than the desired behavior sequence is not selected. Therefore, while reproducing the exchange between people. When creating a system that attracts users to a desired action sequence, it is not sufficient to learn only the desired action sequence. According to the present invention, when configuring a system, the behavioral control of the system is learned including statistical information of these behavioral sequences, so that natural behavioral control can be performed while pulling in the desired behavioral sequence. it can.

［ポリシーを用いた行動制御］
以下、ポリシーを用いて、行動を制御する方法について説明する
［状態分布更新部１１５及び状態確率テーブル記憶部１１７］
状態確率テーブル記憶部１１７には、一つ前の状態の確率分布ｂ_ｔ−１が記憶されている。状態分布更新部１１５は、観測値ｏ_ｔ’が入力されると、一つ前に行ったシステムのアクションａからＰＯＭＤＰ確率・報酬テーブル記憶部１０９に問合せ、格納された統計量より状態遷移確率Ｐ（ｓ’｜ｓ，ａ）を求める。また、観測値ｏ_ｔ’からＰＯＭＤＰ確率・報酬テーブル記憶部１０９に問合せ、格納された統計量より出力確率Ｐ（ｏ’｜ｓ’，ａ）を求める。また、状態確率テーブル記憶部１１７に問合せ、一つ前の状態の確率分布ｂ_ｔ−１を受け取り、以下の式により、現在の状態の確率分布ｂ_ｔを求める。 [Action control using policies]
Hereinafter, a method for controlling behavior using a policy will be described [state distribution update unit 115 and state probability table storage unit 117].
The state probability table storage unit 117 stores the probability distribution b _{t−1 of the} previous state. When the observation value o _t ′ is input, the state distribution update unit 115 queries the POMDP probability / reward table storage unit 109 from the system action a performed immediately before, and determines the state transition probability P from the stored statistics. (S ′ | s, a) is obtained. Further, the POMDP probability / reward table storage unit 109 is inquired from the observed value o _t ′, and the output probability P (o ′ | s ′, a) is obtained from the stored statistics. Further, the state probability table storage unit 117 is queried, the probability distribution b _t-1 of the previous state is received, and the probability distribution b _t of the current state is obtained by the following equation.

なお、ηは全体の和を１にするための正規化定数である。求めた現在の状態の確率分布ｂ_ｔは、状態確率テーブル記憶部１１７に記憶し、アクション決定部１１９へ出力される。 Note that η is a normalization constant for making the sum of all 1s. The obtained probability distribution b _t of the current state is stored in the state probability table storage unit 117 and output to the action determination unit 119.

［アクション決定部１１９］
アクション決定部１１９は、行動制御に先立ちＰＯＭＤＰポリシー記憶部１１３からポリシーを取得し、記憶しておく。さらに、現在の状態の確率分布ｂ_ｔを受け取ると、これをポリシーｆ（）の引数として、システムがとるべきアクションａ_ｔを決定し出力する。
このような構成とすることによって、所望の行動系列以外もモデル化し、アクションを決定する関数を生成することができ、本発明の行動制御学習装置が生成した関数を用いたシステムであれば、所望の行動系列以外のユーザーの行動に対しても、統計的に自然なふるまいを行うようすることができる。 [Action decision unit 119]
The action determination unit 119 acquires a policy from the POMDP policy storage unit 113 and stores it prior to behavior control. Further, upon receiving a probability distribution b _t of the current state, which as an argument to the policy f (), the system determines and outputs an action a _t should take.
By adopting such a configuration, it is possible to model a function other than a desired action sequence and generate a function for determining an action. If the system uses the function generated by the action control learning device of the present invention, any desired system can be used. It is possible to perform statistically natural behavior even for user behaviors other than the behavior series.

なお、行動制御学習装置１００は、状態分布更新部１１５、状態確率テーブル１１７及びアクション決定部を備えているが、これらの構成を別装置として構成し、この別装置からの問合せに応じて状態遷移確率、出力確率及びポリシーを出力する構成としても良い。 The behavior control learning device 100 includes a state distribution update unit 115, a state probability table 117, and an action determination unit. However, these configurations are configured as separate devices, and state transition is performed in response to an inquiry from the separate device. It is good also as a structure which outputs a probability, an output probability, and a policy.

［シミュレーション結果］
一対一の行動記録のデータを想定し行動制御のシミュレーションによる実験を行った。図５はシミュレーション結果を示す。アクションには、握手、挨拶、笑い、移動、おしゃべり、うなずき、首ふり、無行動の８種類を用意した。観測値も同様に、以上の８種類とした。一般的に、観測値には誤認識があると仮定するが、ここでは、確定値とした。但し、隠れ状態はユーザーの意図を表しており、この部分は観測できないとしている。この部分を隠れ状態とした。この隠れ状態ｓ_ｏの数は１６である。これとは別にシステムのアクションに一対一に対応する隠れ状態ｓ_ａを設定し、その状態の数を８とした。ラベル付けを行う所望の行動系列として２種類の系列を用意した。これらの系列に対して計算機で自動的にラベル付けを行い、所望の行動系列であると判断したものには１をつけた。このうちの１つは、お互いに握手をし、お互いに挨拶をし、その後、笑いとおしゃべりとうなずきを数回ランダムにお互い繰り返し、最後に挨拶をし合い、握手をし合うという行動系列である。もう一つは、片方が移動し、片方が無行動でその後、挨拶をし合い、笑いとおしゃべりとうなずきを数回ランダムに繰り返し、挨拶をし合い、最後に片方が何もしないで、片方が移動するという行動系列である。この行動系列の報酬の与え方として、（分配手段２）を用いた。すなわち、この行動系列の開始時刻から終了時刻までが分かっているものとし、その開始時刻から終了時刻までの各時刻に報酬として１を付加した。これらの行動系列は全体の学習データの数に対して１０分の１とした。残りのデータでは、ユーザーの観測値とシステム行動のペアが握手−握手、挨拶−挨拶、笑い−笑い、移動−移動、おしゃべり−おしゃべり、うなずき−おしゃべり、首ふり−おしゃべり、無行動―無行動の出現確率が統計的に多くなるようにサンプルを作成した。もしユーザーが所望の行動系列を望んでいる場合には、この所望の系列に近づくようにシステムが動作するように系を学習する。しかし、もしユーザーにその意思がなければ、残りのサンプルの統計的なふるまいを示す行動するように学習する。学習データとして、全部で１００００サンプルのデータを作成した。このデータから提案手法を使ってダイナミックベイジアンネットワークを作成し、それをＰＯＭＤＰの確率・報酬テーブルに変換し、強化学習により行動を選択手法であるポリシーを求めた。比較には、ＰＯＭＤＰにおいて、所望の系列だけに報酬を与える手法を用いた。評価には、２０００サンプルのデータを用いた。所望の系列の学習データを生成した手法、及び、その他の系列の学習データを生成した手法に従ってユーザーの観測値だけを生成した。実験では、ユーザーが所望の系列を希望しているときには所望の系列の行動を行い、それ以外のデータに対しては、データの統計量に従って行動を選択するかどうかを調べた。 [simulation result]
Experiments were performed by simulation of behavior control assuming one-on-one behavior record data. FIG. 5 shows the simulation results. There are eight types of actions: handshake, greeting, laughter, movement, chatter, nod, pretend, and no action. Similarly, the above eight types of observation values were used. In general, it is assumed that there is a misrecognition of the observed value, but here it is a definite value. However, the hidden state represents the user's intention, and this part cannot be observed. This part was hidden. The number of hidden state _{s o} is 16. At the set hidden state s _a corresponding one-to-one separately to the system action was the number of its state 8. Two types of sequences were prepared as desired behavior sequences for labeling. These series were automatically labeled by the computer, and 1 was assigned to those judged to be the desired action series. One of these is a series of actions that shake hands with each other, greet each other, then repeat laughter, chatter and nod several times at random, greet each other at the end, and shake hands. . The other one moves, the other is inactive, then greets each other, repeats laughter, chatting and nodding several times at random, greets each other, and finally one does nothing, It is an action sequence of moving. (Distributing means 2) was used as a method of giving a reward for this action series. That is, it is assumed that the time from the start time to the end time of this action sequence is known, and 1 is added as a reward at each time from the start time to the end time. These action sequences were set to 1/10 of the total number of learning data. In the rest of the data, the user's observations and system behavior pairs are handshake-shake, greeting-greeting, laugh-laugh, move-move, talk-talk, nodding-talk, neck-shake, no action-no action Samples were created so that the probability of occurrence was statistically increased. If the user wants a desired action sequence, the system is learned so that the system operates to approach this desired sequence. However, if the user is not willing, he learns to act to show the statistical behavior of the remaining samples. A total of 10,000 samples of data were created as learning data. A dynamic Bayesian network was created from this data using the proposed method, converted into a POMDP probability / reward table, and a policy that was a behavior selection method was obtained by reinforcement learning. For comparison, a method of rewarding only a desired sequence in POMDP was used. For the evaluation, data of 2000 samples were used. Only the observation values of the user were generated according to the method of generating the desired series of learning data and the method of generating the other series of learning data. In the experiment, when the user wanted a desired series, the behavior of the desired series was performed, and whether the behavior was selected according to the statistics of the data for other data was examined.

所望の系列だけに報酬を与える手法は、所望の系列２００サンプルに対して全て正しくアクションを生成した。提案手法も所望の系列に対しては全て正しい行動を示した。これにより、どちらの手法も所望の系列に対しては正しいアクションを生成することが確認された。 The method of rewarding only the desired series generated all the actions correctly for 200 samples of the desired series. The proposed method also showed correct behavior for all desired sequences. This confirms that both methods generate the correct action for the desired sequence.

所望の行動系列だけに報酬を当てる手法による観測及び生成された観測値・アクションペア、提案手法による観測及び生成された観測値・アクションのペア、学習データ中に存在する観測値とアクションのペアの主な頻度を図５に示す。この図のように、所望の行動系列だけに報酬を与えるＰＯＭＤＰでは、学習された所望の系列に含まれる観測値・アクションのペアに対しては高頻度でアクションを選択しているのが分かる。しかし、一番右に示す学習データの観測値・アクションのペアの統計パターンとは程遠いことが分かる。これは、所望の行動系列に対してだけ報酬を与える手法は、どんな観測値が観測されても、所望の系列になるように、行動を決定しているからである。これに対して、今回提案する行動の生起確率を報酬に導入する手法は、９倍ある所望の系列以外の学習データの統計量にも近づいていることが分かる。 Observed and generated observation / action pairs using a method that rewards only the desired action sequence, observation / action pairs generated by the proposed method, observed / action pairs existing in the learning data The main frequencies are shown in FIG. As shown in this figure, it can be seen that in POMDP that rewards only a desired action sequence, an action is frequently selected for the observed value / action pair included in the learned desired sequence. However, it can be seen that it is far from the statistical pattern of the observed value / action pair of the learning data shown on the far right. This is because the method of giving a reward only to a desired behavior sequence determines the behavior so that the desired sequence is obtained no matter what the observed value is observed. On the other hand, it can be seen that the method of introducing the occurrence probability of the action proposed this time to the reward approaches the statistic amount of learning data other than the desired series which is 9 times.

所望の行動系列だけに報酬を当てる手法では、ユーザー側が所望の行動を実行しない場合でも、所望の行動系列のためのアクションを生成する。これは、所望の行動だけをシステムが学習したからである。タスクがはじめから決まっているようなシステムでは、このようなふるまいは自然である。しかし、人と人との間のやり取りを再現しつつ、かつ。所望の行動系列にユーザーを引き込むようなシステムを作る場合には、所望の行動系列だけを学習するだけでは不十分である。そこで、本発明は、ユーザーが所望の行動を実行しない場合には、システムは学習データ中の統計的量に従って動作するようにし、ユーザー側が所望の行動を実行しない場合でも、自然な行動をできるように制御することができるという効果を奏する。 In the method of rewarding only a desired behavior sequence, an action for the desired behavior sequence is generated even when the user does not execute the desired behavior. This is because the system has learned only the desired behavior. In a system where tasks are determined from the beginning, this behavior is natural. However, while reproducing the interaction between people. When creating a system that attracts users to a desired action sequence, it is not sufficient to learn only the desired action sequence. Therefore, according to the present invention, when the user does not execute the desired action, the system operates according to the statistical amount in the learning data so that the user can perform a natural action even when the user does not execute the desired action. There is an effect that it can be controlled.

＜ハードウェア構成＞
図６は、本実施例における行動制御学習装置１００のハードウェア構成を例示したブロック図である。図６に例示するように、この例の行動制御学習装置１００は、それぞれＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。 <Hardware configuration>
FIG. 6 is a block diagram illustrating a hardware configuration of the behavior control learning apparatus 100 according to the present embodiment. As illustrated in FIG. 6, the behavior control learning device 100 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, a RAM ( Random Access Memory) 16 and a bus 17.

この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１２は、データが入力される入力インターフェース、キーボード、マウス等であり、出力部１３は、データが出力される出力インターフェース等である。補助記憶装置１４は、例えば、ハードディスク、半導体メモリ等であり、行動制御学習装置１００としてコンピュータを機能させるためのプログラムや各種データが格納される。また、ＲＡＭ１６には、上記のプログラムや各種データが展開され、ＣＵＰ１１等から利用される。また、バス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を通信可能に接続する。なお、このようなハードウェアの具体例としては、例えば、パーソナルコンピュータの他、サーバ装置やワークステーション等を例示できる。 The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like, and the output unit 13 is an output interface for outputting data. The auxiliary storage device 14 is, for example, a hard disk, a semiconductor memory, or the like, and stores a program for causing the computer to function as the behavior control learning device 100 and various data. Further, the above-mentioned program and various data are expanded in the RAM 16 and used from the CUP 11 or the like. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 so that they can communicate with each other. In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.

＜プログラム構成＞
上述のように、補助記憶装置１４には、本実施例の行動制御学習装置１００の各処理を実行するための各プログラムが格納される。ライセンス管理プログラムを構成する各プログラムは、単一のプログラム列として記載されていてもよく、また、少なくとも一部のプログラムが別個のモジュールとしてライブラリに格納されていてもよい。
＜ハードウェアとプログラムとの協働＞
ＣＰＵ１１は、読み込まれたＯＳプログラムに従い、補助記憶装置１４に格納されている上述のプログラムや各種データをＲＡＭ１６に展開する。そして、このプログラムやデータが書き込まれたＲＡＭ１６上のアドレスがＣＰＵ１１のレジスタ１１ｃに格納される。ＣＰＵ１１の制御部１１ａは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ｂに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 <Program structure>
As described above, each program for executing each process of the behavior control learning apparatus 100 according to the present embodiment is stored in the auxiliary storage device 14. Each program constituting the license management program may be described as a single program sequence, or at least a part of the program may be stored in the library as a separate module.
<Cooperation between hardware and program>
The CPU 11 expands the above-described program and various data stored in the auxiliary storage device 14 in the RAM 16 according to the read OS program. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, causes the calculation unit 11b to sequentially execute the operation indicated by the program, The calculation result is stored in the register 11c.

図１は、このようにＣＰＵ１１に上述のプログラムが読み込まれて実行されることにより構成される行動制御学習装置１００の機能構成を例示したブロック図である。 FIG. 1 is a block diagram illustrating a functional configuration of the behavior control learning apparatus 100 configured by reading and executing the above-described program in the CPU 11 as described above.

ここで、行動データ記憶部１０１、ＤＢＮ確率テーブル記憶部１０５、ＰＯＭＤＰ確率・報酬テーブル記憶部１０９、ＰＯＭＤＰポリシー記憶部１１３及び状態確率テーブル記憶部１１７は、補助記憶装置１４、ＲＡＭ１６、レジスタ１１ｃ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。また、ＤＢＮ生成部１０３、ＤＢＮ−ＰＯＭＤ変換部Ｐ１０７、強化学習部１１１、状態分布更新部１１５及びアクション決定部１１９は、ＣＰＵ１１にライセンス管理プログラムを実行させることにより構成されるものである。 Here, the behavior data storage unit 101, the DBN probability table storage unit 105, the POMDP probability / reward table storage unit 109, the POMDP policy storage unit 113, and the state probability table storage unit 117 are the auxiliary storage device 14, RAM 16, register 11c, and others. It corresponds to any one of the buffer memory, the cache memory, etc., or a storage area using these together. The DBN generation unit 103, the DBN-POMD conversion unit P107, the reinforcement learning unit 111, the state distribution update unit 115, and the action determination unit 119 are configured by causing the CPU 11 to execute a license management program.

１００行動制御学習装置
１０１行動データ記憶部
１０３ＤＢＮ生成部
１０５ＤＢＮ確率テーブル記憶部
１０７ＤＢＮ−ＰＯＭＤＰ変換部
１０９ＰＯＭＤＰ確率・報酬テーブル記憶部
１１１強化学習部
１１３ＰＯＭＤＰポリシー記憶部
１１５状態分布更新部
１１７状態確率テーブル記憶部
１１９アクション決定部 100 behavior control learning device 101 behavior data storage unit 103 DBN generation unit 105 DBN probability table storage unit 107 DBN-POMDP conversion unit 109 POMDP probability / reward table storage unit 111 reinforcement learning unit 113 POMDP policy storage unit 115 state distribution update unit 117 state Probability table storage unit 119 Action determination unit

Claims

A behavior control learning device that generates learning data for performing a person-to-person action in a person-to-person system,
In data showing person-to-person actions, one person is assigned as a user, the other person is assigned as a system, the user action is taken as an observed value, the system action is taken as an action, and a series of actions consisting of an observed value o and an action a An evaluation value r is obtained by evaluating whether or not the series is a desired action series.
An action data recording unit for storing the observed value o , the action a, and the evaluation value r ;
t is assumed to represent a time, using the observation value o, action a and evaluation value r, to generate a dynamic Bayesian network (hereinafter referred to as "DBN"), reward when executing the action a _t in state s _t r _t of the probability _{_{_{P (r t | s t,}}} a t), the probability state by the action _{a t} changes from _{s t} to _{_{s t + 1 P (s t}} + 1 | s t, a t), by the action _{a t} in state _{s t + 1} A DBN generator that estimates a probability P (o _{t + 1} | s _{t + 1} , a _t ) that the observed value o _{t + 1} is observed;
Using the probabilities P (r _t | s _t , a _t ), P (s _{t + 1} | s _t , a _t ), P (o _{t + 1} | s _{t + 1} , a _t ), the state is changed from s to s ′ by action a P (s ′ | s, a), the probability P (o ′ | s ′, a _t ) of outputting the observation value o ′ in the state s ′ by the action a, and the action a in the state s A DBN-POMDP conversion unit for generating a reward r (s, a);
Using the probabilities P (s ′ | s, a), P (o ′ | s ′, a _t ) and the reward r (s, a), an action to be taken by the system using the probability distribution of the current state as an argument for example Bei and reinforcement learning unit for generating a function to one output, the,
The reward r (s, a) is a product sum of r _t and P (r _t | s _t , a _t ).
A behavior control learning apparatus characterized by that.

The behavior control learning device according to claim 1,
The state _{s t} a _{s a} representative of the internal state of the _{s o} and actions representing the internal state of the observed values set _{_{_{s t = (s o, s}}} a) and,
The DBN generation unit generates a DBN with P (a | s _a ) = 1 only when a = s _a ,
The DBN-POMDP conversion unit, reward _{^ r ((*, s a} ), a) [ where * represents any _{s o]} the

As sought
The reinforcement learning unit uses αr (s, a) + β ^ r ((*, s _a ), a) [where α and β are arbitrary real numbers] instead of the reward r (s, a). Using the current state probability distribution as an argument, generate a function that outputs one action that the system should take,
A behavior control learning apparatus characterized by that.

A behavior control learning method for generating learning data for performing a person-to-person action in a person-to-system,
In data showing person-to-person actions, one person is assigned as a user, the other person is assigned as a system, the user action is taken as an observed value, the system action is taken as an action, and a series of actions consisting of an observed value o and an action a An evaluation value r is obtained by evaluating whether or not the series is a desired action series.
t is assumed to represent a time, using the observation value o, action a and evaluation value r, to generate a dynamic Bayesian network (hereinafter referred to as "DBN"), reward when executing the action a _t in state s _t r _t of the probability _{_{_{P (r t | s t,}}} a t), the probability state by the action _{a t} changes from _{s t} to _{_{s t + 1 P (s t}} + 1 | s t, a t), by the action _{a t} in state _{s t + 1} A DBN generation step of estimating a probability P (o _{t + 1} | s _{t + 1} , a _t ) that the observed value o _{t + 1} is observed;
Using the probabilities P (r _t | s _t , a _t ), P (s _{t + 1} | s _t , a _t ), P (o _{t + 1} | s _{t + 1} , a _t ), the state is changed from s to s ′ by action a P (s ′ | s, a), the probability P (o ′ | s ′, a _t ) of outputting the observation value o ′ in the state s ′ by the action a, and the action a in the state s A DBN-POMDP conversion step for generating a reward r (s, a);
Using the probabilities P (s ′ | s, a), P (o ′ | s ′, a _t ) and the reward r (s, a), an action to be taken by the system using the probability distribution of the current state as an argument for example Bei and reinforcement learning step of generating a function to one output, the,
The reward r (s, a) is a product sum of r _t and P (r _t | s _t , a _t ).
A behavior control learning method characterized by that.

The behavior control learning method according to claim 3,
The state _{s t} a _{s a} representative of the internal state of the _{s o} and actions representing the internal state of the observed values set _{_{_{s t = (s o, s}}} a) and,
The DBN generation step generates a DBN with P (a | s _a ) = 1 only when a = s _a ,
The DBN-POMDP conversion step, reward _{^ r ((*, s a} ), a) [ where * represents any _{s o]} the

As sought
In the reinforcement learning step, αr (s, a) + β ^ r ((*, s _a ), a) [where α and β are arbitrary real numbers] instead of the reward r (s, a). Using the current state probability distribution as an argument, generate a function that outputs one action that the system should take,
A behavior control learning method characterized by that.

A program for causing a computer to function as the behavior control learning device according to claim 1.