JP6114679B2

JP6114679B2 - Control policy determination device, control policy determination method, control policy determination program, and control system

Info

Publication number: JP6114679B2
Application number: JP2013235415A
Authority: JP
Inventors: 塚原　裕史; 裕史塚原; 満安倍; 真人大林
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2013-02-15
Filing date: 2013-11-13
Publication date: 2017-04-12
Anticipated expiration: 2033-11-13
Also published as: JP2014179064A

Description

本発明は、対話システム、自立移動ロボットや車両等におけるノイズ等の不確定性を含む環境センシング情報に基づいて制御方策を決定する制御方策決定装置及びそれを含む制御システム、制御方策決定方法に関するものである。 The present invention relates to an interactive system, a control policy determination device that determines a control policy based on environmental sensing information including uncertainty such as noise in autonomous mobile robots and vehicles, a control system including the control policy, and a control policy determination method It is.

従来より、部分観測可能マルコフ決定過程（ＰＯＭＤＰ：Partially Observable Markov Decision Process）に基づいて、システムの最適な制御方策を決定する価値関数を強化学習の枠組みによって自動的に獲得する技術が知られている。 Conventionally, a technology for automatically acquiring a value function for determining an optimal control strategy of a system based on a partially observable Markov Decision Process (POMDP) by a reinforcement learning framework has been known. .

状態数とアクション（行動）とが有限（離散的）である場合、ＰＯＭＤＰの価値関数は、信念空間上の区分線形な関数で表されることが知られている。信念空間は、次元が状態数と等しいユークリッド空間の中で、各座標値が正でかつそれらの和が１となるような超平面として与えられる。信念空間上の点の各座標値は、対応する状態をシステムが取る確率となっている。また、各線形関数には、アクションの１つが対応する。 When the number of states and actions (behavior) are finite (discrete), it is known that the value function of POMDP is represented by a piecewise linear function in the belief space. The belief space is given as a hyperplane in which each coordinate value is positive and the sum thereof is 1 in the Euclidean space whose dimension is equal to the number of states. Each coordinate value of a point on the belief space is a probability that the system takes a corresponding state. Each linear function corresponds to one of the actions.

この区分線形な価値関数は、複数の線形関数が決定する下半空間の上界として与えられる。これらの線形関数の数は、制御方策を決定するために先読みするステップ数に関して指数関数的に増加する。そのような多くの線形関数の中から、上界を与える線形関数を決定するには、膨大な計算量を要する。従って、このＰＯＭＤＰの定式そのままでは、現実的な問題へ適用することが困難である。 This piecewise linear value function is given as the upper bound of the lower half space determined by a plurality of linear functions. The number of these linear functions increases exponentially with respect to the number of steps read ahead to determine the control strategy. In order to determine the linear function that gives the upper bound from such many linear functions, a huge amount of calculation is required. Therefore, it is difficult to apply this POMDP formula as it is to a practical problem.

そのため、近似的に価値関数を計算する方法が提案されてきた。特に、ＰＢＶＩ（Point-Based Value Iteration）という方法は、実装が容易であり、よく利用されている。また、ＰＢＶＩの様々な拡張がなされている（例えば、非特許文献１）。 Therefore, methods for calculating the value function approximately have been proposed. In particular, a method called PBVI (Point-Based Value Iteration) is easy to implement and is often used. In addition, various extensions of PBVI have been made (for example, Non-Patent Document 1).

G. Shani, J. Pineau and R. Kaplow, "A Survey of Point-Based POMDP solvers" , Auton Agent Multi-Agent Syst (Published Online, 08 June 2012), DOI 10.1007/s10458-012-9200-2.G. Shani, J. Pineau and R. Kaplow, "A Survey of Point-Based POMDP solvers", Auton Agent Multi-Agent Syst (Published Online, 08 June 2012), DOI 10.1007 / s10458-012-9200-2. C. Thurau, K. Kersting and C. Bauckhage, "Convex Non-Negative Matrix Factorization in the Wild", (ICDM 2009, pp. 523-532), DOI: 10.1109/ICDM.2009.55.C. Thurau, K. Kersting and C. Bauckhage, "Convex Non-Negative Matrix Factorization in the Wild", (ICDM 2009, pp. 523-532), DOI: 10.1109 / ICDM.2009.55.

しかしながら、ＰＢＶＩによる方法は、純粋なＰＯＭＤＰに比べて、大きく計算量が削減されてはいるが、以下のような課題がある。すなわち、ＰＢＶＩによる方法は、近似に用いる点の決め方によって近似精度が大きく影響するところ、理論的にどのような点を選択するべきか、明確な基準がないという課題がある。そのため、近似の精度を上げるためには多くの点を採用する必要があり、結果として、依然として大きな計算リソースを必要とする。通常、この点の数は、状態数に対して指数関数的に増大するため、やはり実用的な問題への適用する際に大きな制約がある。 However, the PBVI method has the following problems, although the calculation amount is greatly reduced as compared with pure POMDP. In other words, the PBVI method has a problem that there is no clear standard for what point should be selected theoretically because the approximation accuracy greatly depends on how the points used for approximation are determined. For this reason, in order to increase the accuracy of approximation, it is necessary to adopt many points, and as a result, still large computational resources are required. Usually, the number of points increases exponentially with respect to the number of states, so that there are still significant restrictions when applied to practical problems.

本発明は、上記の問題に鑑みてなされたものであり、部分観測可能マルコフ決定過程（ＰＯＭＤＰ）によりシステム制御をモデル化し、その価値関数を強化学習によって自動獲得するための計算コストを削減し、このようなシステム制御モデルを実用的な問題へ適用可能にすることを目的とする。 The present invention has been made in view of the above-described problem, modeled system control by a partially observable Markov decision process (POMDP), and reduced the calculation cost for automatically acquiring the value function by reinforcement learning, The purpose is to make such a system control model applicable to practical problems.

上記の課題を解決するために、本発明の制御方策決定装置は、不確定性を含む環境センシング情報に基づいて制御方策を決定する制御方策決定装置であって、前記環境センシング情報に基づいて、信念空間上の価値関数の線形要素を与える線形関数の候補集合を生成する線形関数生成部と、前記信念空間上の前記候補集合を双対空間上の複数の点に変換する双対変換部と、前記複数の点の凸包を近似する近似凸包を計算する凸包近似計算部と、前記近似凸包の頂点のメンバーシップ関数を決定するメンバーシップ決定部と、前記近似凸包の上辺を抽出する凸包上辺抽出部と、前記上辺に属する頂点を前記信念空間上の線形関数に逆変換する逆双対変換部と、前記逆変換によって得られた線形関数に基づいて、バックアップステップ数に応じて線形関数を更新する線形関数更新部とを備え、前記双対変換部は、さらに前記バックアップステップ数に応じて更新された線形関数を前記候補集合として、双対空間上の複数の点に変換し、前記制御方策決定装置は、さらに、前記バックアップステップ数の線形関数の更新の後に前記逆変換によって得られた線形関数に基づいて、近似価値関数の複数の線形要素を求める価値関数決定部と、前記近似価値関数の複数の線形要素の各々に対して、前記メンバーシップ関数に従って行動を割り当てる方策決定部とを備えた構成を有している。 In order to solve the above problems, the control policy determination device of the present invention is a control policy determination device that determines a control policy based on environmental sensing information including uncertainty, and based on the environmental sensing information, A linear function generation unit that generates a linear function candidate set that gives a linear element of a value function on a belief space; a dual conversion unit that converts the candidate set on the belief space into a plurality of points on a dual space; and A convex hull approximation calculation unit that calculates an approximate convex hull that approximates the convex hull of a plurality of points, a membership determination unit that determines a membership function of a vertex of the approximate convex hull, and an upper side of the approximate convex hull Based on a convex hull upper side extraction unit, an inverse dual transformation unit that inversely transforms vertices belonging to the upper side into a linear function on the belief space, and a linear function obtained by the inverse transformation, according to the number of backup steps A linear function update unit that updates a shape function, and the dual conversion unit further converts the linear function updated according to the number of backup steps into a plurality of points on a dual space as the candidate set, The control policy determining device further includes a value function determining unit for obtaining a plurality of linear elements of the approximate value function based on the linear function obtained by the inverse transformation after the update of the linear function of the number of backup steps, and the approximation And a policy determining unit that assigns an action to each of a plurality of linear elements of the value function according to the membership function.

この構成により、信念空間上の候補集合が双対空間に写像された上で、その凸包を近似する近似凸包の上辺の頂点及びそのメンバーシップ関数が求められるので、バックアップステップ数（価値反復区間）が大きくなった場合にも、近似価値関数を高速に計算して、候補集合の要素に行動を割り当てることができる。 With this configuration, the candidate set in the belief space is mapped to the dual space, and the vertex of the approximate convex hull that approximates the convex hull and its membership function are obtained, so the number of backup steps (value iteration interval) ) Becomes large, the approximate value function can be calculated at high speed and actions can be assigned to the elements of the candidate set.

上記の制御方策決定装置において、前記メンバーシップ決定部は、前記双対空間上の複数の点に適用した場合に正しいメンバーシップが得られるメンバーシップ関数を、前記近似凸包の頂点のメンバーシップ関数として決定してよい。 In the control policy determination apparatus, the membership determination unit may use a membership function that obtains a correct membership when applied to a plurality of points on the dual space as a membership function of a vertex of the approximate convex hull. You may decide.

この構成により、近似凸包が双対空間上の複数の点の凸包に一致する場合に、正しいメンバーシップに帰着するという条件を満たすメンバーシップ関数を得ることができる。 With this configuration, when the approximate convex hull matches the convex hull of a plurality of points on the dual space, it is possible to obtain a membership function that satisfies the condition of reducing to a correct membership.

上記の制御方策決定装置において、前記メンバーシップ決定部は、前記双対空間上の複数の点を前記近似凸包に変換する行列でもって、前記双対空間上の複数の点のメンバーシップ関数を変換したものを、前記近似凸包の頂点のメンバーシップ関数として決定してよい。 In the control policy determination apparatus, the membership determination unit converts a membership function of a plurality of points on the dual space with a matrix that converts the plurality of points on the dual space to the approximate convex hull. May be determined as a membership function of the vertices of the approximate convex hull.

上記の制御方策決定装置において、前記線形関数更新部は、各バックアップステップにおける前記線形関数の更新を観測値ごとに行ってよい。 In the control policy determining apparatus, the linear function update unit may update the linear function for each observation value in each backup step.

本発明の制御システムは、上記の制御方策決定装置と、前記環境センシング情報を入力する環境センシング情報入力部と、前記制御方策決定装置にて割り当てられた行動を実行するための制御コマンドを出力する出力部とを備え構成を有している。 The control system of the present invention outputs the control policy determining device, the environment sensing information input unit for inputting the environment sensing information, and a control command for executing an action assigned by the control policy determining device. And an output unit.

この構成によっても、信念空間上の候補集合が双対空間に写像された上で、その凸包を近似する近似凸包の上辺の頂点及びそのメンバーシップ関数が求められるので、価値反復区間（バックアップステップ数）が大きくなった場合にも、近似価値関数を高速に計算して、候補集合の要素に行動を割り当てることができる。 Even with this configuration, the candidate set in the belief space is mapped to the dual space, and the vertex of the approximate convex hull that approximates the convex hull and its membership function are obtained, so the value iteration interval (backup step) Even when the number is large, the approximate value function can be calculated at high speed and actions can be assigned to the elements of the candidate set.

本発明の制御方策決定方法は、不確定性を含む環境センシング情報に基づいて制御方策を獲得する制御方策決定方法であって、前記環境センシング情報を入力する環境センシング情報入力ステップと、前記環境センシング情報に基づいて、信念空間上の価値関数の線形要素を与える線形関数の候補集合を生成する線形関数生成ステップと、前記信念空間上の前記候補集合を双対空間上の複数の点に変換する双対変換ステップと、前記複数の点の凸包を近似する近似凸包を計算する近似凸包計算ステップと、前記近似凸包の頂点のメンバーシップ関数を決定するメンバーシップ決定ステップと、前記近似凸包の上辺を抽出する凸包上辺抽出ステップと、前記上辺に属する頂点を前記信念空間上の線形関数に逆変換する逆双対変換ステップと、前記逆変換によって得られた線形関数に基づいて、バックアップステップ数に応じて線形関数を更新する線形関数更新ステップとを含み、前記双対変換ステップは、さらに前記バックアップステップ数に応じて更新された線形関数を前記候補集合として、双対空間上の複数の点に変換し、前記制御方策決定方法は、さらに、前記バックアップステップ数の線形関数の更新の後に前記逆変換によって得られた線形関数に基づいて、近似価値関数の複数の線形要素を求める価値関数決定ステップと、前記近似価値関数の複数の線形要素の各々に対して、前記メンバーシップ関数に従って行動を割り当てる方策決定ステップとを含む構成を有している。 The control policy determination method of the present invention is a control policy determination method for acquiring a control policy based on environmental sensing information including uncertainty, the environmental sensing information input step for inputting the environmental sensing information, and the environmental sensing A linear function generation step for generating a linear function candidate set that gives a linear element of a value function on the belief space based on the information; and a dual function that converts the candidate set on the belief space into a plurality of points on the dual space. A conversion step; an approximate convex hull calculation step for calculating an approximate convex hull approximating the convex hull of the plurality of points; a membership determination step for determining a membership function of a vertex of the approximate convex hull; and the approximate convex hull A convex hull upper side extracting step for extracting the upper side of the vertices, an inverse dual transformation step for inversely transforming vertices belonging to the upper side to linear functions on the belief space, A linear function updating step that updates a linear function according to the number of backup steps based on a linear function obtained by inverse transformation, and the dual transformation step further includes a linear function updated according to the number of backup steps Is converted into a plurality of points on the dual space as the candidate set, and the control policy determination method is further based on the linear function obtained by the inverse transformation after the update of the linear function of the number of backup steps. A value function determining step for obtaining a plurality of linear elements of the approximate value function, and a policy determining step for assigning an action to each of the plurality of linear elements of the approximate value function according to the membership function Yes.

本発明のさらに別の態様は、コンピュータを上記の制御方策決定装置として機能させるための制御方策決定プログラムである。 Yet another aspect of the present invention is a control policy determination program for causing a computer to function as the control policy determination device.

本発明によれば、ＰＯＭＤＰにおける価値反復計算を高速に近似実行することが可能となり、音声認識誤りに対して頑健な音声対話システムにおける対話制御や実世界での自立ロボットや車両制御など状態数が大きな現実の問題へＰＯＭＤＰの枠組みを適用することが可能となる。 According to the present invention, it is possible to perform approximate calculation of value iteration in POMDP at high speed, and the number of states such as dialog control in a voice dialog system that is robust against voice recognition errors, real-world independent robots and vehicle control, etc. It is possible to apply the POMDP framework to large real problems.

本発明の実施の形態における対話システムの構成を示すブロック図The block diagram which shows the structure of the dialog system in embodiment of this invention 本発明の実施の形態における価値関数計算部の構成を示すブロック図The block diagram which shows the structure of the value function calculation part in embodiment of this invention. 本発明の実施の形態における候補集合（Ｔ＝１）の例を示すグラフGraph showing an example of a candidate set (T = 1) in the embodiment of the present invention （ａ）本発明の実施の形態における候補集合の例を示すグラフ（ｂ）本発明の実施の形態における価値関数の例を示すグラフ(A) Graph showing an example of a candidate set in an embodiment of the present invention (b) Graph showing an example of a value function in an embodiment of the present invention 本発明の実施の形態における候補集合（Ｔ＝２）の例を示すグラフGraph showing an example of a candidate set (T = 2) in the embodiment of the present invention 本発明の実施の形態における候補集合（Ｔ＝３）の例を示すグラフThe graph which shows the example of the candidate set (T = 3) in embodiment of this invention （ａ）本発明の実施の形態における候補集合（Ｔ＝１）の例を示すグラフ（ｂ）本発明の実施の形態における双対変換された点集合の例を示すグラフ(A) Graph showing an example of a candidate set (T = 1) in an embodiment of the present invention (b) Graph showing an example of a dual-transformed point set in an embodiment of the present invention （ａ）本発明の実施の形態における候補集合（Ｔ＝１）の例を示すグラフ（ｂ）本発明の実施の形態における双対変換された点集合の例を示すグラフ(A) Graph showing an example of a candidate set (T = 1) in an embodiment of the present invention (b) Graph showing an example of a dual-transformed point set in an embodiment of the present invention 本発明の実施の形態における近似凸包の例を示すグラフThe graph which shows the example of the approximate convex hull in embodiment of this invention （ａ）本発明の実施の形態における近似凸包の例を示すグラフ（ｂ）本発明の実施の形態における逆双対変換された線形関数の例を示すグラフ(A) Graph showing an example of the approximate convex hull in the embodiment of the present invention (b) Graph showing an example of an inverse dual transformed linear function in the embodiment of the present invention （ａ）本発明の実施の形態における候補集合（Ｔ＝２）の例を示すグラフ（ｂ）本発明の実施の形態における双対変換された点集合の例を示すグラフ(A) Graph showing an example of a candidate set (T = 2) in the embodiment of the present invention (b) Graph showing an example of a dual-transformed point set in the embodiment of the present invention （ａ）本発明の実施の形態における候補集合（Ｔ＝３）の例を示すグラフ（ｂ）本発明の実施の形態における双対変換された点集合の例を示すグラフ(A) Graph showing an example of a candidate set (T = 3) in the embodiment of the present invention (b) Graph showing an example of a dual-transformed point set in the embodiment of the present invention （ａ）本発明の実施の形態における候補集合（Ｔ＝２）の例を示すグラフ（ｂ）本発明の実施の形態における近似凸包の例を示すグラフ(A) Graph showing an example of the candidate set (T = 2) in the embodiment of the present invention (b) Graph showing an example of the approximate convex hull in the embodiment of the present invention （ａ）本発明の実施の形態における候補集合（Ｔ＝３）の例を示すグラフ（ｂ）本発明の実施の形態における近似凸包の例を示すグラフ(A) Graph showing an example of a candidate set (T = 3) in the embodiment of the present invention (b) Graph showing an example of an approximate convex hull in the embodiment of the present invention 本発明の実施の形態における近似価値関数（Ｔ＝３）の例を示すグラフThe graph which shows the example of the approximate value function (T = 3) in embodiment of this invention （ａ）本発明の実施の形態における近似価値関数の例を示すグラフ（ｂ）本発明の実施の形態における厳密な価値関数の例を示すグラフ(A) Graph showing an example of the approximate value function in the embodiment of the present invention (b) Graph showing an example of the strict value function in the embodiment of the present invention 本発明の実施の形態における価値関数と最適な行動方策の例を示すグラフThe graph which shows the example of the value function in embodiment of this invention, and the optimal action policy 本発明の実施の形態における制御方策決定装置の動作フロー図Operation flow diagram of control policy determination apparatus in the embodiment of the present invention 本発明の実施の形態の変形例における制御方策決定装置の動作フロー図Operation flow diagram of control policy determination device in modified example of embodiment of the present invention

まず、本発明によって上記のようなＰＯＭＤＰにおける価値関数の学習における課題を解決する原理の概要を説明する。 First, the outline of the principle for solving the problem in learning of the value function in POMDP as described above according to the present invention will be described.

本発明の一態様では、まず、信念空間上の線形関数の集合を別のユークリッド空間（以下、「双対空間」という。）上の点に対応させる。つまり、信念空間上の線形関数の集合を双対空間における点の集合として扱う。双対空間の次元は状態数に等しい。 In one embodiment of the present invention, first, a set of linear functions on a belief space is made to correspond to a point on another Euclidean space (hereinafter referred to as “dual space”). That is, a set of linear functions in belief space is treated as a set of points in dual space. The dimension of the dual space is equal to the number of states.

次に、双対空間における点の集合に関する凸包を求める。さらに、信念空間上での価値関数の上下関係から定まる双対空間における上下関係に関して、先に求めた凸包の上半面を求める。この上半面上にある点が対応する信念空間上の線形関数が、価値関数の一部となる。よって、双対空間における凸包の上半面上の点を信念空間上の線形関数へ逆変換することで、所望の価値関数を得ることができる。 Next, a convex hull for a set of points in the dual space is obtained. Furthermore, the upper half of the convex hull obtained previously is obtained for the vertical relationship in the dual space determined from the vertical relationship of the value function in the belief space. A linear function in the belief space corresponding to the points on the upper half becomes a part of the value function. Therefore, a desired value function can be obtained by inversely transforming a point on the upper half of the convex hull in the dual space into a linear function in the belief space.

一般に、高次元空間における凸包を厳密に求めるための計算量は次元数と共に指数関数的に増加する。この課題を解決するために、本発明の一態様では、凸包を求める問題を、点集合を全て含む凸集合の中で体積が最小となるものを求める最小化問題という連続的な問題へと緩和する。そして、本発明の一態様では、この最小化問題の近似解を高速に構成する。この近似解を求める際に、実際の凸包よりも頂点の数を小さくすることもでき、その場合には、より計算量を小さくすることが可能である。 In general, the amount of calculation for accurately obtaining a convex hull in a high-dimensional space increases exponentially with the number of dimensions. In order to solve this problem, in one aspect of the present invention, the problem of obtaining a convex hull is changed to a continuous problem called a minimization problem of obtaining a minimum volume among convex sets including all point sets. ease. In one aspect of the present invention, the approximate solution of the minimization problem is configured at high speed. When obtaining this approximate solution, the number of vertices can be made smaller than the actual convex hull, and in that case, the amount of calculation can be made smaller.

近似的に求められた凸包（近似凸包）の頂点は、一般に、信念空間の線形関数から双対空間に写像された点とは一致しないので、その頂点を信念空間へ逆変換したときに、どのアクションに対応するものとすればよいかが課題となる。そこで、本発明の一態様では、これらの各頂点への信念空間における線形関数から写像された点集合の各点のメンバーシップ関数を定義する。なお、このような写像は１つとは限らないが、その１つを適切な基準で選択する。本発明の一態様において、ある近似凸包の頂点に対応するアクションは、信念空間における線形関数から写像された点集合を同じアクションを持つ部分集合へ分割し、各部分集合の点と対応する頂点へのメンバーシップ関数を足し上げ、その和が最大となる部分集合のアクションとして定義する。 The vertex of the convex hull obtained by approximation (approximate convex hull) generally does not match the point mapped to the dual space from the linear function of the belief space, so when the vertex is converted back to the belief space, What action should be taken is a problem. Therefore, in one aspect of the present invention, the membership function of each point of the point set mapped from the linear function in the belief space to each vertex is defined. Note that the number of such maps is not limited to one, but one is selected based on an appropriate criterion. In one aspect of the present invention, the action corresponding to a vertex of a certain approximate convex hull divides a point set mapped from a linear function in the belief space into subsets having the same action, and the vertex corresponding to the point of each subset Add the membership function to, and define it as the subset action that maximizes the sum.

このように近似凸包の各頂点についてアクションを決め、逆変換によって信念空間へ戻し、それらの下半空間の上界を求めることで、価値反復によるＰＯＭＤＰの近似価値関数を構成することができる。 In this way, by determining the action for each vertex of the approximate convex hull, returning it to the belief space by inverse transformation, and determining the upper bound of those lower half spaces, it is possible to construct an approximate value function of POMDP by value iteration.

以下、本発明の実施の形態のシステム制御装置について、図面を参照しながら説明する。なお、以下に説明する実施の形態は、本発明を実施する場合の一例を示すものであって、本発明を以下に説明する具体的構成に限定するものではない。本発明の実施にあたっては、実施の形態に応じた具体的構成が適宜採用されてよい。 Hereinafter, a system control apparatus according to an embodiment of the present invention will be described with reference to the drawings. The embodiment described below shows an example when the present invention is implemented, and the present invention is not limited to the specific configuration described below. In carrying out the present invention, a specific configuration according to the embodiment may be adopted as appropriate.

以下では簡単のために、状態数が２つである場合について説明する。但し、本発明の実施の形態の数式などは、状態が離散的で有限である限り、任意の状態数Ｎの場合にもそのまま適用できる。また、本実施の形態における図も、説明を簡単にするため２状態の場合について説明するが、同様に状態が離散的で有限である限り、任意の状態数について、それらを高次元空間へ拡張したものを考えることが可能である。 Hereinafter, for the sake of simplicity, a case where the number of states is two will be described. However, the mathematical formulas and the like of the embodiment of the present invention can be applied to any number of states N as long as the states are discrete and finite. Also, in the figure in this embodiment, the case of two states will be described for the sake of simplicity. Similarly, as long as the states are discrete and finite, they can be extended to a high-dimensional space for any number of states. It is possible to think about what you did.

本発明の実施の形態を説明する上で、具体的な問題設定は必ずしも必要ではないが、理解を助けるために、例として次のような対話システムの制御を考える。この対話システムにおいて、ユーザは、「食事をしたい」か「買物をしたい」かのどちらかの意図を持っているものとし、そのユーザ意図を音声対話によって判別するものとする。 In describing the embodiment of the present invention, specific problem setting is not necessarily required, but in order to help understanding, the following control of the interactive system is considered as an example. In this interactive system, it is assumed that the user has an intention of “I want to eat” or “I want to shop”, and the user's intention is determined by voice dialogue.

以下、これらのユーザ意図を「状態」と呼び、「食事をしたい」という状態をｘ₁、「買物したい」という状態をｘ₂と表す。対話システムからユーザへの意図の問合せに対して、ユーザは、音声によって「食事」あるいは「買物」と発話して応答するものとする。これらの応答を観測量と呼び、「食事」という観測をｏ₁、「買物」という観測をｏ₂と表す。なお、一般には、Ｍ個の離散的な観測値が得られる。また、状態ｘにおいて、観測量ｏは、確率ｐ（ｏ｜ｘ）に従って得られるものと仮定する。この確率ｐ（ｏ｜ｘ）を知覚モデルと呼ぶ。 Hereinafter, these user intentions are referred to as “state”, the state of “I want to eat” is expressed as x ₁ , and the state of “I want to shop” is expressed as x ₂ . It is assumed that the user responds to the inquiry about the intention from the interactive system by speaking “meal” or “shopping” by voice. These responses are called observations, and the observation of “meal” is represented as o ₁ , and the observation of “shopping” is represented as o ₂ . In general, M discrete observation values are obtained. In the state x, it is assumed that the observation amount o is obtained according to the probability p (o | x). This probability p (o | x) is called a perceptual model.

図１は、本実施の形態の対話システムの構成を示すブロック図である。図１に示すように、対話システム１００は、音声入力部１０、音声認識部２０、制御方策決定装置３０、及び出力部４０を備えている。音声入力部１０、音声認識部２０、及び出力部４０は、制御方策決定装置３０に対して環境センシング情報を入力し、制御方策決定装置３０が決定した方策に応じた出力を行う環境Ｅである。制御方策決定装置３０の以下に説明する各機能は、演算処理部、メモリ、記憶装置、入力部、出力部等を備えたコンピュータが、所定のプログラムを実行することで実現される。 FIG. 1 is a block diagram showing the configuration of the interactive system according to the present embodiment. As shown in FIG. 1, the dialogue system 100 includes a voice input unit 10, a voice recognition unit 20, a control policy determination device 30, and an output unit 40. The voice input unit 10, the voice recognition unit 20, and the output unit 40 are environments E that input environment sensing information to the control policy determination device 30 and perform output in accordance with the policy determined by the control policy determination device 30. . Each function described below of the control policy determination device 30 is realized by a computer including an arithmetic processing unit, a memory, a storage device, an input unit, an output unit, and the like executing a predetermined program.

音声入力部１０は、音声の入力を受けて、それを音声信号に変換する。音声認識部２０は、音声入力部１０で生成された音声信号に対して、音声認識処理を行って音声内容を認識し、認識結果を環境センシング情報として制御方策決定装置３０に入力する。音声入力部１０及び音声認識部２０からなる構成は、環境をセンシングして、環境センシング情報を制御方策決定装置３０に与えるものであり、本発明の「環境センシング情報入力部」に相当する。 The voice input unit 10 receives voice input and converts it into a voice signal. The speech recognition unit 20 performs speech recognition processing on the speech signal generated by the speech input unit 10 to recognize speech content, and inputs the recognition result to the control policy determination device 30 as environment sensing information. The configuration including the voice input unit 10 and the voice recognition unit 20 senses the environment and supplies the environmental sensing information to the control policy determination device 30, and corresponds to the “environmental sensing information input unit” of the present invention.

制御方策決定装置３０は、音声認識部２０にて認識された音声内容に基づいて、制御方策を決定する。出力部４０は、制御演算部３０で決定された制御方策に基づいて、その方策を実行するための制御コマンドを制御対象に対して出力する。制御方策決定装置３０は、信念計算部３１０、システムモデル部３２０、環境モデル部３３０、価値関数計算部３４０、及び方策決定部３５０を備えている。信念計算部３１０は、初期信念生成部３１１、及び信念更新部３１２を備えている。システムモデル部３２０は、行動モデル部３２１、知覚モデル部３２２、及び方策モデル部３２３からなる。環境モデル部３３０は報酬関数モデル部３３１を備えている。 The control strategy determination device 30 determines a control strategy based on the voice content recognized by the voice recognition unit 20. Based on the control strategy determined by the control calculation section 30, the output section 40 outputs a control command for executing the strategy to the controlled object. The control policy determination device 30 includes a belief calculation unit 310, a system model unit 320, an environment model unit 330, a value function calculation unit 340, and a policy determination unit 350. The belief calculation unit 310 includes an initial belief generation unit 311 and a belief update unit 312. The system model unit 320 includes a behavior model unit 321, a perceptual model unit 322, and a policy model unit 323. The environment model unit 330 includes a reward function model unit 331.

信念計算部３１１は、初期信念を生成するとともに、信念を更新する。システムモデル部３２０は行動モデル、知覚モデル、及び方策モデルが記憶されている。環境モデル部３３０には、報酬関数モデルが記憶されている。価値関数計算部３４０は、価値関数を算出する。方策決定部３５０は、価値関数計算部３４０で算出された価値関数を用いて、方策モデルに従って制御方策を決定する。 The belief calculation unit 311 generates an initial belief and updates the belief. The system model unit 320 stores an action model, a perceptual model, and a policy model. The environment model unit 330 stores a reward function model. The value function calculation unit 340 calculates a value function. The policy determination unit 350 uses the value function calculated by the value function calculation unit 340 to determine a control policy according to the policy model.

本実施の形態の対話システム１００において、音声認識部２０は、発話を音声認識し、「食事」あるいは「買物」と認識する。一般に、音声認識では誤認識により、ユーザが「食事」と発話したとしても、「買物」と認識されてしまうことが問題となる。つまり、音声認識の結果はユーザの状態を特定するものではなく、ユーザの状態を反映したある確率的現象を観測したものにすぎないとみなすことができる。これが制御方策決定装置３０において確率的な知覚モデルｐ（ｏ｜ｘ）を扱う理由である。なお、知覚モデルは、知覚モデル部３２２に記憶されている。 In the interactive system 100 according to the present embodiment, the voice recognition unit 20 recognizes a speech as voice and recognizes it as “meal” or “shopping”. In general, in speech recognition, even if the user speaks “meal” due to misrecognition, it is problematic that the user is recognized as “shopping”. In other words, the result of speech recognition does not specify the user's state, but can be regarded as merely observing a certain stochastic phenomenon reflecting the user's state. This is the reason for handling the probabilistic perceptual model p (o | x) in the control strategy determination device 30. Note that the perceptual model is stored in the perceptual model unit 322.

この知覚モデルの確率的な性質により、たとえシステムが観測を行ったとしても、ユーザの状態は確率的にしか把握できないことになる。そこで、状態がｘ₁である確率をｐ₁＝ｐ（ｘ₁）、状態がｘ₂である確率をｐ₂＝ｐ（ｘ₂）とおき、これらの確率の組ｂ＝（ｐ₁，ｐ₂）を信念と呼ぶ。但し、ｐ₁＋ｐ₂＝１である。可能なｐ₁，ｐ₂の値全体を信念空間とよぶ。すなわち、本システムの場合、（ｐ₁，ｐ₂）という［０，１］×［０，１］の正方領域内のｐ₁＋ｐ₂＝１を満たす直線上が信念空間となる。 Due to the probabilistic nature of this perceptual model, the user's state can only be grasped probabilistically even if the system observes it. Therefore, the probability that the state is x ₁ is p ₁ = p (x ₁ ), the probability that the state is x ₂ is p ₂ = p (x ₂ ), and a set of these probabilities b = (p ₁ , p ₂ ) is called belief. However, p ₁ + p ₂ = 1. The entire possible values of p ₁ and p ₂ are called belief space. That is, in the case of this system, the belief space is a straight line that satisfies p ₁ + p ₂ = 1 in a square area of [0, 1] × [0, 1] (p ₁ , p ₂ ).

このように、システムの状態は、ｘ₁かあるいはｘ₂か確定的に定められない代わりに、信念ｂによって、拡張された意味でシステムの状態が与えられると考えることができる。従来の状態空間がｘ₁とｘ₂の２状態で与えられるのは、信念空間が（１，０）と（０，１）のみからなる場合とみなすことができる。その意味で、以下、信念も状態と呼ぶことがある。 Thus, it can be considered that the state of the system is given in an expanded sense by the belief b, instead of being definitively determined as x ₁ or x ₂ . The conventional state space given in two states x ₁ and x ₂ can be regarded as a case in which the belief space consists of only (1, 0) and (0, 1). In that sense, beliefs are sometimes referred to as states below.

対話システム１００は、信念の値に応じて、「食事に関する情報を提示する」か、あるいは、「買物に関する情報を提示する」か、あるいは、ユーザに対してもう一度「聞き直す」かの処理を実行することができるものとする。以下、これらのシステムの処理を特に「行動」と呼び、「食事に関する情報を提示する」という行動をｕ₁、「買物に関する情報を提示する」という行動をｕ₂、「聞き直す」という行動をｕ₃と表す。なお、一般には、Ｌ個の離散的な行動をとることができる。方策決定部３５０は、これらのいずれの行動をとるかを決定する。 The dialogue system 100 executes a process of “presenting information about meals”, “presenting information about shopping”, or “re-listening” to the user again according to the value of the belief. Shall be able to. Hereinafter, the processing of these systems will be referred to as “behavior” in particular, the action of “presenting information on meals” is u ₁ , the action of “presenting information on shopping” is u ₂ , and the action of “listening” is called. expressed as u _3. In general, L discrete actions can be taken. The policy determining unit 350 determines which of these actions is taken.

一般に、行動は「聞き直す」という行動のように、「状態に関する観測を行う行動」と、行動ｕ₁、ｕ₂のように「最終決定を行う行動」という２つの大きな行動のカテゴリへ分類することができる。「最終決定を行う行動」カテゴリに属する行動が選択された場合、そこでシステム制御の問題は一旦、完了となる。システム制御の問題は、「状態に関する観測を行う行動」カテゴリに属する行動が選択されている限り、「最終決定を行う行動」カテゴリに属する行動が選択されるまで継続する。 In general, actions are classified into two large action categories, such as "behavior that observes the state", such as "rehearse" action, and "behavior that makes final decisions" such as actions u ₁ and u _2. be able to. When an action belonging to the “behavior for final decision” category is selected, the problem of system control is once completed. The problem of system control continues as long as an action belonging to the “behavior performing state observation” category is selected until an action belonging to the “behavior performing final decision” category is selected.

以下、状態ｘにおいて、行動ｕを実行したときに、状態はｘ´へ確率ｐ（ｘ´｜ｘ，ｕ）に従って遷移するものと仮定する。この確率ｐ（ｘ´｜ｘ，ｕ）を行動モデルと呼ぶ。この行動モデルが、行動モデル部３２１に記憶されている。以下、「最終決定を行う行動」カテゴリに属する行動ｕ^*が選択された場合は、任意の状態ｘ、ｘ´について、ｐ（ｘ´｜ｘ，ｕ^*）＝０であるとする。なお、この場合には、厳密にはｐは確率とは解釈できないが、便宜的にそのように定義する。 Hereinafter, it is assumed that when the action u is executed in the state x, the state transitions to x ′ according to the probability p (x ′ | x, u). This probability p (x ′ | x, u) is called an action model. This behavior model is stored in the behavior model unit 321. Hereinafter, when an action u ^* belonging to the “behavior for final decision” category is selected, it is assumed that p (x ′ | x, u ^* ) = 0 for any state x, x ′. In this case, strictly speaking, p cannot be interpreted as a probability, but is defined as such for convenience.

また、状態ｘにおいて、行動ｕを実行したとき、システムは報酬ｒ（ｘ，ｕ）を受け取るものとする。この報酬モデルｒ（ｘ，ｕ）が環境モデル部３３０の報酬関数モデル部３３１に記憶されている。信念がｂである状態からＴステップ先の将来までに受け取る報酬の和の期待値の最大値をその状態の価値関数と呼び、Ｖ_r（ｂ）と表す。また、その価値関数を得るために、信念がｂであるときに取るべき行動を最適方策と呼び、π_T（ｂ）と表す。 In the state x, when the action u is executed, the system receives a reward r (x, u). The reward model r (x, u) is stored in the reward function model unit 331 of the environment model unit 330. The maximum expected value of the sum of rewards received from the state where the belief is b to the future T steps ahead is called the value function of that state, and is expressed as V _r (b). In addition, in order to obtain the value function, an action to be taken when the belief is b is called an optimal policy, and is expressed as π _T (b).

但し、報酬の和の期待値を計算する際に、一般には、将来得られる報酬については、割引率γ（０＜γ≦１）をバックアップステップ数だけ冪乗した重み付けをして（すなわち、τステップ後に得られる報酬に関しては、γ^τを乗じて）、期待値を取る。この最適方策π_T（ｂ）は、方策モデルとして方策モデル部３２３に記憶されている。 However, when calculating the expected value of the sum of rewards, in general, the reward obtained in the future is weighted by multiplying the discount rate γ (0 <γ ≦ 1) by the number of backup steps (ie, τ regarding remuneration obtained after step, by multiplying the gamma ^tau), takes the expectation. This optimal policy π _T (b) is stored in the policy model unit 323 as a policy model.

以上のように、状態が観測によって確率的にしか推定されない状況で、現在得られている信念の値に基づいてその時点での最適なシステムの行動を選択すること、すなわち、最適な方策を決定することが、ＰＯＭＤＰを対話システム制御へ適用する際の基本的な問題設定である。 As described above, in the situation where the state is estimated only probabilistically by observation, the optimal system action at that time is selected based on the current belief value, that is, the optimal policy is determined. This is the basic problem setting when applying POMDP to interactive system control.

このような問題設定において、最適な方策は各状態ｘ_iにおける価値関数を最大化するものとして決定される。従って、その価値関数を求めることができれば、ＰＯＭＤＰによる対話制御ができる。システム制御部の価値関数計算部３４０は、この価値関数を求める。以下では、本実施の形態の価値関数計算部３４０が、価値反復手法による価値関数を求める方法を説明する。 In such a problem setting, the optimal strategy is determined as maximizing the value function in each state x _i . Therefore, if the value function can be obtained, dialogue control by POMDP can be performed. The value function calculation unit 340 of the system control unit obtains this value function. Below, the value function calculation part 340 of this Embodiment demonstrates the method of calculating | requiring the value function by a value iteration method.

図２は、価値関数計算部３４０の構成を示すブロック図である。価値関数計算部３４０は、初期線形関数生成部１、双対変換部２、双対価値反復処理部３、逆双対変換部４、価値関数決定部５、及び線形関数更新部６を備えている。双対価値反復処理部３は、ラベル付凸包近似計算部３１及び枝刈り部３５からなる。ラベル付凸包近似計算部３１は、ＣｏｎｖｅｘＮＭＦ計算部３３からなる凸包近似計算部３２と、メンバーシップ決定部３４とからなる。 FIG. 2 is a block diagram showing the configuration of the value function calculation unit 340. The value function calculation unit 340 includes an initial linear function generation unit 1, a dual conversion unit 2, a dual value iteration processing unit 3, an inverse dual conversion unit 4, a value function determination unit 5, and a linear function update unit 6. The dual value iterative processing unit 3 includes a labeled convex hull approximation calculating unit 31 and a pruning unit 35. The labeled convex hull approximation calculation unit 31 includes a convex hull approximation calculation unit 32 including a ConvexNMF calculation unit 33 and a membership determination unit 34.

まず、本実施の形態のように、状態数が離散的で有限である場合、ＰＯＭＤＰの価値関数は、信念空間上の区分線形関数として表現されることが理論的に示される（非特許文献１参照）。また、この区分線形関数は一般に下に凸であることも理論的に示すことができる。価値関数計算部３４０は、この区分線形関数が下に凸であるという性質を利用して価値関数を計算する。以下、具体的に説明する。 First, it is theoretically shown that when the number of states is discrete and finite as in the present embodiment, the value function of POMDP is expressed as a piecewise linear function in a belief space (Non-patent Document 1). reference). It can also be theoretically shown that this piecewise linear function is generally convex downward. The value function calculation unit 340 calculates the value function using the property that this piecewise linear function is convex downward. This will be specifically described below.

各区分を与える線形関数（以下、「線形要素」ともいう。）を式（１）のようにおく。なお、以下では、一般的に状態数をＮ、観測値数をＭ、行動数をＬとする。本実施の形態では、Ｎ＝２、Ｍ＝２、Ｌ＝３である。
線形関数φν（ｂ）は、その係数ベクトル
と同一視することができる。 A linear function (hereinafter also referred to as “linear element”) that gives each section is set as shown in Expression (1). In the following, it is generally assumed that the number of states is N, the number of observed values is M, and the number of actions is L. In the present embodiment, N = 2, M = 2, and L = 3.
The linear function φν (b) is the coefficient vector
Can be identified.

（初期線形関数生成処理）
初期線形関数生成部１は、まず、バックアップステップ数をＴ＝１として、このときの価値関数の線形要素を与える線形関数の組を初期化する線形関数を計算する。具体的には、初期線形関数生成部１は、各行動ｕ_l（ｌ＝１，２，・・・，Ｌ）について、各状態ｘ_nのときに得られる報酬関数γから式（２）を計算する。
(Initial linear function generation process)
First, the initial linear function generation unit 1 sets the number of backup steps as T = 1, and calculates a linear function that initializes a set of linear functions that give linear elements of the value function at this time. Specifically, the initial linear function generation unit 1 calculates Equation (2) from the reward function γ obtained in each state x _n for each action u _l (l = 1, 2,..., L). calculate.

初期線形関数生成部１は、このようにして得られる初期線形関数の組によって、式（３）により価値関数の線形要素となる候補集合Γ⁽¹⁾を取得する。
図３は、この候補集合Γ⁽¹⁾の例を示すグラフである。図３のグラフでは、横軸に信念空間、縦軸に信念価値をとっている。なお、図３は、Ｌ＝６の場合の候補集合Γ⁽¹⁾を示している。図３における線は、それぞれ初期線形関数を表しており、Ｌ＝６であるので、線形関数も６本となっている。 The initial linear function generation unit 1 acquires a candidate set Γ ^(1), which is a linear element of the value function, using Equation (3), based on the set of initial linear functions thus obtained.
FIG. 3 is a graph showing an example of this candidate set Γ ⁽¹⁾ . In the graph of FIG. 3, the horizontal axis represents belief space and the vertical axis represents belief value. FIG. 3 shows the candidate set Γ ⁽¹⁾ when L = 6. Each line in FIG. 3 represents an initial linear function. Since L = 6, the number of linear functions is also six.

ここで、価値関数Ｖ₁（ｂ）は、この候補集合Γ⁽¹⁾から式（４）によって与えられる。
図４の左側は、ある候補集合Γ⁽¹⁾の例を示すグラフであり、図４の右側は、当該候補集合Γ⁽¹⁾から得られる価値関数Ｖ₁（ｂ）を示すグラフである。価値関数Ｖ₁（ｂ）の値は、信念価値とも呼ばれる。 Here, the value function V ₁ (b) is given by the equation (4) from the candidate set Γ ⁽¹⁾ .
The left side of FIG. 4 is a graph showing an example of a certain candidate set Γ ⁽¹⁾ , and the right side of FIG. 4 is a graph showing the value function V ₁ (b) obtained from the candidate set Γ ⁽¹⁾ . The value of the value function V ₁ (b) is also called belief value.

図４に示すように、候補集合Γ⁽¹⁾の全ての要素が価値関数Ｖ₁（ｂ）の線形要素となる訳ではない。実際には、そのごく一部が寄与するのみである。図４の左のグラフの実線で示した要素のみが価値関数Ｖ₁（ｂ）の線形要素となる。図４の右のグラフでは、この価値関数Ｖ₁（ｂ）の線形要素として貢献する要素のみを示している。このように、候補集合Γ⁽¹⁾から価値関数Ｖ₁（ｂ）の線形要素とならない要素を排除することを枝刈りと呼ぶ。 As shown in FIG. 4, not all elements of the candidate set Γ ⁽¹⁾ are linear elements of the value function V ₁ (b). In fact, only a small part contributes. Only elements indicated by solid lines in the left graph of FIG. 4 are linear elements of the value function V ₁ (b). The right graph of FIG. 4 shows only elements that contribute as linear elements of the value function V ₁ (b). In this manner, eliminating elements that are not linear elements of the value function V ₁ (b ⁾ from the candidate set Γ ⁽¹⁾ is called pruning.

図５は、図３の例において、バックアップステップ数Ｔ＝２となった場合の候補集合Γ⁽¹⁾を示すグラフであり、図６は、図３の例において、バックアップステップ数Ｔ＝３となった場合の候補集合Γ⁽¹⁾を示すグラフである。図５及び図６に例示するように、一般にこの枝刈り処理は、バックアップステップ数（価値反復区間ともいう。）の増加に対して候補集合の要素数が指数関数的に増加することから、計算量が極めて多くなり、計算負荷が増大し、計算時間も長くなる。このことがＰＯＭＤＰにおける価値反復による価値関数の計算方法の問題となっている。 FIG. 5 is a graph showing the candidate set Γ ⁽¹⁾ when the number of backup steps T = 2 in the example of FIG. 3, and FIG. 6 shows the number of backup steps T = 3 in the example of FIG. It is a graph which shows candidate set Γ ⁽¹⁾ in the case of becoming. As illustrated in FIGS. 5 and 6, in general, this pruning process is performed because the number of elements in the candidate set increases exponentially with an increase in the number of backup steps (also referred to as a value iteration interval). The amount becomes extremely large, the calculation load increases, and the calculation time also increases. This is a problem of the calculation method of the value function by value iteration in POMDP.

そこで、本実施の形態では、その枝刈りの問題を解決するために、厳密に候補集合Γ⁽¹⁾から価値関数Ｖ₁（ｂ）の線形要素となるものを選ぶのではなく、価値関数Ｖ₁（ｂ）の線形要素を近似する線形関数を計算する。 Therefore, in the present embodiment, in order to solve the pruning problem, the value function V ₁ is not strictly selected from the candidate set Γ ⁽¹⁾ but becomes a linear element of the value function V ₁ (b). ₁ Calculate a linear function approximating the linear element of (b).

（双対変換）
このために、まず、双対変換部２は、候補集合Γ⁽¹⁾の要素
を、以下のようにして、あるＮ次元空間の点と対応させる。係数（ν⁽¹⁾）_nをＮ次元空間での点の座標値とみなすことで、線形関数
から、Ｎ次元空間の点
への写像を定義できる。これを双対変換といい、その写像先のＮ次元空間を双対空間という。 (Dual transformation)
For this purpose, first, the dual transformation unit 2 uses the elements of the candidate set Γ ⁽¹⁾ .
Is made to correspond to a point in a certain N-dimensional space as follows. Coefficient (ν ⁽¹⁾ ) Linear function by considering _n as coordinate value of point in N-dimensional space
To point in N-dimensional space
A mapping to can be defined. This is called dual transformation, and the mapped N-dimensional space is called dual space.

図７は、双対変換の例を示すグラフである。図７の例において、線形関数１０１〜１０６は、双対空間の点２０１〜２０６にそれぞれ写像されている。なお、ここで示した双対変換は、例の１つに過ぎず、ここで説明したものに限定されない。また、候補集合Γ⁽¹⁾の要素を双対空間へ写像した点からなる集合をΓ´⁽¹⁾と表すことにする。 FIG. 7 is a graph showing an example of dual transformation. In the example of FIG. 7, the linear functions 101 to 106 are mapped to the points 201 to 206 in the dual space, respectively. The dual transformation shown here is only one example, and is not limited to the one described here. In addition, a set composed of points obtained by mapping the elements of the candidate set Γ ⁽¹⁾ to the dual space is represented as Γ ′ ⁽¹⁾ .

（凸包と価値関数）
信念空間上の線形関数の上下関係は、双対空間においても保存する。すなわち、図８に示すように、価値関数が下に凸な区分線形関数で与えられるという性質により、価値関数を与える線形要素は、双対空間へ写像した候補集合の点集合Γ´⁽¹⁾に対する凸包の上辺となる。従って、候補集合Γ⁽¹⁾から価値関数の線形要素を求める問題は、双対空間の点集合Γ´⁽¹⁾に対する凸包の上辺を求める問題に帰着する。 (Convex hull and value function)
The hierarchical relationship of linear functions in belief space is preserved even in dual space. That is, as shown in FIG. 8, due to the property that the value function is given by a downwardly convex piecewise linear function, the linear element giving the value function corresponds to the point set Γ ′ ⁽¹⁾ of the candidate set mapped to the dual space. The upper side of the convex hull. Therefore, the problem of obtaining the linear element of the value function from the candidate set Γ ⁽¹⁾ results in the problem of obtaining the upper side of the convex hull for the point set Γ ′ ^{(1) in} the dual space.

状態数Ｎ＝２である場合には、双対空間は２次元であり、その場合凸包を求めるアルゴリズムは本質的にソートの問題となり、それは高速な処理アルゴリズムが知られている。例えば、Ｋｉｒｋｐａｔｒｉｃｋ−Ｓｅｉｄｅｌのクイック凸包アルゴリズムが利用できる。よって、双対空間へ問題を写像することで、高速に価値関数を計算することができることになる。つまり、この処理により状態数Ｎ＝２である場合には、ＰＯＭＤＰの価値反復による価値関数計算の困難を解決できる。 When the number of states N = 2, the dual space is two-dimensional, in which case the algorithm for obtaining the convex hull is essentially a sorting problem, and a fast processing algorithm is known. For example, the Kirkpatrick-Seidel quick convex hull algorithm can be used. Therefore, the value function can be calculated at high speed by mapping the problem to the dual space. In other words, when the number of states N = 2 by this process, it is possible to solve the difficulty of value function calculation due to POMDP value iteration.

しかし、状態数Ｎ＝３以上の空間では、これまで知られている最善のアルゴリズムを用いても、凸包を求めるための計算量は次元数に関して指数関数的に増加することが知られている。そのため、双対空間における凸包計算問題へ写像しただけでは、ＰＯＭＤＰの価値反復による価値関数計算の困難を簡単には解決することはできない。 However, in a space where the number of states is N = 3 or more, it is known that the amount of calculation for obtaining the convex hull increases exponentially with respect to the number of dimensions even if the best algorithm known so far is used. . Therefore, simply mapping to the convex hull calculation problem in the dual space cannot easily solve the difficulty of the value function calculation due to the POMDP value iteration.

さらに考慮すべきことは、空間の次元Ｎが高くなるに連れて、凸包の頂点数は点集合Γ´⁽¹⁾の点の数に比例して増大するということである。つまり、価値関数は非常に多くの区分に細分化されたものになるということである。厳密にはそうであったとしても、逆に隣り合った区分での違いは小さくなり、滑らかな関数へ変わっていく。実用的には、そのような滑らかな関数よりも、それを良く近似する少ない区分での区分線形関数で近似するほうが望ましい。 A further consideration is that as the dimension N of the space increases, the number of vertices of the convex hull increases in proportion to the number of points in the point set Γ ′ ⁽¹⁾ . In other words, the value function is subdivided into many categories. Strictly speaking, even if it is exactly the same, the difference between adjacent segments becomes smaller, and it becomes a smooth function. Practically, it is desirable to approximate with a piecewise linear function with a small number of sections that approximate it better than such a smooth function.

（ＮＭＦによる凸包の近似計算）
以上のような理由により、本実施の形態では価値関数計算部３４０の双対価値反復処理部３が凸包近似計算部３２を備えている。凸包近似計算部３２は、凸包を厳密に求めるのではなく、非負値行列因子分解（Non-negative Matrix Factorization：ＮＭＦ）によって、それをよく近似するポリトープを求める。以下、このようなポリトープを簡単のため「近似凸包」という。このように近似凸包を考えるということは、離散的な点集合Γ´⁽¹⁾に対してその凸包と求めるという離散組合せ問題を、連続なＮ次元空間全体の中で、点集合Γ´⁽¹⁾の凸包を良く近似するポリトープを計算するという問題へと緩和することを意味する。 (Approximate calculation of convex hull by NMF)
For the reasons described above, in the present embodiment, the dual value iteration processing unit 3 of the value function calculation unit 340 includes the convex hull approximation calculation unit 32. The convex hull approximation calculation unit 32 does not calculate the convex hull exactly, but calculates a polytope that closely approximates it by non-negative matrix factorization (NMF). Hereinafter, such a polytope is referred to as an “approximate convex hull” for simplicity. Considering the approximate convex hull in this way means that the discrete combination problem of finding the convex hull for the discrete point set Γ ′ ⁽¹⁾ is the point set Γ ′ in the entire continuous N-dimensional space. ^This means that the problem of computing a polytope that closely approximates the convex hull in ⁽¹⁾ is alleviated.

いま、点集合Γ´⁽¹⁾の要素数が｜Γ´⁽¹⁾｜個あるとして、それらを並べたＮ×｜Γ´⁽¹⁾｜行列を式（５）のようにおく。
ここで、φ_r´⁽¹⁾は、縦ベクトルであるとする。点集合Γ´⁽¹⁾の凸包の頂点数は不明であるが、ここではその頂点数がＫで与えられるものと仮定する。換言すれば、ここでは、十分高速に計算できる計算量となるように頂点数Ｋを抑えて、その上で凸包を最もよく近似するようなポリトープを求めるという問題を考える。 Now, assuming that the number of elements of the point set Γ ′ ⁽¹⁾ is | Γ ′ ⁽¹⁾ |, an N × | Γ ′ ⁽¹⁾ | matrix in which these elements are arranged is set as in Expression (5).
Here, φ _r ′ ⁽¹⁾ is assumed to be a vertical vector. The number of vertices of the convex hull of the point set Γ ′ ⁽¹⁾ is unknown, but here it is assumed that the number of vertices is given by K. In other words, here, a problem is considered in which the number of vertices K is suppressed so that the calculation amount can be calculated at a sufficiently high speed, and a polytope that best approximates the convex hull is obtained.

いま、この求めるポリトープの頂点をｗ_k ⁽¹⁾（ｋ＝１，２，・・・，Ｋ）とおく。これらはＮ次元ベクトルであり、このポリトープが点集合Γ´⁽¹⁾を包含するという条件は、式（６）及び式（７）で与えられる。
Now, let the vertex of the desired polytope be w _k ⁽¹⁾ (k = 1, 2,..., K). These are N-dimensional vectors, and the condition that this polytope includes the point set Γ ′ ⁽¹⁾ is given by the equations (6) and (7).

ここで、ｈ_r ⁽¹⁾は、Ｋ次元ベクトルであり、式（８）が成り立つ。
また、Ｗ⁽¹⁾はＮ×Ｋ行列、Ｈ⁽¹⁾はＫ×｜Γ´⁽¹⁾｜行列であり、以下の式（９）及び式（１０）が成り立つ。
Here, h _r ⁽¹⁾ is a K-dimensional vector, and Expression (8) is established.
W ⁽¹⁾ is an N × K matrix, H ⁽¹⁾ is a K × | Γ ′ ⁽¹⁾ | matrix, and the following formulas (9) and (10) hold.

さらに、このポリトープの頂点数が実際の凸包の頂点数と等しいときに、凸包と一致するようにするために、ポリトープの頂点ｗ_k ⁽¹⁾が、点集合Γ´⁽¹⁾の要素によって表されるということを要請する。すなわち、式（１１）及び式（１２）の通りとする。
ただし、ｇ_k ⁽¹⁾は｜Γ´⁽¹⁾｜次元ベクトルであり、式（１３）を満足する。
また、Ｇ⁽¹⁾は｜Γ´⁽¹⁾｜×Ｋ行列であり、式（１４）の通りである。
式（６）、式（７）、式（１１）、式（１２）から、下記の式（１５）、式（１６）という自己無撞着な因子分解問題が得られる。
Furthermore, when the number of vertices of this polytope is equal to the number of vertices of the actual convex hull, in order to match the convex hull, the vertex w _k ⁽¹⁾ of the polytope is an element of the point set Γ ′ ⁽¹⁾ . Request that it be represented by That is, it is as shown in Formula (11) and Formula (12).
However, g _k ⁽¹⁾ is | Γ ′ ⁽¹⁾ | dimensional vector and satisfies Expression (13).
G ⁽¹⁾ is a | Γ ′ ⁽¹⁾ | × K matrix, as shown in Equation (14).
From the equations (6), (7), (11), and (12), the following self-consistent factorization problem of the following equations (15) and (16) is obtained.

この問題を解くアルゴリズムは、例えば非特許文献２に与えられている。この文献にある方法によれば、実際には式（１５）、式（１６）の問題を解く際に、まずある２次元空間への射影を取り、その２次元空間内で凸包を求め、それを元の空間へ戻すという操作を繰り返す。この手法によれば点集合Γ´⁽¹⁾のサイズ｜Γ´⁽¹⁾｜が増加しても高速に計算することが可能である。 An algorithm for solving this problem is given in Non-Patent Document 2, for example. According to the method in this document, when actually solving the problems of the equations (15) and (16), first, a projection to a certain two-dimensional space is taken, and a convex hull is obtained in the two-dimensional space, The operation of returning it to the original space is repeated. According to this method, even if the size | Γ ′ ⁽¹⁾ | of the point set Γ ′ ⁽¹⁾ increases, it can be calculated at high speed.

式（１５）、式（１６）を解くことによって、近似凸包の頂点Ｗ⁽¹⁾＝Ｖ⁽¹⁾Ｇ⁽¹⁾が求まる。図９は、近似凸包の計算例を示すグラフである。図９の例では、Ｋ＝５としている。図中、破線はもとの凸包であり、実線はそれを近似した近似凸包である。 By solving the equations (15) and (16), the vertex W ⁽¹⁾ = V ⁽¹⁾ G ^{(1) of} the approximate convex hull is obtained. FIG. 9 is a graph showing an example of calculating the approximate convex hull. In the example of FIG. 9, K = 5. In the figure, the broken line is the original convex hull, and the solid line is an approximate convex hull approximating it.

（メンバーシップ関数の定義）
候補集合Γ⁽¹⁾の各要素φには、それぞれある行動ｕが割り当てられている。要素φがどの行動ｕに割り当てられているかをメンバーシップという。これを一般化し、要素φの行動ｕ_lへの寄与の割合をα_l（φ）∈［０，１］で表し、α（φ）＝（α₁（φ），α₂（φ），・・・α_L（φ））をメンバーシップ関数と呼ぶ。 (Membership function definition)
A certain action u is assigned to each element φ of the candidate set Γ ⁽¹⁾ . To which action u the element φ is assigned is called membership. Generalizing this, the ratio of the contribution of the element φ to the action u _l is expressed as α _l (φ) ∈ [0, 1], and α (φ) = (α ₁ (φ), α ₂ (φ),.・・ Α _L (φ)) is called the membership function.

メンバーシップ決定部３４は、近似凸包の頂点ｗ_k ⁽¹⁾のメンバーシップを決定する。双対集合Γ´⁽¹⁾の要素φ´のメンバーシップは、対応する候補集合Γ⁽¹⁾の要素φのメンバーシップによって与えられる。すなわち、候補集合Γ⁽¹⁾の要素φが行動ｕ_lに対応する場合、要素φ´⁽¹⁾のメンバーシップは、式（１７）により与えられる。
以下、簡単のために、α_l´（φ_l´⁽¹⁾）＝δ_ll´と表記する。 The membership determination unit 34 determines the membership of the vertex w _k ⁽¹⁾ of the approximate convex hull. The membership of the element φ ′ of the dual set Γ ′ ⁽¹⁾ is given by the membership of the element φ of the corresponding candidate set Γ ⁽¹⁾ . That is, when the element φ of the candidate set Γ ⁽¹⁾ corresponds to the action u _l , the membership of the element φ ′ ⁽¹⁾ is given by Expression (17).
Hereinafter, for _{_{^{simplicity, α l'(φ l '(}}} 1)) = denoted as [delta] _LL'.

点集合Γ´⁽¹⁾上の制限を外して近似を行ったことで、得られた近似凸包の頂点ｗ_k ⁽¹⁾のメンバーシップが非自明となる。そこで、メンバーシップ決定部３４は、頂点ｗ_k ⁽¹⁾が点集合Γ´⁽¹⁾の各要素とどのように結びついているかという重みで足しあげることで頂点ｗ_k ⁽¹⁾のメンバーシップを決定する。換言すれば、近似凸包の頂点ｗ_k ⁽¹⁾のメンバーシップ関数は、それを点集合Γ´⁽¹⁾の各要素に適用した場合に正しいメンバーシップが得られる関数である。このようにしてメンバーシップを決定することで、近似凸包が厳密に点集合Γ´⁽¹⁾の凸包に一致する場合には、正しいメンバーシップに帰着するものと期待できる。 By removing the restriction on the point set Γ ′ ⁽¹⁾ and performing the approximation, the membership of the vertex w _k ⁽¹⁾ of the obtained approximate convex hull becomes non-trivial. Therefore, the membership determination unit 34 adds the membership of the vertex w _k ⁽¹⁾ by adding the weight of how the vertex w _k ⁽¹⁾ is associated with each element of the point set Γ ′ ^(1). decide. In other words, the membership function of the vertex w _k ⁽¹⁾ of the approximate convex hull is a function that obtains correct membership when it is applied to each element of the point set Γ ′ ⁽¹⁾ . By determining membership in this way, it can be expected that if the approximate convex hull exactly matches the convex hull of the point set Γ ′ ⁽¹⁾ , it will result in correct membership.

具体的には、メンバーシップ決定部３４は、頂点ｗ_k ⁽¹⁾に対するメンバーシップ関数を下式（１８）によって求める。
式（１８）に示すように、メンバーシップ決定部３４は、点集合Γ´⁽¹⁾の各要素を近似凸包に変換する変換行列Ｇでもって、点集合Γ´⁽¹⁾の各要素のメンバーシップを変換したものを、近似凸包の頂点のメンバーシップとして決定する。 Specifically, the membership determination unit 34 obtains a membership function for the vertex w _k ⁽¹⁾ by the following equation (18).
As shown in Expression (18), the membership determination unit 34 uses a transformation matrix G that converts each element of the point set Γ ′ ⁽¹⁾ into an approximate convex hull, and calculates each element of the point set Γ ′ ⁽¹⁾ . The converted membership is determined as the membership of the vertex of the approximate convex hull.

頂点ｗ_k ⁽¹⁾に対応する行動
は、下式（１９）で決定する。
Action corresponding to vertex w _k ⁽¹⁾
Is determined by the following equation (19).

（近似凸包の上辺）
上記のようにしてラベル付凸包近似計算部３１で、近似凸包の頂点ｗ_k ⁽¹⁾とそのメンバーシップが求まると、枝刈り部３５の凸包上辺抽出部３６は、近似凸包からその上辺を取り出すことで、近似価値関数に対応する関数を取得する。 (Upper side of approximate convex hull)
When the labeled convex hull approximation calculation unit 31 determines the vertex w _k ⁽¹⁾ of the approximate convex hull and its membership as described above, the convex hull upper side extraction unit 36 of the pruning unit 35 determines from the approximate convex hull. By extracting the upper side, a function corresponding to the approximate value function is obtained.

この凸包上辺の抽出においては、Ｎ次元空間の座標
に対して、その方向余弦
を考える。 In extracting the upper side of the convex hull, the coordinates of the N-dimensional space are used.
Against its direction cosine
think of.

凸包の上辺の抽出においては、双対空間内のある凸包と双対空間内の先の方向余弦と平行な原点から延びる半直線と凸包との交点のうち、原点からの距離が最も大きい点のみを残し、このようにして残された点からなる集合を、双対空間における凸包の上辺として抽出できる。この双対集合Γ´⁽¹⁾の凸包の上辺の頂点は、価値関数に対応する線形関数を与える。 In the extraction of the upper side of the convex hull, the point at which the distance from the origin is the largest among the intersections of a convex hull in the dual space and the half line extending from the origin parallel to the previous cosine in the dual space and the convex hull And a set of points left in this way can be extracted as the upper side of the convex hull in the dual space. The vertex of the upper side of the convex hull of this dual set Γ ′ ⁽¹⁾ gives a linear function corresponding to the value function.

近似凸包に対しても同様に、上辺を定義し、近似凸包の頂点ｗ_k ⁽¹⁾（ｋ＝１，２，・・・，Ｋ）のうち、上辺に属するものを
とおく。ここで、Ｋ≦Ｊ、かつ１≦ｋ_s≠ｋ_t≦Ｋ（ｓ≠ｔ）である。 Similarly, the upper side is defined for the approximate convex hull, and the vertexes w _k ⁽¹⁾ (k = 1, 2,..., K) of the approximate convex hull belong to the upper side.
far. Here, K ≦ J and 1 ≦ k _s ≠ k _t ≦ K (s ≠ t).

（近似結果の逆変換と価値関数）
逆双対変換部４は、近似凸包の上辺に属する頂点
を逆双対変換によって、信念空間上の線形関数へ戻す。逆双対変換部４は、近似凸包の上辺に属する頂点
を用いて、信念空間上の線形関数を式（２０）によって求める。
(Inverse transformation of approximation result and value function)
The inverse dual transform unit 4 is a vertex belonging to the upper side of the approximate convex hull.
Is returned to a linear function in belief space by inverse dual transformation. The inverse dual transform unit 4 is a vertex belonging to the upper side of the approximate convex hull.
Is used to obtain a linear function in the belief space by Expression (20).

図１０は、逆双対変換の例を示すグラフである。図１０の左のグラフにおいて、近似凸包の上辺に属する頂点２１１、２１２、２１３、２１４が、それぞれ図１０の右のグラフの信念空間上の線形関数１１１、１１２、１１３、１１４に変換されている。 FIG. 10 is a graph showing an example of inverse dual transformation. In the left graph of FIG. 10, vertices 211, 212, 213, and 214 belonging to the upper side of the approximate convex hull are converted into linear functions 111, 112, 113, and 114 in the belief space of the right graph of FIG. Yes.

価値関数決定部５は、これらの線形関数に基づいて、式（２１）によって近似価値関数
の区分線形要素を求める。
Based on these linear functions, the value function determining unit 5 uses the approximate value function according to the equation (21).
Find piecewise linear elements of.

以上の処理によって、価値関数計算部３４０は、バックアップステップ数Ｔ＝１での近似価値関数を求めることができる。なお、仮に凸包を近似ではなく、厳密に求めていた場合には、厳密な価値関数Ｖ₁（ｂ）が得られることになる。 With the above processing, the value function calculation unit 340 can obtain an approximate value function with the number of backup steps T = 1. Note that if the convex hull is obtained not strictly but approximate, a strict value function V ₁ (b) is obtained.

（観測による信念更新）
以上の説明では、価値関数計算部３４０は、すぐに得られる報酬のみを使って価値関数を計算した。しかし、将来に得られる報酬を考慮することで、一般に、よりよい価値関数を得ることができるようになる。 (Belief update by observation)
In the above description, the value function calculation unit 340 calculates the value function using only the reward that can be obtained immediately. However, in general, a better value function can be obtained by considering the rewards obtained in the future.

このことは、「最終決定を行う行動」カテゴリに属する行動を選択する前に、「状態に関する観測を行う行動」カテゴリに属する行動を選択すること、すなわち、次の行動の前に観測を行って、信念の状態を更新することができることを意味する。この観測によって、信念の値をより正しく認識することができるため、最終的に得られる報酬の和の期待値が向上する。 This means that before selecting an action that belongs to the “behavior for final decision” category, selecting an action that belongs to the “behavior for state observation” category, ie, performing an observation before the next action. Means that the state of belief can be updated. This observation makes it possible to recognize the value of the belief more correctly, so that the expected value of the sum of the finally obtained rewards is improved.

まず、信念がｂ＝（ｐ₁，ｐ₂，・・・，ｐ_N）であったとする。ここで行動ｕを実行することで、別の信念状態へ遷移するものとする。そこで更に観測を実行し、観測値ｏが得られたとすると、信念に関する事後分布が式（２２）によって得られる。
First, it is assumed that the belief is b = (p ₁ , p ₂ ,..., P _N ). Here, by executing the action u, a transition to another belief state is assumed. Therefore, if further observation is performed and the observed value o is obtained, a posterior distribution regarding the belief is obtained by Expression (22).

この信念の事後分布に関する最適な行動方策は、価値関数
で与えられる。 The optimal action strategy for this posterior distribution of beliefs is the value function
Given in.

そこで、線形関数更新部６は、将来に得られる報酬の和の期待値を向上させるため線形関数を更新する。具体的には、線形関数更新部６は、候補集合Γ⁽¹⁾の要素を式（２３）によって更新する。
更新された候補集合は、双対変換部２に出力されて、上記と同様の処理によって、バックアップステップ数Ｔ＝２の価値関数を求める。 Therefore, the linear function updating unit 6 updates the linear function in order to improve the expected value of the sum of rewards obtained in the future. Specifically, the linear function update unit 6 updates the elements of the candidate set Γ ⁽¹⁾ with Expression (23).
The updated candidate set is output to the dual conversion unit 2 and a value function of the number of backup steps T = 2 is obtained by the same process as described above.

式（２２）より、バックアップステップ数Ｔ＝２での価値関数Ｖ₂（ｂ）は、式（２４）で算出される。
From the equation (22), the value function V ₂ (b) at the backup step number T = 2 is calculated by the equation (24).

式（２４）は、以下の式（２５）、式（２６）のように書き直すことができる。
ここで、
と略記した。 Expression (24) can be rewritten as the following Expression (25) and Expression (26).
here,
Was abbreviated.

価値関数計算部３４０は、式（２７）の候補集合Γ⁽²⁾の線形関数の集合から、式（２８）によって価値関数Ｖ₂（ｂ）を算出する。
The value function calculation unit 340 calculates the value function V ₂ (b) from the set of linear functions of the candidate set Γ ⁽²⁾ of Expression (27) according to Expression (28).

また、メンバーシップ決定部３４は、要素
のメンバーシップを、式（２９）で求める。
In addition, the membership determination unit 34
Is obtained by Expression (29).

候補集合Γ⁽¹⁾から近似価値関数
の線形要素
を求めたのと同じ手順を繰り返すことで、候補集合Γ⁽²⁾から価値関数Ｖ₂（ｂ）あるいはその近似関数
の線形要素
を求めることができる。 Approximate value function from candidate set Γ ⁽¹⁾
Linear elements of
Is repeated from the candidate set Γ ⁽²⁾ to the value function V ₂ (b) or its approximate function.
Linear elements of
Can be requested.

さらに、バックアップステップを増やすことにより、線形要素
から候補集合Γ⁽²⁾を生成したのを同じ手順を繰り返すことにより、線形要素
から候補集合Γ⁽³⁾を生成することができる。 In addition, by increasing the backup step, linear elements
By generating the candidate set Γ ⁽²⁾ from the same procedure, the linear element
A candidate set Γ ⁽³⁾ can be generated from

このように、任意の価値反復区間Ｔについて、バックアップステップを行った価値関数Ｖ_T（ｂ）あるいはその近似関数
を計算することができる。価値関数計算部３４０は、上記のようにして、価値関数Ｖ_T（ｂ）及びその近似関数
を計算する。 As described above, the value function V _T (b) in which the backup step is performed for an arbitrary value repeating section T or its approximate function.
Can be calculated. The value function calculation unit 340 performs the value function V _T (b) and its approximate function as described above.
Calculate

方策決定部３５０は、各線形要素に対して、行動
を割り当てる。 The policy decision unit 350 performs an action for each linear element.
Assign.

図１１及び図１２は、バックアップステップＴ＝２、Ｔ＝３における候補集合Γ⁽²⁾、Γ⁽³⁾とその双対変換の例を示すグラフである。図１１の例において、線形関数１２１〜１２４は、双対空間の点２２１〜２２４にそれぞれ写像されている。 11 and 12 are graphs showing examples of candidate sets Γ ⁽²⁾ and Γ ⁽³⁾ and their dual transformations in the backup steps T = 2 and T = 3. In the example of FIG. 11, the linear functions 121 to 124 are mapped to the points 221 to 224 in the dual space, respectively.

図１３及び図１４は、バックアップステップＴ＝２、Ｔ＝３における近似凸包計算の例を示すグラフである。図１５は、バックアップステップＴ＝３まで行ったときの近似価値関数の例を示すグラフである。図１６（ａ）は、近似価値関数の例を示すグラフであり、図１６（ｂ）は、厳密な価値関数の例を示すグラフである。図１６（ａ）と図１６（ｂ）を比較して分かるように、厳密な価値関数と近似の価値関数との差は小さく、本実施の形態の計算方法が有効であることが分かる。 13 and 14 are graphs showing an example of approximate convex hull calculation in the backup steps T = 2 and T = 3. FIG. 15 is a graph showing an example of the approximate value function when the backup step T = 3. FIG. 16A is a graph showing an example of an approximate value function, and FIG. 16B is a graph showing an example of a strict value function. As can be seen by comparing FIG. 16A and FIG. 16B, the difference between the strict value function and the approximate value function is small, and it can be seen that the calculation method of the present embodiment is effective.

信念空間上の点ｂにおいて価値関数を与える線形関数の重み係数は、式（３０）によって与えられる。
方策決定部３５０は、その線形関数のメンバーシップから式（３１）によって最適方策を決定する。
The weighting coefficient of the linear function that gives the value function at the point b on the belief space is given by Expression (30).
The policy determination unit 350 determines an optimal policy from the membership of the linear function according to Expression (31).

図１７は、本実施の形態における価値関数と最適な行動方策の例を示すグラフである。図１７に示すように、対話システム１００は、価値関数に従って、信念空間が０〜３．２であれば、「食事に関する情報を提示する」という行動を実行する制御コマンドを出力し、信念空間が０．３２〜８．２であれば、「聞き直す」という行動を実行し、信念空間が８．２〜１であれば、「買物に関する情報を提示する」という行動を実行する制御コマンドを出力する。 FIG. 17 is a graph showing an example of the value function and the optimum action policy in the present embodiment. As shown in FIG. 17, if the belief space is 0 to 3.2 according to the value function, the dialogue system 100 outputs a control command for executing an action “present information about meals”, and the belief space is If 0.32 to 8.2, execute the action “rehearse”, and if the belief space is 8.2 to 1, output a control command to execute the action “present information about shopping” To do.

次に、制御方策決定装置３０の動作を説明する。図１８は、制御方策決定装置３０の動作フロー図である。価値関数計算部３４０は、知覚モデル部３２２、行動モデル部３２１、報酬関数モデル部３３１からそれぞれ知覚モデル、行動モデル、報酬関数をロードする（ステップＳ１１）。次に、バックアップステップ数をＴ＝１として（ステップＳ１２）、初期線形関数生成部１が初期線形関数の組を生成する（ステップＳ１３）。そして、Ｔが上限（Ｔ_max）に達していないかを判断して（ステップＳ１４）、達していない場合には（ステップＳ１４にてＮＯ）、双対変換部２が、線形関数の組を双対変換する（ステップＳ１５）。 Next, the operation of the control policy determination device 30 will be described. FIG. 18 is an operation flow diagram of the control policy determination device 30. The value function calculator 340 loads the perceptual model, the behavior model, and the reward function from the perceptual model unit 322, the behavior model unit 321, and the reward function model unit 331, respectively (step S11). Next, the number of backup steps is set to T = 1 (step S12), and the initial linear function generation unit 1 generates a set of initial linear functions (step S13). Then, it is determined whether T has reached the upper limit (T _max ) (step S14), and if not reached (NO in step S14), the dual conversion unit 2 performs dual conversion on the set of linear functions. (Step S15).

次に、凸包近似計算部３２が、双対変換によって得られた複数の点の近似凸包を計算する（ステップＳ１６）。そして、メンバーシップ決定部１４は、近似凸包の頂点のメンバーシップを計算し（ステップＳ１７）、凸包上辺抽出部３６が近似凸包の上辺を抽出する（ステップＳ１８）。近似凸包の上辺が抽出されると、逆双対変換部４は、近似凸包の上辺の頂点について逆双対変換を行う（ステップＳ１９）。そして、価値関数決定部５は、逆双対変換によって得られた価値関数の線形要素をもって、線形関数の組を更新する（ステップＳ２０）。 Next, the convex hull approximation calculation unit 32 calculates an approximate convex hull of a plurality of points obtained by the dual transformation (step S16). Then, the membership determination unit 14 calculates the membership of the vertex of the approximate convex hull (step S17), and the convex hull upper side extraction unit 36 extracts the upper side of the approximate convex hull (step S18). When the upper side of the approximate convex hull is extracted, the inverse dual transform unit 4 performs an inverse dual transform on the vertex of the upper side of the approximate convex hull (step S19). Then, the value function determining unit 5 updates the set of linear functions with the linear elements of the value function obtained by inverse dual transformation (step S20).

その後、バックアップステップ数Ｔをインクリメントし（ステップＳ２１）、再びバックアップステップ数が上限（Ｔ_max）を超えていないかを判断する（ステップＳ１４）。このようにして、バックアップステップ数が上限を超えるまでステップＳ１５〜ステップＳ２１を繰り返し、バックアップステップ数が上限を超えると（ステップＳ１４にてＹＥＳ）、価値関数計算部３４０は、そのときに得られている、価値関数を与える線形関数の組を出力する（ステップＳ２２）。このときの価値関数は、上述の近似価値関数である。方策決定部３５０は、この近似価値関数を用いて、最適方策を決定する。 Thereafter, the backup step number T is incremented (step S21), and it is determined again whether the backup step number exceeds the upper limit (T _max ) (step S14). In this way, steps S15 to S21 are repeated until the number of backup steps exceeds the upper limit. When the number of backup steps exceeds the upper limit (YES in step S14), value function calculation unit 340 is obtained at that time. A set of linear functions giving a value function is output (step S22). The value function at this time is the above approximate value function. The policy determination unit 350 determines an optimal policy using this approximate value function.

なお上記の実施の形態では、バックアップステップ数が２以上のとき、線形要素の候補集合の更新を式（２６）のように観測値すべてについて一度に行ったが、変形例として、式（２５）の最大化処理を、観測値ごとに逐次的に実行してもよい。すなわち、線形要素の候補集合の更新は、観測値ごとに逐次的に実行し、その部分的に更新された線形関数組について枝刈り処理を行いながら全ての観測値に関する線形要素の候補集合の更新処理を実行してもよい。 In the above embodiment, when the number of backup steps is 2 or more, the update of the candidate set of linear elements is performed at once for all the observed values as in Expression (26). However, as a modification, Expression (25) May be sequentially executed for each observation value. In other words, the update of the candidate set of linear elements is performed sequentially for each observation value, and the candidate set of linear elements for all observation values is updated while pruning the partially updated linear function set. Processing may be executed.

この変形例における制御方策決定装置３０の動作を説明する。図１９は、変形例の制御方策決定装置３０の動作フロー図である。価値関数計算部３４０は、知覚モデル部３２２、行動モデル部３２１、報酬関数モデル部３３１からそれぞれ知覚モデル、行動モデル、報酬関数をロードする（ステップＳ３１）。次に、バックアップステップ数をＴ＝１として（ステップＳ３２）、初期線形関数生成部１が初期線形関数の組を生成する（ステップＳ３３）。そして、双対変換部２が、初期線形関数の組を双対変換する（ステップＳ３４）。 The operation of the control policy determination device 30 in this modification will be described. FIG. 19 is an operation flowchart of the control policy determination device 30 according to the modification. The value function calculator 340 loads the perceptual model, the behavior model, and the reward function from the perceptual model unit 322, the behavior model unit 321 and the reward function model unit 331, respectively (step S31). Next, the number of backup steps is set to T = 1 (step S32), and the initial linear function generation unit 1 generates a set of initial linear functions (step S33). Then, the dual conversion unit 2 dual converts the set of initial linear functions (step S34).

次に、凸包近似計算部３２が、双対変換によって得られた複数の点の近似凸包を計算する（ステップＳ３５）。そして、メンバーシップ決定部１４は、近似凸包の頂点のメンバーシップを計算し（ステップＳ３６）、凸包上辺抽出部３６が近似凸包の上辺を抽出する（ステップＳ３７）。近似凸包の上辺が抽出されると、逆双対変換部４は、近似凸包の上辺の頂点について逆双対変換を行う（ステップＳ３８）。以上の処理によって、バックアップステップ数Ｔ＝１の価値関数が求まる。 Next, the convex hull approximation calculation unit 32 calculates the approximate convex hull of a plurality of points obtained by the dual transformation (step S35). Then, the membership determination unit 14 calculates the membership of the vertex of the approximate convex hull (step S36), and the convex hull upper side extraction unit 36 extracts the upper side of the approximate convex hull (step S37). When the upper side of the approximate convex hull is extracted, the inverse dual transform unit 4 performs an inverse dual transform on the vertex of the upper side of the approximate convex hull (step S38). With the above processing, a value function with the number of backup steps T = 1 is obtained.

その後、バックアップステップ数Ｔをインクリメントし（ステップＳ３９）、バックアップステップ数Ｔが上限（Ｔ_max）を超えていないかを判断する（ステップＳ４０）。Ｔが上限（Ｔ_max）に達していない場合には（ステップＳ４０にてＮＯ）、観測値ｍを１とする（ステップＳ４１）。そして、線形関数更新部６が、観測値ｍに関する線形関数組を更新する（ステップＳ４２）。 Thereafter, the backup step number T is incremented (step S39), and it is determined whether the backup step number T exceeds the upper limit (T _max ) (step S40). If T has not reached the upper limit (T _max ) (NO in step S40), the observed value m is set to 1 (step S41). Then, the linear function update unit 6 updates the linear function group related to the observation value m (step S42).

次に、双対変換部２が、線形関数の組を双対変換し（ステップＳ４３）、凸包近似計算部３２が、双対変換によって得られた複数の点の近似凸包を計算し（ステップＳ４４）、メンバーシップ決定部１４が、近似凸包の頂点のメンバーシップを計算し（ステップＳ４５）、凸包上辺抽出部３６が、近似凸包の上辺を抽出し（ステップＳ４６）、逆双対変換部４が、近似凸包の上辺の頂点について逆双対変換を行うことで（ステップＳ４７）、価値関数を取得する。そして、観測値ｍをインクリメントして（ステップＳ４８）、ｍが観測値数Ｍを超えているかを判断し（ステップＳ４９）、超えていない場合には（ステップＳ４９にてＮＯ）、ステップＳ４２に戻って、ステップＳ４２〜Ｓ４９の処理を繰り返す。 Next, the dual transformation unit 2 dual transforms the set of linear functions (step S43), and the convex hull approximation calculation unit 32 calculates an approximate convex hull of a plurality of points obtained by the dual transformation (step S44). The membership determination unit 14 calculates the membership of the vertex of the approximate convex hull (step S45), the convex hull upper side extraction unit 36 extracts the upper side of the approximate convex hull (step S46), and the inverse dual transform unit 4 However, by performing inverse dual transformation on the top vertex of the approximate convex hull (step S47), a value function is acquired. Then, the observed value m is incremented (step S48), and it is determined whether m exceeds the observed value number M (step S49). If not exceeded (NO in step S49), the process returns to step S42. Steps S42 to S49 are repeated.

インクリメントされた観測値ｍが観測値数Ｍを超えている場合（ステップＳ４９にてＹＥＳ）、すなわち、すべての観測値について線形関数組を更新して価値関数を算出した場合には、ステップＳ３９に戻ってバックアップステップ数Ｔをインクリメントして、バックアップステップ数Ｔが上限（Ｔ_max）を超えていないかを判断する。このようにして、バックアップステップ数が上限を超えるまでステップＳ３９〜ステップＳ４９を繰り返し、バックアップステップ数が上限を超えると（ステップＳ４０にてＹＥＳ）、価値関数計算部３４０は、そのときに得られている、価値関数を与える線形関数の組を出力する（ステップＳ５０）。このときの価値関数は、上述の近似価値関数である。方策決定部３５０は、この近似価値関数を用いて、最適方策を決定する。 If the incremented observation value m exceeds the observation value number M (YES in step S49), that is, if the value function is calculated by updating the linear function group for all observation values, the process proceeds to step S39. Returning, the backup step number T is incremented, and it is determined whether the backup step number T exceeds the upper limit (T _max ). In this way, steps S39 to S49 are repeated until the number of backup steps exceeds the upper limit. When the number of backup steps exceeds the upper limit (YES in step S40), the value function calculation unit 340 is obtained at that time. A set of linear functions giving a value function is output (step S50). The value function at this time is the above approximate value function. The policy determination unit 350 determines an optimal policy using this approximate value function.

以上、対話システムの制御を事例として実施形態を説明してきたが、本発明の制御方策決定装置の応用は、このような対話システムの制御に制限されるものではない。 As described above, the embodiment has been described by taking the control of the interactive system as an example, but the application of the control policy determining apparatus of the present invention is not limited to the control of the interactive system.

本発明は、価値反復計算を高速に近似実行することが可能となるという効果を有し、環境センシング情報に基づいて制御方策を決定する制御方策決定装置等として有用である。 The present invention has an effect that it is possible to perform approximate execution of a value iteration calculation at high speed, and is useful as a control policy determination device or the like that determines a control policy based on environment sensing information.

１初期線形関数生成部
２双対変換部
３双対価値反復処理部
４逆双対変換部
５価値関数決定部
６線形関数更新部
１０音声入力部
２０音声認識部
３０制御方策決定装置
３１ラベル付凸包近似計算部
３２凸包近似計算部
３３ＣｏｎｖｅｘＮＭＦ計算部
３４メンバーシップ決定部
３５枝刈り部
３６凸包上辺抽出部
４０出力部
１００対話システム DESCRIPTION OF SYMBOLS 1 Initial linear function production | generation part 2 Dual transformation part 3 Dual value iterative processing part 4 Inverse dual transformation part 5 Value function determination part 6 Linear function update part 10 Speech input part 20 Speech recognition part 30 Control policy decision apparatus 31 Convex hull approximation with label Calculation unit 32 Convex hull approximation calculation unit 33 ConvexNMF calculation unit 34 Membership determination unit 35 Pruning unit 36 Convex hull upper side extraction unit 40 Output unit 100 Interactive system

Claims

A control policy determination device that determines a control policy based on environmental sensing information including uncertainty,
A linear function generation unit that generates a candidate set of linear functions that give linear elements of a value function in a belief space based on the environment sensing information;
A dual transform unit for transforming the candidate set on the belief space into a plurality of points on the dual space;
A convex hull approximation calculator for calculating an approximate convex hull approximating the convex hull of the plurality of points;
A membership determination unit for determining a membership function of a vertex of the approximate convex hull;
A convex hull upper side extraction unit for extracting the upper side of the approximate convex hull;
An inverse dual transformation unit that inversely transforms vertices belonging to the upper side into a linear function on the belief space;
A linear function updating unit that updates the linear function according to the number of backup steps based on the linear function obtained by the inverse transformation;
With
The dual transform unit further transforms the linear function updated according to the number of backup steps into a plurality of points on the dual space as the candidate set,
The control policy determination device further includes:
A value function determining unit for obtaining a plurality of linear elements of the approximate value function based on the linear function obtained by the inverse transformation after the update of the linear function of the number of backup steps;
A policy determining unit that assigns an action according to the membership function to each of a plurality of linear elements of the approximate value function;
A control policy decision device comprising:

The membership determining unit determines a membership function that obtains a correct membership when applied to a plurality of points on the dual space as a membership function of a vertex of the approximate convex hull. Item 2. The control policy determination device according to Item 1.

The membership determination unit converts a membership function of a plurality of points on the dual space with a matrix that converts a plurality of points on the dual space to the approximate convex hull, The control policy determining apparatus according to claim 2, wherein the control policy determining apparatus is determined as a vertex membership function.

4. The control policy determination device according to claim 1, wherein the linear function update unit updates the linear function in each backup step for each observation value. 5.

A control policy determination device according to any one of claims 1 to 4,
An environmental sensing information input unit for inputting the environmental sensing information;
An output unit for outputting a control command for executing an action assigned by the control policy determination device;
A control system characterized by comprising:

A control policy determination method for obtaining a control policy based on environmental sensing information including uncertainty,
Environmental sensing information input step for inputting the environmental sensing information;
Generating a linear function candidate set that gives a linear element of a value function in a belief space based on the environmental sensing information; and
A dual transformation step of transforming the candidate set on the belief space into a plurality of points on the dual space;
An approximate convex hull calculating step of calculating an approximate convex hull approximating the convex hull of the plurality of points;
A membership determination step for determining a membership function of a vertex of the approximate convex hull;
A convex hull upper side extracting step of extracting the upper side of the approximate convex hull;
An inverse dual transformation step of inversely transforming vertices belonging to the upper side to a linear function on the belief space;
A linear function updating step for updating the linear function according to the number of backup steps based on the linear function obtained by the inverse transformation;
Including
The dual transformation step further transforms a linear function updated according to the number of backup steps into a plurality of points on the dual space as the candidate set,
The control strategy determination method further includes:
A value function determining step for obtaining a plurality of linear elements of the approximate value function based on the linear function obtained by the inverse transformation after the update of the linear function of the number of backup steps;
A policy determining step of assigning an action according to the membership function to each of a plurality of linear elements of the approximate value function;
A control policy determination method comprising:

Computer
A control policy determination device that determines a control policy based on environmental sensing information including uncertainty,
A linear function generation unit that generates a candidate set of linear functions that give linear elements of a value function in a belief space based on the environment sensing information;
A dual transform unit for transforming the candidate set on the belief space into a plurality of points on the dual space;
A convex hull approximation calculator for calculating an approximate convex hull approximating the convex hull of the plurality of points;
A membership determination unit for determining a membership function of a vertex of the approximate convex hull;
A convex hull upper side extraction unit for extracting the upper side of the approximate convex hull;
An inverse dual transformation unit that inversely transforms vertices belonging to the upper side into a linear function on the belief space;
A linear function updating unit that updates the linear function according to the number of backup steps based on the linear function obtained by the inverse transformation;
With
The dual transform unit further transforms the linear function updated according to the number of backup steps into a plurality of points on the dual space as the candidate set,
The control policy determination device further includes:
A value function determining unit for obtaining a plurality of linear elements of the approximate value function based on the linear function obtained by the inverse transformation after the update of the linear function of the number of backup steps;
A policy determining unit that assigns an action according to the membership function to each of a plurality of linear elements of the approximate value function;
A control policy determination program for functioning as a control policy determination device.