CN106296006A

CN106296006A - The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation

Info

Publication number: CN106296006A
Application number: CN201610658485.2A
Authority: CN
Inventors: 王轩; 蒋琳; 张加佳; 滕雯娟; 代佳宁; 王鹏程; 胡开亮; 林云川; 朱航宇
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2016-08-10
Filing date: 2016-08-10
Publication date: 2017-01-04

Abstract

The invention provides the minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation, comprise the steps: step 1: for each information collection, initialize its strategy, valuation and the sorry value of each action；Step 2: use current strategy to carry out game, until completing this game；Step 3: calculate valuation and the sorry value of each action on each information collection that this game is had access to；Step 4: calculate the strategy made new advances according to sorry matching algorithm；Step 5: calculate the value-at-risk of New Policy and consider the relation of income and risk, selecting strategy to be used in next round game；Step 6: return step 2, until gambling process terminates.The present invention devises a kind of concept utilizing economics risk, and the principle of research risk model, in conjunction with minimum sorry algorithm, applies in non-complete information machine game.While utilizing minimum sorry algorithm income dominant strategy, take into account the risk of strategy, reach the most rational Nash Equilibrium.

Description

The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation

Technical field

The present invention relates to artificial intelligence field, particularly relate to the minimum something lost of non-perfect information game risk and Revenue Reconciliation The appraisal procedure of regret.

Background technology

Artificial intelligence is an important branch of computer realm, and its central task is to study how to make computer do Originally the work that the intelligence of people just can complete can only be leaned on.Game playing by machine, as an important research field of artificial intelligence, is inspection Test an important means of Artificial Intelligence Development level.In the research of game playing by machine, non-complete information machine game is this neck One of the difficult point of territory research and emphasis.Game side in non-perfect information game is owing to cannot obtain all of information, thus nothing Method accurately predicts which countermeasure opponent can take.This is similar with the situation of commercial competition, military war etc. in society, it Research has the strongest reference value for setting up the DSS of society.

Summary of the invention

In order to solve problem in prior art, the invention provides non-perfect information game risk and Revenue Reconciliation Few sorry appraisal procedure, comprises the steps:

Step 1: for each information collection, initializes its strategy, valuation and the sorry value of each action；

Step 2: use current strategy to carry out game, until completing this game；

Step 3: calculate valuation and the sorry value of each action on each information collection that this game is had access to；

Step 4: calculate the strategy made new advances according to sorry matching algorithm；

Step 5: calculate the value-at-risk of New Policy and consider the relation of income and risk, selecting in next round game and want The strategy used；

Step 6: return step 2, until gambling process terminates.

The invention has the beneficial effects as follows:

The present invention devises a kind of concept utilizing economics risk, and the principle of research risk model, in conjunction with minimum Sorry algorithm, applies in non-complete information machine game.While utilizing minimum sorry algorithm income dominant strategy, take into account The risk of strategy, reaches the most rational Nash Equilibrium.

Accompanying drawing explanation

Fig. 1 is flow chart of the present invention；

Fig. 2 is non-perfect information game process；

Fig. 3 is I, II type risk of loss schematic diagram in risk model.

Detailed description of the invention

The present invention will be further described below in conjunction with the accompanying drawings.

First the model of non-perfect information game and the basic conception of risk model are introduced.

Non-complete Information expansion formula game is a hexa-atomic group of ＜ H, H, P, f_c,{L_i}_{I=1,2 ..., N},{u_i}_{I=1,2 ..., N}＞ Wherein N is the finite aggregate representing player；H is the set of limited action sequence, empty sequenceAnd the prefix of each action sequence is also Element in H.Terminator sequence Z is not to be the sequence of any sequence prefix in H.For nonterminal sequences h ∈ H, A (h)={ a:ha ∈ H} represents the set of the action that can perform after action sequence h.Function P is that nonterminal sequence distributes a player, its Middle c represents random event.P (h) represents which player to do action at sequences h trailing wheel to.If P (h)=c, then random event is certainly Action after fixed sequence h.For player i ∈ N,Represent that its information is split；Information segmentation Element is referred to as information collection, and each information collection is the subset of H, represents some action sequences that cannot clearly distinguish.Function f_cFor P (I) the information collection of=c provides the probability that in A (h), each action a occurs, and is expressed as f_c(a|I)；For player i ∈ N, u_i:Z → R is its utility function, obtains return value in each terminator sequence.

The tactful σ of player i_iIt is to each information collection I_i∈L_i,σ_i(I_i):A(I_i) → [0,1] it is at behavior aggregate A (I_i) Probability-distribution function.The policy space ∑ of player i_iRepresent.One the tactful group strategy comprising all players, with σ=(σ₁, σ₂,...,σ_N) represent.Use σ_-iRepresent and remove player i, tactful group of remaining all player's strategies composition.

Given strategy group σ (when all players are according to strategy σ selection action), the probability that definition action sequence h occurs is π^σ (h).Obviously π^σH () can be decomposed into the product that the generation of action sequence h is contributed by each player, i.e.In like manner, definableFor two different action sequence h and h', Make π^σ(h, h') is under strategy group σ, the transition probability from h to h', if h is the prefix of h', then π^σ(h, h')=π^σ(h)/π^σ (h') otherwise, π^σ(h, h')=0.It is similar to, can defineWith

Set W in Fig. 2 represents the set of all possible situation of non-perfect information game environment I, each in W Element w_iAll representing a possible complete information state of I, the time of day of I is some w in W_i.Here generation is introduced The concept on boundary a: world is a possible state of non-perfect information game.W is world's collection of current game state, and S is W Sampling collection,The basic process of complete information Monte Carlo sampling approach is, uses random method to sample out the son of W Collection S, to each complete information world s therein_iCalculate, each s of statistical analysis_iOptimal solution m_i, finally select in M Final optimal strategy sequence.

Uncertainty in game playing by machine problem policy selection algorithm is attributed to two categories below risk of loss.

I type risk of loss and computational methods thereof:

The risk of loss caused by the inaccuracy to world's valuation of evaluation function is referred to as I type risk of loss.Assume generation The optimal strategy sequence of boundary w is m, and the most now the I type risk of loss computational methods of m are as follows:

In above formula,Represent evaluation function to taking the income valuation of policy sequence m under world w,Represent true The world takes income valuation during policy sequence m.

II type risk of loss and computational methods:

The risk of loss caused due to the inaccuracy of opponent's optimal strategy judgement is referred to as II type risk of loss, policy sequence The II type risk of loss computational methods of m are as follows:

It it is evaluation function real world I is taked policy sequence m income valuation.Game both sides under real world I Practical strategies sequence m ' income valuation.

Fig. 3 illustrates the difference of I, II type risk of loss, evaluation function to world w and real world I through policy sequence m The valuation difference of prospective earnings be I type risk of loss, figure is L_wI, in real world I, policy sequence m and practical strategies sequence The prospective earnings difference of m ' is II type risk of loss, is L in figure_mII.Thus, the risk of policy sequence m is used to damage under definition world w Mistake is

L_wm=L_wI+L_mII (3)。

Each step below in conjunction with Fig. 1 just invention elaborates.Basic step is:

Step 1: initialize.For player i ∈ N, to each of which information collection I ∈ L_iThe valuation v (I, σ) of upper strategy= 0 couple of each a ∈ A (I), r (I, a)=0, its strategy is initialized as δ_i(I, a)=1/ | A (I) |

Step 2: game side carries out action in turn according to the strategy of oneself, until this game terminates, and records each game Reef knot fruit.

The value of information JiIChu:

At information collection I, do not take the sorry value of action a:

Step 4: the valuation on each information collection having access to obtained by previous step is according to regretting matching algorithm, again For each action partition density on each information collection, obtain new strategy.So calculate compared to directly taking to regret degree Maximum action, is advantageous in that the calculating avoiding opponent to carry out regret value equally, the strategy of perception one's own side.Thus obtain with income Preferential strategy.

For information collection I, obtain, by sorry coupling, the strategy that next step a income is preferential:

Step 5: calculate the value-at-risk of New Policy and consider the relation of income and risk, selecting in next round game and want The strategy used.

Risk factor impact on payoff be considered below:

For the feature of non-complete information machine game, the method proposing an approximation calculation risk loss, it is basic Thought is the average calculating the estimated revenue in sampling collection S, replaces the true earning of I in world collection W.

Assuming that the world of current state is integrated as W by game person, unit's prime number is that the sampling of n, W integrates as S, unit's prime number be t, M be W All legal policy arrangement sets, unit prime number be k.First average yield computational methods now are given:

Definition:Average yield for sampling collection S.Computational methods are as follows:

Based on (7) formula, the integrated risk loss approximation computational methods formula for policy sequence δ is as follows:

(8), in formula, about equal sign institute junction is useAnd sampling collection S carries out the process of approximate calculation.

Based on above method, it is possible to calculate the value-at-risk of New Policy.

Followed by the relation how considered between income and risk.

Assume have tactful A, B.E_AAnd E_BRepresent game person's prospective earnings for strategy A, B respectively.L_AAnd L_BRepresent strategy The risk of loss of A and B.Then the good and bad judgment rule of strategy A, B is as follows:

1: if strategy A, B meet u_A-L_A>u_B, then A is better than B, otherwise, if meeting u_B-L_B>u_A, then B is better than A.

2: otherwise, by following formula:

If R>0, then A is better than B, if R<0, then B is better than A, if R=0, then AB etc. are excellent, and system can randomly choose.

By above method, can be ranked up the new and old strategy of current game person, the strategy of sequence optimum is as current Risk and the strategy of Revenue Reconciliation, that is to say the optimal strategy of game person.

Step 6: judge whether whole gambling process terminates, if not terminating, returning step 2 and continuing executing with.

Above content is to combine concrete preferred implementation further description made for the present invention, it is impossible to assert Being embodied as of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of present inventive concept, it is also possible to make some simple deduction or replace, all should be considered as belonging to the present invention's Protection domain.

Claims

The minimum sorry appraisal procedure of the most non-perfect information game risk and Revenue Reconciliation, it is characterised in that:

Comprise the steps:

Step 1: for each information collection, initializes its strategy, valuation and the sorry value of each action；

Step 2: use current strategy to carry out game, until completing this game；

Step 3: calculate valuation and the sorry value of each action on each information collection that this game is had access to；

Step 4: calculate the strategy made new advances according to sorry matching algorithm；

Step 5: calculate the value-at-risk of New Policy and consider the relation of income and risk, selecting in next round game and to use Strategy；

Step 6: return step 2, until gambling process terminates.
The minimum sorry appraisal procedure of non-perfect information game risk the most according to claim 1 and Revenue Reconciliation, It is characterized in that: in step 1, initialization procedure is as follows: for player i ∈ N, to each of which information collection I ∈ L_iEstimating of upper strategy Value v (I, σ)=0, to sorry value r on information collection I of each a ∈ A (I), action a (I, a)=0, its strategy is initialized as δ_i(I, a)=1/ | A (I) |, when representing initial, the probability of each action is equal, adds up to 1, wherein: N is represent player limited Collection, L_iRepresenting the information segmentation of player i, I is information collection, and σ is strategy group, and a is action.
The minimum sorry appraisal procedure of non-perfect information game risk the most according to claim 2 and Revenue Reconciliation, It is characterized in that: in step 3, the value of information JiIChu:

$v_{i} (σ, I) = \underset{z &Element; Z_{I}}{Σ} u_{i} (z) π_{- i}^{σ} (z [I]) π^{σ} (z [I], z) - - - (4)$

At information collection I, do not take the sorry value of action a:

$R_{i}^{T} (I, a) = \frac{1}{T} Σ_{t = 1}^{T} (v_{i} (I, σ_{(I &RightArrow; a)}^{t}) - v_{i} (I, σ^{t})) - - - (5)$

Wherein, z represents that in terminator sequence set, u (z) represent the actual utility value after arriving game final state, z [I] represents terminator sequence z display part on information collection I,Represent that all of opponent of player i arrives the general of z [I] Rate, π^σ(z [I], z) is all players transition probability from historical series z [I] to z,Represent one and σ^tStrategy of equal value Group, except in information collection I, strategy groupAlways selection action a formula (5) calculating player i in T wheel iteration takes to move Make the average sorry value of a.
The minimum sorry appraisal procedure of non-perfect information game risk the most according to claim 3 and Revenue Reconciliation, It is characterized in that: in step 4, previous step the valuation on each information collection having access to obtained is calculated according to regretting coupling Method, is each action partition density on each information collection again, obtains new strategy, thus obtain the plan preferential with income Slightly, for information collection I, obtain, by sorry coupling, the strategy that next step a income is preferential:

Wherein, formula implication is: when cumulative Sorry value be timing, be normalized than upper total sorry value, proportional more New Policy, otherwise the iterative strategy of next round is i.e. For initial homogenization strategy, wherein R represents the sorry value that cumulative T takes turns, and a represents action, I representative information collection,I.e. For next round (T+1 wheel) at information collection I, the probability of player i employing action a.
The minimum sorry appraisal procedure of non-perfect information game risk the most according to claim 4 and Revenue Reconciliation, It is characterized in that: in step 5, for the feature of non-complete information machine game, propose an approximation calculation risk loss Method, its basic thought is the average calculating the estimated revenue in sampling collection S, replaces the true earning of I in world collection W；

Assuming that the world of current state is integrated as W by game person, unit's prime number is that the sampling of n, W integrates as S, unit's prime number be t, M be the institute of W Having a legal policy arrangement set, unit's prime number is k, first provides average yield computational methods now:

Definition:For the average yield of sampling collection S, computational methods are as follows:

$\overset{&OverBar;}{E_{s}} = \frac{1}{t k} Σ_{i = 1}^{t} Σ_{j = 1}^{k} E_{i}^{j}, (i &Element; S, j &Element; M) - - - (7)$

Based on (7) formula, the integrated risk loss approximation computational methods formula for policy sequence δ is as follows:

$\begin{matrix} L_{W σ} = \frac{1}{n} \sqrt{Σ_{i = 1}^{n} {L_{w_{i} σ}}^{2}} = \frac{1}{n} \sqrt{Σ_{i = 1}^{n} {(L_{w_{i} I} + L_{σ I I})}^{2}} \\ = \frac{1}{n} \sqrt{Σ_{i = 1}^{n} {(E_{w_{i}}^{σ} - E_{I}^{σ} + E_{I}^{σ} - E_{I}^{σ'})}^{2}} \\ = \frac{1}{n} \sqrt{Σ_{i = 1}^{n} {(E_{w_{i}}^{σ} - E_{I}^{σ'})}^{2}} \\ \approx \frac{1}{t} \sqrt{Σ_{i = 1}^{t} {(E_{w_{i}}^{σ} - \overset{&OverBar;}{E_{s}})}^{2}}, (w_{i} &Element; S) \end{matrix} - - - (8)$

(8), in formula, about equal sign institute junction is useAnd sampling collection S carries out the process of approximate calculation, based on top Method, calculates the value-at-risk of New Policy；

Followed by the relation how considered between income and risk,

Assume have tactful A, B, E_AAnd E_BRepresent game person's prospective earnings for strategy A, B, L respectively_AAnd L_BRepresent strategy A and B Risk of loss, then strategy A, B good and bad judgment rule as follows:

1: if strategy A, B meet u_A-L_A>u_B, then A is better than B, otherwise, if meeting u_B-L_B>u_A, then B is better than A；

2: otherwise, by following formula:

$R = l o g [\frac{E_{A} - (E_{B} - L_{B})}{E_{B} - (E_{A} - L_{A})}] - - - (9)$

If R>0, then A is better than B, if R<0, then B is better than A, if R=0, then AB etc. are excellent, and system can randomly choose；

By above method, being ranked up the new and old strategy of current game person, the strategy of sequence optimum accounts for as current risk Dominant strategy, that is to say the optimal strategy of game person, and wherein, R represents risk, L_AAnd L_BRepresent the risk of loss of strategy A and B, u_AWith u_BRepresent the actual benefit value of strategy A and B.