CN115965086A - Non-perfect information game strategy enhancement method based on small sample opponent modeling - Google Patents

Non-perfect information game strategy enhancement method based on small sample opponent modeling Download PDF

Info

Publication number
CN115965086A
CN115965086A CN202211605230.1A CN202211605230A CN115965086A CN 115965086 A CN115965086 A CN 115965086A CN 202211605230 A CN202211605230 A CN 202211605230A CN 115965086 A CN115965086 A CN 115965086A
Authority
CN
China
Prior art keywords
strategy
game
style
opponent
adversary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211605230.1A
Other languages
Chinese (zh)
Inventor
王骄
王诗佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202211605230.1A priority Critical patent/CN115965086A/en
Publication of CN115965086A publication Critical patent/CN115965086A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Image Analysis (AREA)

Abstract

The invention designs a small sample opponent modeling-based non-perfect information game strategy enhancement method, belonging to the technical field of non-perfect information machine games; firstly, training an adversary strategy style recognition network based on a small sample learning method; training an opponent strategy style recognition network by using a small sample learning method, excavating the defects of the opponent strategy, and dynamically adjusting countermeasures in real time by adopting a more robust strategy updating mode, thereby achieving the purpose of improving the benefits of the strategy as much as possible while ensuring the safety of the strategy; compared with a Nash equilibrium strategy, the technical scheme provided by the invention obviously improves game benefits, and compared with an over-confident adversary modeling strategy, the technical scheme provided by the invention obviously reduces risks utilized by the adversaries; the method is suitable for a real game scene, reduces the dependence on data quantity, improves the robustness of the strategy and meets the application requirement.

Description

Imperfect information game strategy enhancement method based on small sample opponent modeling
Technical Field
The invention belongs to the technical field of non-perfect information machine games, and particularly relates to a non-perfect information game strategy enhancement method based on small sample opponent modeling.
Background
The machine game can divide game problems into perfect information game and imperfect information game according to whether all participants obtain all or accurate situation information. The non-perfect information game is closer to the actual application scene, but the game has the characteristics of large state space, information loss, uncertain decision benefits and the like, the difficulty of solving the optimal strategy by the traditional strategy search algorithm based on the game tree is increased, and the research of the non-perfect information game is relatively laggard compared with the perfect information game.
The opponent modeling is an important research direction of machine game, and the core idea is to extract features from the historical behaviors of the opponent and analyze the personalized behavior trend and the strategy weakness of the opponent. The theory and practice of the current research prove that in the process of non-rational opponent game with behavior characteristics or strategy bias, if hidden characteristics and strategy loopholes of opponents can be found and utilized, the game can help the intelligent agent to improve game profit and game winning rate. Currently, the main research directions for modeling an adversary are modeling of an adversary action, modeling of an adversary type, modeling of an adversary preference, modeling of an adversary belief, and the like. With the increasing abundance and perfection of intelligent game scenes, the hand modeling technology is deeply researched and developed towards the characteristics closer to the real game problem. The invention discloses an implicit opponent modeling method based on deep reinforcement learning in Chinese patent CN202111316717.3, wherein an opponent strategy is expressed as a feature vector input decision network, and the game level is improved through an end-to-end training mode. The invention discloses an explicit opponent modeling method based on non-complete information game in Chinese patent CN201610835289.8, and the method induces opponent action preference by adopting a statistical learning method.
The invention discloses a hidden adversary modeling method based on deep reinforcement learning in Chinese patent CN202111316717.3, which expresses an adversary strategy as a feature vector to be used as input of a decision network. And in many actual game play confrontation environments, the agent will encounter an unknown adversary, any a priori knowledge of which is not available in advance. The biggest limitation that makes the intelligent game technology difficult to be applied in reality is that real samples are scarce, and an accurate and effective model is difficult to learn by using the traditional neural network learning or probabilistic reasoning algorithm, because the methods need to extract experience knowledge from a large amount of data.
Although an explicit opponent modeling method is adopted in the opponent modeling method based on the non-complete information game disclosed in the chinese patent No. CN201610835289.8, the learning of the opponent action preference based on statistics still needs a large amount of data, and is difficult to cope with unknown opponents. Furthermore, the strategy search algorithm based on adversary modeling is a double-edged sword, which, while having the ability to increase strategy revenue, increases the risk of being utilized. This is because the game problem with complex state motion space can hardly avoid model errors, especially data scarcity, and the model errors can cause serious loss of the searched countermeasure. Therefore, how to improve game profit as much as possible while ensuring policy security should be the key to improve robustness of decision algorithm based on adversary model.
Disclosure of Invention
Aiming at the defects of the prior art, the invention designs a non-perfect information game strategy enhancement method based on small sample opponent modeling; the method is characterized in that a small sample learning method is utilized to train an opponent style recognition model, the defects of the opponent strategy are excavated, and a more robust strategy updating mode is adopted to dynamically adjust the countermeasures in real time, so that the aim of improving the strategy income as much as possible while ensuring the strategy security is achieved.
A non-perfect information game strategy enhancement method based on small sample opponent modeling specifically comprises the following steps:
step 1: training an adversary strategy style recognition network based on a small sample learning method;
step 1.1: firstly, a database representing game tracks with different strategy styles is made
Figure BDA0003998490470000025
Each game track sample represents the state and action sequence information tau of a game round from the opening to the time t t ={s 0 ,a 1 ,…,a t-1 ,s t And encoding the sequence information; wherein s represents the game state, a represents the action, and the subscripts correspond to different times;
by strategic availability index epsilon ii ) As types of policy styles, the availability evaluates a policyThe average profit gap compared to the nash equilibrium strategy is somewhat expressed as:
Figure BDA0003998490470000021
wherein
Figure BDA0003998490470000022
Indicating the income of player i when all players adopt the Nash equilibrium strategy; />
Figure BDA0003998490470000023
Indicates when player i takes σ i When other players adopt the Nash equilibrium strategy, the income of the player i; the degree of availability is chosen as x (x) considering that an adversary cannot take too fool a policy and that the policy has a certain degree of distinguishability>0) An individual availability index; to meet the scarce sample characteristics of practical problems, only m (0) is collected in each strategy style<m<<|S| |A| ) A bar sample, where | S | represents the motion space size and | a | represents the motion space size;
the strategy is first iteratively updated using the MCCFR algorithm starting from a random initialization strategy, a per iteration>Fixing the current strategy for 0 time to calculate the availability, and stopping strategy updating when a preset numerical value is reached; then, randomly initializing b, b the game environment>0 time, the strategy is used each time from the opening game to the ending, and the complete game state action sequence { s } is recorded 0 ,a 1 ,…,a T ,s T Wherein T represents the end time of the game;
finally, because the lengths of all game tracks are inconsistent, the game tracks cannot be input into the time sequence network in batches, and the lengths of all samples need to be unified; in order to maintain the integrity of the sample data, the lengths of all samples are unified into the length of the longest sample, and the length is filled with 0 when the length is less than the length
Figure BDA0003998490470000024
Tmax represents the timing length of the longest timing sample among all samples; thus generating a product withx classes, each class having a dataset of m samples;
step 1.2: establishing an adversary strategy style identification network; the adversary strategy style recognition network comprises two sub-modules, namely a double-attention prototype feature extraction module and a distance scaling module;
the double-attention prototype feature extraction module extracts features of input samples collected by strategies from two dimensions of time and space, and the sub-module consists of an inner attention time sequence encoder and an outer attention encoder, wherein the inner attention time sequence encoder outputs time sequence features of each sample, and the outer attention encoder extracts distinguishable features of the samples of the type in the whole space from other types after receiving the time sequence features of all the samples of each type and outputs prototype feature vectors of each strategy style; the distance scaling module assigns different weights z to the features of each dimension of the prototype vector in consideration of the influence of each dimension of the strategy style prototype feature on the calculation of the class distance i
The adversary strategy style identification network outputs probability distribution that the current input sample belongs to all preset strategy styles in each database, and the calculation formula is as follows:
Figure BDA0003998490470000031
wherein
Figure BDA0003998490470000033
For the feature vector, Ψ, of the sample to be identified after the intra-attention-sequential encoder i A class prototype vector of the strategy style i after the double-attention prototype feature extraction module is obtained, d (,) is a distance calculation formula, euclidean distance is adopted, and c represents the strategy category; the adversary strategy style identification network structure specifically comprises: the internal attention time sequence encoder is firstly connected into a full connection layer, a batch regularization layer and a Relu activation layer, then connected into an LSTM time sequence layer, and finally connected into a soft attention module soft-attention; the external attention encoder is composed of a self-attention module self-attention; sample groups for each class in the support set are doubledThe prototype vector group output by the attention prototype feature extraction module is spliced with the vectors of the query samples passing through the internal attention encoder respectively, and then spliced vectors of all categories are stacked to form a two-dimensional vector input distance scaling module; the distance scaling module consists of two layers of convolution networks, and a batch regularization layer and a Relu activation layer are added behind each layer of convolution network;
step 1.3: training an opponent strategy style recognition network, randomly extracting n classes from a strategy style library, respectively sampling k samples from the n classes to form a support set S, sampling k samples from the rest samples to form a query set Q, inputting the query set Q as input data of one-time training into the opponent strategy style recognition network in the step 1.2, and obtaining an output vector P of the opponent strategy style recognition network;
the data of each training batch needs to be sampled twice, the first time is class sampling, the second time is sample sampling, and a support set and a query set which have the same class but different samples are respectively generated, wherein the data amount is n x k; the next batch of training data needs to go through the two samplings again;
step 1.4: calculating a cross entropy loss value, wherein the formula is as follows:
Figure BDA0003998490470000032
wherein, c * Is the true category of the sample τ, i.e., the sample label; updating the parameters of the adversary strategy style recognition network by adopting a self-adaptive moment estimation optimization algorithm, and repeating the steps 1.3 and 1.4 until the adversary strategy style recognition network is converged;
step 2: in the actual game process, calling the opponent strategy style recognition network trained to be convergent in the step 1, recognizing an opponent strategy style c from the game decision track of the current round and other visible game information, and recognizing the confidence degree alpha of the recognition result; so that the auxiliary intelligent agent flexibly and dynamically adjusts the strategy when the decision of the local is taken;
step 2.1: the game state from the opening of the current round of game to the current time t is sequenced and all the playersKnown gambling information τ of a sequence of strategic actions t ={s 0 ,a 1 ,…,a t-1 ,s t Converting into an input vector according to the coding mode in the step S1.1;
step 2.2: inputting the known game information converted into the vector into an opponent strategy style recognition network trained to be convergent, and outputting a classification vector
Figure BDA0003998490470000041
Selecting the most probable category as prediction @forthe current opponent strategy style>
Figure BDA0003998490470000043
And saving the prediction probability value of the category as the confidence degree alpha = p of the neural network prediction result c*
And step 3: based on the adversary strategy style identified in the step 2, calculating the best regret matching strategy sigma for the current adversary strategy c And calculating the Nash equilibrium strategy sigma * (ii) a Based on the credibility of the adversary style recognition result, the strategy safety is ensured, meanwhile, the strategy profit is improved, and the probability distribution of all legal actions of the current decision node is generated by adopting a linear soft update mode;
step 3.1: calculating an optimal regret matching strategy for the adversary strategy style based on a Monte Carlo virtual regret minimization algorithm, and calculating an approximate Nash balancing strategy for a rational adversary; the CFR algorithm is suitable for solving the non-perfect information game; when solving the best regret matching strategy, fixing the adversary strategy and only updating the strategy; when solving the Nash equilibrium strategy, the strategies of the two parties are updated simultaneously;
step 3.2: and (3) taking the adversary strategy style confidence coefficient output in the step (2) as a coefficient for fusing the Nash equilibrium strategy and the optimal regret matching strategy, wherein the calculation formula of the fusion strategy is as follows:
Figure BDA0003998490470000042
the decision-making action of the method is executed after the roulette is sampled based on the probability distribution of legal action given by the enhanced strategy; in order to improve the real-time performance of the strategy, the strategy style of the opponent is re-identified and the strategy is updated every time the method makes a decision, so steps 2 and 3 are required to be executed before each decision.
The invention has the beneficial technical effects that:
the opponent strategy style recognition module based on small sample learning in the technical scheme provided by the invention only needs to collect a small amount of data interacted with different opponents, and compared with the prior art, the requirement amount of training data is reduced to 1%, but the same prediction precision can be achieved. Meanwhile, compared with a Nash equilibrium strategy, the strategy fusion mechanism based on the adversary prediction credibility in the technical scheme provided by the invention obviously improves the game profit and obviously reduces the risk of being utilized by the adversary compared with an over-confident adversary modeling strategy. The method is suitable for a real game scene, reduces the dependence on data quantity, improves the robustness of the strategy and meets the application requirement.
Drawings
FIG. 1 is a technical scheme diagram for online enhancing of imperfect information game strategy based on small sample opponent modeling in the embodiment of the present invention
FIG. 2 is a schematic diagram of a small sample adversary strategy style identification module according to an embodiment of the present invention
FIG. 3 is a block diagram of a neural network structure of a small sample adversary strategy style recognition module according to an embodiment of the present invention
FIG. 4 is a flow chart of an implementation of a non-perfect information game strategy online enhancement method based on small sample opponent modeling in an embodiment of the present invention
Detailed Description
The invention is further explained below with reference to the figures and examples;
FIG. 1 is a view of the technical means of the present invention. As shown in fig. 1, the imperfect information game strategy enhancement method based on small sample opponent modeling provided by the invention comprises an opponent strategy style recognition model training based on a small sample learning algorithm; solving an adversary optimal regret matching strategy and solving a nash equilibrium strategy based on a deep Monte Carlo virtual regret minimization algorithm; the online strategy enhancement based on the adversary strategy style identification specifically comprises the following steps:
step 1: training an adversary strategy style recognition network based on a small sample learning method;
in order to deal with the adversary modeling limitation caused by data loss in an actual game scene, designing an adversary strategy style recognition module suitable for game problem data characteristics, and training a neural network by adopting a meta-learning method; traditional small sample learning techniques are mostly used for image processing problems, and gaming data is different from picture type data. First, the game history state motion trajectory is time-ordered, and the individual samples in different categories may be very similar or even identical, i.e., different categories of regions may overlap. Aiming at the particularity of the game data, a double-attention prototype feature extraction module and a distance scaling module shown in fig. 2 are designed, and sample features are extracted from two dimensions of time and space.
Based on the method, a known strategy style is selected, a small amount of sample data of a corresponding category is collected to generate a database, and a neural network is trained to judge the similarity degree of the unknown strategy style sample and the known strategy style. The method comprises the following specific steps:
step 1.1: firstly, a database representing game tracks with different strategy styles is made
Figure BDA0003998490470000054
Each game track sample represents the state and action sequence information tau of a game round from opening to time t t ={s 0 ,a 1 ,…,a t-1 ,s t And encoding the sequence information; wherein S represents a game state, a represents an action, and subscripts correspond to different moments;
by strategic availability index epsilon ii ) As a type of policy style, the availability evaluates the average revenue gap of a policy compared to nash equilibrium policy, which is formulated as:
Figure BDA0003998490470000051
wherein
Figure BDA0003998490470000052
Indicating the income of player i when all players adopt the Nash equilibrium strategy; />
Figure BDA0003998490470000053
Indicates when player i takes σ i The income of player i when other players adopt the Nash equilibrium strategy; considering that an adversary cannot adopt an excessively foolproof strategy and the strategy is ensured to have certain differentiable degree, the available degree is selected as 0.2,0.4,0.6,0.8,1.0,1.2,1.4,1.6,1.8,2.0; in order to meet the characteristic of scarce samples in the practical problem, only 1000 samples are collected in each strategy style;
firstly, an MCCFR algorithm is used, a random initialization strategy is started to iterate an updating strategy, the availability of the current strategy is fixed every 1000 times of iteration, and strategy updating is stopped when a preset value is reached; then, the game environment is randomly initialized 1000 times, and each time the game is opened to the end by using the strategy, a complete game state action sequence { s } is recorded 0 ,a 1 ,…,a T ,s T Wherein T represents the time when the game ends;
finally, because the lengths of all game tracks are inconsistent, the game tracks cannot be input into the time sequence network in batches, and the lengths of all samples need to be unified; in order to maintain the integrity of the sample data, the lengths of all samples are unified into the length of the longest sample, and the length is filled with 0 when the length is less than the length
Figure BDA0003998490470000061
Tmax represents the timing length of the longest timing sample among all samples; this generated a data set with 10 classes of 1000 samples each;
step 1.2: establishing an adversary strategy style identification network; the adversary strategy style recognition network comprises two sub-modules, namely a double-attention prototype feature extraction module and a distance scaling module;
the double attentionThe force prototype feature extraction module extracts features of input samples collected by strategies from two dimensions of time and space, and the submodule consists of an inner attention time sequence encoder and an outer attention encoder, wherein the inner attention time sequence encoder outputs time sequence features of each sample, and the outer attention encoder extracts distinguishable features of the samples in the whole space from other classes after receiving the time sequence features of all the samples of each class and outputs prototype feature vectors of each strategy style; the distance scaling module assigns different weights z to the features of each dimension of the prototype vector, taking into account the effect of each dimension of the policy style prototype features on computing the category distance i
The adversary strategy style identification network outputs probability distribution of the current input sample belonging to all preset strategy styles in each database, and the calculation formula is as follows:
Figure BDA0003998490470000062
wherein
Figure BDA0003998490470000063
For the feature vector of the sample to be identified after the intra-attention-sequential encoder, Ψ i A class prototype vector of the strategy style i after the double-attention prototype feature extraction module is subjected to d (,) distance calculation formula, euclidean distance is adopted, and c represents the strategy category; an adversary strategy style identification network structure is shown in fig. 3, an internal attention time sequence encoder is firstly connected into a full connection layer, a Batch regularization layer (Batch regularization) and a Relu activation layer, then connected into an LSTM time sequence layer, and finally connected into a soft attention module soft-attention; the external attention encoder is composed of a self-attention module self-attention; the prototype vector group output by the double-attention prototype feature extraction module of each type of sample group in the support set is spliced with the vector of the query sample passing through the internal attention encoder, and then spliced vectors of all types are stacked to form a two-dimensional vector input distance scaling module; the distance scaling module is composed of two layers of convolution networks, wherein each layer of convolution networkAdding a batch regularization layer and a Relu activation layer after the annealing;
step 1.3: training an opponent strategy style recognition network, randomly extracting n classes from a strategy style library, respectively sampling k samples from the n classes to form a support set S, sampling k samples from the rest samples to form a query set Q, inputting the query set Q as input data of one-time training into the opponent strategy style recognition network in the step 1.2, and obtaining an output vector P of the opponent strategy style recognition network;
the data of each training batch needs to be sampled twice, wherein the first time is class sampling, the second time is sample sampling, support sets and query sets which have the same class and different samples are respectively generated, and the data amount is n x k; the next batch of training data needs to go through the two samplings again;
step 1.4: calculating a cross entropy loss value, wherein the formula is as follows:
Figure BDA0003998490470000071
wherein, c * Is the true class of the sample τ, i.e., the sample label; updating the parameters of the adversary strategy style recognition network by adopting a self-adaptive moment estimation optimization algorithm, and repeating the steps 1.3 and 1.4 until the adversary strategy style recognition network is converged;
step 2: in the actual game process, calling the opponent strategy style recognition network trained to be converged in the step 1, and recognizing an opponent strategy style c and a confidence coefficient alpha of a recognition result from a game decision track of the current round and other visible game information; so that the auxiliary intelligent agent flexibly and dynamically adjusts the strategy when the decision of the local is taken;
step 2.1: the known game information tau of the game state variable sequence from the game play of the current round to the current time t and the strategy action sequences of all the players t ={s 0 ,a 1 ,…,a t-1 ,s t Converting into an input vector according to the coding mode in the step S1.1;
step 2.2: training known gaming information inputs that are converted into vectorsOutputting the classification vector to the converged adversary strategy style recognition network
Figure BDA0003998490470000072
Selecting the most probable category as prediction @forthe current opponent strategy style>
Figure BDA0003998490470000073
And saving the prediction probability value of the category as the confidence degree alpha = p of the neural network prediction result c*
And 3, step 3: calculating the best regret matching strategy sigma for the current adversary strategy based on the adversary strategy style identified in the step 2 c And calculating the Nash equilibrium strategy sigma * (ii) a Based on the credibility of the adversary style identification result, the strategy safety is ensured, the strategy income is improved, and the probability distribution of all legal actions of the current decision node is generated by adopting a linear soft update mode;
step 3.1: calculating an optimal regret matching strategy for the adversary strategy style based on a Monte Carlo virtual regret minimization algorithm, and calculating an approximate Nash balancing strategy for a rational adversary; the CFR algorithm is suitable for solving the imperfect information game; when solving the best regret matching strategy, fixing the adversary strategy and only updating the strategy; when solving the Nash equilibrium strategy, the strategies of the two parties are updated simultaneously;
step 3.2: and (3) taking the adversary strategy style confidence coefficient output in the step (2) as a coefficient for fusing the Nash equilibrium strategy and the optimal regret matching strategy, wherein the calculation formula of the fusion strategy is as follows:
Figure BDA0003998490470000074
the decision-making action of the method is executed after the roulette is sampled based on the probability distribution of legal action given by the enhanced strategy; in order to improve the real-time performance of the policy, the policy style of the opponent is re-identified and the policy is updated every time the present method makes a decision, so steps 2 and 3 are executed before each decision, and the flow is shown in fig. 4.
The adversary modeling method provided by the invention has practical significance for solving the non-perfect information game problem in the real scene. The opponent model constructed by using the small sample learning method summarizes experiences from a small amount of data, identifies the difference of strategy styles and solves the problem of learning the opponent strategy style from the scarce historical game sample data. Meanwhile, by comparing the new opponent with the opponents encountered before, the pre-learned success strategy is used for dealing with the similar new opponents, and the game method accords with the characteristic that the game is required to be played with an unknown opponent in an open environment. Moreover, considering that the adversary model based on the deep neural network, especially the model with only a small number of samples has model deviations which cannot be completely eliminated, these modeling errors may result in that the real adversary strategy cannot be identified, and the intelligent agent is misled to adopt the strategy with huge risk. In contrast, the invention adopts a more robust strategy enhancement method, and adopts a relatively conservative strategy to reduce the risk when the strategy of the opponent is difficult to identify, and otherwise adopts a more aggressive strategy to obtain higher benefits. Most importantly, the invention adopts an online updating mode, namely, the current opponent strategy can be re-identified at each step of the game, the strategy can be flexibly changed in real time, the risk of the strategy being recycled is reduced, and the requirement of dynamically adapting to the change of the opponent strategy is also met.

Claims (7)

1. A non-perfect information game strategy enhancement method based on small sample opponent modeling is characterized by comprising the following steps:
step 1: training an adversary strategy style recognition network based on a small sample learning method, and training until convergence;
and 2, step: in the actual game process, calling the opponent strategy style recognition network trained to be convergent in the step 1, recognizing an opponent strategy style c from the game decision track of the current round and other visible game information, and recognizing the confidence degree alpha of the recognition result; so that the auxiliary agent flexibly and dynamically adjusts the strategy when the decision of the local is taken;
and step 3: adversary strategy style based on step 2 recognitionCalculating the best regret matching strategy sigma for the current adversary strategy c And calculating the Nash equilibrium strategy sigma * (ii) a And based on the credibility of the adversary style identification result, the strategy safety is ensured, the strategy income is improved, and the probability distribution of all legal actions of the current decision node is generated by adopting a linear soft update mode.
2. The non-perfect information game strategy enhancement method based on small sample opponent modeling according to claim 1, characterized in that the step 1 specifically comprises:
step 1.1: firstly, a database representing game tracks with different strategy styles is made
Figure FDA0003998490460000011
Each game track sample represents the state and action sequence information tau of a game round from opening to time t t ={s 0 ,a 1 ,…,a t-1 ,s t And encoding the sequence information; wherein s represents a game state, a represents an action, and subscripts correspond to different times;
using the degree of utilization index epsilon of the strategy ii ) As a type of policy style, the degree of availability evaluates the average revenue gap of a policy compared to nash equilibrium policy, which is given by:
Figure FDA0003998490460000012
wherein
Figure FDA0003998490460000013
Indicating the income of player i when all players adopt the Nash equilibrium strategy; />
Figure FDA0003998490460000014
Indicates when player i takes σ i The income of player i when other players adopt the Nash equilibrium strategy; considering that an adversary cannot take too much foolThe policy of clumsy things and the guarantee policy have a certain degree of distinguishability, and the degree of availability is selected as x: (>0) An individual availability index; to meet the scarce sample characteristics of practical problems, only m (0) is collected in each strategy style<m<<|S| |A| ) A bar sample, where | S | represents the action space size, | a | represents the action space size;
the strategy is first iteratively updated using the MCCFR algorithm starting from a random initialization strategy, a per iteration>Fixing the current strategy for 0 time to calculate the availability, and stopping strategy updating when a preset numerical value is reached; then, randomly initializing b, b the game environment>0 time, the strategy is used each time from the opening game to the ending, and the complete game state action sequence { s } is recorded 0 ,a 1 ,…,a T ,s T Wherein T represents the time when the game ends;
finally, because the lengths of all game tracks are different, the game tracks cannot be input into the time sequence network in batches, and the lengths of all samples need to be unified; in order to maintain the integrity of the sample data, the lengths of all samples are unified into the length of the longest sample, and the length is filled with 0 when the length is less than the length
Figure FDA0003998490460000015
Tmax represents the timing length of the longest timing sample among all samples; this generates a data set with x classes of m samples each;
step 1.2: establishing an adversary strategy style identification network; the adversary strategy style recognition network comprises two sub-modules, namely a double-attention prototype feature extraction module and a distance scaling module;
the adversary strategy style identification network outputs probability distribution of the current input sample belonging to all preset strategy styles in each database, and the calculation formula is as follows:
Figure FDA0003998490460000021
wherein
Figure FDA0003998490460000023
For the feature vector of the sample to be identified after the intra-attention-sequential encoder, Ψ i A class prototype vector of the strategy style i after the double-attention prototype feature extraction module is obtained, d (,) is a distance calculation formula, euclidean distance is adopted, and c represents the strategy category;
step 1.3: training an opponent strategy style recognition network, randomly extracting n classes from a strategy style library, respectively sampling k samples from the n classes to form a support set S, sampling k samples from the rest samples to form a query set Q, and inputting the query set Q as input data of one-time training into the opponent strategy style recognition network in the step 1.2 to obtain an output vector P of the opponent strategy style recognition network;
the data of each training batch needs to be sampled twice, and the training data of the next batch needs to be sampled twice again;
step 1.4: calculating a cross entropy loss value, wherein the formula is as follows:
Figure FDA0003998490460000022
wherein, c * Is the true category of the sample τ, i.e., the sample label; and (3) updating the parameters of the adversary strategy style recognition network by adopting a self-adaptive moment estimation optimization algorithm, and repeating the steps 1.3 and 1.4 until the adversary strategy style recognition network is converged.
3. The method for enhancing game strategy of imperfect information based on hand modeling with small samples as claimed in claim 2, wherein the step 1.2 of the feature extraction module of the double attention prototype extracts the features of the input samples collected by the strategy from two dimensions of time and space, the sub-module is composed of an inner attention time sequence encoder and an outer attention encoder, wherein the inner attention time sequence encoder outputs the time sequence feature of each sample, and after the outer attention encoder receives the time sequence features of all samples of each class, the outer attention encoder extracts the distinguishable features of the samples of the class from other classes in the whole space and outputs the distinguishable features as the features of each strategyPrototype feature vectors of styles; the distance scaling module assigns different weights z to the features of each dimension of the prototype vector in consideration of the influence of each dimension of the strategy style prototype feature on the calculation of the class distance i
4. The non-perfect information game strategy enhancement method based on small sample opponent modeling according to claim 2, wherein the opponent strategy style identification network structure in step 1.2 is specifically: the internal attention time sequence encoder is firstly connected into a full connection layer, a batch regularization layer and a Relu activation layer, then connected into an LSTM time sequence layer, and finally connected into a soft attention module soft-attention; the external attention encoder is composed of a self-attention module self-attention; the prototype vector group output by the double-attention prototype feature extraction module of the sample group of each category in the support set and the query sample pass through the internal attention encoder respectively
The vectors are spliced, and then spliced vectors of all types are stacked to form a two-dimensional vector input distance scaling module; the distance scaling module is composed of two layers of convolution networks, and a batch regularization layer and a Relu activation layer are added behind each layer of convolution network.
5. The method for enhancing the game strategy of the imperfect information based on the small sample opponent modeling according to claim 2, wherein the two times of sampling in step 1.3 are firstly sampling in categories and secondly sampling in samples, and a support set and a query set which have the same category but different samples are respectively generated, and the data amount is n x k.
6. The non-perfect information game strategy enhancement method based on small sample opponent modeling according to claim 1, characterized in that the step 2 specifically comprises:
step 2.1: known gambling information tau of gambling state change sequence from the opening of the current round of game to the current time t and strategy action sequence of all players t ={s 0 ,a 1 ,…,a t-1 ,s t Converting into input according to the coding mode of step S1.1Vector quantity;
step 2.2: inputting the known game information converted into the vector into an opponent strategy style recognition network trained to be convergent, and outputting a classification vector
Figure FDA0003998490460000032
Selecting the most probable category as the prediction @ 'of the current opponent's policy style>
Figure FDA0003998490460000033
And saving the prediction probability value for the category as a confidence in the neural network prediction>
Figure FDA0003998490460000031
7. The non-perfect information game strategy enhancement method based on small sample opponent modeling according to claim 1, characterized in that the step 3 specifically comprises:
step 3.1: calculating an optimal regret matching strategy aiming at the strategy style of the opponent based on a Monte Carlo virtual regret minimization algorithm, and calculating an approximate Nash equilibrium strategy aiming at the rational opponent; the CFR algorithm is suitable for solving the imperfect information game; when solving the best regret matching strategy, fixing the adversary strategy and only updating the strategy; when solving the Nash equilibrium strategy, the strategies of the two parties are updated simultaneously;
step 3.2: and (3) taking the adversary strategy style confidence coefficient output in the step (2) as a coefficient for fusing the Nash equilibrium strategy and the optimal regret matching strategy, wherein the calculation formula of the fusion strategy is as follows:
Figure FDA0003998490460000034
the decision-making action of the method is executed after the roulette is sampled based on the probability distribution of legal action given by the enhanced strategy; in order to improve the real-time performance of the strategy, the strategy style of the opponent is re-identified and the strategy is updated every time the party makes a decision, so that the steps 2 and 3 are executed before each decision.
CN202211605230.1A 2022-12-14 2022-12-14 Non-perfect information game strategy enhancement method based on small sample opponent modeling Pending CN115965086A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211605230.1A CN115965086A (en) 2022-12-14 2022-12-14 Non-perfect information game strategy enhancement method based on small sample opponent modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211605230.1A CN115965086A (en) 2022-12-14 2022-12-14 Non-perfect information game strategy enhancement method based on small sample opponent modeling

Publications (1)

Publication Number Publication Date
CN115965086A true CN115965086A (en) 2023-04-14

Family

ID=87359507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211605230.1A Pending CN115965086A (en) 2022-12-14 2022-12-14 Non-perfect information game strategy enhancement method based on small sample opponent modeling

Country Status (1)

Country Link
CN (1) CN115965086A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578636A (en) * 2023-05-15 2023-08-11 北京大学 Distributed multi-agent cooperation method, system, medium and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116578636A (en) * 2023-05-15 2023-08-11 北京大学 Distributed multi-agent cooperation method, system, medium and equipment
CN116578636B (en) * 2023-05-15 2023-12-15 北京大学 Distributed multi-agent cooperation method, system, medium and equipment

Similar Documents

Publication Publication Date Title
Hu et al. Duplex generative adversarial network for unsupervised domain adaptation
CN108875807B (en) Image description method based on multiple attention and multiple scales
CN110458216B (en) Image style migration method for generating countermeasure network based on conditions
CN110263912B (en) Image question-answering method based on multi-target association depth reasoning
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN113628059B (en) Associated user identification method and device based on multi-layer diagram attention network
CN109255381B (en) Image classification method based on second-order VLAD sparse adaptive depth network
CN112784929B (en) Small sample image classification method and device based on double-element group expansion
CN112836798A (en) Non-directional white-box attack resisting method aiming at scene character recognition
CN114969316A (en) Text data processing method, device, equipment and medium
CN113988074B (en) Chinese named entity recognition method and device for dynamically fusing dictionary information
CN114139676A (en) Training method of domain adaptive neural network
CN115965086A (en) Non-perfect information game strategy enhancement method based on small sample opponent modeling
CN112348068B (en) Time sequence data clustering method based on noise reduction encoder and attention mechanism
CN118196231B (en) Lifelong learning draft method based on concept segmentation
CN113204522A (en) Large-scale data retrieval method based on Hash algorithm combined with generation countermeasure network
CN114048290A (en) Text classification method and device
CN114970517A (en) Visual question and answer oriented method based on multi-modal interaction context perception
CN116977457A (en) Data processing method, device and computer readable storage medium
CN114254108B (en) Method, system and medium for generating Chinese text countermeasure sample
Li et al. Defensive few-shot learning
CN114495118A (en) Personalized handwritten character generation method based on countermeasure decoupling
CN111737688B (en) Attack defense system based on user portrait
CN113254575A (en) Machine reading understanding method and system based on multi-step evidence reasoning
CN117152851A (en) Face and human body collaborative clustering method based on large model pre-training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination