CN117933419A - Intelligent body testing method and device for behavior diversity - Google Patents
Intelligent body testing method and device for behavior diversity Download PDFInfo
- Publication number
- CN117933419A CN117933419A CN202410042647.4A CN202410042647A CN117933419A CN 117933419 A CN117933419 A CN 117933419A CN 202410042647 A CN202410042647 A CN 202410042647A CN 117933419 A CN117933419 A CN 117933419A
- Authority
- CN
- China
- Prior art keywords
- agent
- value
- target
- constraint
- diversity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 62
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000000034 method Methods 0.000 claims description 29
- 230000006378 damage Effects 0.000 claims description 23
- 230000009471 action Effects 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 20
- 230000006399 behavior Effects 0.000 claims description 19
- 208000027418 Wounds and injury Diseases 0.000 claims description 15
- 238000011156 evaluation Methods 0.000 claims description 15
- 208000014674 injury Diseases 0.000 claims description 15
- 239000008280 blood Substances 0.000 claims description 13
- 210000004369 blood Anatomy 0.000 claims description 13
- 238000013459 approach Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000036541 health Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 8
- 230000003542 behavioural effect Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000013461 design Methods 0.000 claims description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 230000009916 joint effect Effects 0.000 claims description 2
- 239000003795 chemical substances by application Substances 0.000 description 120
- 230000008901 benefit Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 239000005557 antagonist Substances 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003094 perturbing effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an intelligent agent testing method and device for behavior diversity, which belongs to the technical field of computers, and comprises the steps of constructing a testing intelligent agent and designing constraint conditions; training the test agent and checking the diversity of the state sequence of the target agent based on the sliding window; judging whether the diversity is sufficient or not, and further selecting constraint conditions of preference according to the situation to continue training. According to the invention, constraint is added in training, so that test agents with different strategies are trained as clear guidance, and various weaknesses of a target agent can be revealed through the test agents.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to an intelligent agent testing method and device for behavior diversity.
Background
The opponent game presents a scenario of multiple agent interactions, each agent striving to optimize its own personal goals, defeat opponents, and earn winnings. Deep Reinforcement Learning (DRL) has become an pioneering method for addressing and understanding opponent games. The method has wide application range from realizing super performance in the chessboard game to solving the key environments of game problems, commercial competition, politics science, resource allocation tasks and the like. DRL provides the ability for agents to learn from large amounts of data, adapt to dynamic environments, and optimize complex objectives. However, training agent processes in a challenge environment are complex.
In a complex environment, the policies of the agent need to not only accommodate a static set of rules, but also to cope with changing policies of the adversary. This can lead to the agent being fragile and difficult to popularize in different scenarios if not handled efficiently. In critical situations such as commercial competition or smart gaming, a non-powerful agent may lead to dire consequences, from economic loss to human life risk.
Some current approaches fail by perturbing the observation of the agent (i.e., the game environment) to cause sub-optimal behavior in the agent selection strategy. However, these methods may not be practical in real-world scenarios because changing the physical environment, such as introducing pixel noise in the input image, tends to be challenging. Furthermore, such an approach may not be effective in revealing inherent decision defects that are not related to digital disturbances in the environment. Other studies have focused on training countermeasure strategies that defeat victim agents in games, and that have been trained in such a way that the countermeasure strategy may expose decision vulnerabilities of the target strategy. However, since the training of the resistance strategy is to defeat the fixed victim, focusing on finding and exploiting the most easily found weaknesses, while possibly ignoring other weaknesses, the diversity of weaknesses determined by such an approach is limited. Some studies emphasize the importance of test diversity, but these approaches rely primarily on curiosity-based approaches to enhance diversity, which may not provide direct guidance. In complex tasks and environments, while these approaches may increase curiosity scores, they may not effectively enhance the diversity of identified decision defects.
Therefore, there is an urgent need for a method for testing policy diversity of an agent, especially in security critical scenarios.
Disclosure of Invention
Aiming at the problems, the invention provides an intelligent agent testing method and device for behavior diversity. The invention constructs the test agent and designs constraint conditions, dynamically adjusts the strategy of the test agent by adding preferential constraint in training, and reveals various weaknesses of the target agent by interaction between the test agent and the target agent.
The invention adopts the technical scheme that:
an intelligent agent testing method facing to behavior diversity comprises the following steps:
1) Designing constraint conditions for mutual combat behaviors of two intelligent agents, wherein the two intelligent agents comprise a test intelligent agent and a target intelligent agent;
2) Training the test agent by using QMIX algorithm, and recording the state sequence of each field fight of the target agent;
3) Checking whether the diversity of the state sequences of the target agents in the sliding window is sufficient;
4) If the diversity is sufficient, continuously training and testing the intelligent agent according to the current state; if the diversity is insufficient, selecting a proper constraint condition to be applied to the training process of the test intelligent agent;
5) And after training, recording the failure scene of the target intelligent agent when testing.
Further, the constraints of the design include:
range of movement C move: limiting the boundary of the moving range of the intelligent agent to 50% of the default value;
Attack scope Limiting the attack range of the agent to 50% of the default value;
Attack scope Limiting the attack range of the agent to 150% of the default value;
Injury value C damage: limiting the injury value of the agent per attack to less than 50% of its default injury;
blood volume difference between both agents C health: limiting the blood volume value difference of the two intelligent agents to be less than 50% of the maximum health value of the test intelligent agent;
Distance between two parties Limiting the distance between the two intelligent agents to be less than 50% of the attack range;
Distance between two parties The distance between the two intelligent agents is limited to be more than 50% of the attack range.
Further, the QMIX algorithm is implemented by a QMIX model, which QMIX model comprises two parts:
The agent network is constructed based on DRQN networks and comprises an input layer and an output layer which are formed by MLP multi-layer neural networks and an intermediate layer which is formed by GRU gate control cyclic neural networks; the agent network takes the observed value of the agent at the current moment and the action of the agent at the previous moment as inputs and takes the value function of the agent at the current moment as output;
The hybrid network is a feedforward neural network, takes the value function of the agent at the current moment output by the agent network as input, and takes the joint action cost function as output.
Further, the step of training the test agent using QMIX algorithm includes:
The QMIX model is updated by setting an evaluation network and a target network, wherein the evaluation network is used for calculating the Q value of the current state, and the target network is used for calculating the target Q value;
selecting actions of the agent using an epsilon-greedy algorithm;
calculating, by the evaluation network, an estimated Q value for each agent to take each action in a given state;
calculating a target Q value corresponding to the optimal action taken by each intelligent agent in the next state through a target network;
The evaluation network receives the Q value corresponding to the action selected by each agent as input and is used for calculating the estimated Q value of each action in the current state; the target network receives the target Q value of each intelligent agent;
Calculating a target value r+gamma (Q tot (target) by using a Q-Learning method, wherein r is the obtained rewarding value, gamma is a super parameter, and Q tot (target) is a target Q value;
calculating the difference between the target value and the estimated Q value to obtain a TD-error;
Expanding the mask to be the same as the TD-error, wiping out the filled TD-error, performing gradient clearing, calculating the gradient of the loss function on the network parameters by using a back propagation algorithm, and updating the parameters of the target network by using the gradient to enable the target network to approach to the real Q value.
Further, the step of checking whether the diversity of the state sequences of the target agents within the sliding window is sufficient includes:
Calculating the distance dis (s, s ') between the state sequence s contained in the sliding window with the size w and the state sequences s' of all previous combat, and finding out the minimum distance d s′ from the distance dis;
if d s′ is greater than the preset similarity threshold, the state sequence s is considered to be a new failure track;
Calculating a frequency freq= #seq/w of newly occurring failure tracks within the sliding window, the frequency representing the diversity, wherein #seq represents the number of newly occurring failure tracks; if freq is greater than the preset frequency threshold, the diversity is sufficient, otherwise insufficient.
Further, whether the diversity of the state sequences of the target agents in the sliding window is sufficient or not is checked by the diversity checking module.
Further, the step of selecting the appropriate constraint includes:
Recording the diversity gain of each constraint in the history test process as the global preference of each constraint;
recording violations of each constraint in the sliding window in real time to represent local preference of each constraint;
weighted summation of global preferences and local preferences as overall preferences for each constraint;
based on the overall preference, an ε -greedy algorithm is used to select the appropriate constraint.
Further, selecting appropriate constraint conditions by a constraint selection module, wherein the constraint selection module comprises a historical diversity gain analysis module and a real-time behavior evaluation module, and the historical diversity gain analysis module is used for calculating Meng En constrained global preferences, local preferences and overall preferences; the real-time behavior evaluation module is used for recording violation conditions of the constraint in real time and calculating the latest constraint index.
Further, the constraint index includes: the number of times the test agent leaves the limited range of movement in a given time step, I move, the average actual attack range over a given time step, I attack, the average injury value over a given time step, I damage, the average of the blood volume differences of both agents over a given time step, I health, and the average distance over a given time step, I distance, of both agents.
An agent testing device for behavioral diversity comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the method.
A storage medium storing a computer program which, when executed, performs the steps of the above method.
Compared with the prior art, the invention has the following beneficial effects:
1) By adding constraints in the training, as explicit guidance to train different strategies of the antagonist (i.e., test agent), such antagonist can reveal various weaknesses of the target agent;
2) The behavior of the counteragent generated by adding the constraint is more in line with the actual scene, and a more reasonable simulation environment can be provided for the actual scene.
Drawings
FIG. 1 is a flow chart of an agent testing method for behavioral diversity.
Fig. 2 is a statistical view of failure scenarios for four experimental maps.
Detailed Description
In order to make the technical features and advantages or technical effects of the technical scheme of the invention more obvious and understandable, the following detailed description is given with reference to the accompanying drawings.
The invention provides a method for testing an intelligent agent oriented to behavior diversity, which has a flow shown in a figure 1 and comprises the following specific steps:
1. design constraints
Referring to the online discussion between the countermeasure game manual and the user, five key factors that can affect the progress and mode of the game are manually determined.
1. Factors related to the capabilities of the agent itself:
range of movement: the moving range is reduced, and the moving capability of the agent is limited.
Attack range: reducing or increasing the scope of attack may directly change the attack capability of the agent.
Injury value: reducing the injury value may limit the combat ability of the agent.
2. Factors related to the interaction state of two agents:
Blood volume differences: it affects combat patterns, such as lightly defeating an adversary (maintaining a small blood volume difference) or defeating a hit.
Distance: the distance between agents affects the agents' policies, such as "long-range attacks" or "short-range attacks".
These factors are then refined to 7 specific constraints, each of which can generally be derived from the lower and upper bounds. However, some of the effects of the games are the same as the bonus function of winning a battle (e.g., injury value) or effects that the games do not allow (e.g., range of movement), and thus they are not included in the constraint. Thus, from the five factors described above, the following seven constraints are derived:
1) Move in smaller range C move: the boundary of the movement range is limited to 50% of its default value. Such movement restrictions may result in the test agent not being able to approach the target agent and thus not being able to launch an attack, unlike what it normally pursues winning behavior in a game.
2) Attack in a smaller rangeThe scope of attack is limited to 50% of its default value. The reduced range of attack by the test agent may result in the target agent being closer to the adversary because it feels unnecessary to maintain a greater security distance.
3) Attack over a wide rangeThe scope of attack is limited to 150% of its default value. An increase in the attack scope of the test agent may result in the target agent selecting a "far attack" because it may protect itself from the greater attack scope of the test agent.
4) Smaller damage value C damage: the injury value per attack is limited to less than 50% of its default injury. This encourages test agents to employ more aggressive approaches to defeat target agents, which may lead to target agents exploring different strategies.
5) The blood volume value difference between the two intelligent agents is smaller C health: the blood volume value difference of both parties is limited to be less than 50% of the maximum health value of the test agent. This will encourage the test agent to choose a "gentle" attack strategy to maintain a small difference in operating conditions between itself and the target agent.
6) The distance between the two intelligent bodies is smallerThe distance between the two intelligent agents is limited to be less than 50% of the attack range. This encourages the test agent to be closer to the target agent throughout the combat, taking a "close-range attack" strategy.
7) The distance between the two intelligent bodies is largerThe distance between the two intelligent agents is limited to be more than 50% of the attack range. This encourages the test agent to remain a distance from the target agent, using a "far attack" strategy throughout the combat.
2. Training test agent
1. Network structure and parameter setting
Training a test agent using QMIX algorithm, the challenge environment defines the state space that the agent can observe and the action space that it can perform. The agent then selects to perform an action based on the observed state and learns through interaction with the environment. The QMIX model consists of two major parts: one is Agent network (Agent network), taking into account the observations of Agent a at time tAnd action of agent a at time t-1/>As input, output the value function of single agent a at time tThe other is a hybrid network (Mixing network), thenAs an input, the joint motion cost function Q tot (τ, u) is output. Agentnetwork is implemented by DRQN network, and the network parameters of different agents are shared, wherein the network comprises 3 layers, namely an input layer (MLP multi-layer neural network), an intermediate layer (GRU gate-control cyclic neural network) and an output layer (MLP multi-layer neural network). The Mixing network is a feed-forward neural network that takes as input the output of the agent network, monotonically mixes, producing the value of Qt ot.
QMIX is updated by: an evaluation network (evaluate Net) and a target network (TARGET NET) are set.
Q tot (target): in the case of the next state s', the maximum Q value selected by the agent among all behaviors, i.e., the target Q value, is used to calculate the target value in the loss function. Based on IGM conditions (INDEPENDENT AND IDENTICALLY Distributed (IID) generation model conditions), the maximum action value of each agent in this state is input.
Q tot (evaluate): and under the condition of the current state s, the estimated Q value of the agent selecting action is used for calculating the current estimated value of the loss.
The update interval of the target network is set to 200, namely, the network is updated every 200 episode; the buffer size buffer_size is set to 5000.
2. The specific training steps are as follows:
2a) Performing action selection by using an epsilon-greedy algorithm;
2b) Calculating the estimated Q value of a single agent to obtain a Q-Table (Q value Table); along dimension 1, tensors ([ macout,1 ]). The Q value is selected for each agent action taken and the last unnecessary dimension is removed. And assigning the action which cannot be executed as minus infinity, and acquiring the maximum action value and the index thereof.
2C) Calculating a target Q value of a single agent;
2d) Back propagation is performed according to the loss function: evaluate the network inputs the Q value of the action selected by each agent, and the target network inputs the maximum Q value of each agent. The target value r+gamma is calculated by Q-Learning, Q tot (target), where r is the prize value rewards and gamma is the hyper-parameter. And (3) calculating the TD-error, expanding the mask to be the same size as the TD-error, wiping out the filled TD-error, performing gradient clearing and back propagation, and updating the parameters of the target network in a specified period.
3. Checking behavioral diversity of target agents
The diversity check measures whether the latest combat trajectory (i.e., the state sequence) is different from the existing trajectory since the start of the test. Note that the present invention only considers the trajectory of target agent combat failures, as the goal is to find diverse failure scenarios for the target agent.
The specific steps are as follows:
1. a state sequence set Q i={s1,s2,...,si-1,si from the beginning of the test to the current fight, wherein i is the total number of the fight, s is the track of each fight target agent, and comprises information such as position, blood volume and the like;
2. The state sequence set in the window is Q w={si-w+1,...,si, wherein w is the size of the sliding window, each s j is the state sequence of the j field fight, and is an m×n matrix, which indicates that each fight has m frames, and the observation result of each frame is an n-dimensional vector;
3. for the state sequence s 'in the sliding window, the distance between the state sequence s' and all other existing state sequences is calculated based on the hamming distance, and the calculation formula is as follows:
wherein, the numerator is composed of the Hamming distance between s and s' and the sequence length difference, and the denominator is composed of two sequence length maxima.
Based on this, the minimum distance between the state sequence s' and other existing sequences can be obtained
4. The present invention utilizes a predefined similarity threshold θ s to filter out similar sequences, i.e., when d s′>θs, it represents that the state sequence is different from the existing state sequence and will be considered as a new emerging failure trace. Wherein, the similarity threshold θ s is set to a corresponding value, such as 0.3 (map 2m_vs_1z) or 0.4 (3 m), depending on the dimension of the state sequence and the manual analysis.
5. Then calculate the frequency of the new failure track that appears in the sliding window: freq= #seq/w, #seq represents the number of newly occurring failure tracks, w being the window size. When freq is less than a predefined frequency threshold θ f (e.g., 0.2), the diversity check module considers that a constraint needs to be added.
4. Selecting constraints and applying to training process for testing agent
After the diversity check module considers the time to introduce the constraint, the constraint selection module will determine which constraint should be applied. The constraint selection module is composed of two sub-modules, namely a historical diversity gain analysis module and a real-time behavior evaluation module, wherein the historical diversity gain analysis module generates a global preference function of each constraint, and the real-time behavior evaluation module generates a local preference function of each constraint. And obtaining the preference function of each constraint after weighted summation of the global preference function and the local preference function. The constraint selection module then uses the epsilon-greedy algorithm to select constraints and apply the constraints to the training process for testing the agent according to the preference function.
1. Historical diversity benefit analysis module
The module in the present invention can collect global information including the history of each constraint selection and the corresponding diversity benefit from each selection. The goal of such constraint selection is to maximize the cumulative diversity through a specific constraint selection strategy. The present invention analogizes this task to a multi-arm slot machine problem, the initial goal of which is to maximize the jackpot by optimizing the strategy of pulling the "arms". The present invention can treat each constraint as each arm on the slot machine and select one constraint as the arm to be pulled.
The specific process is as follows:
1a) Taking a constraint set and time of constraint selection as inputs;
1b) Selecting constraint c through epsilon-greedy algorithm each time;
1c) Taking the number of different tracks before the next constraint selection as the diversity Reward v=reorder (c) brought by c;
1d) The average prize R (c) of the constraint c is updated according to the current prize and is prepared for the next selection in the following manner:
wherein count (c) is the number of times constraint c is selected;
le) global preference with final return rewards vector R (C f) as constraint, higher rewards meaning higher preference for selecting C i;
1f) R (C i) is normalized so that each element is in the [0,1] range, thereby eliminating the effect of different orders of magnitude.
2. Real-time behavior evaluation module
To represent local constraint preferences, the module will record violations of the constraints in real-time to calculate relevant metrics for the most recent constraint. According to the 5 factors mentioned above, the present invention defines 5 constraint indexes:
a) I move: the number of times the test agent leaves the limited range of movement in a given time step corresponds to constraint C move, which ranges from 0, max m, where max m is the number of time steps.
B) I attack: average actual attack range over a given time step corresponds to constraintsAnd/>The range is [0.5r,1.5r ], where r is the default attack range.
C) I damage: the average injury value of the test agent over a given time step corresponds to constraint C damage, ranging from [0, max d ], where max d is the default maximum injury value.
D) I health: the mean of the blood volume differences for both agents over a given time step corresponds to constraint C health, ranging from [0, max h ], where max h is the maximum blood volume value.
E) I distance: the average distance of the two parties over a given time step corresponds to the constraintAndThe range is [0, d ], where d is the distance between the two furthest points on the map.
Based on statistics of the indexes, the invention normalizes the min-max of the upper and lower boundaries of the indexes, If the index corresponds to a constraint, then this normalized value is directly used as its local preference. If the index corresponds to two constraints, then this normalized value is considered to be the local preference of the first constraint (i.e./>And/>). Then, a second normalized value is calculated by exchanging x min and x max and taken as a second constraint (i.e./>And) Is described in (a) is provided.
3. Constraint selection with epsilon-greedy algorithm and application to training
Through the two modules, global constraint preference and local constraint preference of each constraint are obtained, the global and local preference are weighted by the same weight (namely 0.5), and a probability vector Pr (C i) is obtained, wherein the vector comprises the combined probability of selecting a certain constraint C i. The larger the value, the greater the probability that the constraint is selected.
Then epsilon-greedy selection is performed to increase exploration, wherein constraint is randomly selected with epsilon probability or constraint is selected with 1-epsilon probability greedy, namely, constraint with maximum combined probability value Pr is selected. According to the general setting epsilon is 0.1.
When a constraint is selected, the present invention will delete the old constraint in the training of the test agent and use this new constraint, i.e., add a discount factor to reconstruct the reward function to penalize the agent's behavior against the constraint, as shown below:
Where m is the number of constraints, C i (pi) is the discount cost function under policy pi, and the corresponding constraint threshold is d i. The goal of adding constraints is to satisfy Maximizing the benefit R (pi) under conditions of (a).
Experimental test:
the invention selects popular real-time strategy game 'interstellar dispute II' as a test environment, and uses the latest framework PyMARL to interact with the game environment. Furthermore, the present invention compares the results with two commonly used, advanced techniques (QMIX and EMOGI). QMIX is a common DRL algorithm used for training strategies to control multiple soldiers in a competitive scenario. EMOGI is directed to generating behavior-diverse game agents by utilizing DRL and evolutionary algorithms.
The invention evaluates four maps of the interstellar dispute II, the name of the map and soldiers controlled by two intelligent agents are shown in the following table:
Table 1 experimental map
Map name | Test party | Tested party |
2m_vs_1z | 2Marines | 1Zealot |
3m | 3Marines | 3Marines |
2s_vs_1sc | 2Stalkers | 1Spine Crawler |
3s_vs_3z | 3Stalkers | 3Zealots |
The invention counts the unique failure scenes detected on the four maps and compares the results with the baseline method, and the results are shown in figure 2.
AdvTest can find more unique failure scenarios during the same test than usual and most advanced techniques. When the test was completed, advTest found a total of 56, 38, 36 and 1,195 unique failure scenarios on the four maps, while the other two baselines were 38, 36, 29, 1,001 (EMOGI) and 19, 28, 22,756 (QMIX), respectively. AdvTest increased by 47.4%, 5.6%, 24.1% and 19.4% on the four maps, respectively, compared to EMOGI. AdvTest increased by 194.7%, 35.7%, 63.6% and 58.1% on four maps, respectively, as compared to QMIX.
Although the specific details, algorithms for implementation, and figures of the present invention have been disclosed for illustrative purposes to aid in understanding the contents of the present invention and the implementation thereof, it will be appreciated by those skilled in the art that: various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention and the appended claims. The invention should not be limited to the preferred embodiments of the present description and the disclosure of the drawings, but the scope of the invention is defined by the claims.
Claims (10)
1. The intelligent agent testing method for the behavior diversity is characterized by comprising the following steps of:
1) Designing constraint conditions for mutual combat behaviors of two intelligent agents, wherein the two intelligent agents comprise a test intelligent agent and a target intelligent agent;
2) Training the test agent by using QMIX algorithm, and recording the state sequence of each field fight of the target agent;
3) Checking whether the diversity of the state sequences of the target agents in the sliding window is sufficient;
4) If the diversity is sufficient, continuously training and testing the intelligent agent according to the current state; if the diversity is insufficient, selecting a proper constraint condition to be applied to the training process of the test intelligent agent;
5) And after training, recording the failure scene of the target intelligent agent when testing.
2. The method of claim 1, wherein the constraints of the design include:
range of movement C move: limiting the boundary of the moving range of the intelligent agent to 50% of the default value;
Attack scope Limiting the attack range of the agent to 50% of the default value;
Attack scope Limiting the attack range of the agent to 150% of the default value;
Injury value C damage: limiting the injury value of the agent per attack to less than 50% of its default injury;
blood volume difference between both agents C health: limiting the blood volume value difference of the two intelligent agents to be less than 50% of the maximum health value of the test intelligent agent;
Distance between two parties Limiting the distance between the two intelligent agents to be less than 50% of the attack range;
Distance between two parties The distance between the two intelligent agents is limited to be more than 50% of the attack range.
3. The method of claim 1, wherein the QMIX algorithm is implemented by a QMIX model, the QMIX model comprising two parts:
The agent network is constructed based on DRQN networks and comprises an input layer and an output layer which are formed by MLP multi-layer neural networks and an intermediate layer which is formed by GRU gate control cyclic neural networks; the agent network takes the observed value of the agent at the current moment and the action of the agent at the previous moment as inputs and takes the value function of the agent at the current moment as output;
The hybrid network is a feedforward neural network, takes the value function of the agent at the current moment output by the agent network as input, and takes the joint action cost function as output.
4. The method of claim 3, wherein the step of training the test agent using QMIX algorithm comprises:
The QMIX model is updated by setting an evaluation network and a target network, wherein the evaluation network is used for calculating the Q value of the current state, and the target network is used for calculating the target Q value;
selecting actions of the agent using an epsilon-greedy algorithm;
calculating, by the evaluation network, an estimated Q value for each agent to take each action in a given state;
calculating a target Q value corresponding to the optimal action taken by each intelligent agent in the next state through a target network;
The evaluation network receives the Q value corresponding to the action selected by each agent as input and is used for calculating the estimated Q value of each action in the current state; the target network receives the target Q value of each intelligent agent;
Calculating a target value r+gamma (Q tot (target) by using a Q-Learning method, wherein r is the obtained rewarding value, gamma is a super parameter, and Q tot (target) is a target Q value;
calculating the difference between the target value and the estimated Q value to obtain a TD-error;
Expanding the mask to be the same as the TD-error, wiping out the filled TD-error, performing gradient clearing, calculating the gradient of the loss function on the network parameters by using a back propagation algorithm, and updating the parameters of the target network by using the gradient to enable the target network to approach to the real Q value.
5. The method of claim 1, wherein the step of checking whether the diversity of the state sequences of the target agents within the sliding window is sufficient comprises:
Calculating the distance dis (s, s ') between the state sequence s contained in the sliding window with the size w and the state sequences s' of all previous combat, and finding out the minimum distance d s′ from the distance dis;
if d s′ is greater than the preset similarity threshold, the state sequence s is considered to be a new failure track;
Calculating a frequency freq= #seq/w of newly occurring failure tracks within the sliding window, the frequency representing the diversity, wherein #seq represents the number of newly occurring failure tracks; if freq is greater than the preset frequency threshold, the diversity is sufficient, otherwise insufficient.
6. The method of claim 1 or 5, wherein the diversity checking module checks whether the diversity of the state sequences of the target agents within the sliding window is sufficient.
7. The method of claim 1, wherein the step of selecting the appropriate constraint comprises:
Recording the diversity gain of each constraint in the history test process as the global preference of each constraint;
recording violations of each constraint in the sliding window in real time to represent local preference of each constraint;
weighted summation of global preferences and local preferences as overall preferences for each constraint;
based on the overall preference, an ε -greedy algorithm is used to select the appropriate constraint.
8. The method of claim 7, wherein the appropriate constraint conditions are selected by a constraint selection module comprising a historical diversity gain analysis module and a real-time behavioral assessment module, wherein the historical diversity gain analysis module is configured to calculate Meng En constrained global, local, and global preferences; the real-time behavior evaluation module is used for recording violation conditions of the constraint in real time and calculating the latest constraint index.
9. The method of claim 8, wherein the constraint index comprises: the number of times the test agent leaves the limited range of movement in a given time step, I move, the average actual attack range over a given time step, I attack, the average injury value over a given time step, I damage, the average of the blood volume differences of both agents over a given time step, I health, and the average distance over a given time step, I distance, of both agents.
10. An agent testing device for behavioral diversity, comprising a memory and a processor, the memory having stored therein a computer program that when executed by the processor performs the steps of the method of any one of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410042647.4A CN117933419A (en) | 2024-01-11 | 2024-01-11 | Intelligent body testing method and device for behavior diversity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410042647.4A CN117933419A (en) | 2024-01-11 | 2024-01-11 | Intelligent body testing method and device for behavior diversity |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117933419A true CN117933419A (en) | 2024-04-26 |
Family
ID=90762543
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410042647.4A Pending CN117933419A (en) | 2024-01-11 | 2024-01-11 | Intelligent body testing method and device for behavior diversity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117933419A (en) |
-
2024
- 2024-01-11 CN CN202410042647.4A patent/CN117933419A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111291890B (en) | Game strategy optimization method, system and storage medium | |
CN111954564B (en) | Method and system for interactive, descriptive and improved game and player performance prediction in team sports | |
US7837543B2 (en) | Reward-driven adaptive agents for video games | |
US20190236455A1 (en) | Pre-training neural networks with human demonstrations for deep reinforcement learning | |
CN111275174B (en) | Game-oriented radar countermeasure generating method | |
CN105637540A (en) | Methods and apparatus for reinforcement learning | |
Zhang et al. | Improving hearthstone AI by learning high-level rollout policies and bucketing chance node events | |
CN113688977A (en) | Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium | |
Jaidee et al. | Case-based goal-driven coordination of multiple learning agents | |
Oh et al. | Learning to sample with local and global contexts in experience replay buffer | |
CN113627596A (en) | Multi-agent confrontation method and system based on dynamic graph neural network | |
Uriarte et al. | Combat models for RTS games | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
Nam et al. | Generation of diverse stages in turn-based role-playing game using reinforcement learning | |
WO2019240047A1 (en) | Behavior learning device, behavior learning method, behavior learning system, program, and recording medium | |
Almón-Manzano et al. | Deep reinforcement learning in agents’ training: Unity ML-agents | |
Hou et al. | Advances in memetic automaton: Toward human-like autonomous agents in complex multi-agent learning problems | |
CN113509726A (en) | Interactive model training method and device, computer equipment and storage medium | |
CN117933419A (en) | Intelligent body testing method and device for behavior diversity | |
Nipu et al. | Maidcrl: Semi-centralized multi-agent influence dense-cnn reinforcement learning | |
CN114662655B (en) | Attention mechanism-based method and device for deriving AI layering decision by soldier chess | |
CN113313236B (en) | Deep reinforcement learning model poisoning detection method and device based on time sequence neural pathway | |
Fathi et al. | Evaluation of using neural networks on variety of agents and playability of games | |
Rubin et al. | Opponent type adaptation for case-based strategies in adversarial games | |
CN112884129A (en) | Multi-step rule extraction method and device based on teaching data and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |