CN117933419A

CN117933419A - Intelligent body testing method and device for behavior diversity

Info

Publication number: CN117933419A
Application number: CN202410042647.4A
Authority: CN
Inventors: 马序言; 王亚文; 王俊杰; 吴泊逾; 闫熠光; 李守斌; 王青
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-04-26

Abstract

The invention provides an intelligent agent testing method and device for behavior diversity, which belongs to the technical field of computers, and comprises the steps of constructing a testing intelligent agent and designing constraint conditions; training the test agent and checking the diversity of the state sequence of the target agent based on the sliding window; judging whether the diversity is sufficient or not, and further selecting constraint conditions of preference according to the situation to continue training. According to the invention, constraint is added in training, so that test agents with different strategies are trained as clear guidance, and various weaknesses of a target agent can be revealed through the test agents.

Description

Intelligent body testing method and device for behavior diversity

Technical Field

The invention belongs to the technical field of computers, and particularly relates to an intelligent agent testing method and device for behavior diversity.

Background

The opponent game presents a scenario of multiple agent interactions, each agent striving to optimize its own personal goals, defeat opponents, and earn winnings. Deep Reinforcement Learning (DRL) has become an pioneering method for addressing and understanding opponent games. The method has wide application range from realizing super performance in the chessboard game to solving the key environments of game problems, commercial competition, politics science, resource allocation tasks and the like. DRL provides the ability for agents to learn from large amounts of data, adapt to dynamic environments, and optimize complex objectives. However, training agent processes in a challenge environment are complex.

In a complex environment, the policies of the agent need to not only accommodate a static set of rules, but also to cope with changing policies of the adversary. This can lead to the agent being fragile and difficult to popularize in different scenarios if not handled efficiently. In critical situations such as commercial competition or smart gaming, a non-powerful agent may lead to dire consequences, from economic loss to human life risk.

Some current approaches fail by perturbing the observation of the agent (i.e., the game environment) to cause sub-optimal behavior in the agent selection strategy. However, these methods may not be practical in real-world scenarios because changing the physical environment, such as introducing pixel noise in the input image, tends to be challenging. Furthermore, such an approach may not be effective in revealing inherent decision defects that are not related to digital disturbances in the environment. Other studies have focused on training countermeasure strategies that defeat victim agents in games, and that have been trained in such a way that the countermeasure strategy may expose decision vulnerabilities of the target strategy. However, since the training of the resistance strategy is to defeat the fixed victim, focusing on finding and exploiting the most easily found weaknesses, while possibly ignoring other weaknesses, the diversity of weaknesses determined by such an approach is limited. Some studies emphasize the importance of test diversity, but these approaches rely primarily on curiosity-based approaches to enhance diversity, which may not provide direct guidance. In complex tasks and environments, while these approaches may increase curiosity scores, they may not effectively enhance the diversity of identified decision defects.

Therefore, there is an urgent need for a method for testing policy diversity of an agent, especially in security critical scenarios.

Disclosure of Invention

Aiming at the problems, the invention provides an intelligent agent testing method and device for behavior diversity. The invention constructs the test agent and designs constraint conditions, dynamically adjusts the strategy of the test agent by adding preferential constraint in training, and reveals various weaknesses of the target agent by interaction between the test agent and the target agent.

The invention adopts the technical scheme that:

an intelligent agent testing method facing to behavior diversity comprises the following steps:

1) Designing constraint conditions for mutual combat behaviors of two intelligent agents, wherein the two intelligent agents comprise a test intelligent agent and a target intelligent agent;

2) Training the test agent by using QMIX algorithm, and recording the state sequence of each field fight of the target agent;

3) Checking whether the diversity of the state sequences of the target agents in the sliding window is sufficient;

4) If the diversity is sufficient, continuously training and testing the intelligent agent according to the current state; if the diversity is insufficient, selecting a proper constraint condition to be applied to the training process of the test intelligent agent;

5) And after training, recording the failure scene of the target intelligent agent when testing.

Further, the constraints of the design include:

range of movement C _move: limiting the boundary of the moving range of the intelligent agent to 50% of the default value;

Attack scope Limiting the attack range of the agent to 50% of the default value;

Attack scope Limiting the attack range of the agent to 150% of the default value;

Injury value C _damage: limiting the injury value of the agent per attack to less than 50% of its default injury;

blood volume difference between both agents C _health: limiting the blood volume value difference of the two intelligent agents to be less than 50% of the maximum health value of the test intelligent agent;

Distance between two parties Limiting the distance between the two intelligent agents to be less than 50% of the attack range;

Distance between two parties The distance between the two intelligent agents is limited to be more than 50% of the attack range.

Further, the QMIX algorithm is implemented by a QMIX model, which QMIX model comprises two parts:

The agent network is constructed based on DRQN networks and comprises an input layer and an output layer which are formed by MLP multi-layer neural networks and an intermediate layer which is formed by GRU gate control cyclic neural networks; the agent network takes the observed value of the agent at the current moment and the action of the agent at the previous moment as inputs and takes the value function of the agent at the current moment as output;

The hybrid network is a feedforward neural network, takes the value function of the agent at the current moment output by the agent network as input, and takes the joint action cost function as output.

Further, the step of training the test agent using QMIX algorithm includes:

The QMIX model is updated by setting an evaluation network and a target network, wherein the evaluation network is used for calculating the Q value of the current state, and the target network is used for calculating the target Q value;

selecting actions of the agent using an epsilon-greedy algorithm;

calculating, by the evaluation network, an estimated Q value for each agent to take each action in a given state;

calculating a target Q value corresponding to the optimal action taken by each intelligent agent in the next state through a target network;

The evaluation network receives the Q value corresponding to the action selected by each agent as input and is used for calculating the estimated Q value of each action in the current state; the target network receives the target Q value of each intelligent agent;

Calculating a target value r+gamma (Q _tot (target) by using a Q-Learning method, wherein r is the obtained rewarding value, gamma is a super parameter, and Q _tot (target) is a target Q value;

calculating the difference between the target value and the estimated Q value to obtain a TD-error;

Expanding the mask to be the same as the TD-error, wiping out the filled TD-error, performing gradient clearing, calculating the gradient of the loss function on the network parameters by using a back propagation algorithm, and updating the parameters of the target network by using the gradient to enable the target network to approach to the real Q value.

Further, the step of checking whether the diversity of the state sequences of the target agents within the sliding window is sufficient includes:

Calculating the distance dis (s, s ') between the state sequence s contained in the sliding window with the size w and the state sequences s' of all previous combat, and finding out the minimum distance d _s′ from the distance dis;

if d _s′ is greater than the preset similarity threshold, the state sequence s is considered to be a new failure track;

Calculating a frequency freq= #seq/w of newly occurring failure tracks within the sliding window, the frequency representing the diversity, wherein #seq represents the number of newly occurring failure tracks; if freq is greater than the preset frequency threshold, the diversity is sufficient, otherwise insufficient.

Further, whether the diversity of the state sequences of the target agents in the sliding window is sufficient or not is checked by the diversity checking module.

Further, the step of selecting the appropriate constraint includes:

Recording the diversity gain of each constraint in the history test process as the global preference of each constraint;

recording violations of each constraint in the sliding window in real time to represent local preference of each constraint;

weighted summation of global preferences and local preferences as overall preferences for each constraint;

based on the overall preference, an ε -greedy algorithm is used to select the appropriate constraint.

Further, selecting appropriate constraint conditions by a constraint selection module, wherein the constraint selection module comprises a historical diversity gain analysis module and a real-time behavior evaluation module, and the historical diversity gain analysis module is used for calculating Meng En constrained global preferences, local preferences and overall preferences; the real-time behavior evaluation module is used for recording violation conditions of the constraint in real time and calculating the latest constraint index.

Further, the constraint index includes: the number of times the test agent leaves the limited range of movement in a given time step, I _move, the average actual attack range over a given time step, I _attack, the average injury value over a given time step, I _damage, the average of the blood volume differences of both agents over a given time step, I _health, and the average distance over a given time step, I _distance, of both agents.

An agent testing device for behavioral diversity comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the method.

A storage medium storing a computer program which, when executed, performs the steps of the above method.

Compared with the prior art, the invention has the following beneficial effects:

1) By adding constraints in the training, as explicit guidance to train different strategies of the antagonist (i.e., test agent), such antagonist can reveal various weaknesses of the target agent;

2) The behavior of the counteragent generated by adding the constraint is more in line with the actual scene, and a more reasonable simulation environment can be provided for the actual scene.

Drawings

FIG. 1 is a flow chart of an agent testing method for behavioral diversity.

Fig. 2 is a statistical view of failure scenarios for four experimental maps.

Detailed Description

In order to make the technical features and advantages or technical effects of the technical scheme of the invention more obvious and understandable, the following detailed description is given with reference to the accompanying drawings.

The invention provides a method for testing an intelligent agent oriented to behavior diversity, which has a flow shown in a figure 1 and comprises the following specific steps:

1. design constraints

Referring to the online discussion between the countermeasure game manual and the user, five key factors that can affect the progress and mode of the game are manually determined.

1. Factors related to the capabilities of the agent itself:

range of movement: the moving range is reduced, and the moving capability of the agent is limited.

Attack range: reducing or increasing the scope of attack may directly change the attack capability of the agent.

Injury value: reducing the injury value may limit the combat ability of the agent.

2. Factors related to the interaction state of two agents:

Blood volume differences: it affects combat patterns, such as lightly defeating an adversary (maintaining a small blood volume difference) or defeating a hit.

Distance: the distance between agents affects the agents' policies, such as "long-range attacks" or "short-range attacks".

These factors are then refined to 7 specific constraints, each of which can generally be derived from the lower and upper bounds. However, some of the effects of the games are the same as the bonus function of winning a battle (e.g., injury value) or effects that the games do not allow (e.g., range of movement), and thus they are not included in the constraint. Thus, from the five factors described above, the following seven constraints are derived:

1) Move in smaller range C _move: the boundary of the movement range is limited to 50% of its default value. Such movement restrictions may result in the test agent not being able to approach the target agent and thus not being able to launch an attack, unlike what it normally pursues winning behavior in a game.

2) Attack in a smaller rangeThe scope of attack is limited to 50% of its default value. The reduced range of attack by the test agent may result in the target agent being closer to the adversary because it feels unnecessary to maintain a greater security distance.

3) Attack over a wide rangeThe scope of attack is limited to 150% of its default value. An increase in the attack scope of the test agent may result in the target agent selecting a "far attack" because it may protect itself from the greater attack scope of the test agent.

4) Smaller damage value C _damage: the injury value per attack is limited to less than 50% of its default injury. This encourages test agents to employ more aggressive approaches to defeat target agents, which may lead to target agents exploring different strategies.

5) The blood volume value difference between the two intelligent agents is smaller C _health: the blood volume value difference of both parties is limited to be less than 50% of the maximum health value of the test agent. This will encourage the test agent to choose a "gentle" attack strategy to maintain a small difference in operating conditions between itself and the target agent.

6) The distance between the two intelligent bodies is smallerThe distance between the two intelligent agents is limited to be less than 50% of the attack range. This encourages the test agent to be closer to the target agent throughout the combat, taking a "close-range attack" strategy.

7) The distance between the two intelligent bodies is largerThe distance between the two intelligent agents is limited to be more than 50% of the attack range. This encourages the test agent to remain a distance from the target agent, using a "far attack" strategy throughout the combat.

2. Training test agent

1. Network structure and parameter setting

Training a test agent using QMIX algorithm, the challenge environment defines the state space that the agent can observe and the action space that it can perform. The agent then selects to perform an action based on the observed state and learns through interaction with the environment. The QMIX model consists of two major parts: one is Agent network (Agent network), taking into account the observations of Agent a at time tAnd action of agent a at time t-1/>As input, output the value function of single agent a at time tThe other is a hybrid network (Mixing network), thenAs an input, the joint motion cost function Q _tot (τ, u) is output. Agentnetwork is implemented by DRQN network, and the network parameters of different agents are shared, wherein the network comprises 3 layers, namely an input layer (MLP multi-layer neural network), an intermediate layer (GRU gate-control cyclic neural network) and an output layer (MLP multi-layer neural network). The Mixing network is a feed-forward neural network that takes as input the output of the agent network, monotonically mixes, producing the value of Qt _ot.

QMIX is updated by: an evaluation network (evaluate Net) and a target network (TARGET NET) are set.

Q _tot (target): in the case of the next state s', the maximum Q value selected by the agent among all behaviors, i.e., the target Q value, is used to calculate the target value in the loss function. Based on IGM conditions (INDEPENDENT AND IDENTICALLY Distributed (IID) generation model conditions), the maximum action value of each agent in this state is input.

Q _tot (evaluate): and under the condition of the current state s, the estimated Q value of the agent selecting action is used for calculating the current estimated value of the loss.

The update interval of the target network is set to 200, namely, the network is updated every 200 episode; the buffer size buffer_size is set to 5000.

2. The specific training steps are as follows:

2a) Performing action selection by using an epsilon-greedy algorithm;

2b) Calculating the estimated Q value of a single agent to obtain a Q-Table (Q value Table); along dimension 1, tensors ([ macout,1 ]). The Q value is selected for each agent action taken and the last unnecessary dimension is removed. And assigning the action which cannot be executed as minus infinity, and acquiring the maximum action value and the index thereof.

2C) Calculating a target Q value of a single agent;

2d) Back propagation is performed according to the loss function: evaluate the network inputs the Q value of the action selected by each agent, and the target network inputs the maximum Q value of each agent. The target value r+gamma is calculated by Q-Learning, Q _tot (target), where r is the prize value rewards and gamma is the hyper-parameter. And (3) calculating the TD-error, expanding the mask to be the same size as the TD-error, wiping out the filled TD-error, performing gradient clearing and back propagation, and updating the parameters of the target network in a specified period.

3. Checking behavioral diversity of target agents

The diversity check measures whether the latest combat trajectory (i.e., the state sequence) is different from the existing trajectory since the start of the test. Note that the present invention only considers the trajectory of target agent combat failures, as the goal is to find diverse failure scenarios for the target agent.

The specific steps are as follows:

1. a state sequence set Q _i＝{s₁,s₂,...,s_i-1,s_i from the beginning of the test to the current fight, wherein i is the total number of the fight, s is the track of each fight target agent, and comprises information such as position, blood volume and the like;

2. The state sequence set in the window is Q _w＝{s_i-w+1,...,s_i, wherein w is the size of the sliding window, each s _j is the state sequence of the j field fight, and is an m×n matrix, which indicates that each fight has m frames, and the observation result of each frame is an n-dimensional vector;

3. for the state sequence s 'in the sliding window, the distance between the state sequence s' and all other existing state sequences is calculated based on the hamming distance, and the calculation formula is as follows:

wherein, the numerator is composed of the Hamming distance between s and s' and the sequence length difference, and the denominator is composed of two sequence length maxima.

Based on this, the minimum distance between the state sequence s' and other existing sequences can be obtained

4. The present invention utilizes a predefined similarity threshold θ _s to filter out similar sequences, i.e., when d _s′＞θ_s, it represents that the state sequence is different from the existing state sequence and will be considered as a new emerging failure trace. Wherein, the similarity threshold θ _s is set to a corresponding value, such as 0.3 (map 2m_vs_1z) or 0.4 (3 m), depending on the dimension of the state sequence and the manual analysis.

5. Then calculate the frequency of the new failure track that appears in the sliding window: freq= #seq/w, #seq represents the number of newly occurring failure tracks, w being the window size. When freq is less than a predefined frequency threshold θ _f (e.g., 0.2), the diversity check module considers that a constraint needs to be added.

4. Selecting constraints and applying to training process for testing agent

After the diversity check module considers the time to introduce the constraint, the constraint selection module will determine which constraint should be applied. The constraint selection module is composed of two sub-modules, namely a historical diversity gain analysis module and a real-time behavior evaluation module, wherein the historical diversity gain analysis module generates a global preference function of each constraint, and the real-time behavior evaluation module generates a local preference function of each constraint. And obtaining the preference function of each constraint after weighted summation of the global preference function and the local preference function. The constraint selection module then uses the epsilon-greedy algorithm to select constraints and apply the constraints to the training process for testing the agent according to the preference function.

1. Historical diversity benefit analysis module

The module in the present invention can collect global information including the history of each constraint selection and the corresponding diversity benefit from each selection. The goal of such constraint selection is to maximize the cumulative diversity through a specific constraint selection strategy. The present invention analogizes this task to a multi-arm slot machine problem, the initial goal of which is to maximize the jackpot by optimizing the strategy of pulling the "arms". The present invention can treat each constraint as each arm on the slot machine and select one constraint as the arm to be pulled.

The specific process is as follows:

1a) Taking a constraint set and time of constraint selection as inputs;

1b) Selecting constraint c through epsilon-greedy algorithm each time;

1c) Taking the number of different tracks before the next constraint selection as the diversity Reward v=reorder (c) brought by c;

1d) The average prize R (c) of the constraint c is updated according to the current prize and is prepared for the next selection in the following manner:

wherein count (c) is the number of times constraint c is selected;

le) global preference with final return rewards vector R (C _f) as constraint, higher rewards meaning higher preference for selecting C _i;

1f) R (C _i) is normalized so that each element is in the [0,1] range, thereby eliminating the effect of different orders of magnitude.

2. Real-time behavior evaluation module

To represent local constraint preferences, the module will record violations of the constraints in real-time to calculate relevant metrics for the most recent constraint. According to the 5 factors mentioned above, the present invention defines 5 constraint indexes:

a) I _move: the number of times the test agent leaves the limited range of movement in a given time step corresponds to constraint C _move, which ranges from 0, max _m, where max _m is the number of time steps.

B) I _attack: average actual attack range over a given time step corresponds to constraintsAnd/>The range is [0.5r,1.5r ], where r is the default attack range.

C) I _damage: the average injury value of the test agent over a given time step corresponds to constraint C _damage, ranging from [0, max _d ], where max _d is the default maximum injury value.

D) I _health: the mean of the blood volume differences for both agents over a given time step corresponds to constraint C _health, ranging from [0, max _h ], where max _h is the maximum blood volume value.

E) I _distance: the average distance of the two parties over a given time step corresponds to the constraintAndThe range is [0, d ], where d is the distance between the two furthest points on the map.

Based on statistics of the indexes, the invention normalizes the min-max of the upper and lower boundaries of the indexes, If the index corresponds to a constraint, then this normalized value is directly used as its local preference. If the index corresponds to two constraints, then this normalized value is considered to be the local preference of the first constraint (i.e./>And/>). Then, a second normalized value is calculated by exchanging x _min and x _max and taken as a second constraint (i.e./>And) Is described in (a) is provided.

3. Constraint selection with epsilon-greedy algorithm and application to training

Through the two modules, global constraint preference and local constraint preference of each constraint are obtained, the global and local preference are weighted by the same weight (namely 0.5), and a probability vector Pr (C _i) is obtained, wherein the vector comprises the combined probability of selecting a certain constraint C _i. The larger the value, the greater the probability that the constraint is selected.

Then epsilon-greedy selection is performed to increase exploration, wherein constraint is randomly selected with epsilon probability or constraint is selected with 1-epsilon probability greedy, namely, constraint with maximum combined probability value Pr is selected. According to the general setting epsilon is 0.1.

When a constraint is selected, the present invention will delete the old constraint in the training of the test agent and use this new constraint, i.e., add a discount factor to reconstruct the reward function to penalize the agent's behavior against the constraint, as shown below:

Where m is the number of constraints, C _i (pi) is the discount cost function under policy pi, and the corresponding constraint threshold is d _i. The goal of adding constraints is to satisfy Maximizing the benefit R (pi) under conditions of (a).

Experimental test:

the invention selects popular real-time strategy game 'interstellar dispute II' as a test environment, and uses the latest framework PyMARL to interact with the game environment. Furthermore, the present invention compares the results with two commonly used, advanced techniques (QMIX and EMOGI). QMIX is a common DRL algorithm used for training strategies to control multiple soldiers in a competitive scenario. EMOGI is directed to generating behavior-diverse game agents by utilizing DRL and evolutionary algorithms.

The invention evaluates four maps of the interstellar dispute II, the name of the map and soldiers controlled by two intelligent agents are shown in the following table:

Table 1 experimental map

Map name	Test party	Tested party
			2m_vs_1z	2Marines	1Zealot
3m	3Marines	3Marines
			2s_vs_1sc	2Stalkers	1Spine Crawler
3s_vs_3z	3Stalkers	3Zealots

The invention counts the unique failure scenes detected on the four maps and compares the results with the baseline method, and the results are shown in figure 2.

AdvTest can find more unique failure scenarios during the same test than usual and most advanced techniques. When the test was completed, advTest found a total of 56, 38, 36 and 1,195 unique failure scenarios on the four maps, while the other two baselines were 38, 36, 29, 1,001 (EMOGI) and 19, 28, 22,756 (QMIX), respectively. AdvTest increased by 47.4%, 5.6%, 24.1% and 19.4% on the four maps, respectively, compared to EMOGI. AdvTest increased by 194.7%, 35.7%, 63.6% and 58.1% on four maps, respectively, as compared to QMIX.

Although the specific details, algorithms for implementation, and figures of the present invention have been disclosed for illustrative purposes to aid in understanding the contents of the present invention and the implementation thereof, it will be appreciated by those skilled in the art that: various alternatives, variations and modifications are possible without departing from the spirit and scope of the invention and the appended claims. The invention should not be limited to the preferred embodiments of the present description and the disclosure of the drawings, but the scope of the invention is defined by the claims.

Claims

1. The intelligent agent testing method for the behavior diversity is characterized by comprising the following steps of:

2. The method of claim 1, wherein the constraints of the design include:

3. The method of claim 1, wherein the QMIX algorithm is implemented by a QMIX model, the QMIX model comprising two parts:

4. The method of claim 3, wherein the step of training the test agent using QMIX algorithm comprises:

selecting actions of the agent using an epsilon-greedy algorithm;

5. The method of claim 1, wherein the step of checking whether the diversity of the state sequences of the target agents within the sliding window is sufficient comprises:

6. The method of claim 1 or 5, wherein the diversity checking module checks whether the diversity of the state sequences of the target agents within the sliding window is sufficient.

7. The method of claim 1, wherein the step of selecting the appropriate constraint comprises:

8. The method of claim 7, wherein the appropriate constraint conditions are selected by a constraint selection module comprising a historical diversity gain analysis module and a real-time behavioral assessment module, wherein the historical diversity gain analysis module is configured to calculate Meng En constrained global, local, and global preferences; the real-time behavior evaluation module is used for recording violation conditions of the constraint in real time and calculating the latest constraint index.

9. The method of claim 8, wherein the constraint index comprises: the number of times the test agent leaves the limited range of movement in a given time step, I _move, the average actual attack range over a given time step, I _attack, the average injury value over a given time step, I _damage, the average of the blood volume differences of both agents over a given time step, I _health, and the average distance over a given time step, I _distance, of both agents.

10. An agent testing device for behavioral diversity, comprising a memory and a processor, the memory having stored therein a computer program that when executed by the processor performs the steps of the method of any one of claims 1-9.