CN117111640B - Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment - Google Patents

Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment Download PDF

Info

Publication number
CN117111640B
CN117111640B CN202311379344.3A CN202311379344A CN117111640B CN 117111640 B CN117111640 B CN 117111640B CN 202311379344 A CN202311379344 A CN 202311379344A CN 117111640 B CN117111640 B CN 117111640B
Authority
CN
China
Prior art keywords
strategy
network
current
value distribution
unmanned aerial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311379344.3A
Other languages
Chinese (zh)
Other versions
CN117111640A (en
Inventor
陈少飞
李鹏
胡振震
陈佳星
谷学强
苏炯铭
石泉
苏小龙
李胜强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202311379344.3A priority Critical patent/CN117111640B/en
Publication of CN117111640A publication Critical patent/CN117111640A/en
Application granted granted Critical
Publication of CN117111640B publication Critical patent/CN117111640B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Traffic Control Systems (AREA)

Abstract

The method is characterized in that the strategy value of the unmanned aerial vehicle is distributed in a strategy network and a hybrid network of the unmanned aerial vehicle respectively through the distributed transformation of an optimistically weighted QMIX algorithm; in order to reduce the risk of the unmanned aerial vehicle when making a decision and adapt to the change of environmental risk, introducing a condition risk value into a strategy network and constructing utility value distribution, taking the environmental risk into consideration in a hybrid network through an implicit quantile network, and effectively learning to adapt to the cooperative obstacle avoidance strategy of the two risks; in order to adapt to the dynamic change of the environmental risk, an option frame is adopted to disperse the strategy value distribution into a plurality of windows for decision making, so that the self-adaptive adjustment capability of the unmanned aerial vehicle in the current state that the risk attitude changes along with the environmental risk is realized. The method can effectively adapt to risks caused by uncertainty factors in a dynamic environment, improves strategy quality in cooperation obstacle avoidance, and enhances robustness and robustness of multi-machine strategy learning.

Description

Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment
Technical Field
The application relates to the technical field of unmanned aerial vehicles, in particular to a multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment.
Background
A good cooperation strategy is obtained in cluster tasks such as multi-unmanned aerial vehicle cooperation obstacle avoidance, a multi-agent reinforcement learning method based on value decomposition is adopted to model the interaction process of agents and environments, and rich theoretical results and application practice experience are obtained. However, in real-world applications, the environment is dynamically changed and is full of uncertainty, and the decision scene with high risk brings great blindness to the active cooperative decision of multiple unmanned aerial vehicles. Therefore, it becomes particularly important to study robust decisions of multiple drones under risk conditions.
In the multi-unmanned aerial vehicle cooperation obstacle avoidance task, risks are derived from risks of a strategy per se when each unmanned aerial vehicle makes a decision (certain actions of the self decision can cause self damage, other unmanned aerial vehicles damage, collision of obstacles and the like to seriously influence the completion of the cooperation task), and risks caused by uncertainty of the obstacles in the environment (the position and time of occurrence of the obstacles are random and uncertain, wind disturbance and the like in the environment can influence uncertainty factors in the environment such as normal flight of the unmanned aerial vehicle). The problems of strategy risk and risk-sensitive reinforcement learning research are essentially the same, but the performance of the risk-sensitive reinforcement learning is greatly limited in a multi-unmanned plane cooperation obstacle avoidance scene due to the influence of local observation and non-stationary strategies of an intelligent agent. The uncertainty of the environment inevitably leads to long-term rewards and randomness of the environmental state transitions, which is unknown to the drone, which requires the drone to take into account the impact of environmental risks in the decision making. However, while conventional value-resolved multi-agent reinforcement learning methods achieve competitive results by learning the expected value of state actions, the value is expected to ignore the risk of uncertainty and stochastic back-hiding in the environment, and learned strategies may lack the ability to accommodate environmental risks.
The prior art has achieved some achievements in the multi-machine cooperation obstacle avoidance method, but has some disadvantages: firstly, few researches on environmental risk influence are focused in a multi-machine cooperation obstacle avoidance method at present, main stream researches are generally focused on a strategy generation method of unmanned aerial vehicle safety robustness, and influence of environmental risk on learning cooperation strategies in a multi-machine cooperation process is ignored; secondly, although environmental risks or strategy risks are considered in the latest multi-agent strategy learning methods with several values distributed at present, the environmental risks and the strategy risks are not considered in a fused manner, and the method provided by the invention not only starts from reducing the decision risks of unmanned aerial vehicles, but also considers how multiple machines cooperate better to cope with the environmental risks; finally, the most important point is that the existing learning method based on the value decomposition multi-machine cooperation obstacle avoidance strategy cannot adapt to the change of environmental risks, namely, when the environmental risks change, the method can only follow the average value to select the exploration strategy.
Disclosure of Invention
Based on the above, it is necessary to provide a multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment aiming at the technical problems.
A multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment comprises the following steps:
Modeling a multi-unmanned aerial vehicle cooperation obstacle avoidance task into a multi-agent non-centralized part observable Markov decision process, wherein the information of each unmanned aerial vehicle comprises the following steps: local observation information, actions, and system status.
Constructing a multi-machine obstacle avoidance strategy learning model based on risk attitude self-adjustment, wherein the multi-machine obstacle avoidance strategy learning model is based on an optimistic weighted QMIX algorithm, a conditional risk value is introduced on the basis of action value distribution to learn utility value distribution in the unmanned plane strategy generation process, an option framework is used for learning an adaptive risk attitude in a strategy layer, and an implicit quantile network is used for integrating environmental risks into a hybrid network during centralized training; the multi-machine obstacle avoidance strategy learning model comprises a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network.
And the current local observation information, the previous action and the global state of each unmanned aerial vehicle are used as inputs, and the multi-machine obstacle avoidance strategy learning model after network parameter initialization is intensively trained to obtain the optimal strategy of each unmanned aerial vehicle in a risk environment.
And each unmanned aerial vehicle adopts the corresponding optimal strategy to complete the unmanned aerial vehicle cooperation obstacle avoidance task.
A multi-machine obstacle avoidance strategy learning device based on risk attitude self-adjustment, the device comprising:
the task modeling module is used for modeling the multi-unmanned aerial vehicle cooperative obstacle avoidance task into a multi-agent non-centralized part observable Markov decision process, and the information of each unmanned aerial vehicle comprises: local observation information, actions, and system status.
The multi-machine obstacle avoidance strategy learning model construction module is used for constructing a multi-machine obstacle avoidance strategy learning model based on risk attitude self-adjustment, the multi-machine obstacle avoidance strategy learning model is based on an optimistically weighted QMIX algorithm, conditional risk values are introduced on the basis of action value distribution to learn utility value distribution in the unmanned plane strategy generation process, an option framework is used for learning the self-adaptive risk attitude in a strategy layer, and an implicit quantile network is used for integrating environmental risks into a hybrid network during centralized training; the multi-machine obstacle avoidance strategy learning model comprises a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network.
The multi-machine obstacle avoidance strategy learning model centralized training module is used for intensively training the multi-machine obstacle avoidance strategy learning model after network parameter initialization by adopting the current local observation information and the previous action and the global state of each unmanned aerial vehicle as inputs to obtain the optimal strategy of each unmanned aerial vehicle in a risk environment.
And the unmanned aerial vehicle cooperation obstacle avoidance module is used for completing unmanned aerial vehicle cooperation obstacle avoidance tasks by adopting the corresponding optimal strategies by each unmanned aerial vehicle.
According to the multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment, the risk possibly encountered in a multi-unmanned-plane cooperation obstacle avoidance task is considered through the distributed transformation of an optimistically weighted QMIX algorithm, and the strategy value and the combined strategy value are distributed in a strategy network and a hybrid network of the unmanned plane respectively; in order to reduce the risk of the unmanned aerial vehicle when making a decision, and adapt to the change of environmental risk, a conditional risk value is introduced into a strategy network, and utility value distribution is constructed, the environmental risk is considered in a hybrid network through an implicit quantile network, so that a cooperative obstacle avoidance strategy capable of adapting to two risks is effectively learned; finally, in order to adapt to the dynamic change of the environmental risk, an option frame with advanced strategy selection capability is adopted, and decision is made by dispersing strategy value distribution into a plurality of windows, so that the capability of self-adaptive adjustment of the risk attitude of the unmanned aerial vehicle along with the change of the environmental risk in the current state is realized. The method can effectively adapt to the risk problem caused by uncertainty factors in a dynamic environment, improves the strategy quality in unmanned aerial vehicle cooperation obstacle avoidance, and remarkably enhances the robustness and robustness of multi-unmanned aerial vehicle strategy learning.
Drawings
FIG. 1 is a flow chart of a multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment in one embodiment;
FIG. 2 is a schematic diagram of a framework of a multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment in one embodiment;
FIG. 3 is a diagram of a utility value distribution function and option framework selection strategy based on one embodiment;
FIG. 4 is a schematic diagram of a monotonic hybrid network architecture in another embodiment;
FIG. 5 is a schematic diagram of an optimal hybrid network in another embodiment;
FIG. 6 is a schematic diagram of a policy network structure according to another embodiment;
FIG. 7 is a graphical illustration of performance versus results (normal Stag_hunt task) of the DWMIX algorithm and the conventional value decomposition algorithm in another embodiment;
FIG. 8 is a graphical illustration of performance versus results (normal MPE tasks) of the DWMIX algorithm and the conventional value decomposition algorithm in another embodiment;
FIG. 9 is a graph showing the performance contrast results of the DWMIX algorithm and the conventional value decomposition algorithm (normal 5mv6m task) in another embodiment;
FIG. 10 is a graph of performance versus results (normal Stag_hunt task) for a DWMIX algorithm and a value-distributed multi-agent reinforcement learning algorithm in another embodiment;
FIG. 11 is a graph of performance versus results (normal MPE tasks) for a DWMIX algorithm and a value-distributed multi-agent reinforcement learning algorithm in another embodiment;
FIG. 12 is a graph showing performance versus results of a DWMIX algorithm and a value-distributed multi-agent reinforcement learning algorithm (normal 5mv6m task) in another embodiment;
FIG. 13 is a schematic diagram of test results in the case of random state transition in another embodiment;
FIG. 14 is a graph showing test results for a random bonus situation in another embodiment;
FIG. 15 is a graph showing test results for the case of random state transition+random rewards in another embodiment;
FIG. 16 is a schematic diagram illustrating a comparison of performance of a DWMIX algorithm and a value-distributed multi-agent reinforcement learning algorithm in exploring the Star_hunt task in another embodiment;
FIG. 17 is a schematic diagram showing the performance of the DWMIX algorithm and the value-distributed multi-agent reinforcement learning algorithm in exploring MPE tasks in another embodiment;
FIG. 18 is a schematic diagram showing a comparison of the performance of the DWMIX algorithm and the value-distributed multi-agent reinforcement learning algorithm on exploring a 5mv6m task in another embodiment;
FIG. 19 is a graph of the results of a DWMIX algorithm and other algorithms testing in a random Star_hunt task when teammates take a random strategy in another embodiment;
FIG. 20 is a graph of test results of the DWMIX algorithm and other algorithms in a combat Stag_hunt task when teammates take a combat strategy in another embodiment;
FIG. 21 is a schematic diagram illustrating the performance of the DWMIX algorithm and the current latest value-distributed multi-agent reinforcement learning algorithm in a task of 5m_vs_6m according to another embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
The environmental risk in the multi-machine cooperation obstacle avoidance task comprises the influence of the environment on the unmanned plane in the interaction process of the unmanned plane and the environment and the influence of a non-stable strategy between unmanned plane cooperation, the former risk is reflected on uncertainty of rewards and environment state transfer, and the latter risk is reflected on the influence of other cooperation unmanned plane strategies on the environment. In order to effectively express the influence of environmental risks on unmanned plane decisions, the environmental risks are combined in the optimal joint policy value distribution, and the optimal joint policy value distribution is not represented by the environmental states and all unmanned plane policy value distributions only, but the environmental risks are learned by adopting an implicit quantile network, namelySubstitution of +.>
The method adopts the thought of value distribution on the basis of an optimistically weighted QMIX algorithm, and designs a distributed multi-unmanned aerial vehicle cooperation obstacle avoidance strategy learning method capable of learning a risk sensitive strategy and having a self-adaptive risk attitude. First, it is proposed to introduce conditional risk values to learn utility value distribution during unmanned aerial vehicle policy generation, and to propose expected utility IGM criteria to ensure that joint utility value distribution can be decomposed. Second, we propose to learn adaptive risk attitudes at the policy level using an options framework to cope with environmental risk changes. Then, an implicit quantile network is adopted to integrate the environmental risk into the hybrid network during centralized training, so that the optimal joint strategy value distribution which can reflect the real environmental risk is learned.
In one embodiment, as shown in fig. 1, a multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment is provided, and the method comprises the following steps:
step 100: modeling a multi-unmanned aerial vehicle cooperation obstacle avoidance task into a multi-agent non-centralized part observable Markov decision process, wherein the information of each unmanned aerial vehicle comprises the following steps: local observation information, actions, and system status.
In particular, multi-unmanned cooperative obstacle avoidance tasks typically employ a value-resolved multi-agent reinforcement learning method, which is typically modeled as a decentralized Markov decision process (Dec-POMDP) with partially observable characteristics, consisting of tuplesAnd (3) representing. />All information of the real environment is described. At each time step, each unmanned plane +.>Select action->Form a combined action->. Each unmanned plane has its own independent observation +.>In the collaborative task, we assume that the drones share the same rewardsRepresenting the discount factor. />Historical observation-action data representing the unmanned aerial vehicle. Unmanned plane->In given policy->When the goal is to maximize the joint optimization strategy +.>The cumulative discount returns:
step 102: constructing a multi-machine obstacle avoidance strategy learning model based on risk attitude self-adjustment, wherein the multi-machine obstacle avoidance strategy learning model is based on an optimistically weighted QMIX algorithm, a conditional risk value is introduced on the basis of action value distribution in the unmanned plane strategy generation process to learn utility value distribution, an option framework is used for learning an adaptive risk attitude in a strategy layer, and an implicit quantile network is used for integrating environmental risks into a hybrid network during centralized training; the multi-machine obstacle avoidance strategy learning model comprises a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network.
Specifically, the choice of a multi-machine obstacle avoidance strategy learning model (DWMIX algorithm) is based on an optimistically weighted QMIX algorithm, which is distributed and transformed by introducing an implicit quantile network during the centralized training, and the overall framework and the optimistically weighted QMIX algorithm remain substantially identical. The three most important components in DWMIX algorithm are unmanned policy network (learning option policy and policy value distribution of each unmanned), monotonic hybrid network (learning decentralised agent policy) and optimal hybrid network (deriving optimal joint action value distribution containing environmental risk), and in practice all adopt target network structure.
The method for decomposing multi-agent reinforcement learning in cooperation tasks is characterized by solving the credit allocation problem. The optimistically weighted QMIX algorithm addresses the problem that QMIX often underestimates some joint action values, providing that all underestimated action values are given more weight, while other action values are given less weight, essentially taking an optimistic attitude towards the action. The optimistic weighting operator is expressed as:wherein->Representing a constant less than 1.
In order to cope with environmental risks, in particular to adapt to the risk variation existing in an uncertainty environment, an option framework is adopted on the individual strategy value distribution of the unmanned aerial vehicle, and according to different options In the quantile->And selecting a corresponding quantile window, and executing a greedy strategy according to the size of the expected value in the window. The cooperative obstacle avoidance task assumes that all unmanned aerial vehicles adopt the same option, namely, the option is selected based on the same risk attitude. In order to learn a more robust strategy, except considering environmental risks, each unmanned aerial vehicle learns a risk sensitive strategy by introducing a conditional risk value on the basis of strategy value distribution, so that the proportion of bad tails of unmanned aerial vehicle strategy value distribution in cooperation is reduced, and the safety degree of unmanned aerial vehicles when executing the strategy is improved. The framework of the multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment is shown in fig. 2.
In conventional value distribution reinforcement learning, an agent generally performs value distribution according to actionsIs->The selection action, i.e. the policy, is risk neutral. However, in the cooperative obstacle avoidance task, the unmanned aerial vehicle as an independent individual needs to consider the potential risks of certain actions of the unmanned aerial vehicle while maximizing the collective benefit, so that each intelligent agent needs to learn a risk-sensitive strategy. Distribution of action values +.>Introducing a risk metric function->And linearly combining the two as a utility value distribution function +.>,/>,/>Representing the weights. Select- >As->The method comprises the following steps:
(1)
wherein,the representation is based on->Is a conditional risk value of (a).
Policy selection is no longer based on maximizing expectationsBut based on maximizing the desired utility distribution value function, i.e +.>. It is noted that each unmanned aerial vehicle only uses utility value distribution +.>To select the actions, avoid those actions with small probability but big harm to the user from being selected, and still adopt the action value distribution when calculating the combined action value in the concentrated training process>. As shown in fig. 3, it is desirable that the probability of the drone selecting an action on the left side of the return distribution (low action value region) becomes smaller. Maximize->Corresponding to pass parameter->In minimizing risk and maximizing original +.>Trade-off between values, i.e. desired +.in the original value distribution>By taking into account->Learning risk sensitive strategies. On this basis, a definition of the expected utility IGM is given. Assume +.>There is an individual utility value distribution +.>The following conditions are satisfied:
(2)
i.e.In condition->The following satisfies->IGM criteria of (c).
It should be noted that the unmanned plane policy network has two kinds of multi-machine obstacle avoidance policy learning models (DWMIX algorithm for short), one is to learn the decentralised policy and generate the option value by adopting the first policy network and the monotone mixed network, as shown by the upper left dotted line box in fig. 2, and the other is to learn the optimal joint action value by adopting the second policy network and the optimal mixed network, as shown by the lower right dotted line box in fig. 2.
The first policy network is used for learning the option value according to the current and historical observation-action information of the intelligent agent (each unmanned plane is one intelligent agent)And action value distribution function->. Specifically, the first policy network maps the action observation information to a multi-layer perceptron (MLP) network, a gate-controlled loop unit (GRU) network and an MLP network in sequenceAnd by superimposing the conditional risk value +.>Generate->. Then according to the discretized utility value distribution of the options +.>In selecting the corresponding window execution +.>Greedy strategy and selecting actions, thereby outputting the value distribution of unmanned aerial vehicle to current state action estimation +.>The method comprises the steps of carrying out a first treatment on the surface of the The data output by the GRU network is learned through a third MLP network with a double-layer structure to obtain the value of the current option +.>
The second policy network is used for learning the current policy value distribution according to the current and historical observation-action information of the intelligent agent (each unmanned plane is one intelligent agent). Specifically, the second policy network maps the action observation information to +_through the MLP network, the GRU network and the MLP network in sequence>Execution->Greedy strategy and selecting action, thereby outputting the current strategy value distribution +.>
Monotonic hybrid network for value distribution of current state motion estimation based on first strategic network output of each drone And global state information, adopting a super network structure in the QMIX algorithm to obtain joint action value distribution
The optimal hybrid network is used for distributing current strategy values output according to the second strategy network of each unmanned aerial vehicleAnd obtaining the joint strategy value distribution by adopting a feed-forward network, and integrating the environmental risk into the joint strategy value distribution by an implicit fractional network to obtain the optimal joint strategy value distribution containing the environmental risk.
Step 104: and (3) taking the current local observation information, the previous action and the global state of each unmanned aerial vehicle as inputs, and carrying out centralized training on a multi-machine obstacle avoidance strategy learning model after network parameter initialization to obtain an optimal strategy of each unmanned aerial vehicle in a risk environment.
Specifically, the multi-machine obstacle avoidance strategy learning model learns in a manner of centralized training and decentralized execution, and integrates an option framework and an implicit quantile network on the basis of an optimistically weighted QMIX algorithm.
Randomly initializing parameters of a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network; setting a parameter initial value of a target monotonic mixed network as a parameter of the monotonic mixed network, setting a parameter initial value of a target optimal mixed network as a parameter of the optimal mixed network, setting a parameter initial value of a first strategy target network of each unmanned aerial vehicle as a parameter of a first strategy network of the unmanned aerial vehicle, and setting a parameter initial value of a second strategy target network of each unmanned aerial vehicle as a parameter of a second strategy network of the unmanned aerial vehicle; initializing an experience playback pool, a exploration rate, a discount factor and a confidence level; according to an experience playback pool, an exploration rate, a discount factor and a confidence level, taking current local observation information and previous actions of each unmanned aerial vehicle as inputs, training a multi-machine obstacle avoidance strategy learning model, updating parameters of two types of strategy networks, a monotonic mixed network and an optimal mixed network by minimizing a preset loss function in an iterative optimization process, updating parameters of the two types of target strategy networks, the target monotonic mixed network and the target optimal mixed network by taking a preset period as an updating frequency until the preset training times are reached, entering the next round of iterative optimization until the preset training times are reached, and stopping iterative optimization to obtain the optimal strategy of each unmanned aerial vehicle.
Step 106: and each unmanned aerial vehicle adopts a corresponding optimal strategy to complete the unmanned aerial vehicle cooperation obstacle avoidance task.
In the multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment, the method considers risks possibly encountered in a multi-unmanned-plane cooperative obstacle avoidance task through the distributed transformation of an optimistically weighted QMIX algorithm, and strategy values and joint strategy values are distributed in a strategy network and a hybrid network of the unmanned plane respectively; in order to reduce the risk of the unmanned aerial vehicle when making a decision, and adapt to the change of environmental risk, a conditional risk value is introduced into a strategy network, and utility value distribution is constructed, the environmental risk is considered in a hybrid network through an implicit quantile network, so that a cooperative obstacle avoidance strategy capable of adapting to two risks is effectively learned; finally, in order to adapt to the dynamic change of the environmental risk, an option frame with advanced strategy selection capability is adopted, and decision is made by dispersing strategy value distribution into a plurality of windows, so that the capability of self-adaptive adjustment of the risk attitude of the unmanned aerial vehicle along with the change of the environmental risk in the current state is realized. The method can effectively adapt to the risk problem caused by uncertainty factors in a dynamic environment, improves the strategy quality in unmanned aerial vehicle cooperation obstacle avoidance, and remarkably enhances the robustness and robustness of multi-unmanned aerial vehicle strategy learning.
The method is not limited to application in the multi-unmanned aerial vehicle cooperation obstacle avoidance task, and can be applied to multi-agent cooperation tasks such as multi-robot cooperation rescue, multi-unmanned aerial vehicle exploration and the like, because the consideration of the environment risk and the decision risk of the agent is carried out at the same time, and the self-adaption of the environment risk is particularly important for the learning robustness of cooperation strategies in a risk environment.
In one embodiment, step 104 includes: setting the maximum training wheel number and the maximum iteration number, and setting the current training wheel number and the current iteration number to be 1; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective first strategy networks to obtain the current action value distribution and the current option value of each unmanned aerial vehicle; inputting the current action value distribution and the current global state of all unmanned aerial vehicles into a monotonic mixed network, and obtaining joint action value distribution by adopting a super network structure in a QMIX algorithm; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective second strategy networks to obtain the current strategy value distribution of each unmanned aerial vehicle; inputting the current strategy value distribution and the current global state of all unmanned aerial vehicles into an optimal hybrid network, adopting a feed-forward network to obtain a combined strategy value distribution, and merging the environmental risk into the combined strategy value distribution through an implicit fractional network to obtain the optimal combined strategy value distribution containing the environmental risk; calculating utility value distribution according to the optimal joint strategy value distribution, and calculating TD target based on the utility value distribution The method comprises the steps of carrying out a first treatment on the surface of the According to the joint action value distribution and the optimal joint strategy value distribution and TD target->Calculating the total loss of the model by adopting a preset total loss function; and updating parameters of the first strategy network, the second strategy network, the monotonic mixed network and the optimal mixed network by minimizing the total loss of the model, updating parameters of the first target strategy network, the second target strategy network, the target monotonic mixed network and the target optimal mixed network at preset updating frequency, increasing the current iteration number by 1, performing next iteration optimization until the iteration number reaches the maximum iteration number, increasing the current training round number by 1, entering the next training round until the training round number reaches the maximum training round number, and obtaining the optimal strategy of each unmanned plane.
Specifically, in the centralized training stage, the algorithm needs to distribute functions according to action values of all unmanned aerial vehiclesAnd learning the joint action value distribution. The DWMIX algorithm learns two joint motion value distributions +.>Andwherein the joint action value distribution ∈ ->Is consistent with the weighted QMIX algorithm, by a super-network structure with monotonic constraints, i.e. the monotonic hybrid network in fig. 4. Whereas the optimal federated policy value distribution +. >Implemented using a feed forward network and incorporating environmental risk +.>The method is integrated into the distribution of the combined strategy values through an implicit fractional network, namely the optimal mixed network shown in fig. 5. For learning +.about.in order to more closely approximate the true federated policy value distribution>We use a multi-layer neural network with non-negative weights, where both the weights and bias of the network take into account the environmental conditions. In this regard, the DWMIX algorithm is more accurate and closer to the true joint value distribution than the DRIMA algorithm learns to the optimal joint policy value distribution, because the DRIMA algorithm is learning +.>A super-network structure like QMIXQMIX is used, so that monotonicity limits its expression capacity.
In one embodiment, the first policy network comprises: a GRU network and 3 MLP networks; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective first policy networks to obtain the current action value distribution and the current option value of each unmanned aerial vehicle, wherein the method comprises the following steps:
step 300: inputting the current local observation information of each unmanned aerial vehicle and the previous action into the respective first strategy network, processing the current local observation information and the previous action by the first MLP network, the GRD network and the second MLP network in sequence, and mapping the action observation information into the following steps of And by superimposing the conditional risk value +.>A discretized utility value distribution is generated.
Step 302: selecting corresponding window execution in discretized utility value distribution according to optionsAnd greedy strategy and selecting action to obtain the current action value distribution of each unmanned aerial vehicle.
Step 304: and learning the data output by the GRU network through a third MLP network with a double-layer structure to obtain the value of the current option.
Step 306: based on the value of the current option, adoptThe policy determines the options in the current state.
Specifically, the first policy network of the unmanned plane learns the option value according to the current and historical observation-action information of the intelligent agentAnd action value distribution function->. Policy network maps action observation information to +.>And by superimposing the conditional risk value +.>Generate->. Then according to the discretized utility value distribution of the options +.>In selecting the corresponding window execution +.>Greedy strategy and selecting actions, thereby outputting the value distribution of unmanned aerial vehicle to current state action estimation +.>. Meanwhile, the first strategy network learns the option value of GRU output data under the current state through an MLP with a double-layer structure>And according to->The policy determines the options in the current state.
The first policy network is structured as shown in fig. 6, the target first policy network and the first policy network are structured identically, and the structure of the second policy network only needs to delete a portion of the dashed box in the first policy network shown in fig. 6. The first policy network and the second policy network do not share parameters, but share network parameters within the same type of policy network.
In one embodiment, step 304 includes: according to the firstMean of action value functions within individual optionsSelecting an action; first->The mean value of the action value functions in the options is as follows:
(3)
wherein,is->Mean value of action value function in individual options, < ->The size of the window is indicated and,the representation will->The number of quantiles is divided into->A window.
Optimizing and updating the option value corresponding to each option by adopting a value iterative optimization formula; the value iterative optimization formula is as follows:
(4)
wherein,for option->Probability of termination, ++>Representing step size->To be probability->The new option to be selected is selected and,for the value of the current option->For discounts factor->For the next state quantity, +.>For the current observed quantity,for option value->In the input is the next observation +.>And the value of the output at the current option.
By usingThe greedy strategy determines options in the current state; the selection mode of the options is as follows:
(5)
Wherein,for the last option->Is indicated at->One option is randomly selected from the options, +.>As above, all are expressed as probability +.>A new option is selected.
Specifically, each unmanned aerial vehicle in the cooperative task is based on respective utility value distributionThe action is selected, although the risk is taken into account, also according to the distribution +.>The mean value selection action of (c) does not fully exploit the rich environmental risk information contained in the distribution. In particular, when the environmental risk is changed, the motion selection based on the mean value cannot reflect the attitude change of the unmanned aerial vehicle when the environmental risk is changed, which is particularly important in the multi-unmanned aerial vehicle cooperation obstacle avoidance task in the dynamic environment, because the unmanned aerial vehicle needs to adapt to the environmental risk change caused by the environmental change when the unmanned aerial vehicle cooperates obstacle avoidance. Based on this, in the utility value distribution +.>More advanced strategies are adopted, i.e. distributed +.>A distribution interval in which a greedy policy is executed is selected.
The option framework is typically used as a policy selector in hierarchical reinforcement learning. Each option in the option frameUse tuple +.>Representation of->Is a set of start states, +.>Representing intra-option policies, ->Is a termination function indicating whether to continue execution or terminate the current option in the current state. When using the option frame, first of all the policy +. >Selecting an optioncThen according to the optionscCorresponding intra-option policiesAn action is selected. Intra-option policy +.>Is no longer based on the whole distribution->But according to the option +.>And selecting the size of the action value in the corresponding distribution window. We use +.>Upper->Individual quantile estimate +.>Is a combination of +.>By the formula->And (3) representing. At the position ofIn, we construct->Option (/ ->) And they share the same start and stop functions.Representing according to->Options (corresponding to->A fractional window) mean of the action value function>Selecting an action, wherein->And (3) calculating by adopting a formula (3).
The selection of the options is optimized by an iterative process like Q learning. The merits of the options adopt the option cost functionExpressed and updated by formula (4), wherein +.>Representation option->Probability of termination, ++>Representing the step size. It is assumed in the DWMIX algorithm that all collaborative agents have the same option.
Based on the corresponding cost function of each optionDWMIX algorithm is probabilistic +.>Continuing to select the last optionWith probability->Select new option +.>. Select->When using->Greedy policy. The selection of the options is shown in formula (5).
In one embodiment, step 302 includes: distributing the current action value of each unmanned aerial vehicleInputting the motion values into a monotonic mixed network, and obtaining joint motion value distribution after passing through a super-network structure with monotonic limitation; the current action value distribution of each unmanned aerial vehicle is input into an optimal mixed network, and after the current action value distribution of each unmanned aerial vehicle passes through a feedforward neural network, the environmental risk is fused into a joint strategy value through an implicit fractional networkIn the distribution, obtaining optimal joint strategy value distribution; and obtaining a joint utility value distribution according to the joint action value distribution, and selecting actions according to the joint utility value distribution.
In one embodiment, the predetermined total loss function in step 212 is:
(6)
wherein,for the preset loss function->、/>、/>Is a parameter controlling the importance of each loss function,for the loss function between the current option value and the target option value,/or->For the loss function between the joint action value distribution versus the optimal joint strategy value distribution, +.>For TD target->And the current optimal joint policy value distribution +.>Inter-loss function.
In one embodiment, the loss function between the federated action value distribution versus the optimal federated policy value distribution is:
(7)
(8)
(9)
wherein,for the joint action value distribution, +. >For TD target, +.>Is weight(s)>Is constant (I)>For the current reward->,/>,/>The global state, h and joint action of the next step respectively,for the joint utility value distribution, ++>To represent the size of the number of samples in a small batch sampled in the buffer,for TD target, +.>
In one embodiment, the loss function between the TD target and the current optimal joint policy value distribution is:
(10)
wherein,for environmental risk->For optimal federated policy value distribution, +.>For the purpose of the TD,,/>for the current reward->As a discount factor, the number of times the discount is calculated,representing the optimal joint strategy value distribution of the next step; />,/>,/>Respectively the global state of the next step,hAnd joint actions.
In one embodiment, the loss function between the current option value and the target option value is:
(11)
(12)
wherein,representing the value of the option when the status and option are entered as the next step, < >>For the value of the current option->Representation option->Global state in next step->Probability of lower termination, < >>For the current reward->For discounts factor->For the global state of the next step, +.>To be probability->A new option selected; />For the last option +.>Is the global state of the last step.
Specifically, the learning objective of the DWMIX algorithm is multiple. First, it is necessary to reduce the approximation error of the joint action value distribution to the optimal joint strategy value distribution as much as possible in the hybrid network . Because the environmental risk is considered in the optimal joint strategy value distribution, the optimal joint strategy value distribution is more similar to the real joint action value distribution by +.>Tracking->The monotone mixed network and the strategy network can be well optimized, so that the quality of the selected strategy is improved. In the loss function->Target network output using optimal joint policy value distribution>To calculate TD target->But the DWMIX algorithm is based on the joint utility value distribution +.>Select action, not +.>. At the same time, use the weight +.>The algorithm is prevented from underestimating the optimal joint action value. When->The algorithm gives the sample a greater weight when it is. Conversely, for overestimated actions, a smaller value +.>To balance the error. Loss function->The calculation formulas of (a) are shown in formulas (7) to (9).
And secondly, optimizing the optimal combination strategy value distribution by adopting a quantile regression method. When optimizing the monotone mixed network, we aim at the optimal joint strategy value distribution to guide the monotone mixed network, so that it is more important to learn an accurate optimal joint strategy value distribution. The loss function as shown in equation (10)Adopts TD target->And the current optimal joint policy value distribution +.>Inter Huber loss.
Next, the option value is updated based on iterative optimization by minimizing the loss function between the current option value and the target option value To improve the options strategy to optimize the risk attitude selected by the agent as the environmental risk changes. Loss function->The calculation is performed using the formula (11) and the formula (12).
Finally, according to the loss function、/>And->Obtaining the product as shown in formula (6)The final objective shown->. The target is trained in an end-to-end fashion to learn the most realistic policy network and hybrid network parameters.
The pseudo code of the multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment is as follows:
1: initializing policy network parameters、/>Monotonically mixed network parameters->And optimal hybrid network parameters->
2: initializing target network parameters、/>、/>、/>Experience buffer
3: initialization of,/>,/>And discount factor->Weight->、/>Confidence level->
4: initializing parameters、/>、/>Number of bits->、/>、/>Number of options->
5:
6: collecting global stateAnd all unmanned observations->
7:
8: each unmanned aerial vehicle calculates current action value distribution according to own strategy networkAnd option value->
9: determining options according to formulas (3) - (5);
determining a distribution window according to options and calculating the mean value of action value functions in the options
11 selection action,/>
12: performing joint actionsState transition and obtaining observations and rewards +.>
13: storage of To buffer zone->Sampling a small batch of samples;
14: according toDeriving a utility value distribution->Select to moveActing as
15: computing TD targets
16: calculating a joint action value distribution and an optimal joint strategy value distribution loss according to a formula (7);/>
17: calculated according to formula (10)And the current optimal joint policy value distribution +.>Inter Huber loss;
18: calculating a loss between the target option value and the current option value according to formula (11)
19:
20: by minimizing lossesUpdate parameter->,/>、/>And->
21: in cycles ofUpdating target network parameters +.>,/>,/>,/>
22:End for
23:End for
On line 14, expected on utility value distributionMiddle binding option->To select the action, which is to keep consistent with the risk attitude of the unmanned aerial vehicle when selecting the action, namely the risk attitude of the unmanned aerial vehicle to the action is not only embodied on the policy network, through +.>Playing a role in the centralized training.
It should be understood that, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly stated in the present application, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of other steps or sub-steps of other steps.
In one illustrative embodiment, the algorithms of the present application are evaluated and validated in a multi-agent benchmark environment grid world stag_hunt, MPE, and challenging SMACs.
(1) Benchmark task
Stag_hunt task: the 8 predators are required to be of the sizeThe grid world of (c) cooperatively pursues 8 prey objects with typical partially observable characteristics. In this task, when only one agent captures a prey without cooperating with other agents, the environmental feedback reward is-2, and only the reward is 2 when cooperating, and the cooperation ability between agents is tested through this reward and punishment setting.
SMAC task: SMAC tasks aim to evaluate an agent's ability to solve complex tasks through collaboration. The randomness of SMAC tasks is mainly reflected in the carrier speed and update order. The randomness of the carrier speed is realized by setting the speed of the next attack of the intelligent agent after the attack to be-1 to +2 experimental steps, so that the permanent synchronous attack of the intelligent agent is avoided. The randomness of the update sequence means that if two agents hit an opponent at the same time, who first inflicts harm and wins is random. 5m_vs_6m was chosen to test algorithm robustness.
MPE task: MPE tasks simulate collaborative tasks in the real world as interactions between particles, we take the Predator-prey task as an example to test the robustness of our algorithm. Three slower predators in the Predator-prey would need to go around the prey by collaboration, while faster prey would be rewarded away from the Predator.
(2) Risk setting
In the three reference environments described above, several types of risk conditions are set by modification thereof. The exploration, dilemma and noise conditions originate from the test methods proposed by the DRIMA algorithm in order to verify the impact of the risk of collaboration, but are considered part of the environmental risk in this embodiment. The risk settings are shown in table 1.
Table 1 risk settings
(3) Reference algorithm and algorithm setting
The DWMIX algorithm is first compared with the currently popular value decomposition MARL algorithm, mainly comprising WQMIX algorithms (OWQMIX and CWQMIX), VDN algorithms and QMIX algorithms, all based on the current most advanced open source frame PYMARL implementation. Secondly, compared with the most advanced value distribution multi-agent reinforcement learning algorithm at present, the method mainly comprises an RMIX algorithm, a DQMIX algorithm, a DFAC algorithm and a DRIMA algorithm, and is realized according to the open source code.
Quantile number in DWMIX of the method、/>,/>Confidence level->,/>Number of options->。/>The linear decay from 1 to 0.01 is within 10e6 time steps. />0.1. The total training steps for the three tasks stag_hunt, MPE and SMAC are 1M, 5M and 2M, respectively, where +.>Set to->. In evaluating algorithm performance, we performed a test at each of 20, 20, and 32 bureau on three tasks, with the average return evaluation algorithm used in the stag_hunt and MPE tasks and the average win rate evaluation algorithm used in the SMAC task.
(4) Performance on a basic environment
Comparing the DWMIX algorithm with the conventional value decomposition algorithm, the performance comparison result (normal stag_hunt task) of the DWMIX algorithm and the conventional value decomposition algorithm is shown in fig. 7, the performance comparison result (normal MPE task) of the DWMIX algorithm and the conventional value decomposition algorithm is shown in fig. 8, and the performance comparison result (normal 5mv6m task) of the DWMIX algorithm and the conventional value decomposition algorithm is shown in fig. 9. The DWMIX algorithm achieves the best performance among the three tasks, even in the very difficult to explore Stag _ hunt and SMAC tasks. However, other conventional value decomposition algorithms often perform well in one or two tasks and cannot be adapted to three tasks with completely different styles. Further, the performance comparison result (normal stag_hunt task) of the DWMIX algorithm and the value distribution multi-agent reinforcement learning algorithm is shown in fig. 10, the performance comparison result (normal MPE task) of the DWMIX algorithm and the value distribution multi-agent reinforcement learning algorithm is shown in fig. 11, and the performance comparison result (normal 5mv6m task) of the DWMIX algorithm and the value distribution multi-agent reinforcement learning algorithm is shown in fig. 12. In stag_hunt and MPE tasks, DWMIX algorithms achieve the best performance with absolute advantage, and in stag_hunt only DWMIX algorithms can learn an efficient strategy, which benefits from the weighting of underestimated action values by the algorithm using quantiles during the learning phase. In the SMAC task, DWMIX algorithm performance approaches RMIX algorithm performance, achieving a second best performance. This verifies that distributed modification has a beneficial impact on multi-agent reinforcement learning.
1) Performance in risk setting
And whether the DWMIX algorithm can effectively adapt to environmental risks or not is verified by setting risk conditions in the benchmark stag_hunt, and the robustness of strategy learning of the intelligent agent is improved. DWMIX algorithms were tested in risk environments with random state transitions, random rewards, exploration, noise and dilemma, respectively.
2) Random state transitions and random rewards
The test results in the case of random state transition are shown in fig. 13. The result shows that the DWMIX algorithm can explore effective cooperation strategies under the uncertainty environment with random state transition, so that the optimal performance is realized, while other value distribution multi-agent algorithms almost fail under the risk setting, and no effective strategies are learned. At random reward settings, DWMIX algorithm achieves nearly similar performance as the original weighted QMIX algorithm at the late stage of training and is able to learn a good cooperation strategy, while other value-distributed multi-agent algorithms obviously do not have this capability. The test results for the random rewards case are shown in figure 14. The random state transition and random rewards are simultaneously integrated into the environment, and the performance of the DWMIX algorithm in the environment with larger uncertainty is tested. The test results in the random state transition + random rewards case are shown in fig. 15, which shows that only DWMIX algorithm achieves the best performance, i.e. DWMIX algorithm can learn better cooperation strategy in the risk environment with greater uncertainty.
Secondly, in order to better understand the role of an option framework for realizing the risk attitude self-adaptive adjustment function in an algorithm, the risk attitude change of the intelligent agent under three environmental risk settings is represented by counting the frequency of the intelligent agent selecting different options in different training stages. In different training phases, the DWMIX algorithm does achieve risk attitude adjustment, and does not fully employ a mean-based quantile option. In experiments, the risk attitudes of an agent under three environmental risk settings can change with the change of the environment, and the probability that the DWMIX algorithm selects the mean value when selecting the option is very small. This shows that DWMIX algorithm using the option framework can effectively improve the exploration performance of the mean-initiative selection mode.
3) Exploration
The training process keeps high exploration for a long time, the strategy that the intelligent body learns to be prone to cooperate becomes more difficult, the non-stationarity of strategy learning is increased, and then the environmental risk is increased. The DWMIX algorithm brings the environmental risk into the optimal joint action value, considers the learned more realistic real joint action value, and realizes the self-adaptive adjustment of the risk attitude through the option framework, so that a better cooperation strategy can be found under the condition of high exploration. Comparison of performance of DWMIX algorithm and value distributed multi-agent reinforcement learning algorithm on exploring stag_hunt task as shown in fig. 16, comparison of performance of DWMIX algorithm and value distributed multi-agent reinforcement learning algorithm on exploring MPE task as shown in fig. 17, and comparison of performance of DWMIX algorithm and value distributed multi-agent reinforcement learning algorithm on exploring 5mv6m task as shown in fig. 18. From fig. 16 to 18, it can be seen that the DWMIX algorithm is substantially consistent in its performance and best performing value decomposition algorithm over three tasks with high exploration, and is significantly better than the value distribution multi-agent algorithm.
4) Noise generation
To further test the robust performance of the DWMIX algorithm, we performed experiments in an environment with noise agents, and fig. 19 shows the results of the DWMIX algorithm with other algorithms in a random stag_hunt task when teammates take a random strategy. FIG. 20 is a test result of the DWMIX algorithm and other algorithms in the challenge Stag_hunt task when teammates take a challenge strategy. In the training phase, the algorithm setting is the same as in the benchmark environment, but in the testing phase, we randomly designate an agent to select a random strategy or an countermeasure strategy, i.e. there is an agent in cooperation that does not select a strategy according to the action value maximization principle. FIG. 20 shows that the performance of the DWMIX algorithm in Stag_hunt is not greatly affected when teammates choose the challenge strategy. FIG. 19 shows that the DWMIX algorithm yields competitive results in the stag_hunt task when teammates take a random strategy. In a comprehensive view, in different tasks, the DWMIX algorithm can effectively adapt to the existence of noise intelligent agents, and a strategy with stronger robustness is obtained.
5) Dilemma of
The risk condition with dilemma is designed through rewarding modeling, so that the influence of environmental risks caused by non-cooperative behaviors among the agents on the policy learning of the agents is tested. We contrast the different algorithms in SMAC tasks, fig. 21 is a schematic diagram of the performance of DWMIX algorithm in 5m vs. 6m task, which demonstrates that DWMIX algorithm achieves the most advanced performance in 5m vs. 6m task.
It should be noted that, in the legends in fig. 7 to 21: the method is DWMIX; the first comparison algorithm is QQMIX algorithm, the second comparison algorithm is QMIX algorithm, the third comparison algorithm is CWQMIX algorithm, the fourth comparison algorithm is VDN algorithm, the fifth comparison algorithm is DFAC-DMIX algorithm, the sixth comparison algorithm is DRIMA-SS algorithm, the seventh comparison algorithm is DQMIX algorithm, the eighth comparison algorithm is RMIX algorithm, and the ninth comparison algorithm is DFAC-DDN algorithm.
(5) Matrix gaming
To demonstrate the effectiveness of the DWMIX algorithm in recovering the optimal strategy and performance when affected by the variance, tests were performed on a one-step matrix gaming experiment. In this experiment, two agents had three actions to choose from and learn the strategy by maximizing global rewards. In experiment 1, no variance was set for the purpose of testing the performance of the algorithm at the time of optimal policy recovery. In experiment 2, variance was introduced to measure the performance of the algorithm on the risk of treatment. The DRIMA/DFAC/RMIX/DQMIX and DWMIX algorithms were compared in 2 games and the awarded results in a one-step game experiment using different types of risk attitudes algorithms are shown in table 2, with DWMIX weight set to 0.1. In experiments 1 and 2, only DWMIX algorithm can find the closest optimal strategy, and joint motion estimation value can reach 7.96 even in variance
Table 2 rewards results in a one-step gaming experiment using algorithms for different types of risk attitudes.
In one embodiment, a multi-machine obstacle avoidance strategy learning device based on risk attitude self-adjustment is provided, including: task modeling module, multi-machine keep away barrier strategy learning model construction module, multi-machine keep away barrier strategy learning model and concentrate training module and unmanned aerial vehicle cooperation and keep away the barrier module, wherein:
the task modeling module is used for modeling the multi-unmanned aerial vehicle cooperative obstacle avoidance task into a multi-agent non-centralized part observable Markov decision process, and the information of each unmanned aerial vehicle comprises: local observation information, actions, and system status.
The multi-machine obstacle avoidance strategy learning model construction module is used for constructing a multi-machine obstacle avoidance strategy learning model based on risk attitude self-adjustment, the multi-machine obstacle avoidance strategy learning model is based on an optimistically weighted QMIX algorithm, conditional risk values are introduced on the basis of action value distribution to learn utility value distribution in the unmanned plane strategy generation process, an option framework is used for learning the self-adaptive risk attitude in a strategy layer, and an implicit quantile network is used for integrating environmental risks into a hybrid network during centralized training; the multi-machine obstacle avoidance strategy learning model comprises a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network.
The multi-machine obstacle avoidance strategy learning model centralized training module is used for intensively training the multi-machine obstacle avoidance strategy learning model initialized by the network parameters by adopting the current local observation information and the previous action and the global state of each unmanned aerial vehicle as inputs to obtain the optimal strategy of each unmanned aerial vehicle in the risk environment.
The unmanned aerial vehicle cooperation obstacle avoidance module is used for completing unmanned aerial vehicle cooperation obstacle avoidance tasks by adopting a corresponding optimal strategy.
In one embodiment, the multi-machine obstacle avoidance strategy learning model centralized training module is further configured to set a maximum training wheel number and a maximum iteration number, and set the current training wheel number and the current iteration number to be 1; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective first strategy networks to obtain the current action value distribution and the current option value of each unmanned aerial vehicle; inputting the current action value distribution and the current global state of all unmanned aerial vehicles into a monotonic mixed network, and obtaining the unmanned aerial vehicle by adopting a super network structure in a QMIX algorithmTo a joint action value distribution; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective second strategy networks to obtain the current strategy value distribution of each unmanned aerial vehicle; inputting the current strategy value distribution and the current global state of all unmanned aerial vehicles into an optimal hybrid network, adopting a feed-forward network to obtain a combined strategy value distribution, and merging the environmental risk into the combined strategy value distribution through an implicit fractional network to obtain the optimal combined strategy value distribution containing the environmental risk; calculating utility value distribution according to the optimal joint strategy value distribution, and calculating TD target based on the utility value distribution The method comprises the steps of carrying out a first treatment on the surface of the According to the joint action value distribution and the optimal joint strategy value distribution and TD target->Calculating the total loss of the model by adopting a preset total loss function; and updating parameters of the first strategy network, the second strategy network, the monotonic mixed network and the optimal mixed network by minimizing the total loss of the model, updating parameters of the first target strategy network, the second target strategy network, the target monotonic mixed network and the target optimal mixed network at preset updating frequency, increasing the current iteration number by 1, performing next iteration optimization until the iteration number reaches the maximum iteration number, increasing the current training round number by 1, entering the next training round until the training round number reaches the maximum training round number, and obtaining the optimal strategy of each unmanned plane.
In one embodiment, the first policy network comprises: a GRU network and 3 MLP networks; the multi-machine obstacle avoidance strategy learning model centralized training module is further configured to input current local observation information and previous actions of each unmanned plane into respective first strategy networks, and map the action observation information into And by superimposing the conditional risk value +.>Generating a discretized utility value distribution; selecting a corresponding window execution in a discretized utility value distribution according to an option +.>Greedy strategy and select action->Obtaining the current action value distribution of each unmanned aerial vehicle; the data output by the GRU network is learned through a third MLP network with a double-layer structure to obtain the value of the current option +.>The method comprises the steps of carrying out a first treatment on the surface of the According to the value of the current option->Adopts->The policy determines the options in the current state.
In one embodiment, the multi-machine obstacle avoidance strategy learning model centralized training module is further configured toA mean value selection action of the action value function in the options; first->The mean value of the action value functions in the options is shown in a formula (3); optimizing and updating the option value corresponding to each option by adopting a value iterative optimization formula; the value iterative optimization formula is shown in formula (4); adopts->The greedy strategy determines options in the current state; the selection of the options is shown in formula (5).
In one embodiment, the multi-machine obstacle avoidance strategy learning model centralized training module is further configured to input the current motion value distribution of each unmanned aerial vehicle into a monotonic hybrid network, and obtain a joint motion value distribution after passing through a super network structure with monotonic limitation; inputting the current action value distribution of each unmanned aerial vehicle into an optimal hybrid network, and after passing through a feedforward neural network, merging the environmental risk into the joint strategy value distribution through an implicit fractional network to obtain the optimal joint strategy value distribution; and obtaining a joint utility value distribution according to the joint action value distribution, and selecting actions according to the joint utility value distribution.
In one embodiment, the predetermined total loss function is shown in equation (6).
In one embodiment, the loss function between the joint action value distribution and the optimal joint policy value distribution is shown in formulas (7) to (9).
In one embodiment, a TD targetThe loss function between the current optimal joint policy distribution is shown in formula (10). />
In one embodiment, the loss function between the current option value and the target option value is shown in formulas (11) and (12).
For specific limitation of the multi-machine obstacle avoidance strategy learning device based on risk attitude self-adjustment, reference may be made to the limitation of the multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment hereinabove, and the description thereof will not be repeated here. All or part of each module in the multi-machine obstacle avoidance strategy learning device based on risk attitude self-adjustment can be realized by software, hardware and combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (9)

1. A multi-machine obstacle avoidance strategy learning method based on risk attitude self-adjustment is characterized by comprising the following steps:
modeling a multi-unmanned aerial vehicle cooperation obstacle avoidance task into a multi-agent non-centralized part observable Markov decision process, wherein the information of each unmanned aerial vehicle comprises the following steps: locally observing information, actions, and system states;
constructing a multi-machine obstacle avoidance strategy learning model based on risk attitude self-adjustment, wherein the multi-machine obstacle avoidance strategy learning model is based on an optimistically weighted QMIX model, a conditional risk value is introduced on the basis of action value distribution to learn utility value distribution in the unmanned plane strategy generation process, an option framework is used for learning an adaptive risk attitude in a strategy layer, and an implicit quantile network is used for integrating environmental risks into a hybrid network during centralized training; the multi-machine obstacle avoidance strategy learning model comprises a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network;
The current local observation information, the previous action and the global state of each unmanned aerial vehicle are used as inputs, and the multi-machine obstacle avoidance strategy learning model after network parameter initialization is intensively trained to obtain the optimal strategy of each unmanned aerial vehicle in a risk environment;
each unmanned aerial vehicle adopts the corresponding optimal strategy to complete unmanned aerial vehicle cooperation obstacle avoidance tasks;
the method comprises the steps of adopting current action and local observation information, historical action and local observation information of each unmanned aerial vehicle to intensively train the multi-machine obstacle avoidance strategy learning model after network parameter initialization to obtain an optimal strategy of each unmanned aerial vehicle in a risk environment, and comprises the following steps:
setting the maximum training wheel number and the maximum iteration number, and setting the current training wheel number and the current iteration number to be 1;
inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective first strategy networks to obtain the current action value distribution and the current option value of each unmanned aerial vehicle;
inputting the current action value distribution and the current global state of all unmanned aerial vehicles into the monotonic mixed network, and obtaining joint action value distribution by adopting a super network structure in a QMIX model;
Inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective second strategy networks to obtain the current strategy value distribution of each unmanned aerial vehicle;
inputting the current strategy value distribution and the current global state of all unmanned aerial vehicles into the optimal hybrid network, adopting a feed-forward network to obtain a combined strategy value distribution, and merging the environmental risk into the combined strategy value distribution through an implicit quantile network to obtain the optimal combined strategy value distribution containing the environmental risk;
calculating utility value distribution according to the optimal joint strategy value distribution, and calculating TD target based on the utility value distribution
According to the joint action value distribution and the optimal joint strategy value distribution, the TD targetCalculating the total loss of the model by adopting a preset total loss function according to the target option value and the current option value;
and updating parameters of the first strategy network, the second strategy network, the monotonic mixed network and the optimal mixed network by minimizing the total model loss, updating parameters of the first target strategy network, the second target strategy network, the target monotonic mixed network and the target optimal mixed network with preset updating frequency, increasing the current iteration number by 1, performing next iteration optimization until the iteration number reaches the maximum iteration number, increasing the current training round number by 1, entering the next training round until the training round number reaches the maximum training round number, and obtaining the optimal strategy of each unmanned aerial vehicle.
2. The method of claim 1, wherein the first policy network comprises: a GRU network and 3 MLP networks;
inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective first policy networks to obtain the current action value distribution and the current option value of each unmanned aerial vehicle, wherein the method comprises the following steps:
inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into the respective first strategy network, and mapping the action observation information into the following steps after the current local observation information of each unmanned aerial vehicle and the action at the previous moment are processed by a first MLP network, a GRD network and a second MLP network in sequenceAnd by superimposing the conditional risk value +.>Generating a discretized utility value distribution;
selecting corresponding window execution in discretized utility value distribution according to optionsGreedy strategy and select action->Obtaining the current action value distribution of each unmanned aerial vehicle;
the data output by the GRU network is learned through a third MLP network with a double-layer structure to obtain the value of the previous option;
based on the value of the current option, adoptingThe policy determines the options in the current state.
3. The method of claim 2, wherein based on the current option value, employing The strategy determines options in the current state, comprising:
according to the firstA mean value selection action of the action value function in the options; first->The mean value of the action value functions in the options is as follows:
wherein,is->Mean value of action value function in individual options, < ->Representing window size, +.>The representation will->The number of quantiles is divided into->A window;
optimizing and updating the option value corresponding to each option by adopting a value iterative optimization formula; the value iterative optimization formula is as follows:
wherein (1)>For option->Probability of termination, ++>Representing step size->To be probability->New option selected, ++>For the current prize to be awarded,for discounts factor->For the value of the current option->For the next global state +.>As a result of the current global state,for the value of the option in the input of the next global state +.>And the value output at the current option;
by usingThe greedy strategy determines options in the current state; the selection mode of the options is as follows:
wherein,for the last option->Is indicated at->One of the options is randomly selected.
4. The method of claim 1, wherein selecting a corresponding window in the discretized utility value distribution to execute according to the optionGreedy policy and select actions, including:
Inputting the current action value distribution of each unmanned aerial vehicle into the monotonic mixed network, and obtaining the joint action value distribution after passing through a super network structure with monotonicity limitation;
inputting the current action value distribution of each unmanned aerial vehicle into the optimal mixed network, and after passing through a feedforward neural network, merging the environmental risk into the joint strategy value distribution through an implicit fractional network to obtain the optimal joint strategy value distribution;
and obtaining the joint utility value distribution according to the joint action value distribution, and selecting actions according to the joint utility value distribution.
5. The method of claim 1, wherein the predetermined total loss function is:
wherein,for the preset loss function->、/>、/>Is a parameter controlling the importance of each loss function, < +.>For the loss function between the current option value and the target option value,/or->For the loss function between the joint action value distribution versus the optimal joint strategy value distribution, +.>For TD target->And the current joint action value distribution +.>Inter-loss function (I/O)>In the global state of the last step +.>For the combined action of the previous step, +.>Is an environmental risk.
6. The method of claim 5, wherein a loss function between the federated action value distribution versus the optimal federated policy value distribution is:
Wherein,for the joint action value distribution, +.>For TD target, +.>Is weight(s)>Is constant (I)>In the global state of the last step +.>For the combined action of the previous step, +.>For the last option->For the current reward->,/>,/>Respectively the global state of the next step,hAnd combined actions of->For the joint utility value distribution, ++>For the size of the number of samples in a small batch sampled in the buffer +.>Is a TD target.
7. The method of claim 5, wherein the TD targetAnd current federated policy value distributionThe loss function between is:
wherein (1)>For environmental risk->For optimal federated policy value distribution, +.>For the purpose of the TD,,/>for the current reward->As a discount factor, the number of times the discount is calculated,representing the optimal joint strategy value distribution of the next step; />,/>,/>Respectively the global state of the next step,hAnd joint actions.
8. The method of claim 5, wherein the loss function between the current option value and the target option value is:
wherein,representing the value of the option when the status and option are entered as the next step, < >>For the value of the current option->Representation option->Global state in next step->Probability of lower termination, < >>For the current reward->As a discount factor, the number of times the discount is calculated,for the global state of the next step, +. >To be probability->A new option selected; />For the last option +.>Global for the last stepStatus of the device.
9. Multi-machine obstacle avoidance strategy learning device based on risk attitude self-adjustment, which is characterized by comprising:
the task modeling module is used for modeling the multi-unmanned aerial vehicle cooperative obstacle avoidance task into a multi-agent non-centralized part observable Markov decision process, and the information of each unmanned aerial vehicle comprises: locally observing information, actions, and system states;
the multi-machine obstacle avoidance strategy learning model construction module is used for constructing a multi-machine obstacle avoidance strategy learning model based on risk attitude self-adjustment, the multi-machine obstacle avoidance strategy learning model is based on an optimistically weighted QMIX model, conditional risk values are introduced on the basis of action value distribution to learn utility value distribution in the unmanned plane strategy generation process, an option framework is used for learning the self-adaptive risk attitude in a strategy layer, and an implicit quantile network is used for integrating environmental risks into a hybrid network during centralized training; the multi-machine obstacle avoidance strategy learning model comprises a first strategy network, a second strategy network, a monotone mixed network and an optimal mixed network;
the multi-machine obstacle avoidance strategy learning model centralized training module is used for intensively training the multi-machine obstacle avoidance strategy learning model initialized by the network parameters by adopting the current local observation information and the action at the previous moment and the global state of each unmanned plane as inputs to obtain the optimal strategy of each unmanned plane in a risk environment;
The unmanned aerial vehicle cooperation obstacle avoidance module is used for completing unmanned aerial vehicle cooperation obstacle avoidance tasks by adopting the corresponding optimal strategies by each unmanned aerial vehicle;
the multi-machine obstacle avoidance strategy learning model centralized training module is also used for setting the maximum training wheel number and the maximum iteration number, and setting the current training wheel number and the current iteration number to be 1; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective first strategy networks to obtain the current action value distribution and the current option value of each unmanned aerial vehicle; inputting the current action value distribution and the current global state of all unmanned aerial vehicles into the monotonic mixingIn the combined network, a super network structure in a QMIX model is adopted to obtain the combined action value distribution; inputting the current local observation information of each unmanned aerial vehicle and the action at the previous moment into respective second strategy networks to obtain the current strategy value distribution of each unmanned aerial vehicle; inputting the current strategy value distribution and the current global state of all unmanned aerial vehicles into the optimal hybrid network, adopting a feed-forward network to obtain a combined strategy value distribution, and merging the environmental risk into the combined strategy value distribution through an implicit quantile network to obtain the optimal combined strategy value distribution containing the environmental risk; calculating utility value distribution according to the optimal joint strategy value distribution, and calculating TD target based on the utility value distribution The method comprises the steps of carrying out a first treatment on the surface of the According to the joint action value distribution and the optimal joint strategy value distribution, the TD targetCalculating the total loss of the model by adopting a preset total loss function according to the target option value and the current option value; and updating parameters of the first strategy network, the second strategy network, the monotonic mixed network and the optimal mixed network by minimizing the total model loss, updating parameters of the first target strategy network, the second target strategy network, the target monotonic mixed network and the target optimal mixed network with preset updating frequency, increasing the current iteration number by 1, performing next iteration optimization until the iteration number reaches the maximum iteration number, increasing the current training round number by 1, entering the next training round until the training round number reaches the maximum training round number, and obtaining the optimal strategy of each unmanned aerial vehicle.
CN202311379344.3A 2023-10-24 2023-10-24 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment Active CN117111640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311379344.3A CN117111640B (en) 2023-10-24 2023-10-24 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311379344.3A CN117111640B (en) 2023-10-24 2023-10-24 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment

Publications (2)

Publication Number Publication Date
CN117111640A CN117111640A (en) 2023-11-24
CN117111640B true CN117111640B (en) 2024-01-16

Family

ID=88797038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311379344.3A Active CN117111640B (en) 2023-10-24 2023-10-24 Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment

Country Status (1)

Country Link
CN (1) CN117111640B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814988A (en) * 2020-07-07 2020-10-23 北京航空航天大学 Testing method of multi-agent cooperative environment reinforcement learning algorithm
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113110592A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle obstacle avoidance and path planning method
CN113341958A (en) * 2021-05-21 2021-09-03 西北工业大学 Multi-agent reinforcement learning movement planning method with mixed experience
CN114020001A (en) * 2021-12-17 2022-02-08 中国科学院国家空间科学中心 Mars unmanned aerial vehicle intelligent control method based on depth certainty strategy gradient learning
WO2022133090A1 (en) * 2020-12-17 2022-06-23 Intel Corporation Adaptive generation and assessment of autonomous vehicle critical scenarios
CN114859899A (en) * 2022-04-18 2022-08-05 哈尔滨工业大学人工智能研究院有限公司 Actor-critic stability reinforcement learning method for navigation obstacle avoidance of mobile robot
CN115186807A (en) * 2022-05-19 2022-10-14 南京大学 Method for decomposing performance of multi-agent reinforcement learning algorithm by using optimistic mapping promotion value
CN116401518A (en) * 2023-04-11 2023-07-07 中国人民解放军国防科技大学 Method and device for enhancing multi-agent strategy learning stability
CN116430898A (en) * 2023-04-23 2023-07-14 西安工业大学 Improved QMIX method applied to unmanned aerial vehicle cooperative countermeasure

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814988A (en) * 2020-07-07 2020-10-23 北京航空航天大学 Testing method of multi-agent cooperative environment reinforcement learning algorithm
WO2022133090A1 (en) * 2020-12-17 2022-06-23 Intel Corporation Adaptive generation and assessment of autonomous vehicle critical scenarios
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113110592A (en) * 2021-04-23 2021-07-13 南京大学 Unmanned aerial vehicle obstacle avoidance and path planning method
CN113341958A (en) * 2021-05-21 2021-09-03 西北工业大学 Multi-agent reinforcement learning movement planning method with mixed experience
CN114020001A (en) * 2021-12-17 2022-02-08 中国科学院国家空间科学中心 Mars unmanned aerial vehicle intelligent control method based on depth certainty strategy gradient learning
CN114859899A (en) * 2022-04-18 2022-08-05 哈尔滨工业大学人工智能研究院有限公司 Actor-critic stability reinforcement learning method for navigation obstacle avoidance of mobile robot
CN115186807A (en) * 2022-05-19 2022-10-14 南京大学 Method for decomposing performance of multi-agent reinforcement learning algorithm by using optimistic mapping promotion value
CN116401518A (en) * 2023-04-11 2023-07-07 中国人民解放军国防科技大学 Method and device for enhancing multi-agent strategy learning stability
CN116430898A (en) * 2023-04-23 2023-07-14 西安工业大学 Improved QMIX method applied to unmanned aerial vehicle cooperative countermeasure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Multi-UAV Collision Avoidance using Multi-Agent Reinforcement Learning with Counterfactual Credit Assignment;Shuangyao;arXiv;第1-8页 *
基于改进强化学习的多无人机协同对抗算法研究;张磊;兵器装备工程学报;第44卷(第5期);第230-237页 *

Also Published As

Publication number Publication date
CN117111640A (en) 2023-11-24

Similar Documents

Publication Publication Date Title
Foerster et al. Stabilising experience replay for deep multi-agent reinforcement learning
Hernandez-Leal et al. A survey and critique of multiagent deep reinforcement learning
CN111291890B (en) Game strategy optimization method, system and storage medium
Hao et al. Exploration in deep reinforcement learning: From single-agent to multiagent domain
Xu et al. Learning to explore via meta-policy gradient
Santos et al. Dyna-H: A heuristic planning reinforcement learning algorithm applied to role-playing game strategy decision systems
Jafferjee et al. Hallucinating value: A pitfall of dyna-style planning with imperfect environment models
Kapturowski et al. Human-level Atari 200x faster
CN112069504A (en) Model enhanced defense method for resisting attack by deep reinforcement learning
CN114925850B (en) Deep reinforcement learning countermeasure defense method for disturbance rewards
Oh et al. Learning to sample with local and global contexts in experience replay buffer
Heinrich et al. Self-play Monte-Carlo tree search in computer poker
CN113239472B (en) Missile guidance method and device based on reinforcement learning
Hogewind et al. Safe reinforcement learning from pixels using a stochastic latent representation
CN117111640B (en) Multi-machine obstacle avoidance strategy learning method and device based on risk attitude self-adjustment
CN117648548A (en) Intelligent decision method and device based on offline-online hybrid reinforcement learning
CN117705113A (en) Unmanned aerial vehicle vision obstacle avoidance and autonomous navigation method for improving PPO
Li et al. Learning adversarial policy in multiple scenes environment via multi-agent reinforcement learning
Yang et al. A Survey on Multiagent Reinforcement Learning Towards Multi-Robot Systems.
CN114757092A (en) System and method for training multi-agent cooperative communication strategy based on teammate perception
Zhou et al. Deep reinforcement learning based intelligent decision making for two-player sequential game with uncertain irrational player
Xu et al. Efficient multi-goal reinforcement learning via value consistency prioritization
Zhang et al. Expode: Exploiting policy discrepancy for efficient exploration in multi-agent reinforcement learning
Tan et al. IMPLANT: an integrated MDP and POMDP learning AgeNT for adaptive games
Chen et al. Offline Fictitious Self-Play for Competitive Games

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant