CN117748747A - AUV cluster energy online monitoring and management system and method - Google Patents

AUV cluster energy online monitoring and management system and method Download PDF

Info

Publication number
CN117748747A
CN117748747A CN202410191318.6A CN202410191318A CN117748747A CN 117748747 A CN117748747 A CN 117748747A CN 202410191318 A CN202410191318 A CN 202410191318A CN 117748747 A CN117748747 A CN 117748747A
Authority
CN
China
Prior art keywords
energy
auv
value
strategy
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410191318.6A
Other languages
Chinese (zh)
Other versions
CN117748747B (en
Inventor
陈云赛
刘祺
刘增凯
孙尧
刘子然
张栋
姜清华
李志彤
邢会明
陈泓洲
张辰玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Harbin Engineering University Innovation Development Center
Original Assignee
Qingdao Harbin Engineering University Innovation Development Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Harbin Engineering University Innovation Development Center filed Critical Qingdao Harbin Engineering University Innovation Development Center
Priority to CN202410191318.6A priority Critical patent/CN117748747B/en
Priority claimed from CN202410191318.6A external-priority patent/CN117748747B/en
Publication of CN117748747A publication Critical patent/CN117748747A/en
Application granted granted Critical
Publication of CN117748747B publication Critical patent/CN117748747B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Feedback Control In General (AREA)

Abstract

The invention provides an AUV cluster energy on-line monitoring and management system and method, belonging to the technical field of ship energy management, wherein the system comprises an energy supply unit, an energy storage unit, an energy utilization unit and an energy optimization management and control system, and the method comprises the following steps: each AUV in the AUV cluster is regarded as an agent, and each agent sub-strategy is obtained through hierarchical reinforcement learning parallel training; performing energy optimization through AUV cluster collision avoidance to generate a cluster cooperative operation strategy; and designing a neural network model and an agent model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption. According to the invention, the AUV cluster can automatically adjust energy scheduling according to real-time collision prevention condition constraint and task demand, so that economical and efficient energy operation is realized. The reinforcement learning optimization method is introduced, and global energy optimization management and control of the AUV cluster are realized from the energy characteristics, so that the AUV cluster can finish the operation task under the limited energy condition.

Description

AUV cluster energy online monitoring and management system and method
Technical Field
The invention belongs to the technical field of ship energy management, and particularly relates to an AUV cluster energy on-line monitoring and management system and method.
Background
In a deep sea unmanned operation environment, an AUV cluster needs to have the functions of autonomous reasoning of unknown environment states and collaborative planning of operation tasks, and needs to combine data information of various loads such as optical, acoustic and chemical sensors around the AUV cluster to make intelligent decisions on the working states of all control units in time, so that the observation operation tasks are efficiently completed, and the defects in the task process are combined to be improved in time.
However, the AUV cluster can present heterogeneous characteristics, the faced task conditions are complex, variable and multi-interference, and in order to ensure the autonomy and adaptability of the AUV cluster in the deep sea complex environment, the intelligentization level of the energy management and operation control of the AUV cluster should be improved, so as to realize the targets of autonomous navigation positioning, dynamic networking, autonomous detection and decision, autonomous execution of the operation, and self-adaptive optimization energy supply, so as to save energy and enable the AUv cluster to continuously work for a longer time in the submarine energy. The AUV cluster energy system is an important guarantee for carrying out a series of operations such as deep sea observation, navigation communication, accurate operation, transportation and the like.
Because the AUV is used for working tasks in the deep sea for a long time and needs strong cruising ability to support, the AUV is provided with a power system. Under the condition of limited energy supply, the energy management strategy has the important effects of energy conservation and consumption reduction on the efficient management and optimization operation strategy of the energy system, and meanwhile, the energy management strategy has self-learning capability, can reasonably and effectively perform energy scheduling under emergency conditions, and ensures the economic operation of the AUV cluster.
However, AUV clusters, when faced with complex operations, have uncertainty and spatio-temporal multiscale in energy supply and demand, and complex overall behavior, which makes conventional model-based control difficult to apply. The method is necessary to design a reinforcement learning optimization method according to the characteristics of complex storage and utilization behaviors, heterogeneous data, diversified energy utilization requirements and the like of an AUV cluster energy system from the aspect of energy characteristics, and the AUV cluster is subjected to overall energy optimization management and control.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an AUV cluster energy on-line monitoring and management system and method, which are reasonable in design, solve the defects in the prior art and have good effects.
An AUV cluster energy on-line monitoring and management system comprises an energy supply unit, an energy storage unit, an energy utilization unit and an energy optimization management and control system;
the energy supply unit is used for supplying energy and comprises a power generation system and a battery system of the energy supply AUV;
the energy storage units are used for storing energy, and each AUV carries energy storage equipment and provides energy for the energy consumption units;
the energy consumption unit is used for using energy, and mainly refers to energy consumption equipment of all AUVs, and comprises sensor equipment, operation equipment, power equipment and communication equipment;
the energy management system is used for distributing energy according to the energy demand of each energy supply and storage unit and the current energy state monitored in real time, and outputting action control instructions to each AUV power execution structure.
Further, the energy supply AUV is selected from AUVs with a double-pontoon raft type wave energy capturing device structure.
An AUV cluster energy on-line monitoring and management method adopts the AUV cluster energy on-line monitoring and management system, which comprises the following sub-steps:
s1, regarding each AUV in an AUV cluster as an agent, and obtaining each agent sub-strategy through hierarchical reinforcement learning parallel training;
s2, performing energy optimization through AUV cluster clustering collision avoidance to generate a cluster cooperative operation strategy;
and S3, designing a neural network model and an agent model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption.
Further, the step S1 includes the following substeps:
s1.1, decomposing an AUV cluster optimization control problem into a plurality of subtasks, wherein each subtask is focused on a part of decision-making process in the AUV cluster, and determining an agent corresponding to each subtask;
current Q function of subtaskThe method comprises the following steps:
;(1)
wherein,is in the time stepThe return that is obtained is that,is in the state of being a state,is to take an action such as to take,is the current state of the device and is,is the current action to be taken,is the expected value;
predictive Q function for subtasksThe method comprises the following steps:
;(2)
wherein,is a time stepIs a bonus of (1),is a discount factor that is used to determine the discount,andthe next state and action, respectively;
s1.2, designing a reinforcement learning hierarchical structure, wherein a high-level agent is used for formulating a long-term strategy target, namely a high-level strategy, and a low-level agent is used for being responsible for specific action execution;
q value function of high-level policyThe method comprises the following steps:
;(3)
wherein,in order to be a subtask or an option,subtasks or options for the current time t;
s1.3, parallel training: aiming at each subtask, independently training an agent to accelerate the learning process;
s1.4, sub-strategy learning: each agent learns the sub-strategy by interacting with the ocean environment and optimizes the sub-strategy by using a reinforcement learning algorithm, and in the strategy optimization iteration process, a subtask or a high-level strategy value function needs to be updated to meet a Bellman equation;
the expression of the Q value function of the update subtask is as follows:
;(4)
wherein,is the rate of learning to be performed,is a reward for the user and,is a possible action in the new state,is to take actionNew state arrived at later.
S1.5, further adjusting the optimized sub-strategy through the multi-agent cooperative strategy, wherein in the sub-strategy adjustment, an option model is used: with one triplet for each sub-policyIs expressed by, whereinIs an optionA set of states that can be initiated up,is atOptionsThe following strategy,is in a state ofLower optionProbability of termination;
when an option is executed, the agent learns not only when to switch options, but also the optimal behavior inside each option, and the Q-value function expression of subtask after the execution of option o is:
(5)。
further, S2 comprises the following sub-steps:
s2.1, searching the position and the speed of an AUV cluster center according to a target with the lowest energy consumption based on a clustering algorithm of graph theory;
s2.2, acquiring the motion trail of each AUV, performing collision judgment once every t moments, if collision occurs, executing the step S2.3, otherwise, executing the step S2.4;
s2.3, introducing a penalty function, calculating, and searching the position and the speed of the AUV cluster center meeting the conditions again;
s2.4, checking the generated AUV motion tracks, if no collision occurs in the whole path, executing the step S2.5, otherwise, returning to the step S2.3 for recalculation;
s2.5, updating the motion trail in real time until each AUV reaches the end point, and generating a plurality of motion trail of each AUV under the condition of no collision;
s2.6, performing energy consumption calculation on all motion tracks of each AUV under the condition that collision does not occur, selecting one motion track with the lowest energy consumption, calculating the lowest energy consumption of the AUV cluster as a cluster cooperative operation strategy.
Further, the penalty function is introduced by adding a penalty term to the objective function, which has the form:
;(6)
wherein,is a function of the object to be measured,is a function of the energy consumption of the vehicle,is a penalty function that is a function of the penalty,is a weight parameter used to adjust the impact of the penalty term.
Further, S3 comprises the following sub-steps:
s3.1, designing a neural network model, wherein the neural network is used as an approximate Q value function and comprises n local Q value networks and a global Q value network, n is the number of AUVs in an AUV cluster, and each local Q value network adopts a cyclic neural network; taking the sub-strategy, the high-level strategy and the historical data received by each intelligent agent for environmental observation obtained in the step 1 as the input of each local Q value network respectively, and outputting the evaluation of the local Q value function;
the global value network is formed by modeling each local Q value network into a graph structure by adopting a mixed network structure, modeling the relationship among agents by using a graph annotation force network as an approximator, and outputting the output value of each local Q value network and the collaborative operation strategy obtained in the step 2 as inputs to obtain the weight value of a Q value function selected by each local Q value network with the maximum global Q value;
s3.2, designing an intelligent body model;
defining a control instruction generated by the action required to be completed under water for the AUV as an action;
the agent state set is: {0: position, 1: posture, 2: speed, 3: azimuth, 4: type 5: operating state, 6: energy state }; the intelligent body type comprises an energy supply AUV, an operation AUV and a communication AUV; the energy state comprises energy supply, energy storage and energy utilization states; the intelligent agent action set is as follows: {0: ascending, 1: submerging, 2: left turn, 3: right turn, 4: starting operation, 5: stopping the operation, 6: null operation };
the bonus function is designed as follows:
;(7)
wherein,indicating the proximity of the agent to the target;representing the energy efficiency of the agent;representing agent execution actionsIs safe;representing the effect of the agent performing a particular task;is a weight factor, and is adjusted according to different tasks;
s3.3, training the neural network model to obtain a trained neural network model;
s3.4, receiving historical data of environmental observation by the sub-strategy, the high-level strategy and the intelligent agents of each AUV obtained in the step 1, inputting the historical data into a trained local Q value network, and outputting evaluation of each local Q value function;
s3.5, the global Q value network takes each local Q value network and the collaborative operation strategy obtained in the step 2 as input, a graph attention method is adopted to calculate the weight value of the Q value function selected by each local Q value network, the global Q value is maximized, the action corresponding to the maximum global Q value is obtained, the action comprises the optimal action of each AUV, the action is packaged into an environment executable instruction, and the environment executable instruction is sent to the AUV cluster.
Further, the cyclic neural network adopts a GRU neural network, and an update formula of the GRU neural network is as follows:
reset gate:;(8)
update door:;(9)
candidate hidden state:;(10)
final hidden state:;(11)
wherein,to be in a time stepIs set to the reset gate vector;activating a function for Sigmoid;is a weight matrix associated with the reset gate;to be in a time stepIs hidden in the first layer;to be in a time stepIs a feature vector of the input of the (a);is a bias vector associated with the reset gate;to be in a time stepIs used for updating the gate vector;is a weight matrix associated with the update gate;is a bias vector associated with the update gate;to be in a time stepIs a candidate hidden state for (a);a weight matrix that is a hidden state;bias vector for hidden state;to be in a time stepIs used to determine the final hidden state of the display.
Further, the mixed network corresponding to the global Q value function is trained by adopting a graph attention network, the TD loss function calculated by the global rewards and the global Q value function is counter-propagated, and the global mixed network and the local Q value function network are trained simultaneously;
iterative formula of global Q-value function:
;(12)
wherein,is shown in the stateTake action downwardsThe Q value of the expected return at that time;to be in a time stepThe instant rewards obtained;for the next stateMaximum Q value for all possible actions;
the TD loss is:
;(13)
wherein,is a time stepThe time sequence difference TD error of (a) is used for evaluating the difference between the current Q value and the target Q value;is a parameter of the current strategy;is a parameter of the target policy.
Further, the attention mechanism of the attention network includes: the attention coefficients are:
;(14)
wherein,is a nodeFor its neighbor nodesIs a concentration factor of (2);is a natural exponential function for normalizing the attention coefficient;is an activation function;transpose of the parameter vector for the attention mechanism;is a weight matrix and is used for linearly transforming input features;is a nodeIs a feature vector of (1);stitching of the representation vectors;is a nodeIs a set of neighbor nodes;is a nodeUpdated feature vectors under the attention mechanism;
updated node characteristicsThe method comprises the following steps:
(15)。
the invention has the beneficial technical effects that:
the goals of autonomous navigation, autonomous collision avoidance, autonomous task allocation and the like are realized by improving the intelligent level of the energy management and operation control of the AUV cluster, so that the AUV cluster can continuously work for a longer time in a deep sea environment. The AUV cluster can automatically adjust energy scheduling according to real-time collision prevention condition constraint and task requirements, so that economical and efficient energy operation is realized. The reinforcement learning optimization method is introduced, and global energy optimization management and control of the AUV cluster are realized from the energy characteristics, so that the AUV cluster can finish the operation task under the limited energy condition. And evaluating the behaviors of the agent by using the neural network to ensure the output of the optimal behaviors.
Drawings
Fig. 1 is a schematic diagram of an AUV cluster energy on-line monitoring and management system according to the present invention.
Fig. 2 is a flowchart of an AUV cluster energy on-line monitoring and management method according to the present invention.
FIG. 3 is a flow chart of obtaining sub-policies of each agent in the present invention.
FIG. 4 is a flow chart of generating a cluster co-operation policy in the present invention.
Fig. 5 is a flowchart of the optimal operation of obtaining each AUV according to the present invention.
Detailed Description
The following is a further description of embodiments of the invention, in conjunction with the specific examples:
an AUV cluster energy on-line monitoring and management system, as shown in figure 1, comprises an energy supply unit, an energy storage unit, an energy consumption unit and an energy optimization management and control system;
the energy supply unit is used for supplying energy, comprises a power generation system and a battery system of the energy supply AUV, and acquires energy from external wave energy and the like by using a mechanism of the energy supply unit to provide energy for the energy storage unit and the energy consumption unit; the energy supply AUV is selected from AUVs with double-pontoon raft type wave energy capturing device structures;
the energy storage units are used for storing energy, and each AUV carries energy storage equipment such as a storage battery, a super capacitor, a fuel cell and the like to provide energy for the energy consumption units;
the energy consumption unit is used for using energy and mainly refers to energy consumption equipment of all AUVs, and comprises sensor equipment, operation equipment, power equipment and communication equipment;
and the energy management system is used for distributing energy according to the energy requirements of the energy supply and storage units and the current energy state monitored in real time and outputting action control instructions to the AUV power execution structures.
An AUV cluster energy on-line monitoring and management method adopts the AUV cluster energy on-line monitoring and management system as shown in figure 2, and comprises the following sub-steps:
s1, regarding each AUV in an AUV cluster as an agent, and obtaining each agent sub-strategy through hierarchical reinforcement learning parallel training;
as shown in fig. 3, S1 includes the following sub-steps:
s1.1, decomposing an AUV cluster optimization control problem into a plurality of subtasks, wherein each subtask is focused on a part of decision-making process in the AUV cluster, and determining an agent corresponding to each subtask;
value function of subtask: for each subtask, a state-action Q function may be definedIs expressed in a state ofTake action downwardsAnd follow the expected return of the current policy;
current Q function of subtaskThe method comprises the following steps:
;(1)
wherein,is in the time stepThe return that is obtained is that,is in the state of being a state,is to take an action such as to take,is the current state of the device and is,is the current action taken;
predictive Q function for subtasksThe method comprises the following steps:
;(2)
wherein,is a time stepIs a bonus of (1),is a discount factor that is used to determine the discount,andthe next state and action, respectively;
s1.2, designing a reinforcement learning hierarchical structure, wherein a high-level agent is used for making a long-term strategy target, namely a high-level strategy, and a low-level agent is used for being responsible for specific action execution, and the structure allows the agent to learn on different decision levels, so that the learning process is simplified, and the size of a state-action space is reduced;
q value function of high-level policyThe method comprises the following steps:
;(3)
s1.3, parallel training: aiming at each subtask, independently training an agent to accelerate the learning process;
s1.4, sub-strategy learning: each agent learns the sub-strategy by interacting with the ocean environment and optimizes the sub-strategy by using a reinforcement learning algorithm, and in the strategy optimization iteration process, a subtask or a high-level strategy value function needs to be updated to meet a Bellman equation;
the expression of the Q value function of the update subtask is as follows:
;(4)
wherein,is the rate of learning to be performed,is a discount factor that is used to determine the discount,is a reward for the user and,is to take actionThe new state that is reached after the time,is a possible action in the new state.
S1.5, further adjusting the optimized sub-strategy through the multi-agent cooperative strategy, wherein in the sub-strategy adjustment, an option model is used: with one triplet for each sub-policyIs expressed by, whereinIs an optionA set of states that can be initiated up,is at the option ofThe following strategy,is in a state ofLower optionProbability of termination;
when an option is executed, the agent learns not only when to switch options, but also the optimal behavior inside each option, and the Q-value function expression of subtask after the execution of option o is:
(5)。
s2, performing energy optimization through AUV cluster clustering collision avoidance to generate a cluster cooperative operation strategy;
as shown in fig. 4, S2 includes the following sub-steps:
s2.1, searching the position and the speed of an AUV cluster center according to a target with the lowest energy consumption based on a clustering algorithm of graph theory;
s2.2, acquiring the motion trail of each AUV, performing collision judgment once every t moments, if collision occurs, executing the step S2.3, otherwise, executing the step S2.4;
s2.3, introducing a penalty function, calculating, and searching the position and the speed of the AUV cluster center meeting the conditions again;
the introduction of the penalty function is achieved by adding a penalty term to the objective function, which penalty term is related to the degree of violation of the constraint. When an optimization algorithm attempts to optimize an objective function, if a solution is generated that violates a constraint, the penalty term increases the value of the objective function, making this solution less profitable in the optimization process.
By introducing a penalty function, the optimization algorithm can better explore the solution space meeting the constraint, and finally find an optimal solution or a solution close to the optimal solution meeting the constraint condition.
The objective function has the following form:
;(6)
wherein,is a function of the object to be measured,is a function of the energy consumption of the vehicle,is a penalty function that is a function of the penalty,is the weight ofAnd parameters used for adjusting the influence degree of the penalty term.
S2.4, checking the generated AUV motion tracks, if no collision occurs in the whole path, executing the step S2.5, otherwise, returning to the step S2.3 for recalculation;
s2.5, updating the motion trail in real time until each AUV reaches the end point, and generating a plurality of motion trail of each AUV under the condition of no collision;
s2.6, performing energy consumption calculation on all motion tracks of each AUV under the condition that collision does not occur, selecting one motion track with the lowest energy consumption, calculating the lowest energy consumption of the AUV cluster as a cluster cooperative operation strategy.
S3, designing a neural network model and an intelligent body model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption;
as shown in fig. 5, S3 includes the following sub-steps:
s3.1, designing a neural network model, wherein the neural network is used as an approximate Q-value function and comprises n local Q-value networks and a global Q-value network, n is the number of AUVs in an AUV cluster, and each local Q-value network adopts a cyclic neural network. Taking the sub-strategy, the high-level strategy and the historical data received by each intelligent agent for environmental observation obtained in the step 1 as the input of each local Q value network respectively, and outputting the evaluation of the local Q value function;
the cyclic neural network adopts a GRU neural network to convert an input sequence into an output sequence of different domains, and the output sequence is the next step of the input sequence. The whole network is trained through the data extracted twice, and is not trained based on an end-to-end mode, and rewards are transmitted between the upper layer controller and the lower layer controller. The updated formula of the GRU neural network is as follows:
reset gate:;(8)
update door:;(9)
candidate hidden state:;(10)
final hidden state:;(11)
wherein,to be in a time stepIs set to the reset gate vector;activating a function for Sigmoid;is a weight matrix associated with the reset gate;to be in a time stepIs hidden in the first layer;to be in a time stepIs a feature vector of the input of the (a);is a bias vector associated with the reset gate;to be in a time stepIs used for updating the gate vector;to be in phase with updating the gateA weight matrix of the switches;is a bias vector associated with the update gate;to be in a time stepIs a candidate hidden state for (a);a weight matrix that is a hidden state;bias vector for hidden state;to be in a time stepIs used to determine the final hidden state of the display.
The global value network is formed by modeling each local Q value network into a graph structure by adopting a mixed network structure, modeling the relationship among agents by using a graph annotation force network as an approximator, and outputting the output value of each local Q value network and the collaborative operation strategy obtained in the step 2 as inputs to obtain the weight value of a Q value function selected by each local Q value network with the maximum global Q value;
training a hybrid network corresponding to the global Q value function by adopting a graph attention network, counter-propagating the TD loss function calculated by the global rewards and the global Q value function, and training the global hybrid network and the local Q value function network simultaneously;
iterative formula of global Q-value function:
;(12)
wherein,is shown in the stateTake action downwardsThe Q value of the expected return at that time;for learning rate, it determines the extent to which new information covers old information;to be in a time stepThe instant rewards obtained;being a discount factor, it is used to calculate the present value of future rewards;for the next stateMaximum Q value for all possible actions;
TD loss:
;(13)
wherein,is a time stepThe time sequence difference TD error of (a) is used for evaluating the difference between the current Q value and the target Q value;is a parameter of the current strategy;for the target policyParameters;
the attention mechanism of the graph attention network includes:
the attention coefficients are:
;(14)
wherein,is a nodeFor its neighbor nodesIs a concentration factor of (2);is a natural exponential function for normalizing the attention coefficient;is an activation function;transpose of the parameter vector for the attention mechanism;is a weight matrix and is used for linearly transforming input features;is a nodeIs a feature vector of (1);stitching of the representation vectors;is a nodeIs set of neighbor nodes of (a);Is a nodeUpdated feature vectors under the attention mechanism;
together, these variables constitute mathematical operations internal to the neural network model for forward and backward propagation to learn the ability to extract features and patterns from the data.
Updated node characteristics:
(15)。
s3.2, designing an intelligent body model;
defining a control instruction generated by the action required to be completed under water for the AUV as an action;
the agent state set is: {0: position, 1: posture, 2: speed, 3: azimuth, 4: type 5: operating state, 6: energy state }; the intelligent body type comprises an energy supply AUV, an operation AUV and a communication AUV; the energy state comprises energy supply, energy storage and energy utilization states; the intelligent agent action set is as follows: {0: ascending, 1: submerging, 2: left turn, 3: right turn, 4: starting operation, 5: stopping the operation, 6: null operation };
designing a reward function: when the agent performs an action, a negative or positive Reward (review) is given according to the degree of error of the action, and a Reward function (evaluation function) is designed accordingly. The agent as a main body of learning can obtain feedback of the environment again according to the reward selection action, and the strategy of learning the selection action in the whole process can continuously improve the ability of obtaining the reward by oneself.
The following factors need to be considered when designing the bonus function:
(1) Task goal: the degree to which the AUV completes the set task;
(2) Energy efficiency: how much energy the AUV consumes when performing a task;
(3) Safety: whether the AUV avoids potential hazards during execution of the task;
(4) Task execution quality: the accuracy with which the AUV performs a particular job task.
The individual parts of the reward function may be set in their contribution to the overall reward by adjusting the weighting parameters according to the specific needs of the task and the environmental conditions. Furthermore, it may be desirable to introduce additional rewards or penalty terms to motivate the agent to explore new states or strategies, as well as to avoid potential long term negative effects. This bonus function needs to be adjusted over multiple experiments and iterations to achieve the best learning effect.
Defining a reward function as:
;(7)
wherein,indicating the current state of the agent,an action representing selection of an agent;indicating the proximity of the agent to the target;representing the energy efficiency of the agent;representing agent execution actionsIs safe;representing the effect of the agent performing a particular task;is a weight factor, and is adjusted according to different tasks;
s3.3, training the neural network model to obtain a trained neural network model;
s3.4, receiving historical data of environmental observation by the sub-strategy, the high-level strategy and the intelligent agents of each AUV obtained in the step 1, inputting the historical data into a trained local Q value network, and outputting evaluation of each local Q value function;
s3.5, the global Q value network takes each local Q value network and the collaborative operation strategy obtained in the step 2 as input, a graph attention method is adopted to calculate the weight value of the Q value function selected by each local Q value network, the global Q value is maximized, the action corresponding to the maximum global Q value, namely the action with the minimum energy consumption, is obtained, the action comprises the optimal action of each AUV, the action is packaged into an environment executable instruction, and the environment executable instruction is sent to the AUV cluster.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.

Claims (10)

1. The AUV cluster energy on-line monitoring and management system is characterized by comprising an energy supply unit, an energy storage unit, an energy utilization unit and an energy optimization management and control system;
the energy supply unit is used for supplying energy and comprises a power generation system and a battery system of the energy supply AUV;
the energy storage units are used for storing energy, and each AUV carries energy storage equipment and provides energy for the energy consumption units;
the energy consumption unit is used for using energy, and mainly refers to energy consumption equipment of all AUVs, and comprises sensor equipment, operation equipment, power equipment and communication equipment;
the energy management system is used for distributing energy according to the energy demand of each energy supply and storage unit and the current energy state monitored in real time, and outputting action control instructions to each AUV power execution structure.
2. The AUV cluster energy on-line monitoring and management system of claim 1, wherein the energy replenishment AUV is an AUV having a dual pontoon raft type wave energy capture device structure.
3. An AUV cluster energy on-line monitoring and management method, characterized in that an AUV cluster energy on-line monitoring and management system as claimed in claim 1 or 2 is adopted, comprising the following sub-steps:
s1, regarding each AUV in an AUV cluster as an agent, and obtaining each agent sub-strategy through hierarchical reinforcement learning parallel training;
s2, performing energy optimization through AUV cluster clustering collision avoidance to generate a cluster cooperative operation strategy;
and S3, designing a neural network model and an agent model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption.
4. The method for online monitoring and managing AUV cluster energy according to claim 3, wherein the step S1 comprises the following sub-steps:
s1.1, decomposing an AUV cluster optimization control problem into a plurality of subtasks, wherein each subtask is focused on a part of decision-making process in the AUV cluster, and determining an agent corresponding to each subtask;
current Q function of subtaskThe method comprises the following steps:
;(1)
wherein,is in the time step->Obtained return (s)/(s)>Status of->Is to take action>Is the current state +.>Is the current action taken,/->Is the expected value;
predictive Q function for subtasksThe method comprises the following steps:
;(2)
wherein,is a step of time->Is awarded (1)>Is a discount factor, < >>And->The next state and action, respectively;
s1.2, designing a reinforcement learning hierarchical structure, wherein a high-level agent is used for formulating a long-term strategy target, namely a high-level strategy, and a low-level agent is used for being responsible for specific action execution;
q value function of high-level policyThe method comprises the following steps:
;(3)
wherein,for subtasks or options->Subtasks or options for the current time t;
s1.3, parallel training: aiming at each subtask, independently training an agent to accelerate the learning process;
s1.4, sub-strategy learning: each agent learns the sub-strategy by interacting with the ocean environment and optimizes the sub-strategy by using a reinforcement learning algorithm, and in the strategy optimization iteration process, a subtask or a high-level strategy value function needs to be updated to meet a Bellman equation;
the expression of the Q value function of the update subtask is as follows:
;(4)
wherein,is learning rate (I/O)>Is rewarded with->Is a possible action in the new state, +.>Is to take action->A new state arrived later;
s1.5, further adjusting the optimized sub-strategy through the multi-agent cooperative strategy, wherein in the sub-strategy adjustment, an option model is used: with one triplet for each sub-policyIs expressed by>Is option->A set of states that can be started,>is at option->Strategies under->Is in the state->Lower option->Probability of termination;
when an option is executed, the agent learns not only when to switch options, but also the optimal behavior inside each option, and the Q-value function expression of subtask after the execution of option o is:
(5)。
5. the AUV cluster energy on-line monitoring and management method of claim 4, wherein S2 includes the sub-steps of:
s2.1, searching the position and the speed of an AUV cluster center according to a target with the lowest energy consumption based on a clustering algorithm of graph theory;
s2.2, acquiring the motion trail of each AUV, performing collision judgment once every t moments, if collision occurs, executing the step S2.3, otherwise, executing the step S2.4;
s2.3, introducing a penalty function, calculating, and searching the position and the speed of the AUV cluster center meeting the conditions again;
s2.4, checking the generated AUV motion tracks, if no collision occurs in the whole path, executing the step S2.5, otherwise, returning to the step S2.3 for recalculation;
s2.5, updating the motion trail in real time until each AUV reaches the end point, and generating a plurality of motion trail of each AUV under the condition of no collision;
s2.6, performing energy consumption calculation on all motion tracks of each AUV under the condition that collision does not occur, selecting one motion track with the lowest energy consumption, calculating the lowest energy consumption of the AUV cluster as a cluster cooperative operation strategy.
6. The method for online monitoring and managing AUV cluster energy according to claim 5, wherein the penalty function is introduced by adding a penalty term to the objective function, and the objective function has the following form:
;(6)
wherein,is an objective function->Is an energy consumption function, +.>Is a penalty function->Is a weight parameter used to adjust the impact of the penalty term.
7. The AUV cluster energy on-line monitoring and management method of claim 6, wherein S3 includes the sub-steps of:
s3.1, designing a neural network model, wherein the neural network is used as an approximate Q value function and comprises n local Q value networks and a global Q value network, n is the number of AUVs in an AUV cluster, and each local Q value network adopts a cyclic neural network; taking the sub-strategy, the high-level strategy and the historical data received by each intelligent agent for environmental observation obtained in the step 1 as the input of each local Q value network respectively, and outputting the evaluation of the local Q value function;
the global value network is formed by modeling each local Q value network into a graph structure by adopting a mixed network structure, modeling the relationship among agents by using a graph annotation force network as an approximator, and outputting the output value of each local Q value network and the collaborative operation strategy obtained in the step 2 as inputs to obtain the weight value of a Q value function selected by each local Q value network with the maximum global Q value;
s3.2, designing an intelligent body model;
defining a control instruction generated by the action required to be completed under water for the AUV as an action;
the agent state set is: {0: position, 1: posture, 2: speed, 3: azimuth, 4: type 5: operating state, 6: energy state }; the intelligent body type comprises an energy supply AUV, an operation AUV and a communication AUV; the energy state comprises energy supply, energy storage and energy utilization states; the intelligent agent action set is as follows: {0: ascending, 1: submerging, 2: left turn, 3: right turn, 4: starting operation, 5: stopping the operation, 6: null operation };
the bonus function is designed as follows:
;(7)
wherein,indicating the proximity of the agent to the target; />Representing the energy efficiency of the agent;representing that the agent performs an action->Is safe; />Representing the effect of the agent performing a particular task;is a weight factor, and is adjusted according to different tasks;
s3.3, training the neural network model to obtain a trained neural network model;
s3.4, receiving historical data of environmental observation by the sub-strategy, the high-level strategy and the intelligent agents of each AUV obtained in the step 1, inputting the historical data into a trained local Q value network, and outputting evaluation of each local Q value function;
s3.5, the global Q value network takes each local Q value network and the collaborative operation strategy obtained in the step 2 as input, a graph attention method is adopted to calculate the weight value of the Q value function selected by each local Q value network, the global Q value is maximized, the action corresponding to the maximum global Q value is obtained, the action comprises the optimal action of each AUV, the action is packaged into an environment executable instruction, and the environment executable instruction is sent to the AUV cluster.
8. The method for online monitoring and managing AUV cluster energy according to claim 7, wherein the cyclic neural network is a GRU neural network, and the update formula of the GRU neural network is as follows:
reset gate:;(8)
update door:;(9)
candidate hidden state:;(10)
final hidden state:;(11)
wherein,for +.>Is set to the reset gate vector; />Activating a function for Sigmoid; />Is a weight matrix associated with the reset gate; />For +.>Is hidden in the first layer; />For +.>Is a feature vector of the input of the (a); />Is a bias vector associated with the reset gate; />For +.>Is used for updating the gate vector; />Is a weight matrix associated with the update gate; />Is a bias vector associated with the update gate; />For +.>Is a candidate hidden state for (a); />A weight matrix that is a hidden state; />Bias vector for hidden state; />For +.>Is used to determine the final hidden state of the display.
9. The method for online monitoring and managing AUV cluster energy according to claim 8, wherein the hybrid network corresponding to the global Q-value function is trained by using a graph attention network, the TD loss function calculated by the global reward and the global Q-value function is counter-propagated, and the global hybrid network and the local Q-value function network are trained simultaneously;
iterative formula of global Q-value function:
;(12)
wherein,is indicated in the state->Take action with->The Q value of the expected return at that time; />For +.>The instant rewards obtained; />For the next state->Maximum Q value for all possible actions;
the TD loss is:
;(13)
wherein,for the time step->The time sequence difference TD error of (a) is used for evaluating the difference between the current Q value and the target Q value; />Is a parameter of the current strategy; />Is a parameter of the target policy.
10. The method for online monitoring and managing AUV cluster energy according to claim 9, wherein the attention mechanism of the attention network comprises:
the attention coefficients are:
;(14)
wherein,for node->For its neighbor node->Is a concentration factor of (2); />Is a natural exponential function for normalizing the attention coefficient; />Is an activation function; />Transpose of the parameter vector for the attention mechanism; />Is a weight matrix and is used for linearly transforming input features; />For node->Is a feature vector of (1); />Stitching of the representation vectors; />For node->Is a set of neighbor nodes; />For node->Updated feature vectors under the attention mechanism;
updated node characteristicsThe method comprises the following steps:
(15)。
CN202410191318.6A 2024-02-21 AUV cluster energy online monitoring and management system and method Active CN117748747B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410191318.6A CN117748747B (en) 2024-02-21 AUV cluster energy online monitoring and management system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410191318.6A CN117748747B (en) 2024-02-21 AUV cluster energy online monitoring and management system and method

Publications (2)

Publication Number Publication Date
CN117748747A true CN117748747A (en) 2024-03-22
CN117748747B CN117748747B (en) 2024-05-17

Family

ID=

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046800A (en) * 2019-03-14 2019-07-23 南京航空航天大学 The satellite cluster formation adjusting planing method of space-oriented target cooperative observation
CN112507622A (en) * 2020-12-16 2021-03-16 中国人民解放军国防科技大学 Anti-unmanned aerial vehicle task allocation method based on reinforcement learning
US20210221247A1 (en) * 2018-06-22 2021-07-22 Moixa Energy Holdings Limited Systems for machine learning, optimising and managing local multi-asset flexibility of distributed energy storage resources
CN113902087A (en) * 2021-10-25 2022-01-07 吉林建筑大学 Multi-Agent deep reinforcement learning algorithm
CN114355973A (en) * 2021-12-28 2022-04-15 哈尔滨工程大学 Multi-agent hierarchical reinforcement learning-based unmanned cluster cooperation method under weak observation condition
CN116317162A (en) * 2023-03-31 2023-06-23 武汉船用电力推进装置研究所(中国船舶集团有限公司第七一二研究所) Underwater energy platform power supply system and control method thereof
CN116546421A (en) * 2023-05-16 2023-08-04 西安邮电大学 Unmanned aerial vehicle position deployment and minimum energy consumption AWAQ algorithm based on edge calculation
CN116533234A (en) * 2023-04-28 2023-08-04 山东大学 Multi-axis hole assembly method and system based on layered reinforcement learning and distributed learning
CN116588282A (en) * 2023-07-17 2023-08-15 青岛哈尔滨工程大学创新发展中心 AUV intelligent operation and maintenance system and method
WO2023160012A1 (en) * 2022-02-25 2023-08-31 南京信息工程大学 Unmanned aerial vehicle assisted edge computing method for random inspection of power grid line
CN116700299A (en) * 2023-05-30 2023-09-05 西安天和海防智能科技有限公司 AUV cluster control system and method based on digital twin
CN117055619A (en) * 2023-09-06 2023-11-14 桂林电子科技大学 Unmanned aerial vehicle scheduling method based on multi-agent reinforcement learning
CN117270559A (en) * 2023-09-22 2023-12-22 南京航空航天大学 Unmanned aerial vehicle cluster deployment and track planning method based on reinforcement learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210221247A1 (en) * 2018-06-22 2021-07-22 Moixa Energy Holdings Limited Systems for machine learning, optimising and managing local multi-asset flexibility of distributed energy storage resources
CN110046800A (en) * 2019-03-14 2019-07-23 南京航空航天大学 The satellite cluster formation adjusting planing method of space-oriented target cooperative observation
CN112507622A (en) * 2020-12-16 2021-03-16 中国人民解放军国防科技大学 Anti-unmanned aerial vehicle task allocation method based on reinforcement learning
CN113902087A (en) * 2021-10-25 2022-01-07 吉林建筑大学 Multi-Agent deep reinforcement learning algorithm
CN114355973A (en) * 2021-12-28 2022-04-15 哈尔滨工程大学 Multi-agent hierarchical reinforcement learning-based unmanned cluster cooperation method under weak observation condition
WO2023160012A1 (en) * 2022-02-25 2023-08-31 南京信息工程大学 Unmanned aerial vehicle assisted edge computing method for random inspection of power grid line
CN116317162A (en) * 2023-03-31 2023-06-23 武汉船用电力推进装置研究所(中国船舶集团有限公司第七一二研究所) Underwater energy platform power supply system and control method thereof
CN116533234A (en) * 2023-04-28 2023-08-04 山东大学 Multi-axis hole assembly method and system based on layered reinforcement learning and distributed learning
CN116546421A (en) * 2023-05-16 2023-08-04 西安邮电大学 Unmanned aerial vehicle position deployment and minimum energy consumption AWAQ algorithm based on edge calculation
CN116700299A (en) * 2023-05-30 2023-09-05 西安天和海防智能科技有限公司 AUV cluster control system and method based on digital twin
CN116588282A (en) * 2023-07-17 2023-08-15 青岛哈尔滨工程大学创新发展中心 AUV intelligent operation and maintenance system and method
CN117055619A (en) * 2023-09-06 2023-11-14 桂林电子科技大学 Unmanned aerial vehicle scheduling method based on multi-agent reinforcement learning
CN117270559A (en) * 2023-09-22 2023-12-22 南京航空航天大学 Unmanned aerial vehicle cluster deployment and track planning method based on reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张冬;陈涛;乔玉龙;: "AUV分布式自主控制与使命管理方法", 火力与指挥控制, no. 09, 15 September 2011 (2011-09-15), pages 172 - 175 *
曾隽芳,刘禹等: "AUV 集群能源管理与优化运行控制方法研究", 第五届水下无人系统技术高峰论坛--以深制海,智领发展论文集, 15 November 2022 (2022-11-15), pages 271 - 276 *

Similar Documents

Publication Publication Date Title
Wang et al. Dynamic path planning for unmanned surface vehicle in complex offshore areas based on hybrid algorithm
Russell et al. Q-decomposition for reinforcement learning agents
CN102819264B (en) Path planning Q-learning initial method of mobile robot
CN102402712B (en) Robot reinforced learning initialization method based on neural network
CN113052372B (en) Dynamic AUV tracking path planning method based on deep reinforcement learning
CN111538349B (en) Long-range AUV autonomous decision-making method oriented to multiple tasks
CN115544899A (en) Water plant water intake pump station energy-saving scheduling method based on multi-agent deep reinforcement learning
CN117039981A (en) Large-scale power grid optimal scheduling method, device and storage medium for new energy
Wang et al. AUV-Assisted Node Repair for IoUT Relying on Multi-Agent Reinforcement Learning
CN111813143B (en) Underwater glider intelligent control system and method based on reinforcement learning
CN117748747B (en) AUV cluster energy online monitoring and management system and method
Chen et al. Survey of multi-agent strategy based on reinforcement learning
CN117748747A (en) AUV cluster energy online monitoring and management system and method
Wang et al. MUTS-based cooperative target stalking for a multi-USV system
Yan et al. An improved multi-AUV patrol path planning method
CN116702903A (en) Spacecraft cluster game intelligent decision-making method based on deep reinforcement learning
CN115334165B (en) Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning
Yazdani et al. Perception-aware online trajectory generation for a prescribed manoeuvre of unmanned surface vehicle in cluttered unstructured environment
Gao Soft computing methods for control and instrumentation
CN117111620B (en) Autonomous decision-making method for task allocation of heterogeneous unmanned system
Du et al. Safe multi-agent learning control for unmanned surface vessels cooperative interception mission
CN113837654B (en) Multi-objective-oriented smart grid hierarchical scheduling method
Bo et al. Adaptive Dynamic Programming Based on Parallel Control Theory for Underwater Vehicles
CN117111594B (en) Self-adaptive track control method for unmanned surface vessel
Cao et al. Threat Assessment Strategy of Human-in-the-Loop Unmanned Underwater Vehicle Under Uncertain Events

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant