CN117748747A

CN117748747A - AUV cluster energy online monitoring and management system and method

Info

Publication number: CN117748747A
Application number: CN202410191318.6A
Authority: CN
Inventors: 陈云赛; 刘祺; 刘增凯; 孙尧; 刘子然; 张栋; 姜清华; 李志彤; 邢会明; 陈泓洲; 张辰玮
Original assignee: Qingdao Harbin Engineering University Innovation Development Center
Current assignee: Qingdao Harbin Engineering University Innovation Development Center
Priority date: 2024-02-21
Filing date: 2024-02-21
Publication date: 2024-03-22
Anticipated expiration: 2044-02-21

Abstract

The invention provides an AUV cluster energy on-line monitoring and management system and method, belonging to the technical field of ship energy management, wherein the system comprises an energy supply unit, an energy storage unit, an energy utilization unit and an energy optimization management and control system, and the method comprises the following steps: each AUV in the AUV cluster is regarded as an agent, and each agent sub-strategy is obtained through hierarchical reinforcement learning parallel training; performing energy optimization through AUV cluster collision avoidance to generate a cluster cooperative operation strategy; and designing a neural network model and an agent model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption. According to the invention, the AUV cluster can automatically adjust energy scheduling according to real-time collision prevention condition constraint and task demand, so that economical and efficient energy operation is realized. The reinforcement learning optimization method is introduced, and global energy optimization management and control of the AUV cluster are realized from the energy characteristics, so that the AUV cluster can finish the operation task under the limited energy condition.

Description

AUV cluster energy online monitoring and management system and method

Technical Field

The invention belongs to the technical field of ship energy management, and particularly relates to an AUV cluster energy on-line monitoring and management system and method.

Background

In a deep sea unmanned operation environment, an AUV cluster needs to have the functions of autonomous reasoning of unknown environment states and collaborative planning of operation tasks, and needs to combine data information of various loads such as optical, acoustic and chemical sensors around the AUV cluster to make intelligent decisions on the working states of all control units in time, so that the observation operation tasks are efficiently completed, and the defects in the task process are combined to be improved in time.

However, the AUV cluster can present heterogeneous characteristics, the faced task conditions are complex, variable and multi-interference, and in order to ensure the autonomy and adaptability of the AUV cluster in the deep sea complex environment, the intelligentization level of the energy management and operation control of the AUV cluster should be improved, so as to realize the targets of autonomous navigation positioning, dynamic networking, autonomous detection and decision, autonomous execution of the operation, and self-adaptive optimization energy supply, so as to save energy and enable the AUv cluster to continuously work for a longer time in the submarine energy. The AUV cluster energy system is an important guarantee for carrying out a series of operations such as deep sea observation, navigation communication, accurate operation, transportation and the like.

Because the AUV is used for working tasks in the deep sea for a long time and needs strong cruising ability to support, the AUV is provided with a power system. Under the condition of limited energy supply, the energy management strategy has the important effects of energy conservation and consumption reduction on the efficient management and optimization operation strategy of the energy system, and meanwhile, the energy management strategy has self-learning capability, can reasonably and effectively perform energy scheduling under emergency conditions, and ensures the economic operation of the AUV cluster.

However, AUV clusters, when faced with complex operations, have uncertainty and spatio-temporal multiscale in energy supply and demand, and complex overall behavior, which makes conventional model-based control difficult to apply. The method is necessary to design a reinforcement learning optimization method according to the characteristics of complex storage and utilization behaviors, heterogeneous data, diversified energy utilization requirements and the like of an AUV cluster energy system from the aspect of energy characteristics, and the AUV cluster is subjected to overall energy optimization management and control.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an AUV cluster energy on-line monitoring and management system and method, which are reasonable in design, solve the defects in the prior art and have good effects.

An AUV cluster energy on-line monitoring and management system comprises an energy supply unit, an energy storage unit, an energy utilization unit and an energy optimization management and control system;

the energy supply unit is used for supplying energy and comprises a power generation system and a battery system of the energy supply AUV;

the energy storage units are used for storing energy, and each AUV carries energy storage equipment and provides energy for the energy consumption units;

the energy consumption unit is used for using energy, and mainly refers to energy consumption equipment of all AUVs, and comprises sensor equipment, operation equipment, power equipment and communication equipment;

the energy management system is used for distributing energy according to the energy demand of each energy supply and storage unit and the current energy state monitored in real time, and outputting action control instructions to each AUV power execution structure.

Further, the energy supply AUV is selected from AUVs with a double-pontoon raft type wave energy capturing device structure.

An AUV cluster energy on-line monitoring and management method adopts the AUV cluster energy on-line monitoring and management system, which comprises the following sub-steps:

s1, regarding each AUV in an AUV cluster as an agent, and obtaining each agent sub-strategy through hierarchical reinforcement learning parallel training;

s2, performing energy optimization through AUV cluster clustering collision avoidance to generate a cluster cooperative operation strategy;

and S3, designing a neural network model and an agent model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption.

Further, the step S1 includes the following substeps:

s1.1, decomposing an AUV cluster optimization control problem into a plurality of subtasks, wherein each subtask is focused on a part of decision-making process in the AUV cluster, and determining an agent corresponding to each subtask;

current Q function of subtaskThe method comprises the following steps:

；（1）

wherein,is in the time stepThe return that is obtained is that,is in the state of being a state,is to take an action such as to take,is the current state of the device and is,is the current action to be taken,is the expected value;

predictive Q function for subtasksThe method comprises the following steps:

；（2）

wherein,is a time stepIs a bonus of (1),is a discount factor that is used to determine the discount,andthe next state and action, respectively;

s1.2, designing a reinforcement learning hierarchical structure, wherein a high-level agent is used for formulating a long-term strategy target, namely a high-level strategy, and a low-level agent is used for being responsible for specific action execution;

q value function of high-level policyThe method comprises the following steps:

；（3）

wherein,in order to be a subtask or an option,subtasks or options for the current time t;

s1.3, parallel training: aiming at each subtask, independently training an agent to accelerate the learning process;

s1.4, sub-strategy learning: each agent learns the sub-strategy by interacting with the ocean environment and optimizes the sub-strategy by using a reinforcement learning algorithm, and in the strategy optimization iteration process, a subtask or a high-level strategy value function needs to be updated to meet a Bellman equation;

the expression of the Q value function of the update subtask is as follows:

；（4）

wherein,is the rate of learning to be performed,is a reward for the user and,is a possible action in the new state,is to take actionNew state arrived at later.

S1.5, further adjusting the optimized sub-strategy through the multi-agent cooperative strategy, wherein in the sub-strategy adjustment, an option model is used: with one triplet for each sub-policyIs expressed by, whereinIs an optionA set of states that can be initiated up,is atOptionsThe following strategy,is in a state ofLower optionProbability of termination;

when an option is executed, the agent learns not only when to switch options, but also the optimal behavior inside each option, and the Q-value function expression of subtask after the execution of option o is:

（5）。

further, S2 comprises the following sub-steps:

s2.1, searching the position and the speed of an AUV cluster center according to a target with the lowest energy consumption based on a clustering algorithm of graph theory;

s2.2, acquiring the motion trail of each AUV, performing collision judgment once every t moments, if collision occurs, executing the step S2.3, otherwise, executing the step S2.4;

s2.3, introducing a penalty function, calculating, and searching the position and the speed of the AUV cluster center meeting the conditions again;

s2.4, checking the generated AUV motion tracks, if no collision occurs in the whole path, executing the step S2.5, otherwise, returning to the step S2.3 for recalculation;

s2.5, updating the motion trail in real time until each AUV reaches the end point, and generating a plurality of motion trail of each AUV under the condition of no collision;

s2.6, performing energy consumption calculation on all motion tracks of each AUV under the condition that collision does not occur, selecting one motion track with the lowest energy consumption, calculating the lowest energy consumption of the AUV cluster as a cluster cooperative operation strategy.

Further, the penalty function is introduced by adding a penalty term to the objective function, which has the form:

；（6）

wherein,is a function of the object to be measured,is a function of the energy consumption of the vehicle,is a penalty function that is a function of the penalty,is a weight parameter used to adjust the impact of the penalty term.

Further, S3 comprises the following sub-steps:

s3.1, designing a neural network model, wherein the neural network is used as an approximate Q value function and comprises n local Q value networks and a global Q value network, n is the number of AUVs in an AUV cluster, and each local Q value network adopts a cyclic neural network; taking the sub-strategy, the high-level strategy and the historical data received by each intelligent agent for environmental observation obtained in the step 1 as the input of each local Q value network respectively, and outputting the evaluation of the local Q value function;

the global value network is formed by modeling each local Q value network into a graph structure by adopting a mixed network structure, modeling the relationship among agents by using a graph annotation force network as an approximator, and outputting the output value of each local Q value network and the collaborative operation strategy obtained in the step 2 as inputs to obtain the weight value of a Q value function selected by each local Q value network with the maximum global Q value;

s3.2, designing an intelligent body model;

defining a control instruction generated by the action required to be completed under water for the AUV as an action;

the agent state set is: {0: position, 1: posture, 2: speed, 3: azimuth, 4: type 5: operating state, 6: energy state }; the intelligent body type comprises an energy supply AUV, an operation AUV and a communication AUV; the energy state comprises energy supply, energy storage and energy utilization states; the intelligent agent action set is as follows: {0: ascending, 1: submerging, 2: left turn, 3: right turn, 4: starting operation, 5: stopping the operation, 6: null operation };

the bonus function is designed as follows:

；（7）

wherein,indicating the proximity of the agent to the target;representing the energy efficiency of the agent;representing agent execution actionsIs safe;representing the effect of the agent performing a particular task;is a weight factor, and is adjusted according to different tasks;

s3.3, training the neural network model to obtain a trained neural network model;

s3.4, receiving historical data of environmental observation by the sub-strategy, the high-level strategy and the intelligent agents of each AUV obtained in the step 1, inputting the historical data into a trained local Q value network, and outputting evaluation of each local Q value function;

s3.5, the global Q value network takes each local Q value network and the collaborative operation strategy obtained in the step 2 as input, a graph attention method is adopted to calculate the weight value of the Q value function selected by each local Q value network, the global Q value is maximized, the action corresponding to the maximum global Q value is obtained, the action comprises the optimal action of each AUV, the action is packaged into an environment executable instruction, and the environment executable instruction is sent to the AUV cluster.

Further, the cyclic neural network adopts a GRU neural network, and an update formula of the GRU neural network is as follows:

reset gate:；（8）

update door:；（9）

candidate hidden state:；（10）

final hidden state:；（11）

wherein,to be in a time stepIs set to the reset gate vector;activating a function for Sigmoid;is a weight matrix associated with the reset gate;to be in a time stepIs hidden in the first layer;to be in a time stepIs a feature vector of the input of the (a);is a bias vector associated with the reset gate;to be in a time stepIs used for updating the gate vector;is a weight matrix associated with the update gate;is a bias vector associated with the update gate;to be in a time stepIs a candidate hidden state for (a);a weight matrix that is a hidden state;bias vector for hidden state;to be in a time stepIs used to determine the final hidden state of the display.

Further, the mixed network corresponding to the global Q value function is trained by adopting a graph attention network, the TD loss function calculated by the global rewards and the global Q value function is counter-propagated, and the global mixed network and the local Q value function network are trained simultaneously;

iterative formula of global Q-value function:

；（12）

wherein,is shown in the stateTake action downwardsThe Q value of the expected return at that time;to be in a time stepThe instant rewards obtained;for the next stateMaximum Q value for all possible actions;

the TD loss is:

；（13）

wherein,is a time stepThe time sequence difference TD error of (a) is used for evaluating the difference between the current Q value and the target Q value;is a parameter of the current strategy;is a parameter of the target policy.

Further, the attention mechanism of the attention network includes: the attention coefficients are:

；（14）

wherein,is a nodeFor its neighbor nodesIs a concentration factor of (2);is a natural exponential function for normalizing the attention coefficient;is an activation function;transpose of the parameter vector for the attention mechanism;is a weight matrix and is used for linearly transforming input features;is a nodeIs a feature vector of (1);stitching of the representation vectors;is a nodeIs a set of neighbor nodes;is a nodeUpdated feature vectors under the attention mechanism;

updated node characteristicsThe method comprises the following steps:

（15）。

the invention has the beneficial technical effects that:

the goals of autonomous navigation, autonomous collision avoidance, autonomous task allocation and the like are realized by improving the intelligent level of the energy management and operation control of the AUV cluster, so that the AUV cluster can continuously work for a longer time in a deep sea environment. The AUV cluster can automatically adjust energy scheduling according to real-time collision prevention condition constraint and task requirements, so that economical and efficient energy operation is realized. The reinforcement learning optimization method is introduced, and global energy optimization management and control of the AUV cluster are realized from the energy characteristics, so that the AUV cluster can finish the operation task under the limited energy condition. And evaluating the behaviors of the agent by using the neural network to ensure the output of the optimal behaviors.

Drawings

Fig. 1 is a schematic diagram of an AUV cluster energy on-line monitoring and management system according to the present invention.

Fig. 2 is a flowchart of an AUV cluster energy on-line monitoring and management method according to the present invention.

FIG. 3 is a flow chart of obtaining sub-policies of each agent in the present invention.

FIG. 4 is a flow chart of generating a cluster co-operation policy in the present invention.

Fig. 5 is a flowchart of the optimal operation of obtaining each AUV according to the present invention.

Detailed Description

The following is a further description of embodiments of the invention, in conjunction with the specific examples:

an AUV cluster energy on-line monitoring and management system, as shown in figure 1, comprises an energy supply unit, an energy storage unit, an energy consumption unit and an energy optimization management and control system;

the energy supply unit is used for supplying energy, comprises a power generation system and a battery system of the energy supply AUV, and acquires energy from external wave energy and the like by using a mechanism of the energy supply unit to provide energy for the energy storage unit and the energy consumption unit; the energy supply AUV is selected from AUVs with double-pontoon raft type wave energy capturing device structures;

the energy storage units are used for storing energy, and each AUV carries energy storage equipment such as a storage battery, a super capacitor, a fuel cell and the like to provide energy for the energy consumption units;

the energy consumption unit is used for using energy and mainly refers to energy consumption equipment of all AUVs, and comprises sensor equipment, operation equipment, power equipment and communication equipment;

and the energy management system is used for distributing energy according to the energy requirements of the energy supply and storage units and the current energy state monitored in real time and outputting action control instructions to the AUV power execution structures.

An AUV cluster energy on-line monitoring and management method adopts the AUV cluster energy on-line monitoring and management system as shown in figure 2, and comprises the following sub-steps:

as shown in fig. 3, S1 includes the following sub-steps:

value function of subtask: for each subtask, a state-action Q function may be definedIs expressed in a state ofTake action downwardsAnd follow the expected return of the current policy;

current Q function of subtaskThe method comprises the following steps:

；（1）

wherein,is in the time stepThe return that is obtained is that,is in the state of being a state,is to take an action such as to take,is the current state of the device and is,is the current action taken;

predictive Q function for subtasksThe method comprises the following steps:

；（2）

s1.2, designing a reinforcement learning hierarchical structure, wherein a high-level agent is used for making a long-term strategy target, namely a high-level strategy, and a low-level agent is used for being responsible for specific action execution, and the structure allows the agent to learn on different decision levels, so that the learning process is simplified, and the size of a state-action space is reduced;

q value function of high-level policyThe method comprises the following steps:

；（3）

the expression of the Q value function of the update subtask is as follows:

；（4）

wherein,is the rate of learning to be performed,is a discount factor that is used to determine the discount,is a reward for the user and,is to take actionThe new state that is reached after the time,is a possible action in the new state.

S1.5, further adjusting the optimized sub-strategy through the multi-agent cooperative strategy, wherein in the sub-strategy adjustment, an option model is used: with one triplet for each sub-policyIs expressed by, whereinIs an optionA set of states that can be initiated up,is at the option ofThe following strategy,is in a state ofLower optionProbability of termination;

（5）。

as shown in fig. 4, S2 includes the following sub-steps:

the introduction of the penalty function is achieved by adding a penalty term to the objective function, which penalty term is related to the degree of violation of the constraint. When an optimization algorithm attempts to optimize an objective function, if a solution is generated that violates a constraint, the penalty term increases the value of the objective function, making this solution less profitable in the optimization process.

By introducing a penalty function, the optimization algorithm can better explore the solution space meeting the constraint, and finally find an optimal solution or a solution close to the optimal solution meeting the constraint condition.

The objective function has the following form:

；（6）

wherein,is a function of the object to be measured,is a function of the energy consumption of the vehicle,is a penalty function that is a function of the penalty,is the weight ofAnd parameters used for adjusting the influence degree of the penalty term.

S3, designing a neural network model and an intelligent body model to obtain the optimal action of each AUV, wherein the optimal action is the action with the least energy consumption;

as shown in fig. 5, S3 includes the following sub-steps:

s3.1, designing a neural network model, wherein the neural network is used as an approximate Q-value function and comprises n local Q-value networks and a global Q-value network, n is the number of AUVs in an AUV cluster, and each local Q-value network adopts a cyclic neural network. Taking the sub-strategy, the high-level strategy and the historical data received by each intelligent agent for environmental observation obtained in the step 1 as the input of each local Q value network respectively, and outputting the evaluation of the local Q value function;

the cyclic neural network adopts a GRU neural network to convert an input sequence into an output sequence of different domains, and the output sequence is the next step of the input sequence. The whole network is trained through the data extracted twice, and is not trained based on an end-to-end mode, and rewards are transmitted between the upper layer controller and the lower layer controller. The updated formula of the GRU neural network is as follows:

reset gate:；（8）

update door:；（9）

candidate hidden state:；（10）

final hidden state:；（11）

wherein,to be in a time stepIs set to the reset gate vector;activating a function for Sigmoid;is a weight matrix associated with the reset gate;to be in a time stepIs hidden in the first layer;to be in a time stepIs a feature vector of the input of the (a);is a bias vector associated with the reset gate;to be in a time stepIs used for updating the gate vector;to be in phase with updating the gateA weight matrix of the switches;is a bias vector associated with the update gate;to be in a time stepIs a candidate hidden state for (a);a weight matrix that is a hidden state;bias vector for hidden state;to be in a time stepIs used to determine the final hidden state of the display.

training a hybrid network corresponding to the global Q value function by adopting a graph attention network, counter-propagating the TD loss function calculated by the global rewards and the global Q value function, and training the global hybrid network and the local Q value function network simultaneously;

iterative formula of global Q-value function:

；（12）

wherein,is shown in the stateTake action downwardsThe Q value of the expected return at that time;for learning rate, it determines the extent to which new information covers old information;to be in a time stepThe instant rewards obtained;being a discount factor, it is used to calculate the present value of future rewards;for the next stateMaximum Q value for all possible actions;

TD loss:

；（13）

wherein,is a time stepThe time sequence difference TD error of (a) is used for evaluating the difference between the current Q value and the target Q value;is a parameter of the current strategy;for the target policyParameters;

the attention mechanism of the graph attention network includes:

the attention coefficients are:

；（14）

wherein,is a nodeFor its neighbor nodesIs a concentration factor of (2);is a natural exponential function for normalizing the attention coefficient;is an activation function;transpose of the parameter vector for the attention mechanism;is a weight matrix and is used for linearly transforming input features;is a nodeIs a feature vector of (1);stitching of the representation vectors;is a nodeIs set of neighbor nodes of (a)；Is a nodeUpdated feature vectors under the attention mechanism;

together, these variables constitute mathematical operations internal to the neural network model for forward and backward propagation to learn the ability to extract features and patterns from the data.

Updated node characteristics:

（15）。

s3.2, designing an intelligent body model;

designing a reward function: when the agent performs an action, a negative or positive Reward (review) is given according to the degree of error of the action, and a Reward function (evaluation function) is designed accordingly. The agent as a main body of learning can obtain feedback of the environment again according to the reward selection action, and the strategy of learning the selection action in the whole process can continuously improve the ability of obtaining the reward by oneself.

The following factors need to be considered when designing the bonus function:

(1) Task goal: the degree to which the AUV completes the set task;

(2) Energy efficiency: how much energy the AUV consumes when performing a task;

(3) Safety: whether the AUV avoids potential hazards during execution of the task;

(4) Task execution quality: the accuracy with which the AUV performs a particular job task.

The individual parts of the reward function may be set in their contribution to the overall reward by adjusting the weighting parameters according to the specific needs of the task and the environmental conditions. Furthermore, it may be desirable to introduce additional rewards or penalty terms to motivate the agent to explore new states or strategies, as well as to avoid potential long term negative effects. This bonus function needs to be adjusted over multiple experiments and iterations to achieve the best learning effect.

Defining a reward function as:

；（7）

wherein,indicating the current state of the agent,an action representing selection of an agent;indicating the proximity of the agent to the target;representing the energy efficiency of the agent;representing agent execution actionsIs safe;representing the effect of the agent performing a particular task;is a weight factor, and is adjusted according to different tasks;

s3.5, the global Q value network takes each local Q value network and the collaborative operation strategy obtained in the step 2 as input, a graph attention method is adopted to calculate the weight value of the Q value function selected by each local Q value network, the global Q value is maximized, the action corresponding to the maximum global Q value, namely the action with the minimum energy consumption, is obtained, the action comprises the optimal action of each AUV, the action is packaged into an environment executable instruction, and the environment executable instruction is sent to the AUV cluster.

It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.

Claims

1. The AUV cluster energy on-line monitoring and management system is characterized by comprising an energy supply unit, an energy storage unit, an energy utilization unit and an energy optimization management and control system;

2. The AUV cluster energy on-line monitoring and management system of claim 1, wherein the energy replenishment AUV is an AUV having a dual pontoon raft type wave energy capture device structure.

3. An AUV cluster energy on-line monitoring and management method, characterized in that an AUV cluster energy on-line monitoring and management system as claimed in claim 1 or 2 is adopted, comprising the following sub-steps:

4. The method for online monitoring and managing AUV cluster energy according to claim 3, wherein the step S1 comprises the following sub-steps:

current Q function of subtaskThe method comprises the following steps:

；（1）

wherein,is in the time step->Obtained return (s)/(s)>Status of->Is to take action>Is the current state +.>Is the current action taken,/->Is the expected value;

predictive Q function for subtasksThe method comprises the following steps:

；（2）

wherein,is a step of time->Is awarded (1)>Is a discount factor, < >>And->The next state and action, respectively;

q value function of high-level policyThe method comprises the following steps:

；（3）

wherein,for subtasks or options->Subtasks or options for the current time t;

the expression of the Q value function of the update subtask is as follows:

；（4）

wherein,is learning rate (I/O)>Is rewarded with->Is a possible action in the new state, +.>Is to take action->A new state arrived later;

s1.5, further adjusting the optimized sub-strategy through the multi-agent cooperative strategy, wherein in the sub-strategy adjustment, an option model is used: with one triplet for each sub-policyIs expressed by>Is option->A set of states that can be started,>is at option->Strategies under->Is in the state->Lower option->Probability of termination;

（5）。

5. the AUV cluster energy on-line monitoring and management method of claim 4, wherein S2 includes the sub-steps of:

6. The method for online monitoring and managing AUV cluster energy according to claim 5, wherein the penalty function is introduced by adding a penalty term to the objective function, and the objective function has the following form:

；（6）

wherein,is an objective function->Is an energy consumption function, +.>Is a penalty function->Is a weight parameter used to adjust the impact of the penalty term.

7. The AUV cluster energy on-line monitoring and management method of claim 6, wherein S3 includes the sub-steps of:

s3.2, designing an intelligent body model;

the bonus function is designed as follows:

；（7）

wherein,indicating the proximity of the agent to the target; />Representing the energy efficiency of the agent;representing that the agent performs an action->Is safe; />Representing the effect of the agent performing a particular task;is a weight factor, and is adjusted according to different tasks;

8. The method for online monitoring and managing AUV cluster energy according to claim 7, wherein the cyclic neural network is a GRU neural network, and the update formula of the GRU neural network is as follows:

reset gate:；（8）

update door:；（9）

candidate hidden state:；（10）

final hidden state:；（11）

wherein,for +.>Is set to the reset gate vector; />Activating a function for Sigmoid; />Is a weight matrix associated with the reset gate; />For +.>Is hidden in the first layer; />For +.>Is a feature vector of the input of the (a); />Is a bias vector associated with the reset gate; />For +.>Is used for updating the gate vector; />Is a weight matrix associated with the update gate; />Is a bias vector associated with the update gate; />For +.>Is a candidate hidden state for (a); />A weight matrix that is a hidden state; />Bias vector for hidden state; />For +.>Is used to determine the final hidden state of the display.

9. The method for online monitoring and managing AUV cluster energy according to claim 8, wherein the hybrid network corresponding to the global Q-value function is trained by using a graph attention network, the TD loss function calculated by the global reward and the global Q-value function is counter-propagated, and the global hybrid network and the local Q-value function network are trained simultaneously;

iterative formula of global Q-value function:

；（12）

wherein,is indicated in the state->Take action with->The Q value of the expected return at that time; />For +.>The instant rewards obtained; />For the next state->Maximum Q value for all possible actions;

the TD loss is:

；（13）

wherein,for the time step->The time sequence difference TD error of (a) is used for evaluating the difference between the current Q value and the target Q value; />Is a parameter of the current strategy; />Is a parameter of the target policy.

10. The method for online monitoring and managing AUV cluster energy according to claim 9, wherein the attention mechanism of the attention network comprises:

the attention coefficients are:

；（14）

wherein,for node->For its neighbor node->Is a concentration factor of (2); />Is a natural exponential function for normalizing the attention coefficient; />Is an activation function; />Transpose of the parameter vector for the attention mechanism; />Is a weight matrix and is used for linearly transforming input features; />For node->Is a feature vector of (1); />Stitching of the representation vectors; />For node->Is a set of neighbor nodes; />For node->Updated feature vectors under the attention mechanism;

updated node characteristicsThe method comprises the following steps:

（15）。