CN116909150A - AUV intelligent control system based on PPO algorithm, control method and application - Google Patents

AUV intelligent control system based on PPO algorithm, control method and application Download PDF

Info

Publication number
CN116909150A
CN116909150A CN202310915486.0A CN202310915486A CN116909150A CN 116909150 A CN116909150 A CN 116909150A CN 202310915486 A CN202310915486 A CN 202310915486A CN 116909150 A CN116909150 A CN 116909150A
Authority
CN
China
Prior art keywords
auv
state
intelligent control
control
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310915486.0A
Other languages
Chinese (zh)
Inventor
李沂滨
孙雨泽
崔明
张忠铝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202310915486.0A priority Critical patent/CN116909150A/en
Publication of CN116909150A publication Critical patent/CN116909150A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)

Abstract

The application relates to an AUV intelligent control system based on a PPO algorithm, a control method and application thereof, belonging to the technical field of intelligent control of robots, comprising the following steps: s1: constructing an AUV intelligent control simulation model based on a PPO algorithm according to the AUV running state and the execution action; s2: obtaining AUV execution motion vectors through PPO algorithm in the control time step, so as to obtain AUV running state transition vectors of each control time step, and storing the AUV running state transition vectors in a control experience buffer space library; s3: the AUV simulation model is trained according to the control experience buffer space library obtained by PPO algorithm control, the trained AUV intelligent control simulation model outputs an execution strategy according to the current running state, the AUV action is controlled, a priority sampling mechanism is introduced into the PPO algorithm, the training speed and the sample utilization rate are improved, and the training effect is improved.

Description

AUV intelligent control system based on PPO algorithm, control method and application
Technical Field
The application relates to an improved AUV intelligent control system based on a PPO algorithm, a control method and an application thereof, and belongs to the technical field of intelligent control of robots.
Background
With the development of science and technology, the development and investment of human beings on ocean resources are larger and larger, and the ocean has rich mineral resources, biological resources and renewable energy sources, so that the ocean is an important asset for sustainable development of human society. The autonomous underwater vehicle (autonomous underwater vehicle, AUV) is very suitable for offshore searching, investigation, identification and salvage operations as an important tool for marine operation, and research and development of intelligent control of the AUV make marine development enter a new era.
The traditional AUV intelligent control method mainly comprises the following steps:
(1) Model-based methods, such as PID (proportion integration differentiation) and sliding mode control, require related parameters to be set in advance, and have great challenges for accuracy and autonomy when the AUV executes tasks when the external environment changes.
(2) The machine learning theory represented by reinforcement learning and deep learning is that the reinforcement learning is used for enabling an intelligent AUV to continuously and interactively explore the environment, finally learning an optimal strategy without establishing an accurate model, and weakening the constraint of the environment. However, the task of the AUV that has a relatively complex working environment and is more similar to the actual situation often has a large state space and a continuous action space. Whereas conventional reinforcement learning is limited to situations where the action space and sample space are small and generally discrete. When the input data is an image or sound, the input data tends to have a high dimension, and conventional reinforcement learning is difficult to handle.
(3) Deep reinforcement learning-based methods, such as DQN, DDPG algorithms, etc., possess powerful perceptional and decision-making capabilities and can accomplish direct control from input to output. At present, deep reinforcement learning is widely applied and has good effect in the autonomous control field of AUV. However, with the current research, AUV controllers based on deep reinforcement learning are poor in terms of anti-jamming capability and stability, and it is difficult to exhibit ideal effects when performing high-precision undersea tasks.
In recent years, with the development of deep learning and reinforcement learning, the application of the deep reinforcement learning algorithm in the AUV intelligent control technology is increasing. In the process of interaction with the environment, the intelligent agent generates an optimal behavior strategy by continuously testing errors and maximizing the accumulated rewards, has a better intelligent control effect compared with the traditional algorithm, has better environment adaptability, and can perform real-time intelligent control on the environment information.
However, the marine environment where the AUV is located is very complex, the deep reinforcement learning algorithm has the problems of slow training process, slow convergence speed and the like, and reinforcement learning rewards are often manually regulated, so that ideal environment rewards are difficult to set, and the reinforcement learning algorithm has the problem of sparse rewards in the training process, namely the problem of extremely slow training speed and even training failure caused by the fact that an intelligent body cannot obtain effective rewards for a long time.
Disclosure of Invention
Aiming at the defects of the prior art, in order to solve the problems in the background art, the application provides an AUV intelligent control system and a control method based on a deep reinforcement learning PPO algorithm, and mainly improves the algorithm in two aspects: firstly, a priority sampling mechanism is introduced into a PPO algorithm, so that training speed and sample utilization rate are improved, and training effect is improved; and secondly, a curiosity driving exploration mechanism is introduced into the PPO algorithm so as to accelerate the training process of reinforcement learning and solve the problem of sparse rewards.
The technical scheme of the application is as follows:
an AUV intelligent control method based on PPO algorithm includes the steps:
s1: constructing an AUV intelligent control simulation model based on a PPO algorithm according to the AUV running state and the execution action;
s2: obtaining AUV execution motion vectors through PPO algorithm in the control time step, so as to obtain AUV running state transition vectors of each control time step, and storing the AUV running state transition vectors in a control experience buffer space library;
s3: and training the AUV simulation model according to a control experience buffer space library obtained by PPO algorithm control, outputting an execution strategy by the trained AUV intelligent control simulation model according to the current running state, and controlling the AUV action.
Preferably, in S1, in an AUV intelligent control simulation model, an AUV dynamics model is established, the AUV dynamics model is derived based on a newton-euler motion equation, six-degree-of-freedom mathematical modeling is performed on a streamline AUV with a length of 2.38 meters, a diameter of 0.32 meters and a weight of 167 kg, and two coordinate systems are defined, namely an inertial coordinate system, a ζηζ and a carrier coordinate system, i.e., O-xyz;
the inertial coordinate system is established by taking a certain point on the sea level as a coordinate origin, taking the north-south direction as a zeta axis, taking the east-west direction as an eta axis and taking the vertical downward direction as a zeta axis; in a simulation environment, a six-dimensional vector group x= { ζ, η, ζ, phi, θ, ψ } is adopted to represent position and posture information when the AUV moves based on an inertial coordinate system, wherein a position parameter vector is { ζ, η, ζ }, and a posture vector parameter is { phi, θ, ω }, and respectively represents a transverse inclination angle, a longitudinal inclination angle and a heading angle of the AUV;
the origin of the carrier coordinate system is fixed at the gravity center of the AUV and is used for representing the speed and angular speed information of the AUV, wherein the speed vector is { u, v, w }, and the speed vector respectively represents the longitudinal speed, the transverse speed and the vertical speed; the angular velocity vector is { p, q, r }, and represents the transverse inclination angle velocity, the longitudinal inclination angle velocity and the yaw angle velocity respectively; the motion state information of the AUV can be completely described by adopting the two sets of vectors.
In the moving process of the AUV, the AUV dynamic model is established, and the moving process of the AUV is simulated by obtaining the moving state of the AUV at the next moment by using a fourth-order Dragon-Gregorian tower method according to the real-time position, the attitude information, the propeller force, the angle of the vertical rudder and the angle of the horizontal rudder of the AUV. Wherein, a propeller is arranged at the stern of the AUV, the force of the propeller is expressed as F, and a horizontal rudder and a vertical rudder are arranged at the stern of the AUV to change the movement direction of the AUV.
Preferably, in step S1, in an intelligent control simulation model of the AUV, based on construction of a PPO algorithm, learning parameters of the PPO algorithm are set, including a state space S, an action space a, and a reward function R;
specifically, two types of state observers, namely observation of task environments and observation of states of the AUV, are set in a state space S; the observation of the task environment comprises a distance d from the AUV to the current route and a current heading angle c; the observation of the AUV comprises the force F of the AUV self-propeller and the rudder angle D of the horizontal rudder; in order to balance the influence of each observed quantity on the neural network training, carrying out normalization processing on a state space to obtain a state vector S= { D, c, F, D };
in the action space A, since the AUV of the simulation environment is a dynamic model which moves in a three-dimensional space, has six degrees of freedom and is provided with three execution mechanisms; the application is trained in a two-dimensional environment, so the application only relates to a stern propeller and a stern horizontal rudder, and defines an action space A= { F, D };
the following adjustment and design are made to the reward function of the neural network so as to achieve the purposes of improving the training success rate and accelerating the training speed:
(1) Setting a close reward according to the distance change between the AUV and the specified path, wherein the reward is positively correlated with the distance between the AUV and the specified path at each step; the method is divided into two stages, wherein the first stage is rewarding when the target depth is not reached, namely, the difference value between the distance from the AUV to the target depth in the last time step and the distance from the AUV to the target depth in the current time step is used as a rewarding value, and the formula is as follows:
r d =d old -d new (1)
wherein r is d For the prize value d old Distance d from AUV to target depth for last time step new The distance from the AUV to the target depth in the current time step;
the second phase is that the AUV has reached the near reward of the desired depth, at which time the task is changed to remain unchanged at the desired depth, thus requiring an increase in the prize and punishment of the near reward:
r d =3*(d old -d new ) (2)
(2) According to the variation of the distance traveled by the AUV along the path, a travel prize is set that is positively correlated with the distance traveled by the AUV per step, i.e., r forward =η oldnew (3);
Wherein r is forward To advance rewards, eta old η coordinates, η, of the AUV for the last time step new η coordinates of the AUV are the current time step;
(3) Setting a heading prize according to the difference between the heading angle of the AUV and the specified desired heading angle, the prize settingIs inversely related to the difference between the heading angle and the desired heading angle; the AUV is returned to the motion trail at a distance of length L, and the desired heading angle is c d =arctan (d/L); the heading angle to be adjusted is a heading angle c and a desired heading angle c d And (2) sum: c change =c+c d The method comprises the steps of carrying out a first treatment on the surface of the The heading prize setting is based on the amount of decrease in the desired heading angle, i.e. r course =c change_old -c change_new (4);
Wherein r is course Awarding for heading c change_old C is the heading angle at the end of the last time step change_new The course angle at the end of the current time step;
(4) Setting time penalty according to the time limit of AUV intelligent control task, the penalty is positively correlated with simulation time, r time -1, ending the task when the simulation step size exceeds 1500, or reaches the simulation environment boundary;
(5) Finally, setting a task ending reward for training, when the task is ended, if the expected depth is not reached and is kept constant within the map boundary or 1000 time steps, subtracting a large penalty value, and setting a penalty value r at the moment done -300; if the task is successful, a large prize value is added, and then the prize value r is set done =300;
In summary, the reward function is set as: r=r d +r forward +r course +r time +r done (5)。
Preferably, in step S2, the intelligent control task of the AUV is completed by adopting a control method based on a PPO algorithm, including:
initializing parameters of an actor network and a critic network in a PPO algorithm, wherein a simulation time step t=0.1; target network soft update frequency parameter τ=5x10 -3 The actor network delay updates the frequency parameter sigma=2, and controls the empirical cache space library size d=10 4 The method comprises the steps of carrying out a first treatment on the surface of the The total number of transfer processes obtained from the experience playback cache library space in each time step is m=256, and the rewarding damage rate is gamma=0.99. Critic; target network update frequency c=2, maximum time step number t=4×10 5
The active and critic networks comprise an input layer, a hidden layer and an output layer, wherein the number of hidden neurons of the active and critic network structures is 128, the activation function of the hidden layer uses a relu function, and the output layer adopts a tanh function; the actor network inputs the state quantity s of the current task environment, outputs the motion vector a and the motion quantity a under the state, outputs the state cost function V(s) for executing the motion under the state, and obtains the dominance value at the moment t through the V(s)
Wherein V(s) t ) As a state cost function at time t, r t For the bonus function at time t, r t+1 A reward function at the time t+1, wherein T is the maximum time step number;
acquiring a state transition vector of each time step within a preset control time step, and storing the state transition vector into a control experience cache space library; state vector s including time t t A motion vector a to be implemented, a prize value r after executing the motion, and a state vector s at the next time t+1 The round ending mark done is used until the step number reaches a preset value;
initializing an AUV position, wherein the initial position of the AUV is arranged near a target path, and an initial state vector s is obtained, and the initial time step number i=0; based on the current state s, an actor network is adopted to obtain an execution action a=pi (s|theta), and 0 is a neural network parameter; executing action a in a simulation environment to obtain new attitude and action state information of the AUV, obtaining a new state s' through a fourth-order Dragon-Gregory tower method, obtaining a reward value r and a termination state done through a reward function, and enabling i=i+1; and storing the AUV running state transition vector into a control experience buffer space library to enable s=s', repeating the steps if the number of steps is smaller than 1000, otherwise, entering a training stage of the step S3.
Preferably, when the control experience is used to sample in the control experience buffer space library in step S3, if the conventional random sampling operation is used, not only the high quality samples cannot be effectively utilized, but also the model may be caused to fall into a locally optimal solution. In addition, the constant change of the number of samples in the playback library is also unfavorable for the convergence of training. The application solves the problems by adding an adaptive weight and priority sampling mechanism:
considering that the loss function of the neural network is influenced by the dominant value, in the process of designing the self-adaptive weight, the influence of the dominant value on the sampling weight is improved; the samples generated by each agent participating in sampling are respectively according to the dominance valuesFrom 1 to N, ordered from large to small; considering that the sum of sampling probabilities of all samples is 1, the sample adaptive weight calculation formula is as follows:
wherein j represents a sample ordering sequence number, P j The sampling probability of the j-th sample is represented, and N represents the total sample amount; the self-adaptive weight calculation formula is used, so that the sampling probability of samples with larger absolute values of dominant values is increased, samples with extremely large or extremely small rewards values can influence the training of the neural network, and the algorithm convergence speed is increased;
after sampling is completed, TD-error=γv (s t+1 )+r t+1 -V(s t ) Updating the parameters of the cirtic network through gradient back propagation of the neural network; using a loss functionUpdating parameters in an actor network through gradient back propagation of a neural network; wherein->Mean value of the interval 0-t and motion probability ratio +.>π θ Representing the policy at the current time, pi θold A represents the policy of the last moment, a t Is the motion vector at time t, s t A state vector at the time t;
if done is not in the termination state or is in the termination state but does not reach the preset step number T, returning to S2 for initialization, and continuing the task; if the number of steps exceeds the preset T, finishing training;
repeating the steps until the preset simulation step number is reached; and according to the rewarding convergence condition of each round and the control performance of the finishing effect judgment algorithm of the path tracking task, after the network training is finished, inputting the environment information into a strategy network, and outputting an action strategy by using the strategy network to control the AUV movement.
An AUV intelligent control system based on a PPO algorithm, comprising:
the simulation module is configured to construct an AUV intelligent control simulation model based on a PPO algorithm according to the AUV running state and the execution action;
the experience acquisition module is configured to acquire AUV execution motion vectors through a PPO algorithm in a control time step, so as to acquire AUV running state transition vectors of each control time step, and store the AUV running state transition vectors into the control experience buffer space library;
the intelligent control module is configured to sort the samples in the control experience buffer space library according to the absolute value of the dominant value from large to small, respectively endow the samples with self-adaptive weights, and improve the influence of the dominant value on the sampling weight; and training the AUV simulation model according to the control experience buffer space library after the weight is distributed, outputting an execution strategy by the trained AUV intelligent control simulation model according to the current running state, and controlling the AUV action.
An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of the first aspect.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.
The application has the beneficial effects that:
based on the PPO algorithm, the application adds a priority sampling mechanism and a curiosity exploration mechanism to complete the intelligent control task of the AUV. Aiming at the problems that the existing deep reinforcement learning control algorithm has slow convergence speed or is difficult to converge in the training process; according to the application, the sample weight is given to the control experience buffer space library according to the dominant value of the sample, so that the influence of the dominant value on the sampling weight is improved. The sampling probability of samples with larger absolute values of the dominant values is increased, so that samples with extremely large or extremely small rewards can influence the training of the neural network, and the algorithm convergence speed is increased; but also fully develop the relation between exploration and utilization and balance sampling probabilities of different samples. Meanwhile, the rewarding function is modified, different rewarding values are given to behaviors in different stages, and the problem of sparse rewarding is solved. A curiosity exploration mechanism is introduced into the reward function, and when the next state is inconsistent with the prediction of the intelligent agent, the reward is given, and the further the actual state is different from the prediction, the higher the reward is. The exploration capability of the algorithm is improved, and the convergence speed of the deep reinforcement learning algorithm is remarkably improved.
Aiming at the problems of weak anti-interference capability and poor self-adaptive capability of the traditional control algorithm; under a complex marine simulation environment, the application controls the action output of the AUV based on the PPO algorithm, has strong self-adaptive adjustment capability when facing various interference factors, improves the anti-interference capability, and adapts to complex and changeable marine environments.
Drawings
FIG. 1 is a general flow chart of the technical scheme of the application;
FIG. 2 is a PPO algorithm training flow chart of the present application
FIG. 3 is a schematic illustration of an exemplary AUV dynamics model of the present application;
FIG. 4 is a block diagram of an exemplary actor and critic neural network of the present application.
Detailed Description
The application will now be further illustrated by way of example, but not by way of limitation, with reference to the accompanying drawings.
Example 1:
as shown in fig. 1, this example provides an AUV intelligent control method based on a PPO algorithm, including:
s1, constructing an AUV intelligent control simulation model based on a PPO algorithm according to an AUV running state and an execution action;
s2, obtaining AUV execution motion vectors through a PPO algorithm in the control time step, so as to obtain AUV running state transition vectors of each control time step, and storing the AUV running state transition vectors into a control experience buffer space library;
and S3, sequencing samples in the control experience buffer space library according to the absolute value of the dominant value from large to small, respectively endowing the samples with self-adaptive weights, and improving the influence of the dominant value on the sampling weight. Training the AUV simulation model according to the control experience buffer space library after the weight is distributed, outputting an execution strategy according to the current running state by the trained AUV intelligent control simulation model, and controlling the AUV action;
in the AUV intelligent control simulation model, an AUV dynamics model shown in fig. 2 is adopted in the embodiment, the AUV dynamics model is derived based on a newton-euler motion equation, six-degree-of-freedom mathematical modeling is performed on a streamline AUV with the length of 2.38 meters, the diameter of 0.32 meter and the weight of 167 kg, and two coordinate systems are defined in the embodiment, namely an inertial coordinate system, namely an zeta and a carrier coordinate system, namely an O-xyz.
The inertial coordinate system is established by taking a certain point on the sea level as a coordinate origin, taking the north-south direction as a zeta axis, taking the east-west direction as an eta axis and taking the vertical downward direction as a zeta axis. In a simulation environment, six-dimensional vector sets are adopted based on an inertial coordinate systemRepresenting the position and posture information of AUV during motion, wherein the position parameter vector is { ζ, η, ζ }, and the posture vector parameter is +.>Respectively represent the transverse inclination angle, the longitudinal inclination angle and the heading angle of the AUV.
The origin of the carrier coordinate system is fixed on the gravity center of the AUV and is used for representing the speed and angular speed information of the AUV, wherein the speed vector is { u, v, w }, and the speed vector respectively represents the longitudinal speed, the transverse speed and the vertical speed; the angular velocity vector is { p, q, r }, and represents the transverse inclination angle velocity, the longitudinal inclination angle velocity and the yaw angle velocity respectively; the motion state information of the AUV is fully described using the two sets of vectors described above.
In the moving process of the AUV, the AUV dynamic model is adopted, and the AUV moving state at the next moment is obtained according to the real-time position, the attitude information, the propeller force, the angle of the vertical rudder and the angle of the horizontal rudder of the AUV, so that the moving process of the AUV is simulated. Wherein, a propeller is arranged at the stern of the AUV, the force of the propeller is expressed as F, and a horizontal rudder and a vertical rudder are arranged at the stern of the AUV to change the movement direction of the AUV.
In the intelligent control simulation model of AUV, the whole task of the embodiment is performed in a two-dimensional simulation environment, a path is set on a two-dimensional plane with zeta=20 depth in an inertial coordinate system, and a target path is η=50.
In the AUV intelligent control simulation model, the embodiment is constructed based on a PPO algorithm, and learning parameters of the PPO algorithm are set, wherein the learning parameters comprise a state space S, an action space A and a reward function R.
Specifically, in the state space S, two types of state observers, namely, observation of a task environment and observation of the state of the heading machine per se, are set in the embodiment; the observation of the task environment includes the distance d of the AUV from the current route and the current heading angle c The method comprises the steps of carrying out a first treatment on the surface of the The observation of the AUV comprises the force F of the AUV self-propeller and the rudder angle D of the horizontal rudder; and (3) carrying out normalization processing on the state space to obtain a state vector S= { D, c, F, D }, wherein the influence of each observed quantity on the neural network training is balanced.
In the action space A, since the AUV of the simulation environment is a dynamic model which moves in a three-dimensional space, has six degrees of freedom and is provided with three execution mechanisms; the present embodiment is trained in a two-dimensional environment, so the present embodiment only involves a stern propeller and a stern horizontal rudder, defining a state space a= { F, D }.
The following adjustment and design are made to the reward function of the neural network, and a curiosity driving exploration mechanism is added at the same time, so that the purposes of improving the training success rate and accelerating the training speed are achieved:
(1) Setting a close reward according to the distance change between the AUV and the specified path, wherein the reward is positively correlated with the distance between the AUV and the specified path at each step; the method is divided into two stages, wherein the first stage is rewarding when the target depth is not reached, namely, the difference value between the distance from the AUV to the target depth in the last time step and the distance from the AUV to the target depth in the current time step is used as a rewarding value, and the formula is as follows:
r d =d old -d new (1)
wherein r is d For the prize value d old Distance d from AUV to target depth for last time step new The distance from the AUV to the target depth in the current time step;
the second phase is that the AUV has reached the near reward of the desired depth, at which time the task is changed to remain unchanged at the desired depth, thus requiring an increase in the prize and punishment of the near reward:
r d =3*(d old -d new ) (2)
(2) Setting a forward prize according to the change in distance traveled by the AUV along the path, the prize being set to be positively correlated with the distance traveled by each step of the AUV; i.e. r forward =η oldnew (3);
Wherein r is forward To advance rewards, eta old η coordinates, η, of the AUV for the last time step new η coordinates of the AUV are the current time step;
(3) Setting a heading prize according to the difference between the heading angle of the AUV and the specified desired heading angle, the prize being set to be inversely related to the difference between the heading angle and the desired heading angle; the AUV is returned to the motion trail at a distance of length L, and the desired heading angle is c d =arctan (d/L); the heading angle to be adjusted is a heading angle c and a desired heading angle c d And (2) sum: c change =c+c d The method comprises the steps of carrying out a first treatment on the surface of the The heading prize setting is based on the amount of decrease in the desired heading angle, i.e. r course =c change_old -c change_new (4);
Wherein r is course Awarding for heading c change_old C is the heading angle at the end of the last time step change_new The course angle at the end of the current time step;
(4) Setting time penalty according to the time limit of AUV intelligent control task, the penalty is positively correlated with simulation time, r time -1, ending the task when the simulation time is too long or the step size exceeds 1500, or the simulation boundary is reached.
(5) Finally, setting a task ending reward for training, when the task is ended, if the expected depth is not reached and is kept constant within the map boundary or 1000 time steps, subtracting a large penalty value, and setting a penalty value r at the moment done -300; if the task is successful, a large prize value is added, and then the prize value r is set done =300;
In summary, the reward function is set as: r=r d +r forward +r course +r time +r done (5)。
In step S2, the present example completes the intelligent control task of the AUV by using a control method based on the PPO algorithm, including:
initializing parameters of an actor network and a critic network in a PPO algorithm, wherein in the embodiment, the number of hidden neurons of the actor and critic network structures is 128, a relu function is used as an activation function of a hidden layer, and a tanh function is used as an output layer; the actor network inputs the state quantity of the current task environment, the critic network inputs the state quantity and the action quantity, outputs a state cost function V(s) for executing the action under the state, and obtains the dominant value at the time t through the V(s)
Wherein V(s) t ) As a state cost function at time t, r t For the bonus function at time t, r t+1 And as a reward function at the time t+1, T is the maximum time step number.
At preset controlAcquiring a state transition vector of each time step within the time step, and storing the state transition vector into a control experience cache space library; state vector s including time t t A motion vector a to be implemented, a prize value r after executing the motion, and a state vector s at the next time t+1 And the turn ending mark done is used until the step number reaches a preset value.
Initializing an AUV position, wherein the initial position of the AUV is arranged near a target path, and an initial state vector s is obtained, and the initial time step number i=0; based on the current state s, an actor network is adopted to obtain an execution action a=pi (s|theta), the action a is executed in a simulation environment, new posture and action state information of the AUV are obtained, a new state s' is obtained through a fourth-order Dragon-Gregory tower method, training conditions are known through a reward function and whether the state is terminated, and meanwhile, the other i=i+1; and storing the AUV running state transition vector into a control experience buffer space library, so that s=s', repeating the steps if the number of steps is smaller than 1000, otherwise, entering a training stage.
When the samples are sampled in the experience playback library in the step S3, if the conventional random sampling operation is used, not only high quality samples cannot be effectively utilized, but also the model may be caused to fall into a locally optimal solution. In addition, the constant change of the number of samples in the playback library is also unfavorable for the convergence of training.
The embodiment trains an AUV intelligent control simulation model by adding a control experience buffer space library of a sample weight mechanism, and comprises the following specific contents:
considering that the loss function of the neural network is influenced by the dominant value, in the process of designing the self-adaptive weight, the influence of the dominant value on the sampling weight is improved; the samples generated by each agent participating in the sampling are respectively sequenced from 1 to N from big to small according to the absolute value of the dominant value. When sampling, not all samples in the experience playback library are uniformly calculated and sampled, but the sampling weights of the samples generated respectively are calculated, and a preset number of samples are acquired respectively according to the weight values for updating network parameters.
Sampling 2 samples in a control experience buffer space library, and updating all parameters in an actor network and a critic network by adopting a mean square error loss function through back propagation of a neural network;
if the training is not in the termination state, returning to S2, obtaining an execution action through an actor network, and if the training is in the termination state, ending the round, and returning to S2 for initializing to enter the next round. If the number of steps exceeds the preset value, the training is finished.
The above steps are repeatedly performed until the training reaches a certain number of rounds or a certain convergence condition. And according to the rewarding convergence condition of each round and the control performance of the finishing effect judgment algorithm of the path tracking task, after the network training is finished, inputting the environment information into a strategy network, and outputting an action strategy by using the strategy network to control the AUV movement.
The embodiment creatively provides an AUV intelligent control method based on the PPO algorithm, solves the problem of poor anti-interference performance of the traditional control method, obviously improves the convergence speed of the PPO algorithm during training, and improves the training efficiency.
Example 2
The embodiment provides an AUV intelligent control system based on a PPO algorithm, which comprises:
the simulation module is configured to construct an AUV intelligent control simulation model based on a PPO algorithm according to the AUV running state and the execution action;
the experience acquisition module is configured to acquire AUV execution motion vectors through a PPO algorithm in a control time step, so as to acquire AUV running state transition vectors of each control time step, and store the AUV running state transition vectors into the control experience buffer space library;
the intelligent control module is configured to sort the samples in the control experience buffer space library according to the absolute value of the dominant value from large to small, respectively endow the samples with self-adaptive weights, and improve the influence of the dominant value on the sampling weight. And training the AUV simulation model according to the control experience buffer space library after the weight is distributed, outputting an execution strategy by the trained AUV intelligent control simulation model according to the current running state, and controlling the AUV action.
It should be noted here that the above modules correspond to the steps described in embodiment 1, and the above modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.
In further examples, there is also provided:
an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in embodiment 1.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor.
The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.
The method in embodiment 1 may be directly embodied as a hardware processor executing or executed with a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. In order to avoid repetition, a description thereof is omitted.
Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The above is only a preferred embodiment of the present application, and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
While the foregoing description of the embodiments of the present application has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the application, but rather, it is intended to cover modifications and variations within the scope of the application as defined by the claims of the present application.

Claims (8)

1. An AUV intelligent control method based on a PPO algorithm is characterized by comprising the following steps:
s1: constructing an AUV intelligent control simulation model based on a PPO algorithm according to the AUV running state and the execution action;
s2: obtaining AUV execution motion vectors through PPO algorithm in the control time step, so as to obtain AUV running state transition vectors of each control time step, and storing the AUV running state transition vectors in a control experience buffer space library;
s3: and training the AUV simulation model according to a control experience buffer space library obtained by PPO algorithm control, outputting an execution strategy by the trained AUV intelligent control simulation model according to the current running state, and controlling the AUV action.
2. The method for intelligently controlling the AUV based on the PPO algorithm according to claim 1, wherein in S1, in an AUV intelligent control simulation model, an AUV dynamics model is established, the AUV performs six-degree-of-freedom mathematical modeling, and two coordinate systems are defined, namely an inertial coordinate system E- ζ eta zeta and a carrier coordinate system O-xyz;
the inertial coordinate system is established by taking a certain point on the sea level as a coordinate origin, taking the north-south direction as a zeta axis, taking the east-west direction as an eta axis and taking the vertical downward direction as a zeta axis; in a simulation environment, a six-dimensional vector group x= { ζ, η, ζ, phi, θ, ψ } is adopted to represent position and posture information when the AUV moves based on an inertial coordinate system, wherein a position parameter vector is { ζ, η, ζ }, and a posture vector parameter is { phi, θ, ω }, and respectively represents a transverse inclination angle, a longitudinal inclination angle and a heading angle of the AUV;
the origin of the carrier coordinate system is fixed at the gravity center of the AUV and is used for representing the speed and angular speed information of the AUV, wherein the speed vector is { u, v, w }, and the speed vector respectively represents the longitudinal speed, the transverse speed and the vertical speed; the angular velocity vector is { p, q, r }, and represents the transverse inclination angle velocity, the longitudinal inclination angle velocity and the yaw angle velocity respectively;
in the moving process of the AUV, the AUV kinetic model is adopted, and the moving state of the AUV at the next moment is obtained by using a four-order Dragon-Gregory tower method according to the real-time position, the attitude information, the propeller force, the vertical rudder and the angle of the horizontal rudder of the AUV, wherein the propeller is arranged at the stern part of the AUV, the force of the propeller is expressed as F, and the horizontal rudder and the vertical rudder are arranged at the stern part of the AUV to change the moving direction of the AUV.
3. The PPO algorithm-based AUV intelligent control method of claim 1, wherein in step S1, in an intelligent control simulation model of the AUV, a PPO algorithm-based construction is performed, and learning parameters of the PPO algorithm are set, including a state space S, an action space a, and a reward function R;
specifically, two types of state observers, namely observation of task environments and observation of states of the AUV, are set in a state space S; the observation of the task environment comprises a distance d from the AUV to the current route and a current heading angle c; the observation of the AUV comprises the force F of the AUV self-propeller and the rudder angle D of the horizontal rudder; carrying out normalization processing on the state space to obtain a state vector S= { D, c, F, D };
defining an action space a= { F, D };
the following adjustments and designs are made to the reward function of the neural network:
(1) Setting a close reward according to the distance change between the AUV and the specified path, wherein the reward is positively correlated with the distance between the AUV and the specified path at each step; the method is divided into two stages, wherein the first stage is rewarding when the target depth is not reached, namely, the difference value between the distance from the AUV to the target depth in the last time step and the distance from the AUV to the target depth in the current time step is used as a rewarding value, and the formula is as follows:
r d =d old -d new (1)
wherein r is d For the prize value d old Distance d from AUV to target depth for last time step new The distance from the AUV to the target depth in the current time step;
the second phase is that the AUV has reached the near reward of the desired depth, at which time the task is changed to remain unchanged at the desired depth, thus requiring an increase in the prize and punishment of the near reward:
r d =3*(d old -d new ) (2)
(2) According to the variation of the distance traveled by the AUV along the path, a travel prize is set that is positively correlated with the distance traveled by the AUV per step, i.e., r forward =η oldnew (3);
Wherein r is forward To advance rewards, eta old η coordinates, η, of the AUV for the last time step new η coordinates of the AUV are the current time step;
(3) Setting a heading prize according to the difference between the heading angle of the AUV and the specified desired heading angle, the prize being set to be inversely related to the difference between the heading angle and the desired heading angle; the AUV is returned to the motion trail at a distance of length L, and the desired heading angle is c d =arctan (d/L); the heading angle to be adjusted is a heading angle c and a desired heading angle c d And (2) sum: c change =c+c d The method comprises the steps of carrying out a first treatment on the surface of the The heading prize setting is based on the amount of decrease in the desired heading angle, i.e. r course =c change_old -c change_new (4);
Wherein r is course Awarding for heading c change_old C is the heading angle at the end of the last time step change_new The course angle at the end of the current time step;
(4) Setting time penalty according to the time limit of AUV intelligent control task, the penalty is positively correlated with simulation time, r time -1, ending the task when the simulation step size exceeds 1500, or reaches the simulation environment boundary;
(5) Finally, setting a task ending reward for training, when the task is ended, if the expected depth is not reached and is kept constant within the map boundary or 1000 time steps, subtracting a large penalty value, and setting a penalty value r at the moment done -300; if the task is successful, a large prize value is added, and then the prize value r is set done =300;
In summary, the reward function is set as: r=r d +r forward +r course +r time +r done (5)。
4. The PPO algorithm-based AUV intelligent control method of claim 1, wherein in step S2, the PPO algorithm-based control method is adopted to complete the intelligent control task of the AUV, and the method comprises the following steps:
initializing parameters of an actor network and a critic network in a PPO algorithm, wherein a simulation time step t=0.1; target network soft update frequency parameter τ=5x10 -3 The actor network delay updates the frequency parameter sigma=2, and controls the empirical cache space library size d=10 4 The method comprises the steps of carrying out a first treatment on the surface of the The total number of transfer processes obtained from the experience playback cache library space in each time step is m=256, and the rewarding damage rate is gamma=0.99. Critic; target network update frequency c=2, maximum time step number t=4×10 5
The active and critic networks comprise an input layer, a hidden layer and an output layer, wherein the number of hidden neurons of the active and critic network structures is 128, the activation function of the hidden layer uses a relu function, and the output layer adopts a tanh function; the actor network inputs the state quantity s of the current task environment,the motion vector a, the critic network input state quantity s and the motion quantity a under the state are output, the state cost function V(s) for executing the motion under the state is output, and the dominant value at the time t is obtained through the V(s)
Wherein V(s) t ) As a state cost function at time t, r t For the bonus function at time t, r t+1 A reward function at the time t+1, wherein T is the maximum time step number;
acquiring a state transition vector of each time step within a preset control time step, and storing the state transition vector into a control experience cache space library; state vector s including time t t A motion vector a to be implemented, a prize value r after executing the motion, and a state vector s at the next time t+1 The round ending mark done is used until the step number reaches a preset value;
initializing an AUV position, obtaining an initial state vector s, and obtaining an initial time step number i=0; based on the current state s, an actor network is adopted to obtain an execution action a=pi (s|theta), wherein theta is a neural network parameter; executing action a in a simulation environment to obtain new attitude and action state information of the AUV, obtaining a new state s' through a fourth-order Dragon-Gregory tower method, obtaining a reward value r and a termination state done through a reward function, and enabling i=i+1; and storing the AUV running state transition vector into a control experience buffer space library to enable s=s', repeating the steps if the number of steps is smaller than 1000, otherwise, entering a training stage of the step S3.
5. The PPO algorithm-based AUV intelligent control method according to claim 1, wherein in the step S3, the samples generated by each agent participating in the sampling are respectively determined according to the dominance valuesAbsolute value of (2)Ordering from 1 to N from big to small; the sum of sampling probabilities of all samples is 1, and the sample adaptive weight calculation formula is as follows:
wherein j represents a sample ordering sequence number, P j The sampling probability of the j-th sample is represented, and N represents the total sample amount;
after sampling is completed, TD-error=γv (s t+1 )+r t+1 -V(s t ) Updating the parameters of the cirtic network through gradient back propagation of the neural network; using a loss functionUpdating parameters in an actor network through gradient back propagation of a neural network; wherein->Mean value of the interval 0-t and motion probability ratio +.>π θ Representing the policy at the current time, pi θold A represents the policy of the last moment, a t Is the motion vector at time t, s t A state vector at the time t;
if done is not in the termination state or is in the termination state but does not reach the preset step number T, returning to S2 for initialization, and continuing the task; if the number of steps exceeds the preset T, finishing training;
repeating the steps until the preset step number is reached; and according to the rewarding convergence condition of each round and the control performance of the finishing effect judgment algorithm of the path tracking task, after the network training is finished, inputting the environment information into a strategy network, and outputting an action strategy by using the strategy network to control the AUV movement.
6. An PPO algorithm-based AUV intelligent control system for executing the PPO algorithm-based AUV intelligent control method of any one of claims 1-5, comprising:
the simulation module is configured to construct an AUV intelligent control simulation model based on a PPO algorithm according to the AUV running state and the execution action;
the experience acquisition module is configured to acquire AUV execution motion vectors through a PPO algorithm in a control time step, so as to acquire AUV running state transition vectors of each control time step, and store the AUV running state transition vectors into the control experience buffer space library;
the intelligent control module is configured to sort the samples in the control experience buffer space library according to the absolute value of the dominant value from large to small, respectively endow the samples with self-adaptive weights, and improve the influence of the dominant value on the sampling weight; and training the AUV simulation model according to the control experience buffer space library after the weight is distributed, outputting an execution strategy by the trained AUV intelligent control simulation model according to the current running state, and controlling the AUV action.
7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-5.
8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-5.
CN202310915486.0A 2023-07-25 2023-07-25 AUV intelligent control system based on PPO algorithm, control method and application Pending CN116909150A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310915486.0A CN116909150A (en) 2023-07-25 2023-07-25 AUV intelligent control system based on PPO algorithm, control method and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310915486.0A CN116909150A (en) 2023-07-25 2023-07-25 AUV intelligent control system based on PPO algorithm, control method and application

Publications (1)

Publication Number Publication Date
CN116909150A true CN116909150A (en) 2023-10-20

Family

ID=88357929

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310915486.0A Pending CN116909150A (en) 2023-07-25 2023-07-25 AUV intelligent control system based on PPO algorithm, control method and application

Country Status (1)

Country Link
CN (1) CN116909150A (en)

Similar Documents

Publication Publication Date Title
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
CN110333739B (en) AUV (autonomous Underwater vehicle) behavior planning and action control method based on reinforcement learning
Shi et al. Multi pseudo Q-learning-based deterministic policy gradient for tracking control of autonomous underwater vehicles
CN106444806B (en) The drive lacking AUV three-dimensional track tracking and controlling method adjusted based on biological speed
Sun et al. Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning
CN110597058B (en) Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning
CN111966118B (en) ROV thrust distribution and reinforcement learning-based motion control method
CN109240091B (en) Underwater robot control method based on reinforcement learning and tracking control method thereof
CN113033119B (en) Underwater vehicle target area floating control method based on double-critic reinforcement learning technology
CN113010963B (en) Variable-quality underwater vehicle obstacle avoidance method and system based on deep reinforcement learning
CN111930141A (en) Three-dimensional path visual tracking method for underwater robot
CN114115262B (en) Multi-AUV actuator saturation cooperative formation control system and method based on azimuth information
CN112462792B (en) Actor-Critic algorithm-based underwater robot motion control method
CN111240345A (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
CN113534668B (en) Maximum entropy based AUV (autonomous Underwater vehicle) motion planning method for actor-critic framework
CN111123923A (en) Unmanned ship local path dynamic optimization method
CN114397899A (en) Bionic robot fish three-dimensional path tracking control method and device
CN114077258B (en) Unmanned ship pose control method based on reinforcement learning PPO2 algorithm
Pan et al. Learning for depth control of a robotic penguin: A data-driven model predictive control approach
CN115657689B (en) Autonomous underwater vehicle target tracking control method based on track prediction
CN113050420B (en) AUV path tracking method and system based on S-plane control and TD3
CN116774712A (en) Real-time dynamic obstacle avoidance method in underactuated AUV three-dimensional environment
CN116909150A (en) AUV intelligent control system based on PPO algorithm, control method and application
Zhai et al. Path planning algorithms for USVs via deep reinforcement learning
Zhang et al. Residual Reinforcement Learning for Motion Control of a Bionic Exploration Robot—RoboDact

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination