CN112100834A - Underwater glider attitude control method based on deep reinforcement learning - Google Patents

Underwater glider attitude control method based on deep reinforcement learning Download PDF

Info

Publication number
CN112100834A
CN112100834A CN202010925225.3A CN202010925225A CN112100834A CN 112100834 A CN112100834 A CN 112100834A CN 202010925225 A CN202010925225 A CN 202010925225A CN 112100834 A CN112100834 A CN 112100834A
Authority
CN
China
Prior art keywords
neural network
evaluation
current
target
underwater glider
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010925225.3A
Other languages
Chinese (zh)
Inventor
高剑
宋保维
潘光
张福斌
王鹏
曹永辉
杜晓旭
彭星光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010925225.3A priority Critical patent/CN112100834A/en
Publication of CN112100834A publication Critical patent/CN112100834A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides an underwater glider attitude control method based on deep reinforcement learning, which comprises a learning stage and an application stage, wherein the learning stage simulates the motion process of an underwater glider through simulation and simultaneously records real-time motion data, and parameters of a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network are updated according to the motion data; after the trained deep reinforcement learning neural network model is obtained, the method is applied to the gliding motion of the actual underwater glider in the longitudinal plane, and the target pitch angle theta is givendCollecting the state value of underwater glider and inputting the state value into the deep reinforcement learning neural network model to obtain the control quantity to realize the attitude of underwater gliderAnd (5) state control. The method is used for learning based on simulation model data or artificial experiment data, so that the attitude of the underwater glider is controlled, and the learning mode is simple; and an accurate mathematical model of the underwater glider does not need to be obtained, and the method is also suitable for complex environments.

Description

Underwater glider attitude control method based on deep reinforcement learning
Technical Field
The invention relates to a control technology of an underwater robot, in particular to an underwater glider attitude control method based on deep reinforcement learning.
Background
The underwater glider is a novel underwater vehicle which is developed by combining a buoy technology, a submerged buoy technology and an underwater robot technology, does not have an external hanging part and is driven by the gravity of the underwater glider. The main characteristics are as follows: the motion control does not depend on a propeller propulsion system, but realizes up-and-down sinking and floating motion by adjusting net buoyancy of the glider, and the glider is controlled to glide forwards by utilizing the lift force generated by the horizontal wings attached to the fuselage and inclining upwards or downwards. The underwater glider overcomes the defects of large power and short sailing time of an underwater vehicle, greatly reduces the operation cost and the manufacturing cost, improves the sailing time, and has practical value in military affairs and ocean exploration research.
The motion attitude of the underwater glider is easily influenced by ocean currents and waves, meanwhile, the underwater glider is complex in structure and single in power mode, a dynamic model is expressed as strong nonlinearity, accurate model parameters are not easy to obtain, and the model constructed in different water area environments is lack of universality. Although many traditional control methods can realize attitude control of the underwater glider and achieve certain control accuracy, the requirements of high accuracy cannot be met, and the control process is complex.
Disclosure of Invention
Technical problem to be solved
The invention aims to overcome the defects and shortcomings of the prior art, provides an underwater glider attitude control method based on deep reinforcement learning, establishes a deep reinforcement learning neural network model, and can realize accurate control of the underwater glider attitude by learning simulation model data or artificial experiment data.
Technical scheme
The invention provides an underwater glider attitude control method based on deep reinforcement learning, which comprises a learning stage and an application stage, wherein the learning stage simulates the motion process of an underwater glider through simulation, simultaneously records real-time motion data, and updates the parameters of a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network according to the motion data, and the method comprises the following specific steps:
step 1: and 4 BP neural networks are established, namely a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network uses the state value of the underwater glider as an input quantity, and uses the control quantity a of the underwater glider as an output action. The evaluation neural network takes the state value and the control quantity of the underwater glider as input and takes the evaluation value as output;
after the neural network is constructed, the parameters of 4 neural networks are initialized, and the sizes of a memory base and a data buffer area are initialized.
Step 2: obtaining the state value s of the underwater glider at the current momenttInputting the state value into the current decision neural network to calculate the output action a of the attitude controller at the current momenttAction a to be outputtApplying the state value to an underwater glider simulator to obtain the state value s of the underwater glider at the next momentt+1. According to the state s at the present momenttAction a at the present timetTarget pitch angle θdAnd the state s of the next momentt+1Calculating the reward value r of the current timet
Preferably rtTake a value of:
rt=r1+r2+r3
Wherein r is1=-λ1(dt-dt-1),r1A return value, r, obtained representing the distance or approach of the current pitch angle to the desired angle2=-λ2(wt-wt-1),r2Representing the return value, r, obtained by a change in angular velocity3Taking a value according to the size of the current pitch angle, if: | dt-dt-1|<0.1°,r3Is a positive "reward" if θ < -90 ° or θ > 90 °, r3Is a negative number, representing a penalty, dtIs the current pitch angle theta and the target pitch angle theta at the time point tdDifference of (d), wtFor the magnitude of the pitch rate at time t, λ1And λ2To set the coefficients.
And step 3: the state(s) obtained in step 2t,at,rt,st+1) Storing the t as a group of experience data units in a memory bank, and increasing the t by 1; judging the relation between t and the set memory base size n, if t is less than n, using the updated stAnd (4) returning to the step (2) until the number of the empirical data units stored in the memory base meets the requirement of n, and then entering the step (4).
And 4, step 4: a specified number N of empirical data units are sampled from a memory bank and stored in a buffer.
Prioritizing the N experience data units in the buffer, preferably by a priority-based experience playback mechanism, and storing in a SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the empirical data units represented by all nodes, denoted as ptotal
And 5: for the N empirical data units stored in the SumTree data structure obtained in step 4, then sampling m empirical data units from the N empirical data units, in this embodiment, in order to ensure that various experiences are likely to be selected, sampling m empirical data units through the following process;
will [0, p ]total]Equally dividing the cell into m cells, and randomly and uniformly sampling in each cell to obtain experience data units corresponding to the priorities obtained by sampling, thereby obtaining m experience data units.
Processing the m sampled empirical data units one by one according to the following process to obtain the gradient value of the current evaluation neural network
Figure BDA0002668239170000031
For a certain empirical data unit(s)t,at,rt,st+1) Will state stAnd an action signal atInputting the evaluation value Q into a current evaluation neural network to obtain an evaluation value Q of the current evaluation neural network; the state s of the next timet+1Inputting the signal into a target decision neural network to obtain an action signal mu' of an actuating mechanism output by the target decision neural network; the state s of the next timet+1Inputting the action value mu 'output by the target decision neural network into the target evaluation neural network to obtain an evaluation value Q' of the target evaluation neural network;
calculating gradient value of the current evaluation neural network by using evaluation value Q of the current evaluation neural network, evaluation value Q' of the target evaluation neural network and loss function L of the evaluation neural network
Figure BDA0002668239170000032
Specifically, the loss function L of the current evaluation neural network is:
Figure BDA0002668239170000033
wherein, ω isiThe importance sampling weight for the ith empirical data unit,icomprises the following steps:
i=ri+γQ'(si+1,μ'(si+1μ'Q')-Q(si,aiQ)
riis the ith empirical numberAccording to the prize value in the unit, si+1Is the next time state, σ, in the ith empirical data unitμ'Is a parameter of the target decision neural network, μ'(s)i+1μ') Is si+1And the action signal of the actuating mechanism is output through the target decision neural network. SigmaQ'Is a parameter of the target evaluation neural network, Q'(s)i+1,μ'(si+1μ'Q') Is to make mu'(s)i+1μ') And si+1And the evaluation value obtained by the target evaluation neural network is large. siIs the current time state in the ith empirical data unit, aiIs the action signal, σ, of the i-th empirical data unit in the current time stateQFor the current evaluation of parameters of the neural network, Q(s)i,aiQ) Is the evaluation value of the current evaluation neural network. Gamma is the discount coefficient and gamma is 0.99.
The gradient of the loss function L of the current evaluation neural network is:
Figure BDA0002668239170000034
step 6: current evaluation neural network update: gradient of neural network according to current evaluation
Figure BDA0002668239170000041
For the currently evaluated neural network parameter sigmaQSelf-increasing
Figure BDA0002668239170000042
Updating, wherein alpha is the learning rate of the evaluation neural network and takes the value of 0.001.
And 7: calculating a gradient of a current decision neural network
Figure BDA0002668239170000043
The specific gradient size calculation formula is as follows:
Figure BDA0002668239170000044
wherein
Figure BDA0002668239170000045
Evaluation value representing currently evaluated neural network
Figure BDA0002668239170000046
Gradient to parameter α, μ(s)i) Is siThe action signal of the actuating mechanism is output through the current decision neural network;
Figure BDA0002668239170000047
representing current decision neural network actions
Figure BDA0002668239170000048
For the current decision neural network parameter sigmaμOf the gradient of (c).
And 8: updating the current decision neural network:
gradient of neural network according to current decision
Figure BDA0002668239170000049
For the current decision neural network parameter sigmaμSelf-increasing
Figure BDA00026682391700000410
Updating, wherein beta is the learning rate of the decision neural network, and the value is 0.001.
And step 9: updating a target evaluation neural network and a target decision neural network: updating the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and updating the target decision neural network parameters according to the updated current decision neural network parameters, wherein the updating formula is
Figure BDA00026682391700000411
Wherein σt+1 QRepresenting the updated current evaluation neural network parameter, σt Q'Representing target evaluation neural network parameters, σ, to be updatedt+1 Q'Representing the updated target evaluation neural network parameters; sigmat+1 μRepresenting the updated current decision neural network parameter, σt μ'Representing the parameters, σ, of the target decision neural network to be updatedt+1 μ'Representing the updated objective decision neural network parameters; tau is1The update rate of the neural network for the target evaluation is taken to be 0.001, tau2The update rate for the target decision neural network takes a value of 0.0001.
Step 10: judging whether the training times exceed the set training times, if so, stopping training, saving the parameter values of 4 neural networks, and if not, returning to the step 4, and sampling the experience data units with the designated number N in the memory base again to be stored in the buffer area.
Step 11: after the trained deep reinforcement learning neural network model is obtained, the method is applied to the gliding motion of the actual underwater glider in the longitudinal plane, and the target pitch angle theta is givendAnd collecting the state value of the underwater glider and inputting the state value into the deep reinforcement learning neural network model to obtain the control quantity so as to realize the attitude control of the underwater glider.
Advantageous effects
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. the underwater glider posture control method is used for learning based on simulation model data or manual experiment data, the underwater glider posture control is achieved, and the learning mode is simple.
2. An accurate mathematical model of the underwater glider does not need to be obtained, and the method is also suitable for complex environments.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1: a deep reinforcement learning algorithm frame schematic diagram;
FIG. 2: a longitudinal gliding motion control chart of the underwater glider;
FIG. 3: SumTree structural diagram;
FIG. 4: the schematic diagram of the turn reward value in the attitude control training process of the underwater glider;
FIG. 5: and in the application stage, the change process of the current pitch angle under the action of the attitude adjusting unit is set when the expected slip pitch angle is-25 degrees.
Detailed Description
The following describes an example of the present invention in detail, in this example, taking pitch angle control in the vertical plane of an actual underwater glider as an example, basic characteristic parameters of the underwater glider are firstly determined, a mathematical model of the underwater glider in the vertical plane gliding motion is obtained, wherein the input control quantity is a speed command a of a moving mass block on the x axistThe changing state of the glider is s: { v1,v32,θ},v1,v32And theta is the direction speed of the underwater glider in the x and z axes (the x axis is the forward axis of the body coordinate system, and the z axis is the coordinate axis vertical to the body plane), the pitch angle speed and the pitch angle respectively. The control block diagram of the longitudinal gliding motion with the speed of the slider as input is shown in figure 2.
The principle of the example method is: the method for controlling the pitching angle of the underwater glider in the vertical plane by using the depth reinforcement learning method based on the priority sampling comprises the following specific steps:
step 1: and 4 BP neural networks are established, namely a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network has 4 input quantities, namely the state value v of the underwater glider at the current moment1,v32And θ, there are 1 output quantity, i.e., the motion value a. Evaluation of neural network with 5 inputs v1,v32θ, a, 1 output quantity is an evaluation value. In this embodiment, each of the 4 BP neural networks has 2 hidden layers, the first hidden layer has 400 nodes, the second hidden layer has 300 nodes, and each of the 4 BP neural networks has one output node.
The mapping process of the BP neural network is as follows:
input layer information-hidden layer activation function-layer 1 hidden layer output:
Figure BDA0002668239170000061
wherein x isiIs the ith input quantity and is the total number of the input quantities. v. ofkiRepresenting weights from input layer to layer 1 hidden layer, bkTo offset the threshold value, z1kIs the kth node of the first hidden layer, h1Is taken to be 400, f1(s) is the hidden layer activation function, chosen as Relu function.
Layer 1 hidden layer output-hidden layer activation function-layer 2 hidden layer output:
Figure BDA0002668239170000062
wherein, wjkRepresenting the weight from layer 1 to layer 2, bjTo offset the threshold value, z2jFor the jth output of the layer 2 hidden layer, h2Taken as 300, f2(s) is the activation function of the output layer, chosen as Relu function.
Layer 2 hidden layer output-output layer activation function-output layer output:
Figure BDA0002668239170000063
where y is the output, where the outputs are all values, wjRepresenting weights between the hidden layer of layer 2 and the output layer, blTo bias the threshold, f3(s) is an activation function of an output layer, the activation function of the output layer of the decision neural network is a Tanh function, and the activation function of the output layer of the evaluation neural network is a Relu function.
The parameters of 4 neural networks are initialized, the size of the initialization memory bank is N10000, and the size of the data buffer is N64.
Step 2: obtaining the state value s of the underwater glider at the current momentt:{v1,v32Theta, inputting the state value into the current decision neural network to calculate the output action a of the attitude controller at the current momenttAction a to be outputtApplied to a simulator constructed based on a mathematical model of the underwater glider during gliding on a longitudinal plane to obtain a state value s of the underwater glider at the next momentt+1. According to the state s at the present momenttAction a at the present timetTarget pitch angle θdAnd the state s of the next momentt+1Calculating the reward value r of the current timet. In this example rtThe values are as follows:
rt=r1+r2+r3
wherein r is1=-λ1(dt-dt-1),r1A return value, r, obtained representing the distance or approach of the current pitch angle to the desired angle2=-λ2(wt-wt-1),r2Representing the return value, r, obtained by a change in angular velocity3Taking a value according to the size of the current pitch angle, if: | dt-dt-1|<0.1°,r3Is a positive "reward" if θ < -90 ° or θ > 90 °, r3Is a negative number, representing a penalty, dtIs the current pitch angle theta and the target pitch angle theta at the time point tdDifference of (d), wtFor the magnitude of the pitch rate at time t, λ1And λ2To set the coefficients.
And step 3: the state(s) obtained in step 2t,at,rt,st+1) Storing the t as a group of experience data units in a memory bank, and increasing the t by 1; judging t andsetting the relation between the memory size n, if t is less than n, using the updated stAnd (4) returning to the step (2) until the number of the empirical data units stored in the memory base meets the requirement of n, and then entering the step (4).
And 4, step 4: sampling a specified number N of experience data units from a memory bank and storing the experience data units into a buffer area; the N experience data units in the buffer area are subjected to priority sequencing through an experience playback mechanism based on priority, and are stored in a SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the empirical data units represented by all nodes, denoted as ptotal
The priority-based empirical playback is specifically as follows: defining p (i) as the sampling probability of the ith experiment, and calculating p (i) as:
Figure BDA0002668239170000071
wherein,
Figure BDA0002668239170000072
representing the priority of the ith experience, η is used to set the degree of priority of use, and when η is 0, it is a uniform empirical sample. p is a radical ofi=|i|+,iTD error is a positive constant. In order not to make the complexity of the algorithm dependent on the size of the experience pool, the resulting p (i) data is arranged in a SumTree data structure. SumTree is a special binary tree structure in which all leaf nodes store priorities corresponding to experience, the parent node has the value of the sum of the corresponding child nodes, and thus the head node has the value of the sum of all priorities, denoted as ptotal. The SumTree principle is shown in FIG. 3: the leaf node in SumTree stores the priority of each experience, the sequence number of the leaf node corresponds to the experience sequence number of the memory bank one by one, and when the experience of the priority is selected, the corresponding experience is taken out from the buffer area according to the selected sequence number of the current priority.
And 5: for the N empirical data units stored in the SumTree data structure obtained in step 4, then sampling m empirical data units from the N empirical data units, in this embodiment, in order to ensure that various experiences are likely to be selected, sampling m empirical data units through the following process;
will [0, p ]total]Equally dividing the cell into m cells, and randomly and uniformly sampling in each cell to obtain experience data units corresponding to the priorities obtained by sampling, thereby obtaining m experience data units.
Processing the m sampled empirical data units one by one according to the following process to obtain the gradient value of the current evaluation neural network
Figure BDA0002668239170000081
For a certain empirical data unit(s)t,at,rt,st+1) Will state stAnd an action signal atInputting the evaluation value Q into a current evaluation neural network to obtain an evaluation value Q of the current evaluation neural network; the state s of the next timet+1Inputting the signal into a target decision neural network to obtain an action signal mu' of an actuating mechanism output by the target decision neural network; the state s of the next timet+1Inputting the action value mu 'output by the target decision neural network into the target evaluation neural network to obtain an evaluation value Q' of the target evaluation neural network;
calculating gradient value of the current evaluation neural network by using evaluation value Q of the current evaluation neural network, evaluation value Q' of the target evaluation neural network and loss function L of the evaluation neural network
Figure BDA0002668239170000082
The loss function L of the current evaluation neural network is:
Figure BDA0002668239170000083
wherein, ω isiThe importance sampling weight for the ith empirical data unit,icomprises the following steps:
i=ri+γQ'(si+1,μ'(si+1μ'Q')-Q(si,aiQ)
riis the prize value, s, in the ith empirical data uniti+1Is the next time state, σ, in the ith empirical data unitμ'Is a parameter of the target decision neural network, μ'(s)i+1μ') Is si+1And the action signal of the actuating mechanism is output through the target decision neural network. SigmaQ'Is a parameter of the target evaluation neural network, Q'(s)i+1,μ'(si+1μ'Q') Is to make mu'(s)i+1μ') And si+1And the evaluation value obtained by the target evaluation neural network is large. siIs the current time state in the ith empirical data unit, aiIs the action signal, σ, of the i-th empirical data unit in the current time stateQFor the current evaluation of parameters of the neural network, Q(s)i,aiQ) Is the evaluation value of the current evaluation neural network. Gamma is the discount coefficient and gamma is 0.99.
The gradient of the loss function L of the current evaluation neural network is:
Figure BDA0002668239170000091
step 6: current evaluation neural network update: gradient of neural network according to current evaluation
Figure BDA0002668239170000092
For the currently evaluated neural network parameter sigmaQSelf-increasing
Figure BDA0002668239170000093
Updating, wherein alpha is the learning rate of the evaluation neural network and takes the value of 0.001.
And 7: calculating the gradient of the current decision neural network, wherein the gradient size calculation formula is as follows:
Figure BDA0002668239170000094
wherein
Figure BDA0002668239170000095
Evaluation value representing currently evaluated neural network
Figure BDA0002668239170000096
Gradient to parameter α, μ(s)i) Is siThe action signal of the actuating mechanism is output through the current decision neural network;
Figure BDA0002668239170000097
representing current decision neural network actions
Figure BDA0002668239170000098
For the current decision neural network parameter sigmaμOf the gradient of (c).
And 8: updating the current decision neural network:
gradient of neural network according to current decision
Figure BDA0002668239170000099
For the current decision neural network parameter sigmaμSelf-increasing
Figure BDA00026682391700000910
Updating, wherein beta is the learning rate of the decision neural network, and the value is 0.001.
And step 9: updating a target evaluation neural network and a target decision neural network: updating the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and updating the target decision neural network parameters according to the updated current decision neural network parameters, wherein the updating formula is
Figure BDA00026682391700000911
Wherein σt+1 QRepresenting the updated current evaluation neural network parameter, σt Q'Representing target evaluation neural network parameters, σ, to be updatedt+1 Q'Representing the updated target evaluation neural network parameters; sigmat+1 μRepresenting the updated current decision neural network parameter, σt μ'Representing the parameters, σ, of the target decision neural network to be updatedt+1 μ'Representing the updated objective decision neural network parameters; tau is1The update rate of the neural network for the target evaluation is taken to be 0.001, tau2The update rate for the target decision neural network takes a value of 0.0001.
Step 10: judging whether the training times exceed the set training times, if so, stopping training, saving the parameter values of 4 neural networks, and if not, returning to the step 4, and sampling the experience data units with the designated number N in the memory base again to be stored in the buffer area.
Step 11: after the trained deep reinforcement learning neural network model is obtained, the method is applied to the gliding motion of the actual underwater glider in the longitudinal plane, and the target pitch angle theta is givendAnd collecting the state value { v ] of the underwater glider1,v32Theta is input into the deep reinforcement learning neural network model to obtain a control quantity (a speed instruction a of the moving mass block on the x axis)t) And the attitude control of the underwater glider is realized.
Fig. 4 is a round rewarded value (referred to as "elispot rewarded") of the training process of the attitude controller in this embodiment, the evaluation of the training effect is mainly measured by the round rewarded value, after a certain period of training, the larger the average reward value is, the better the training effect is, as shown in fig. 4, the round rewarded value continuously rises, and after 900 rounds of training, the round rewarded value is basically stabilized at about 850, which indicates that the controller has learned a good strategy.
Fig. 5 shows an example of the application phase, showing the change of the pitch angle during the gliding process. Setting the initial pitch angle to 0 ° to begin gliding, with a desired pitch angle of-25 °. The pitch angle reaches the expected value after 16s, the steady-state error is 0.06 degrees when the glides steadily, and it can be seen that the error between the pitch angle and the expected angle is small when the glides steadily, and the glider can be considered to be capable of continuously gliding according to the expected track.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims (8)

1. An underwater glider attitude control method based on deep reinforcement learning is characterized in that: the method comprises the following steps:
step 1: establishing a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network; the decision neural network adopts the state value of the underwater glider as an input quantity and adopts the control quantity a of the underwater glider as an output action; the evaluation neural network takes the state value and the control quantity of the underwater glider as input and takes the evaluation value as output; initializing parameters of 4 neural networks, and initializing a memory bank and a data buffer area;
step 2: obtaining the state value s of the underwater glider at the current momenttInputting the state value into the current decision neural network to calculate the output action a of the attitude controller at the current momenttAction a to be outputtApplying the state value to an underwater glider simulator to obtain the state value s of the underwater glider at the next momentt+1(ii) a According to the state s at the present momenttAction a at the present timetTarget pitch angle θdAnd the state s of the next momentt+1Calculating the reward value r of the current timet
And step 3: the state(s) obtained in step 2t,at,rt,st+1) Storing the t as a group of experience data units in a memory bank, and increasing the t by 1; judging the relation between t and the set memory base size n, if t is less than n, using the updated stReturning to the step 2 untilAfter the number of the experience data units stored in the memory base meets the requirement of n, entering a step 4;
and 4, step 4: sampling a specified number N of experience data units from a memory bank and storing the experience data units into a buffer area;
and 5: sampling m empirical data units in N empirical data units in a buffer; the m sampled empirical data units are processed one by one according to the following process:
for a certain empirical data unit(s)t,at,rt,st+1) Will state stAnd an action signal atInputting the evaluation value Q into a current evaluation neural network to obtain an evaluation value Q of the current evaluation neural network; the state s of the next timet+1Inputting the signal into a target decision neural network to obtain an action signal mu' of an actuating mechanism output by the target decision neural network; the state s of the next timet+1Inputting the action value mu 'output by the target decision neural network into the target evaluation neural network to obtain an evaluation value Q' of the target evaluation neural network;
calculating gradient value of the current evaluation neural network by using evaluation value Q of the current evaluation neural network, evaluation value Q' of the target evaluation neural network and loss function L of the evaluation neural network
Figure FDA0002668239160000011
Step 6: current evaluation neural network update: gradient of neural network according to current evaluation
Figure FDA0002668239160000012
For the currently evaluated neural network parameter sigmaQSelf-increasing
Figure FDA0002668239160000021
Updating, wherein alpha is the learning rate of the evaluation neural network;
and 7: calculating a gradient of a current decision neural network
Figure FDA0002668239160000022
And 8: updating the current decision neural network: gradient of neural network according to current decision
Figure FDA0002668239160000023
For the current decision neural network parameter sigmaμSelf-increasing
Figure FDA0002668239160000024
Updating, wherein beta is the learning rate of the decision neural network;
and step 9: updating a target evaluation neural network and a target decision neural network: updating the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and updating the target decision neural network parameters according to the updated current decision neural network parameters;
step 10: judging whether the training times exceed the set training times, if so, stopping training, storing parameter values of 4 neural networks, and if not, returning to the step 4, and sampling the experience data units with the designated number of N in the memory base again to store the experience data units in the buffer area;
step 11: after the trained deep reinforcement learning neural network model is obtained, the model is applied to the actual underwater glider in the gliding motion of the longitudinal plane, a target pitch angle is given, the state value of the underwater glider is collected and input into the deep reinforcement learning neural network model to obtain a control quantity, and the attitude control of the underwater glider is realized.
2. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: reward value rtThe values are as follows:
rt=r1+r2+r3
wherein r is1=-λ1(dt-dt-1),r1Representing a return value obtained when the current pitch angle is far away from or close to the target pitch angle; r is2=-λ2(wt-wt-1),r2A return value obtained by representing the change of the angular velocity; r is3Taking a value according to the size of the current pitch angle, if: | dt-dt-1|<0.1°,r3For a positive reward, r is given if the pitch angle θ < -90 ° or θ > 90 ° at the current time t3Is a negative number, representing a penalty, dtIs the pitch angle theta and the target pitch angle theta at the time of tdDifference of (d), wtFor the magnitude of the pitch rate at time t, λ1And λ2To set the coefficients.
3. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 4, the N experience data units in the buffer area are subjected to priority sequencing through an experience playback mechanism based on priority, and are stored in a SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the empirical data units represented by all nodes, denoted as ptotal
4. The underwater glider attitude control method based on deep reinforcement learning of claim 3, wherein: in step 5, m empirical data units are sampled by the following process, and [0, p ] is obtainedtotal]Equally dividing the cell into m cells, and randomly and uniformly sampling in each cell to obtain experience data units corresponding to the priorities obtained by sampling, thereby obtaining m experience data units.
5. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 5, obtaining the gradient value of the current evaluation neural network
Figure FDA0002668239160000031
The process comprises the following steps:
the loss function L of the current evaluation neural network is:
Figure FDA0002668239160000032
wherein, ω isiThe importance sampling weight for the ith empirical data unit,icomprises the following steps:
i=ri+γQ'(si+1,μ'(si+1μ'Q')-Q(si,aiQ)
riis the prize value, s, in the ith empirical data uniti+1Is the next time state, σ, in the ith empirical data unitμ'Is a parameter of the target decision neural network, μ'(s)i+1μ') Is si+1The action signal of the actuating mechanism is output through the target decision neural network; sigmaQ'Is a parameter of the target evaluation neural network, Q'(s)i+1,μ'(si+1μ'Q') Is to make mu'(s)i+1μ') And si+1Evaluating the size of an evaluation value obtained by a target evaluation neural network; siIs the current time state in the ith empirical data unit, aiIs the action signal, σ, of the i-th empirical data unit in the current time stateQFor the current evaluation of parameters of the neural network, Q(s)i,aiQ) Is the evaluation value of the current evaluation neural network; γ is the discount coefficient;
the gradient of the loss function L of the current evaluation neural network is obtained as:
Figure FDA0002668239160000033
6. the underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 7, the gradient of the current decision neural network
Figure FDA0002668239160000034
The calculation formula is as follows:
Figure FDA0002668239160000035
wherein
Figure FDA0002668239160000036
Evaluation value representing currently evaluated neural network
Figure FDA0002668239160000037
Gradient to parameter α, μ(s)i) Is siThe action signal of the actuating mechanism is output through the current decision neural network;
Figure FDA0002668239160000041
representing current decision neural network actions
Figure FDA0002668239160000042
For the current decision neural network parameter sigmaμOf the gradient of (c).
7. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 9, the updating formula of the target evaluation neural network and the target decision neural network is as follows:
Figure FDA0002668239160000043
wherein σt+1 QRepresenting the updated current evaluation neural network parameter, σt Q'Representing target evaluation neural network parameters, σ, to be updatedt+1 Q'Representing the updated target evaluation neural network parameters; sigmat+1 μRepresenting the updated current decision neural network parameter, σt μ'Representing the parameters, σ, of the target decision neural network to be updatedt+1 μ'Representing updated object blocksPlanning neural network parameters; tau is1Evaluation of the update rate of the neural network for the purpose, τ2The update rate of the neural network is decided on for the target.
8. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: the decision neural network adopts the state value v of the underwater glider1,v32Theta as an input, v1,v32Theta is the direction speed, pitch angle speed and pitch angle of x and z axes of the underwater glider respectively, wherein the x axis is the forward axis of the body coordinate system, and the z axis is the coordinate axis vertical to the body plane; the control quantity a of the underwater glider is a speed instruction of the moving mass block on the x axis.
CN202010925225.3A 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning Pending CN112100834A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010925225.3A CN112100834A (en) 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010925225.3A CN112100834A (en) 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN112100834A true CN112100834A (en) 2020-12-18

Family

ID=73758469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010925225.3A Pending CN112100834A (en) 2020-09-06 2020-09-06 Underwater glider attitude control method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN112100834A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420513A (en) * 2021-07-01 2021-09-21 西北工业大学 Underwater cylinder turbulent flow partition flow field prediction method based on deep learning
CN113879495A (en) * 2021-10-26 2022-01-04 西北工业大学 Underwater glider dynamic motion planning method based on ocean current prediction
CN114355777A (en) * 2022-01-06 2022-04-15 浙江大学 Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control
CN115046433A (en) * 2021-03-09 2022-09-13 北京理工大学 Aircraft time collaborative guidance method based on deep reinforcement learning
CN118466221A (en) * 2024-07-11 2024-08-09 中国海洋大学 Deep reinforcement learning decision-making method for attack angle of underwater glider

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN110597058A (en) * 2019-08-28 2019-12-20 浙江工业大学 Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 Unmanned mine card tracking control system and method based on deep reinforcement learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102644A (en) * 2017-06-22 2017-08-29 华南师范大学 The underwater robot method for controlling trajectory and control system learnt based on deeply
CN108803321A (en) * 2018-05-30 2018-11-13 清华大学 Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN110597058A (en) * 2019-08-28 2019-12-20 浙江工业大学 Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 Unmanned mine card tracking control system and method based on deep reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘西等: "DDPG优化基于动态逆的飞行器姿态控制", 《计算机仿真》 *
葛东旭: "《数据挖掘原理与应用》", 31 March 2020 *
韦鹏程等: "《大数据巨量分析与机器学习的整合与开发》", 31 May 2017 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115046433A (en) * 2021-03-09 2022-09-13 北京理工大学 Aircraft time collaborative guidance method based on deep reinforcement learning
CN113420513A (en) * 2021-07-01 2021-09-21 西北工业大学 Underwater cylinder turbulent flow partition flow field prediction method based on deep learning
CN113420513B (en) * 2021-07-01 2023-03-07 西北工业大学 Underwater cylinder turbulent flow partition flow field prediction method based on deep learning
CN113879495A (en) * 2021-10-26 2022-01-04 西北工业大学 Underwater glider dynamic motion planning method based on ocean current prediction
CN113879495B (en) * 2021-10-26 2024-04-19 西北工业大学 Dynamic motion planning method for underwater glider based on ocean current prediction
CN114355777A (en) * 2022-01-06 2022-04-15 浙江大学 Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control
CN114355777B (en) * 2022-01-06 2023-10-10 浙江大学 Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control
CN118466221A (en) * 2024-07-11 2024-08-09 中国海洋大学 Deep reinforcement learning decision-making method for attack angle of underwater glider
CN118466221B (en) * 2024-07-11 2024-09-17 中国海洋大学 Deep reinforcement learning decision-making method for attack angle of underwater glider

Similar Documents

Publication Publication Date Title
CN112100834A (en) Underwater glider attitude control method based on deep reinforcement learning
CN107102644B (en) Underwater robot track control method and control system based on deep reinforcement learning
CN106483852B (en) A kind of stratospheric airship control method based on Q-Learning algorithm and neural network
CN108803321A (en) Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study
CN110597058B (en) Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning
CN113052372B (en) Dynamic AUV tracking path planning method based on deep reinforcement learning
CN111290270B (en) Underwater robot backstepping speed and heading control method based on Q-learning parameter adaptive technology
CN111240345A (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
CN113377121B (en) Aircraft intelligent disturbance rejection control method based on deep reinforcement learning
CN114355777B (en) Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control
CN1718378A (en) S face control method of flotation under water robot motion
CN109300144A (en) A kind of pedestrian track prediction technique of mosaic society&#39;s power model and Kalman filtering
CN112286218A (en) Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient
CN114428517B (en) End-to-end autonomous landing control method for unmanned plane and unmanned ship cooperative platform
CN111273677B (en) Autonomous underwater robot speed and heading control method based on reinforcement learning technology
CN113792857B (en) Pulse neural network training method based on membrane potential self-increasing mechanism
CN111487863A (en) Active suspension reinforcement learning control method based on deep Q neural network
CN114967713B (en) Underwater vehicle buoyancy discrete change control method based on reinforcement learning
Huang et al. Attitude control of fixed-wing UAV based on DDQN
CN117818706B (en) Method, system, equipment and medium for predicting speed of medium-low speed maglev train
CN117289709B (en) High-ultrasonic-speed appearance-changing aircraft attitude control method based on deep reinforcement learning
CN118034373A (en) Method and system for controlling residence of optimal intelligent area of stratospheric airship environment
Dong et al. Gliding motion optimization for a biomimetic gliding robotic fish
CN114637327A (en) Online track generation guidance method based on depth strategic gradient reinforcement learning
CN114237268A (en) Unmanned aerial vehicle strong robust attitude control method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201218

WD01 Invention patent application deemed withdrawn after publication