CN112100834A - Underwater glider attitude control method based on deep reinforcement learning - Google Patents
Underwater glider attitude control method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN112100834A CN112100834A CN202010925225.3A CN202010925225A CN112100834A CN 112100834 A CN112100834 A CN 112100834A CN 202010925225 A CN202010925225 A CN 202010925225A CN 112100834 A CN112100834 A CN 112100834A
- Authority
- CN
- China
- Prior art keywords
- neural network
- evaluation
- current
- target
- underwater glider
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 230000002787 reinforcement Effects 0.000 title claims abstract description 25
- 238000013528 artificial neural network Methods 0.000 claims abstract description 200
- 238000011156 evaluation Methods 0.000 claims abstract description 124
- 230000008569 process Effects 0.000 claims abstract description 15
- 238000003062 neural network model Methods 0.000 claims abstract description 9
- 230000009471 action Effects 0.000 claims description 24
- 238000005070 sampling Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 12
- BULVZWIRKLYCBC-UHFFFAOYSA-N phorate Chemical compound CCOP(=S)(OCC)SCSCC BULVZWIRKLYCBC-UHFFFAOYSA-N 0.000 claims description 9
- 230000008859 change Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 238000004088 simulation Methods 0.000 abstract description 5
- 238000002474 experimental method Methods 0.000 abstract description 4
- 238000013178 mathematical model Methods 0.000 abstract description 4
- 230000004913 activation Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/061—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Feedback Control In General (AREA)
Abstract
The invention provides an underwater glider attitude control method based on deep reinforcement learning, which comprises a learning stage and an application stage, wherein the learning stage simulates the motion process of an underwater glider through simulation and simultaneously records real-time motion data, and parameters of a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network are updated according to the motion data; after the trained deep reinforcement learning neural network model is obtained, the method is applied to the gliding motion of the actual underwater glider in the longitudinal plane, and the target pitch angle theta is givendCollecting the state value of underwater glider and inputting the state value into the deep reinforcement learning neural network model to obtain the control quantity to realize the attitude of underwater gliderAnd (5) state control. The method is used for learning based on simulation model data or artificial experiment data, so that the attitude of the underwater glider is controlled, and the learning mode is simple; and an accurate mathematical model of the underwater glider does not need to be obtained, and the method is also suitable for complex environments.
Description
Technical Field
The invention relates to a control technology of an underwater robot, in particular to an underwater glider attitude control method based on deep reinforcement learning.
Background
The underwater glider is a novel underwater vehicle which is developed by combining a buoy technology, a submerged buoy technology and an underwater robot technology, does not have an external hanging part and is driven by the gravity of the underwater glider. The main characteristics are as follows: the motion control does not depend on a propeller propulsion system, but realizes up-and-down sinking and floating motion by adjusting net buoyancy of the glider, and the glider is controlled to glide forwards by utilizing the lift force generated by the horizontal wings attached to the fuselage and inclining upwards or downwards. The underwater glider overcomes the defects of large power and short sailing time of an underwater vehicle, greatly reduces the operation cost and the manufacturing cost, improves the sailing time, and has practical value in military affairs and ocean exploration research.
The motion attitude of the underwater glider is easily influenced by ocean currents and waves, meanwhile, the underwater glider is complex in structure and single in power mode, a dynamic model is expressed as strong nonlinearity, accurate model parameters are not easy to obtain, and the model constructed in different water area environments is lack of universality. Although many traditional control methods can realize attitude control of the underwater glider and achieve certain control accuracy, the requirements of high accuracy cannot be met, and the control process is complex.
Disclosure of Invention
Technical problem to be solved
The invention aims to overcome the defects and shortcomings of the prior art, provides an underwater glider attitude control method based on deep reinforcement learning, establishes a deep reinforcement learning neural network model, and can realize accurate control of the underwater glider attitude by learning simulation model data or artificial experiment data.
Technical scheme
The invention provides an underwater glider attitude control method based on deep reinforcement learning, which comprises a learning stage and an application stage, wherein the learning stage simulates the motion process of an underwater glider through simulation, simultaneously records real-time motion data, and updates the parameters of a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network according to the motion data, and the method comprises the following specific steps:
step 1: and 4 BP neural networks are established, namely a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network uses the state value of the underwater glider as an input quantity, and uses the control quantity a of the underwater glider as an output action. The evaluation neural network takes the state value and the control quantity of the underwater glider as input and takes the evaluation value as output;
after the neural network is constructed, the parameters of 4 neural networks are initialized, and the sizes of a memory base and a data buffer area are initialized.
Step 2: obtaining the state value s of the underwater glider at the current momenttInputting the state value into the current decision neural network to calculate the output action a of the attitude controller at the current momenttAction a to be outputtApplying the state value to an underwater glider simulator to obtain the state value s of the underwater glider at the next momentt+1. According to the state s at the present momenttAction a at the present timetTarget pitch angle θdAnd the state s of the next momentt+1Calculating the reward value r of the current timet。
Preferably rtTake a value of:
rt=r1+r2+r3
Wherein r is1=-λ1(dt-dt-1),r1A return value, r, obtained representing the distance or approach of the current pitch angle to the desired angle2=-λ2(wt-wt-1),r2Representing the return value, r, obtained by a change in angular velocity3Taking a value according to the size of the current pitch angle, if: | dt-dt-1|<0.1°,r3Is a positive "reward" if θ < -90 ° or θ > 90 °, r3Is a negative number, representing a penalty, dtIs the current pitch angle theta and the target pitch angle theta at the time point tdDifference of (d), wtFor the magnitude of the pitch rate at time t, λ1And λ2To set the coefficients.
And step 3: the state(s) obtained in step 2t,at,rt,st+1) Storing the t as a group of experience data units in a memory bank, and increasing the t by 1; judging the relation between t and the set memory base size n, if t is less than n, using the updated stAnd (4) returning to the step (2) until the number of the empirical data units stored in the memory base meets the requirement of n, and then entering the step (4).
And 4, step 4: a specified number N of empirical data units are sampled from a memory bank and stored in a buffer.
Prioritizing the N experience data units in the buffer, preferably by a priority-based experience playback mechanism, and storing in a SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the empirical data units represented by all nodes, denoted as ptotal。
And 5: for the N empirical data units stored in the SumTree data structure obtained in step 4, then sampling m empirical data units from the N empirical data units, in this embodiment, in order to ensure that various experiences are likely to be selected, sampling m empirical data units through the following process;
will [0, p ]total]Equally dividing the cell into m cells, and randomly and uniformly sampling in each cell to obtain experience data units corresponding to the priorities obtained by sampling, thereby obtaining m experience data units.
Processing the m sampled empirical data units one by one according to the following process to obtain the gradient value of the current evaluation neural network
For a certain empirical data unit(s)t,at,rt,st+1) Will state stAnd an action signal atInputting the evaluation value Q into a current evaluation neural network to obtain an evaluation value Q of the current evaluation neural network; the state s of the next timet+1Inputting the signal into a target decision neural network to obtain an action signal mu' of an actuating mechanism output by the target decision neural network; the state s of the next timet+1Inputting the action value mu 'output by the target decision neural network into the target evaluation neural network to obtain an evaluation value Q' of the target evaluation neural network;
calculating gradient value of the current evaluation neural network by using evaluation value Q of the current evaluation neural network, evaluation value Q' of the target evaluation neural network and loss function L of the evaluation neural network
Specifically, the loss function L of the current evaluation neural network is:
wherein, ω isiThe importance sampling weight for the ith empirical data unit,icomprises the following steps:
i=ri+γQ'(si+1,μ'(si+1|σμ')σQ')-Q(si,ai|σQ)
riis the ith empirical numberAccording to the prize value in the unit, si+1Is the next time state, σ, in the ith empirical data unitμ'Is a parameter of the target decision neural network, μ'(s)i+1|σμ') Is si+1And the action signal of the actuating mechanism is output through the target decision neural network. SigmaQ'Is a parameter of the target evaluation neural network, Q'(s)i+1,μ'(si+1|σμ')σQ') Is to make mu'(s)i+1|σμ') And si+1And the evaluation value obtained by the target evaluation neural network is large. siIs the current time state in the ith empirical data unit, aiIs the action signal, σ, of the i-th empirical data unit in the current time stateQFor the current evaluation of parameters of the neural network, Q(s)i,ai|σQ) Is the evaluation value of the current evaluation neural network. Gamma is the discount coefficient and gamma is 0.99.
The gradient of the loss function L of the current evaluation neural network is:
step 6: current evaluation neural network update: gradient of neural network according to current evaluationFor the currently evaluated neural network parameter sigmaQSelf-increasingUpdating, wherein alpha is the learning rate of the evaluation neural network and takes the value of 0.001.
And 7: calculating a gradient of a current decision neural networkThe specific gradient size calculation formula is as follows:
whereinEvaluation value representing currently evaluated neural networkGradient to parameter α, μ(s)i) Is siThe action signal of the actuating mechanism is output through the current decision neural network;representing current decision neural network actionsFor the current decision neural network parameter sigmaμOf the gradient of (c).
And 8: updating the current decision neural network:
gradient of neural network according to current decisionFor the current decision neural network parameter sigmaμSelf-increasingUpdating, wherein beta is the learning rate of the decision neural network, and the value is 0.001.
And step 9: updating a target evaluation neural network and a target decision neural network: updating the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and updating the target decision neural network parameters according to the updated current decision neural network parameters, wherein the updating formula is
Wherein σt+1 QRepresenting the updated current evaluation neural network parameter, σt Q'Representing target evaluation neural network parameters, σ, to be updatedt+1 Q'Representing the updated target evaluation neural network parameters; sigmat+1 μRepresenting the updated current decision neural network parameter, σt μ'Representing the parameters, σ, of the target decision neural network to be updatedt+1 μ'Representing the updated objective decision neural network parameters; tau is1The update rate of the neural network for the target evaluation is taken to be 0.001, tau2The update rate for the target decision neural network takes a value of 0.0001.
Step 10: judging whether the training times exceed the set training times, if so, stopping training, saving the parameter values of 4 neural networks, and if not, returning to the step 4, and sampling the experience data units with the designated number N in the memory base again to be stored in the buffer area.
Step 11: after the trained deep reinforcement learning neural network model is obtained, the method is applied to the gliding motion of the actual underwater glider in the longitudinal plane, and the target pitch angle theta is givendAnd collecting the state value of the underwater glider and inputting the state value into the deep reinforcement learning neural network model to obtain the control quantity so as to realize the attitude control of the underwater glider.
Advantageous effects
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
1. the underwater glider posture control method is used for learning based on simulation model data or manual experiment data, the underwater glider posture control is achieved, and the learning mode is simple.
2. An accurate mathematical model of the underwater glider does not need to be obtained, and the method is also suitable for complex environments.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1: a deep reinforcement learning algorithm frame schematic diagram;
FIG. 2: a longitudinal gliding motion control chart of the underwater glider;
FIG. 3: SumTree structural diagram;
FIG. 4: the schematic diagram of the turn reward value in the attitude control training process of the underwater glider;
FIG. 5: and in the application stage, the change process of the current pitch angle under the action of the attitude adjusting unit is set when the expected slip pitch angle is-25 degrees.
Detailed Description
The following describes an example of the present invention in detail, in this example, taking pitch angle control in the vertical plane of an actual underwater glider as an example, basic characteristic parameters of the underwater glider are firstly determined, a mathematical model of the underwater glider in the vertical plane gliding motion is obtained, wherein the input control quantity is a speed command a of a moving mass block on the x axistThe changing state of the glider is s: { v1,v3,ω2,θ},v1,v3,ω2And theta is the direction speed of the underwater glider in the x and z axes (the x axis is the forward axis of the body coordinate system, and the z axis is the coordinate axis vertical to the body plane), the pitch angle speed and the pitch angle respectively. The control block diagram of the longitudinal gliding motion with the speed of the slider as input is shown in figure 2.
The principle of the example method is: the method for controlling the pitching angle of the underwater glider in the vertical plane by using the depth reinforcement learning method based on the priority sampling comprises the following specific steps:
step 1: and 4 BP neural networks are established, namely a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network. The current decision neural network and the target decision neural network are called decision neural networks, and the current evaluation neural network and the target evaluation neural network are called evaluation neural networks. The decision neural network has 4 input quantities, namely the state value v of the underwater glider at the current moment1,v3,ω2And θ, there are 1 output quantity, i.e., the motion value a. Evaluation of neural network with 5 inputs v1,v3,ω2θ, a, 1 output quantity is an evaluation value. In this embodiment, each of the 4 BP neural networks has 2 hidden layers, the first hidden layer has 400 nodes, the second hidden layer has 300 nodes, and each of the 4 BP neural networks has one output node.
The mapping process of the BP neural network is as follows:
input layer information-hidden layer activation function-layer 1 hidden layer output:
wherein x isiIs the ith input quantity and is the total number of the input quantities. v. ofkiRepresenting weights from input layer to layer 1 hidden layer, bkTo offset the threshold value, z1kIs the kth node of the first hidden layer, h1Is taken to be 400, f1(s) is the hidden layer activation function, chosen as Relu function.
wherein, wjkRepresenting the weight from layer 1 to layer 2, bjTo offset the threshold value, z2jFor the jth output of the layer 2 hidden layer, h2Taken as 300, f2(s) is the activation function of the output layer, chosen as Relu function.
where y is the output, where the outputs are all values, wjRepresenting weights between the hidden layer of layer 2 and the output layer, blTo bias the threshold, f3(s) is an activation function of an output layer, the activation function of the output layer of the decision neural network is a Tanh function, and the activation function of the output layer of the evaluation neural network is a Relu function.
The parameters of 4 neural networks are initialized, the size of the initialization memory bank is N10000, and the size of the data buffer is N64.
Step 2: obtaining the state value s of the underwater glider at the current momentt:{v1,v3,ω2Theta, inputting the state value into the current decision neural network to calculate the output action a of the attitude controller at the current momenttAction a to be outputtApplied to a simulator constructed based on a mathematical model of the underwater glider during gliding on a longitudinal plane to obtain a state value s of the underwater glider at the next momentt+1. According to the state s at the present momenttAction a at the present timetTarget pitch angle θdAnd the state s of the next momentt+1Calculating the reward value r of the current timet. In this example rtThe values are as follows:
rt=r1+r2+r3
wherein r is1=-λ1(dt-dt-1),r1A return value, r, obtained representing the distance or approach of the current pitch angle to the desired angle2=-λ2(wt-wt-1),r2Representing the return value, r, obtained by a change in angular velocity3Taking a value according to the size of the current pitch angle, if: | dt-dt-1|<0.1°,r3Is a positive "reward" if θ < -90 ° or θ > 90 °, r3Is a negative number, representing a penalty, dtIs the current pitch angle theta and the target pitch angle theta at the time point tdDifference of (d), wtFor the magnitude of the pitch rate at time t, λ1And λ2To set the coefficients.
And step 3: the state(s) obtained in step 2t,at,rt,st+1) Storing the t as a group of experience data units in a memory bank, and increasing the t by 1; judging t andsetting the relation between the memory size n, if t is less than n, using the updated stAnd (4) returning to the step (2) until the number of the empirical data units stored in the memory base meets the requirement of n, and then entering the step (4).
And 4, step 4: sampling a specified number N of experience data units from a memory bank and storing the experience data units into a buffer area; the N experience data units in the buffer area are subjected to priority sequencing through an experience playback mechanism based on priority, and are stored in a SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the empirical data units represented by all nodes, denoted as ptotal。
The priority-based empirical playback is specifically as follows: defining p (i) as the sampling probability of the ith experiment, and calculating p (i) as:
wherein,representing the priority of the ith experience, η is used to set the degree of priority of use, and when η is 0, it is a uniform empirical sample. p is a radical ofi=|i|+,iTD error is a positive constant. In order not to make the complexity of the algorithm dependent on the size of the experience pool, the resulting p (i) data is arranged in a SumTree data structure. SumTree is a special binary tree structure in which all leaf nodes store priorities corresponding to experience, the parent node has the value of the sum of the corresponding child nodes, and thus the head node has the value of the sum of all priorities, denoted as ptotal. The SumTree principle is shown in FIG. 3: the leaf node in SumTree stores the priority of each experience, the sequence number of the leaf node corresponds to the experience sequence number of the memory bank one by one, and when the experience of the priority is selected, the corresponding experience is taken out from the buffer area according to the selected sequence number of the current priority.
And 5: for the N empirical data units stored in the SumTree data structure obtained in step 4, then sampling m empirical data units from the N empirical data units, in this embodiment, in order to ensure that various experiences are likely to be selected, sampling m empirical data units through the following process;
will [0, p ]total]Equally dividing the cell into m cells, and randomly and uniformly sampling in each cell to obtain experience data units corresponding to the priorities obtained by sampling, thereby obtaining m experience data units.
Processing the m sampled empirical data units one by one according to the following process to obtain the gradient value of the current evaluation neural network
For a certain empirical data unit(s)t,at,rt,st+1) Will state stAnd an action signal atInputting the evaluation value Q into a current evaluation neural network to obtain an evaluation value Q of the current evaluation neural network; the state s of the next timet+1Inputting the signal into a target decision neural network to obtain an action signal mu' of an actuating mechanism output by the target decision neural network; the state s of the next timet+1Inputting the action value mu 'output by the target decision neural network into the target evaluation neural network to obtain an evaluation value Q' of the target evaluation neural network;
calculating gradient value of the current evaluation neural network by using evaluation value Q of the current evaluation neural network, evaluation value Q' of the target evaluation neural network and loss function L of the evaluation neural network
The loss function L of the current evaluation neural network is:
wherein, ω isiThe importance sampling weight for the ith empirical data unit,icomprises the following steps:
i=ri+γQ'(si+1,μ'(si+1|σμ')σQ')-Q(si,ai|σQ)
riis the prize value, s, in the ith empirical data uniti+1Is the next time state, σ, in the ith empirical data unitμ'Is a parameter of the target decision neural network, μ'(s)i+1|σμ') Is si+1And the action signal of the actuating mechanism is output through the target decision neural network. SigmaQ'Is a parameter of the target evaluation neural network, Q'(s)i+1,μ'(si+1|σμ')σQ') Is to make mu'(s)i+1|σμ') And si+1And the evaluation value obtained by the target evaluation neural network is large. siIs the current time state in the ith empirical data unit, aiIs the action signal, σ, of the i-th empirical data unit in the current time stateQFor the current evaluation of parameters of the neural network, Q(s)i,ai|σQ) Is the evaluation value of the current evaluation neural network. Gamma is the discount coefficient and gamma is 0.99.
The gradient of the loss function L of the current evaluation neural network is:
step 6: current evaluation neural network update: gradient of neural network according to current evaluationFor the currently evaluated neural network parameter sigmaQSelf-increasingUpdating, wherein alpha is the learning rate of the evaluation neural network and takes the value of 0.001.
And 7: calculating the gradient of the current decision neural network, wherein the gradient size calculation formula is as follows:
whereinEvaluation value representing currently evaluated neural networkGradient to parameter α, μ(s)i) Is siThe action signal of the actuating mechanism is output through the current decision neural network;representing current decision neural network actionsFor the current decision neural network parameter sigmaμOf the gradient of (c).
And 8: updating the current decision neural network:
gradient of neural network according to current decisionFor the current decision neural network parameter sigmaμSelf-increasingUpdating, wherein beta is the learning rate of the decision neural network, and the value is 0.001.
And step 9: updating a target evaluation neural network and a target decision neural network: updating the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and updating the target decision neural network parameters according to the updated current decision neural network parameters, wherein the updating formula is
Wherein σt+1 QRepresenting the updated current evaluation neural network parameter, σt Q'Representing target evaluation neural network parameters, σ, to be updatedt+1 Q'Representing the updated target evaluation neural network parameters; sigmat+1 μRepresenting the updated current decision neural network parameter, σt μ'Representing the parameters, σ, of the target decision neural network to be updatedt+1 μ'Representing the updated objective decision neural network parameters; tau is1The update rate of the neural network for the target evaluation is taken to be 0.001, tau2The update rate for the target decision neural network takes a value of 0.0001.
Step 10: judging whether the training times exceed the set training times, if so, stopping training, saving the parameter values of 4 neural networks, and if not, returning to the step 4, and sampling the experience data units with the designated number N in the memory base again to be stored in the buffer area.
Step 11: after the trained deep reinforcement learning neural network model is obtained, the method is applied to the gliding motion of the actual underwater glider in the longitudinal plane, and the target pitch angle theta is givendAnd collecting the state value { v ] of the underwater glider1,v3,ω2Theta is input into the deep reinforcement learning neural network model to obtain a control quantity (a speed instruction a of the moving mass block on the x axis)t) And the attitude control of the underwater glider is realized.
Fig. 4 is a round rewarded value (referred to as "elispot rewarded") of the training process of the attitude controller in this embodiment, the evaluation of the training effect is mainly measured by the round rewarded value, after a certain period of training, the larger the average reward value is, the better the training effect is, as shown in fig. 4, the round rewarded value continuously rises, and after 900 rounds of training, the round rewarded value is basically stabilized at about 850, which indicates that the controller has learned a good strategy.
Fig. 5 shows an example of the application phase, showing the change of the pitch angle during the gliding process. Setting the initial pitch angle to 0 ° to begin gliding, with a desired pitch angle of-25 °. The pitch angle reaches the expected value after 16s, the steady-state error is 0.06 degrees when the glides steadily, and it can be seen that the error between the pitch angle and the expected angle is small when the glides steadily, and the glider can be considered to be capable of continuously gliding according to the expected track.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.
Claims (8)
1. An underwater glider attitude control method based on deep reinforcement learning is characterized in that: the method comprises the following steps:
step 1: establishing a current decision neural network, a current evaluation neural network, a target decision neural network and a target evaluation neural network; the decision neural network adopts the state value of the underwater glider as an input quantity and adopts the control quantity a of the underwater glider as an output action; the evaluation neural network takes the state value and the control quantity of the underwater glider as input and takes the evaluation value as output; initializing parameters of 4 neural networks, and initializing a memory bank and a data buffer area;
step 2: obtaining the state value s of the underwater glider at the current momenttInputting the state value into the current decision neural network to calculate the output action a of the attitude controller at the current momenttAction a to be outputtApplying the state value to an underwater glider simulator to obtain the state value s of the underwater glider at the next momentt+1(ii) a According to the state s at the present momenttAction a at the present timetTarget pitch angle θdAnd the state s of the next momentt+1Calculating the reward value r of the current timet;
And step 3: the state(s) obtained in step 2t,at,rt,st+1) Storing the t as a group of experience data units in a memory bank, and increasing the t by 1; judging the relation between t and the set memory base size n, if t is less than n, using the updated stReturning to the step 2 untilAfter the number of the experience data units stored in the memory base meets the requirement of n, entering a step 4;
and 4, step 4: sampling a specified number N of experience data units from a memory bank and storing the experience data units into a buffer area;
and 5: sampling m empirical data units in N empirical data units in a buffer; the m sampled empirical data units are processed one by one according to the following process:
for a certain empirical data unit(s)t,at,rt,st+1) Will state stAnd an action signal atInputting the evaluation value Q into a current evaluation neural network to obtain an evaluation value Q of the current evaluation neural network; the state s of the next timet+1Inputting the signal into a target decision neural network to obtain an action signal mu' of an actuating mechanism output by the target decision neural network; the state s of the next timet+1Inputting the action value mu 'output by the target decision neural network into the target evaluation neural network to obtain an evaluation value Q' of the target evaluation neural network;
calculating gradient value of the current evaluation neural network by using evaluation value Q of the current evaluation neural network, evaluation value Q' of the target evaluation neural network and loss function L of the evaluation neural network
Step 6: current evaluation neural network update: gradient of neural network according to current evaluationFor the currently evaluated neural network parameter sigmaQSelf-increasingUpdating, wherein alpha is the learning rate of the evaluation neural network;
And 8: updating the current decision neural network: gradient of neural network according to current decisionFor the current decision neural network parameter sigmaμSelf-increasingUpdating, wherein beta is the learning rate of the decision neural network;
and step 9: updating a target evaluation neural network and a target decision neural network: updating the target evaluation neural network parameters according to the updated current evaluation neural network parameters, and updating the target decision neural network parameters according to the updated current decision neural network parameters;
step 10: judging whether the training times exceed the set training times, if so, stopping training, storing parameter values of 4 neural networks, and if not, returning to the step 4, and sampling the experience data units with the designated number of N in the memory base again to store the experience data units in the buffer area;
step 11: after the trained deep reinforcement learning neural network model is obtained, the model is applied to the actual underwater glider in the gliding motion of the longitudinal plane, a target pitch angle is given, the state value of the underwater glider is collected and input into the deep reinforcement learning neural network model to obtain a control quantity, and the attitude control of the underwater glider is realized.
2. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: reward value rtThe values are as follows:
rt=r1+r2+r3
wherein r is1=-λ1(dt-dt-1),r1Representing a return value obtained when the current pitch angle is far away from or close to the target pitch angle; r is2=-λ2(wt-wt-1),r2A return value obtained by representing the change of the angular velocity; r is3Taking a value according to the size of the current pitch angle, if: | dt-dt-1|<0.1°,r3For a positive reward, r is given if the pitch angle θ < -90 ° or θ > 90 ° at the current time t3Is a negative number, representing a penalty, dtIs the pitch angle theta and the target pitch angle theta at the time of tdDifference of (d), wtFor the magnitude of the pitch rate at time t, λ1And λ2To set the coefficients.
3. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 4, the N experience data units in the buffer area are subjected to priority sequencing through an experience playback mechanism based on priority, and are stored in a SumTree data structure; the value of the head node in the SumTree data structure is the sum of the priorities of the empirical data units represented by all nodes, denoted as ptotal。
4. The underwater glider attitude control method based on deep reinforcement learning of claim 3, wherein: in step 5, m empirical data units are sampled by the following process, and [0, p ] is obtainedtotal]Equally dividing the cell into m cells, and randomly and uniformly sampling in each cell to obtain experience data units corresponding to the priorities obtained by sampling, thereby obtaining m experience data units.
5. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 5, obtaining the gradient value of the current evaluation neural networkThe process comprises the following steps:
the loss function L of the current evaluation neural network is:
wherein, ω isiThe importance sampling weight for the ith empirical data unit,icomprises the following steps:
i=ri+γQ'(si+1,μ'(si+1|σμ')σQ')-Q(si,ai|σQ)
riis the prize value, s, in the ith empirical data uniti+1Is the next time state, σ, in the ith empirical data unitμ'Is a parameter of the target decision neural network, μ'(s)i+1|σμ') Is si+1The action signal of the actuating mechanism is output through the target decision neural network; sigmaQ'Is a parameter of the target evaluation neural network, Q'(s)i+1,μ'(si+1|σμ')σQ') Is to make mu'(s)i+1|σμ') And si+1Evaluating the size of an evaluation value obtained by a target evaluation neural network; siIs the current time state in the ith empirical data unit, aiIs the action signal, σ, of the i-th empirical data unit in the current time stateQFor the current evaluation of parameters of the neural network, Q(s)i,ai|σQ) Is the evaluation value of the current evaluation neural network; γ is the discount coefficient;
the gradient of the loss function L of the current evaluation neural network is obtained as:
6. the underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 7, the gradient of the current decision neural networkThe calculation formula is as follows:
whereinEvaluation value representing currently evaluated neural networkGradient to parameter α, μ(s)i) Is siThe action signal of the actuating mechanism is output through the current decision neural network;representing current decision neural network actionsFor the current decision neural network parameter sigmaμOf the gradient of (c).
7. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: in step 9, the updating formula of the target evaluation neural network and the target decision neural network is as follows:
wherein σt+1 QRepresenting the updated current evaluation neural network parameter, σt Q'Representing target evaluation neural network parameters, σ, to be updatedt+1 Q'Representing the updated target evaluation neural network parameters; sigmat+1 μRepresenting the updated current decision neural network parameter, σt μ'Representing the parameters, σ, of the target decision neural network to be updatedt+1 μ'Representing updated object blocksPlanning neural network parameters; tau is1Evaluation of the update rate of the neural network for the purpose, τ2The update rate of the neural network is decided on for the target.
8. The underwater glider attitude control method based on deep reinforcement learning of claim 1, characterized in that: the decision neural network adopts the state value v of the underwater glider1,v3,ω2Theta as an input, v1,v3,ω2Theta is the direction speed, pitch angle speed and pitch angle of x and z axes of the underwater glider respectively, wherein the x axis is the forward axis of the body coordinate system, and the z axis is the coordinate axis vertical to the body plane; the control quantity a of the underwater glider is a speed instruction of the moving mass block on the x axis.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010925225.3A CN112100834A (en) | 2020-09-06 | 2020-09-06 | Underwater glider attitude control method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010925225.3A CN112100834A (en) | 2020-09-06 | 2020-09-06 | Underwater glider attitude control method based on deep reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112100834A true CN112100834A (en) | 2020-12-18 |
Family
ID=73758469
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010925225.3A Pending CN112100834A (en) | 2020-09-06 | 2020-09-06 | Underwater glider attitude control method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112100834A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113420513A (en) * | 2021-07-01 | 2021-09-21 | 西北工业大学 | Underwater cylinder turbulent flow partition flow field prediction method based on deep learning |
CN113879495A (en) * | 2021-10-26 | 2022-01-04 | 西北工业大学 | Underwater glider dynamic motion planning method based on ocean current prediction |
CN114355777A (en) * | 2022-01-06 | 2022-04-15 | 浙江大学 | Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control |
CN115046433A (en) * | 2021-03-09 | 2022-09-13 | 北京理工大学 | Aircraft time collaborative guidance method based on deep reinforcement learning |
CN118466221A (en) * | 2024-07-11 | 2024-08-09 | 中国海洋大学 | Deep reinforcement learning decision-making method for attack angle of underwater glider |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102644A (en) * | 2017-06-22 | 2017-08-29 | 华南师范大学 | The underwater robot method for controlling trajectory and control system learnt based on deeply |
CN108803321A (en) * | 2018-05-30 | 2018-11-13 | 清华大学 | Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study |
CN110597058A (en) * | 2019-08-28 | 2019-12-20 | 浙江工业大学 | Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning |
CN110879595A (en) * | 2019-11-29 | 2020-03-13 | 江苏徐工工程机械研究院有限公司 | Unmanned mine card tracking control system and method based on deep reinforcement learning |
-
2020
- 2020-09-06 CN CN202010925225.3A patent/CN112100834A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102644A (en) * | 2017-06-22 | 2017-08-29 | 华南师范大学 | The underwater robot method for controlling trajectory and control system learnt based on deeply |
CN108803321A (en) * | 2018-05-30 | 2018-11-13 | 清华大学 | Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study |
CN110597058A (en) * | 2019-08-28 | 2019-12-20 | 浙江工业大学 | Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning |
CN110879595A (en) * | 2019-11-29 | 2020-03-13 | 江苏徐工工程机械研究院有限公司 | Unmanned mine card tracking control system and method based on deep reinforcement learning |
Non-Patent Citations (3)
Title |
---|
刘西等: "DDPG优化基于动态逆的飞行器姿态控制", 《计算机仿真》 * |
葛东旭: "《数据挖掘原理与应用》", 31 March 2020 * |
韦鹏程等: "《大数据巨量分析与机器学习的整合与开发》", 31 May 2017 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115046433A (en) * | 2021-03-09 | 2022-09-13 | 北京理工大学 | Aircraft time collaborative guidance method based on deep reinforcement learning |
CN113420513A (en) * | 2021-07-01 | 2021-09-21 | 西北工业大学 | Underwater cylinder turbulent flow partition flow field prediction method based on deep learning |
CN113420513B (en) * | 2021-07-01 | 2023-03-07 | 西北工业大学 | Underwater cylinder turbulent flow partition flow field prediction method based on deep learning |
CN113879495A (en) * | 2021-10-26 | 2022-01-04 | 西北工业大学 | Underwater glider dynamic motion planning method based on ocean current prediction |
CN113879495B (en) * | 2021-10-26 | 2024-04-19 | 西北工业大学 | Dynamic motion planning method for underwater glider based on ocean current prediction |
CN114355777A (en) * | 2022-01-06 | 2022-04-15 | 浙江大学 | Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control |
CN114355777B (en) * | 2022-01-06 | 2023-10-10 | 浙江大学 | Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control |
CN118466221A (en) * | 2024-07-11 | 2024-08-09 | 中国海洋大学 | Deep reinforcement learning decision-making method for attack angle of underwater glider |
CN118466221B (en) * | 2024-07-11 | 2024-09-17 | 中国海洋大学 | Deep reinforcement learning decision-making method for attack angle of underwater glider |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112100834A (en) | Underwater glider attitude control method based on deep reinforcement learning | |
CN107102644B (en) | Underwater robot track control method and control system based on deep reinforcement learning | |
CN106483852B (en) | A kind of stratospheric airship control method based on Q-Learning algorithm and neural network | |
CN108803321A (en) | Autonomous Underwater Vehicle Trajectory Tracking Control method based on deeply study | |
CN110597058B (en) | Three-degree-of-freedom autonomous underwater vehicle control method based on reinforcement learning | |
CN113052372B (en) | Dynamic AUV tracking path planning method based on deep reinforcement learning | |
CN111290270B (en) | Underwater robot backstepping speed and heading control method based on Q-learning parameter adaptive technology | |
CN111240345A (en) | Underwater robot trajectory tracking method based on double BP network reinforcement learning framework | |
CN113377121B (en) | Aircraft intelligent disturbance rejection control method based on deep reinforcement learning | |
CN114355777B (en) | Dynamic gliding method and system based on distributed pressure sensor and sectional attitude control | |
CN1718378A (en) | S face control method of flotation under water robot motion | |
CN109300144A (en) | A kind of pedestrian track prediction technique of mosaic society's power model and Kalman filtering | |
CN112286218A (en) | Aircraft large-attack-angle rock-and-roll suppression method based on depth certainty strategy gradient | |
CN114428517B (en) | End-to-end autonomous landing control method for unmanned plane and unmanned ship cooperative platform | |
CN111273677B (en) | Autonomous underwater robot speed and heading control method based on reinforcement learning technology | |
CN113792857B (en) | Pulse neural network training method based on membrane potential self-increasing mechanism | |
CN111487863A (en) | Active suspension reinforcement learning control method based on deep Q neural network | |
CN114967713B (en) | Underwater vehicle buoyancy discrete change control method based on reinforcement learning | |
Huang et al. | Attitude control of fixed-wing UAV based on DDQN | |
CN117818706B (en) | Method, system, equipment and medium for predicting speed of medium-low speed maglev train | |
CN117289709B (en) | High-ultrasonic-speed appearance-changing aircraft attitude control method based on deep reinforcement learning | |
CN118034373A (en) | Method and system for controlling residence of optimal intelligent area of stratospheric airship environment | |
Dong et al. | Gliding motion optimization for a biomimetic gliding robotic fish | |
CN114637327A (en) | Online track generation guidance method based on depth strategic gradient reinforcement learning | |
CN114237268A (en) | Unmanned aerial vehicle strong robust attitude control method based on deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201218 |
|
WD01 | Invention patent application deemed withdrawn after publication |