CN114911157A - Robot navigation control method and system based on partial observable reinforcement learning - Google Patents
Robot navigation control method and system based on partial observable reinforcement learning Download PDFInfo
- Publication number
- CN114911157A CN114911157A CN202210366719.1A CN202210366719A CN114911157A CN 114911157 A CN114911157 A CN 114911157A CN 202210366719 A CN202210366719 A CN 202210366719A CN 114911157 A CN114911157 A CN 114911157A
- Authority
- CN
- China
- Prior art keywords
- network
- robot
- state
- action
- observation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000002787 reinforcement Effects 0.000 title claims abstract description 22
- 239000002245 particle Substances 0.000 claims abstract description 66
- 238000012549 training Methods 0.000 claims abstract description 53
- 238000001914 filtration Methods 0.000 claims abstract description 29
- 238000012952 Resampling Methods 0.000 claims abstract description 16
- 238000004088 simulation Methods 0.000 claims abstract description 5
- 230000009471 action Effects 0.000 claims description 53
- 230000006870 function Effects 0.000 claims description 33
- 238000012546 transfer Methods 0.000 claims description 24
- 238000009826 distribution Methods 0.000 claims description 15
- 238000005070 sampling Methods 0.000 claims description 14
- 238000005457 optimization Methods 0.000 claims description 12
- 230000009977 dual effect Effects 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000003111 delayed effect Effects 0.000 claims description 2
- 230000007704 transition Effects 0.000 claims description 2
- 238000004422 calculation algorithm Methods 0.000 abstract description 12
- 238000007493 shaping process Methods 0.000 abstract description 5
- 230000015556 catabolic process Effects 0.000 abstract description 3
- 238000006731 degradation reaction Methods 0.000 abstract description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000007613 environmental effect Effects 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000011217 control strategy Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000005381 potential energy Methods 0.000 description 2
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000008614 cellular interaction Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010408 sweeping Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B13/00—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
- G05B13/02—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
- G05B13/04—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
- G05B13/042—Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
Abstract
The invention discloses a robot navigation control method and system based on partial observable reinforcement learning, which are mainly applied to a navigation task of a robot in an uncertain environment with an unknown model. In order to complete the navigation task under the uncertain environment, the invention adopts a reinforcement learning algorithm under partial observable environment. The system comprises a filtering unit, a planning unit, a playback pool and a learning unit. In the invention, the belief state is represented by using state particles to reduce the computational complexity of belief state updating, the simulation planning based on a learned model is used to improve the sample utilization rate, the resampling method is used to prevent the particle degradation problem, and the reward shaping based on the belief state negative information entropy is used to improve the training efficiency and stability of the algorithm in the reward sparse navigation task. The method can realize efficient and stable strategy learning in the observable environment of the unknown part of the model, and uses the learned strategy in the actual robot navigation task.
Description
Technical Field
The invention relates to a robot navigation control method and system based on reinforcement learning in a partial observable environment, and belongs to the technical field of robot control.
Background
With the development of the technology, robots have been widely applied to various production and living fields, and various application scenes therewith also provide more new challenges for the robot technology. The robot navigation is one of the most important tasks in the field of robot control, and a large number of robot navigation control requirements exist in practical application scenes, such as sweeping robots, warehousing and transportation robots, search and rescue robots and the like. Most of the traditional robot navigation algorithms need to obtain accurate modeling of the environment, which greatly limits the application range of the algorithms. While reinforcement learning can learn control strategies from data generated by interaction with the environment, it is increasingly applied to robot navigation tasks.
The environment in which the robot is located is usually very complex, and due to the blocking of obstacles, the detection range of the sensor and other factors, the robot can only obtain partial information of the environment through the sensor. The difficulty of the decision task under the incomplete information is greatly increased compared with that under the complete information. Meanwhile, the performance of the sensor of the robot is limited, the information obtained by the sensor is noisy, and the uncertainty caused by the noise can interfere the decision of the robot. Therefore, how to control the robot under uncertain environment is an urgent problem to be solved in the field of robot navigation.
The existing partially observable reinforcement learning algorithm cannot effectively encourage the robot to take actions of obtaining environmental information, and an optimal strategy is difficult to obtain in tasks where the environmental information is crucial. In addition, when the robot executes the navigation task, the reward can be obtained only when the robot reaches a target point, and therefore the reward is sparse. The training speed of the existing partially observable reinforcement learning algorithm is low in the environment with sparse rewards, and the performance of the algorithm is unstable.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the common problems of the existing robot navigation technology under the uncertain environment, the invention provides a robot navigation control method and system based on partial observable reinforcement learning. The robot navigation task is modeled as a Partially Observable Markov Decision Process (POMDP), and the problem is solved by using a reinforcement learning algorithm under a Partially Observable environment. The method effectively solves the problem of sparse rewards when the reinforcement learning is utilized to process the robot navigation task, and implicitly encourages the robot to actively take the action of obtaining the environmental information under partial observable environments, thereby obtaining a better strategy and improving the efficiency and the stability of the navigation control method.
The technical scheme is as follows: a robot navigation control method based on partial observable reinforcement learning specifically comprises the following steps:
s1, initializing network parameters, including: transfer model D ψ Parameter psi, observation model Z θ Parameter of theta, policy network pi ρ Parameter p, double Q value network Q ω Parameter ω of (d). Setting the training time step counter t to be 0, and entering S2;
s2, generating K weighted belief state particles according to the prior of the initial stateInitial weightAll set to 1, the robot obtains an initial observation o through the sensor 1 Go to S3;
s3, if the training time step counter t is less than the maximum training step number L, t ← t +1, and then S4 is entered; otherwise, go to S27;
s8, mixing the particlesAnd average belief stateCopy N copies after combination and give each copy weightN new weighted particles are obtained and are marked asSuperscript (n) denotes the nth copy, proceeding to S9;
s9, setting a planning time step counter i as t-1, and entering S10;
s10, if the planning time step counter is less than the maximum planning step number H, i ← i +1, and the process goes to S11; otherwise, go to S19;
s12, for each particle in each copy, according to the transfer model D ψ Receive the next time status and rewardProceeding to S13;
s14, evaluating the entropy of the belief state information for each copy Representing an estimate of the current belief state, proceed to S15;
s15, updating the weight of each copy particleA (m)(n) Representing the merit function, S16 is entered;
s16, if resampling is needed, entering S17; otherwise, go to S18;
s17, resampling the copied particles, and entering S18;
s18, entering S10;
s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the nth copied robot planning track t Go to S20;
s20, robot action a t Interacting with the training environment to obtain the state s at the next time t+1 Observation at the next moment o t+1 And a prize r t Go to S21;
s21, if resampling is needed, entering S22; otherwise, go to S23;
s22, resampling the belief state particles, and entering S23;
s25, the learning unit samples the training data from the playback pool, updates the network parameters, and enters S26;
s26, entering S3;
and S27, finishing the training, and outputting the trained network for the navigation control of the robot. The specific steps of the robot navigation control use stage can be obtained by canceling the playback pool and the learning unit and skipping the steps S24 and S25, and the environment only needs to provide observation and bonus information in S20, but does not need to provide real state information.
In the above technical solution, the environment (training environment) where the robot is located is modeled as a POMDP, and the POMDP may be represented by the following six-tuple:
(7) state space S, S t E, S represents the state of the robot at the time t;
(8) an operating space A, a t E is A to represent the action taken by the robot at the moment t;
(9) transition probability function T: S × A × S → [0,1],T(s t ,a t ,s t+1 ) Indicating that the robot is in state s t Taking action a t Is transferred to s t+1 The probability of (d);
(10) reward function R: S × A →, R (S) t ,a t ) Indicating that the robot is in state s t Taking action a t Immediate rewards available;
(11) observation space O, O t E, O represents the observation obtained by the robot at the moment t;
(12) observation probability function Z: S × A × O → [0,1],Z(s t ,a t-1 ,o t ) Indicating that the robot is taking action a t-1 Is transferred to s t Obtain observation o t The probability of (c).
The goal of POMDP is to obtain a strategy of pi H → A based on a historical sequence of action observations to maximize the desired jackpot, jackpot G t Is defined as:
wherein, gamma belongs to (0, 1)]Is a discount factor used to weigh the immediate prize and the delayed prize. r is t Indicating the reward the robot received at time t.
In the above technical solution, the belief state b t (s)=p(s t =s|h t ) Represents a known history h t ={b 0 ,a 0 ,o 1 ,…,a t-1 ,o t In the case of s t Is the probability distribution of s, b 0 Representing the initial state probability distribution.
In the foregoing technical solution, in S1, the network includes:
transfer model D ψ ψ is a parameter of the transfer model;
observation model Z θ Theta is a parameter of the observation model;
policy network pi ρ ρ is a parameter of the policy network;
dual Q-value network Q ω And omega is a parameter of the double-Q value network.
Wherein the model D is transferred ψ The device comprises a filtering unit, a planning unit, a transfer model network and a transfer model, wherein the filtering unit is used for updating state particles and simulating in the planning unit, the input is state and action, the output is state and reward at the next moment, and the transfer model network structure is a fully-connected network; observation model Z θ The system is used for updating the particle weight in the filtering unit, the input is state and observation, the output is the probability of observation, and the network structure of the observation model is a full-connection network; policy network pi ρ The method is used for providing strategies for simulation of the robot in a planning module, the input is belief state particles and average belief states, the output is actions and logarithm of the probability of the actions, and the network structure of a strategy network is that the mean value mu and the variance sigma of the actions are output through a full-connection network 2 From the Gaussian distribution N (mu, sigma) 2 ) Obtaining an action by middle sampling, and calculating and outputting a logarithm value of the probability of the action by using Gaussian distribution; dual Q-value network Q ω For updating particle weight in planning module, the input is state and action, the output is two Q values, the double Q value network is two full connection networks Q 1 And Q 2 While requiring the pair Q i (i 1,2) each maintaining a target Q value network TQ with the same network structure i (i ═ 1,2) for network parameter updates.
In the above technical solution, in S2, representing the belief state by using weighted particles is a common approximation method for dealing with the problem of too high complexity of belief state update computation, and the process of updating the particles is called a particle filter or a time-series monte carlo method.
In the above technical solution, S7-S19 are planning units, and the robot performs simulation planning using the copied belief state particles, respectively, so as to select an optimal action.
In the above technical solution, in S14, in the estimation of the belief state information entropy, belief state particles are usedEstimating belief state probability distributionsMeanwhile, a Kernel Density Estimation (KDE) method is used, and the probability distribution of the belief state is estimated by adopting Gaussian Kernel Density.
In the above technical solution, in S15, the calculation formula of the dominance function a is:
where TD represents the timing difference error, Q ω The smaller of the two outputs of the dual Q network.
When the advantage function A is calculated, reward shaping based on the belief state negative information entropy is added, so that the robot is encouraged to take the action of obtaining information, and the efficiency and the stability of the algorithm are improved.
In the above technical solutions, in S17 and S22, the resampling is a technique for preventing particle degradation, which is commonly used in particle filtering. Specifically, N particles with weights are selected from N weighted particles randomly and repeatedly, and then the weights of the new particles are all set to 1.
In the foregoing technical solution, in S25, the updating the network parameter includes:
(4) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, and use a specified optimization method, such as random gradient descent, Adam and the like, to optimize the loss function and update the network parameters.
(5) The two networks of the dual-Q network are updated in the same manner, and are based on the time-Difference (TD) error from the target Q network:
wherein alpha is a temperature coefficient, and controls the attention degree of the strategy entropy. And optimizing the loss function by using a specified optimization method, such as random gradient descent, Adam and the like, and updating the network parameters. In addition, the Q-value network parameters are copied to the target Q-value network every fixed updating step.
(6) Policy function updates are based on a loss function:
wherein, alpha is temperature coefficient, Q ω The smaller of the two outputs of the dual Q network. And optimizing the loss function by using a specified optimization method, such as random gradient descent, Adam and the like, and updating the network parameters.
In order to achieve the above object, the present invention provides a robot navigation control system based on partially observable reinforcement learning, comprising: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit.
And the filtering unit is used for updating the belief state particles and the weights thereof and interactively obtaining state, observation and reward information with the training environment by using the action obtained from the planning unit. And in addition, the training data is also processed and stored in a playback pool.
And the planning unit is used for receiving the weighted particles provided by the filtering module, simulating the planning by using the learned transfer model and the strategy network, and outputting the action to be provided to the filtering unit.
The playback pool is used for storing the processed training data and providing the training data required by learning for the learning unit, namely a data set consisting of tuples stored in the playback pool by the sampling filter unit.
And the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit.
In the system, the training environment, namely the actual application environment or the highly-simulated virtual environment, is used for training the robot navigation control method, interacts with the filtering unit, and provides state, observation and reward information for filtering.
Based on the technical scheme, the neural network can be trained for practical use. The playback pool and the learning unit are cancelled, and the specific steps of the usage phase can be obtained by skipping steps S24 and S25, in which case the environment in S20 only needs to provide observation and bonus information, and does not need to provide real status information.
Has the advantages that: due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the invention processes the robot navigation control task by using reinforcement learning, and can learn and obtain the control strategy from data generated by interaction with the environment. The problem that the traditional control method needs environment accurate modeling is solved, and the application range of the control method is expanded.
The invention models the environment as a POMDP problem and can define the uncertainty in the environment. The traditional method is difficult to process tasks with shielding, limited sensor detection range and noise of the sensor, and the navigation control task of the environment can be effectively processed.
The invention adopts a model-based partially observable reinforcement learning algorithm, can improve the utilization rate of training samples and improve the training efficiency.
The method adopts reward shaping based on the potential energy function, can effectively solve the problem of sparse reward of the robot navigation task in reality, does not change the optimal strategy, and improves the algorithm training efficiency and stability.
The belief state negative information entropy is used as the potential energy function in the reward shaping method, so that the robot can be encouraged to take the action of obtaining information, and an optimal strategy is easier to obtain compared with a traditional control method.
Drawings
FIG. 1 is a diagram of an overall training framework for an embodiment of the present invention;
FIG. 2 is a diagram of unit interactions during a training phase in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of cell interaction during a use phase according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
Fig. 4 is a top view of a robot navigation environment, wherein the robot is in one of 2 rooms with the same size on the left and right. The state of the robot is its absolute coordinates throughout the house. The robot can take actions with limited speed in any direction. The robot is provided with 4 sensors in the positive directions of the upper direction, the lower direction, the left direction and the right direction, the sensors can obtain the distance from the robot to the nearest wall surface in the direction, and Gaussian noise exists. The initial position of the robot is random, its goal is to reach the charging place below the left room or above the right room, and a reward of +100 is obtained after the robot reaches the target position. The robot can only obtain observation when in use, so the robot cannot judge which room the robot is in, and can determine the room the robot is in through the change of the wall surface distance unless the robot reaches a shadow part in the graph.
The steps of the robot in the training phase are as follows:
s1, initializing network parameters, including: transfer model D ψ Parameter psi, observation model Z θ Parameter of theta, policy network pi ρ P, double-Q value network Q ω Parameter ω. Setting the training time step counter t as 0,proceeding to S2;
s2, generating 100 weighted belief state particles according to the initial state priorInitial weightAll set to 1, the robot obtains an initial observation o through the sensor 1 Proceeding to S3;
s3, if the training time step counter t is less than the maximum training step number L ═ 10,000, then t ← t +1, proceed to S4; otherwise, go to S27;
s6, samplingM with the highest weight is 3 particles and is recorded asInputting the weighted particles to the planning unit by the filtering unit as in fig. 2 and 3, and entering S7;
s8, mixing the particlesAnd average belief stateCopy N30 portions after combining and weight each portionN new weighted particles are obtained and are marked asSuperscript (n) denotes the nth copy, proceeding to S9;
s9, setting a planning time step counter i as t-1, and entering S10;
s10, if the planning time step counter is less than the maximum planning step number H ═ 10, then i ← i +1, and proceed to S11; otherwise, go to S19;
s12, for each particle in each copy, according to the transfer mode D ψ Receive the next time status and rewardProceeding to S13;
s14, evaluating the entropy of the belief state information for each copy Representing an estimate of the current belief state, proceed to S15;
s16, if resampling is needed, entering S17; otherwise, go to S18;
s17, resampling the copied particles, and entering S18;
s18, entering S10;
s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the planning track of the nth copying robot t The input to the filtering unit by the planning unit as in fig. 2 and 3 goes to S20;
s20, robot action a t Interacting with the training environment to obtain the state s of the next moment t+1 Observation at the next moment o t+1 And a prize r t As in 2, s t+1 ,o t+1 ,r t Input to the filtering unit S21
S21, if resampling is needed, entering S22; otherwise, go to S23;
s22, resampling the belief state particles, and entering S23;
s25, the learning unit in FIG. 2 samples training data from the playback pool, updates network parameters, and transmits the updated network parameters to the filtering unit and the planning unit, and then the process goes to S26;
s26, entering S3;
and S27, finishing the training, and outputting the trained network for the navigation control of the robot. Referring to fig. 3, the specific steps of the stage of using the robot navigation control can be obtained by canceling the playback pool and the learning unit and skipping steps S24 and S25, and the environment in S20 only needs to be provided to the filter module for observation and reward information, but not to provide the true status information.
The whole training process block diagram refers to fig. 1.
In the above embodiment, in S1, the network includes a transfer model D ψ ψ is a parameter of the transfer model; observation model Z θ Theta is a parameter of the observation model; policy network pi ρ ρ is a parameter of the policy network; dual Q-value network Q ω And omega is a parameter of the double-Q value network. Wherein the model D is transferred ψ The input is the state and the action, the output is the state and the reward of the next moment, the transfer model network structure is a 4-layer full-connection network, and the number of neurons in each layer is 256/256/256/3 respectively; observation model Z θ The input is state and observation, the output is probability of observation, the observation model network structure is 4 layers of full-connection network, the number of neurons in each layer is 256/256/256/1; policy network pi ρ The input is the belief state particle and the average belief state, the output is the action and the logarithm of the probability of outputting the action, and the network structure of the strategy network is that the mean value mu and the variance sigma of the action are output through the full-connection network 2 From the Gaussian distribution N (mu, sigma) 2 ) The motion is obtained through middle sampling, and a logarithmic value of the probability of the motion is calculated and output by using Gaussian distribution, wherein the logarithmic value is a 3-layer fully-connected network, and the number of neurons in each layer is 256/256/4; dual Q-value network Q ω The input is state and action, the output is two Q values, the double Q value network is two full connection networks Q 1 And Q 2 All the three-layer full-connection networks are 3 layers of full-connection networks, the number of neurons in each layer is 256/256/1, and Q needs to be matched i (i 1,2) each maintaining a target Q value network TQ with the same network structure i (i ═ 1,2) for network parameter updates. The parameter initialization uses the pytorech default parameter initialization method.
In the above embodiment, S10-S19 are planning units, and the robot uses the copied belief-state particles to perform simulation planning respectively, so as to select an optimal action.
In the above embodiment, in S14, in the estimation of the entropy of the belief state information, the belief state particle is usedEstimating belief state probability distributionsAnd a multivariate Gaussian kernel density estimation method with a Silverman empirical window width is adopted. The formula for the kernel density estimation at this time is:
wherein D is the dimension of the state, the window width matrix H is a diagonal matrix, and the calculation formula of the elements on the main diagonal is as follows:
In the above embodiment, in S15, the calculation formula of the dominance function a is:
where TD represents the timing difference error, Q ω The smaller of the two outputs of the dual Q network. When the advantage function A is calculated, reward shaping based on the belief state negative information entropy is added, so that the robot is encouraged to take an action of obtaining information, and the efficiency and the stability of the algorithm are improved.
In the above embodiments, in S17 and S22, the resampling is a technique for preventing particle degradation, which is commonly used in particle filtering. Specifically, N particles with weights are selected from N weighted particles randomly and repeatedly, and then the weights of the new particles are all set to 1.
In the foregoing embodiment, in S25, the updating the network parameter includes:
(1) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, an Adam optimization method is used for optimizing the loss function, the learning rate is 0.001, and the network parameters are updated.
(2) The two networks of the dual-Q network are updated in the same manner, and are based on the time-Difference (TD) error from the target Q network:
where α ═ 1 is a temperature coefficient, the degree of importance of the control on the strategy entropy is controlled, and γ ═ 0.95 is a discount factor. And (5) optimizing a loss function by using an Adam optimization method, wherein the learning rate is 0.001, and updating the network parameters. In addition, the Q-value network parameters are copied to the target Q-value network every 5 steps.
(3) Policy function updates are based on a loss function:
wherein α ═ 1 is a temperature coefficient, Q ω The smaller of the two outputs of the dual Q network. And (5) optimizing a loss function by using an Adam optimization method, wherein the learning rate is 0.001, and updating the network parameters.
The following is a specific structure of this embodiment, including: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit.
And the filtering unit is used for updating the belief state particles and the weights thereof and interactively acquiring state, observation and reward information with the training environment by using the actions acquired from the planning unit. And the training data is also processed and stored in a playback pool.
And the planning unit is used for receiving the weighted particles provided by the filtering module, simulating the planning by using the learned transfer model and the strategy network, and outputting the action to be provided to the filtering unit.
The playback pool is used for storing the processed training data and providing the training data required by learning for the learning unit, namely a data set consisting of tuples stored in the playback pool by the sampling filter unit.
And the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit.
After the training phase is finished, the playback pool and the learning unit are cancelled, and the specific steps of the use phase can be obtained by skipping the steps S24 and S25, in which case the environment in S20 only needs to provide observation and bonus information, but does not need to provide real state information.
Claims (10)
1. A robot navigation control method based on partial observable reinforcement learning is characterized by comprising the following steps:
s1, initializing network parameters, including: transfer model D ψ Parameter psi, observation model Z θ Parameter of theta, policy network pi ρ Parameter p, double Q value network Q ω Parameter ω of (d); setting the training time step counter t to be 0, and entering S2;
s2, generating K weighted belief state particles according to the prior of the initial stateInitial weightAll set to 1, the robot obtains an initial observation o through the sensor 1 Go to S3;
s3, if the training time step counter t is less than the maximum training step number L, t ← t +1, and then S4 is entered; otherwise, go to S27;
s8, mixing the particlesAnd average belief stateCopy N copies after combination and give each copy weightN new weighted particles are obtained and are marked asSuperscript (n) denotes the nth copy, proceeding to S9;
s9, setting a planning time step counter i as t-1, and entering S10;
s10, if the planning time step counter is less than the maximum planning step number H, i ← i +1, and the process goes to S11; otherwise, go to S19;
s12, for each particle in each copy, according to the transfer model D ψ Receive the next time status and rewardProceeding to S13;
s14, evaluating the entropy of the belief state information for each copy Representing an estimate of the current belief state, proceed to S15;
s15, updating the weight of each copy particleA (m)(n) Representing the merit function, S16 is entered;
s16, if resampling is needed, entering S17; otherwise, go to S18;
s17, resampling the copy particles, and entering S18;
s18, entering S10;
s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the nth copied robot planning track t Proceeding to S20;
s20, robot action a t Interacting with the training environment to obtain the state s of the next moment t+1 Observation at the next moment o t+1 And a prize r t Go to S21;
s21, if resampling is needed, entering S22; otherwise, go to S23;
s22, resampling the belief state particles, and entering S23;
s25, the learning unit samples training data from the playback pool, updates network parameters and enters S26;
s26, entering S3;
and S27, finishing the training, and outputting the trained network for the navigation control of the robot.
2. The method of claim 1, wherein when the trained network is used for the robot navigation control, the playback pool and the learning unit are cancelled, and the steps S24 and S25 are skipped to obtain the specific steps of the using stage of the robot navigation control, so that the environment in S20 only needs to provide the observation and reward information, but does not need to provide the real status information.
3. The method of claim 1, wherein the robot training environment is modeled as a POMDP, which is expressed by the following six-tuple:
(1) state space S, S t E, S represents the state of the robot at the time t;
(2) an action space A, a t E is A to represent the action taken by the robot at the moment t;
(3) transition probability function T: S × A × S → [0,1],T(s t ,a t ,s t+1 ) Indicating that the robot is in state s t Taking action a t Is transferred to s t+1 The probability of (d);
(4) reward function R: S × A → R, R (S) t ,a t ) Indicating that the robot is in state s t Taking action a t All canImmediate rewards earned;
(5) observation space O, O t The epsilon O represents the observation obtained by the robot at the time t;
(6) observation probability function Z: S × A × O → [0,1],Z(s t ,a t-1 ,o t ) Indicating that the robot is taking action a t-1 Is transferred to s t Obtain observation o t The probability of (d);
the goal of POMDP is to obtain a strategy of pi H → A based on a historical sequence of action observations to maximize the desired jackpot, jackpot G t Is defined as:
wherein, gamma belongs to (0, 1)]A discount factor used to weigh immediate and delayed rewards; r is t Indicating the reward the robot received at time t.
4. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein the belief state b t (s)=p(s t =s|h t ) Represents a known history h t ={b 0 ,a 0 ,o 1 ,…,a t-1 ,o t In the case of s t Is the probability distribution of s, b 0 Representing the initial state probability distribution.
5. The method of claim 1, wherein the transfer model D is a model of a robot navigation system ψ The input is the state and the action, the output is the state and the reward at the next moment, and the model network structure is transferred to be a full-connection network; observation model Z θ The input is state and observation, the output is probability of observation, and the network structure of the observation model is a fully-connected network; policy network pi ρ The input is belief state particles and average belief state, the output is action and logarithm of probability of the action, and the network structure of the strategy network is that a full-connection network is passed throughMean μ and variance σ of the envelope output action 2 From the Gaussian distribution N (mu, sigma) 2 ) Obtaining an action by middle sampling, and calculating and outputting a logarithm value of the probability of the action by using Gaussian distribution; dual Q-value network Q ω The input is state and action, the output is two Q values, the double Q value network is two full connection networks Q 1 And Q 2 While requiring the pair Q i (i 1,2) each maintaining a target Q value network TQ with the same network structure i (i ═ 1,2) for network parameter updates.
6. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S14, the belief state particles are used in the estimation of the belief state information entropyEstimating belief state probability distributionsAnd then, estimating the probability distribution of the belief state by using a kernel density estimation method and adopting Gaussian kernel density.
8. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S17, S22, the resampling is that N weighted particles are selected at random and repeatedly according to the weight, and then the weight of the new particles is set to 1.
9. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S25, the updating the network parameters includes:
(1) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, an optimization method is used for optimizing the loss function, and the network parameters are updated.
(2) The updating modes of the two networks of the double-Q-value network are the same, and the two networks are based on the time sequence difference error of the target Q-value network:
wherein alpha is a temperature coefficient, and the attention degree of strategy entropy is controlled; optimizing a loss function by using an optimization method, and updating network parameters; copying the Q value network parameters to a target Q value network every other fixed updating step;
(3) policy function updates are based on a loss function:
wherein, alpha is temperature coefficient, Q ω The smaller of the two outputs of the double-Q value network; and optimizing the loss function and updating the network parameters by using an optimization method.
10. A robot navigation control system based on partially observable reinforcement learning, comprising: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit;
the filtering unit is used for updating the belief state particles and the weights thereof and interactively obtaining state, observation and reward information by using the actions obtained from the planning unit and the training environment; in addition, the system is also used for processing the training data and storing the training data into a playback pool;
the planning unit is used for receiving the weighted particles provided by the filtering module, simulating planning by using a learned transfer model and a strategy network, and outputting the action to be provided to the filtering unit;
the playback pool is a database supporting random access and is used for storing the processed training data and providing the training data required by learning for the learning unit;
the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit;
the training environment is an actual application environment or a simulation virtual environment, is used for training the robot navigation control method, interacts with the filtering unit, and provides state, observation and reward information for filtering.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210366719.1A CN114911157A (en) | 2022-04-08 | 2022-04-08 | Robot navigation control method and system based on partial observable reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210366719.1A CN114911157A (en) | 2022-04-08 | 2022-04-08 | Robot navigation control method and system based on partial observable reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114911157A true CN114911157A (en) | 2022-08-16 |
Family
ID=82762508
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210366719.1A Pending CN114911157A (en) | 2022-04-08 | 2022-04-08 | Robot navigation control method and system based on partial observable reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114911157A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115826013A (en) * | 2023-02-15 | 2023-03-21 | 广东工业大学 | Beidou satellite positioning method based on lightweight reinforcement learning in urban multipath environment |
-
2022
- 2022-04-08 CN CN202210366719.1A patent/CN114911157A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115826013A (en) * | 2023-02-15 | 2023-03-21 | 广东工业大学 | Beidou satellite positioning method based on lightweight reinforcement learning in urban multipath environment |
CN115826013B (en) * | 2023-02-15 | 2023-04-21 | 广东工业大学 | Beidou satellite positioning method based on light reinforcement learning under urban multipath environment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110928189B (en) | Robust control method based on reinforcement learning and Lyapunov function | |
Song et al. | New chaotic PSO-based neural network predictive control for nonlinear process | |
CN114839884B (en) | Underwater vehicle bottom layer control method and system based on deep reinforcement learning | |
CN114967713B (en) | Underwater vehicle buoyancy discrete change control method based on reinforcement learning | |
CN116700327A (en) | Unmanned aerial vehicle track planning method based on continuous action dominant function learning | |
CN116052254A (en) | Visual continuous emotion recognition method based on extended Kalman filtering neural network | |
CN114626307B (en) | Distributed consistent target state estimation method based on variational Bayes | |
CN114911157A (en) | Robot navigation control method and system based on partial observable reinforcement learning | |
Wei et al. | Boosting offline reinforcement learning with residual generative modeling | |
CN115972211A (en) | Control strategy offline training method based on model uncertainty and behavior prior | |
CN111798494A (en) | Maneuvering target robust tracking method under generalized correlation entropy criterion | |
CN115374933A (en) | Intelligent planning and decision-making method for landing behavior of multi-node detector | |
CN114626505A (en) | Mobile robot deep reinforcement learning control method | |
CN105424043A (en) | Motion state estimation method based on maneuver judgment | |
CN113407820B (en) | Method for processing data by using model, related system and storage medium | |
Wang et al. | A KNN based Kalman filter Gaussian process regression | |
Yin et al. | Sample efficient deep reinforcement learning via local planning | |
Du et al. | A novel locally regularized automatic construction method for RBF neural models | |
Xu et al. | Residual autoencoder-LSTM for city region vehicle emission pollution prediction | |
CN115009291B (en) | Automatic driving assistance decision making method and system based on network evolution replay buffer area | |
CN114995106A (en) | PID self-tuning method, device and equipment based on improved wavelet neural network | |
WO2021140698A1 (en) | Information processing device, method, and program | |
CN114662656A (en) | Deep neural network model training method, autonomous navigation method and system | |
CN108960406B (en) | MEMS gyroscope random error prediction method based on BFO wavelet neural network | |
Li et al. | Covid-19 Epidemic Trend Prediction Based on CNN-StackBiLSTM |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |