CN114911157A - Robot navigation control method and system based on partial observable reinforcement learning - Google Patents

Robot navigation control method and system based on partial observable reinforcement learning Download PDF

Info

Publication number
CN114911157A
CN114911157A CN202210366719.1A CN202210366719A CN114911157A CN 114911157 A CN114911157 A CN 114911157A CN 202210366719 A CN202210366719 A CN 202210366719A CN 114911157 A CN114911157 A CN 114911157A
Authority
CN
China
Prior art keywords
network
robot
state
action
observation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210366719.1A
Other languages
Chinese (zh)
Inventor
章宗长
俞扬
孔祥瀚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202210366719.1A priority Critical patent/CN114911157A/en
Publication of CN114911157A publication Critical patent/CN114911157A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses a robot navigation control method and system based on partial observable reinforcement learning, which are mainly applied to a navigation task of a robot in an uncertain environment with an unknown model. In order to complete the navigation task under the uncertain environment, the invention adopts a reinforcement learning algorithm under partial observable environment. The system comprises a filtering unit, a planning unit, a playback pool and a learning unit. In the invention, the belief state is represented by using state particles to reduce the computational complexity of belief state updating, the simulation planning based on a learned model is used to improve the sample utilization rate, the resampling method is used to prevent the particle degradation problem, and the reward shaping based on the belief state negative information entropy is used to improve the training efficiency and stability of the algorithm in the reward sparse navigation task. The method can realize efficient and stable strategy learning in the observable environment of the unknown part of the model, and uses the learned strategy in the actual robot navigation task.

Description

Robot navigation control method and system based on partially observable reinforcement learning
Technical Field
The invention relates to a robot navigation control method and system based on reinforcement learning in a partial observable environment, and belongs to the technical field of robot control.
Background
With the development of the technology, robots have been widely applied to various production and living fields, and various application scenes therewith also provide more new challenges for the robot technology. The robot navigation is one of the most important tasks in the field of robot control, and a large number of robot navigation control requirements exist in practical application scenes, such as sweeping robots, warehousing and transportation robots, search and rescue robots and the like. Most of the traditional robot navigation algorithms need to obtain accurate modeling of the environment, which greatly limits the application range of the algorithms. While reinforcement learning can learn control strategies from data generated by interaction with the environment, it is increasingly applied to robot navigation tasks.
The environment in which the robot is located is usually very complex, and due to the blocking of obstacles, the detection range of the sensor and other factors, the robot can only obtain partial information of the environment through the sensor. The difficulty of the decision task under the incomplete information is greatly increased compared with that under the complete information. Meanwhile, the performance of the sensor of the robot is limited, the information obtained by the sensor is noisy, and the uncertainty caused by the noise can interfere the decision of the robot. Therefore, how to control the robot under uncertain environment is an urgent problem to be solved in the field of robot navigation.
The existing partially observable reinforcement learning algorithm cannot effectively encourage the robot to take actions of obtaining environmental information, and an optimal strategy is difficult to obtain in tasks where the environmental information is crucial. In addition, when the robot executes the navigation task, the reward can be obtained only when the robot reaches a target point, and therefore the reward is sparse. The training speed of the existing partially observable reinforcement learning algorithm is low in the environment with sparse rewards, and the performance of the algorithm is unstable.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the common problems of the existing robot navigation technology under the uncertain environment, the invention provides a robot navigation control method and system based on partial observable reinforcement learning. The robot navigation task is modeled as a Partially Observable Markov Decision Process (POMDP), and the problem is solved by using a reinforcement learning algorithm under a Partially Observable environment. The method effectively solves the problem of sparse rewards when the reinforcement learning is utilized to process the robot navigation task, and implicitly encourages the robot to actively take the action of obtaining the environmental information under partial observable environments, thereby obtaining a better strategy and improving the efficiency and the stability of the navigation control method.
The technical scheme is as follows: a robot navigation control method based on partial observable reinforcement learning specifically comprises the following steps:
s1, initializing network parameters, including: transfer model D ψ Parameter psi, observation model Z θ Parameter of theta, policy network pi ρ Parameter p, double Q value network Q ω Parameter ω of (d). Setting the training time step counter t to be 0, and entering S2;
s2, generating K weighted belief state particles according to the prior of the initial state
Figure BDA0003586171650000021
Initial weight
Figure BDA0003586171650000022
All set to 1, the robot obtains an initial observation o through the sensor 1 Go to S3;
s3, if the training time step counter t is less than the maximum training step number L, t ← t +1, and then S4 is entered; otherwise, go to S27;
s4, the robot observes the model Z according to θ (s, o) update weights
Figure BDA0003586171650000023
Proceeding to S5;
s5, calculating average belief state
Figure BDA0003586171650000024
Proceeding to S6;
s6, sampling
Figure BDA0003586171650000025
M particles with the highest median weight, is recorded as
Figure BDA0003586171650000026
Proceeding to S7;
s7, normalizing the weight of M particles
Figure BDA0003586171650000027
Proceeding to S8;
s8, mixing the particles
Figure BDA0003586171650000028
And average belief state
Figure BDA0003586171650000029
Copy N copies after combination and give each copy weight
Figure BDA00035861716500000210
N new weighted particles are obtained and are marked as
Figure BDA00035861716500000211
Superscript (n) denotes the nth copy, proceeding to S9;
s9, setting a planning time step counter i as t-1, and entering S10;
s10, if the planning time step counter is less than the maximum planning step number H, i ← i +1, and the process goes to S11; otherwise, go to S19;
s11, for each copy, obtaining an action according to the policy network
Figure BDA00035861716500000212
Proceeding to S12;
s12, for each particle in each copy, according to the transfer model D ψ Receive the next time status and reward
Figure BDA00035861716500000213
Proceeding to S13;
s13, updating the average belief status for each copy
Figure BDA00035861716500000214
Proceeding to S14;
s14, evaluating the entropy of the belief state information for each copy
Figure BDA00035861716500000215
Figure BDA00035861716500000216
Representing an estimate of the current belief state, proceed to S15;
s15, updating the weight of each copy particle
Figure BDA00035861716500000217
A (m)(n) Representing the merit function, S16 is entered;
s16, if resampling is needed, entering S17; otherwise, go to S18;
s17, resampling the copied particles, and entering S18;
s18, entering S10;
s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the nth copied robot planning track t Go to S20;
s20, robot action a t Interacting with the training environment to obtain the state s at the next time t+1 Observation at the next moment o t+1 And a prize r t Go to S21;
s21, if resampling is needed, entering S22; otherwise, go to S23;
s22, resampling the belief state particles, and entering S23;
s23, updating the belief state particles according to the transfer model
Figure BDA0003586171650000031
Proceeding to S24;
s24, data
Figure BDA0003586171650000032
Storing the data into a playback pool, and entering S25;
s25, the learning unit samples the training data from the playback pool, updates the network parameters, and enters S26;
s26, entering S3;
and S27, finishing the training, and outputting the trained network for the navigation control of the robot. The specific steps of the robot navigation control use stage can be obtained by canceling the playback pool and the learning unit and skipping the steps S24 and S25, and the environment only needs to provide observation and bonus information in S20, but does not need to provide real state information.
In the above technical solution, the environment (training environment) where the robot is located is modeled as a POMDP, and the POMDP may be represented by the following six-tuple:
(7) state space S, S t E, S represents the state of the robot at the time t;
(8) an operating space A, a t E is A to represent the action taken by the robot at the moment t;
(9) transition probability function T: S × A × S → [0,1],T(s t ,a t ,s t+1 ) Indicating that the robot is in state s t Taking action a t Is transferred to s t+1 The probability of (d);
(10) reward function R: S × A →, R (S) t ,a t ) Indicating that the robot is in state s t Taking action a t Immediate rewards available;
(11) observation space O, O t E, O represents the observation obtained by the robot at the moment t;
(12) observation probability function Z: S × A × O → [0,1],Z(s t ,a t-1 ,o t ) Indicating that the robot is taking action a t-1 Is transferred to s t Obtain observation o t The probability of (c).
The goal of POMDP is to obtain a strategy of pi H → A based on a historical sequence of action observations to maximize the desired jackpot, jackpot G t Is defined as:
Figure BDA0003586171650000033
wherein, gamma belongs to (0, 1)]Is a discount factor used to weigh the immediate prize and the delayed prize. r is t Indicating the reward the robot received at time t.
In the above technical solution, the belief state b t (s)=p(s t =s|h t ) Represents a known history h t ={b 0 ,a 0 ,o 1 ,…,a t-1 ,o t In the case of s t Is the probability distribution of s, b 0 Representing the initial state probability distribution.
In the foregoing technical solution, in S1, the network includes:
transfer model D ψ ψ is a parameter of the transfer model;
observation model Z θ Theta is a parameter of the observation model;
policy network pi ρ ρ is a parameter of the policy network;
dual Q-value network Q ω And omega is a parameter of the double-Q value network.
Wherein the model D is transferred ψ The device comprises a filtering unit, a planning unit, a transfer model network and a transfer model, wherein the filtering unit is used for updating state particles and simulating in the planning unit, the input is state and action, the output is state and reward at the next moment, and the transfer model network structure is a fully-connected network; observation model Z θ The system is used for updating the particle weight in the filtering unit, the input is state and observation, the output is the probability of observation, and the network structure of the observation model is a full-connection network; policy network pi ρ The method is used for providing strategies for simulation of the robot in a planning module, the input is belief state particles and average belief states, the output is actions and logarithm of the probability of the actions, and the network structure of a strategy network is that the mean value mu and the variance sigma of the actions are output through a full-connection network 2 From the Gaussian distribution N (mu, sigma) 2 ) Obtaining an action by middle sampling, and calculating and outputting a logarithm value of the probability of the action by using Gaussian distribution; dual Q-value network Q ω For updating particle weight in planning module, the input is state and action, the output is two Q values, the double Q value network is two full connection networks Q 1 And Q 2 While requiring the pair Q i (i 1,2) each maintaining a target Q value network TQ with the same network structure i (i ═ 1,2) for network parameter updates.
In the above technical solution, in S2, representing the belief state by using weighted particles is a common approximation method for dealing with the problem of too high complexity of belief state update computation, and the process of updating the particles is called a particle filter or a time-series monte carlo method.
In the above technical solution, S7-S19 are planning units, and the robot performs simulation planning using the copied belief state particles, respectively, so as to select an optimal action.
In the above technical solution, in S14, in the estimation of the belief state information entropy, belief state particles are used
Figure BDA0003586171650000041
Estimating belief state probability distributions
Figure BDA0003586171650000042
Meanwhile, a Kernel Density Estimation (KDE) method is used, and the probability distribution of the belief state is estimated by adopting Gaussian Kernel Density.
In the above technical solution, in S15, the calculation formula of the dominance function a is:
Figure BDA0003586171650000043
Figure BDA0003586171650000044
where TD represents the timing difference error, Q ω The smaller of the two outputs of the dual Q network.
When the advantage function A is calculated, reward shaping based on the belief state negative information entropy is added, so that the robot is encouraged to take the action of obtaining information, and the efficiency and the stability of the algorithm are improved.
In the above technical solutions, in S17 and S22, the resampling is a technique for preventing particle degradation, which is commonly used in particle filtering. Specifically, N particles with weights are selected from N weighted particles randomly and repeatedly, and then the weights of the new particles are all set to 1.
In the foregoing technical solution, in S25, the updating the network parameter includes:
(4) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, and use a specified optimization method, such as random gradient descent, Adam and the like, to optimize the loss function and update the network parameters.
(5) The two networks of the dual-Q network are updated in the same manner, and are based on the time-Difference (TD) error from the target Q network:
Figure BDA0003586171650000051
wherein alpha is a temperature coefficient, and controls the attention degree of the strategy entropy. And optimizing the loss function by using a specified optimization method, such as random gradient descent, Adam and the like, and updating the network parameters. In addition, the Q-value network parameters are copied to the target Q-value network every fixed updating step.
(6) Policy function updates are based on a loss function:
Figure BDA0003586171650000052
wherein, alpha is temperature coefficient, Q ω The smaller of the two outputs of the dual Q network. And optimizing the loss function by using a specified optimization method, such as random gradient descent, Adam and the like, and updating the network parameters.
In order to achieve the above object, the present invention provides a robot navigation control system based on partially observable reinforcement learning, comprising: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit.
And the filtering unit is used for updating the belief state particles and the weights thereof and interactively obtaining state, observation and reward information with the training environment by using the action obtained from the planning unit. And in addition, the training data is also processed and stored in a playback pool.
And the planning unit is used for receiving the weighted particles provided by the filtering module, simulating the planning by using the learned transfer model and the strategy network, and outputting the action to be provided to the filtering unit.
The playback pool is used for storing the processed training data and providing the training data required by learning for the learning unit, namely a data set consisting of tuples stored in the playback pool by the sampling filter unit.
And the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit.
In the system, the training environment, namely the actual application environment or the highly-simulated virtual environment, is used for training the robot navigation control method, interacts with the filtering unit, and provides state, observation and reward information for filtering.
Based on the technical scheme, the neural network can be trained for practical use. The playback pool and the learning unit are cancelled, and the specific steps of the usage phase can be obtained by skipping steps S24 and S25, in which case the environment in S20 only needs to provide observation and bonus information, and does not need to provide real status information.
Has the advantages that: due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the invention processes the robot navigation control task by using reinforcement learning, and can learn and obtain the control strategy from data generated by interaction with the environment. The problem that the traditional control method needs environment accurate modeling is solved, and the application range of the control method is expanded.
The invention models the environment as a POMDP problem and can define the uncertainty in the environment. The traditional method is difficult to process tasks with shielding, limited sensor detection range and noise of the sensor, and the navigation control task of the environment can be effectively processed.
The invention adopts a model-based partially observable reinforcement learning algorithm, can improve the utilization rate of training samples and improve the training efficiency.
The method adopts reward shaping based on the potential energy function, can effectively solve the problem of sparse reward of the robot navigation task in reality, does not change the optimal strategy, and improves the algorithm training efficiency and stability.
The belief state negative information entropy is used as the potential energy function in the reward shaping method, so that the robot can be encouraged to take the action of obtaining information, and an optimal strategy is easier to obtain compared with a traditional control method.
Drawings
FIG. 1 is a diagram of an overall training framework for an embodiment of the present invention;
FIG. 2 is a diagram of unit interactions during a training phase in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of cell interaction during a use phase according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an embodiment of the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
Fig. 4 is a top view of a robot navigation environment, wherein the robot is in one of 2 rooms with the same size on the left and right. The state of the robot is its absolute coordinates throughout the house. The robot can take actions with limited speed in any direction. The robot is provided with 4 sensors in the positive directions of the upper direction, the lower direction, the left direction and the right direction, the sensors can obtain the distance from the robot to the nearest wall surface in the direction, and Gaussian noise exists. The initial position of the robot is random, its goal is to reach the charging place below the left room or above the right room, and a reward of +100 is obtained after the robot reaches the target position. The robot can only obtain observation when in use, so the robot cannot judge which room the robot is in, and can determine the room the robot is in through the change of the wall surface distance unless the robot reaches a shadow part in the graph.
The steps of the robot in the training phase are as follows:
s1, initializing network parameters, including: transfer model D ψ Parameter psi, observation model Z θ Parameter of theta, policy network pi ρ P, double-Q value network Q ω Parameter ω. Setting the training time step counter t as 0,proceeding to S2;
s2, generating 100 weighted belief state particles according to the initial state prior
Figure BDA0003586171650000061
Initial weight
Figure BDA0003586171650000062
All set to 1, the robot obtains an initial observation o through the sensor 1 Proceeding to S3;
s3, if the training time step counter t is less than the maximum training step number L ═ 10,000, then t ← t +1, proceed to S4; otherwise, go to S27;
s4, the robot observes the model Z according to θ (s, o) update weights
Figure BDA0003586171650000071
Proceeding to S5;
s5, calculating average belief state
Figure BDA0003586171650000072
Proceeding to S6;
s6, sampling
Figure BDA0003586171650000073
M with the highest weight is 3 particles and is recorded as
Figure BDA0003586171650000074
Inputting the weighted particles to the planning unit by the filtering unit as in fig. 2 and 3, and entering S7;
s7, normalizing the weight of M particles
Figure BDA0003586171650000075
Proceeding to S8;
s8, mixing the particles
Figure BDA0003586171650000076
And average belief state
Figure BDA0003586171650000077
Copy N30 portions after combining and weight each portion
Figure BDA0003586171650000078
N new weighted particles are obtained and are marked as
Figure BDA0003586171650000079
Superscript (n) denotes the nth copy, proceeding to S9;
s9, setting a planning time step counter i as t-1, and entering S10;
s10, if the planning time step counter is less than the maximum planning step number H ═ 10, then i ← i +1, and proceed to S11; otherwise, go to S19;
s11, for each copy, obtaining an action according to the policy network
Figure BDA00035861716500000710
Proceeding to S12;
s12, for each particle in each copy, according to the transfer mode D ψ Receive the next time status and reward
Figure BDA00035861716500000711
Proceeding to S13;
s13, updating the average belief status for each copy
Figure BDA00035861716500000712
Proceeding to S14;
s14, evaluating the entropy of the belief state information for each copy
Figure BDA00035861716500000713
Figure BDA00035861716500000714
Representing an estimate of the current belief state, proceed to S15;
s15, updating the weight of each copy particle
Figure BDA00035861716500000715
A (m)(n) Representing a merit functionGo to S16;
s16, if resampling is needed, entering S17; otherwise, go to S18;
s17, resampling the copied particles, and entering S18;
s18, entering S10;
s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the planning track of the nth copying robot t The input to the filtering unit by the planning unit as in fig. 2 and 3 goes to S20;
s20, robot action a t Interacting with the training environment to obtain the state s of the next moment t+1 Observation at the next moment o t+1 And a prize r t As in 2, s t+1 ,o t+1 ,r t Input to the filtering unit S21
S21, if resampling is needed, entering S22; otherwise, go to S23;
s22, resampling the belief state particles, and entering S23;
s23, updating the belief state particles according to the transfer model
Figure BDA0003586171650000081
Proceeding to S24;
s24, data
Figure BDA0003586171650000082
Storing the playback pool in the FIG. 2, and entering S25;
s25, the learning unit in FIG. 2 samples training data from the playback pool, updates network parameters, and transmits the updated network parameters to the filtering unit and the planning unit, and then the process goes to S26;
s26, entering S3;
and S27, finishing the training, and outputting the trained network for the navigation control of the robot. Referring to fig. 3, the specific steps of the stage of using the robot navigation control can be obtained by canceling the playback pool and the learning unit and skipping steps S24 and S25, and the environment in S20 only needs to be provided to the filter module for observation and reward information, but not to provide the true status information.
The whole training process block diagram refers to fig. 1.
In the above embodiment, in S1, the network includes a transfer model D ψ ψ is a parameter of the transfer model; observation model Z θ Theta is a parameter of the observation model; policy network pi ρ ρ is a parameter of the policy network; dual Q-value network Q ω And omega is a parameter of the double-Q value network. Wherein the model D is transferred ψ The input is the state and the action, the output is the state and the reward of the next moment, the transfer model network structure is a 4-layer full-connection network, and the number of neurons in each layer is 256/256/256/3 respectively; observation model Z θ The input is state and observation, the output is probability of observation, the observation model network structure is 4 layers of full-connection network, the number of neurons in each layer is 256/256/256/1; policy network pi ρ The input is the belief state particle and the average belief state, the output is the action and the logarithm of the probability of outputting the action, and the network structure of the strategy network is that the mean value mu and the variance sigma of the action are output through the full-connection network 2 From the Gaussian distribution N (mu, sigma) 2 ) The motion is obtained through middle sampling, and a logarithmic value of the probability of the motion is calculated and output by using Gaussian distribution, wherein the logarithmic value is a 3-layer fully-connected network, and the number of neurons in each layer is 256/256/4; dual Q-value network Q ω The input is state and action, the output is two Q values, the double Q value network is two full connection networks Q 1 And Q 2 All the three-layer full-connection networks are 3 layers of full-connection networks, the number of neurons in each layer is 256/256/1, and Q needs to be matched i (i 1,2) each maintaining a target Q value network TQ with the same network structure i (i ═ 1,2) for network parameter updates. The parameter initialization uses the pytorech default parameter initialization method.
In the above embodiment, S10-S19 are planning units, and the robot uses the copied belief-state particles to perform simulation planning respectively, so as to select an optimal action.
In the above embodiment, in S14, in the estimation of the entropy of the belief state information, the belief state particle is used
Figure BDA0003586171650000083
Estimating belief state probability distributions
Figure BDA0003586171650000084
And a multivariate Gaussian kernel density estimation method with a Silverman empirical window width is adopted. The formula for the kernel density estimation at this time is:
Figure BDA0003586171650000085
wherein D is the dimension of the state, the window width matrix H is a diagonal matrix, and the calculation formula of the elements on the main diagonal is as follows:
Figure BDA0003586171650000091
wherein
Figure BDA0003586171650000092
Is the standard deviation of the state particle in dimension i.
In the above embodiment, in S15, the calculation formula of the dominance function a is:
Figure BDA0003586171650000093
Figure BDA0003586171650000094
where TD represents the timing difference error, Q ω The smaller of the two outputs of the dual Q network. When the advantage function A is calculated, reward shaping based on the belief state negative information entropy is added, so that the robot is encouraged to take an action of obtaining information, and the efficiency and the stability of the algorithm are improved.
In the above embodiments, in S17 and S22, the resampling is a technique for preventing particle degradation, which is commonly used in particle filtering. Specifically, N particles with weights are selected from N weighted particles randomly and repeatedly, and then the weights of the new particles are all set to 1.
In the foregoing embodiment, in S25, the updating the network parameter includes:
(1) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, an Adam optimization method is used for optimizing the loss function, the learning rate is 0.001, and the network parameters are updated.
(2) The two networks of the dual-Q network are updated in the same manner, and are based on the time-Difference (TD) error from the target Q network:
Figure BDA0003586171650000095
where α ═ 1 is a temperature coefficient, the degree of importance of the control on the strategy entropy is controlled, and γ ═ 0.95 is a discount factor. And (5) optimizing a loss function by using an Adam optimization method, wherein the learning rate is 0.001, and updating the network parameters. In addition, the Q-value network parameters are copied to the target Q-value network every 5 steps.
(3) Policy function updates are based on a loss function:
Figure BDA0003586171650000096
wherein α ═ 1 is a temperature coefficient, Q ω The smaller of the two outputs of the dual Q network. And (5) optimizing a loss function by using an Adam optimization method, wherein the learning rate is 0.001, and updating the network parameters.
The following is a specific structure of this embodiment, including: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit.
And the filtering unit is used for updating the belief state particles and the weights thereof and interactively acquiring state, observation and reward information with the training environment by using the actions acquired from the planning unit. And the training data is also processed and stored in a playback pool.
And the planning unit is used for receiving the weighted particles provided by the filtering module, simulating the planning by using the learned transfer model and the strategy network, and outputting the action to be provided to the filtering unit.
The playback pool is used for storing the processed training data and providing the training data required by learning for the learning unit, namely a data set consisting of tuples stored in the playback pool by the sampling filter unit.
And the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit.
After the training phase is finished, the playback pool and the learning unit are cancelled, and the specific steps of the use phase can be obtained by skipping the steps S24 and S25, in which case the environment in S20 only needs to provide observation and bonus information, but does not need to provide real state information.

Claims (10)

1. A robot navigation control method based on partial observable reinforcement learning is characterized by comprising the following steps:
s1, initializing network parameters, including: transfer model D ψ Parameter psi, observation model Z θ Parameter of theta, policy network pi ρ Parameter p, double Q value network Q ω Parameter ω of (d); setting the training time step counter t to be 0, and entering S2;
s2, generating K weighted belief state particles according to the prior of the initial state
Figure FDA0003586171640000011
Initial weight
Figure FDA0003586171640000012
All set to 1, the robot obtains an initial observation o through the sensor 1 Go to S3;
s3, if the training time step counter t is less than the maximum training step number L, t ← t +1, and then S4 is entered; otherwise, go to S27;
s4, the robot observes the model Z according to θ (s, o) update weights
Figure FDA0003586171640000013
Proceeding to S5;
s5 meterArithmetic mean belief state
Figure FDA0003586171640000014
Proceeding to S6;
s6, sampling
Figure FDA0003586171640000015
M particles with the highest median weight, is recorded as
Figure FDA0003586171640000016
Proceeding to S7;
s7, normalizing the weight of M particles
Figure FDA0003586171640000017
Proceeding to S8;
s8, mixing the particles
Figure FDA0003586171640000018
And average belief state
Figure FDA0003586171640000019
Copy N copies after combination and give each copy weight
Figure FDA00035861716400000110
N new weighted particles are obtained and are marked as
Figure FDA00035861716400000111
Superscript (n) denotes the nth copy, proceeding to S9;
s9, setting a planning time step counter i as t-1, and entering S10;
s10, if the planning time step counter is less than the maximum planning step number H, i ← i +1, and the process goes to S11; otherwise, go to S19;
s11, for each copy, obtaining an action according to the policy network
Figure FDA00035861716400000112
Proceeding to S12;
s12, for each particle in each copy, according to the transfer model D ψ Receive the next time status and reward
Figure FDA00035861716400000113
Proceeding to S13;
s13, updating the average belief status for each copy
Figure FDA00035861716400000114
Proceeding to S14;
s14, evaluating the entropy of the belief state information for each copy
Figure FDA00035861716400000115
Figure FDA00035861716400000116
Representing an estimate of the current belief state, proceed to S15;
s15, updating the weight of each copy particle
Figure FDA00035861716400000117
A (m)(n) Representing the merit function, S16 is entered;
s16, if resampling is needed, entering S17; otherwise, go to S18;
s17, resampling the copy particles, and entering S18;
s18, entering S10;
s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the nth copied robot planning track t Proceeding to S20;
s20, robot action a t Interacting with the training environment to obtain the state s of the next moment t+1 Observation at the next moment o t+1 And a prize r t Go to S21;
s21, if resampling is needed, entering S22; otherwise, go to S23;
s22, resampling the belief state particles, and entering S23;
s23, updating the belief state particles according to the transfer model
Figure FDA0003586171640000021
Proceeding to S24;
s24, data
Figure FDA0003586171640000022
Storing the data into a playback pool, and entering S25;
s25, the learning unit samples training data from the playback pool, updates network parameters and enters S26;
s26, entering S3;
and S27, finishing the training, and outputting the trained network for the navigation control of the robot.
2. The method of claim 1, wherein when the trained network is used for the robot navigation control, the playback pool and the learning unit are cancelled, and the steps S24 and S25 are skipped to obtain the specific steps of the using stage of the robot navigation control, so that the environment in S20 only needs to provide the observation and reward information, but does not need to provide the real status information.
3. The method of claim 1, wherein the robot training environment is modeled as a POMDP, which is expressed by the following six-tuple:
(1) state space S, S t E, S represents the state of the robot at the time t;
(2) an action space A, a t E is A to represent the action taken by the robot at the moment t;
(3) transition probability function T: S × A × S → [0,1],T(s t ,a t ,s t+1 ) Indicating that the robot is in state s t Taking action a t Is transferred to s t+1 The probability of (d);
(4) reward function R: S × A → R, R (S) t ,a t ) Indicating that the robot is in state s t Taking action a t All canImmediate rewards earned;
(5) observation space O, O t The epsilon O represents the observation obtained by the robot at the time t;
(6) observation probability function Z: S × A × O → [0,1],Z(s t ,a t-1 ,o t ) Indicating that the robot is taking action a t-1 Is transferred to s t Obtain observation o t The probability of (d);
the goal of POMDP is to obtain a strategy of pi H → A based on a historical sequence of action observations to maximize the desired jackpot, jackpot G t Is defined as:
Figure FDA0003586171640000023
wherein, gamma belongs to (0, 1)]A discount factor used to weigh immediate and delayed rewards; r is t Indicating the reward the robot received at time t.
4. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein the belief state b t (s)=p(s t =s|h t ) Represents a known history h t ={b 0 ,a 0 ,o 1 ,…,a t-1 ,o t In the case of s t Is the probability distribution of s, b 0 Representing the initial state probability distribution.
5. The method of claim 1, wherein the transfer model D is a model of a robot navigation system ψ The input is the state and the action, the output is the state and the reward at the next moment, and the model network structure is transferred to be a full-connection network; observation model Z θ The input is state and observation, the output is probability of observation, and the network structure of the observation model is a fully-connected network; policy network pi ρ The input is belief state particles and average belief state, the output is action and logarithm of probability of the action, and the network structure of the strategy network is that a full-connection network is passed throughMean μ and variance σ of the envelope output action 2 From the Gaussian distribution N (mu, sigma) 2 ) Obtaining an action by middle sampling, and calculating and outputting a logarithm value of the probability of the action by using Gaussian distribution; dual Q-value network Q ω The input is state and action, the output is two Q values, the double Q value network is two full connection networks Q 1 And Q 2 While requiring the pair Q i (i 1,2) each maintaining a target Q value network TQ with the same network structure i (i ═ 1,2) for network parameter updates.
6. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S14, the belief state particles are used in the estimation of the belief state information entropy
Figure FDA0003586171640000031
Estimating belief state probability distributions
Figure FDA0003586171640000032
And then, estimating the probability distribution of the belief state by using a kernel density estimation method and adopting Gaussian kernel density.
7. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S15, the calculation formula of the merit function a is:
Figure FDA0003586171640000033
Figure FDA0003586171640000034
where TD represents the timing difference error, Q ω The smaller of the two outputs of the dual Q network.
8. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S17, S22, the resampling is that N weighted particles are selected at random and repeatedly according to the weight, and then the weight of the new particles is set to 1.
9. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S25, the updating the network parameters includes:
(1) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, an optimization method is used for optimizing the loss function, and the network parameters are updated.
(2) The updating modes of the two networks of the double-Q-value network are the same, and the two networks are based on the time sequence difference error of the target Q-value network:
Figure FDA0003586171640000035
wherein alpha is a temperature coefficient, and the attention degree of strategy entropy is controlled; optimizing a loss function by using an optimization method, and updating network parameters; copying the Q value network parameters to a target Q value network every other fixed updating step;
(3) policy function updates are based on a loss function:
Figure FDA0003586171640000041
wherein, alpha is temperature coefficient, Q ω The smaller of the two outputs of the double-Q value network; and optimizing the loss function and updating the network parameters by using an optimization method.
10. A robot navigation control system based on partially observable reinforcement learning, comprising: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit;
the filtering unit is used for updating the belief state particles and the weights thereof and interactively obtaining state, observation and reward information by using the actions obtained from the planning unit and the training environment; in addition, the system is also used for processing the training data and storing the training data into a playback pool;
the planning unit is used for receiving the weighted particles provided by the filtering module, simulating planning by using a learned transfer model and a strategy network, and outputting the action to be provided to the filtering unit;
the playback pool is a database supporting random access and is used for storing the processed training data and providing the training data required by learning for the learning unit;
the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit;
the training environment is an actual application environment or a simulation virtual environment, is used for training the robot navigation control method, interacts with the filtering unit, and provides state, observation and reward information for filtering.
CN202210366719.1A 2022-04-08 2022-04-08 Robot navigation control method and system based on partial observable reinforcement learning Pending CN114911157A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210366719.1A CN114911157A (en) 2022-04-08 2022-04-08 Robot navigation control method and system based on partial observable reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210366719.1A CN114911157A (en) 2022-04-08 2022-04-08 Robot navigation control method and system based on partial observable reinforcement learning

Publications (1)

Publication Number Publication Date
CN114911157A true CN114911157A (en) 2022-08-16

Family

ID=82762508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210366719.1A Pending CN114911157A (en) 2022-04-08 2022-04-08 Robot navigation control method and system based on partial observable reinforcement learning

Country Status (1)

Country Link
CN (1) CN114911157A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826013A (en) * 2023-02-15 2023-03-21 广东工业大学 Beidou satellite positioning method based on lightweight reinforcement learning in urban multipath environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115826013A (en) * 2023-02-15 2023-03-21 广东工业大学 Beidou satellite positioning method based on lightweight reinforcement learning in urban multipath environment
CN115826013B (en) * 2023-02-15 2023-04-21 广东工业大学 Beidou satellite positioning method based on light reinforcement learning under urban multipath environment

Similar Documents

Publication Publication Date Title
CN110928189B (en) Robust control method based on reinforcement learning and Lyapunov function
Song et al. New chaotic PSO-based neural network predictive control for nonlinear process
CN114839884B (en) Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN114967713B (en) Underwater vehicle buoyancy discrete change control method based on reinforcement learning
CN116700327A (en) Unmanned aerial vehicle track planning method based on continuous action dominant function learning
CN116052254A (en) Visual continuous emotion recognition method based on extended Kalman filtering neural network
CN114626307B (en) Distributed consistent target state estimation method based on variational Bayes
CN114911157A (en) Robot navigation control method and system based on partial observable reinforcement learning
Wei et al. Boosting offline reinforcement learning with residual generative modeling
CN115972211A (en) Control strategy offline training method based on model uncertainty and behavior prior
CN111798494A (en) Maneuvering target robust tracking method under generalized correlation entropy criterion
CN115374933A (en) Intelligent planning and decision-making method for landing behavior of multi-node detector
CN114626505A (en) Mobile robot deep reinforcement learning control method
CN105424043A (en) Motion state estimation method based on maneuver judgment
CN113407820B (en) Method for processing data by using model, related system and storage medium
Wang et al. A KNN based Kalman filter Gaussian process regression
Yin et al. Sample efficient deep reinforcement learning via local planning
Du et al. A novel locally regularized automatic construction method for RBF neural models
Xu et al. Residual autoencoder-LSTM for city region vehicle emission pollution prediction
CN115009291B (en) Automatic driving assistance decision making method and system based on network evolution replay buffer area
CN114995106A (en) PID self-tuning method, device and equipment based on improved wavelet neural network
WO2021140698A1 (en) Information processing device, method, and program
CN114662656A (en) Deep neural network model training method, autonomous navigation method and system
CN108960406B (en) MEMS gyroscope random error prediction method based on BFO wavelet neural network
Li et al. Covid-19 Epidemic Trend Prediction Based on CNN-StackBiLSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination