CN114911157A

CN114911157A - Robot navigation control method and system based on partial observable reinforcement learning

Info

Publication number: CN114911157A
Application number: CN202210366719.1A
Authority: CN
Inventors: 章宗长; 俞扬; 孔祥瀚
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-08-16

Abstract

The invention discloses a robot navigation control method and system based on partial observable reinforcement learning, which are mainly applied to a navigation task of a robot in an uncertain environment with an unknown model. In order to complete the navigation task under the uncertain environment, the invention adopts a reinforcement learning algorithm under partial observable environment. The system comprises a filtering unit, a planning unit, a playback pool and a learning unit. In the invention, the belief state is represented by using state particles to reduce the computational complexity of belief state updating, the simulation planning based on a learned model is used to improve the sample utilization rate, the resampling method is used to prevent the particle degradation problem, and the reward shaping based on the belief state negative information entropy is used to improve the training efficiency and stability of the algorithm in the reward sparse navigation task. The method can realize efficient and stable strategy learning in the observable environment of the unknown part of the model, and uses the learned strategy in the actual robot navigation task.

Description

Robot navigation control method and system based on partially observable reinforcement learning

Technical Field

The invention relates to a robot navigation control method and system based on reinforcement learning in a partial observable environment, and belongs to the technical field of robot control.

Background

With the development of the technology, robots have been widely applied to various production and living fields, and various application scenes therewith also provide more new challenges for the robot technology. The robot navigation is one of the most important tasks in the field of robot control, and a large number of robot navigation control requirements exist in practical application scenes, such as sweeping robots, warehousing and transportation robots, search and rescue robots and the like. Most of the traditional robot navigation algorithms need to obtain accurate modeling of the environment, which greatly limits the application range of the algorithms. While reinforcement learning can learn control strategies from data generated by interaction with the environment, it is increasingly applied to robot navigation tasks.

The environment in which the robot is located is usually very complex, and due to the blocking of obstacles, the detection range of the sensor and other factors, the robot can only obtain partial information of the environment through the sensor. The difficulty of the decision task under the incomplete information is greatly increased compared with that under the complete information. Meanwhile, the performance of the sensor of the robot is limited, the information obtained by the sensor is noisy, and the uncertainty caused by the noise can interfere the decision of the robot. Therefore, how to control the robot under uncertain environment is an urgent problem to be solved in the field of robot navigation.

The existing partially observable reinforcement learning algorithm cannot effectively encourage the robot to take actions of obtaining environmental information, and an optimal strategy is difficult to obtain in tasks where the environmental information is crucial. In addition, when the robot executes the navigation task, the reward can be obtained only when the robot reaches a target point, and therefore the reward is sparse. The training speed of the existing partially observable reinforcement learning algorithm is low in the environment with sparse rewards, and the performance of the algorithm is unstable.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the common problems of the existing robot navigation technology under the uncertain environment, the invention provides a robot navigation control method and system based on partial observable reinforcement learning. The robot navigation task is modeled as a Partially Observable Markov Decision Process (POMDP), and the problem is solved by using a reinforcement learning algorithm under a Partially Observable environment. The method effectively solves the problem of sparse rewards when the reinforcement learning is utilized to process the robot navigation task, and implicitly encourages the robot to actively take the action of obtaining the environmental information under partial observable environments, thereby obtaining a better strategy and improving the efficiency and the stability of the navigation control method.

The technical scheme is as follows: a robot navigation control method based on partial observable reinforcement learning specifically comprises the following steps:

s1, initializing network parameters, including: transfer model D _ψ Parameter psi, observation model Z _θ Parameter of theta, policy network pi _ρ Parameter p, double Q value network Q _ω Parameter ω of (d). Setting the training time step counter t to be 0, and entering S2;

s2, generating K weighted belief state particles according to the prior of the initial state

Initial weight

All set to 1, the robot obtains an initial observation o through the sensor ₁ Go to S3;

s3, if the training time step counter t is less than the maximum training step number L, t ← t +1, and then S4 is entered; otherwise, go to S27;

s4, the robot observes the model Z according to _θ (s, o) update weights

Proceeding to S5;

s5, calculating average belief state

Proceeding to S6;

s6, sampling

M particles with the highest median weight, is recorded as

Proceeding to S7;

s7, normalizing the weight of M particles

Proceeding to S8;

s8, mixing the particles

And average belief state

Copy N copies after combination and give each copy weight

N new weighted particles are obtained and are marked as

Superscript (n) denotes the nth copy, proceeding to S9;

s9, setting a planning time step counter i as t-1, and entering S10;

s10, if the planning time step counter is less than the maximum planning step number H, i ← i +1, and the process goes to S11; otherwise, go to S19;

s11, for each copy, obtaining an action according to the policy network

Proceeding to S12;

s12, for each particle in each copy, according to the transfer model D _ψ Receive the next time status and reward

Proceeding to S13;

s13, updating the average belief status for each copy

Proceeding to S14;

s14, evaluating the entropy of the belief state information for each copy

Representing an estimate of the current belief state, proceed to S15;

s15, updating the weight of each copy particle

A ^(m)(n) Representing the merit function, S16 is entered;

s16, if resampling is needed, entering S17; otherwise, go to S18;

s17, resampling the copied particles, and entering S18;

s18, entering S10;

s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the nth copied robot planning track _t Go to S20;

s20, robot action a _t Interacting with the training environment to obtain the state s at the next time _t+1 Observation at the next moment o _t+1 And a prize r _t Go to S21;

s21, if resampling is needed, entering S22; otherwise, go to S23;

s22, resampling the belief state particles, and entering S23;

s23, updating the belief state particles according to the transfer model

Proceeding to S24;

s24, data

Storing the data into a playback pool, and entering S25;

s25, the learning unit samples the training data from the playback pool, updates the network parameters, and enters S26;

s26, entering S3;

and S27, finishing the training, and outputting the trained network for the navigation control of the robot. The specific steps of the robot navigation control use stage can be obtained by canceling the playback pool and the learning unit and skipping the steps S24 and S25, and the environment only needs to provide observation and bonus information in S20, but does not need to provide real state information.

In the above technical solution, the environment (training environment) where the robot is located is modeled as a POMDP, and the POMDP may be represented by the following six-tuple:

(7) state space S, S _t E, S represents the state of the robot at the time t;

(8) an operating space A, a _t E is A to represent the action taken by the robot at the moment t;

(9) transition probability function T: S × A × S → [0,1]，T(s _t ,a _t ,s _t+1 ) Indicating that the robot is in state s _t Taking action a _t Is transferred to s _t+1 The probability of (d);

(10) reward function R: S × A →, R (S) _t ,a _t ) Indicating that the robot is in state s _t Taking action a _t Immediate rewards available;

(11) observation space O, O _t E, O represents the observation obtained by the robot at the moment t;

(12) observation probability function Z: S × A × O → [0,1]，Z(s _t ,a _t-1 ,o _t ) Indicating that the robot is taking action a _t-1 Is transferred to s _t Obtain observation o _t The probability of (c).

The goal of POMDP is to obtain a strategy of pi H → A based on a historical sequence of action observations to maximize the desired jackpot, jackpot G _t Is defined as:

wherein, gamma belongs to (0, 1)]Is a discount factor used to weigh the immediate prize and the delayed prize. r is _t Indicating the reward the robot received at time t.

In the above technical solution, the belief state b _t (s)＝p(s _t ＝s|h _t ) Represents a known history h _t ＝{b ₀ ,a ₀ ,o ₁ ,…,a _t-1 ,o _t In the case of s _t Is the probability distribution of s, b ₀ Representing the initial state probability distribution.

In the foregoing technical solution, in S1, the network includes:

transfer model D _ψ ψ is a parameter of the transfer model;

observation model Z _θ Theta is a parameter of the observation model;

policy network pi _ρ ρ is a parameter of the policy network;

dual Q-value network Q _ω And omega is a parameter of the double-Q value network.

Wherein the model D is transferred _ψ The device comprises a filtering unit, a planning unit, a transfer model network and a transfer model, wherein the filtering unit is used for updating state particles and simulating in the planning unit, the input is state and action, the output is state and reward at the next moment, and the transfer model network structure is a fully-connected network; observation model Z _θ The system is used for updating the particle weight in the filtering unit, the input is state and observation, the output is the probability of observation, and the network structure of the observation model is a full-connection network; policy network pi _ρ The method is used for providing strategies for simulation of the robot in a planning module, the input is belief state particles and average belief states, the output is actions and logarithm of the probability of the actions, and the network structure of a strategy network is that the mean value mu and the variance sigma of the actions are output through a full-connection network ² From the Gaussian distribution N (mu, sigma) ² ) Obtaining an action by middle sampling, and calculating and outputting a logarithm value of the probability of the action by using Gaussian distribution; dual Q-value network Q _ω For updating particle weight in planning module, the input is state and action, the output is two Q values, the double Q value network is two full connection networks Q ₁ And Q ₂ While requiring the pair Q _i (i 1,2) each maintaining a target Q value network TQ with the same network structure _i (i ═ 1,2) for network parameter updates.

In the above technical solution, in S2, representing the belief state by using weighted particles is a common approximation method for dealing with the problem of too high complexity of belief state update computation, and the process of updating the particles is called a particle filter or a time-series monte carlo method.

In the above technical solution, S7-S19 are planning units, and the robot performs simulation planning using the copied belief state particles, respectively, so as to select an optimal action.

In the above technical solution, in S14, in the estimation of the belief state information entropy, belief state particles are used

Estimating belief state probability distributions

Meanwhile, a Kernel Density Estimation (KDE) method is used, and the probability distribution of the belief state is estimated by adopting Gaussian Kernel Density.

In the above technical solution, in S15, the calculation formula of the dominance function a is:

where TD represents the timing difference error, Q _ω The smaller of the two outputs of the dual Q network.

When the advantage function A is calculated, reward shaping based on the belief state negative information entropy is added, so that the robot is encouraged to take the action of obtaining information, and the efficiency and the stability of the algorithm are improved.

In the above technical solutions, in S17 and S22, the resampling is a technique for preventing particle degradation, which is commonly used in particle filtering. Specifically, N particles with weights are selected from N weighted particles randomly and repeatedly, and then the weights of the new particles are all set to 1.

In the foregoing technical solution, in S25, the updating the network parameter includes:

(4) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, and use a specified optimization method, such as random gradient descent, Adam and the like, to optimize the loss function and update the network parameters.

(5) The two networks of the dual-Q network are updated in the same manner, and are based on the time-Difference (TD) error from the target Q network:

wherein alpha is a temperature coefficient, and controls the attention degree of the strategy entropy. And optimizing the loss function by using a specified optimization method, such as random gradient descent, Adam and the like, and updating the network parameters. In addition, the Q-value network parameters are copied to the target Q-value network every fixed updating step.

(6) Policy function updates are based on a loss function:

wherein, alpha is temperature coefficient, Q _ω The smaller of the two outputs of the dual Q network. And optimizing the loss function by using a specified optimization method, such as random gradient descent, Adam and the like, and updating the network parameters.

In order to achieve the above object, the present invention provides a robot navigation control system based on partially observable reinforcement learning, comprising: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit.

And the filtering unit is used for updating the belief state particles and the weights thereof and interactively obtaining state, observation and reward information with the training environment by using the action obtained from the planning unit. And in addition, the training data is also processed and stored in a playback pool.

And the planning unit is used for receiving the weighted particles provided by the filtering module, simulating the planning by using the learned transfer model and the strategy network, and outputting the action to be provided to the filtering unit.

The playback pool is used for storing the processed training data and providing the training data required by learning for the learning unit, namely a data set consisting of tuples stored in the playback pool by the sampling filter unit.

And the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit.

In the system, the training environment, namely the actual application environment or the highly-simulated virtual environment, is used for training the robot navigation control method, interacts with the filtering unit, and provides state, observation and reward information for filtering.

Based on the technical scheme, the neural network can be trained for practical use. The playback pool and the learning unit are cancelled, and the specific steps of the usage phase can be obtained by skipping steps S24 and S25, in which case the environment in S20 only needs to provide observation and bonus information, and does not need to provide real status information.

Has the advantages that: due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the invention processes the robot navigation control task by using reinforcement learning, and can learn and obtain the control strategy from data generated by interaction with the environment. The problem that the traditional control method needs environment accurate modeling is solved, and the application range of the control method is expanded.

The invention models the environment as a POMDP problem and can define the uncertainty in the environment. The traditional method is difficult to process tasks with shielding, limited sensor detection range and noise of the sensor, and the navigation control task of the environment can be effectively processed.

The invention adopts a model-based partially observable reinforcement learning algorithm, can improve the utilization rate of training samples and improve the training efficiency.

The method adopts reward shaping based on the potential energy function, can effectively solve the problem of sparse reward of the robot navigation task in reality, does not change the optimal strategy, and improves the algorithm training efficiency and stability.

The belief state negative information entropy is used as the potential energy function in the reward shaping method, so that the robot can be encouraged to take the action of obtaining information, and an optimal strategy is easier to obtain compared with a traditional control method.

Drawings

FIG. 1 is a diagram of an overall training framework for an embodiment of the present invention;

FIG. 2 is a diagram of unit interactions during a training phase in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of cell interaction during a use phase according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

Fig. 4 is a top view of a robot navigation environment, wherein the robot is in one of 2 rooms with the same size on the left and right. The state of the robot is its absolute coordinates throughout the house. The robot can take actions with limited speed in any direction. The robot is provided with 4 sensors in the positive directions of the upper direction, the lower direction, the left direction and the right direction, the sensors can obtain the distance from the robot to the nearest wall surface in the direction, and Gaussian noise exists. The initial position of the robot is random, its goal is to reach the charging place below the left room or above the right room, and a reward of +100 is obtained after the robot reaches the target position. The robot can only obtain observation when in use, so the robot cannot judge which room the robot is in, and can determine the room the robot is in through the change of the wall surface distance unless the robot reaches a shadow part in the graph.

The steps of the robot in the training phase are as follows:

s1, initializing network parameters, including: transfer model D _ψ Parameter psi, observation model Z _θ Parameter of theta, policy network pi _ρ P, double-Q value network Q _ω Parameter ω. Setting the training time step counter t as 0,proceeding to S2;

s2, generating 100 weighted belief state particles according to the initial state prior

Initial weight

All set to 1, the robot obtains an initial observation o through the sensor ₁ Proceeding to S3;

s3, if the training time step counter t is less than the maximum training step number L ═ 10,000, then t ← t +1, proceed to S4; otherwise, go to S27;

s4, the robot observes the model Z according to _θ (s, o) update weights

Proceeding to S5;

s5, calculating average belief state

Proceeding to S6;

s6, sampling

M with the highest weight is 3 particles and is recorded as

Inputting the weighted particles to the planning unit by the filtering unit as in fig. 2 and 3, and entering S7;

s7, normalizing the weight of M particles

Proceeding to S8;

s8, mixing the particles

And average belief state

Copy N30 portions after combining and weight each portion

N new weighted particles are obtained and are marked as

Superscript (n) denotes the nth copy, proceeding to S9;

s9, setting a planning time step counter i as t-1, and entering S10;

s10, if the planning time step counter is less than the maximum planning step number H ═ 10, then i ← i +1, and proceed to S11; otherwise, go to S19;

s11, for each copy, obtaining an action according to the policy network

Proceeding to S12;

s12, for each particle in each copy, according to the transfer mode D _ψ Receive the next time status and reward

Proceeding to S13;

s13, updating the average belief status for each copy

Proceeding to S14;

s14, evaluating the entropy of the belief state information for each copy

Representing an estimate of the current belief state, proceed to S15;

s15, updating the weight of each copy particle

A ^(m)(n) Representing a merit functionGo to S16;

s16, if resampling is needed, entering S17; otherwise, go to S18;

s17, resampling the copied particles, and entering S18;

s18, entering S10;

s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the planning track of the nth copying robot _t The input to the filtering unit by the planning unit as in fig. 2 and 3 goes to S20;

s20, robot action a _t Interacting with the training environment to obtain the state s of the next moment _t+1 Observation at the next moment o _t+1 And a prize r _t As in 2, s _t+1 ，o _t+1 ，r _t Input to the filtering unit S21

S21, if resampling is needed, entering S22; otherwise, go to S23;

s22, resampling the belief state particles, and entering S23;

s23, updating the belief state particles according to the transfer model

Proceeding to S24;

s24, data

Storing the playback pool in the FIG. 2, and entering S25;

s25, the learning unit in FIG. 2 samples training data from the playback pool, updates network parameters, and transmits the updated network parameters to the filtering unit and the planning unit, and then the process goes to S26;

s26, entering S3;

and S27, finishing the training, and outputting the trained network for the navigation control of the robot. Referring to fig. 3, the specific steps of the stage of using the robot navigation control can be obtained by canceling the playback pool and the learning unit and skipping steps S24 and S25, and the environment in S20 only needs to be provided to the filter module for observation and reward information, but not to provide the true status information.

The whole training process block diagram refers to fig. 1.

In the above embodiment, in S1, the network includes a transfer model D _ψ ψ is a parameter of the transfer model; observation model Z _θ Theta is a parameter of the observation model; policy network pi _ρ ρ is a parameter of the policy network; dual Q-value network Q _ω And omega is a parameter of the double-Q value network. Wherein the model D is transferred _ψ The input is the state and the action, the output is the state and the reward of the next moment, the transfer model network structure is a 4-layer full-connection network, and the number of neurons in each layer is 256/256/256/3 respectively; observation model Z _θ The input is state and observation, the output is probability of observation, the observation model network structure is 4 layers of full-connection network, the number of neurons in each layer is 256/256/256/1; policy network pi _ρ The input is the belief state particle and the average belief state, the output is the action and the logarithm of the probability of outputting the action, and the network structure of the strategy network is that the mean value mu and the variance sigma of the action are output through the full-connection network ² From the Gaussian distribution N (mu, sigma) ² ) The motion is obtained through middle sampling, and a logarithmic value of the probability of the motion is calculated and output by using Gaussian distribution, wherein the logarithmic value is a 3-layer fully-connected network, and the number of neurons in each layer is 256/256/4; dual Q-value network Q _ω The input is state and action, the output is two Q values, the double Q value network is two full connection networks Q ₁ And Q ₂ All the three-layer full-connection networks are 3 layers of full-connection networks, the number of neurons in each layer is 256/256/1, and Q needs to be matched _i (i 1,2) each maintaining a target Q value network TQ with the same network structure _i (i ═ 1,2) for network parameter updates. The parameter initialization uses the pytorech default parameter initialization method.

In the above embodiment, S10-S19 are planning units, and the robot uses the copied belief-state particles to perform simulation planning respectively, so as to select an optimal action.

In the above embodiment, in S14, in the estimation of the entropy of the belief state information, the belief state particle is used

Estimating belief state probability distributions

And a multivariate Gaussian kernel density estimation method with a Silverman empirical window width is adopted. The formula for the kernel density estimation at this time is:

wherein D is the dimension of the state, the window width matrix H is a diagonal matrix, and the calculation formula of the elements on the main diagonal is as follows:

wherein

Is the standard deviation of the state particle in dimension i.

In the above embodiment, in S15, the calculation formula of the dominance function a is:

where TD represents the timing difference error, Q _ω The smaller of the two outputs of the dual Q network. When the advantage function A is calculated, reward shaping based on the belief state negative information entropy is added, so that the robot is encouraged to take an action of obtaining information, and the efficiency and the stability of the algorithm are improved.

In the above embodiments, in S17 and S22, the resampling is a technique for preventing particle degradation, which is commonly used in particle filtering. Specifically, N particles with weights are selected from N weighted particles randomly and repeatedly, and then the weights of the new particles are all set to 1.

In the foregoing embodiment, in S25, the updating the network parameter includes:

(1) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, an Adam optimization method is used for optimizing the loss function, the learning rate is 0.001, and the network parameters are updated.

(2) The two networks of the dual-Q network are updated in the same manner, and are based on the time-Difference (TD) error from the target Q network:

where α ═ 1 is a temperature coefficient, the degree of importance of the control on the strategy entropy is controlled, and γ ═ 0.95 is a discount factor. And (5) optimizing a loss function by using an Adam optimization method, wherein the learning rate is 0.001, and updating the network parameters. In addition, the Q-value network parameters are copied to the target Q-value network every 5 steps.

(3) Policy function updates are based on a loss function:

wherein α ═ 1 is a temperature coefficient, Q _ω The smaller of the two outputs of the dual Q network. And (5) optimizing a loss function by using an Adam optimization method, wherein the learning rate is 0.001, and updating the network parameters.

The following is a specific structure of this embodiment, including: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit.

And the filtering unit is used for updating the belief state particles and the weights thereof and interactively acquiring state, observation and reward information with the training environment by using the actions acquired from the planning unit. And the training data is also processed and stored in a playback pool.

After the training phase is finished, the playback pool and the learning unit are cancelled, and the specific steps of the use phase can be obtained by skipping the steps S24 and S25, in which case the environment in S20 only needs to provide observation and bonus information, but does not need to provide real state information.

Claims

1. A robot navigation control method based on partial observable reinforcement learning is characterized by comprising the following steps:

s1, initializing network parameters, including: transfer model D _ψ Parameter psi, observation model Z _θ Parameter of theta, policy network pi _ρ Parameter p, double Q value network Q _ω Parameter ω of (d); setting the training time step counter t to be 0, and entering S2;

Initial weight

s4, the robot observes the model Z according to _θ (s, o) update weights

Proceeding to S5;

s5 meterArithmetic mean belief state

Proceeding to S6;

s6, sampling

M particles with the highest median weight, is recorded as

Proceeding to S7;

s7, normalizing the weight of M particles

Proceeding to S8;

s8, mixing the particles

And average belief state

Copy N copies after combination and give each copy weight

N new weighted particles are obtained and are marked as

Superscript (n) denotes the nth copy, proceeding to S9;

s9, setting a planning time step counter i as t-1, and entering S10;

s11, for each copy, obtaining an action according to the policy network

Proceeding to S12;

Proceeding to S13;

s13, updating the average belief status for each copy

Proceeding to S14;

s14, evaluating the entropy of the belief state information for each copy

Representing an estimate of the current belief state, proceed to S15;

s15, updating the weight of each copy particle

A ^(m)(n) Representing the merit function, S16 is entered;

s16, if resampling is needed, entering S17; otherwise, go to S18;

s17, resampling the copy particles, and entering S18;

s18, entering S10;

s19, evenly sampling from 1 to N to obtain N, and outputting the first action a in the nth copied robot planning track _t Proceeding to S20;

s20, robot action a _t Interacting with the training environment to obtain the state s of the next moment _t+1 Observation at the next moment o _t+1 And a prize r _t Go to S21;

s21, if resampling is needed, entering S22; otherwise, go to S23;

s22, resampling the belief state particles, and entering S23;

s23, updating the belief state particles according to the transfer model

Proceeding to S24;

s24, data

Storing the data into a playback pool, and entering S25;

s25, the learning unit samples training data from the playback pool, updates network parameters and enters S26;

s26, entering S3;

and S27, finishing the training, and outputting the trained network for the navigation control of the robot.

2. The method of claim 1, wherein when the trained network is used for the robot navigation control, the playback pool and the learning unit are cancelled, and the steps S24 and S25 are skipped to obtain the specific steps of the using stage of the robot navigation control, so that the environment in S20 only needs to provide the observation and reward information, but does not need to provide the real status information.

3. The method of claim 1, wherein the robot training environment is modeled as a POMDP, which is expressed by the following six-tuple:

(1) state space S, S _t E, S represents the state of the robot at the time t;

(2) an action space A, a _t E is A to represent the action taken by the robot at the moment t;

(3) transition probability function T: S × A × S → [0,1]，T(s _t ,a _t ,s _t+1 ) Indicating that the robot is in state s _t Taking action a _t Is transferred to s _t+1 The probability of (d);

(4) reward function R: S × A → R, R (S) _t ,a _t ) Indicating that the robot is in state s _t Taking action a _t All canImmediate rewards earned;

(5) observation space O, O _t The epsilon O represents the observation obtained by the robot at the time t;

(6) observation probability function Z: S × A × O → [0,1]，Z(s _t ,a _t-1 ,o _t ) Indicating that the robot is taking action a _t-1 Is transferred to s _t Obtain observation o _t The probability of (d);

wherein, gamma belongs to (0, 1)]A discount factor used to weigh immediate and delayed rewards; r is _t Indicating the reward the robot received at time t.

4. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein the belief state b _t (s)＝p(s _t ＝s|h _t ) Represents a known history h _t ＝{b ₀ ,a ₀ ,o ₁ ,…,a _t-1 ,o _t In the case of s _t Is the probability distribution of s, b ₀ Representing the initial state probability distribution.

5. The method of claim 1, wherein the transfer model D is a model of a robot navigation system _ψ The input is the state and the action, the output is the state and the reward at the next moment, and the model network structure is transferred to be a full-connection network; observation model Z _θ The input is state and observation, the output is probability of observation, and the network structure of the observation model is a fully-connected network; policy network pi _ρ The input is belief state particles and average belief state, the output is action and logarithm of probability of the action, and the network structure of the strategy network is that a full-connection network is passed throughMean μ and variance σ of the envelope output action ² From the Gaussian distribution N (mu, sigma) ² ) Obtaining an action by middle sampling, and calculating and outputting a logarithm value of the probability of the action by using Gaussian distribution; dual Q-value network Q _ω The input is state and action, the output is two Q values, the double Q value network is two full connection networks Q ₁ And Q ₂ While requiring the pair Q _i (i 1,2) each maintaining a target Q value network TQ with the same network structure _i (i ═ 1,2) for network parameter updates.

6. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S14, the belief state particles are used in the estimation of the belief state information entropy

Estimating belief state probability distributions

And then, estimating the probability distribution of the belief state by using a kernel density estimation method and adopting Gaussian kernel density.

7. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S15, the calculation formula of the merit function a is:

8. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S17, S22, the resampling is that N weighted particles are selected at random and repeatedly according to the weight, and then the weight of the new particles is set to 1.

9. The method for controlling robot navigation based on partially observable reinforcement learning of claim 1, wherein in S25, the updating the network parameters includes:

(1) the transfer model and the observation model adopt the minimum mean square error of the predicted value and the true value as a loss function, an optimization method is used for optimizing the loss function, and the network parameters are updated.

(2) The updating modes of the two networks of the double-Q-value network are the same, and the two networks are based on the time sequence difference error of the target Q-value network:

wherein alpha is a temperature coefficient, and the attention degree of strategy entropy is controlled; optimizing a loss function by using an optimization method, and updating network parameters; copying the Q value network parameters to a target Q value network every other fixed updating step;

(3) policy function updates are based on a loss function:

wherein, alpha is temperature coefficient, Q _ω The smaller of the two outputs of the double-Q value network; and optimizing the loss function and updating the network parameters by using an optimization method.

10. A robot navigation control system based on partially observable reinforcement learning, comprising: the device comprises a filtering unit, a planning unit, a playback pool and a learning unit;

the filtering unit is used for updating the belief state particles and the weights thereof and interactively obtaining state, observation and reward information by using the actions obtained from the planning unit and the training environment; in addition, the system is also used for processing the training data and storing the training data into a playback pool;

the planning unit is used for receiving the weighted particles provided by the filtering module, simulating planning by using a learned transfer model and a strategy network, and outputting the action to be provided to the filtering unit;

the playback pool is a database supporting random access and is used for storing the processed training data and providing the training data required by learning for the learning unit;

the learning unit is used for sampling the training data in the revisit pool, training the network by using a given optimization method, and providing the updated network parameters to the filtering unit and the planning unit;

the training environment is an actual application environment or a simulation virtual environment, is used for training the robot navigation control method, interacts with the filtering unit, and provides state, observation and reward information for filtering.