CN106950969A

CN106950969A - It is a kind of based on the mobile robot continuous control method without map movement planner

Info

Publication number: CN106950969A
Application number: CN201710294685.9A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2017-04-28
Filing date: 2017-04-28
Publication date: 2017-07-14

Abstract

What is proposed in the present invention is a kind of based on the mobile robot continuous control method without map movement planner, and its main contents includes：Policy-Gradient, intensified learning are determined without map movement planner, asynchronous depth, network and reward function is assessed, and its process is to carry out end-to-end training using without map movement planner, be that no map movement planner finds transfer function to control frequency；Modification original depth determines Policy-Gradient, and Policy-Gradient is determined as asynchronous depth；Intensified learning is carried out, training and sample is collected and can be performed parallel；Movement planner is estimated using network is assessed, reward function is defined and checks up to target.The present invention uses high-precision laser range-finding sensor, can accurately calculate outbound path, more efficient；While demonstrating without any manual designs and in advance, feasible path optimizing can be efficiently searched out, by robot navigation to target location, and will not be collided with the barrier in environment.

Description

It is a kind of based on the mobile robot continuous control method without map movement planner

Technical field

The present invention relates to robot control field, more particularly, to a kind of based on the moving machine without map movement planner Device people's continuous control method.

Background technology

With the development of science and technology, Mobile Robotics Navigation increasingly turns into robotics and artificial intelligence field is ground One of hot issue studied carefully, meanwhile, it is also the embodiment of full autonomous robot level of intelligence.It is desirable to realize unknown During environmental work, mobile robot can obtain local environmental information according to self-sensor device, with independently setting up environment Figure, and according to the map of foundation, cook up the feasible path that can be arrived at collisionless.So, mobile robot can So that applied to fields such as daily navigation, path plannings, the trip and work given people offers convenience.However, traditional method makes Navigation is realized with simultaneous localization and mapping, is not only taken, and there is stronger dependence to map.

The present invention propose it is a kind of based on the mobile robot continuous control method without map movement planner, using without ground Figure movement planner carries out end-to-end training, is that no map movement planner finds transfer function to control frequency, so as to machine People can make a response to new observation result immediately；Modification original depth determines Policy-Gradient, and plan is determined as asynchronous depth Omit gradient；Intensified learning is carried out, training and sample is collected and can be performed parallel；Movement planner is carried out using network is assessed Assess, define reward function and check up to target.The present invention uses high-precision laser range-finding sensor, can accurately calculate outlet Footpath, it is more efficient；While demonstrating without any manual designs and in advance, feasible path optimizing can be efficiently searched out, will Robot navigation will not collide to target location with the barrier in environment.

The content of the invention

The problems such as being taken for navigation, it is an object of the invention to provide a kind of based on the movement without map movement planner Robot continuous control method, carries out end-to-end training using without map movement planner, is that no map movement planner is found Transfer function controls frequency, so that robot can make a response to new observation result immediately；Original depth is changed to determine Policy-Gradient, Policy-Gradient is determined as asynchronous depth；Intensified learning is carried out, training and sample is collected and can be performed parallel； Movement planner is estimated using network is assessed, reward function is defined and checks up to target.

To solve the above problems, the present invention provides a kind of based on the mobile robot continuous control without map movement planner Method, its main contents include：

(1) without map movement planner；

(2) asynchronous depth determines Policy-Gradient；

(3) intensified learning；

(4) network is assessed；

(5) reward function.

Wherein, it is described based on the mobile robot continuous control method without map movement planner, only extract 10 dimensions and survey Away from result and target relative information as reference, no map movement planner is strengthened learning method by asynchronous deep layer and started anew Trained end to end, and can directly export continuous straight line and angular speed.

Wherein, it is described to be used as input by taking 10 dimension range measurements and target location without map movement planner, even Continuous diversion order is used as output；To carrying out end-to-end training without map movement planner, it may be directly applied to virtual and true In real environment；Can be by Mobile Robotics Navigation to required target without map movement planner, and will not be with any obstacle Thing collides.

Further, described transfer function, is that no map movement planner defines transfer function：

v_t=f (x_t,p_t,v_t-1) (1)

Wherein, x_tIt is the observed value of raw sensor data, p_tIt is the relative position of target, v_t-1In being final time step-length The speed of mobile robot；They can be considered as the immediate status of mobile robot；State is mapped directly to action by model, Speed v i.e. next time_t；Effective movement planner must assure that control frequency, so that robot can be immediately to new sight Result is examined to make a response.

Wherein, described asynchronous depth determines Policy-Gradient, compared with original depth determines Policy-Gradient, by sampling process It is separated to another thread；In training thread, each iterative step updates by the batch collected from buffering area and assesses network θ^QWith actor network θ^uWeight；The prediction target for assessing network is according to reward r_iWith estimation Q value γ Q ' calculating；Q ' is Next state s_t+1The output of the weight θ Q ' provided goal-based assessment network, and with the optimal action a of estimation_t+1=u ' (s_i+1|θ^u′) target actor network θ^uIt is used as input.

Further, described sample is collected, and actor network is updated by the Policy-Gradient for Batch conversion of sampling；Sample This collection thread parallel is performed, and action is determined by actor network；Within the training time, random process N is added, is excited to action The exploration in space；New conversion is saved in by the shared response buffering area of thread of training and sample；Asynchronous depth determines strategy Gradient can also use multiple Data Collection threads to realize other asynchronous methods；Original depth determines Policy-Gradient reverse every time A sample is collected in propagation iterative, and parallel asynchronous depth determines that the sample that Policy-Gradient is collected in each step is more.

Wherein, described intensified learning, 10 abstract dimension laser ranging results, previous action and relative target position quilt Merge as 14 dimensional input vectors；10 dimension laser ranging results are used with the original laser result between 90 degree and 90 degree Angular distribution, ranging information is normalized to (0,1)；The two dimensional motion of each time step includes the angular speed of mobile robot And linear velocity；Two dimension target position is represented with the polar coordinates (distance and angle) relative to mobile robot coordinate system；With After the neural net layer that 3 of 512 nodes are fully connected, input vector is sent to the linear velocity and angle speed of mobile robot Drag out a miserable existence order.

Further, described laser ranging result, in order to constrain the angular velocity range in (- 1,1), uses tanh Function (tanh) is used as activation primitive；In addition, the scope of linear speed is constrained in (0,1) by sigmoid function；Due to laser knot Fruit can not cover the dorsal area of mobile robot, so can not be moved rearwards by；Output action is multiplied by two hyper parameters, determines to move The final linear and angular speed that mobile robot is directly performed；In view of real kinetic, selection 0.5m/s is used as maximum line velocity Maximum angular rate is used as with 1rad/s.

Wherein, described assessment network, for assess network, predict state and action to Q values；Using 3 completely The neural net layer of connection handles input state；The action merges in second neural net layer being fully connected；Q values are most Activated eventually by linear activation primitive：

Y=kx+b (2)

Wherein, x is the input of last layer, and y is the Q values of prediction, and k and b are the weight of training and the deviation of this layer.

Wherein, collided without barrier described reward function, the target location that mobile robot attempts needed for reaching； Reward function has three kinds of different conditions：

If robot is checked up to target by distance threshold, on the occasion of reward r_Reach, but if pass through minimum Range measurement checks that robot collides with barrier, then rewards r for negative value_Collision；The two conditions all can stop training； Otherwise, reward function and the difference of a upper time step distance, d_t-1-d_t, it is multiplied by hyper parameter c_r；Reward function can lean on robot Close-target position；Reward function is directly used by assessment network, without cutting out or normalizing.

Brief description of the drawings

Fig. 1 is a kind of system framework based on the mobile robot continuous control method without map movement planner of the present invention Figure.

Fig. 2 is that the present invention is a kind of to be transported based on the mobile robot continuous control method without map movement planner without map The transfer function of dynamic planner.

Fig. 3 is a kind of extensive chemical based on the mobile robot continuous control method without map movement planner of the present invention Practise.

Embodiment

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combine, the present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system framework based on the mobile robot continuous control method without map movement planner of the present invention Figure.Main to include without map movement planner, asynchronous depth determines Policy-Gradient, and intensified learning assesses network and reward function.

Based on the mobile robot continuous control method without map movement planner, only extract 10 and tie up range measurement and target Relative information is as reference, and no map movement planner strengthens learning method by asynchronous deep layer and starts anew to carry out end to end Training, and can directly export continuous straight line and angular speed.

Asynchronous depth determines Policy-Gradient, compared with original depth determines Policy-Gradient, sampling process is separated to another Individual thread；In training thread, each iterative step updates by the batch collected from buffering area and assesses network θ^QAnd actuator Network θ^uWeight；The prediction target for assessing network is according to reward r_iWith estimation Q value γ Q ' calculating；Q ' is next state s_t+1The output of the weight θ Q ' provided goal-based assessment network, and with the optimal action a of estimation_t+1=u ' (s_i+1|θ^u′) target Actor network θ^uIt is used as input.

Actor network is updated by the Policy-Gradient for Batch conversion of sampling；Sample is collected thread parallel and performed, action Determined by actor network；Within the training time, random process N is added, the exploration to motion space is excited；New conversion is preserved To in by the shared response buffering area of thread of training and sample；Asynchronous depth determines that Policy-Gradient can also use multiple data to receive Collection thread realizes other asynchronous methods；Original depth determines that Policy-Gradient collects a sample in each backpropagation iteration, And parallel asynchronous depth determines that the sample that Policy-Gradient is collected in each step is more.

For assess network, predict state and action to Q values；Located using 3 neural net layers being fully connected Manage input state；The action merges in second neural net layer being fully connected；Q values swash eventually through linear activation primitive It is living：

Y=kx+b (1)

Collided without barrier reward function, the target location that mobile robot attempts needed for reaching；Reward function has Three kinds of different conditions：

Fig. 2 is that the present invention is a kind of to be transported based on the mobile robot continuous control method without map movement planner without map The transfer function of dynamic planner.It is used as input by taking 10 dimension range measurements and target location without map movement planner, even Continuous diversion order is used as output；To carrying out end-to-end training without map movement planner, it may be directly applied to virtual and true In real environment；Can be by Mobile Robotics Navigation to required target without map movement planner, and will not be with any obstacle Thing collides.

To define transfer function without map movement planner：

v_t=f (x_t,p_t,v_t-1) (3)

Fig. 3 is a kind of extensive chemical based on the mobile robot continuous control method without map movement planner of the present invention Practise.Abstract 10 dimension laser ranging results, previous action and relative target position be merged together as 14 dimension inputs to Amount；10 dimension laser ranging results use angular distribution with the original laser result between 90 degree and 90 degree, and ranging information is by normalizing Turn to (0,1)；The two dimensional motion of each time step includes the angular speed and linear velocity of mobile robot；Two dimension target position with Polar coordinates (distance and angle) relative to mobile robot coordinate system are represented；It is fully connected in 3 with 512 nodes After neural net layer, input vector is sent to linear velocity and the angular speed order of mobile robot.

In order to constrain the angular velocity range in (- 1,1), activation primitive is used as using hyperbolic tangent function (tanh)；In addition, The scope of linear speed is constrained in (0,1) by sigmoid function；Because lasing result can not cover the back region of mobile robot Domain, so can not be moved rearwards by；Output action is multiplied by two hyper parameters, determine that mobile robot is directly performed final linear and Angular speed；In view of real kinetic, selection 0.5m/s is used as maximum angular rate as maximum line velocity and 1rad/s.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and scope, the present invention can be realized with other concrete forms.In addition, those skilled in the art can be to this hair Bright to carry out various changes and modification without departing from the spirit and scope of the present invention, these are improved and modification also should be regarded as the present invention's Protection domain.Therefore, appended claims are intended to be construed to include preferred embodiment and fall into all changes of the scope of the invention More and modification.

Claims

1. it is a kind of based on the mobile robot continuous control method without map movement planner, it is characterised in that mainly including nothing Map movement planner (one)；Asynchronous depth determines Policy-Gradient (two)；Intensified learning (three)；Assess network (four)；Reward letter Number (five).

2. based on, based on the mobile robot continuous control method without map movement planner, it is special described in claims 1 Levy and be, only extract 10 and tie up range measurement and target relative information as reference, no map movement planner passes through asynchronous deep layer Strengthen learning method to start anew to be trained end to end, and can directly export continuous straight line and angular speed.

3. based on described in claims 1 without map movement planner (one), it is characterised in that by taking 10 dimension ranging knots Fruit and target location are as input, and continuous diversion order is used as output；To carrying out end-to-end training without map movement planner, It may be directly applied in virtual and true environment；Can be by Mobile Robotics Navigation to required mesh without map movement planner Mark, and will not be collided with any barrier.

4. based on the transfer function described in claims 3, it is characterised in that be without map movement planner definition conversion letter Number：

v_t=f (x_t,p_t,v_t-1) (1)

Wherein, x_tIt is the observed value of raw sensor data, p_tIt is the relative position of target, v_t-1It is to move in final time step-length The speed of robot；They can be considered as the immediate status of mobile robot；State is mapped directly to action by model, i.e., under Speed v once_t；Effective movement planner must assure that control frequency, so that robot can be immediately to new observation knot Fruit is made a response.

5. Policy-Gradient (two) is determined based on the asynchronous depth described in claims 1, it is characterised in that determine with original depth Policy-Gradient is compared, and sampling process is separated into another thread；In training thread, each iterative step is by from buffering area The batch of collection, updates and assesses network θ^QWith actor network θ^uWeight；The prediction target for assessing network is according to reward r_iWith Estimate Q value γ Q ' calculating；Q ' is next state s_t+1The output of the weight θ Q ' provided goal-based assessment network, and to estimate The optimal action a of meter_t+1=u ' (s_i+1|θ^u′) target actor network θ^uIt is used as input.

6. collected based on the sample described in claims 5, it is characterised in that the plan that actor network passes through Batch conversion of sampling Gradient is omited to update；Sample is collected thread parallel and performed, and action is determined by actor network；Within the training time, addition is random Process N, excites the exploration to motion space；New conversion is saved in by the shared response buffering area of thread of training and sample； Asynchronous depth determines that Policy-Gradient can also use multiple Data Collection threads to realize other asynchronous methods；Original depth determines plan Slightly gradient collects a sample in each backpropagation iteration, and parallel asynchronous depth determines Policy-Gradient in each step The sample of collection is more.

7. based on the intensified learning (three) described in claims 1, it is characterised in that 10 abstract dimension laser ranging results, first Preceding action and relative target position is merged together as 14 dimensional input vectors；10 dimension laser ranging results are with 90 degree and 90 Original laser result between degree uses angular distribution, and ranging information is normalized to (0,1)；The two dimension of each time step is moved Work includes the angular speed and linear velocity of mobile robot；Two dimension target position is with the polar coordinates relative to mobile robot coordinate system (distance and angle) is represented；After the neural net layer that 3 with 512 nodes are fully connected, input vector is sent to The linear velocity of mobile robot and angular speed order.

8. based on the laser ranging result described in claims 7, it is characterised in that in order to constrain the angular speed model in (- 1,1) Enclose, activation primitive is used as using hyperbolic tangent function (tanh)；In addition, the scope of linear speed constrained in by sigmoid function (0, 1) in；Because lasing result can not cover the dorsal area of mobile robot, so can not be moved rearwards by；Output action is multiplied by two Individual hyper parameter, determines the final linear and angular speed that mobile robot is directly performed；In view of real kinetic, 0.5m/s is selected Maximum angular rate is used as maximum line velocity and 1rad/s.

9. based on the assessment network (four) described in claims 1, it is characterised in that for assessing network, predict state and Act to Q values；Input state is handled using 3 neural net layers being fully connected；The action merges at second completely In the neural net layer of connection；Q values are activated eventually through linear activation primitive：

Y=kx+b (2)

10. based on the reward function (five) described in claims 1, it is characterised in that mobile robot is attempted needed for reaching Collided without barrier target location；Reward function has three kinds of different conditions：

If robot is checked up to target by distance threshold, on the occasion of reward r_Reach, but if pass through minimum ranging As a result check, robot collides with barrier, then reward r for negative value_Collision；The two conditions all can stop training；It is no Then, reward function and the difference of a upper time step distance, d_t-1-d_t, it is multiplied by hyper parameter c_r；Reward function can make robot close Target location；Reward function is directly used by assessment network, without cutting out or normalizing.