CN109241552A

CN109241552A - A kind of underwater robot motion planning method based on multiple constraint target

Info

Publication number: CN109241552A
Application number: CN201810764979.8A
Authority: CN
Inventors: 张国成; 程俊涵; 孙玉山; 盛明伟; 冉祥瑞; 王力锋; 焦文龙; 王子楷; 贾晨凯; 吴凡宇
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2019-01-18
Anticipated expiration: 2038-07-12
Also published as: CN109241552B

Abstract

A kind of underwater robot motion planning method based on multiple constraint target belongs to machine learning and underwater robot motion planning field.The model construction stage: current environment is converted by the flow velocity signal of the signal of robot obstacle-avoiding sonar and flow sensor；According to Dynamic Constraints, discrete movement space is established；Using underwater obstacle as constraint, reward functions are established；Markovian decision process is established based on multi-objective restriction, is realized for algorithm and establishes basis；Training stage: being trained based on Q learning algorithm, in current environment, executes movement based on Greedy strategy, primary strategy is assessed based on original strategy and updated to one step strategy of every execution, and improvement strategy realizes planning purpose until adapting to environment.The present invention considers the multiple constraints target such as water flow, obstruction, target, by intensified learning method in conjunction with underwater multiple constraint target, realizes the motion planning of underwater robot, has stronger real-time, and can be adapted for a variety of environment.

Description

A kind of underwater robot motion planning method based on multiple constraint target

Technical field

The invention belongs to machine learning and underwater robot motion planning fields, and in particular to one kind is based on multiple constraint mesh Target underwater robot motion planning method.

Background technique

Intelligent Underwater Robot scientific research of seas, ocean development, Underwater Engineering and in terms of have extensively Application prospect.Intelligent Underwater Robot is generally operational under complicated marine environment, is made to preferably complete various operations Life task and its own safety of guarantee, need it under circumstances not known to there is autokinetic movement to plan ability, can be in circumstances not known Middle avoiding barrier, navigation to target point.

Traditional underwater robot motion planning technology needs to construct global map in advance.It is needed when environment changes Connection model is re-established, adaptability is poor, not very practical.Intensified learning is a kind of unsupervised learning method, it is one A process constantly attempted.It obtains knowledge with evaluation by constantly action, and improvement strategy is made final with adapting to environment Evaluation function value is maximum, reaches the destination of study.

Intensified learning is but traditional based on the underwater of intensified learning it is verified that can use in robot under water Robot motion planning method, it is contemplated that single constrained objective is not considered water flow constraint, goal constraint simultaneously and hindered Influence in the case of the multi-objective restrictions such as boat object constraint to underwater robot movement.

Summary of the invention

The purpose of the present invention is to provide a kind of underwater robot motion planning methods based on multiple constraint target.This method By the underwater human occupant dynamic model under the influence of construction water flow, and the method for combining intensified learning carries out multiple constraint target Fusion, constructs reasonable prize signal and motion space, exports underwater robot optimal control policy by training.Furthermore this hair It is bright also to combine underwater multiple constraint target with the Q learning algorithm in intensified learning, underwater robot can be allowed unknown Environmental characteristic is obtained under underwater environment, is carried out Policy iteration, is completed the motion planning of underwater robot.

The object of the present invention is achieved like this:

A kind of underwater robot motion planning method based on multiple constraint target is divided into model construction stage and algorithm training Stage, specifically includes the following steps:

(1) the model construction stage, the model construction of Markovian decision process E is referred specifically to, intensified learning task usually may be used To be described with Markovian decision process.Because of the particularity of underwater environment, environmental constraints, obstruction constraint and target are considered The multi-objective restrictions such as point constraint construct Markovian decision process, specifically includes the following steps:

(1-1) establishes current environment x according to sensor signal_t；If the obstacle distance of the i-th freedom degree direction of robot is l_iIf barrier is not present in i freedom degree, it is set as infinite；The flow velocity of robot present position is set as vc；Real-time localization machine The Euclidean distance d of device people position, calculating robot and target point；

The maximum value that (1-2) can advance according to underwater robot, establishes the motion space A of robot, and A is moved by five Order composition, before respectively advancing, is left front, right, left side pushes away and right side pushes away, speed v_a, angular speed ω_a；

(1-3) considers obstruction constraint, if the i-th freedom degree guards against safe distance h under water_iIf the l detected_i< h_i, Then think collision, a negative reward r is set_ter；

(1-4) considers target point constraint, and target point threshold value is d', if detecting that d becomes larger, a negative reward is arranged r_oppIf detecting that d becomes smaller, a positive reward r is set_moveIf detecting d < d', robot arrives at target point, if Set a positive reward r_arr。

(2) the algorithm training stage refers specifically to robot and carries out continuous trial and error in Computer Simulation, learning strategy, specifically The following steps are included:

(2-1) initializes t=0, and t represents the step time of robot training movement each time；Initialize r_t=0, r_tRepresent machine People executes reward obtained when t-th of movement；

(2-2) initializes a matrix Q, and (x, a), when being recorded in state x, the Q value that selection movement a can be obtained is initialized It is 0；

(2-3) initialization counter count=0, the total degree of recorder people training；M value is set, it is total to represent robot It needs to train M times altogether；

(2-4) is executed (2-5) when count is less than specified frequency of training M, otherwise executes (2-14)；

(2-5) obtains sensor signal, obtains current state x_t, including obstruction information, the freedom degree side i of robot To barrier distance l_i, set infinite for distance if without barrier；The ocean current flow rate information vc of current location_t； Own location information, and Euclidean distance d of the target point apart from robot is calculated；

(2-6) acts a according to matrix Q, selection_t；

(2-7) considers that kinematical constraint and water flow constraint will be selected according to the speed that target formula is actually externally shown The movement a selected_tSpeed in conjunction with flow velocity, according to combining what is obtained emulate, and update l_i；

(2-8) is if l_i< h_i, execute (2-9), otherwise execute (2-10)；

(2-9) collides, r_t=r_ter, terminate this training, enable x_t+1For sky, matrix Q is updated, and by count+1, is enabled T=0 re-executes training from (2-4)；

(2-10) is executed (2-11) if d'< d, is otherwise arrived at target point, terminate this training, enable r_t=r_arr, enable x_t+1For sky, and matrix Q is updated by count+1, enable t=0, re-execute training from (2-4)；

(2-11) is if d_t< d_t-1, execute (2-12), otherwise execute (2-13)；

(2-12) d reduces, and enables r_t=r_move, update x_t+1, and matrix Q is updated, by t+1, trained stream is re-executed from (2-5) Journey；

(2-13) d increases, and enables r_t=r_opp, update x_t+1, and matrix Q is updated, by t+1, trained stream is re-executed from (2-5) Journey；

(2-14) terminates training, the matrix Q after being trained；

(2-15) exports underwater robot motion planning strategy.

The kinematical constraint, i.e., the kinematic constraint of underwater robot itself in training process are as follows: assuming that aircraft Center of gravity is (x, y) in the coordinate of fixed coordinate system, then robot fixed coordinate system speed are as follows:

Wherein, θ is Angle of Trim, and φ is Angle of Heel, and α is influence coefficient of the kinematic constraint to underwater robot speed.

The water flow constraint is accounted for when selection acts in the training process with following methods: in learning training process In, x_tUnder state by ADCP obtain flow velocity be vc_t, a movement a according to strategy, in Robot Selection behavior aggregate_t, The speed of body isWhen robot execution acts, water flow constraint, the practical route speed externally shown are considered Are as follows: vi_t=v_at+βvc_t, wherein β is influence coefficient of the water flow to underwater robot speed.

The selection acts a_tMethod particularly includes: use Greedy strategy, given threshold ε, using computer generate with Machine number ε ', if random number is less than threshold value, i.e. ε ' < ε, then robot executes state Q (x in Q matrix_t, a) in element maximum value Corresponding movement, i.e. a_t=max_a Q(x_t,a)；If random number is greater than threshold value, i.e. ε ' > ε, then robot randomly chooses one and moves It executes, i.e. a_t=randomQ (xt, a).

The method of the update matrix Q are as follows: it is assumed that the state belonging to before robot execution movement is x_t, what should be executed is dynamic As a_t, according to the obtained award coefficient r of feedback_t, the state arrived at after execution movement is x_t+1, then

Q(x_t,a_t)←(1-α)*r_t+α*(r_t+γmax_a'Q(x_t+1,a'))

Wherein α is learning efficiency, and wherein γ is discount factor.

The beneficial effects of the present invention are:

(1) present invention considers the multiple constraints target such as water flow, obstruction, target, and traditional intensified learning planing method A variety of constrained objectives are not considered simultaneously, and the method for the training has practicability and robustness；

(2) present invention realizes the movement rule of underwater robot by intensified learning method in conjunction with underwater multiple constraint target It draws, there is stronger real-time, and can be adapted for a variety of environment.

Detailed description of the invention

Fig. 1 is a kind of model construction schematic diagram of underwater robot motion planning method based on multiple constraint target；

Fig. 2 is a kind of training stage execution flow chart of underwater robot motion planning method based on multiple constraint target.

Specific embodiment

Specific embodiments of the present invention will be further explained with reference to the accompanying drawing:

The present invention relates to a kind of underwater robot motion planning methods, specifically by multi-objective restriction and intensified learning method Combine, in underwater robot motion planning.The model construction stage: the signal of robot obstacle-avoiding sonar and flow velocity are passed The flow velocity signal of sensor is converted into current environment；Dynamic Constraints based on underwater robot establish discrete movement space；With water Lower barrier establishes reward functions as constraint；Markovian decision process is established based on multi-objective restriction, is built for algorithm realization Vertical basis.Training stage: being trained based on Q learning algorithm, in current environment, executes movement, every execution based on Greedy strategy One step strategy, is assessed based on original strategy and is updated primary strategy, and improvement strategy realizes planning mesh until adapting to environment 's.Intensified learning method in conjunction with underwater multiple constraint target, is realized the motion planning of underwater robot by the present invention, have compared with Strong real-time, and can be adapted for a variety of environment.

The present invention is directed to the particularity of underwater environment, considers multiple constraint target combination intensified learning method, the underwater machine of training Device people's motion planning strategy.It is divided into model construction stage and Strategies Training stage, comprising the following steps:

1. the model construction stage, as shown in Figure 1, the specific steps are as follows:

Intensified learning task can usually be described with Markovian decision process.Because of the particularity of underwater environment, consider The multi-objective restrictions such as environmental constraints, obstruction constraint and target point constraint construct Markovian decision process.

The concrete composition of state space X: the one, obstruction of avoidance sonar handling machine people's local environment of underwater robot Information, i.e. the obstruction clearance information l of robot i freedom degree direction_i；Two, the ocean current letter of ADCP handling machine people local environment Breath, i.e. the flow velocity vc of robot present position；Three, the relative position information of GPS handling machine people and target point, i.e. robot away from Euclidean distance d from target point.

The concrete composition of motion space A: the motion space in the present invention includes four kinds of control commands, and title is respectively a left side Before, it is preceding, right before, left side pushes away and right side pushes away.The linear velocity of robot is fixed value v_a。

The concrete composition of reward function R: robot once collides, reward value r_ter；There is no collisions for robot But distance objective point is more and more remoter, reward value r_opp；Robot there is no collision and distance objective point it is more and more closer, prize Encouraging value is r_move；Robot arrives at target point, reward value r_arr。

2. Strategies Training stage, process are as shown in Figure 2, the specific steps are as follows:

Virtual environment is initially set up for training, the specific method is as follows:

Using robot motion simulation software establish one emulation marine environment, in virtual environment set barrier, Target point and ocean current.Barrier can define at random with target point, and define 6-12 different robot starting points.

Two-dimensional surface space is subjected to rasterizing processing, the ocean current in each grid can be considered identical, and flow field is flowed with one Function Ψ (x, y) is generated at random, and the velocity field of ocean current can be obtained by flow field function:

Due to the Incoercibility of fluid

Vc in formula_x, vc_yRespectively ocean current is taken as each in the position (x, y) along the x axis with the velocity component of Y direction The central point of grid.

Carry out Strategies Training, the specific steps are as follows:

1) t=0 is initialized, t represents the step time of robot training movement each time；Rt=0 is initialized, rt represents robot Execute t-th of movement reward obtained.Defining a matrix Q, (x, a), when being recorded in state x, selection movement a's can be obtained The Q value obtained, is initialized as 0.Initialization counter count=0, the total degree of recorder people training.M value is set, machine is represented Device people needs to train M times in total.Initialize the radius of safety h of underwater robot i freedom degree direction_i.D' value is set, machine is represented The threshold value of people and target point distance.

2) state for initializing robot, randomly selects a starting point and starts to be explored.

3) robot obtains environmental information x_t, including obstruction information, the barrier of robot i freedom degree direction with The distance l of robot_i, set infinite for distance if without barrier；The ocean current flow rate information vc of current location；Itself Location information, and Euclidean distance d of the target point apart from robot is calculated.

4) threshold epsilon is set, generates a random number ε ' by computer, if ε ' < ε, robot randomly selects dynamic The movement made in space executes, i.e. a_t=randomQ (xt, a)；If ε ' > ε, robot according to matrix Q (x, a), choosing It selects in state x_tUnder, it is worth maximum movement a, i.e. a_t=max_a Q(xt,a)。

5) robot considers kinematical constraint and water flow constraint, according to the speed that target formula is actually externally shown, In a simulated environment, according to speed vi_tMovement.

6) robot has executed movement a_tAfterwards, environmental information x is obtained again_t+1。

If 6-1) l_i< h_i, explanation collides, this training terminates, counter count+1, according to

Q(xt,at)←(1-α)*r_t+α*(r_t+γmax_a'Q(x_t+1,a'))

Update matrix Q, train epochs t=0, if count < M, the re -training since step 2), if count= M continues to execute step 7).

If 6-2) li > hi, illustrate to continue to judge whether it arrives at target point there is no collision.

If 6-2-1) d≤d', illustrating to arrive at target point, this training terminates, counter count+1, matrix Q is updated, Train epochs t=0, if count < M, the re -training since step 2) continue to execute step 7) if count=M.

If 6-2-2) d > d', illustrate that, without arriving at target point, t+1 continues this training from step 3).

7) training is completed, and exports underwater robot motion planning strategy.

This method advantage is to consider the multiple constraints target such as water flow, obstruction, target, traditional intensified learning planning side Method does not consider a variety of constrained objectives simultaneously, and trained method lacks practicability and robustness.The present invention passes through intensified learning pair Multiple constraint target carries out Fusion Features, can train more practical underwater robot motion planning strategy.

Claims

1. a kind of underwater robot motion planning method based on multiple constraint target, which is characterized in that this method is divided into model structure Build stage and algorithm training stage, comprising:

(1) the model construction stage；The model construction of Markovian decision process E, comprising the following steps:

(1-1) establishes current environment x according to sensor signal_t；If the obstacle distance of the i-th freedom degree direction of robot is l_iIf Barrier is not present in i freedom degree, then by l_iIt is set as infinite；The flow velocity of robot present position is set as vc；Real-time localization machine The Euclidean distance d of device people position, calculating robot and target point；

The maximum value that (1-2) can advance according to underwater robot establishes the motion space A of robot；The A includes five Motion command, before respectively advancing, is left front, right, left side pushes away and right side pushes away；Speed is v_a, angular speed ω_a；

(1-3) considers obstruction constraint；If the i-th freedom degree guards against safe distance h under water_iIf the l detected_i< h_i, then it is assumed that Collision occurs, and a negative reward r is arranged_ter；

(1-4) considers target point constraint；If target point threshold value is d', if detecting, d becomes larger, and a negative reward r is arranged_opp, If detecting, d becomes smaller, and a positive reward r is arranged_moveIf detecting d < d', robot arrives at target point, setting one A positive reward r_arr。

(2) algorithm training rank；Robot carries out continuous trial and error, learning strategy in Computer Simulation, comprising the following steps:

(2-1) initializes t=0, and t represents the step time of robot training movement each time；Initialize r_t=0, r_tRobot is represented to hold Reward obtained when t-th of movement of row；

(2-2) initializes a matrix Q, and (x, a), when being recorded in state x, selection acts the Q value that a is obtained；

(2-3) initialization counter count=0, the total degree of recorder people training；M value is set, represent robot needs in total It trains M times；

(2-5) obtains sensor signal, obtains current state x_t；The current state x_tIncluding obstruction information, robot i The distance l of the barrier of freedom degree direction_i, current location ocean current flow rate information vc_t, own location information, target is calculated Euclidean distance d of the point apart from robot；

(2-6) acts a according to matrix Q, selection_t；

(2-7) considers kinematical constraint and water flow constraint, by the movement a of selection_tSpeed in conjunction with flow velocity, according to combine To the route speed that externally shows of reality emulated, and update l_i；

(2-8) is if l_i< h_i, execute (2-9), otherwise execute (2-10)；

(2-9) collides, r_t=r_ter, terminate this training, enable x_t+1For sky, matrix Q is updated, and by count+1, enables t=0, Training is re-executed from (2-4)；

(2-10) is executed (2-11) if d'< d, is otherwise arrived at target point, terminate this training, enable r_t=r_arr, enable x_t+1For Sky, and matrix Q is updated by count+1, t=0 is enabled, re-executes training from (2-4)；

(2-11) is if d_t< d_t-1, execute (2-12), otherwise execute (2-13)；

(2-12) d reduces, and enables r_t=r_move, update x_t+1, and matrix Q is updated, by t+1, trained process is re-executed from (2-5)；

(2-13) d increases, and enables r_t=r_opp, update x_t+1, and matrix Q is updated, by t+1, trained process is re-executed from (2-5)；

(2-14) terminates training, the matrix Q after being trained；

(2-15) exports underwater robot motion planning strategy.

2. a kind of underwater robot motion planning method based on multiple constraint target according to claim 1, It is characterized in that: the kinematical constraint, i.e., the kinematic constraint of underwater robot itself in training process are as follows: assuming that aircraft Center of gravity is (x, y) in the coordinate of fixed coordinate system, then robot fixed coordinate system speed are as follows:

3. a kind of underwater robot motion planning method based on multiple constraint target according to claim 1, feature exist In: the water flow is constrained and is accounted for when selection in the training process acts with following methods: during learning training, x_t Under state by ADCP obtain flow velocity be vc_t, a movement a according to strategy, in Robot Selection behavior aggregate_t, the speed of itself Degree isWhen robot execution acts, water flow constraint, the practical route speed externally shown are as follows: vi are considered_t =v_at+βvc_t, wherein β is influence coefficient of the water flow to underwater robot speed.

4. a kind of underwater robot motion planning method based on multiple constraint target according to claim 1, feature exist In: the selection acts a_tMethod particularly includes: Greedy strategy is used, given threshold ε generates random number using computer ε ', if random number is less than threshold value, i.e. ε ' < ε, then robot executes state Q (x in Q matrix_t, a) in element maximum value it is corresponding Movement, i.e. a_t=max_aQ(x_t,a)；If random number is greater than threshold value, i.e. ε ' > ε, then robot randomly chooses a movement and holds Row, i.e. a_t=randomQ (xt, a).

5. a kind of underwater robot motion planning method based on multiple constraint target according to claim 1, feature exist In: the method for the update matrix Q are as follows: it is assumed that the state belonging to before robot execution movement is x_t, the movement that should be executed is a_t, according to the obtained award coefficient r of feedback_t, the state arrived at after execution movement is x_t+1, then

Q(x_t,a_t)←(1-α)*r_t+α*(r_t+γmax_a'Q(x_t+1,a'))

Wherein α is learning efficiency, and wherein γ is discount factor.