CN111880535B

CN111880535B - Unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning

Info

Publication number: CN111880535B
Application number: CN202010715076.8A
Authority: CN
Inventors: 张卫东; 王雪纯; 徐鑫莉; 蔡云泽
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2022-07-15
Anticipated expiration: 2040-07-23
Also published as: CN111880535A

Abstract

The invention relates to an unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning, wherein the method comprises the following steps of: 1) building a marine environment; 2) setting an action space according to the condition of the unmanned ship propeller, and learning according to global planning information provided by the static chart and obstacle information in the detection radius range of the radar system to obtain a reinforcement learning state code; 3) setting reward target weight to obtain a comprehensive reward function; 4) building and training an evaluation network and a strategy network; 5) and respectively inputting the reinforcement learning state codes into an evaluation network and a strategy network, inputting the comprehensive reward function into the evaluation network, and determining the output of the controller according to the action corresponding to the learned mean value of the strategy network. Compared with the prior art, the invention has high self-learning ability, can adapt to different large-scale complex environments through simple deployment training, and further realizes autonomous perception, autonomous navigation and autonomous obstacle avoidance.

Description

Unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning

Technical Field

The invention relates to an unmanned ship autonomous obstacle avoidance method and system, in particular to an unmanned ship hybrid sensing autonomous obstacle avoidance method and system based on reinforcement learning.

Background

The unmanned ship is an unmanned water vehicle capable of realizing autonomous navigation, autonomous obstacle avoidance and autonomous water surface operation, and has the advantages of small volume, high speed, good stealth, no casualty risk and the like. The unmanned ship is very suitable for executing water surface operation tasks in dangerous sea areas with greater risks to casualties of people or simple water surface operation tasks with low requirements on personnel participation degree, has good cost-effectiveness ratio, and is widely and effectively applied to the fields of ocean monitoring, ocean investigation, maritime search and rescue, unmanned freight transportation and the like.

At present, the mainstream thought for realizing autonomous navigation of the unmanned ship is to deploy and apply autonomous sensing, autonomous navigation and autonomous obstacle avoidance algorithms respectively, and each algorithm is matched with each other to complement and complete navigation and operation tasks. For example, algorithms such as pattern recognition, target detection and the like are involved in vision system perception, main ideas for realizing global planning autonomous navigation include a grid graph method, an A-algorithm, a genetic algorithm and the like, and methods such as an artificial potential field method, optimal interaction collision avoidance and the like are mainly applied to local dynamic collision avoidance. Although the methods have good performance in respective application backgrounds, different functional modules need to be elaborately designed, and parameters need to be integrally configured and adjusted for a comprehensive algorithm, so that the unmanned ship intelligent algorithm is complex and tedious to realize. Furthermore, because these methods lack the ability of autonomous learning, it is difficult to adapt to large-scale complex environments, and different algorithm modules need to be redesigned and recombined to cooperate with different environments.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a reinforcement learning-based unmanned ship hybrid perception autonomous obstacle avoidance method and system with autonomous learning and environmental characteristic adaptation capabilities.

The purpose of the invention can be realized by the following technical scheme:

an unmanned ship mixed sensing autonomous obstacle avoidance method based on reinforcement learning comprises the following steps:

1) building a marine environment: establishing an interaction rule between the unmanned ship and the marine environment, generating random obstacles, and randomly generating an initial point and a final point of the unmanned ship;

2) setting an action space and a state space: setting an action space according to the condition of the unmanned ship propeller, and learning according to global planning information provided by the static chart and obstacle information in the detection radius range of the radar system to obtain a reinforcement learning state code;

3) determining a reward function: setting reward target weight to obtain a comprehensive reward function;

4) establishing and training an evaluation network and a strategy network: the evaluation network and the strategy network are respectively formed by connecting a state coding network and a perceptron, and network parameters are initialized and trained;

5) and the intelligent agent decision controller outputs: and respectively inputting the reinforcement learning state codes into an evaluation network and a strategy network, inputting the comprehensive reward function into the evaluation network, and determining the output of the controller according to the action corresponding to the learned mean value of the strategy network.

Preferably, the interaction rule between the unmanned ship and the marine environment in the step 1) follows the own kinetic equation of the unmanned ship.

Preferably, the random obstacles generated in step 1) include 4 kinds: random static obstacles that can be delineated by a chart, random dynamic obstacles that cannot be delineated by a chart, random dynamic obstacles with autonomous control capability, and random dynamic obstacles without autonomous control capability.

Preferably, the motion space in step 2) includes discretized yaw force, pitch force and yaw.

Preferably, the strong learning state code in step 2) is obtained through deep network learning, and specifically includes:

and learning the characteristics of the static chart by combining the convolutional neural network and full connection with learning to obtain a static programming state code, using the static programming state code and the dynamic obstacle avoidance state code fed back by the radar system processing as key characteristics of the reinforcement learning state code, and reallocating the importance by learning the whole weight matrix to obtain the final reinforcement learning state code.

Preferably, the dynamic obstacle avoidance state code is:

wherein σ_tFor detecting the obstacle mark in the detection radius range,

The distance between the unmanned boat and the target in the world coordinate system,

is the angle of the unmanned ship from the target in the world coordinate system, and psi is the yaw angle, u, of the unmanned ship in the world coordinate system_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, r, of the coordinate system of the unmanned ship_tThe heading speed of the coordinate system of the unmanned ship,

is the closest obstacle distance in the world coordinate system,

the subscript t denotes the time t, which is the nearest obstacle angle in the world coordinate system.

Preferably, the comprehensive reward function in step 3) is a product of a reward target weight matrix and a reward target, and the reward target includes: a distance reward objective, an obstacle avoidance reward objective, a speed reward objective, and an energy consumption reward objective.

Preferably, the reward targets are obtained by:

in the task of navigating the unmanned ship to the target point, if

Then the distance to the reward target R _distance1, otherwise R_distance＝0，

The distance between the unmanned ship and the target in the world coordinate system is shown, subscript t represents the time t, and subscript t +1 represents the time t + 1;

when the radar detects an obstacle and is within the range threatened by the obstacle, if

Obstacle avoidance reward target R_obstanceNot all right 1, otherwise R_obstance＝0，

The subscript t represents the time t, and the subscript t +1 represents the time t + 1;

if it is not

Then the speed reward target R_speedNot all right 1, otherwise R_speed＝0，u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, v, of the coordinate system of the unmanned ship_thSetting a speed threshold;

if it is not

Then the energy consumption awards the target R _consumption1, otherwise R_consumption＝0，τ_uIs the surging force, tau, of the unmanned boat_rIs the bow shaking force, tau, of the unmanned boat_thA threshold is set for energy consumption.

Preferably, step 4) is done based on the A3C algorithm.

An unmanned ship hybrid perception autonomous obstacle avoidance system based on reinforcement learning comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor realizes the autonomous obstacle avoidance method when running the computer program.

Compared with the prior art, the invention has the following advantages:

the algorithm has high self-learning capacity, and can adapt to different large-scale complex environments through simple deployment training, so that autonomous perception, autonomous navigation and autonomous obstacle avoidance are realized;

the algorithm integrates the functions of environmental perception and navigation obstacle avoidance, and gets rid of the heavy burden of respective configuration and overall parameter adjustment caused by modular algorithm design;

the algorithm has the static planning and dynamic collision avoidance capabilities, on one hand, the track planning can be realized by learning a static sea chart, on the other hand, the algorithm can deal with sea surface real-time threats, and has reliable and stable threat avoidance capabilities.

Drawings

Fig. 1 is a schematic diagram of the overall structure of the unmanned surface vehicle hybrid sensing autonomous obstacle avoidance method based on reinforcement learning.

Fig. 2 is a schematic diagram of state coding of the unmanned ship hybrid perception reinforcement learning algorithm.

Fig. 3 is a parameter explanatory diagram of dynamic obstacle avoidance coding.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. Note that the following description of the embodiments is merely a substantial example, and the present invention is not intended to be limited to the application or the use thereof, and is not limited to the following embodiments.

Examples

As shown in fig. 1, an unmanned surface vehicle hybrid perception autonomous obstacle avoidance method based on reinforcement learning includes the following steps:

1) building a marine environment: establishing an interaction rule between the unmanned ship and a marine environment, generating random obstacles, and randomly generating an initial point and a final point of the unmanned ship;

the unmanned ship and marine environment interaction rule follows the self-kinetic equation of the unmanned ship:

wherein eta is [ x, y, psi ═ x, y, psi]^TContaining unmanned boat position and yaw angle information, v ═ u, upsilon, r]^TContaining yaw, surge, yaw speed information, [ tau ═_u,0,τ_t]^TThe pitching force and the yawing force of the unmanned boat, M is the mass of the unmanned boat, R (psi) is a function of the yaw angle psi, and C (v) and g (v) are functions of v respectively;

the random obstacles generated include 4 kinds: random static obstacles that can be delineated by a chart, random dynamic obstacles that cannot be delineated by a chart, random dynamic obstacles with autonomous control capability, and random dynamic obstacles without autonomous control capability.

And 4 times of initial points and target points are randomly set for each generated marine environment, and the intelligent agent can interact for 500 times for the marine environments with different initial points and target points.

2) Setting an action space and a state space: setting an action space according to the situation of the propeller of the unmanned ship, and learning according to global planning information provided by the static chart and obstacle information in the detection radius range of the radar system to obtain a reinforcement learning state code;

the motion space comprises discretized swaying force, discretized surging force and discretized yawing force;

the reinforcement learning state code is obtained through deep network learning, and specifically comprises the following steps:

Preferably, the dynamic obstacle avoidance state code is:

wherein σ_tFor detecting the obstacle mark in the detection radius range,

the angle of the unmanned ship in the world coordinate system from the target is psi, the yaw angle and u of the unmanned ship in the world coordinate system are u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, r, of the coordinate system of the unmanned ship_tThe heading speed of the coordinate system of the unmanned ship,

is the closest obstacle distance in the world coordinate system,

The action space of the under-actuated unmanned ship is discretized output of the surging force and the yawing force, and each propulsion force is discretized into 20 levels according to the thrust level. Referring to fig. 2, the state code learning process of reinforcement learning is obtained by static programming state codes, namely, sea chart features through combination network learning of CNN and FC, and finally compressed into 256-dimensional vectors. The diagram of the nine-tuple in the dynamic obstacle avoidance state encoding information is shown in fig. 3. The reinforcement learning state code is a 265-dimensional vector of the combination of the two codes, and is obtained by multiplying the two state codes by a learned weight matrix.

the composite reward function is the product of a reward target weight matrix and reward targets, and the reward targets comprise: a distance reward objective, an obstacle avoidance reward objective, a speed reward objective, and an energy consumption reward objective.

The reward objectives are obtained by:

in the task of navigating the unmanned ship to the target point, if

Then the distance to the reward target R_distanceNot all right 1, otherwise R_distance＝0，

when the radar detects an obstacle and is within the range threatened by the obstacle, if the radar detects the obstacle

Then obstacle avoidance reward target R _obstance1, otherwise R_obstance＝0，

if it is not

Then the speed reward target R _speed1, otherwise R_speed＝0，u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, v, of the coordinate system of the unmanned ship_thSetting a speed threshold;

if it is not

Then the energy consumption awards the target R_consumptionNot all right 1, otherwise R_consumption＝0，τ_uIs the surging force, tau, of the unmanned boat_rIs the bow shaking force, tau, of the unmanned boat_thA threshold is set for energy consumption.

4) Establishing and training an evaluation network and a strategy network: the evaluation network and the strategy network are completed based on an A3C algorithm, the evaluation network and the strategy network are respectively formed by connecting a state coding network and a perceptron, network parameters are initialized and trained, and in the network training process, the gradient calculation of the evaluation network meets the following updating method:

the gradient calculation of the policy network satisfies the following updating method:

wherein w is a network parameter of the policy network, θ is a network parameter of the evaluation network, s_tCoding the dynamic obstacle avoidance state of the unmanned ship at the moment t, a_tIs the decision of the unmanned ship at the time t, pi (a)_t|s_tω) is a policy network at s_tAction of output in the state r_tMaking a for unmanned boat_tReward value given by the post-decision environment, V(s)_tTheta) is in the state s_tThe value of the network forecast is evaluated.

Two network learning updating parameters are obtained to obtain V(s) and pi (a | s), and simultaneously, a mixed perception state coding method is also obtained.

5) Outputting by the agent decision controller: and respectively inputting the reinforcement learning state codes into an evaluation network and a strategy network, inputting the comprehensive reward function into the evaluation network, and determining the output of the controller according to the action corresponding to the learned mean value of the strategy network.

In the embodiment, during training, the output of the controller, i.e. action selection, is obtained by sampling according to the learned mean-variance strategy distribution. And when the unmanned ship collides, ending the training of the current round in advance, if the current target point and the initial point complete 500 training rounds, returning to the step 1, regenerating the target point and the initial point, and if the current environment has been set with 4 initial points and target points, regenerating the marine environment.

And (4) regenerating the marine environment, the initial point and the target point under the actual test environment, carrying out interactive observation global planning and local obstacle avoidance information on the unmanned ship and the marine environment, obtaining a reinforcement learning state code through the network trained in the step (4), and executing the action corresponding to the strategy distribution mean value under the state code, namely controller output, so as to complete the set marine operation task.

An unmanned ship hybrid sensing autonomous obstacle avoidance system based on reinforcement learning comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the autonomous obstacle avoidance method is realized when the processor runs the computer program.

The above embodiments are merely examples and do not limit the scope of the present invention. These embodiments may be implemented in other various manners, and various omissions, substitutions, and changes may be made without departing from the scope of the technical idea of the present invention.

Claims

1. An unmanned ship hybrid perception autonomous obstacle avoidance method based on reinforcement learning is characterized by comprising the following steps:

5) and the intelligent agent decision controller outputs: respectively inputting the reinforcement learning state codes into an evaluation network and a strategy network, inputting the comprehensive reward function into the evaluation network, and determining the output of the controller according to the action corresponding to the mean value of the learned strategy network;

in the step 1), the unmanned ship and marine environment interaction rule follows the own kinetic equation of the unmanned ship;

the random obstacles generated in step 1) include 4 kinds: random static obstacles which can be described by a chart, random dynamic obstacles which cannot be described by the chart, random dynamic obstacles with automatic control capability and random dynamic obstacles without automatic control capability;

the action space in the step 2) comprises discretized swaying force, surging force and yawing;

the strong learning state code in the step 2) is obtained through deep network learning, and specifically comprises the following steps:

the method comprises the steps that a static planning state code is obtained through a convolutional neural network and full connection combined learning of the features of a static chart, the static planning state code and a dynamic obstacle avoidance state code fed back by a radar system are used as key features of a reinforcement learning state code, and a final reinforcement learning state code is obtained through learning of an integral weight matrix and redistribution of importance;

the dynamic obstacle avoidance state code is as follows:

wherein σ_tFor detecting the obstacle mark in the detection radius range,

the angle of the unmanned ship in the world coordinate system from the target is psi, the yaw angle and u of the unmanned ship in the world coordinate system are u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, r, of the coordinate system of the unmanned ship_tThe yaw speed of the coordinate system of the unmanned ship,

is the nearest obstacle distance in the world coordinate system,

the subscript t represents the time t, which is the nearest barrier angle in the world coordinate system;

the comprehensive reward function in the step 3) is the product of a reward target weight matrix and a reward target, and the reward target comprises: a distance reward target, an obstacle avoidance reward target, a speed reward target and an energy consumption reward target;

the reward objectives are obtained by:

in the task of navigating the unmanned ship to the target point, if

Then the distance to the reward target R_distance1, otherwise R_distance＝0，

Obstacle avoidance reward target R_obstance1, otherwise R_obstance＝0，

The subscript t represents the time t, and the subscript t +1 represents the time t +1, wherein the distance is the nearest barrier in a world coordinate system;

if it is used

Then the speed reward target R_speed1, otherwise R_speed＝0，u_tIs the surging speed, v, of the coordinate system of the unmanned ship_tIs the swaying speed, v, of the coordinate system of the unmanned ship_thSetting a speed threshold;

if it is used

Then the energy consumption awards the target R_consumptionNot all right 1, otherwise R_consumption＝0，τ_uIs the surging force, tau, of the unmanned boat_rIs the bow shaking force, tau, of the unmanned boat_thSetting a threshold value for energy consumption;

step 4) is completed based on the A3C algorithm.

2. An unmanned boat hybrid perception autonomous obstacle avoidance system based on reinforcement learning, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the autonomous obstacle avoidance method of claim 1 when running the computer program.