CN110955239B

CN110955239B - Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning

Info

Publication number: CN110955239B
Application number: CN201911102540.XA
Authority: CN
Inventors: 刘峰; 陈畅
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2019-11-12
Filing date: 2019-11-12
Publication date: 2021-03-02
Anticipated expiration: 2039-11-12
Also published as: CN110955239A

Abstract

The invention provides an unmanned ship multi-target track planning method and system based on inverse reinforcement learning, wherein the method comprises the following steps: obtaining an optimal strategy pool by utilizing reinforcement learning, inputting information of a final target state to obtain an optimal path reaching a final target point, and controlling the unmanned ship to move forwards according to the optimal path; when the obstacle appears in the current place, a path capable of avoiding the obstacle is obtained by utilizing reverse reinforcement learning based on multiple target points, the unmanned ship is controlled to reach a stage new target point, and emergency obstacle avoidance is achieved. The system comprises an initialization module, a strategy estimation module, a strategy optimization module and a multi-target point module. The invention has the beneficial effects that: the method not only can realize global path planning, but also can reduce the calculation time by using the trained strategy pool and multiple target points under the condition of a complex sea area, thereby realizing emergency dynamic obstacle avoidance.

Description

Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning

Technical Field

The invention relates to the field of unmanned ship path planning, in particular to an unmanned ship multi-target track planning method and system based on inverse reinforcement learning.

Background

The exploration of the earth by human beings has never stopped, and with the rise of artificial intelligence, various unmanned devices are successively put into application, such as unmanned vehicles and unmanned planes, and the use of the unmanned devices facilitates the exploration of more unknown fields by human beings. The ocean occupies 70% of the surface area of the earth, and how to explore the ocean and develop ocean resources becomes the focus of attention of all countries. Under the large environment of artificial intelligence, the research of unmanned ships is scheduled.

Compared with the use of unmanned vehicles on land, the complex marine environment brings many new challenges to the research of unmanned ships, such as submarine vortexes and submarine organisms, and the dynamic and static obstacles form a complex and staggered marine environment, which causes difficulty to the unmanned ships. The unmanned ship motion path planning is a key technology for safe driving of the unmanned ship, and in some complex marine environments, the problems are difficult to deal with by a traditional path planning algorithm.

Chinese patent applications with patent numbers CN201810229544, CN201811612058 and CN201910494894 relate to the problem of trajectory planning of unmanned ships, but in general, there are the following problems: firstly, in the prior art, obstacle information on a map needs to be known in advance, and a suddenly-appearing dynamic obstacle cannot be avoided; secondly, in the prior art, a single target point is set, and once the single target point meets a large target point such as a submarine vortex and a dark current, the problem of next planning cannot be solved; thirdly, the prior art mainly aims at global path planning, is easy to fall into a local optimal point, and cannot deal with emergency situations.

Disclosure of Invention

In view of the above, the invention provides an unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning, according to the situation of dynamic and static obstacles in a complex sea area, a path planning model which is calculated in advance is utilized, a target point is replaced in time in an emergency situation, and other paths are directly started without re-calculation, so that the effect of avoiding risks is achieved.

The invention provides an unmanned ship multi-target track planning method based on inverse reinforcement learning, which comprises the following steps of:

s1, initializing a forward reinforcement learning model and a reverse reinforcement learning model: initializing a state Q value, initializing a behavior Q value function, initializing a behavior space, and initializing a strategy;

s2, performing path planning by using a positive reinforcement learning model, and establishing an optimal strategy pool;

s3, inputting state information of a final target point according to the optimal strategy pool, obtaining an optimal path reaching the final target point, and controlling the unmanned ship to move forwards according to the optimal path;

s4, judging whether an obstacle appears in front according to the real-time environment, if so, executing a step S5, otherwise, returning to the step S3;

s5, acquiring a path capable of avoiding the obstacle by adopting reverse reinforcement learning based on multiple target points, controlling the unmanned ship to reach a staged new target point, and then executing the step S6;

and S6, judging whether the unmanned ship reaches the final target point, if so, ending the process, otherwise, returning to the step S3.

Further, the step S1, wherein:

the process of initializing the state Q value is as follows: constructing rasterized state points according to the chart information and the detected environment information, and initializing a state Q value of each state point, wherein the state Q value is set for each state point on the chart, a negative Q value is set for an obstacle, a Q value of 0 is set for a state point on a feasible path, and a positive Q value is set for a target state;

the process of initializing the behavior space is as follows: determining a set of behaviors that can be performed by the state points according to whether obstacles and critical points exist around all the constructed state points;

the process of initializing the behavior Q value function is as follows: the behavior Q value function is Q (s, a), wherein s represents a state point, a represents any behavior in a behavior space of the state point s, the behavior Q value function Q (s, a) represents a behavior Q value obtained after the current state s carries out a behavior a, each behavior of the state is given an initial behavior Q value for the state point on the feasibility path, and the initial behavior Q value is null for the obstacle and the target state;

the process of initializing the strategy is to determine the first behavior in the behavior space of the state point for the state point on the feasible path.

Further, the specific process of step S2 is as follows:

s201, strategy estimation: updating the behavior Q value function of the state point on each feasible path, wherein the specific process is as follows: for any behavior in the behavior space of the current state, firstly acquiring a next state reached after the behavior is executed, obtaining a maximum behavior Q value of the next state by using a greedy algorithm, and then updating a behavior Q value function of the current state according to the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;

s202, strategy optimization: for the state after the strategy estimation is completed for one time, selecting the behavior with the maximum behavior Q value as the optimal strategy by using a greedy algorithm, and updating a strategy pool, wherein the strategy pool stores the behaviors corresponding to the maximum behavior Q values of the state points on all feasible paths;

s203, judging whether the iteration reaches the iteration time limit, if so, performing the step S204, otherwise, returning to the step S201, and continuing to perform the iteration;

and S204, forming an optimal strategy pool after reinforcement learning according to the strategy pool in the step S202 after iteration is completed.

Further, in step S201, the specific process of updating the behavior Q value function of the current state is as follows: if the next state is the target state or the obstacle, updating the behavior Q value function of the current state according to the formula (a); if the next state is the state on the feasible path, updating the behavior Q value function of the current state according to the formula (b)

Wherein, a_iAny behavior representing the current state of updating, i ═ 1, …, n, n represents the number of behaviors in the behavior space of the current state, s represents the number of behaviors in the behavior space of the current state₁Indicating the current state, Q(s)₁，a_i) Indicating the current state s before update₁Behavior Q value function of in behavior a_iTaking the value of;

indicating the updated current state s₁Behavior Q value function of in behavior a_iTaking the value of; s₂Indicating the current state s₁Conduct action a_iThe next state later reached; q(s)₂) Represents a state s₂A lower state Q value; max (Q(s)₂A)) represents the state s obtained by means of the greedy algorithm₂Maximum behavior Q value of; r represents the return obtained by selecting a behavior, α is the learning rate, represents the rate of updating the Q value of the behavior, and γ represents the loss factor.

Further, in the step S5, according to the optimal path obtained in the step S3, the state point where the unmanned ship travels in a short time in the future is set as a target state in the reverse reinforcement learning, and an optimal path set reaching any other plurality of local target points can be obtained by using the optimal strategy pool generated in the step S2; and determining a local path capable of avoiding emergency by using the optimal path set, and controlling the unmanned ship to advance along the local path to reach a staged new target point.

The invention also provides an unmanned ship multi-target trajectory planning system based on inverse reinforcement learning, which comprises an initialization module, a strategy estimation module, a strategy optimization module and a multi-target-point module, wherein:

the initialization module is used for initializing the positive and negative reinforcement learning models and comprises an initialization state Q value, an initialization behavior Q value function, an initialization behavior space and an initialization strategy;

the strategy estimation module is used for updating a behavior Q value function of the current state;

the strategy optimization module generates an optimal strategy pool by using the updating result of the strategy estimation module;

the multi-target point module is used for setting a plurality of local target points in advance, setting the local target points as starting points when an emergency occurs, setting state points which are driven by the unmanned ship in a short time in the future as target states, and performing reverse reinforcement learning by using an optimal strategy pool.

Further, the unmanned ship multi-target trajectory planning system based on the inverse reinforcement learning further comprises a greedy algorithm module and a stepping module, wherein the greedy algorithm module is used for selecting a behavior which enables a behavior Q value of a state to be maximum; the stepping module is used for acquiring the next state reached after a certain action is executed; when the strategy estimation module updates the behavior Q value function of the current state, for any behavior in the behavior space of the current state, the stepping module is firstly utilized to obtain the next state reached after the behavior is executed, then the greedy algorithm module is utilized to obtain the maximum behavior Q value of the next state, and the strategy estimation module updates the behavior Q value function of the current state by utilizing the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state.

Further, the policy optimization module selects a behavior corresponding to the maximum behavior Q value by using a greedy algorithm according to the behavior Q value function updated by the policy estimation module, stores the behavior as an optimal policy, further obtains optimal policies of all state points on the feasible path, and generates an optimal policy pool.

The technical scheme provided by the invention has the beneficial effects that:

(1) setting coordinates of unmanned ship driving in a short time in the future as target points of a reinforcement learning algorithm by using an inverse reinforcement learning idea, and acquiring a local optimal path for obstacle avoidance;

(2) global path planning is realized by using reinforcement learning, and dynamic obstacle avoidance is realized by using inverse reinforcement learning;

(3) the strategy model calculated by utilizing the greedy algorithm iterative behavior Q value can avoid the re-calculation of the algorithm and is not easy to fall into a local optimal point.

Drawings

FIG. 1 is a structural diagram of an unmanned ship multi-target trajectory planning system based on inverse reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a flowchart of a multi-target trajectory planning method for an unmanned ship based on inverse reinforcement learning according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating Q values of an initialization state according to an embodiment of the present invention;

FIG. 4 is a flow chart of a path planning through reinforcement learning according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an optimal policy pool provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of an optimal path provided by an embodiment of the invention;

fig. 7 is a schematic diagram of a staged new target point provided by the embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present invention provides an unmanned ship multi-target trajectory planning system based on inverse reinforcement learning, including an initialization module 1, a strategy estimation module 2, a strategy optimization module 3, and a multi-target point module 4, where:

the initialization module 1 is used for initializing a forward reinforcement learning model and a reverse reinforcement learning model, and initializing a state Q value, a behavior Q value function, a behavior space and a strategy; the strategy estimation module 2 is used for updating a behavior Q value function of the current state; the strategy optimization module 3 generates an optimal strategy pool by using the updating result of the strategy estimation module 2; the multi-target point module 4 is used for setting a plurality of local target points in advance, setting the local target points as starting points when an emergency occurs, setting state points which the unmanned ship runs to in a short time in the future as target states, and performing reverse reinforcement learning by using an optimal strategy pool.

The system further comprises a greedy algorithm module 5 and a stepping module 6, wherein the greedy algorithm module 5 is used for selecting the behavior which enables the behavior Q value of the state to be maximum, and the stepping module 6 is used for acquiring the next state reached after executing a certain behavior.

Referring to fig. 2, an embodiment of the present invention provides a multi-target trajectory planning method for an unmanned ship based on inverse reinforcement learning, including the following steps:

s1, initializing a forward reinforcement learning model and a reverse reinforcement learning model: an initialization state Q value, an initialization behavior Q value function, an initialization behavior space, and an initialization policy.

And constructing rasterized state points according to the chart information and the detected environment information, and initializing a state Q value of each state point, wherein the state Q value is set for each state point on the chart, a negative Q value is set for an obstacle, a Q value of 0 is set for a state point on a feasible path, and a positive Q value is set for a target state. Preferably, referring to fig. 3 and table 1, the obstacle is set to-100, the target point is set to 10, and the feasible path is set to 0.

TABLE 1 initialization State Q, possible State, and Return value

Initializing a behavior space, wherein the behavior space is a set of behaviors that can be performed by determining all constructed state points according to whether barriers and critical points exist around the state points, and for the unmanned ship, the behavior space is divided into 8 types, namely front, rear, left, right, front left, front right, rear left and rear right.

Initializing a behavior Q value function, wherein the behavior Q value function is Q (s, a), s represents a state point, a represents any behavior in a behavior space of the state point s, and the behavior Q value function Q (s, a) represents a behavior Q value obtained after the current state s is subjected to the behavior a. For a state point on the feasibility path, each behavior of the state is given an initial behavior Q value, preferably the initial value is set to 0; for the obstacle and target state, the initial behavior Q is null.

And initializing a strategy, determining the first behavior in the behavior space of the state point for the state point on the feasible path, and preferably selecting the state point according to the sequence of front, back, left, right, front left, front right, back left and back right.

And S2, performing path planning by using the positive reinforcement learning model, and establishing an optimal strategy pool.

Specifically, referring to fig. 4, the specific process of step S2 includes:

s201, strategy estimation is carried out by utilizing a strategy estimation module 2: updating the behavior Q value function of the state point on each feasible path, specifically, for any behavior in the behavior space of the current state, firstly, acquiring the next state reached after the execution of the behavior by using a stepping module 6, then, obtaining the maximum behavior Q value of the next state by using a greedy algorithm module 5, and updating the behavior Q value function of the current state by using a strategy estimation module 2 according to the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;

if the next state is the target state or the obstacle, updating the behavior Q value function of the current state according to the formula (1); if the next state is the state on the feasible path, updating the behavior Q value function of the current state according to the formula (2)

Wherein, a_iN, n represents the number of behaviors in the behavior space of the current state, s₁Indicating the current state, Q(s)₁，a_i) Indicating the current state s before update₁Behavior Q value function of in behavior a_iTaking the value of;

indicating the updated current state s₁Behavior Q value function of in behavior a_iTaking the value of; s₂Indicating the current state s₁Conduct action a_iThe next state later reached; q(s)₂) Represents a state s₂A lower state Q value; max (Q(s)₂，a_i) Represents the state s obtained by means of the greedy algorithm module 5₂Maximum behavior Q value of; r represents the reward obtained by the selected behavior, i.e. the reward value set in table 1; α is a learning rate, which indicates a rate of updating a Q value, and if α is selected too much, an error is large, and if α is selected too little, calculation efficiency is low, where α is selected to be 0.1 in this embodiment; γ represents a loss factor, preferably γ ═ 0.9.

S202, strategy optimization is carried out by utilizing a strategy optimization module 3: and according to the behavior Q value function updated by the strategy estimation module 2, selecting the behavior with the maximum behavior Q value as an optimal strategy by using a greedy algorithm module 5, and updating a strategy pool, wherein the strategy pool stores behaviors corresponding to the maximum behavior Q values of state points on all feasible paths.

And S203, judging whether the iteration reaches the iteration time limit, if so, performing a step 204, otherwise, returning to the step S201, and continuing to perform the iteration.

S204, forming an optimal strategy pool after reinforcement learning according to the strategy pool in step S202 after the iteration is completed, where the optimal strategy pool formed after the iteration is completed is as shown in fig. 5 for the data shown in fig. 3.

S3, inputting the information of the final target state according to the optimal strategy pool, so as to obtain the optimal path reaching the final target point, and controlling the unmanned ship to move forward according to the optimal path; the resulting optimal path using the data shown in fig. 3 is shown in fig. 6, where the state point S represents the final target point.

S4, judging whether emergency occurs in front according to real-time environment, if yes, executing step S5, otherwise returning to step S3.

S5, based on the multiple target points, a path capable of avoiding emergency situations is obtained by adopting reverse reinforcement learning, the unmanned ship is controlled to reach a staged new target point, and then the step S6 is executed. Firstly, a plurality of local target points are set in advance by using the multi-target module 4, the local target points are set as starting points, and state points which are driven by the unmanned ship in a short time in the future are set as target states in reverse reinforcement learning according to the optimal path obtained in the step S3; by adopting reverse reinforcement learning, the optimal path set reaching the plurality of local target points can be obtained by utilizing the optimal strategy pool generated in the step S2; screening the optimal path sets, determining a local path which can avoid emergency, and controlling the unmanned ship to advance along the local path to reach a staged new target point.

Specifically, referring to fig. 7, when the unmanned ship travels to the state point a, the unmanned ship will travel to the state point B according to the optimal path in fig. 6, and at this time, it is detected that an obstacle appears at the point C, and obstacle avoidance needs to be performed; the multi-target point module 2 sets a plurality of local target points 1, 2, and 3 in advance, then sets the three local target points 1, 2, and 3 as starting points of inverse reinforcement learning, sets a point B as a target state of inverse reinforcement learning, inputs the point B into the optimal strategy pool formed in step S2, and obtains optimal paths B1, B2, and B3 reaching the local target points 1, 2, and 3, where the path B1 can avoid obstacles, so the unmanned ship is controlled to reach the staged target point according to the local optimal path B1 with the state point 1 as the staged target point.

In this document, the terms front, back, upper and lower are used to define the components in the drawings and the positions of the components relative to each other, and are used for clarity and convenience of the technical solution. It is to be understood that the use of the directional terms should not be taken to limit the scope of the claims.

The features of the embodiments and embodiments described herein above may be combined with each other without conflict.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An unmanned ship multi-target track planning method based on inverse reinforcement learning is characterized by comprising the following steps:

the specific process of step S2 is as follows:

s204, forming an optimal strategy pool after reinforcement learning according to the strategy pool in the step S202 after iteration is completed;

s5, setting the state point of the unmanned ship which is driven in a short time in the future as a target state in reverse reinforcement learning according to the optimal path acquired in the step S3 based on multiple target points, and acquiring an optimal path set reaching any other multiple local target points by using the optimal strategy pool generated in the step S2; determining a local path capable of avoiding emergency by using the optimal path set, controlling the unmanned ship to advance along the local path to reach a staged new target point, and then executing the step S6;

2. The inverse reinforcement learning-based unmanned aerial vehicle multi-target trajectory planning method of claim 1, wherein the step S1 is that:

3. The unmanned aerial vehicle multi-target trajectory planning method based on inverse reinforcement learning of claim 1, wherein in the step S201, the specific process of updating the behavior Q value function of the current state is as follows: if the next state is the target state or the obstacle, updating the behavior Q value function of the current state according to the formula (a); if the next state is the state on the feasible path, updating the behavior Q value function of the current state according to the formula (b)

Wherein, a_iAny behavior representing the current state of updating, i ═ 1, …, n, n represents the number of behaviors in the behavior space of the current state, s represents the number of behaviors in the behavior space of the current state₁Indicating the current state, Q(s)₁,a_i) Indicating the current state s before update₁Behavior Q value function of in behavior a_iTaking the value of;

indicating the updated current state s₁Behavior Q value function of in behavior a_iTaking the value of; s₂Indicating the current state s₁Conduct action a_iThe next state later reached; q(s)₂) Represents a state s₂A lower state Q value; max (Q(s)₂A)) represents the state s obtained by means of the greedy algorithm₂Maximum behavior Q value of; r represents the return obtained by selecting the behavior, α is the learning rate, which represents the rate of updating the Q value of the behavior, and γ represents the loss factor.

4. The unmanned ship multi-target trajectory planning system based on inverse reinforcement learning is characterized by comprising an initialization module, a strategy estimation module, a strategy optimization module and a multi-target-point module, wherein:

the multi-target point module is used for setting a plurality of local target points in advance, setting the local target points as starting points when an emergency occurs, setting state points which are driven by the unmanned ship in a short time in the future as target states, and performing reverse reinforcement learning by using an optimal strategy pool;

the system also comprises a greedy algorithm module and a stepping module, wherein the greedy algorithm module is used for selecting the behavior which enables the behavior Q value of the state to be maximum; the stepping module is used for acquiring a next state reached after a certain action is executed; when the strategy estimation module updates a behavior Q value function of the current state, for any behavior in a behavior space of the current state, firstly, the stepping module is used for acquiring a next state reached after the behavior is executed, then, the greedy algorithm module is used for acquiring a maximum behavior Q value of the next state, and the strategy estimation module updates the behavior Q value function of the current state by using the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;

and the strategy optimization module selects the behavior corresponding to the maximum behavior Q value by using a greedy algorithm according to the behavior Q value function updated by the strategy estimation module, stores the behavior as an optimal strategy, further acquires the optimal strategies of all state points on the feasible path and generates an optimal strategy pool.