CN110955239B - Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning - Google Patents
Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning Download PDFInfo
- Publication number
- CN110955239B CN110955239B CN201911102540.XA CN201911102540A CN110955239B CN 110955239 B CN110955239 B CN 110955239B CN 201911102540 A CN201911102540 A CN 201911102540A CN 110955239 B CN110955239 B CN 110955239B
- Authority
- CN
- China
- Prior art keywords
- behavior
- state
- value
- strategy
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 50
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000005457 optimization Methods 0.000 claims abstract description 13
- 230000006399 behavior Effects 0.000 claims description 162
- 230000006870 function Effects 0.000 claims description 45
- 230000008569 process Effects 0.000 claims description 18
- 230000009471 action Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 abstract description 4
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 5
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000004888 barrier function Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/0206—Control of position or course in two dimensions specially adapted to water vehicles
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
Abstract
The invention provides an unmanned ship multi-target track planning method and system based on inverse reinforcement learning, wherein the method comprises the following steps: obtaining an optimal strategy pool by utilizing reinforcement learning, inputting information of a final target state to obtain an optimal path reaching a final target point, and controlling the unmanned ship to move forwards according to the optimal path; when the obstacle appears in the current place, a path capable of avoiding the obstacle is obtained by utilizing reverse reinforcement learning based on multiple target points, the unmanned ship is controlled to reach a stage new target point, and emergency obstacle avoidance is achieved. The system comprises an initialization module, a strategy estimation module, a strategy optimization module and a multi-target point module. The invention has the beneficial effects that: the method not only can realize global path planning, but also can reduce the calculation time by using the trained strategy pool and multiple target points under the condition of a complex sea area, thereby realizing emergency dynamic obstacle avoidance.
Description
Technical Field
The invention relates to the field of unmanned ship path planning, in particular to an unmanned ship multi-target track planning method and system based on inverse reinforcement learning.
Background
The exploration of the earth by human beings has never stopped, and with the rise of artificial intelligence, various unmanned devices are successively put into application, such as unmanned vehicles and unmanned planes, and the use of the unmanned devices facilitates the exploration of more unknown fields by human beings. The ocean occupies 70% of the surface area of the earth, and how to explore the ocean and develop ocean resources becomes the focus of attention of all countries. Under the large environment of artificial intelligence, the research of unmanned ships is scheduled.
Compared with the use of unmanned vehicles on land, the complex marine environment brings many new challenges to the research of unmanned ships, such as submarine vortexes and submarine organisms, and the dynamic and static obstacles form a complex and staggered marine environment, which causes difficulty to the unmanned ships. The unmanned ship motion path planning is a key technology for safe driving of the unmanned ship, and in some complex marine environments, the problems are difficult to deal with by a traditional path planning algorithm.
Chinese patent applications with patent numbers CN201810229544, CN201811612058 and CN201910494894 relate to the problem of trajectory planning of unmanned ships, but in general, there are the following problems: firstly, in the prior art, obstacle information on a map needs to be known in advance, and a suddenly-appearing dynamic obstacle cannot be avoided; secondly, in the prior art, a single target point is set, and once the single target point meets a large target point such as a submarine vortex and a dark current, the problem of next planning cannot be solved; thirdly, the prior art mainly aims at global path planning, is easy to fall into a local optimal point, and cannot deal with emergency situations.
Disclosure of Invention
In view of the above, the invention provides an unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning, according to the situation of dynamic and static obstacles in a complex sea area, a path planning model which is calculated in advance is utilized, a target point is replaced in time in an emergency situation, and other paths are directly started without re-calculation, so that the effect of avoiding risks is achieved.
The invention provides an unmanned ship multi-target track planning method based on inverse reinforcement learning, which comprises the following steps of:
s1, initializing a forward reinforcement learning model and a reverse reinforcement learning model: initializing a state Q value, initializing a behavior Q value function, initializing a behavior space, and initializing a strategy;
s2, performing path planning by using a positive reinforcement learning model, and establishing an optimal strategy pool;
s3, inputting state information of a final target point according to the optimal strategy pool, obtaining an optimal path reaching the final target point, and controlling the unmanned ship to move forwards according to the optimal path;
s4, judging whether an obstacle appears in front according to the real-time environment, if so, executing a step S5, otherwise, returning to the step S3;
s5, acquiring a path capable of avoiding the obstacle by adopting reverse reinforcement learning based on multiple target points, controlling the unmanned ship to reach a staged new target point, and then executing the step S6;
and S6, judging whether the unmanned ship reaches the final target point, if so, ending the process, otherwise, returning to the step S3.
Further, the step S1, wherein:
the process of initializing the state Q value is as follows: constructing rasterized state points according to the chart information and the detected environment information, and initializing a state Q value of each state point, wherein the state Q value is set for each state point on the chart, a negative Q value is set for an obstacle, a Q value of 0 is set for a state point on a feasible path, and a positive Q value is set for a target state;
the process of initializing the behavior space is as follows: determining a set of behaviors that can be performed by the state points according to whether obstacles and critical points exist around all the constructed state points;
the process of initializing the behavior Q value function is as follows: the behavior Q value function is Q (s, a), wherein s represents a state point, a represents any behavior in a behavior space of the state point s, the behavior Q value function Q (s, a) represents a behavior Q value obtained after the current state s carries out a behavior a, each behavior of the state is given an initial behavior Q value for the state point on the feasibility path, and the initial behavior Q value is null for the obstacle and the target state;
the process of initializing the strategy is to determine the first behavior in the behavior space of the state point for the state point on the feasible path.
Further, the specific process of step S2 is as follows:
s201, strategy estimation: updating the behavior Q value function of the state point on each feasible path, wherein the specific process is as follows: for any behavior in the behavior space of the current state, firstly acquiring a next state reached after the behavior is executed, obtaining a maximum behavior Q value of the next state by using a greedy algorithm, and then updating a behavior Q value function of the current state according to the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;
s202, strategy optimization: for the state after the strategy estimation is completed for one time, selecting the behavior with the maximum behavior Q value as the optimal strategy by using a greedy algorithm, and updating a strategy pool, wherein the strategy pool stores the behaviors corresponding to the maximum behavior Q values of the state points on all feasible paths;
s203, judging whether the iteration reaches the iteration time limit, if so, performing the step S204, otherwise, returning to the step S201, and continuing to perform the iteration;
and S204, forming an optimal strategy pool after reinforcement learning according to the strategy pool in the step S202 after iteration is completed.
Further, in step S201, the specific process of updating the behavior Q value function of the current state is as follows: if the next state is the target state or the obstacle, updating the behavior Q value function of the current state according to the formula (a); if the next state is the state on the feasible path, updating the behavior Q value function of the current state according to the formula (b)
Wherein, aiAny behavior representing the current state of updating, i ═ 1, …, n, n represents the number of behaviors in the behavior space of the current state, s represents the number of behaviors in the behavior space of the current state1Indicating the current state, Q(s)1,ai) Indicating the current state s before update1Behavior Q value function of in behavior aiTaking the value of;indicating the updated current state s1Behavior Q value function of in behavior aiTaking the value of; s2Indicating the current state s1Conduct action aiThe next state later reached; q(s)2) Represents a state s2A lower state Q value; max (Q(s)2A)) represents the state s obtained by means of the greedy algorithm2Maximum behavior Q value of; r represents the return obtained by selecting a behavior, α is the learning rate, represents the rate of updating the Q value of the behavior, and γ represents the loss factor.
Further, in the step S5, according to the optimal path obtained in the step S3, the state point where the unmanned ship travels in a short time in the future is set as a target state in the reverse reinforcement learning, and an optimal path set reaching any other plurality of local target points can be obtained by using the optimal strategy pool generated in the step S2; and determining a local path capable of avoiding emergency by using the optimal path set, and controlling the unmanned ship to advance along the local path to reach a staged new target point.
The invention also provides an unmanned ship multi-target trajectory planning system based on inverse reinforcement learning, which comprises an initialization module, a strategy estimation module, a strategy optimization module and a multi-target-point module, wherein:
the initialization module is used for initializing the positive and negative reinforcement learning models and comprises an initialization state Q value, an initialization behavior Q value function, an initialization behavior space and an initialization strategy;
the strategy estimation module is used for updating a behavior Q value function of the current state;
the strategy optimization module generates an optimal strategy pool by using the updating result of the strategy estimation module;
the multi-target point module is used for setting a plurality of local target points in advance, setting the local target points as starting points when an emergency occurs, setting state points which are driven by the unmanned ship in a short time in the future as target states, and performing reverse reinforcement learning by using an optimal strategy pool.
Further, the unmanned ship multi-target trajectory planning system based on the inverse reinforcement learning further comprises a greedy algorithm module and a stepping module, wherein the greedy algorithm module is used for selecting a behavior which enables a behavior Q value of a state to be maximum; the stepping module is used for acquiring the next state reached after a certain action is executed; when the strategy estimation module updates the behavior Q value function of the current state, for any behavior in the behavior space of the current state, the stepping module is firstly utilized to obtain the next state reached after the behavior is executed, then the greedy algorithm module is utilized to obtain the maximum behavior Q value of the next state, and the strategy estimation module updates the behavior Q value function of the current state by utilizing the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state.
Further, the policy optimization module selects a behavior corresponding to the maximum behavior Q value by using a greedy algorithm according to the behavior Q value function updated by the policy estimation module, stores the behavior as an optimal policy, further obtains optimal policies of all state points on the feasible path, and generates an optimal policy pool.
The technical scheme provided by the invention has the beneficial effects that:
(1) setting coordinates of unmanned ship driving in a short time in the future as target points of a reinforcement learning algorithm by using an inverse reinforcement learning idea, and acquiring a local optimal path for obstacle avoidance;
(2) global path planning is realized by using reinforcement learning, and dynamic obstacle avoidance is realized by using inverse reinforcement learning;
(3) the strategy model calculated by utilizing the greedy algorithm iterative behavior Q value can avoid the re-calculation of the algorithm and is not easy to fall into a local optimal point.
Drawings
FIG. 1 is a structural diagram of an unmanned ship multi-target trajectory planning system based on inverse reinforcement learning according to an embodiment of the present invention;
FIG. 2 is a flowchart of a multi-target trajectory planning method for an unmanned ship based on inverse reinforcement learning according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating Q values of an initialization state according to an embodiment of the present invention;
FIG. 4 is a flow chart of a path planning through reinforcement learning according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an optimal policy pool provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of an optimal path provided by an embodiment of the invention;
fig. 7 is a schematic diagram of a staged new target point provided by the embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present invention provides an unmanned ship multi-target trajectory planning system based on inverse reinforcement learning, including an initialization module 1, a strategy estimation module 2, a strategy optimization module 3, and a multi-target point module 4, where:
the initialization module 1 is used for initializing a forward reinforcement learning model and a reverse reinforcement learning model, and initializing a state Q value, a behavior Q value function, a behavior space and a strategy; the strategy estimation module 2 is used for updating a behavior Q value function of the current state; the strategy optimization module 3 generates an optimal strategy pool by using the updating result of the strategy estimation module 2; the multi-target point module 4 is used for setting a plurality of local target points in advance, setting the local target points as starting points when an emergency occurs, setting state points which the unmanned ship runs to in a short time in the future as target states, and performing reverse reinforcement learning by using an optimal strategy pool.
The system further comprises a greedy algorithm module 5 and a stepping module 6, wherein the greedy algorithm module 5 is used for selecting the behavior which enables the behavior Q value of the state to be maximum, and the stepping module 6 is used for acquiring the next state reached after executing a certain behavior.
Referring to fig. 2, an embodiment of the present invention provides a multi-target trajectory planning method for an unmanned ship based on inverse reinforcement learning, including the following steps:
s1, initializing a forward reinforcement learning model and a reverse reinforcement learning model: an initialization state Q value, an initialization behavior Q value function, an initialization behavior space, and an initialization policy.
And constructing rasterized state points according to the chart information and the detected environment information, and initializing a state Q value of each state point, wherein the state Q value is set for each state point on the chart, a negative Q value is set for an obstacle, a Q value of 0 is set for a state point on a feasible path, and a positive Q value is set for a target state. Preferably, referring to fig. 3 and table 1, the obstacle is set to-100, the target point is set to 10, and the feasible path is set to 0.
TABLE 1 initialization State Q, possible State, and Return value
Initializing a behavior space, wherein the behavior space is a set of behaviors that can be performed by determining all constructed state points according to whether barriers and critical points exist around the state points, and for the unmanned ship, the behavior space is divided into 8 types, namely front, rear, left, right, front left, front right, rear left and rear right.
Initializing a behavior Q value function, wherein the behavior Q value function is Q (s, a), s represents a state point, a represents any behavior in a behavior space of the state point s, and the behavior Q value function Q (s, a) represents a behavior Q value obtained after the current state s is subjected to the behavior a. For a state point on the feasibility path, each behavior of the state is given an initial behavior Q value, preferably the initial value is set to 0; for the obstacle and target state, the initial behavior Q is null.
And initializing a strategy, determining the first behavior in the behavior space of the state point for the state point on the feasible path, and preferably selecting the state point according to the sequence of front, back, left, right, front left, front right, back left and back right.
And S2, performing path planning by using the positive reinforcement learning model, and establishing an optimal strategy pool.
Specifically, referring to fig. 4, the specific process of step S2 includes:
s201, strategy estimation is carried out by utilizing a strategy estimation module 2: updating the behavior Q value function of the state point on each feasible path, specifically, for any behavior in the behavior space of the current state, firstly, acquiring the next state reached after the execution of the behavior by using a stepping module 6, then, obtaining the maximum behavior Q value of the next state by using a greedy algorithm module 5, and updating the behavior Q value function of the current state by using a strategy estimation module 2 according to the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;
if the next state is the target state or the obstacle, updating the behavior Q value function of the current state according to the formula (1); if the next state is the state on the feasible path, updating the behavior Q value function of the current state according to the formula (2)
Wherein, aiN, n represents the number of behaviors in the behavior space of the current state, s1Indicating the current state, Q(s)1,ai) Indicating the current state s before update1Behavior Q value function of in behavior aiTaking the value of;indicating the updated current state s1Behavior Q value function of in behavior aiTaking the value of; s2Indicating the current state s1Conduct action aiThe next state later reached; q(s)2) Represents a state s2A lower state Q value; max (Q(s)2,ai) Represents the state s obtained by means of the greedy algorithm module 52Maximum behavior Q value of; r represents the reward obtained by the selected behavior, i.e. the reward value set in table 1; α is a learning rate, which indicates a rate of updating a Q value, and if α is selected too much, an error is large, and if α is selected too little, calculation efficiency is low, where α is selected to be 0.1 in this embodiment; γ represents a loss factor, preferably γ ═ 0.9.
S202, strategy optimization is carried out by utilizing a strategy optimization module 3: and according to the behavior Q value function updated by the strategy estimation module 2, selecting the behavior with the maximum behavior Q value as an optimal strategy by using a greedy algorithm module 5, and updating a strategy pool, wherein the strategy pool stores behaviors corresponding to the maximum behavior Q values of state points on all feasible paths.
And S203, judging whether the iteration reaches the iteration time limit, if so, performing a step 204, otherwise, returning to the step S201, and continuing to perform the iteration.
S204, forming an optimal strategy pool after reinforcement learning according to the strategy pool in step S202 after the iteration is completed, where the optimal strategy pool formed after the iteration is completed is as shown in fig. 5 for the data shown in fig. 3.
S3, inputting the information of the final target state according to the optimal strategy pool, so as to obtain the optimal path reaching the final target point, and controlling the unmanned ship to move forward according to the optimal path; the resulting optimal path using the data shown in fig. 3 is shown in fig. 6, where the state point S represents the final target point.
S4, judging whether emergency occurs in front according to real-time environment, if yes, executing step S5, otherwise returning to step S3.
S5, based on the multiple target points, a path capable of avoiding emergency situations is obtained by adopting reverse reinforcement learning, the unmanned ship is controlled to reach a staged new target point, and then the step S6 is executed. Firstly, a plurality of local target points are set in advance by using the multi-target module 4, the local target points are set as starting points, and state points which are driven by the unmanned ship in a short time in the future are set as target states in reverse reinforcement learning according to the optimal path obtained in the step S3; by adopting reverse reinforcement learning, the optimal path set reaching the plurality of local target points can be obtained by utilizing the optimal strategy pool generated in the step S2; screening the optimal path sets, determining a local path which can avoid emergency, and controlling the unmanned ship to advance along the local path to reach a staged new target point.
Specifically, referring to fig. 7, when the unmanned ship travels to the state point a, the unmanned ship will travel to the state point B according to the optimal path in fig. 6, and at this time, it is detected that an obstacle appears at the point C, and obstacle avoidance needs to be performed; the multi-target point module 2 sets a plurality of local target points 1, 2, and 3 in advance, then sets the three local target points 1, 2, and 3 as starting points of inverse reinforcement learning, sets a point B as a target state of inverse reinforcement learning, inputs the point B into the optimal strategy pool formed in step S2, and obtains optimal paths B1, B2, and B3 reaching the local target points 1, 2, and 3, where the path B1 can avoid obstacles, so the unmanned ship is controlled to reach the staged target point according to the local optimal path B1 with the state point 1 as the staged target point.
And S6, judging whether the unmanned ship reaches the final target point, if so, ending the process, otherwise, returning to the step S3.
In this document, the terms front, back, upper and lower are used to define the components in the drawings and the positions of the components relative to each other, and are used for clarity and convenience of the technical solution. It is to be understood that the use of the directional terms should not be taken to limit the scope of the claims.
The features of the embodiments and embodiments described herein above may be combined with each other without conflict.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (4)
1. An unmanned ship multi-target track planning method based on inverse reinforcement learning is characterized by comprising the following steps:
s1, initializing a forward reinforcement learning model and a reverse reinforcement learning model: initializing a state Q value, initializing a behavior Q value function, initializing a behavior space, and initializing a strategy;
s2, performing path planning by using a positive reinforcement learning model, and establishing an optimal strategy pool;
the specific process of step S2 is as follows:
s201, strategy estimation: updating the behavior Q value function of the state point on each feasible path, wherein the specific process is as follows: for any behavior in the behavior space of the current state, firstly acquiring a next state reached after the behavior is executed, obtaining a maximum behavior Q value of the next state by using a greedy algorithm, and then updating a behavior Q value function of the current state according to the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;
s202, strategy optimization: for the state after the strategy estimation is completed for one time, selecting the behavior with the maximum behavior Q value as the optimal strategy by using a greedy algorithm, and updating a strategy pool, wherein the strategy pool stores the behaviors corresponding to the maximum behavior Q values of the state points on all feasible paths;
s203, judging whether the iteration reaches the iteration time limit, if so, performing the step S204, otherwise, returning to the step S201, and continuing to perform the iteration;
s204, forming an optimal strategy pool after reinforcement learning according to the strategy pool in the step S202 after iteration is completed;
s3, inputting state information of a final target point according to the optimal strategy pool, obtaining an optimal path reaching the final target point, and controlling the unmanned ship to move forwards according to the optimal path;
s4, judging whether an obstacle appears in front according to the real-time environment, if so, executing a step S5, otherwise, returning to the step S3;
s5, setting the state point of the unmanned ship which is driven in a short time in the future as a target state in reverse reinforcement learning according to the optimal path acquired in the step S3 based on multiple target points, and acquiring an optimal path set reaching any other multiple local target points by using the optimal strategy pool generated in the step S2; determining a local path capable of avoiding emergency by using the optimal path set, controlling the unmanned ship to advance along the local path to reach a staged new target point, and then executing the step S6;
and S6, judging whether the unmanned ship reaches the final target point, if so, ending the process, otherwise, returning to the step S3.
2. The inverse reinforcement learning-based unmanned aerial vehicle multi-target trajectory planning method of claim 1, wherein the step S1 is that:
the process of initializing the state Q value is as follows: constructing rasterized state points according to the chart information and the detected environment information, and initializing a state Q value of each state point, wherein the state Q value is set for each state point on the chart, a negative Q value is set for an obstacle, a Q value of 0 is set for a state point on a feasible path, and a positive Q value is set for a target state;
the process of initializing the behavior space is as follows: determining a set of behaviors that can be performed by the state points according to whether obstacles and critical points exist around all the constructed state points;
the process of initializing the behavior Q value function is as follows: the behavior Q value function is Q (s, a), wherein s represents a state point, a represents any behavior in a behavior space of the state point s, the behavior Q value function Q (s, a) represents a behavior Q value obtained after the current state s carries out a behavior a, each behavior of the state is given an initial behavior Q value for the state point on the feasibility path, and the initial behavior Q value is null for the obstacle and the target state;
the process of initializing the strategy is to determine the first behavior in the behavior space of the state point for the state point on the feasible path.
3. The unmanned aerial vehicle multi-target trajectory planning method based on inverse reinforcement learning of claim 1, wherein in the step S201, the specific process of updating the behavior Q value function of the current state is as follows: if the next state is the target state or the obstacle, updating the behavior Q value function of the current state according to the formula (a); if the next state is the state on the feasible path, updating the behavior Q value function of the current state according to the formula (b)
Wherein, aiAny behavior representing the current state of updating, i ═ 1, …, n, n represents the number of behaviors in the behavior space of the current state, s represents the number of behaviors in the behavior space of the current state1Indicating the current state, Q(s)1,ai) Indicating the current state s before update1Behavior Q value function of in behavior aiTaking the value of;indicating the updated current state s1Behavior Q value function of in behavior aiTaking the value of; s2Indicating the current state s1Conduct action aiThe next state later reached; q(s)2) Represents a state s2A lower state Q value; max (Q(s)2A)) represents the state s obtained by means of the greedy algorithm2Maximum behavior Q value of; r represents the return obtained by selecting the behavior, α is the learning rate, which represents the rate of updating the Q value of the behavior, and γ represents the loss factor.
4. The unmanned ship multi-target trajectory planning system based on inverse reinforcement learning is characterized by comprising an initialization module, a strategy estimation module, a strategy optimization module and a multi-target-point module, wherein:
the initialization module is used for initializing the positive and negative reinforcement learning models and comprises an initialization state Q value, an initialization behavior Q value function, an initialization behavior space and an initialization strategy;
the strategy estimation module is used for updating a behavior Q value function of the current state;
the strategy optimization module generates an optimal strategy pool by using the updating result of the strategy estimation module;
the multi-target point module is used for setting a plurality of local target points in advance, setting the local target points as starting points when an emergency occurs, setting state points which are driven by the unmanned ship in a short time in the future as target states, and performing reverse reinforcement learning by using an optimal strategy pool;
the system also comprises a greedy algorithm module and a stepping module, wherein the greedy algorithm module is used for selecting the behavior which enables the behavior Q value of the state to be maximum; the stepping module is used for acquiring a next state reached after a certain action is executed; when the strategy estimation module updates a behavior Q value function of the current state, for any behavior in a behavior space of the current state, firstly, the stepping module is used for acquiring a next state reached after the behavior is executed, then, the greedy algorithm module is used for acquiring a maximum behavior Q value of the next state, and the strategy estimation module updates the behavior Q value function of the current state by using the state Q value of the next state, the behavior Q value of the current state and the maximum behavior Q value of the next state;
and the strategy optimization module selects the behavior corresponding to the maximum behavior Q value by using a greedy algorithm according to the behavior Q value function updated by the strategy estimation module, stores the behavior as an optimal strategy, further acquires the optimal strategies of all state points on the feasible path and generates an optimal strategy pool.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911102540.XA CN110955239B (en) | 2019-11-12 | 2019-11-12 | Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911102540.XA CN110955239B (en) | 2019-11-12 | 2019-11-12 | Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110955239A CN110955239A (en) | 2020-04-03 |
CN110955239B true CN110955239B (en) | 2021-03-02 |
Family
ID=69977440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911102540.XA Expired - Fee Related CN110955239B (en) | 2019-11-12 | 2019-11-12 | Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110955239B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241552A (en) * | 2018-07-12 | 2019-01-18 | 哈尔滨工程大学 | A kind of underwater robot motion planning method based on multiple constraint target |
CN109799820A (en) * | 2019-01-22 | 2019-05-24 | 智慧航海(青岛)科技有限公司 | Unmanned ship local paths planning method based on the random road sign figure method of comparison expression |
CN110174118A (en) * | 2019-05-29 | 2019-08-27 | 北京洛必德科技有限公司 | Robot multiple-objective search-path layout method and apparatus based on intensified learning |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9051043B1 (en) * | 2012-12-28 | 2015-06-09 | Google Inc. | Providing emergency medical services using unmanned aerial vehicles |
CN104298239B (en) * | 2014-09-29 | 2016-08-24 | 湖南大学 | A kind of indoor mobile robot strengthens map study paths planning method |
US10902347B2 (en) * | 2017-04-11 | 2021-01-26 | International Business Machines Corporation | Rule creation using MDP and inverse reinforcement learning |
US10235881B2 (en) * | 2017-07-28 | 2019-03-19 | Toyota Motor Engineering & Manufacturing North America, Inc. | Autonomous operation capability configuration for a vehicle |
US10678241B2 (en) * | 2017-09-06 | 2020-06-09 | GM Global Technology Operations LLC | Unsupervised learning agents for autonomous driving applications |
CN107544516A (en) * | 2017-10-11 | 2018-01-05 | 苏州大学 | Automated driving system and method based on relative entropy depth against intensified learning |
CN108724182B (en) * | 2018-05-23 | 2020-03-17 | 苏州大学 | End-to-end game robot generation method and system based on multi-class simulation learning |
CN108921873B (en) * | 2018-05-29 | 2021-08-31 | 福州大学 | Markov decision-making online multi-target tracking method based on kernel correlation filtering optimization |
CN109405843B (en) * | 2018-09-21 | 2020-01-03 | 北京三快在线科技有限公司 | Path planning method and device and mobile device |
CN109540136A (en) * | 2018-10-25 | 2019-03-29 | 广东华中科技大学工业技术研究院 | A kind of more unmanned boat collaboration paths planning methods |
CN109726866A (en) * | 2018-12-27 | 2019-05-07 | 浙江农林大学 | Unmanned boat paths planning method based on Q learning neural network |
CN110321811B (en) * | 2019-06-17 | 2023-05-02 | 中国工程物理研究院电子工程研究所 | Target detection method in unmanned aerial vehicle aerial video for deep reverse reinforcement learning |
-
2019
- 2019-11-12 CN CN201911102540.XA patent/CN110955239B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109241552A (en) * | 2018-07-12 | 2019-01-18 | 哈尔滨工程大学 | A kind of underwater robot motion planning method based on multiple constraint target |
CN109799820A (en) * | 2019-01-22 | 2019-05-24 | 智慧航海(青岛)科技有限公司 | Unmanned ship local paths planning method based on the random road sign figure method of comparison expression |
CN110174118A (en) * | 2019-05-29 | 2019-08-27 | 北京洛必德科技有限公司 | Robot multiple-objective search-path layout method and apparatus based on intensified learning |
Non-Patent Citations (2)
Title |
---|
《基于生态策略的动态多目标优化算法》;张世文 等;《计算机研究与发展》;20140615;第1313-1330页 * |
《支持强化学习RNSGA-II 算法在航迹规划中应用》;封硕 等;《计算机工程与应用》;20190904;第246-251页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110955239A (en) | 2020-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10466058B2 (en) | Navigation for vehicles | |
CN113110509B (en) | Warehousing system multi-robot path planning method based on deep reinforcement learning | |
Chen et al. | Optimal time-consuming path planning for autonomous underwater vehicles based on a dynamic neural network model in ocean current environments | |
CN110546653A (en) | Action selection for reinforcement learning using neural networks | |
Zhao et al. | Route planning for autonomous vessels based on improved artificial fish swarm algorithm | |
CN104850009A (en) | Coordination control method for multi-unmanned aerial vehicle team based on predation escape pigeon optimization | |
CN109657863A (en) | A kind of unmanned boat global path dynamic optimization method based on glowworm swarm algorithm | |
CN112577507A (en) | Electric vehicle path planning method based on Harris eagle optimization algorithm | |
Bai et al. | USV path planning algorithm based on plant growth | |
Wang et al. | Research on dynamic path planning of wheeled robot based on deep reinforcement learning on the slope ground | |
Xia et al. | Research on collision avoidance algorithm of unmanned surface vehicle based on deep reinforcement learning | |
Zhao et al. | Path planning for autonomous surface vessels based on improved artificial fish swarm algorithm: a further study | |
CN115129064A (en) | Path planning method based on fusion of improved firefly algorithm and dynamic window method | |
CN115655279A (en) | Marine unmanned rescue airship path planning method based on improved whale algorithm | |
CN117666589A (en) | Unmanned ship missile interception and avoidance algorithm based on reinforcement learning, interception and avoidance system and readable storage medium | |
CN110955239B (en) | Unmanned ship multi-target trajectory planning method and system based on inverse reinforcement learning | |
CN117787517A (en) | UUV search path design method based on improved whale algorithm | |
Mishra et al. | A review on vision based control of autonomous vehicles using artificial intelligence techniques | |
CN117032247A (en) | Marine rescue search path planning method, device and equipment | |
CN116048126A (en) | ABC rapid convergence-based unmanned aerial vehicle real-time path planning method | |
CN114237303A (en) | Unmanned aerial vehicle path planning method and device based on Monte Carlo tree search | |
CN112595333A (en) | Road navigation data processing method and device, electronic equipment and storage medium | |
Tran et al. | Mobile robot planner with low-cost cameras using deep reinforcement learning | |
CN116991179B (en) | Unmanned aerial vehicle search track optimization method, device, equipment and medium | |
CN113503878B (en) | Unmanned ship path planning method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210302 Termination date: 20211112 |
|
CF01 | Termination of patent right due to non-payment of annual fee |