CN115718497A

CN115718497A - Multi-unmanned-boat collision avoidance decision method

Info

Publication number: CN115718497A
Application number: CN202211480755.7A
Authority: CN
Inventors: 吴德烽; 薛德来; 刘源铄; 刘启俊
Original assignee: Jimei University
Current assignee: Jimei University
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-02-28

Abstract

The invention relates to a collision avoidance decision method for multiple unmanned boats. Meanwhile, collision risks and COLREGs are considered, environmental information is represented and environmental risks are evaluated through mutual speed obstacle areas, and the near-end strategy optimization makes decision according to the evaluated environmental risks. A mutual velocity barrier algorithm is used for improving the action space and the reward function of a near-end strategy optimization algorithm, and a neural network based on a recursion module is used for directly mapping states of different peripheral barriers into actions so as to solve the collision avoidance problem under limited information. The method develops a new reward function based on the mutual velocity obstacle area and the expected collision time, can adapt to a plurality of different environments and solves the problem of sparse reward. The invention combines the near-end strategy optimization and the mutual speed obstacle, and combines the advantages of the near-end strategy optimization and the mutual speed obstacle, thereby perfectly realizing collision avoidance of the multiple unmanned boats on the basis of COLREGs, and ensuring the multiple unmanned boats to perform task safe navigation.

Description

Multi-unmanned-boat collision avoidance decision method

Technical Field

The invention belongs to an autonomous decision-making method for multiple unmanned boats, relates to the unmanned boat technology, the field of path planning algorithms, the field of collision avoidance algorithms, a control method for the multiple unmanned boats and the like, and particularly relates to a collision avoidance decision-making method for the multiple unmanned boats.

Background

In recent years, the demand for resources has prompted countries to increase the exploration and utilization of oceans, and the development of unmanned technology provides technical support for ocean exploration and utilization. The unmanned ship is widely applied to the exploration and utilization of ocean resources as novel ocean equipment. For the marine exploration and development task, a single unmanned ship is difficult to perfectly complete, and the unmanned ship cluster can effectively complete the tasks of marine monitoring, marine rescue, auxiliary mooring and the like. Unmanned ship is the new field of unmanned driving technology research, and marine environment is more complicated than land environment, and many unmanned ships have proposed the challenge to marine safety and environmental protection in the marine traffic engineering, consequently have proposed higher requirement to unmanned ship navigation control and navigation safety. The safe marine navigation of the unmanned boats is ensured under the marine collision avoidance rules (COLREGs), and the realization of autonomous collision avoidance among the unmanned boats has important strategic significance.

In the research of multiple unmanned boats, the control method mainly comprises two forms: 1) The centralized control method is characterized in that in a centralized system, a controller can flexibly coordinate a plurality of unmanned boats in the same working space, and collision in a group is avoided under the condition that group environment information is known. The method can realize more accurate control, but has higher requirements on the system and lower robustness, and is difficult to expand to large-scale groups. 2) A distributed control method, which allows each vessel to make decisions independently from the sensors, is suitable for deploying large numbers of unmanned boats with relatively low computational complexity. The method has strong robustness to the occurrence of errors and emergent faults in the movement of the individual unmanned ships in the cluster. But the control precision is lower, the response is slow, and therefore a mature collision avoidance algorithm needs to be carried to realize safe sailing on the sea. A great deal of research is carried out on ship path planning and collision avoidance algorithms in various scientific research institutes, colleges and universities and enterprises to obtain a series of research results. However, most of the unmanned ships aim at collision avoidance path planning in the field of single ships, and research in the field of multiple unmanned ships is less. Therefore, a collision avoidance decision method for the multiple unmanned boats needs to be researched to realize safe navigation and safe operation of the multiple unmanned boats on the sea.

The prior art has lower control precision on a plurality of unmanned boats, and the control method has no good generalization capability. The artificial potential field method, the dynamic window method and the model prediction control method are mostly applied to the field of single unmanned ships and are less applied to the interaction aspect of multiple unmanned ships. The grid graph method ignores the characteristic of smooth motion trail of the unmanned ship, the speed obstacle method is more applied to the field of multiple unmanned ships, and the unmanned ships can vibrate in the collision avoidance process. Deep reinforcement learning provides a solution for collision avoidance in a complex environment, but network adjustment and reward function adjustment are required to be carried out in collision avoidance of multiple unmanned boats, and randomness is achieved. Most of the existing collision avoidance algorithms aim at the problems that a single unmanned ship is easy to vibrate and fall into local optimality in collision avoidance of multiple unmanned ships.

Disclosure of Invention

The invention aims to solve the problems that the existing scheme can not follow collision avoidance and path planning algorithms conforming to COLREGs and can not well realize safe navigation and safe operation of multiple unmanned boats at sea, and provides a collision avoidance decision method for multiple unmanned boats. A mutual velocity barrier algorithm is used for improving the action space and the reward function of a near-end strategy optimization algorithm, and a neural network based on a recursion module is used for directly mapping states of different peripheral barriers into actions so as to solve the collision avoidance problem under limited information. The method develops a new reward function based on the mutual velocity obstacle area and the expected collision time, can adapt to a plurality of different environments and solves the problem of sparse reward. The multiple unmanned boats have collision-prevention path planning capability under the control of the algorithm provided by the invention and comply with COLREGs.

In order to achieve the purpose, the technical scheme of the invention is as follows: a collision avoidance decision method for multiple unmanned boats is based on a near-end strategy optimization algorithm, and is assisted by an expansion strategy of a mutual velocity barrier algorithm, the mutual velocity barrier algorithm improves a reward function in the near-end strategy optimization algorithm, the problem of sparse reward in reinforcement learning is solved, the network updating speed is higher, the learning efficiency is higher, the defects of high randomness and low learning rate are improved, and as shown in figure 1, the method specifically comprises the following steps:

step 1, constructing a decision model;

step 2, loading unknown environment and training a model;

step 3, designing a test environment, and extracting the current monitorable environment information;

step 4, environmental perception;

step 5, data processing;

step 6, risk assessment is carried out, and the current risk state of the unmanned ship is checked;

step 7, executing corresponding decision behaviors aiming at risks according to the step 6;

step 8, calculating the reward value according to the step 7;

and 9, judging whether collision avoidance is realized or not, and returning a reward value and a result.

For step 1, the near-end strategy optimization is a three-network structure, which is a variation of a strategy gradient algorithm, and the algorithm structure is as shown in fig. 2, the algorithm starts with initializing a neural network, and is provided with two operator networks, the structure is two layers, and each layer has 256 neurons. Wherein the network is sampled by pi, the old network is sampled by pi _old And (6) updating. During the training cycle, π receives current context information, updates state s' according to the information selection action and returns a reward r. The two operator networks are punished through self-adaptive KL, the critic network structure is two layers, each layer of 256 neurons is evaluated to be good or bad through s' and r, and pi is updated. The network updating time can be shortened, and the algorithm efficiency is improved. As shown in fig. 3 and 4, the mutual velocity barrier is a collision avoidance algorithm based on velocity, surrounding information is represented by vectors, and collision risk is evaluated through moving velocity and direction, so that collision avoidance efficiency is improved compared with the situation of observing only the position. Near-end strategy optimization, combined with mutual velocity barriers, performs well on many different tasks, better than previous algorithms.

For step 2, designing a training environment, wherein the optimization target of the near-end strategy optimization algorithm is the expectation of maximization reward, and when the expectation is calculated, the sampling method selects importance sampling. Importance sampling is the key to achieving updates to the theta network with data collected under the parameter theta' network, and two unmanned boats are described by two distribution functions p, q. The calculation expectation formula is as follows:

in theory q (x) can be an arbitrary distribution, but in practice p (x) and q (x) are close, from the point of view of the two distribution variances

Var _x～p [f(x)]＝E _x～p [f(x) ² ]-(E _x～p [f(x)]) ²

When the sampling data reaches 1000 or more, p (x) = q (x).

And converting the online strategy into the offline strategy by using an importance sampling method. Solving for expectations in strategic gradients

Is converted into

Where tau is the sampling trajectory and where,

is a correction term.

Applied to the actual environment for gradient updating

Wherein A is ^θ (s _t ,a _t ) Is an evaluation function for evaluating the quality of the selection action a under the state s at the time t.

New optimization function

Obtaining a near-end strategy optimization definitional formula from the above formula

Wherein beta is a weight coefficient, the KL divergence is used for describing the difference measurement between theta and theta', and the difference refers to the difference of behaviors (operators) corresponding to the parameters. β KL (θ, θ') is a limiting condition.

Mutual velocity barriers assume that the opposing party uses the same strategy, rather than maintaining uniform motion, as shown in FIG. 4, and can be described using equation (9)

The mutual velocity barrier is not to select a new velocity for each of the unmanned boats other than the other unmanned boat velocity barrier, but to select an average of the current velocity and velocities outside the other unmanned boat velocity barrier, v _A 、v _B Is the current selected speed of the unmanned vehicle; mutual velocity barrier from unmanned boat B to unmanned boat A

All speeds comprising agent A, i.e. current speed v _A And speed barrier of unmanned boat B

Average value of the velocity in. It can be geometrically interpreted as a speed obstacle

It is translated so that its vertex is located

Considering that collision avoidance of unmanned boats follows the rules of sea traffic collision avoidance, the right side is selected when the collision avoidance strategy is executed. Let drones A and B select a new speed v 'outside of mutual speed barriers of each other' _A And v' _B Equation (10) demonstrates its security.

For step 2, the operation process of the algorithm training model is specifically divided into the following steps:

step 2.1, determining the current positions of the unmanned boats and target points of the unmanned boats according to the designed unknown environment;

2.2, evaluating the current collision risk by mutual speed barriers, feeding the result back to the near-end strategy optimization, and executing the action by the network pi and updating the position state and the action state to obtain a network parameter theta';

step 2.3, network π _old Making a decision according to the environment to obtain a network parameter theta;

step 2.4, updating theta by theta 'through the KL divergence of theta and theta';

step 2.5, in the mutual velocity obstacle evaluation of the current collision risk, if the collision risk is detected, predicting the velocity state of the barrier at the next moment, and changing the velocity and the direction of the unmanned ship according to the state of the barrier at the next moment to enable the unmanned ship to avoid the barrier;

step 2.6, if the unmanned ship is farther away from the target point, feeding back a lower reward value, and adjusting the movement direction of the unmanned ship to approach the target point;

step 2.7, if the difference between the selected speed and the expected speed is large, feeding back a lower reward value, and adjusting the speed of the unmanned ship to approach the expected speed;

step 2.8, judging whether collision avoidance is finished or not, and if the collision avoidance is finished and a target point is reached, obtaining a basic collision avoidance route;

step 2.9, if the collision avoidance behavior is not finished, returning to the step 2.1, and continuing to iteratively update until a target point is reached;

and 2.10, training for N times to obtain the optimal collision avoidance route, completing algorithm training and obtaining a training model.

And 3, designing a test environment, and obtaining preliminary information according to the test environment and the current unmanned ship position state for making a decision at the next moment.

For step 4, ambient information is monitored, represented by a mutual velocity barrier vector.

For step 5, the gru neural network processes the input information into the same dimension, see fig. 5.

And 6, setting a maximum detection range for each unmanned boat sensor, and dividing the signals to be received by the size, the current speed, the current heading and the collision avoidance radius of each unmanned boat in the detection range. After the prior information of the local environment is obtained, the local collision avoidance path planning can be realized.

And 7, performing collision avoidance behavior, normal navigation or acceleration behavior according to mutual velocity obstacle algorithm evaluation.

And 8, feeding back the reward according to the distance between the current state of the unmanned ship and the target point, and guiding the decision-making behavior of the unmanned ship at the next moment.

For step 9, the model learns the action strategy by continuously interacting with the environment, the learning effect is represented by the cumulative reward value for each training event, and the total reward value and outcome are calculated.

Compared with the prior art, the invention has the following beneficial effects:

the method of the invention forms an extension strategy combined with a mutual velocity barrier algorithm on the basis of a near-end strategy optimization algorithm. When the algorithm is used for local collision avoidance, a mutual velocity barrier improvement reward function determines decision-making behaviors, surrounding barriers and other environmental information are uniformly represented by mutual velocity barrier vectors and used for strategy evaluation of collision risks, namely, the barriers are found in a detectable range, and whether the positions of the barriers cause collision threats at the next moment is judged according to the velocity information (size and direction) of the observed barriers. The near-end strategy optimization executes collision avoidance behavior according to the collision risk, collision avoidance behavior rules conform to COLREGs, a collision avoidance safe navigation task is completed through an optimal path, and an algorithm operation flow structure diagram after mutual velocity barriers are added is shown in FIG. 6.

The method carries out the fusion of the near-end strategy optimization and the mutual speed barrier, the mutual speed barrier is used for representing the improvement of the environmental information and the reward function, the collision prevention efficiency of the algorithm is improved, the problems that the algorithm is easy to fall into local optimum and shock motion are solved, the collision prevention capability of the algorithm is improved, the generalization capability is good, and the safe sailing efficiency of collision prevention of the unmanned boats on the water surface is improved generally.

Drawings

Fig. 1 is a flow chart of collision avoidance decision of multiple unmanned boats.

Fig. 2 is a diagram of a near-end policy optimization algorithm.

FIG. 3 is a diagram of a velocity barrier algorithm

Fig. 4 is a diagram of a mutual velocity barrier algorithm.

Fig. 5 is a flowchart of GRU data processing.

FIG. 6 is a schematic diagram of a fusion mutual velocity barrier algorithm structure based on near-end strategy optimization.

Fig. 7 is a structure view of a double-paddle unmanned boat.

Fig. 8 is verification of mutual collision avoidance of multiple unmanned boats.

Fig. 9 is a static obstacle verification scenario for collision avoidance of multiple unmanned boats.

Fig. 10 is a verification of multiple drones in a dynamic, static barrier scenario.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

Aiming at the defects that the collision avoidance rate of the near-end strategy optimization algorithm is low and the randomness is too high and the situation that the near-end strategy optimization algorithm is easy to fall into the local optimal solution, the near-end strategy optimization algorithm is improved by adding the mutual speed obstacle.

One of the technical problems faced in the unmanned ship cluster is collision avoidance, and a good decision strategy is needed in a complex sea area environment to ensure safe sailing of the unmanned ships. The near-end strategy optimization has good performance in unknown environment exploration and very quick response, but the characteristics of low navigation speed, smooth track and the like of the unmanned ship need to be considered in unmanned ship application, a mutual velocity barrier algorithm is introduced to improve a reward function mechanism, and the collision avoidance problem under limited information is solved.

By improving the optimization of the near-end strategy, an expansion strategy combined with mutual speed barriers is added, and the process is as follows:

the geometric definition of the velocity barrier is shown in fig. 7. Let

Minkowski sum representing two drones a and B, then let-a represent drones a at their reference points:

let λ (s, v) denote the a-ray in the v direction starting from s:

λ(s,v)＝{s+tv|t≥0}

the VO area of the unmanned boat A generated by the unmanned boat B is given by the following formula

Indicating that unmanned boats a and B will collide at a certain moment.

In the actual voyage of USVs, this approach can result in undesirable oscillatory motion when each drone views the other drones as moving obstacles and selects for itself a velocity that is outside of any velocity obstacle induced by the other drones. Imagine the following. The two unmanned boats A and B are respectively provided with v _A And v _B Are moved towards each other, and thus

And

along the currentThe continuation of the velocity will result in a collision. Therefore, unmanned ship a decides to change its speed to v' _A So that it is outside the speed barrier of B, i.e. it is outside the speed barrier of B

At the same time, unmanned boat B changed its speed to v' _B So that it is outside the speed barrier of B, i.e.

However, in the new case, the old speed v _A And v _B Outside the speed barriers of B and A, respectively (i.e.

And

). If both agents prefer the old speeds, they will again select them, since it directly guides them to the target. In the next cycle, these velocities appear to cause collisions, which may be again selected v' _A And v' _B And so on. Thus, when the speed barrier method is used to avoid each other, the agent oscillates between these two speeds.

To solve the above problem, the speed barrier is improved and described by the following formula:

instead of selecting a new speed for each unmanned boat outside the other unmanned boat speed obstacles, a new speed, i.e. the average of its current speed and the speeds outside the other unmanned boat speed obstacles, is selected. Mutual velocity barrier from unmanned boat B to unmanned boat A

All speeds including agent A, i.e.Current velocity v _A And speed barrier of unmanned boat B

It is translated so that its vertex is located

Considering that collision avoidance of unmanned boats follows the rules of sea traffic collision avoidance, the right side is selected when the collision avoidance strategy is executed. Let drones A and B select a new speed v 'outside of mutual speed barriers of each other' _A And v' _B The following formula demonstrates its safety.

The simplified structure of the algorithm operation flow after the improvement is shown in fig. 6. The operation steps are as follows:

step 1, constructing a decision model, wherein the neural network structure is 2 layers, and each layer comprises 256 neurons.

And 2, training a model, and making a decision to act, wherein the unmanned ship is a double-paddle under-actuated unmanned ship, as shown in fig. 7. The center of mass c of the unmanned boat is positioned at the center of the double-oar axis, (x) _c ,y _c ) Is the barycentric coordinate of the unmanned boat; alpha is a direction angle, namely an included angle between the motion direction of the unmanned boat and the x axis. The pose vector of the unmanned ship is as follows: p = (x) _c ,y _c ,α) ^T . Wherein r is _l For the radius of motion, Δ α is the double-oar heading angle increment, v _l Indicating the linear velocity of the left blade, v _r Indicating the linear speed of the right paddle and l is the distance of the double paddles.

The kinematics equation of the double-paddle differential driving unmanned ship which can be obtained according to the rigid body mechanics is as follows:

wherein v is the linear velocity at the barycenter of the unmanned ship, and omega is the steering angular velocity of the unmanned ship;

assuming that the initial pose vector of the unmanned ship is S _start ＝(x ₀ ,y ₀ ,α) ^T Current position x _c ＝S _start [0],y _c ＝S _start [1],α＝S _start [2]。

Wherein cur represents curvature, ste = { -1,0,1}, ste = -1 represents the unmanned boat turns left, ste =0 represents the unmanned boat moves straight, and ste =1 represents the unmanned boat turns right. r is _min Representing the minimum turning radius.

Rotation angle

δ＝|ste|×l _step ×cur×gea

Wherein l _step Representing a step size, gea = { -1,1}, gea = -1 is the reverse gear, and gea =1 is the forward gear.

Distance of movement

l _trans ＝(1-|ste|×l _step ×gea

Rotation matrix

Migration matrix

If omega is more than or equal to 0.01 or less than or equal to-0.01, the position at the next moment is

If ω → 0, the next time position is

Wherein

T _s Is the sampling time.

Center position and mass coordinate transformation of double-oar center after unmanned ship moving

And 3, designing different test environments to embody the generalization capability of the model.

And 4, uniformly expressing the surrounding environment information by using vectors, and performing decision by using the vectors as model input.

And step 5, in the sailing process, the behaviors of other ships are observed while the safe sailing of the unmanned ship is ensured, and all the ships learn at the same time, so that the environment is continuously remodeled, the unmanned ship can detect that the number of the other ships around changes continuously, and the dimensionality of the network learning input data changes. For variable-length input sequences, we use the GRU algorithm to extract valid information, as shown in FIG. 4, where O ₁ ，O ₂ ，O ₃ ，O _n For the observation of the surrounding vessel within the detection range, O _self Is the self state and is connected with the self state value of the ship to form an observation value O with a fixed length. The GRU algorithm reserves the information of each ship on the premise of no distortion, adopts normalization processing to observe data to accelerate the training process, and selects the optimal action through network learning.

And 6, for collision risk evaluation, inputting environmental information into the model through mutual speed barriers, and enabling the model to adjust decision-making behaviors through position information and speed information of barriers around the unmanned ship.

Step 7, the collision avoidance algorithm is converted into a circle segment collision detection algorithm, and as can be seen from the mutual velocity barrier geometric definition, collision avoidance of the two unmanned boats can be converted into collision avoidance of mass points on a circle, namely, the unmanned boat A is regarded as mass points, and the radius R of the unmanned boat A is regarded as _A And adding the mixture to the unmanned boat B. The particle motion track is equivalent to the velocity track emitted from the starting point E and is a ray, and assuming that collision avoidance is completed after the time t, the end point is marked as L. C denotes the center of the collision, i.e. P _B And R represents the radius of the circle, i.e. R _A +R _B 。

Wherein

The direction vector representing the ray, in the mutual velocity barrier, represents the velocity, from the starting point to the end point.

Wherein

Representing a vector from the center of the circle to the origin of the ray.

The insertion parameter equation:

P _x ＝E _x +td _x

P _y ＝E _y +td _y

finally, a quadratic equation for t is obtained:

solving equations in a classification discussion judges the positions of the particle velocity trajectory and the circle.

And 8, in order to solve the problem of sparse reward, a reward evaluation function is set in each step of action, a positive reward is given when the reward evaluation function is close to a target point and avoids an obstacle, and a negative reward is given when the reward is not close to the target point, so that the reward reaches the target point in the shortest time in the optimal path. For this purpose, the invention sets a reward function for a relative speed obstacle algorithm to be described as R _rvo ：

Wherein p is ₁ ，p ₂ ，p ₃ ，p ₄ ，p ₅ ，p ₆ The constant value is set according to the environment in the experiment and is used for adjusting the reward function so as to improve the performance of the strategy function.

Representing a selected speed v _t And the required speed

By setting the maximum distance to 3, dd _max ＝3，R _dd Is a speed reward function, is set in the range of (0,1) and is inversely proportional to distance, i.e., the closer the selected speed is to the desired speed reward value, the greater the selected speed. R _t Is a time reward function, is set in the range of (0,1) and is inversely proportional to time, i.e., the shorter the time, the larger the reward value. t is t _min Is the expected minimum time for the unmanned vehicle to collide with an obstacle at the current speed.

And 9, judging whether the unmanned ship reaches a target point or not according to the test result and the feedback reward value after the test is finished.

The invention combines the near-end strategy optimization and the mutual speed obstacle, and combines the advantages of the near-end strategy optimization and the mutual speed obstacle, thereby perfectly realizing collision avoidance of the multiple unmanned boats on the basis of COLREGs, and ensuring the multiple unmanned boats to perform task safe navigation.

In the invention, the algorithm fusion and the use of the kinematic model are closer to the navigation state of the actual unmanned ship, and the multiple unmanned ships can independently execute actions and can cooperatively operate, so that the collision avoidance of the multiple unmanned ships can be efficiently realized.

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. A collision avoidance decision method for multiple unmanned boats is characterized by comprising the following steps:

step 1, constructing a decision model;

step 2, loading unknown environment and training a decision model;

step 3, designing a test environment, and extracting current environment information capable of being monitored;

step 4, sensing the environment;

step 5, data processing;

step 8, calculating the reward value according to the step 7;

2. The collision avoidance decision method for the multiple unmanned boats according to claim 1, wherein in the step 1, the decision model is constructed by a near-end strategy optimization algorithm and a mutual velocity barrier algorithm; the near-end strategy optimization algorithm firstly starts from initializing a neural network, two operator networks are arranged, the structure is two layers, each layer comprises 256 neurons, wherein the network pi samples the old network pi _old Updating(ii) a During the training cycle, the network pi receives the current environment information, selects an action according to the information to update the state s' and returns the reward r; two operator networks are punished through self-adaptive KL; the critic network structure is two layers, each layer of 256 neurons evaluates the action quality through s' and r, and updates the network pi; the mutual velocity obstacle algorithm is a collision avoidance algorithm based on velocity, surrounding information is represented by vectors, and collision risk is evaluated through the moving velocity and the moving direction.

3. The collision avoidance decision method for the multiple unmanned boats according to claim 2, characterized in that in step 2, an unknown environment needs to be designed, the optimization target of the near-end strategy optimization algorithm is the expectation of maximization reward, and when the expectation is calculated, the sampling method selects importance sampling; importance sampling is the key to updating a theta network by collecting data under the condition that the parameter is the theta' network, and two unmanned boats are described by two distribution functions p and q; the calculation expectation formula is as follows:

where f (x) is a sampling function, x is the sampling value of p (x), q (x), p = p (x), q = q (x), and q can be theoretically any distribution, but in practice p and q are close, from two distribution variances:

Var _x～p [f(x)]＝E _x～p [f(x) ² ]-(E _x～p [f(x)]) ²

when p (x) and q (x) are distributed and the down-sampling data reaches more than 1000, p (x) = q (x);

converting an online strategy into an offline strategy by using an importance sampling method; in a strategy gradient, solving for the expectation:

the conversion is:

where R (τ) is the reward value, τ is the sample trace, p _θ ，p _θ’ Is a probability value that is a function of,

is a correction term;

and when the method is applied to an actual environment, gradient updating is carried out:

wherein A is ^θ (s _t ,a _t ) Is an evaluation function, pi _θ ，π _θ' Is a strategy for two distributions, p _θ ，p _θ' The probability value is n, and n represents the nth sample and is used for evaluating the quality of the selected action a under the state s at the moment t;

the new optimization function:

the near-end policy optimization algorithm definition is obtained from the above formula:

wherein beta is a weight coefficient, theta 'represents a demonstration parameter, theta represents a parameter needing to be optimized, KL divergence is used for describing difference measurement between theta and theta', and the difference refers to the difference of behaviors (actors) corresponding to the parameters; beta KL (theta, theta') is a limiting condition;

mutual speed barriers assume that the other party uses the same strategy, rather than maintaining uniform motion, and are described using the following equation:

the mutual speed obstacle is not to select a new speed for each unmanned boat except for other unmanned boat speed obstacles, but to select the average value of the current speed and the speeds outside the other unmanned boat speed obstacles; v. of _A 、v _B Is the current selected speed of the unmanned ship, and the mutual speed barrier from the unmanned ship B to the unmanned ship A

Average of the velocities within; it can be geometrically interpreted as a speed obstacle

It is translated so that its vertex is located

Considering that collision avoidance of the unmanned ship follows the rules of sea traffic collision avoidance, the right side is selected when a collision avoidance strategy is executed; let drones A and B select a new speed v 'outside of mutual speed barriers of each other' _A And v' _B The following formula demonstrates its safety:

4. the collision avoidance decision method for multiple unmanned boats according to claim 3, wherein in the step 2, the specific steps of training the decision model are as follows:

2.4, updating theta by theta 'through the KL divergence of theta and theta';

step 2.6, if the unmanned surface vehicle is farther away from the target point, feeding back a lower reward value, and adjusting the movement direction of the unmanned surface vehicle to approach the target point;

and 2.10, training N times to obtain an optimal collision avoidance route algorithm, and finishing training to obtain a trained decision model.

5. The collision avoidance decision method for multiple unmanned boats according to claim 1, wherein the step 3 is specifically as follows: designing a test environment, and obtaining preliminary information according to the test environment and the current unmanned ship position state for making a decision at the next moment.

6. The collision avoidance decision method for multiple unmanned boats according to claim 2, wherein the step 4 is implemented in a manner that: ambient information is monitored and represented by a mutual velocity barrier vector.

7. The collision avoidance decision method for multiple unmanned boats according to claim 1, wherein the step 5 is implemented in a specific manner as follows: the GRU neural network processes the input information into the same dimension.

8. The collision avoidance decision-making method for multiple unmanned boats according to claim 2, wherein the step 6 is implemented in a manner that: the sensor of each unmanned boat needs to set a maximum detection range, and signals needing to be received are divided by the size, the current speed, the current heading and the collision avoidance radius of other boats in the detectable range; and after the prior information of the local environment is obtained, the local collision avoidance path planning is realized.

9. The collision avoidance decision method for multiple unmanned boats according to claim 2, wherein the step 7 is implemented in a manner that: and performing collision avoidance behavior, normal navigation or acceleration behavior according to the mutual velocity barrier algorithm evaluation.

10. The collision avoidance decision method for multiple unmanned boats according to claim 1, wherein the step 8 is implemented in a manner of: and feeding back the reward according to the distance between the current state of the unmanned ship and the target point, and guiding the decision-making behavior of the unmanned ship at the next moment.