CN112540614B

CN112540614B - Unmanned ship track control method based on deep reinforcement learning

Info

Publication number: CN112540614B
Application number: CN202011353012.4A
Authority: CN
Inventors: 仲伟波; 李浩东; 冯友兵; 常琦; 许强; 林伟; 孙彬; 胡智威; 齐国庆
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2022-10-25
Anticipated expiration: 2040-11-26
Also published as: CN112540614A

Abstract

The invention belongs to the field of unmanned ship track control and discloses an unmanned ship track control method based on deep reinforcement learning. The method comprises the following steps: the deep reinforcement learning framework is used for unmanned boat track control with a large hysteresis system, and the deep reinforcement learning framework enables a large hysteresis non-Markov system such as an unmanned boat to obtain a good training effect through deep reinforcement learning.

Description

Unmanned ship track control method based on deep reinforcement learning

Technical Field

The invention belongs to the field of unmanned ship track control, and particularly relates to an unmanned ship track control method based on deep reinforcement learning.

Background

In recent years, the deep neural network has been developed greatly, and the reinforcement learning has achieved remarkable achievements in aspects of playing chess, games, recommendation systems and the like after being combined with the deep neural network. The reason why deep reinforcement learning can achieve good training effect in these fields is that the rules of these fields are relatively clear, the state transition strictly conforms to markov, and the influence factors of the intelligent agent under these circumstances are relatively small and controllable. When the unmanned boat is used for deep reinforcement learning, the unmanned boat is influenced by various environmental factors, and the environmental factors considered when the unmanned boat completes different tasks in different environments have certain differences. Whether the unmanned ship can obtain enough and accurate environmental information is an important factor influencing the learning effect of the deep reinforcement learning. The track control of the unmanned ship is the basis for the unmanned ship to complete other tasks, and the application of deep reinforcement learning to the track control of the unmanned ship is an important step of the unmanned ship in the automatic control step towards artificial intelligence.

Disclosure of Invention

The invention designs a deep reinforcement learning framework for unmanned boat track control with a large hysteresis system, and the deep reinforcement learning framework enables a large hysteresis non-Markov system such as an unmanned boat to obtain a good training effect through deep reinforcement learning.

The invention is realized by the following technical scheme: an unmanned ship track control method based on deep reinforcement learning comprises the following steps:

the method comprises the following steps: initializing network parameters of a decision network Q and a target network Q';

step two: obtaining the current state S of the unmanned ship _t The method comprises the steps of obtaining position information and speed information of the current moment, data of an obstacle avoidance sensor carried by an unmanned ship, and information of a rudder angle position and propeller output power of the previous moment;

step three: preprocessing the state information of the unmanned ship, and introducing differential quantities of length and angle information into the state information of the unmanned ship for the large inertia of the ship; for calculating board card delay, introducing integral quantity of state information into the state information;

step four: will state S _t Substituting the decision network Q and obtaining an action ac and a reward r according to a strategy pi (ac | s);

step five: execute the action and enter the next state S _t+1 And is pretreated to obtain state S' _t+1 ；

Step six: will (S) _t ′,S′ _t+1 Ac, r) as a piece of data together with the sampling priority is stored in an experience pool;

step seven: sampling m pieces of data by taking the sampling priority as the basis of the sampling probability, and putting the m pieces of data into a target network to obtain a loss function omega;

step eight: updating the decision network Q by using omega;

step nine: if i > = n, updating the target network Q' once by using the parameters of the decision network Q, and enabling i =0;

step ten: and (5) observing whether a training ending condition is reached, ending the training when the training ending condition is reached, and otherwise, jumping to the step two.

Further, in the second step, the operation information such as the steering angle and the propeller output power is also used as the state information as a part of the state information.

Further, in the third step, when the state is input into the decision network, the data of the state S is preprocessed, so that the large hysteresis system not meeting markov property can also meet markov property to a certain extent.

Furthermore, rewards acquired by the unmanned boat are set in detail, and the problem that the learning and training efficiency is low due to the sparse rewards is prevented.

Further, in the second step, the probability of the data of the training neural network being sampled is dynamically adjusted, so that the latest data can be utilized as early as possible, and it is ensured that all data are uniformly used. The overall utilization rate of the data is improved.

Compared with the prior art, the invention has the following beneficial effects: the invention designs a deep reinforcement learning framework for unmanned ship track control with a large hysteresis system, and the deep reinforcement learning framework enables a large hysteresis non-Markov system such as an unmanned ship to obtain a good training effect through deep reinforcement learning. The state of the unmanned ship is transferred to a certain extent according with Markov property through differential preprocessing of state information, and the influence of delay of the unmanned ship execution action on the training effect can be reduced in a self-adaptive manner through the delay preprocessing. The detailed reward functions are set with the track control as the main target, the relation among the reward functions is analyzed, and the situation that the training of the unmanned boat is involved in accidents is avoided in the setting of the reward functions by considering some accidents possibly encountered by the unmanned boat in the training process.

Drawings

FIG. 1 is a block diagram of an algorithm flow of an unmanned ship track control method based on deep reinforcement learning according to the present invention;

FIG. 2 is a data flow diagram of the unmanned surface vehicle track control method based on deep reinforcement learning;

fig. 3 is a diagram of unmanned ship hardware distribution and connection of the unmanned ship track control method based on deep reinforcement learning according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Moreover, the technical solutions in the embodiments of the present invention may be combined with each other, but it is necessary to be able to be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent, and is not within the protection scope of the present invention.

Please refer to the drawings to explain the specific implementation process:

(1) In the network parameter initialization, if the training is performed for the first time, the weight parameter of the network is initialized randomly, and if the training is not performed for the first time, the network is initialized to the network parameter stored when the previous test is finished. The parameter i is used for evaluating the update count of the network, and the target network is updated once after the evaluation network is updated for n times. The unmanned ship samples the environmental data at an interval of Ts, and updates the decision network once every Ts and updates the target network once every n.Ts seconds.

(2) The acquired current state information comprises the position information of the unmanned ship, and the position of the target track point is known, so that a coordinate system is established by taking the current target track point as an original point and the target track direction as the positive direction of an x axis, and the coordinate of the unmanned ship can be calculated to be G _t ＝(x _t ,y _t ) (ii) a The direction of a target track changing to the direction of the next target trackAngle of delta theta _t ，-180°<△θ _t Less than or equal to 180 degrees; the data of the unmanned ship obstacle avoidance sensor is D _t 。

Because the calculation of the calculation board card is time-consuming, the response delay of the motor is not negligible, the influence of the rudder angle and the power output of the propeller on the state of the unmanned ship is continuous, and the action at the previous moment influences the state at the next moment, so that the state information needs to be included. The propeller output power of the unmanned ship is Pu _t The rudder angle output is Ang _t . Total current power output is F _t ＝(Pu _t ,Ang _t ). Then F will be _t-1 And (5) incorporating the motion space.

The finally obtained current state information is S _t ＝(G _t ,△θ _t ,D _t ,F _t-1 )

(3) The differential quantity and the integral quantity of the state information are introduced to eliminate the influence of large hysteresis of the unmanned ship. In practice, we sample the data discretely on the time axis. In a discrete system we replace the differential and integral quantities with the differential and delay quantities. The state information before the preprocessing is S _t The preprocessed state information is S _t ′。

The difference is used for eliminating the influence of the large inertia of the unmanned ship, and the acceleration of the unmanned ship is directly influenced by the rudder angle and the action of the propeller. The change in the unmanned vehicle speed is markov compliant, while the change in the position is not markov compliant. The position at the next moment is influenced not only by the current output power but also by the current speed, so that the speed is also listed as part of the status information. The information of the unmanned ship related to the distance or the heading and the difference thereof should be introduced into the state information.

The time delay is used for eliminating the influence caused by the time difference from decision making to action response in place, and the time delay amount of the state at the previous lambda moment is introduced into the state space. And setting the actual delay as tau, T as the sampling interval of the unmanned ship system, and setting lambda to satisfy the relation of lambda T & gt tau. The weight value of the network corresponding to the state information at the moment closest to the actual delay amount during training rises quickly along with the training, and the weight value of data in the network without the corresponding state information is quickly attenuated and approaches to 0 because the action and the behavior of the data have no relevance or the relevance is very low. Therefore, the problem that the unmanned ship is delayed from decision to action is solved in a self-adaptive manner.

Considering that the difference component can be linearly represented by a delay quantity, the influence of the delay quantity on the state transition of the unmanned ship is reflected on the weight of the neural network, and in order to simplify the deep neural network, the state preprocessing is finally simplified as follows:

S _t ′＝(S _t ，S _t-1 ，S _t-2 …S _t-λ )

(4) The system is applied to the unmanned surface vehicle with double propellers and single rudder. The two propellers are controlled by a signal, and the power output is the same. In order to conveniently control the board card to control the unmanned ship, a discrete action space is set. Thrust Pu to propeller output _t From 0 thrust to maximum thrust, 10 gear positions are set. For rudder angle Ang _t Resolution was 5 degrees from-60 degrees to 60 degrees, with 25 angles set. Action A _t ＝(Pu _t ，Ang _t )。

(5) And (3) setting the reward, wherein in order to achieve the training target, a reward function is set in detail:

r＝k·r _v ·r _y +r _s +r _z

each component is explained separately below, where the individual letters a, b, c, d, g, h, k are all constants.

r _v Awarding speed in a direction approaching the current target track point

A reward is set. The horizontal distance between the unmanned boat and the target track is x _t And x is _t ≥0

r _y Reward is controlled for track, the reward is larger when the navigation line sticking precision of the unmanned boat is higher, and the vertical distance between the unmanned boat and the target track isy _t (y _t ≥0)。

r _s For position reward, the reward is larger when the unmanned boat is closer to the target position, and the reward is larger when the distance between the unmanned boat and the target track point is smaller. A distance from the target track point of

And as long as the unmanned ship reaches the range threshold value d of the target track point, the current track point of the unmanned ship is updated to be the next track point. Thus c/d in the above formula _t And does not tend to be infinite. However, considering that the unmanned boat starts sailing and may be very close to the track point when sailing is finished, in order to prevent the unmanned boat from obtaining unreasonably large reward, the piecewise function in the above formula is set to limit the maximum value of the position reward.

r _z In order to avoid the obstacle reward, the unmanned boat can obtain the information of the obstacles in front of the unmanned boat through the obstacle avoiding sensor. The magnitude of the sailing speed of the unmanned boat

Setting a dynamic safety distance gv _d The drones receive a negative reward when less than the safe distance.

The final reward function is R = k · R _v ·r _y +r _s +r _z . Wherein r is _v ·r _y The terms are set to multiply rather than add because the approach of the drone to the track point and the track maintenance must be done simultaneously, and if the two awards are added, the result will be thatSo that unmanned boats can still receive unreasonably medium positive rewards when they remain in track and stop moving forward.

(6) Two deep neural networks with the same structure, a decision network Q and a target network Q' are set. The specific updating process is shown in fig. 1, and the data flow is shown in fig. 2. And the decision network Q is used for selecting actions to be executed by the unmanned ship after the environmental information is collected. The updated error function is derived from the target network Q' each time the decision network is updated for an action. The target network Q' cannot be updated every time, otherwise the target is always changing to be unfavorable for convergence of parameters. The constant n is therefore set, once every time the decision network Q is updated n times the target network Q' is updated.

(7) The format of the experience pool data includes preprocessed status information and rewards, action information and next status information, i.e., (S) _t ′，S _t+1 ，R′ _t ，A _t ). The reward function is also preprocessed data, and the reason of preprocessing is the same as that of state preprocessing and is not repeated.

The data actually stored by the experience pool should also include the unique number N of the piece of data, the sampling probability level P, and the number of times M the data was sampled.

Each piece of data in the experience pool has a format of (N, P, M, S) _t ′，S _t+1 ，R′ _t ，A _t )。

(8) The sampling levels of the data in the experience pool are divided into three levels, and the probability that the data with high sampling levels are sampled is higher.

The initial most recently stored data sample level is three levels. To ensure that the most up-to-date data is available as soon as possible after being put into the experience pool. The data with the sampling level of three levels is sampled for three times, and then the sampling level is two levels. The data of which the sampling level is two levels is lowered to one level after being sampled five times. Ten pieces of data are sampled each time an update results in one strip. The arrangement can ensure that most data sampling levels in the experience pool are kept at one level. The setting of the sampling grade can improve the data use efficiency and accelerate the convergence.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. An unmanned ship track control method based on deep reinforcement learning is characterized by comprising the following steps: the method comprises the following steps:

step three: preprocessing the state information of the unmanned ship, and introducing differential quantities of length and angle information into the state information of the unmanned ship for the large inertia of the ship; the state S 'is formed by introducing the integral quantity of the state information into the state information according to the ship hysteresis' _t Of which is S' _t ＝(S _t ，S _t-1 ，S _t-2 …S _t-λ )；

Step four: will be state S' _t Substituting into the decision network Q and obtaining the action ac and the reward r according to the strategy pi (ac | s),

the reward function is:

r＝k·r _v ·r _y +r _s +r _z

wherein: r is _v Awarding speed in a direction approaching the current target track point

Set reward, noneThe horizontal distance between the boat and the target track is x _t And x is _t ≥0

r _y Reward is controlled for track, the reward is larger when the navigation line pasting precision of the unmanned boat is higher, and the vertical distance between the unmanned boat and the target track is y _t And y is _t ≥0

r _s For position reward, the reward is larger when the unmanned boat is closer to the target position, the distance between the unmanned boat and the target track point is smaller, the reward is larger, and the distance between the unmanned boat and the target track point is

Updating the current track point of the unmanned ship to be the next track point within a range threshold value d when the unmanned ship reaches the target track point;

r _z in order to avoid the obstacle reward, the unmanned boat can obtain the information of the obstacles in front of the unmanned boat through the obstacle avoiding sensor, and the unmanned boat can run at the speed

Setting a dynamic safety distance gv _d When the distance is less than the safe distance, the unmanned boat obtains a negative reward,

in the above formula, the letters a, b, c, d, g, h and k are constants;

Step six: will (S) _t ′，S′ _t+1 Ac, r) as a piece of data together with the sampling priority is stored in an experience pool;

step eight: updating the decision network Q by using a loss function omega;

step nine: if i > = n, the target network Q' is updated once with the parameters of the decision network Q, and let i =0,

i is the updating times of the decision network Q, and n is a preset constant;

2. The unmanned ship track control method based on deep reinforcement learning of claim 1, characterized in that: in the second step, the operation information of the rudder angle at the previous time and the output power of the propeller is also used as the state information as a part of the current state information.

3. The unmanned ship track control method based on deep reinforcement learning of claim 1, characterized in that: in the third step, the state S' _t The large hysteresis system which does not meet Markov property can also meet Markov property to a certain extent by inputting the large hysteresis system into a state action value function network.

4. The unmanned ship track control method based on deep reinforcement learning of claim 1, characterized in that: in the second step, the probability of the data of the training neural network being sampled is dynamically adjusted, so that the latest data can be utilized as early as possible, and all the data are guaranteed to be uniformly used.