CN117111594B

CN117111594B - Self-adaptive track control method for unmanned surface vessel

Info

Publication number: CN117111594B
Application number: CN202310530731.6A
Authority: CN
Inventors: 张卫东; 林�源; 陈树康; 仓乃梦; 曹刚; 贾泽华; 吴迪
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2024-04-12
Anticipated expiration: 2043-05-12
Also published as: CN117111594A

Abstract

The invention relates to a self-adaptive track control method of an unmanned surface vessel, which comprises the following steps: aiming at the characteristics of timeliness and high nonlinearity of unmanned surface vessel operation data in a complex environment, based on the Peephole LSTM method, the nonlinear characteristics of unmanned surface vessel navigation data are learned by introducing a constant error transmitter, and the time sequence rule among the data is excavated, so that the state space of the unmanned surface vessel is formed. And performing real-time self-adaptive track control on the unmanned surface vessel based on a deep reinforcement learning DDPG algorithm, and adjusting an action strategy of an optimized network by constructing a double-layer network architecture and using a maximized global reward. And the experience playback technology is adopted, samples at each moment are stored in a replay buffer area, and the correlation among the samples is reduced through non-uniform small batch sampling. The parameters of the target network are updated periodically by iteratively calculating the loss function. Compared with the prior art, the unmanned surface vessel has the advantages of improving the sailing efficiency and the safety of the unmanned surface vessel and the like.

Description

Self-adaptive track control method for unmanned surface vessel

Technical Field

The invention relates to the technical field of intelligent control, in particular to self-adaptive track control of an unmanned surface vessel.

Background

With the rapid development of the automatic driving technology, the unmanned surface vessel is used as a novel water carrying platform and plays an important role in the aspects of environment monitoring, reconnaissance and water patrol. However, when facing complicated and changeable sea conditions, how to ensure the unmanned ship to effectively realize the sailing task and improve the motion control performance is widely paid attention by researchers at home and abroad. However, the current track control method is mainly based on feedback control and model predictive control, and aiming at time-varying data generated during navigation of an unmanned surface vessel, the model structure is often too complex, a large amount of online calculation is needed, and the track control strategy is difficult to output in real time, so that the task execution efficiency of the unmanned surface vessel is reduced, and the risk factors during navigation of the unmanned surface vessel are increased. Therefore, for complex navigation environments and application scenes, an adaptive track control method of the unmanned surface vessel is needed to ensure the efficiency and safety of the unmanned surface vessel.

Disclosure of Invention

In view of the above, the present invention aims to provide an adaptive track control method for an unmanned surface vessel to dynamically control the track of the unmanned surface vessel under complex sea conditions, so as to improve the sailing efficiency and safety of the unmanned surface vessel.

Based on the above purpose, the invention provides a self-adaptive track control method of an unmanned surface vessel, which comprises the following steps:

s1, based on a Peephole LSTM method, taking average endurance mileage, average endurance time and average sailing speed of an unmanned ship as input of the current moment of the Peephole LSTM, and obtaining a complex time data sequence generated when the unmanned ship runs by learning nonlinear characteristics of sailing data of the unmanned ship;

s2, taking the obtained complex time data sequence generated when the unmanned ship runs as a state space of a deep reinforcement learning DDPG algorithm, and setting an action space of the unmanned ship as (V, beta), wherein V is the speed of the unmanned ship, beta is the rudder angle value of the unmanned ship, training is carried out based on the DDPG algorithm, and an adaptive flight path control strategy of the unmanned water surface ship is output in real time, wherein the DDPG network comprises an Actor network and a Critic network;

the step S2 specifically comprises the following steps:

s21, initializing parameters of an Actor and a Critic network in a training starting stage, outputting an unmanned ship control strategy by a prediction network Actor based on a state space at the current moment, and taking the output action value as input of the prediction network Critic;

s22, action a of Critic network to Actor network output at current time St _t Evaluating to obtain a reward function r _t The unmanned ship state is converted into S _t+1 Outputting a value function Q at the current moment;

s23, the Actor network adjusts and optimizes the action strategy of the Actor network according to the value function output by the Critic network, and updates the network parameters.

S24, updating network parameters of the target network based on a soft update mode to finish DDPG network training;

s3, controlling navigation of the unmanned surface vessel in real time based on an optimal track control strategy of the unmanned surface vessel output by the DDPG network.

Preferably, in step S1, the method of pephole LSTM further includes the steps of:

adding peeping holes on all doors in the network for connection;

inputting the data to a forgetting gate, and learning nonlinear characteristics of unmanned ship navigation data in low-level neurons;

inputting the result of the forgetting gate at the current moment into an input gate;

and outputting the result of the input gate to the output gate, and generating network output in the neuron of the last layer.

Preferably, in the pephole LSTM method, the forget gate update formula is:

wherein f _t Is a forgetting gate at time t, sigma is expressed as a sigmoid activation function, W _f Input layer weight representing forget gate, C _t-1 And h _t-1 The state and output at time t-1 are shown,representing k unmanned ship indexes input at t time, b _f A bias factor representing a forgetting gate;

the result of the forgetting gate at the t moment is input to an input gate i, the input gate has a structure similar to that of the forgetting gate, and an input gate update formula is as follows:

wherein i is _t An input gate at time t, W _i Input layer weight of output gate, b _i The bias coefficient of the input gate is represented, the pepole LSTM adopts a neuron structure with a plurality of layers, processes a part of calculation tasks at each layer, transmits the hidden state of the neurons at the moment to the pepole LSTM layer at the next moment, and generates network output in the neurons at the last layer;

the output gate update formula is:

wherein o is _t Is the output gate at time t, W _o Input layer weights representing output gates, b _o Representing the bias factor of the output gate.

Preferably, the DDPG network uses an average task execution duration as a reward function of the DDPG, and the average task execution duration T has a calculation formula:

wherein N is the task executed by the unmanned ship, N is the total number of tasks, T _n Is the length of time that the unmanned boat takes in performing the nth task.

Preferably, the DDPG network adopts a deterministic strategy, and random noise is added into the predicted network in the training stage of the network, so that the Actor network has certain exploration capability when outputting deterministic actions.

Preferably, the DDPG network uses an empirical playback technique to store the state St at each time, the action value executed, the obtained bonus function and the state at the next time in a replay buffer, and non-uniform small-batch sampling is adopted during sampling, so that the correlation between samples is reduced.

Preferably, the DDPG network performs iterative calculation through the prediction network and the target network, and calculates the loss function L based on the mean square error.

Preferably, the DDPG network updates network parameters of Critic in the predictive network based on back propagation of the neural network, and calculates gradients of the predictive network based on a random gradient descent method.

Preferably, the DDPG network is based on predicting parameters θ of network actors and Critic ^Q And theta ^μ To update parameters in the target network, respectively.

The invention has the beneficial effects that:

A. according to the invention, nonlinear complex data generated when the unmanned surface vessel runs are processed based on the Peephole LSTM algorithm, the coupling relation of different indexes is analyzed by arranging a multi-layer neural network structure, and the time sequence rule among the data is mined, so that the state change of the unmanned surface vessel at different moments is described, and a state space is formed.

B. According to the invention, network parameters are updated in a single step by setting a prediction and target network based on a DDPG algorithm, the neural network is prevented from being fitted, and the unmanned surface vessel can adapt to navigation under various complex environments through extensive offline training, so that the current track control strategy is output in real time, and the navigation efficiency and safety of the unmanned surface vessel are improved.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for adaptive track control according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a real-time adaptive track control flow of the deep reinforcement learning DDPG algorithm on an unmanned surface vessel according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

It is to be noted that unless otherwise defined, technical or scientific terms used herein should be taken in a general sense as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

The embodiment provides a self-adaptive track control method of an unmanned surface vessel, and reference is made to fig. 1.

The self-adaptive track control of the unmanned surface vessel comprises the following steps:

s1, considering that unmanned surface vessels running data in a complex environment have time-varying and highly nonlinear characteristics, state changes of unmanned surface vessels at different moments cannot be directly described through state transfer functions. The average endurance mileage, the average endurance time and the average sailing speed of the unmanned ship are used as the input of the current moment of the Peephole LSTMk isNumber of unmanned boats operating data.

The long-term memory neural network of the Peephole LSTM is widely used for processing complex sequence data, and solves the defects of gradient elimination and gradient explosion of RNN during long-sequence training by introducing a constant error transmitter. The deep neural network has better generalization capability than the shallow neural network. By adding peepholes to all gates in the network with the pephalestm for connection, the remaining gates can detect the current cell state even when the output gate is closed. By learning the nonlinear characteristics of unmanned ship navigation data in the low-level neurons and then combining the nonlinear characteristics in the deep-level neurons, the timing rules of the data are mined. The forgetting door update formula is as follows:

wherein f _t Is a forgetting gate at time t, sigma is expressed as a sigmoid activation function, W _f Representing the input layer weights of the forget gate. C (C) _t-1 And h _t-1 The state and output at time t-1 are shown, respectively.And represents k unmanned ship indexes input at the time t. b _f Representing the bias factor of the forgetting gate. The result of the forgetting gate at the t moment is input to an input gate i, the input gate has a structure similar to that of the forgetting gate, and an input gate update formula is as follows:

wherein i is _t An input gate at time t, W _i The input layer weight of the gate is output. b _i Representing the bias factor of the input gate. The pephole LSTM adopts a neuron structure with multiple layers, processes a part of calculation tasks at each layer, transmits the hidden state of the neurons at the moment to the pephole LSTM layer at the next moment, and generates a network in the neurons at the last layerAnd outputting. The output gate update formula is as follows:

wherein o is _t Is the output gate at time t, W _o Representing the input layer weights of the output gates. b _o Representing the bias factor of the output gate. Based on the Peephole LSTM model, the complex time data sequence generated when the unmanned ship runs is effectively analyzed and used as a state space of the deep reinforcement learning DDPG algorithm.

S2, performing real-time self-adaptive track control on the unmanned surface vessel based on a depth reinforcement learning DDPG algorithm, wherein the DDPG expands a discrete action space of the depth reinforcement learning, so that the unmanned surface vessel can output an optimal control strategy of the current stage on a continuous action space. The motion space of the unmanned ship is set as (V, beta), wherein V is the navigational speed of the unmanned ship, and beta is the rudder angle value of the unmanned ship. Through a large amount of off-line training, the unmanned ship can output the optimal track control strategy of the current stage in a complex navigation environment.

The DDPG adopts a prediction network and a target network, so that the stability of an algorithm is improved, and the convergence performance of the network is improved. The average task execution duration is used as a reward function of the DDPG to improve the task execution efficiency of the unmanned ship and ensure the effective execution of the task. The average task execution duration T is calculated as follows:

S21, DDPG adopts a deterministic strategy, and random noise is added into a predicted network in a training stage of the network, so that the Actor network has certain exploration capacity when outputting deterministic actions. In the beginning of training, the parameters of the Actor and Critic networks are initialized. Unmanned ship capable of outputting state space based on current moment by predictive network ActorAnd the control strategy takes the output action value as the input of the prediction network Critic. Action a of Critic network output to Actor network at current time St _t Evaluating to obtain a reward function r _t The unmanned ship state is converted into S _t+1 And outputting a value function Q at the current moment.

Wherein Q(s) _t ,a _t ) Represented in state s _t Lower predictive network usage action a _t The resulting function of the value(s),is the bellman equation. r(s) _t ,a _t ) Representing the unmanned ship in state s _t Lower execution action a _t The prize value obtained. Gamma is the discount coefficient, Q (s _t+1 ,μ(s _t+1 ) Indicating that the unmanned ship is in state s _t+1 The action strategy μ(s) _t+1 ) The resulting cost function.

S22, the Actor network adjusts and optimizes the action strategy of the Actor network according to the value function output by the Critic network, and updates the network parameters. Using experience playback technique to make the state St of each moment and the action value a of execution _t Obtained bonus function r _t And the state S at the next moment _t+1 And the samples are stored in a replay buffer area, non-uniform small-batch sampling is adopted during sampling, and correlation among samples is reduced. Calculating a loss function L of the network based on the mean square error:

where N represents the number of samples sampled from the replay buffer and μ' represents the action policy used in the target network. Q'(s) _t+1 ,μ'(s _t+1 ) Representing state s _t+1 Lower target network parameter usage strategy μ'(s) _t+1 ) The resulting value function.

S23, updating network parameters of Critic in the prediction network based on back propagation of the neural network by iteratively calculating a loss function, and calculating the gradient of the prediction network by using a random gradient descent method.

Wherein,as a gradient of the network, θ ^μ Parameter indicating the predictor in the network, < +.>To predict the gradient of the network θμ. />Representing the gradient of the target network performing action a in state s, a=μ(s) representing the action value based on the policy μ in state s. />Represented in predicted network theta ^μ The gradient obtained using strategy μ(s).

S24, predicting parameters theta of network Actor and Critic ^Q And theta ^μ To update parameters in the target network, respectively. In order to avoid frequent updating of network parameters, the parameters are updated in a soft updating mode so as to prevent the phenomenon of over fitting of the DDPG network.

Where η is the update coefficient of the predicted network and the target network. θ ^Q Representing parameters of Critic in the predictive network, θ ^μ ' and theta ^Q ' represents the parameters of the Actor and Critic in the target network, respectively.

S3, controlling and optimizing navigation of the unmanned surface vessel in real time based on an optimal track control strategy of the unmanned surface vessel output by the DDPG network, so that the unmanned surface vessel can adapt to a complex navigation environment, and the task execution efficiency and the navigation safety are improved.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the invention (including the claims) is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

The present invention is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the present invention should be included in the scope of the present invention.

Claims

1. The self-adaptive track control method of the unmanned surface vessel is characterized by comprising the following steps of:

s2, taking the obtained complex time data sequence generated when the unmanned ship runs as a state space of a deep reinforcement learning DDPG algorithm, and setting the action space of the unmanned ship as the state spaceWherein V is the speed of the unmanned boat, < ->Is the rudder angle value of the unmanned ship, trains based on a DDPG algorithm, outputs the self-adaptive track control strategy of the unmanned surface ship in real time, and the DDPG network comprises an Actor network and a CriticA network;

the step S2 specifically comprises the following steps:

s22, outputting action of Critic network to Actor network at current time StEvaluating and obtaining a reward function->The state of the unmanned boat is changed to +.>Outputting a value function Q at the current moment;

s23, the Actor network adjusts and optimizes the action strategy of the Actor network according to a value function output by the Critic network, and updates network parameters;

s3, controlling navigation of the unmanned surface vessel in real time based on an optimal track control strategy of the unmanned surface vessel output by the DDPG network;

in step S1, the method of pephole LSTM further includes the following steps:

adding peeping holes on all doors in the network for connection;

outputting the result of the input gate to the output gate, and generating network output in the last layer of neurons;

in the method of the pephole LSTM, the forgetting gate updating formula is as follows:

wherein,is a forgetting door at the moment t +.>Expressed as sigmoid activation function, +.>Input layer weights representing forgetting gate, < ->And->Respectively indicate->Status and output of time of day->K unmanned ship indexes input at t time are represented, namely +.>A bias factor representing a forgetting gate;

the result of the forgetting gate at the t moment is input to an input gate i, the input gate has a structure similar to that of the forgetting gate, and an input gate updating formula is as follows:

wherein,for the input gate at time t +.>Input layer weight of output gate, +.>The bias coefficient of the input gate is represented, the pepole LSTM adopts a neuron structure with a plurality of layers, processes a part of calculation tasks at each layer, transmits the hidden state of the neurons at the moment to the pepole LSTM layer at the next moment, and generates network output in the neurons at the last layer;

the output gate update formula is:

wherein,is the output gate at time t, +.>Input layer weights representing output gates, +.>Representing the bias factor of the output gate.

2. The adaptive track control method of the unmanned surface vessel according to claim 1, wherein the DDPG network uses an average task execution time length as a reward function of the DDPG, and the average task execution time length T has a calculation formula of:

wherein N is the task executed by the unmanned ship, N is the total number of tasks,is the length of time that the unmanned boat takes in performing the nth task.

3. The self-adaptive track control method of the unmanned surface vessel according to claim 1, wherein the DDPG network adopts a deterministic strategy, and random noise is added into the predictive network in the training stage of the network, so that the Actor network has a certain exploration capacity when outputting deterministic actions.

4. The self-adaptive track control method of the unmanned surface vessel according to claim 1, wherein the DDPG network uses an empirical playback technique to store the state St at each time, the action value executed, the obtained reward function and the state at the next time in a replay buffer, and non-uniform small-batch sampling is adopted during sampling, so that the correlation between samples is reduced.

5. The adaptive track control method of an unmanned surface vessel according to claim 1, wherein the DDPG network performs iterative computation by a prediction network and a target network, and calculates a loss function based on a mean square errorL。

6. The adaptive track control method of an unmanned surface vessel of claim 5, wherein the DDPG network updates Critic's network parameters in the predictive network based on back propagation of the neural network, and calculates a gradient of the predictive network based on a random gradient descent method.

7. The method of claim 1, wherein the DDPG network is based on parameters of predictive networks Actor and CriticAnd->To update parameters in the target network, respectively.