CN113176776A

CN113176776A - Unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning

Info

Publication number: CN113176776A
Application number: CN202110235684.3A
Authority: CN
Inventors: 骆祥峰; 张瀚; 谢少荣; 陈雪
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-07-27
Anticipated expiration: 2041-03-03
Also published as: CN113176776B

Abstract

The invention discloses a weather self-adaptive obstacle avoidance method for an unmanned ship based on deep reinforcement learning, which comprises the following steps: constructing a deep reinforcement network based on a PPO algorithm; the method comprises the following steps of constructing a simulation environment for unmanned ship obstacle avoidance and an unmanned ship model, and defining the state space of the unmanned ship model, wherein the state space comprises the following steps: the method comprises the following steps that an environment image collected by an image sensor on an unmanned ship model and three-dimensional coordinate information of a preset target point are obtained; the motion space includes: the steering angle and the thrust of the unmanned ship model; designing a reward function based on the time sequence distance as an optimization basis; sampling sample data generated when the unmanned ship model interacts with a simulation environment in different weather by using a deep enhanced network; based on a PPO algorithm, training the depth strengthening network by using sample data to obtain an automatic obstacle avoidance model of the unmanned ship under different weathers. The method can sense weather changes in real time, dynamically select the pre-training obstacle avoidance model, and enable the unmanned ship model to adapt to different weathers.

Description

Unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning

Technical Field

The invention relates to a weather adaptive obstacle avoidance method for an unmanned ship, in particular to a weather adaptive obstacle avoidance method for the unmanned ship based on deep reinforcement learning.

Background

With the increasing ocean development activities, unmanned boats serve as water surface intelligent task platforms and can carry various sensors to complete tasks such as environment monitoring, hydrological investigation, patrol, search and rescue and the like. It is one of the basic capabilities of unmanned boats to avoid ships, reefs, or other obstacles to reach a designated location in an optimal path during mission.

The unmanned ship guarantees navigation safety and realizes an intelligent autonomous obstacle avoidance technology without leaving. The autonomous obstacle avoidance of the unmanned ship means that the surrounding environment of the unmanned ship on the water surface is detected through the ship-borne sensing equipment, a smooth collision-free path is planned, and the unmanned ship can quickly and safely reach a task position. However, the conventional obstacle avoidance algorithm depends on exact environmental information, needs to artificially design a mathematical model of a scene, has high implementation complexity, and is difficult to balance implementation cost and solution effect. Especially in environments such as weather changes, it is difficult to cope with the complexity of the operating environment of the unmanned ship by relying on expert experience for modeling. The advanced reinforcement learning technology, an artificial intelligence field, provides a solution for the perception decision problem in a complex environment, and self-learns the obstacle avoidance strategy by interacting with the environment through a trial and error mechanism. However, the obstacle avoidance method based on deep reinforcement learning still has the problems of difficult setting of the reward function, high training calculation force requirement, large algorithm convergence difference of different tasks and the like. The method is applied to the unmanned ship obstacle avoidance task with changing weather, and due to the fact that the complexity of the environment is improved due to the increase of weather variables, the obstacle avoidance model based on deep reinforcement learning is difficult to converge or the convergence speed is low, and therefore the unmanned ship obstacle avoidance efficiency is low.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to overcome the defects in the prior art and provide a weather adaptive obstacle avoidance method of an unmanned ship based on deep reinforcement learning, wherein a deep reinforcement network is constructed based on a PPO algorithm; constructing a simulation environment for unmanned ship obstacle avoidance and an unmanned ship model in a Unity3D simulation engine, and defining a state space and an action space of the unmanned ship model; designing a reward function based on a time sequence distance as an optimization updating basis of an obstacle avoidance strategy, and sampling sample data generated when the unmanned ship model interacts with a simulation environment in different weather by using a depth-enhanced network; training the depth strengthening network by using sample data based on a PPO algorithm to obtain automatic obstacle avoidance models of the unmanned ship under different weathers, and calling corresponding obstacle avoidance models to avoid obstacles when the unmanned ship perceives the different weathers. Discrete rewards are continuous by adding a reward function of time sequence distance information, and the problem of sparse rewards in the unmanned ship offshore obstacle avoidance task is solved; the unmanned ship model adapts to different weathers by using a mode of sensing weather changes in real time and dynamically switching the pre-training obstacle avoidance model, and the problem that the obstacle avoidance strategy is trained in the changed weathers and is difficult to converge is solved.

The embodiment of the invention provides a weather adaptive obstacle avoidance method for an unmanned ship, which is based on a PPO algorithm to construct a deep reinforcement network; the method comprises the following steps of constructing a simulation environment and an unmanned ship model of unmanned ship obstacle avoidance, and designing a state space of the unmanned ship model, wherein the state space comprises: the unmanned ship model comprises an environmental image acquired by an image sensor on the unmanned ship model and three-dimensional coordinate information of a preset target point; the motion space includes: the steering angle and the thrust of the unmanned ship model; sampling a plurality of sample data generated by the unmanned ship model in different weathers when the unmanned ship model interacts with the simulation environment by using the deep reinforcement network; and training the depth strengthening network by using the plurality of sample data to obtain an automatic obstacle avoidance model of the unmanned ship under different weathers based on a PPO algorithm.

In order to achieve the purpose of the invention, the invention adopts the following technical scheme:

an unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning comprises the following steps:

(1) constructing a deep reinforcement network based on a PPO algorithm;

(2) constructing a simulation environment for unmanned ship obstacle avoidance and an unmanned ship model, and defining a state space and an action space of the unmanned ship model; the state space comprises an environment image acquired by an image sensor on the unmanned ship model and three-dimensional coordinate information of a preset target point; the action space comprises a steering angle and a thrust of the unmanned ship model;

(3) designing a reward function based on the time sequence distance;

(4) sampling a plurality of sample data generated by the unmanned ship model in different weathers when the unmanned ship model interacts with a simulation environment by using the deep reinforcing network;

(5) based on a PPO algorithm, the depth strengthening network is trained by using a plurality of sample data, so that automatic obstacle avoidance models of the unmanned ship under different weathers are obtained.

Preferably, in the step (3), a reward function for training the deep augmentation network is related to a distance between the unmanned ship model and a preset target point, and a formula of the reward function based on the time-series distance is as follows:

wherein r is_tDenotes the reward value at time t, - λ is a predetermined negative reward value, d_tAnd (U, T) represents the distance between the unmanned ship model U and a preset target point T at the moment T, and delta is a preset distance value.

Preferably, in the step (5), the step of constructing the deep reinforcement network based on the PPO algorithm is as follows:

(5-1) constructing the deep reinforcement network comprising a strategy network and a value network based on a PPO algorithm;

(5-2) sampling a plurality of sample data generated by the unmanned ship model when interacting with the simulation environment in different weathers by using a depth-enhanced network;

(5-3) sampling a plurality of sample data generated when the unmanned ship model interacts with simulation environments under different weathers by using the strategy network in the initialized deep reinforcement network;

and (5-4) training the depth strengthening network by using a plurality of sample data, thereby obtaining an automatic obstacle avoidance model of the unmanned ship.

Preferably, in the step (5-3), the policy network includes a new policy network and an old policy network; each sample data includes a status, an action, and a reward;

preferably, in the step (5-4), based on a PPO algorithm, training the deep augmentation network by using a plurality of sample data to obtain an automatic obstacle avoidance model of the unmanned ship, and the specific steps are as follows:

(5-4-1) inputting the state in each sample data into the value network to obtain the value of each sample data, and updating the parameters of the value network by using an advantage function calculated based on the value and the accumulated reward of each sample data;

and (5-4-2) updating the new strategy network according to the plurality of sample data, the new strategy network, the old strategy network and a preset objective function, and taking the new strategy network before updating as the old strategy network.

Preferably, in the step (5-3), the policy network comprises three convolutional layers and a fully-connected layer which are connected in sequence; the method comprises the following steps of sampling a plurality of sample data generated by the unmanned ship model when interacting with a simulation environment by utilizing a strategy network in an initialized deep reinforcement network, and comprises the following steps:

inputting the current state generated when the unmanned ship model interacts with the simulation environment into a first convolution layer in a strategy network, and outputting the probability of each action executed by the unmanned ship model through a full connection layer of the strategy network;

the method comprises the steps of controlling the unmanned ship model to execute the action with the maximum probability, obtaining the reward returned by a preset reward function, and obtaining the next state after the unmanned ship model executes the action with the maximum probability, wherein each sample data comprises the current state, the action executed in the current state and the reward returned by the reward function.

Preferably, the simulated environment comprises a plurality of weather types;

based on the PPO algorithm, training the depth-enhanced network by using sample data sampled in different weathers to obtain an automatic obstacle avoidance model of the unmanned ship, and the method comprises the following steps:

based on the PPO algorithm and each weather type, training the deep reinforcement network by using a plurality of sample data to obtain an automatic obstacle avoidance model corresponding to each weather type.

Preferably, the unmanned ship weather adaptive obstacle avoidance method based on the deep reinforcement learning collects environmental images of the unmanned ship according to a preset time interval; inputting the weather type of the environment image into a preset weather perception model, and identifying the weather type of the environment image;

selecting and executing the actions of the unmanned ship according to the collected environment image categories and a pre-trained automatic obstacle avoidance model until a preset destination is reached; the pre-trained automatic obstacle avoidance model is an unmanned ship automatic obstacle avoidance model which is obtained by training under the interaction with different weather environments based on a PPO algorithm; each automatic obstacle avoidance model corresponds to a different weather type.

Preferably, the generation method of the weather perception model is as follows:

constructing a marine weather simulation scene, wherein the marine weather simulation scene comprises unmanned ship autonomous driving scene simulation, full-time illumination simulation and marine weather simulation;

collecting a plurality of weather image samples when the unmanned ship model autonomously drives in the marine weather simulation scene;

and training the weather perception model based on the convolutional neural network by using a plurality of weather image samples to obtain the weather perception model for weather identification.

The invention provides an unmanned ship, which comprises at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the unmanned ship weather adaptive obstacle avoidance method based on the deep reinforcement learning.

Compared with the prior art, the method comprises the steps of firstly constructing the depth strengthening network based on the PPO algorithm, then constructing the simulation environment and the unmanned ship model of the unmanned ship obstacle avoidance, defining a state space and an action space suitable for the unmanned ship obstacle avoidance task, designing a reward function based on a time sequence distance, continuously distributing discrete rewards, solving the problem of sparse rewards, sampling a plurality of sample data generated when the unmanned ship model interacts with different weather environments by using the depth strengthening network, and then training the depth strengthening network by using the plurality of sample data based on the PPO algorithm to obtain the automatic obstacle avoidance model of the unmanned ship under different weather conditions. The updating amplitude can be limited by using a PPO algorithm, so that the obstacle avoidance strategy of the automatic obstacle avoidance model in the training process can be stably optimized in a progressive mode. Meanwhile, the unmanned ship model is adapted to different weathers by using a mode of sensing weather changes in real time and dynamically switching the pre-training obstacle avoidance model, the problem that the obstacle avoidance strategy is difficult to converge when the weather changes is solved, and the training efficiency and the obstacle avoidance success rate of the unmanned ship when the weather changes are improved.

Preferably, the number of the automatic obstacle avoidance models is multiple; each automatic obstacle avoidance model corresponds to a different weather type; before selecting the action of the unmanned ship and executing according to the collected environment image and the preset automatic obstacle avoidance model, the method further comprises the following steps: inputting the collected environment image into a preset weather perception model, and identifying a target weather type represented by the environment image; then, selecting an automatic obstacle avoidance model corresponding to the target weather type according to the corresponding relation between the preset weather type and the automatic obstacle avoidance model; inputting the environment image into an automatic obstacle avoidance model corresponding to the target weather type, and selecting and executing the action of the unmanned ship; in the process of carrying out the obstacle avoidance task by the unmanned ship, the invention senses the change of weather in real time through the preset weather sensing model, switches the automatic obstacle avoidance model corresponding to the corresponding weather type according to the self-adaptive rule, and selects and executes the correct action to realize the self-adaptive obstacle avoidance of the unmanned ship under the dynamic weather, so that the unmanned ship can keep better obstacle avoidance success rate under the dynamic weather.

Preferably, the convolutional neural network-based weather awareness model includes: the device comprises a first convolution layer, a pooling layer, a second convolution layer, a third convolution layer, a fourth convolution layer, two full-connection layers and a classifier which are connected in sequence; the loss function of the classifier is:

where loss (x, l) represents the loss function, x represents the weather image sample, l represents the tag set,t' represents the total number of weather types, S_iRepresents the output of the fully connected layer, y_iA real label representing a sample of the weather image,

representing the accumulation from i-1 to the total number of weather types. The method provides a concrete structure of the weather perception model and a loss function of the classifier.

Compared with the prior art, the invention has the following obvious and prominent substantive characteristics and remarkable advantages:

1. the method is characterized in that a deep reinforcement network is constructed based on a PPO algorithm; the method comprises the following steps of constructing a simulation environment for unmanned ship obstacle avoidance and an unmanned ship model, and defining the state space of the unmanned ship model, wherein the state space comprises the following steps: the method comprises the following steps that an environment image collected by an image sensor on an unmanned ship model and three-dimensional coordinate information of a preset target point are obtained; the motion space includes: the steering angle and the thrust of the unmanned ship model; designing a reward function based on the time sequence distance as an optimization basis; sampling sample data generated when the unmanned ship model interacts with a simulation environment in different weather by using a deep enhanced network; training the depth strengthening network by using sample data based on a PPO algorithm to obtain an automatic obstacle avoidance model of the unmanned ship under different weathers;

2. the method can sense weather changes in real time, dynamically select the pre-training obstacle avoidance model, and enable the unmanned ship model to adapt to different weathers;

3. according to the method, the discrete rewards are continuous by adding the reward function of the time sequence distance information, so that the problem of sparse rewards in the unmanned ship offshore obstacle avoidance task is solved; the unmanned ship model adapts to different weathers by using a mode of sensing weather changes in real time and dynamically switching the pre-training obstacle avoidance model, and the problem that the obstacle avoidance strategy is trained in the changed weathers and is difficult to converge is solved.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a flow chart of an unmanned surface vehicle weather adaptive obstacle avoidance method based on deep reinforcement learning according to the invention.

Fig. 2 is a schematic diagram of a deep enhancement network according to the third embodiment of the present invention.

Fig. 3 is a schematic diagram of a policy network of three, implemented in accordance with the present invention.

Fig. 4 is a schematic diagram of an unmanned ship obstacle avoidance training process in the third embodiment of the invention.

FIG. 5 is a graphical illustration of reward values and entropy values trained under different weather conditions in accordance with the invention.

Fig. 6 is a detailed flowchart of an unmanned ship automatic obstacle avoidance method adaptive to changing weather according to the fourth embodiment of the invention.

Fig. 7 is a schematic diagram of a convolutional neural network-based weather awareness model in accordance with the fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:

the first embodiment is as follows:

in this embodiment, referring to fig. 1, a depth reinforcement learning-based unmanned surface vehicle weather adaptive obstacle avoidance method includes the following steps:

(1) constructing a deep reinforcement network based on a PPO algorithm;

(3) designing a reward function based on the time sequence distance;

The method can sense weather changes in real time, dynamically select the pre-training obstacle avoidance model, and enable the unmanned ship model to adapt to different weathers.

Example two:

this embodiment is substantially the same as the first embodiment, and is characterized in that:

in this embodiment, in the step (3), the reward function for training the depth-enhanced network is related to the distance between the unmanned ship model and the preset target point, and the formula of the reward function based on the time-series distance is as follows:

In this embodiment, in the step (5), the step of constructing the deep mesh reinforcement based on the PPO algorithm is as follows:

In this embodiment, in the step (5-3), the policy network includes a new policy network and an old policy network; each sample data includes a status, an action, and a reward;

in the step (5-4), training the depth-enhanced network by using a plurality of sample data based on a PPO algorithm to obtain an automatic obstacle avoidance model of the unmanned ship, and specifically, the method comprises the following steps:

In this embodiment, in the step (5-3), the policy network includes three convolutional layers and a fully-connected layer connected in sequence; the method comprises the following steps of sampling a plurality of sample data generated by the unmanned ship model when interacting with a simulation environment by utilizing a strategy network in an initialized deep reinforcement network, and comprises the following steps:

In this embodiment, the simulation environment includes a plurality of weather types;

In the embodiment, the environment image of the unmanned ship is collected according to a preset time interval; inputting the weather type of the environment image into a preset weather perception model, and identifying the weather type of the environment image;

In this embodiment, the generation manner of the weather perception model is as follows:

The method of the embodiment constructs a deep reinforcement network based on a PPO algorithm; the method comprises the following steps of constructing a simulation environment for unmanned ship obstacle avoidance and an unmanned ship model, and defining the state space of the unmanned ship model, wherein the state space comprises the following steps: the method comprises the following steps that an environment image collected by an image sensor on an unmanned ship model and three-dimensional coordinate information of a preset target point are obtained; the motion space includes: the steering angle and the thrust of the unmanned ship model; designing a reward function based on the time sequence distance as an optimization basis; sampling sample data generated when the unmanned ship model interacts with a simulation environment in different weather by using a deep enhanced network; based on a PPO algorithm, training the depth strengthening network by using sample data to obtain an automatic obstacle avoidance model of the unmanned ship under different weathers.

Example three:

in this embodiment, the unmanned surface vehicle weather adaptive obstacle avoidance method based on the deep reinforcement learning is applied to electronic devices such as a notebook computer and a desktop computer, the electronic devices can execute the method under different weather conditions to generate automatic obstacle avoidance models corresponding to different weather types, and the unmanned surface vehicle can execute tasks under different weather conditions by using the automatic obstacle avoidance models to achieve automatic obstacle avoidance so as to reach a specified destination by using an optimal path.

The obstacle avoidance of the unmanned ship is defined as follows: the unmanned ship senses the sea surface environment by using the image sensor, avoids the obstacle of the sea surface static island and reef and reaches the position of the preset target point at the highest speed.

In this embodiment, Deep Reinforcement Learning (DRL) is used to solve the problem of obstacle avoidance in a complex marine system, and Deep Reinforcement Learning combines the sensing ability of Deep Learning and the decision-making ability of Reinforcement Learning, so that a decision-making strategy can be autonomously learned from the experience accumulated in the decision-making process, and the problem of sensing decision-making in the complex system can be solved. The deep reinforcement learning mainly collects behavior experience through real-time interaction with the environment, the autonomous learning process of the strategy is unsupervised, and an optimal strategy model can be obtained only by setting a corresponding reward function as a feedback index of the task completion degree, so that the method is suitable for an unknown dynamic environment.

Firstly, an obstacle avoidance scene is in a certain state, the unmanned ship selects an action based on state information acquired by a visual sensor, an instant return is obtained after the action is executed, meanwhile, the unmanned ship model executes the action and has certain influence on the environment, so that the environment is converted into a new state, and then, the optimization of an obstacle avoidance strategy is carried out by using a model-free depth reinforcement learning algorithm. The above process can be repeatedly executed in the interaction of the unmanned ship and the obstacle avoidance scene, and finally, knowledge and experience are learned in a round of iteration process, and an optimal action strategy for solving the problem of obstacle avoidance of the unmanned ship is formed. The specific flow of this embodiment is shown in fig. 1:

101, constructing a deep reinforcement network based on a PPO algorithm;

in the embodiment, a deep reinforcement network is constructed by adopting a PPO algorithm, a near-end Policy Optimization (PPO for short) algorithm has better balance among the difficulty of implementation, the sampling complexity and the debugging time, and the algorithm tries to update to a more optimal Policy in each iteration step and can limit the updating amplitude of the Policy, so that the obstacle avoidance Policy can be stably optimized in a progressive manner;

102, constructing a simulation environment and an unmanned ship model of unmanned ship obstacle avoidance, and defining a state space and an action space of the unmanned ship model, wherein the state space comprises: the method comprises the following steps that an environment image collected by an image sensor on an unmanned ship model and three-dimensional coordinate information of a preset target point are obtained; the motion space includes: the steering angle and the thrust of the unmanned ship model;

in the embodiment, the obstacle avoidance scene of the unmanned ship is modeled through computer simulation, and comprises a plurality of weather types, such as sunny day, cloudy day, sunny dusk, sunny night, unmanned ship, sensors and the like, so that the scene is convenient to adjust, different marine environments are simulated, and the method is economical, safe and reliable, and can greatly shorten the development period; when constructing an unmanned obstacle avoidance scene comprising an unmanned ship obstacle avoidance simulation environment and an unmanned obstacle avoidance model, a Unity3D engine can be adopted to specifically construct the unmanned ship obstacle avoidance simulation environment;

the state space of the unmanned ship model is environment information perceived by the unmanned ship model through a visual sensor, and the environment information comprises two forms of images and vectors; in this embodiment, the states include: and the image acquired by the vision sensor and the three-dimensional coordinate information of the preset destination.

The action space of the unmanned ship model consists of steering and thrust, and defines a mode for controlling the unmanned ship model to move; the steering of the unmanned ship model is discrete movement and can comprise a plurality of discrete steering anglesSelecting a steering angle from the unmanned ship models at each moment for steering; the discrete motion comprises 7 steering angles, respectively:

discrete thrust, 1 for applied thrust and 0 for no thrust;

103, designing a reward function based on a time sequence distance;

for the training of the deep reinforcement network, after the state space and the action space of the unmanned ship model are defined, a reward function suitable for an obstacle avoidance task is designed, and the setting of the reward function is one of the keys of whether the strategy can be converged;

the method comprises the steps that a reward function for training a deep reinforcement network is set to be related to the real-time distance between an unmanned ship model and a preset target point, namely the distance between the unmanned ship model and the preset target point is added into the reward function, discrete reward values are converted into continuous values, and the problem of sparse rewards is solved; in one example, the formula for the reward function is as follows:

Wherein x is_usvAbscissa, y, representing model of unmanned boat_usvOrdinate, x, representing the model of the unmanned ship_targetRepresenting preset target pointsAbscissa of (a), y_targetRepresenting the ordinate of the preset target point;

104, acquiring a plurality of sample data generated in the process of dynamic interaction between the unmanned ship model and simulation environments under different weathers by using a depth-enhanced network in the process of executing an obstacle avoidance task by the unmanned ship model based on a PPO algorithm; the method comprises the following steps that only weather types in a simulation environment are changed in the simulation environment, so that corresponding obstacle avoidance model training samples can be generated according to different weather types;

firstly, constructing a deep reinforcement network comprising a strategy network and a value network based on a PPO algorithm, wherein a basic frame of the deep reinforcement network is an Actor-Critic frame, and is shown in FIG. 2;

the strategy network Actor10 is used for predicting the probability distribution of the selective action of the unmanned ship model in the current state, the input of the strategy network Actor10 is the current state observed by the vision sensor of the unmanned ship model, and the output of the strategy network Actor10 is the probability of each action taken by the unmanned ship model in the current state. Referring to fig. 3, a fully-connected layer of a policy network combines 7 steering operations and 2 thrust operations in an action space, 14 neurons are arranged in the fully-connected layer to represent different combinations of steering and thrust respectively to obtain actions, and the output of each neuron represents the probability of executing the corresponding action;

the value network criticic 20 is an evaluator, the input of the evaluator is the current state of the unmanned ship model, and the output is the value of the current state; the network structure of the value network can be the same as that of the strategy network, only the output neuron of the full connection layer is set to be 1, and the value of one state is output after the state is input into the value network;

the value network Critic outputs the value of the current state, the strategy network Actor selects the optimal action based on the probability of each action taken by the unmanned ship model in the current state, and controls the unmanned ship model to execute the optimal action, the visual sensor of the unmanned ship model observes the next state, the value network Critic calculates the value of the next state, and calculates the difference between the value of the next state and the value obtained in the obstacle avoidance environment minus the value of the current state as the TD error, the deep reinforcement network can optimize the update amplitude of the action probability in the strategy network Actor according to the TD error, when the value of the next state is greater than the value of the current state, the unmanned ship model achieves a better state after executing the current action, namely the next state, the value network Critic feeds back the probability of the action selection in the current state to the strategy network Actor to increase, otherwise, the number is reduced; the magnitude of the increase or decrease is determined by the TD error;

then, sampling a plurality of sample data generated when the unmanned ship model interacts with a simulation environment by using a strategy network in the initialized deep reinforcement network;

the concrete mode is as follows: the vision sensor on the unmanned ship model acquires an image to form a state s_tInputting the data into the first convolutional layer of the strategic network Actor, outputting the probability of executing various actions by the full-connection layer of the strategic network Actor, and controlling the unmanned ship model to execute the action a with the highest probability_tEntering the next state in the obstacle avoidance environment to obtain the reward r returned by the environment_tThe image acquired by the vision sensor forms the next state s_t+1Then change the state s_t+1Inputting the data into a policy network Actor, circularly sampling and converting each state s_tAction a_t-a prize r_tAs one sample data, a plurality of sample data can be obtained; when sampling of sample data is carried out by using a policy network Actor, updating of the policy network is not carried out;

and 105, training the depth strengthening network by using sample data under different weathers based on the PPO algorithm to obtain an automatic obstacle avoidance model of the unmanned ship under different weathers.

Inputting the state in each sample data into a value network to obtain the value and the accumulated reward of each sample data, and updating the parameters of the value network by using an advantage function calculated based on the value and the accumulated reward of each sample data; updating the new strategy network according to a plurality of sample data, the new strategy network, the old strategy network and a preset objective function, and taking the new strategy network before updating as the old strategy network;

in the embodiment, in the process of training the depth-enhanced network by adopting a PPO algorithm, a visual sensor on an unmanned ship model acquires an image forming state, the current state is used as a feature vector and is respectively input into a strategy network and a value network, the strategy network outputs an action to be executed in the current state, controls the unmanned ship model to execute the action, and the value network outputs the value of the current state; and then updating parameters of a strategy network and a value network by using rewards fed back by the environment according to a target function adopted by the PPO algorithm, and continuously and circularly iterating the three steps of executing actions, obtaining rewards and updating the strategies until the training is finished to obtain the automatic obstacle avoidance model of the unmanned ship.

The updating of the value network and the policy network in the deep hardened network is specifically described below with reference to fig. 4, where the policy network includes a new policy network and an old policy network.

When the PPO algorithm is used for training the deep reinforcement network, the target function of the PPO algorithm is as follows:

wherein r is_t(theta) represents the ratio of the old strategy network to the new strategy network in one iteration updating, theta represents the strategy parameter to be optimized, epsilon represents a preset constant for controlling the strategy updating amplitude,

indicates the expected value of time t, A_tRepresenting the merit function. The PPO algorithm utilizes a proportion item to describe the difference between the new strategy and the old strategy, and designs a cutting operation on a target function to reduce the difference between the new strategy and the old strategy so as to select an action which has a huge difference with the previous strategy in case that the strategy is updated too fast;

wherein the content of the first and second substances,

representing old policy network,. pi_θRepresenting a new policy network to be optimized, pi_θ(as) represents a new policy network with a state of s at time t and action as a,

and (4) representing the old policy network with the state bit s and action a at the time t.

In this embodiment, if the policy network and the value network share parameters in the deep enhanced network, the error term of the policy network and the value network may be added to the objective function, and the information entropy reward is added to increase exploration, so as to obtain the following objective function:

wherein, c₁And c₂Is a pre-set parameter of the process,

representing a policy n_θInformation entropy reward of, s_tRepresenting the state at time t, theta representing the strategy parameter to be optimized,

representing a value network V_θ(s_t) And target value

The square error of (a) is calculated,

representing the expected value at time t.

After the strategy network finishes the collection of a plurality of sample data, for the update of the value network, the advantage function A is calculated by using the value v and the accumulated reward r obtained by the state s of the sample data through the value network_tAnd using the dominance function A_tBack-propagation of the mean-square-error-loss of (d) update-value-network parameters. For policyAnd slightly updating the network, namely respectively inputting the state in each sample data into the new strategy network and the old strategy network to obtain the new strategy probability and the old strategy probability corresponding to the action in each sample data, and updating the new strategy network according to the new strategy probability and the old strategy probability corresponding to the action in each sample data and a preset objective function. Specifically, based on the new policy probability and the old policy probability corresponding to the action in the sample data in each iteration process, the ratio r of the old policy network to the new policy network in each iteration update is calculated_t(θ), then calculating the loss based on the loss function in the formula (1), and propagating the updated new policy network back, and using the new policy network before updating as the old policy network. After the strategy network and the value network are updated, the automatic obstacle avoidance model of the unmanned ship can be obtained.

The training curves in different weather are shown in fig. 5; after the automatic obstacle avoidance models of various weather types are obtained through training, the advantages and disadvantages of the automatic obstacle avoidance models can be evaluated through accumulated rewards, information entropy and obstacle avoidance success rate. And evaluating the automatic obstacle avoidance models corresponding to various weather types by taking the obstacle avoidance success rate as an evaluation index. The various weather types include: sunny (SunnyDay), cloudy (CloudDay), day (12 o 'clock), dusk (ClearDusk) (18 o' clock), night (ClearNight) (24 o 'clock), day (12 o' clock), dynamic weather: ChangingWeather1 day-dusk-night switching and ChangingWeather2 day-dusk switching. Table 1 below shows the average values of the obstacle avoidance success rate and the obstacle avoidance success rate for 300, 500, and 1000 rounds of tests, respectively.

Table 1 shows the obstacle avoidance success rate and the average value table of the obstacle avoidance success rate in the third embodiment of the present invention

From the above, the automatic obstacle avoidance model in this embodiment has an obstacle avoidance success rate of about 0.9 in several different static weathers. However, it is difficult to train a model with a high obstacle avoidance success rate in an environment with weather changes, and because the dynamic changes of weather increase the complexity of the environment, one model is difficult to adapt to the obstacle avoidance environment with weather changes. A second embodiment of the present invention is therefore proposed to solve this problem.

Example four:

the embodiment relates to the real-time weather classification of a state image acquired by an unmanned ship and the calling of an unmanned ship automatic obstacle avoidance model pre-trained in different weathers in a weather change environment, and different automatic obstacle avoidance models can be selected according to weather types.

A specific flow of the automatic obstacle avoidance method according to the present embodiment is shown in fig. 6.

The simulation environment of this embodiment is substantially the same as that of the first embodiment, and mainly differs in that: adding dynamically changing weather such as clear day-clear dusk-clear night dynamic switching, clear day-cloudy day-clear dusk dynamic switching, etc. A plurality of automatic obstacle avoidance models are preset in the unmanned ship; a plurality of automatic obstacle avoidance models under different weather conditions are preset, namely, the automatic obstacle avoidance models correspond to weather types one to one, and the weather types are as follows: sunny day, cloudy day, dusk in sunny day, night in sunny day, etc. Matching the acquired environment image category with a preset automatic obstacle avoidance model, selecting and executing the actions of the unmanned ship until the unmanned ship reaches a preset destination; the automatic obstacle avoidance model is the unmanned ship automatic obstacle avoidance model based on the depth reinforcement learning and generated in the first embodiment and under different weather conditions.

Step 201, collecting an environment image of the unmanned ship according to a preset time interval.

Step 202, inputting the collected environment image into a preset weather perception model, and identifying a target weather type represented by the environment image.

In this embodiment, a weather perception model is further preset in the unmanned ship and used for identifying the weather type of the current environment of the unmanned ship, an image sensor is mounted on the unmanned ship in advance, the image sensor can be a camera, in the process that the unmanned ship executes tasks, the image sensor collects the environment image of the unmanned ship according to a preset time interval and inputs the environment image into the weather perception model, and the target weather type represented by the environment image is identified. In one example, weather identification can be performed once every 15 times of acquisition of the environment image, so that the unmanned ship can move more smoothly, delay of execution of actions of the unmanned ship is avoided, and meanwhile, a high obstacle avoidance success rate can be kept.

In this embodiment, the weather sensor adopts a conventional convolutional neural network structure, and the specific network layer setting please refer to fig. 7, which selects the classifier as the Softmax classifier, and optimizes the parameters in the weather sensing model by using an average method that maximizes the logarithmic probability of the correct tag, where the loss function of the Softmax classifier is:

wherein loss (x, l) represents a loss function, x represents a weather image sample, l represents a tag set, T' represents a total number of weather types, S_iRepresents the output of the fully connected layer, y_iA real label representing a sample of the weather image.

Wherein f is_i(x, w) represents the ith of the last fully-connected layer output in the weather perception model, x represents the weather image sample input, w represents the parameter of the weather perception model, exp represents an exponential function with e as the base, f_j(x, w) represents each element, Σ, in the last fully-connected layer output in the weather awareness model_jRepresenting the accumulation of each element.

The above-mentioned Softmax classifier can adopt a back propagation algorithm with a descending random gradient to train a weather perception model, minimize the loss of the Softmax classifier, and output the probability that the current weather image sample belongs to each weather category, so that the Softmax classifier has the capability of distinguishing weather.

Step 203, comprising the following substeps:

and a substep 2031 of selecting an automatic obstacle avoidance model corresponding to the target weather type based on the corresponding relationship between the preset weather type and the automatic obstacle avoidance model.

And a substep 2032 of inputting the environment image into an automatic obstacle avoidance model corresponding to the target weather type, selecting an action of the unmanned ship and executing the action until a preset destination is reached.

Specifically, the corresponding relation between the weather type and the automatic obstacle avoidance model is preset in the unmanned ship, so that the automatic obstacle avoidance model corresponding to the target weather type can be selected, then the environment image is input into the automatic obstacle avoidance model corresponding to the target weather type, the automatic obstacle avoidance model selects and executes actions, and in the task execution process of the unmanned ship, even if the weather changes, the corresponding automatic obstacle avoidance model can be selected to select and execute the actions of the unmanned ship until the unmanned ship reaches the preset destination.

In this embodiment, the automatic obstacle avoidance model switched based on the dynamic weather adaptation in this embodiment is compared with a single automatic obstacle avoidance model for the dynamic weather, and in this embodiment, the obstacle avoidance success rate is used as an evaluation index. Dynamic weather includes: dynamic switching between day-dusk-night; dynamic switching of day-cloudy-dusk. Table 2 below shows the average values of the obstacle avoidance success rate and the obstacle avoidance success rate for 300, 500, and 1000 rounds of tests, respectively.

Table 2 shows the four obstacle avoidance success rates and the average value table of the obstacle avoidance success rates in the embodiment of the present invention

As can be seen from the above, the automatic obstacle avoidance model switched based on the dynamic weather adaptation in this embodiment has a higher obstacle avoidance success rate, and is suitable for a scene with dynamically changing weather.

Compared with the prior art, in the task execution process of the unmanned ship, the change of weather is sensed in real time through the preset weather sensing model, the automatic obstacle avoidance model corresponding to the corresponding weather type is switched according to the self-adaptive rule, and the correct action is selected and executed, so that the self-adaptive obstacle avoidance of the unmanned ship under the dynamic weather is realized, and the unmanned ship can keep a good obstacle avoidance success rate under the dynamic weather.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

Example five:

the present embodiments relate to an unmanned boat comprising at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to implement the method for automatic obstacle avoidance of an unmanned surface vehicle as described in any one of the fourth to sixth embodiments. Wherein, unmanned ship still includes image sensor, propeller etc. and no longer gives unnecessary details here.

Where the memory and processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting together one or more of the various circuits of the processor and the memory. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

Example six:

the present embodiment relates to a computer-readable storage medium storing a computer program. The computer program realizes the above-described method embodiments when executed by a processor.

That is, as can be understood by those skilled in the art, all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice. The embodiments of the present invention have been described with reference to the accompanying drawings, but the present invention is not limited to the embodiments, and various changes and modifications can be made according to the purpose of the invention, and any changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitutions, as long as the purpose of the present invention is met, and the present invention shall fall within the protection scope of the present invention without departing from the technical principle and inventive concept of the present invention.

Claims

1. An unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning is characterized by comprising the following steps:

(1) constructing a deep reinforcement network based on a PPO algorithm;

(3) designing a reward function based on the time sequence distance;

2. The unmanned surface vehicle weather adaptive obstacle avoidance method based on the deep reinforcement learning of claim 1, wherein in the step (3), a reward function for training the deep reinforcement network is related to a distance between the unmanned surface vehicle model and a preset target point, and a formula of the reward function based on a time sequence distance is as follows:

3. The unmanned surface vehicle weather adaptive obstacle avoidance method based on the deep reinforcement learning of claim 1, wherein in the step (5), the step of constructing the deep reinforcement network based on the PPO algorithm is as follows:

4. The unmanned ship weather adaptive obstacle avoidance method based on deep reinforcement learning of claim 3, wherein in the step (5-3), the policy network comprises a new policy network and an old policy network; each sample data includes a status, an action, and a reward;

5. The unmanned surface vehicle weather adaptive obstacle avoidance method based on deep reinforcement learning of claim 3, wherein in the step (5-3), the policy network comprises three convolutional layers and fully-connected layers which are connected in sequence; the method comprises the following steps of sampling a plurality of sample data generated by the unmanned ship model when interacting with a simulation environment by utilizing a strategy network in an initialized deep reinforcement network, and comprises the following steps:

6. The unmanned surface vehicle weather adaptive obstacle avoidance method based on the deep reinforcement learning of any one of claims 1 to 5, wherein the simulation environment comprises a plurality of weather types;

7. The unmanned surface vehicle weather adaptive obstacle avoidance method based on the deep reinforcement learning of claims 1 to 5, characterized in that: collecting an environment image of the unmanned ship according to a preset time interval; inputting the weather type of the environment image into a preset weather perception model, and identifying the weather type of the environment image;

8. The unmanned surface vehicle weather adaptive obstacle avoidance method based on the deep reinforcement learning of claim 7, wherein the weather perception model is generated in a manner that: