CN115470934A

CN115470934A - Sequence model-based reinforcement learning path planning algorithm in marine environment

Info

Publication number: CN115470934A
Application number: CN202211118607.0A
Authority: CN
Inventors: 杨嘉琛; 代慧澳; 温家宝; 肖帅
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2022-12-13

Abstract

The underwater vehicle has an important function on the exploration and protection of national ocean resources, but because of the particularity of the ocean environment, the underwater vehicle cannot communicate with the outside when working underwater, and therefore, the design of an underwater vehicle path planning algorithm with autonomous control capability is an important guarantee for the normal work of the underwater vehicle. Aiming at a path planning task in a complex marine environment, the method uses a decisiontransform algorithm to control the motion of a submersible in a python simulation environment. Because the decision transformer uses the track sequence to predict and control the action, the requirement of the traditional reinforcement learning algorithm on the Markov property of the input state is eliminated, a converged optimal strategy can be obtained in the partially observable environment with unknown ocean currents, the established path planning task is completed, and the reinforcement learning path planning algorithm based on the sequence model in the ocean environment is finally obtained.

Description

Sequence model-based reinforcement learning path planning algorithm in marine environment

Technical Field

The invention relates to a path planning algorithm, in particular to a reinforcement learning algorithm for training a marine underwater intelligent agent by using a transform model in a complex marine environment.

Background

Because of the unique condition of marine environment, the autonomous underwater vehicle plays an increasingly important role in the development and protection process of marine resources in various countries. The environment in the ocean is complex, factors influencing the normal work of the underwater vehicle such as large-scale ocean currents, obstacles and the like exist, the application of various mature GPS-based path planning algorithms at present is greatly limited by the underwater communication technology, and the autonomous underwater vehicle designed according to the specific environment of a specific sea area can better adapt to the ocean environment.

The reinforcement learning algorithm can sample data and promote own strategies in continuous interaction with the environment, and finally converges to an optimal strategy, in the process, the algorithm needs to fully utilize the currently collected data under the condition of ensuring exploration, learn the action value function of the current environment, directly or indirectly obtain the strategy function, and enable an intelligent agent to complete autonomous decision. The method has wide prospect in training the intelligent agent to complete the established path planning task in the marine simulation environment by using the reinforcement learning algorithm.

The basic assumption of reinforcement learning modeling is that the current environment satisfies the markov property, i.e. the current state is only related to the last state and is not related to the previous historical information, i.e. the current state already contains all the historical information needed for decision making, but because of the complexity of the marine environment or because of the limitations of the sensors, the complete information of the current state, such as the change of ocean current, is often not available in the actual environment. Traditional reinforcement learning algorithms need to be carefully designed to adapt to partially observable situations, and especially training learning is difficult to perform under the condition of sparse rewards.

The ability of the transform model to process sequence data has been widely verified in the field of natural language processing. The application of the transform model to the image processing field also obtains the capability of being comparable to a convolutional neural network, and the transform model based on the self-attention mechanism enables a large pre-trained model to be possible. The reinforcement learning algorithm based on the transform is expected to solve the problem of reward sparsity in the traditional reinforcement learning, and the sample utilization rate is higher.

However, because of the complexity of the marine environment and the sparse nature of the reward, it is difficult to quickly converge to a better strategy by directly using the precision transform algorithm, and the exploration capability of the algorithm for the environment is insufficient.

On the basis of a precision transform, an off-line reinforcement learning algorithm which is compatible with exploration and utilization is innovatively used for ocean path planning. Entropy regularization is used to ensure the exploratory degree of output action in the process of training the network, and the entropy of the network is maintained to be close to a specific value until enough empirical data is collected. A better strategy is obtained under the condition of using a small amount of data, and an autonomous intelligent agent capable of overcoming ocean currents and avoiding obstacles is trained.

Disclosure of Invention

Aiming at the problem of path planning, the invention provides an off-line reinforcement learning algorithm based on a precision transform in a marine complex environment, and the exploration capacity of an agent is adjusted by using an entropy regularization method. The method uses a python simulation environment to acquire the position and the speed of the intelligent agent in real time as state input, and obtains output action after passing through a transform network. The system can find the optimal strategy for maximizing the current reward in the continuous training process, and complete the established path planning task. The technical scheme is as follows:

in a first step, a decision transformer-based network structure is constructed, wherein the input of the network is a sequence of three segments each having a length of K, and the output is a prediction of the sequence.

(1) And (4) preprocessing data. The collected data is stored according to each round, and is convenient to input into the transformer according to the sequences of the unavailable rewards, states and actions. The reward in the traditional reinforcement learning is given when the current state takes corresponding action to enter the next state, the reward represents the evaluation of the behavior of the agent, and the decision transfer rmer used here directly inputs the accumulated reward into the network as the non-acquired reward, so that all subsequent rewards corresponding to the current state need to be accumulated as the non-acquired reward of the current state.

(2) Embedding of data and position coding. The input of the model is three sequences including the unavailable reward, state and action, and because the characteristic dimensions of each sequence are different, the three sequences are respectively embedded, and are respectively converted into data with the same dimension through the full connection layer. After the three embedded eigenvectors are connected together, position coding is carried out, and the eigenvectors generated by the sequence positions are added. The feature vectors after embedding and position coding are the input of the model.

(3) And (4) designing a network structure. The main structure of the network uses the GPT2 model, and comprises 12 decoders stacked in transformers, and the dimension of the feature vector of the model is reduced from the original 768 to 256. Otherwise consistent with the layer 12 GPT2 model. The network structure here is not particularly required, and any transform model based on the self-attention mechanism can be used for sequence-to-sequence translation in the present invention.

(4) And (4) designing a loss function. Since six discrete actions are output here, a cross entropy function is used for training, that is, a sequence with length of K is input, and the output is obtained after precision transform, and the output of the network is the same data as the input, and also includes the unavailable reward, state and action. The cross entropy loss function is calculated by selecting an action item, and the label used in the process is the last action value in the sequence. In order to achieve a balance between exploration and utilization of the algorithm, the entropy of the output action in brackets on the right side is added into the loss function, the entropy of the output action is increased in the training process to select different actions, and the capability of the algorithm for exploring the environment is increased.

In formula (1), M represents the number of discrete actions, p _ic Probability of action representing network output, y _ic Representing a sign function (0 or 1), if the true class of sample i is equal to c taken 1, otherwise 0, α is the entropy coefficient.

The loss function represents a classification problem without considering entropy regularization, which type is used for predicting the action corresponding to the 20 th sequence after the 20 sequences are input, but an unobtainable reward item is contained in the input state, which can be regarded as the action value Q in the traditional reinforcement learning, so that the transform network implicitly learns the action value function of the current state, which is the fundamental difference between the current state and the simulation learning.

And secondly, setting a reward function according to the characteristics of the simulation environment, and collecting random experience to train the model.

(1) And (4) designing a reward function. The mission scenario used here is to reach intermediate and target points and avoid obstacles, so a large positive reward is given when the agent reaches intermediate and target points, respectively, and a large negative reward is given when an obstacle is encountered, a small penalty is given when moving towards the target point, and a small penalty is given when moving away from the target point, but is greater than the penalty for moving towards the target point.

(2) Data was collected using a random strategy. In the absence of data, a random strategy is first used to sample the environment to obtain a whole set of data for training.

(3) The network is trained using the collected data. The collected data is divided into a plurality of batches to be sent into the network, and the cross entropy function defined in the front is used for training.

(4) Interact with the environment again, resulting in a new experience. The trained network has a preliminary knowledge of the environment, where initial, unobtained reward values need to be entered at the time of sampling, and the optimal reward value needs to be calculated from the environment characteristics and reward settings. The sampled data is filled into the buffer area by taking the office as a unit.

(5) And (4) repeating the steps (3) and (4) until a better strategy is obtained.

The invention initiatively constructs an off-line reinforcement learning algorithm based on the precision transform for ocean path planning, the algorithm can design an obstacle avoidance algorithm reaching multiple target points in a given ocean environment, and can be converged under the condition that part of ocean current information is unknown can be observed. Its advantage and positive effect lie in:

(1) The invention applies off-line reinforcement learning and can obtain a better strategy by convergence under the condition of less data quantity.

(2) An entropy regularization method is introduced into a loss function, so that the off-line reinforcement learning algorithm has strong exploration capacity.

(3) The method is tested in a simulation environment, does not require Markov properties to be met among observed quantities, and can adapt to path planning tasks under partial observable conditions.

Drawings

FIG. 1 is a schematic diagram of a model of input and output of a decision transform network according to an embodiment of the present invention

FIG. 2 is a loss variation curve of cross entropy function in the training process in the embodiment of the present invention

FIG. 3 is a path planning trajectory simulated by the underwater vehicle intelligent agent in the Python simulation environment in the embodiment of the present invention

FIG. 4 is a graph illustrating an average prize value per round for a target prize value of 520 in accordance with an embodiment of the present invention

FIG. 5 is a graph of the average prize value obtained per round for a 1040 target prize value in accordance with an embodiment of the present invention

Detailed Description

In order to make the technical scheme of the invention clearer, the invention is further explained below by combining the attached drawings. The invention is realized by the following steps:

(1) And (4) preprocessing data. A sequence with the length of K is sampled from the data of a whole bureau, and the sequence comprises three data types of reward, state and action which are not acquired. Here, the size of the batch is 256, the sampling length K is 20, the dimensionality of the reward and the action is 1, the dimensionality of the state is 7, so that the dimensionality of the input reward and the action is (256, 20, 1), and the dimensionality of the input state is (256, 20, 7).

(2) Embedding of data and position coding. The preprocessed data are respectively connected in a second dimension after being added with corresponding position codes through linear layers to obtain tensors with the dimensions of (256, 20, 12) to obtain tensors with the dimensions of (256, 60, 128). This is the input in the GPT2 network.

(3) And (4) designing a network structure. A self-attention mechanism based transformer model was chosen to handle the sequence problem. Fig. 1 is a schematic diagram of a model including network input and output.

(4) And (4) designing a loss function. The alpha is used for controlling the requirement of the algorithm on the entropy of the network output action, and the smaller value is 0.1, so that the network mainly updates the cross entropy loss function, and the disorder degree of the output action is increased in the process.

(1) And (4) designing a reward function. Positive awards of 50 and 500 are given when the agent arrives at the intermediate and target points respectively, while negative awards of-10 are given when an obstacle is encountered, a smaller penalty of-0.5 is given when moving towards the target point and a larger penalty of-1 is given when moving away from the target point.

(2) Data was collected using a random strategy. 1000 rounds of data were randomly collected, each having a maximum of 1000 steps.

(3) The network is trained using the collected data. And (4) taking parts of the collected data and sending the parts of the collected data into the network in batches, and training the parts by using the cross entropy function defined in the front. FIG. 2 is a plot of the loss variation of the cross entropy function during the training process.

(4) Interacting again with the environment, 10 new experiences are generated to be added to the buffer. Since the model outputs the probability of each discrete action, the probability is sampled in the distribution consisting of six actions, increasing the exploratory nature of the network.

(5) And (4) continuously repeating the steps (3) and (4) until a better strategy is obtained. Fig. 3 is a simulation picture obtained after network convergence. The underwater vehicle starts from a lower starting point, finally reaches an upper red target point after passing through a middle target point, a set path planning task is completed, the planned path avoids eight preset obstacles, and no collision accident occurs. The arrows in the figure indicate the direction of the ocean current.

(6) The initial prize value of the decision fransformer is set according to the maximum prize value that can be reached by the environment, in which the prize values to the intermediate point and the target point are 20 and 500, respectively, so that the maximum prize value that can be reached without any penalty is 520, and thus the theoretical prize value is 520. Fig. 4 is an average prize value for each round obtained when the expected prize is 520. The unobtained reward has a large influence on the performance of the algorithm, so that the unobtained reward value can be set to be an unobtainable large reward value in the path planning task of the present invention, such as 1040, which has a positive promoting effect on the intelligent agent to plan the desired path. Figure 5 is the average prize value for each round that results when the earned prize is initialized to 1040.

Claims

1. A reinforcement learning path planning algorithm in a marine environment based on a sequence model, wherein the method comprises:

(1) Simulating ocean current, obstacles and other environments by using a python simulation environment, acquiring the position and the speed of an intelligent agent in real time as state input, obtaining output actions through a precision transformer network, finding an optimal strategy for maximizing the current reward through continuous training, and completing a set path planning task;

(2) Preprocessing and storing the data, and embedding and position coding the data;

(3) Constructing a network structure based on precision transform, wherein the input of the network is a sequence with three sections of lengths of K;

(4) Designing a loss function;

(5) Designing a reward function suitable for a path planning task;

(6) The network is trained using the actions as labels.

2. The method for planning the offline reinforcement learning path under the complex marine environment according to claim 1, wherein: the data processing method in the step (2) comprises the following steps:

the collected data is stored according to each round, the collected data is input into a transform according to the sequence of the unobtained rewards, the states and the actions, the precision transform is used for directly inputting the accumulated reward into a network as the unobtained rewards, and all the rewards after the current state are accumulated to be used as the unobtained rewards of the current state.

Inputting three sequences, wherein the characteristic dimensions of each sequence are different and comprise unavailable rewards, states and actions, embedding the three sequences, converting the three sequences into data with the same dimensions through a full connection layer, connecting three embedded characteristic vectors together, then performing position coding, adding the characteristic vectors generated by the sequence positions, and obtaining the characteristic vectors after embedding and position coding, wherein the characteristic vectors are input vectors of the model.

3. The method for planning the offline reinforcement learning path in the complex marine environment according to claim 1, wherein: the design and selection of the loss function in the step (5) comprises the following steps:

the selected loss function satisfies the conditions: and selecting a cross entropy function when the action output by the intelligent agent is a discrete action, and selecting a mean square error function when the action output by the intelligent agent is a continuous action.

The simulation environment used in the verification process outputs six discrete actions, so a cross entropy function is used for training, a sequence with the length of K is input, and the output is obtained after a precision transform, wherein the output of the network is the same data as the input and represents the reward, the state and the action. Selecting an action item to calculate a cross entropy loss function, wherein a label used in the process is the last action value in the sequence, in order to enable the algorithm to reach a balance between exploration and utilization, the entropy of an output action is added into the loss function, the entropy of the output action is increased in the training process to select different actions, and the capability of the algorithm for exploring the environment is increased.

4. The method for planning the offline reinforcement learning path in the complex marine environment according to claim 1, wherein: the design and selection of the excitation function in the step (6) comprise the following steps:

the path planning task used by the invention is to reach intermediate and target points and avoid obstacles, so that a larger positive reward is given when the agent reaches intermediate and target points, respectively, and a larger negative reward is given when an obstacle is encountered, a smaller penalty is given when moving towards the target point, and a smaller penalty is given when moving away from the target point, but is greater than the penalty for moving towards the target point.

5. The method for planning the offline reinforcement learning path in the complex marine environment according to claim 1, wherein: the training of the neural network in the step (7) comprises the following steps:

step 1: data was collected using a random strategy. Under the condition of no data, firstly, sampling in an environment by using a random strategy to obtain a whole part of data for training;

step 2: the network is trained using the collected data. Dividing the collected data into a plurality of batches and sending the data into a network, and using the cross entropy function defined in the foregoing to train;

and step 3: interact with the environment again, resulting in a new experience. The trained network already preliminarily knows the environment, and experience obtained by sampling again is more targeted, wherein initial unobtained reward values need to be input during sampling, and the optimal reward values need to be calculated through environment characteristics and reward setting. Sampling to obtain data, and filling the data into a buffer area according to the local;

and 4, step 4: and (4) continuously repeating the step 2 and the step 3 until a better strategy is obtained.