CN115470934A - Sequence model-based reinforcement learning path planning algorithm in marine environment - Google Patents

Sequence model-based reinforcement learning path planning algorithm in marine environment Download PDF

Info

Publication number
CN115470934A
CN115470934A CN202211118607.0A CN202211118607A CN115470934A CN 115470934 A CN115470934 A CN 115470934A CN 202211118607 A CN202211118607 A CN 202211118607A CN 115470934 A CN115470934 A CN 115470934A
Authority
CN
China
Prior art keywords
data
environment
network
reward
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211118607.0A
Other languages
Chinese (zh)
Inventor
杨嘉琛
代慧澳
温家宝
肖帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202211118607.0A priority Critical patent/CN115470934A/en
Publication of CN115470934A publication Critical patent/CN115470934A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The underwater vehicle has an important function on the exploration and protection of national ocean resources, but because of the particularity of the ocean environment, the underwater vehicle cannot communicate with the outside when working underwater, and therefore, the design of an underwater vehicle path planning algorithm with autonomous control capability is an important guarantee for the normal work of the underwater vehicle. Aiming at a path planning task in a complex marine environment, the method uses a decisiontransform algorithm to control the motion of a submersible in a python simulation environment. Because the decision transformer uses the track sequence to predict and control the action, the requirement of the traditional reinforcement learning algorithm on the Markov property of the input state is eliminated, a converged optimal strategy can be obtained in the partially observable environment with unknown ocean currents, the established path planning task is completed, and the reinforcement learning path planning algorithm based on the sequence model in the ocean environment is finally obtained.

Description

Sequence model-based reinforcement learning path planning algorithm in marine environment
Technical Field
The invention relates to a path planning algorithm, in particular to a reinforcement learning algorithm for training a marine underwater intelligent agent by using a transform model in a complex marine environment.
Background
Because of the unique condition of marine environment, the autonomous underwater vehicle plays an increasingly important role in the development and protection process of marine resources in various countries. The environment in the ocean is complex, factors influencing the normal work of the underwater vehicle such as large-scale ocean currents, obstacles and the like exist, the application of various mature GPS-based path planning algorithms at present is greatly limited by the underwater communication technology, and the autonomous underwater vehicle designed according to the specific environment of a specific sea area can better adapt to the ocean environment.
The reinforcement learning algorithm can sample data and promote own strategies in continuous interaction with the environment, and finally converges to an optimal strategy, in the process, the algorithm needs to fully utilize the currently collected data under the condition of ensuring exploration, learn the action value function of the current environment, directly or indirectly obtain the strategy function, and enable an intelligent agent to complete autonomous decision. The method has wide prospect in training the intelligent agent to complete the established path planning task in the marine simulation environment by using the reinforcement learning algorithm.
The basic assumption of reinforcement learning modeling is that the current environment satisfies the markov property, i.e. the current state is only related to the last state and is not related to the previous historical information, i.e. the current state already contains all the historical information needed for decision making, but because of the complexity of the marine environment or because of the limitations of the sensors, the complete information of the current state, such as the change of ocean current, is often not available in the actual environment. Traditional reinforcement learning algorithms need to be carefully designed to adapt to partially observable situations, and especially training learning is difficult to perform under the condition of sparse rewards.
The ability of the transform model to process sequence data has been widely verified in the field of natural language processing. The application of the transform model to the image processing field also obtains the capability of being comparable to a convolutional neural network, and the transform model based on the self-attention mechanism enables a large pre-trained model to be possible. The reinforcement learning algorithm based on the transform is expected to solve the problem of reward sparsity in the traditional reinforcement learning, and the sample utilization rate is higher.
However, because of the complexity of the marine environment and the sparse nature of the reward, it is difficult to quickly converge to a better strategy by directly using the precision transform algorithm, and the exploration capability of the algorithm for the environment is insufficient.
On the basis of a precision transform, an off-line reinforcement learning algorithm which is compatible with exploration and utilization is innovatively used for ocean path planning. Entropy regularization is used to ensure the exploratory degree of output action in the process of training the network, and the entropy of the network is maintained to be close to a specific value until enough empirical data is collected. A better strategy is obtained under the condition of using a small amount of data, and an autonomous intelligent agent capable of overcoming ocean currents and avoiding obstacles is trained.
Disclosure of Invention
Aiming at the problem of path planning, the invention provides an off-line reinforcement learning algorithm based on a precision transform in a marine complex environment, and the exploration capacity of an agent is adjusted by using an entropy regularization method. The method uses a python simulation environment to acquire the position and the speed of the intelligent agent in real time as state input, and obtains output action after passing through a transform network. The system can find the optimal strategy for maximizing the current reward in the continuous training process, and complete the established path planning task. The technical scheme is as follows:
in a first step, a decision transformer-based network structure is constructed, wherein the input of the network is a sequence of three segments each having a length of K, and the output is a prediction of the sequence.
(1) And (4) preprocessing data. The collected data is stored according to each round, and is convenient to input into the transformer according to the sequences of the unavailable rewards, states and actions. The reward in the traditional reinforcement learning is given when the current state takes corresponding action to enter the next state, the reward represents the evaluation of the behavior of the agent, and the decision transfer rmer used here directly inputs the accumulated reward into the network as the non-acquired reward, so that all subsequent rewards corresponding to the current state need to be accumulated as the non-acquired reward of the current state.
(2) Embedding of data and position coding. The input of the model is three sequences including the unavailable reward, state and action, and because the characteristic dimensions of each sequence are different, the three sequences are respectively embedded, and are respectively converted into data with the same dimension through the full connection layer. After the three embedded eigenvectors are connected together, position coding is carried out, and the eigenvectors generated by the sequence positions are added. The feature vectors after embedding and position coding are the input of the model.
(3) And (4) designing a network structure. The main structure of the network uses the GPT2 model, and comprises 12 decoders stacked in transformers, and the dimension of the feature vector of the model is reduced from the original 768 to 256. Otherwise consistent with the layer 12 GPT2 model. The network structure here is not particularly required, and any transform model based on the self-attention mechanism can be used for sequence-to-sequence translation in the present invention.
(4) And (4) designing a loss function. Since six discrete actions are output here, a cross entropy function is used for training, that is, a sequence with length of K is input, and the output is obtained after precision transform, and the output of the network is the same data as the input, and also includes the unavailable reward, state and action. The cross entropy loss function is calculated by selecting an action item, and the label used in the process is the last action value in the sequence. In order to achieve a balance between exploration and utilization of the algorithm, the entropy of the output action in brackets on the right side is added into the loss function, the entropy of the output action is increased in the training process to select different actions, and the capability of the algorithm for exploring the environment is increased.
Figure BDA0003845973010000021
In formula (1), M represents the number of discrete actions, p ic Probability of action representing network output, y ic Representing a sign function (0 or 1), if the true class of sample i is equal to c taken 1, otherwise 0, α is the entropy coefficient.
The loss function represents a classification problem without considering entropy regularization, which type is used for predicting the action corresponding to the 20 th sequence after the 20 sequences are input, but an unobtainable reward item is contained in the input state, which can be regarded as the action value Q in the traditional reinforcement learning, so that the transform network implicitly learns the action value function of the current state, which is the fundamental difference between the current state and the simulation learning.
And secondly, setting a reward function according to the characteristics of the simulation environment, and collecting random experience to train the model.
(1) And (4) designing a reward function. The mission scenario used here is to reach intermediate and target points and avoid obstacles, so a large positive reward is given when the agent reaches intermediate and target points, respectively, and a large negative reward is given when an obstacle is encountered, a small penalty is given when moving towards the target point, and a small penalty is given when moving away from the target point, but is greater than the penalty for moving towards the target point.
(2) Data was collected using a random strategy. In the absence of data, a random strategy is first used to sample the environment to obtain a whole set of data for training.
(3) The network is trained using the collected data. The collected data is divided into a plurality of batches to be sent into the network, and the cross entropy function defined in the front is used for training.
(4) Interact with the environment again, resulting in a new experience. The trained network has a preliminary knowledge of the environment, where initial, unobtained reward values need to be entered at the time of sampling, and the optimal reward value needs to be calculated from the environment characteristics and reward settings. The sampled data is filled into the buffer area by taking the office as a unit.
(5) And (4) repeating the steps (3) and (4) until a better strategy is obtained.
The invention initiatively constructs an off-line reinforcement learning algorithm based on the precision transform for ocean path planning, the algorithm can design an obstacle avoidance algorithm reaching multiple target points in a given ocean environment, and can be converged under the condition that part of ocean current information is unknown can be observed. Its advantage and positive effect lie in:
(1) The invention applies off-line reinforcement learning and can obtain a better strategy by convergence under the condition of less data quantity.
(2) An entropy regularization method is introduced into a loss function, so that the off-line reinforcement learning algorithm has strong exploration capacity.
(3) The method is tested in a simulation environment, does not require Markov properties to be met among observed quantities, and can adapt to path planning tasks under partial observable conditions.
Drawings
FIG. 1 is a schematic diagram of a model of input and output of a decision transform network according to an embodiment of the present invention
FIG. 2 is a loss variation curve of cross entropy function in the training process in the embodiment of the present invention
FIG. 3 is a path planning trajectory simulated by the underwater vehicle intelligent agent in the Python simulation environment in the embodiment of the present invention
FIG. 4 is a graph illustrating an average prize value per round for a target prize value of 520 in accordance with an embodiment of the present invention
FIG. 5 is a graph of the average prize value obtained per round for a 1040 target prize value in accordance with an embodiment of the present invention
Detailed Description
In order to make the technical scheme of the invention clearer, the invention is further explained below by combining the attached drawings. The invention is realized by the following steps:
in a first step, a decision transformer-based network structure is constructed, wherein the input of the network is a sequence of three segments each having a length of K, and the output is a prediction of the sequence.
(1) And (4) preprocessing data. A sequence with the length of K is sampled from the data of a whole bureau, and the sequence comprises three data types of reward, state and action which are not acquired. Here, the size of the batch is 256, the sampling length K is 20, the dimensionality of the reward and the action is 1, the dimensionality of the state is 7, so that the dimensionality of the input reward and the action is (256, 20, 1), and the dimensionality of the input state is (256, 20, 7).
(2) Embedding of data and position coding. The preprocessed data are respectively connected in a second dimension after being added with corresponding position codes through linear layers to obtain tensors with the dimensions of (256, 20, 12) to obtain tensors with the dimensions of (256, 60, 128). This is the input in the GPT2 network.
(3) And (4) designing a network structure. A self-attention mechanism based transformer model was chosen to handle the sequence problem. Fig. 1 is a schematic diagram of a model including network input and output.
(4) And (4) designing a loss function. The alpha is used for controlling the requirement of the algorithm on the entropy of the network output action, and the smaller value is 0.1, so that the network mainly updates the cross entropy loss function, and the disorder degree of the output action is increased in the process.
Figure BDA0003845973010000041
And secondly, setting a reward function according to the characteristics of the simulation environment, and collecting random experience to train the model.
(1) And (4) designing a reward function. Positive awards of 50 and 500 are given when the agent arrives at the intermediate and target points respectively, while negative awards of-10 are given when an obstacle is encountered, a smaller penalty of-0.5 is given when moving towards the target point and a larger penalty of-1 is given when moving away from the target point.
(2) Data was collected using a random strategy. 1000 rounds of data were randomly collected, each having a maximum of 1000 steps.
(3) The network is trained using the collected data. And (4) taking parts of the collected data and sending the parts of the collected data into the network in batches, and training the parts by using the cross entropy function defined in the front. FIG. 2 is a plot of the loss variation of the cross entropy function during the training process.
(4) Interacting again with the environment, 10 new experiences are generated to be added to the buffer. Since the model outputs the probability of each discrete action, the probability is sampled in the distribution consisting of six actions, increasing the exploratory nature of the network.
(5) And (4) continuously repeating the steps (3) and (4) until a better strategy is obtained. Fig. 3 is a simulation picture obtained after network convergence. The underwater vehicle starts from a lower starting point, finally reaches an upper red target point after passing through a middle target point, a set path planning task is completed, the planned path avoids eight preset obstacles, and no collision accident occurs. The arrows in the figure indicate the direction of the ocean current.
(6) The initial prize value of the decision fransformer is set according to the maximum prize value that can be reached by the environment, in which the prize values to the intermediate point and the target point are 20 and 500, respectively, so that the maximum prize value that can be reached without any penalty is 520, and thus the theoretical prize value is 520. Fig. 4 is an average prize value for each round obtained when the expected prize is 520. The unobtained reward has a large influence on the performance of the algorithm, so that the unobtained reward value can be set to be an unobtainable large reward value in the path planning task of the present invention, such as 1040, which has a positive promoting effect on the intelligent agent to plan the desired path. Figure 5 is the average prize value for each round that results when the earned prize is initialized to 1040.

Claims (5)

1. A reinforcement learning path planning algorithm in a marine environment based on a sequence model, wherein the method comprises:
(1) Simulating ocean current, obstacles and other environments by using a python simulation environment, acquiring the position and the speed of an intelligent agent in real time as state input, obtaining output actions through a precision transformer network, finding an optimal strategy for maximizing the current reward through continuous training, and completing a set path planning task;
(2) Preprocessing and storing the data, and embedding and position coding the data;
(3) Constructing a network structure based on precision transform, wherein the input of the network is a sequence with three sections of lengths of K;
(4) Designing a loss function;
(5) Designing a reward function suitable for a path planning task;
(6) The network is trained using the actions as labels.
2. The method for planning the offline reinforcement learning path under the complex marine environment according to claim 1, wherein: the data processing method in the step (2) comprises the following steps:
the collected data is stored according to each round, the collected data is input into a transform according to the sequence of the unobtained rewards, the states and the actions, the precision transform is used for directly inputting the accumulated reward into a network as the unobtained rewards, and all the rewards after the current state are accumulated to be used as the unobtained rewards of the current state.
Inputting three sequences, wherein the characteristic dimensions of each sequence are different and comprise unavailable rewards, states and actions, embedding the three sequences, converting the three sequences into data with the same dimensions through a full connection layer, connecting three embedded characteristic vectors together, then performing position coding, adding the characteristic vectors generated by the sequence positions, and obtaining the characteristic vectors after embedding and position coding, wherein the characteristic vectors are input vectors of the model.
3. The method for planning the offline reinforcement learning path in the complex marine environment according to claim 1, wherein: the design and selection of the loss function in the step (5) comprises the following steps:
the selected loss function satisfies the conditions: and selecting a cross entropy function when the action output by the intelligent agent is a discrete action, and selecting a mean square error function when the action output by the intelligent agent is a continuous action.
The simulation environment used in the verification process outputs six discrete actions, so a cross entropy function is used for training, a sequence with the length of K is input, and the output is obtained after a precision transform, wherein the output of the network is the same data as the input and represents the reward, the state and the action. Selecting an action item to calculate a cross entropy loss function, wherein a label used in the process is the last action value in the sequence, in order to enable the algorithm to reach a balance between exploration and utilization, the entropy of an output action is added into the loss function, the entropy of the output action is increased in the training process to select different actions, and the capability of the algorithm for exploring the environment is increased.
4. The method for planning the offline reinforcement learning path in the complex marine environment according to claim 1, wherein: the design and selection of the excitation function in the step (6) comprise the following steps:
the path planning task used by the invention is to reach intermediate and target points and avoid obstacles, so that a larger positive reward is given when the agent reaches intermediate and target points, respectively, and a larger negative reward is given when an obstacle is encountered, a smaller penalty is given when moving towards the target point, and a smaller penalty is given when moving away from the target point, but is greater than the penalty for moving towards the target point.
5. The method for planning the offline reinforcement learning path in the complex marine environment according to claim 1, wherein: the training of the neural network in the step (7) comprises the following steps:
step 1: data was collected using a random strategy. Under the condition of no data, firstly, sampling in an environment by using a random strategy to obtain a whole part of data for training;
step 2: the network is trained using the collected data. Dividing the collected data into a plurality of batches and sending the data into a network, and using the cross entropy function defined in the foregoing to train;
and step 3: interact with the environment again, resulting in a new experience. The trained network already preliminarily knows the environment, and experience obtained by sampling again is more targeted, wherein initial unobtained reward values need to be input during sampling, and the optimal reward values need to be calculated through environment characteristics and reward setting. Sampling to obtain data, and filling the data into a buffer area according to the local;
and 4, step 4: and (4) continuously repeating the step 2 and the step 3 until a better strategy is obtained.
CN202211118607.0A 2022-09-14 2022-09-14 Sequence model-based reinforcement learning path planning algorithm in marine environment Pending CN115470934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211118607.0A CN115470934A (en) 2022-09-14 2022-09-14 Sequence model-based reinforcement learning path planning algorithm in marine environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211118607.0A CN115470934A (en) 2022-09-14 2022-09-14 Sequence model-based reinforcement learning path planning algorithm in marine environment

Publications (1)

Publication Number Publication Date
CN115470934A true CN115470934A (en) 2022-12-13

Family

ID=84333643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211118607.0A Pending CN115470934A (en) 2022-09-14 2022-09-14 Sequence model-based reinforcement learning path planning algorithm in marine environment

Country Status (1)

Country Link
CN (1) CN115470934A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115790608A (en) * 2023-01-31 2023-03-14 天津大学 AUV path planning algorithm and device based on reinforcement learning
CN115946133A (en) * 2023-03-16 2023-04-11 季华实验室 Mechanical arm plug-in control method, device, equipment and medium based on reinforcement learning
CN116295449A (en) * 2023-05-25 2023-06-23 吉林大学 Method and device for indicating path of autonomous underwater vehicle
CN116540701A (en) * 2023-04-19 2023-08-04 广州里工实业有限公司 Path planning method, system, device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115790608A (en) * 2023-01-31 2023-03-14 天津大学 AUV path planning algorithm and device based on reinforcement learning
CN115790608B (en) * 2023-01-31 2023-05-30 天津大学 AUV path planning algorithm and device based on reinforcement learning
CN115946133A (en) * 2023-03-16 2023-04-11 季华实验室 Mechanical arm plug-in control method, device, equipment and medium based on reinforcement learning
CN116540701A (en) * 2023-04-19 2023-08-04 广州里工实业有限公司 Path planning method, system, device and storage medium
CN116540701B (en) * 2023-04-19 2024-03-05 广州里工实业有限公司 Path planning method, system, device and storage medium
CN116295449A (en) * 2023-05-25 2023-06-23 吉林大学 Method and device for indicating path of autonomous underwater vehicle
CN116295449B (en) * 2023-05-25 2023-09-12 吉林大学 Method and device for indicating path of autonomous underwater vehicle

Similar Documents

Publication Publication Date Title
CN115470934A (en) Sequence model-based reinforcement learning path planning algorithm in marine environment
CN113176776B (en) Unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning
Jiang et al. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge
CN112819253A (en) Unmanned aerial vehicle obstacle avoidance and path planning device and method
CN113110509A (en) Warehousing system multi-robot path planning method based on deep reinforcement learning
CN112183742B (en) Neural network hybrid quantization method based on progressive quantization and Hessian information
CN114019370B (en) Motor fault detection method based on gray level image and lightweight CNN-SVM model
CN112906828A (en) Image classification method based on time domain coding and impulse neural network
CN115100574A (en) Action identification method and system based on fusion graph convolution network and Transformer network
CN111476285A (en) Training method of image classification model, image classification method and storage medium
CN101706888A (en) Method for predicting travel time
CN115375877A (en) Three-dimensional point cloud classification method and device based on channel attention mechanism
CN113894780B (en) Multi-robot cooperation countermeasure method, device, electronic equipment and storage medium
CN114626598A (en) Multi-modal trajectory prediction method based on semantic environment modeling
Nam et al. Sequential Semantic Generative Communication for Progressive Text-to-Image Generation
CN116109945B (en) Remote sensing image interpretation method based on ordered continuous learning
WO2023179609A1 (en) Data processing method and apparatus
CN115630566B (en) Data assimilation method and system based on deep learning and dynamic constraint
Shi et al. Motion planning for unmanned vehicle based on hybrid deep learning
CN116892932A (en) Navigation decision method combining curiosity mechanism and self-imitation learning
CN115063597A (en) Image identification method based on brain-like learning
Zhang et al. Reinforcement Learning from Demonstrations by Novel Interactive Expert and Application to Automatic Berthing Control Systems for Unmanned Surface Vessel
CN114065834A (en) Model training method, terminal device and computer storage medium
Szymak Using neuro-evolutionary-fuzzy method to control a swarm of unmanned underwater vehicles
Lan et al. Learning-based path planning algorithm in ocean currents for multi-glider

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Wen Jiabao

Inventor after: Dai Huiao

Inventor after: Yang Jiachen

Inventor after: Xiao Shuai

Inventor before: Yang Jiachen

Inventor before: Dai Huiao

Inventor before: Wen Jiabao

Inventor before: Xiao Shuai