CN116523154A

CN116523154A - Model training method, route planning method and related devices

Info

Publication number: CN116523154A
Application number: CN202310286403.6A
Authority: CN
Inventors: 吴阿丹; 车涛
Original assignee: Northwest Institute of Eco Environment and Resources of CAS
Current assignee: Northwest Institute of Eco Environment and Resources of CAS
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-08-01
Anticipated expiration: 2043-03-22
Also published as: CN116523154B

Abstract

The application provides a model training method, a route planning method and a related device, and relates to the field of machine learning. The method comprises the steps of obtaining a model to be trained based on reinforcement learning; acquiring an experience set generated by interaction between a training ship and the environment through a model to be trained; updating the model to be trained according to the experience set until the model to be trained meets the training conditions, and obtaining the route planning model. Each historical experience in the experience set comprises instant rewards and new sailing states obtained by training ships to execute sailing actions generated by a model to be trained; since the instant rewards include internal instant rewards that are positively correlated with the novelty of the new navigational state, which characterizes the difference between the new navigational state and the conventional navigational state, the efficiency of exploring the environment can be improved in training the reinforcement learning model to formulate navigational routes.

Description

Model training method, route planning method and related devices

Technical Field

The present disclosure relates to the field of machine learning, and in particular, to a model training method, a route planning method, and related devices.

Background

The development potential of the arctic region brings global attention, and under the background, the prospective research of arctic route planning is developed, and a real-time arctic ice region route intelligent planning system is established as soon as possible, so that the method has very important application value. However, the remote and severe arctic environment still brings great challenges to the planning of the route in the iced area, such as icebergs, floating ices and storm snow, wherein sea ice is a key factor affecting the route, but the sea ice distribution varies greatly in the annual and irregular manner, so that the selection of the route for shipping is uncertain.

Since reinforcement learning is autonomous learning by an agent in a "trial and error" manner, rewarding coaching is obtained by constantly interacting with the environment, with the goal of making decisions by constantly learning the optimal strategy in order to obtain maximum return. Accordingly, it is proposed in the related art to generate an en-route planning model for planning an en-route by means of reinforcement learning. However, as the complexity of route planning increases, the scale of the state space and the action space of the environment also increases greatly, so that the problem of poor exploration efficiency exists in the environment interaction process by using the traditional reinforcement learning method.

Disclosure of Invention

In order to overcome at least one defect in the prior art, the application provides a model training method, a route planning method and a related device, which are used for improving the exploration efficiency of the environment in the process of training a reinforcement learning model to make a navigation route. The method specifically comprises the following steps:

in a first aspect, the present application provides a model training method, the method comprising:

obtaining a model to be trained, wherein the model to be trained is a reinforcement learning model;

acquiring an experience set generated by interaction between a training ship and the environment through the model to be trained, wherein each historical experience in the experience set comprises instant rewards and new sailing states obtained by the training ship executing sailing actions generated by the model to be trained; the instant rewards include an internal instant reward positively correlated with a novelty of the new navigational state, the novelty characterizing a difference between the new navigational state and a conventional navigational state;

updating the model to be trained according to the experience set until the model to be trained meets training conditions, and obtaining the route planning model.

In a second aspect, the present application provides a method of route planning, the method comprising:

Determining a target ship, a starting point and an ending point of the target ship;

and planning a navigation route from the starting point to the end point for the target ship by using the route planning model trained by the model training method.

In a third aspect, the present application provides a model training apparatus, the apparatus comprising:

the experience generation module is used for acquiring a model to be trained, wherein the model to be trained is a reinforcement learning model;

the experience generation module is further used for obtaining an experience set generated by interaction between the training ship and the environment through the model to be trained, wherein each historical experience in the experience set comprises instant rewards and new sailing states obtained by the training ship executing sailing actions generated by the model to be trained; the instant rewards include an internal instant reward positively correlated with a novelty of the new navigational state, the novelty characterizing a difference between the new navigational state and a conventional navigational state;

and the model updating module is used for updating the model to be trained according to the experience set until the model to be trained meets the training conditions, and obtaining the route planning model.

In a fourth aspect, the present application provides a storage medium storing a computer program which, when executed by a processor, implements the model training method or the route planning method.

In a fifth aspect, the present application provides an electronic device, the electronic device comprising a processor and a memory, the memory storing a computer program, which when executed by the processor, implements the model training method or the route planning method.

Compared with the prior art, the application has the following beneficial effects:

in the model training method, the route planning method and the related devices provided by the embodiment, a model to be trained based on reinforcement learning is obtained; acquiring an experience set generated by interaction between a training ship and the environment through a model to be trained; updating the model to be trained according to the experience set until the model to be trained meets the training conditions, and obtaining the route planning model. Each historical experience in the experience set comprises instant rewards and new sailing states obtained by training ships to execute sailing actions generated by a model to be trained; since the instant rewards include internal instant rewards that are positively correlated with the novelty of the new navigational state, which characterizes the difference between the new navigational state and the conventional navigational state, the efficiency of exploring the environment can be improved in training the reinforcement learning model to formulate navigational routes.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the relationship of internal/external instant rewards provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a model training principle provided in an embodiment of the present application;

FIG. 4 is a schematic diagram showing the comparison of the effects of the scheme according to the embodiment of the present application;

FIG. 5 is a second schematic diagram showing the comparison of the effects of the scheme provided in the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a model training device according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Icon: 101-an experience generation module; 102-a model update module; 201-a memory; 202-a processor; 203-a communication unit; 204-system bus.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present application, it should be noted that the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Based on the above statement, as introduced in the background art, the development potential of the arctic region presents a route, which attracts global attention, in this background, a prospective study of arctic route planning is carried out, and a real-time arctic ice region intelligent planning system is established as early as possible, so that the method has very important application value.

The conventional path planning method comprises Dijkstra algorithm, A-algorithm, ant colony algorithm and other algorithms. These algorithms solve the global route reachability problem, but only implement the route planning of the static environment, i.e. the grid data is static, if the data changes, it is recalculated once; however, the actual environment of the arctic region appears to be severe and variable, resulting in real-time changes in obstacles, such as sudden changes in the region of sea ice, snow, and high winds, and therefore, conventional routing algorithms cannot be used in conjunction with external environments to achieve safe routing of unknown dynamic grid environments.

From the viewpoint of the implementation thought efficiency of the conventional path planning method, the running environment of the Dijkstra algorithm requires global map information, and a greedy strategy is used for determining the optimal path. If the map is too large and there are many points, the execution efficiency level is low.

While the algorithm a improves on the Dijkstra algorithm search, it introduces a valuation function, which can rank the nodes with the valuation function at each step of the search, but only find all paths from the origin to a specific target point, or the shortest paths to some points. Compared with Dijkstra algorithm, the A algorithm remarkably improves the calculation efficiency, is a typical heuristic algorithm, but the estimation function needs a certain priori experience to set parameters, the natural environments of all the legs of the arctic ice area route are different, a general valuation function is difficult to plan a route, function parameters need to be set for different areas, and the method has a certain subjectivity. Meanwhile, when the number of searching nodes is too large on a complex map with more obstacles, the performance of the a-algorithm is drastically reduced.

The ant colony algorithm has the characteristics of distribution calculation, information positive feedback and heuristic search, is essentially a heuristic global optimization algorithm in an evolutionary algorithm, but the positive feedback of pheromone needs long-term exploration, and the algorithm convergence speed is low.

Thus, these conventional methods do not have the ability to sense and cope with changes in the environment as a whole, nor the ability to learn and migrate, especially in unknown environments, once small changes in the environment occur, the environmental knowledge learned from the sample before the algorithm will no longer be valid, and no correct movement decisions can be made.

In order to overcome the above problems, route planning by reinforcement learning is proposed in the related art. Wherein reinforcement learning does not use a priori training samples but rather learns based on rewards and penalties principles: reinforcement learning is autonomous learning by an agent in a "trial and error" manner, and obtains rewarding guiding actions by constantly interacting with the environment, with the goal of making decisions by constantly learning an optimal strategy in order to obtain maximum return. The currently mainstream deep reinforcement learning algorithms are mainly divided into the following three categories: a method based on a cost function (Value Based Algorithms:), a method based on Policy Gradient (PG) Algorithms, a method based on an Actor-Critic framework (Actor-Critic Algorithms).

(1) Value Based Algorithms: value Based Algorithms learns to derive the optimal strategy in the reinforcement learning problem by learning a cost function for states s and actions a in the scene. Such methods calculate the value Q of all actions for each state in the scene during training, and select one of the actions with the largest Q value for indirect decision making at decision time, with representative methods being Sarsa algorithm, Q-learning algorithm, DQN algorithm, etc. However, such methods rely on random policy exploration, which is inefficient and cannot be applied to continuous motion issues.

(2) Policy Gradient (PG) Algorithms: the PG method directly learns a policy function about the state s, and can directly generate an optimal action through the policy function in different states, wherein a representative method is a request method.

(3) Actor-Critic Algorithms: the Actor-Critic Algorithms algorithm combines the advantages of Value Based Algorithms and Policy Gradient (PG) Algorithm Algorithms while learning a cost function and a Policy function, which generates optimal actions through Policy function decisions, and optimizes policies through the cost function according to the rewards obtained, with representative methods including A3C (Asynchronous Advantage Actor-Critic) Algorithms, depth deterministic Policy Gradient Algorithms (Deep Deterministic Policy Gradient, DDPG), near-end Policy optimization Algorithms (Proximal Policy Optimization, PPO), and the like.

The reinforcement learning is directly applied to the problem of route planning at the present stage, so that the problem of small scene scale and low scene complexity exists in the research. With the increase of scene complexity, the deep reinforcement learning algorithms such as DQN and the like are directly applied to the route planning problem, so that the characteristic representation and state transition of the environment are difficult to effectively learn, the complex route planning problem cannot be solved, and the problems of overlong exploration time, incapability of converging strategy functions and the like exist. Overall, value Based Algorithms learning strategies are single and depend on random strategy learning, and strategy updating fluctuation in the training process is large, so that the requirements of complex channel scenes are difficult to adapt. Policy Gradient Algorithms, although capable of solving the problem of continuous motion, requires sampling by the monte carlo method after acquiring the track of the whole section of the route, resulting in greatly reduced sample utilization and difficult effective application. The Actor-Critic Algorithms combines the advantages of the method, makes a decision according to the probability distribution of the strategy function, and has more various strategies; and the strategy of the intelligent agent is optimized through the learning cost function, so that the sample utilization rate is greatly increased, and the application value is higher.

All three reinforcement learning methods can solve the route planning problem to different degrees, but all the algorithms learn to obtain a cost function or strategy function by continuously exploring the environment and according to feedback obtained by the environment, so that the exploring efficiency determines the training efficiency and the training result. When the complexity of the problem is increased and the scale of the state space and the action space of the environment is greatly increased, the optimal strategy of the route planning problem is difficult to be explored in the huge problem by directly using the traditional reinforcement learning method. The method aims to solve the problems that the training and exploring efficiency in the course planning is low and the convergence in the complex course planning scene is impossible in the existing deep reinforcement learning algorithm.

Based on the findings of the above technical problems, the inventors have made creative efforts to propose the following technical solutions to solve or improve the above problems. It should be noted that the above prior art solutions have all the drawbacks that the inventors have obtained after practice and careful study, and thus the discovery process of the above problems and the solutions to the problems that the embodiments of the present application hereinafter propose should not be construed as what the inventors have made in the invention creation process to the present application, but should not be construed as what is known to those skilled in the art.

In view of this, in order to overcome at least one of the shortcomings in the prior art, the present embodiment provides a model training method. In the method, a model to be trained is obtained, wherein the model to be trained is a reinforcement learning model; acquiring an experience set generated by interaction between a training ship and the environment through a model to be trained; updating the model to be trained according to the experience set until the model to be trained meets the training conditions, and obtaining the route planning model. Each historical experience in the experience set comprises an instant reward obtained by training the ship to execute the sailing action generated by the model to be trained and a new sailing state. Since the instant rewards include internal instant rewards that are positively correlated with the novelty of the new navigational state and that characterize the difference between the new navigational state and the conventional navigational state, the efficiency of exploring the environment can be improved in training the reinforcement learning model to formulate navigational routes.

The electronic device implementing the method may be any device capable of providing sufficient computing power. For example, the electronic device may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, servers, and the like.

When the electronic device is a server, the server set may be centralized or distributed (e.g., the server may be a distributed system). In some embodiments, the server may be local or remote to the user terminal. In some embodiments, the server may be implemented on a cloud platform; by way of example only, the Cloud platform may include a private Cloud, public Cloud, hybrid Cloud, community Cloud (Community Cloud), distributed Cloud, cross-Cloud (Inter-Cloud), multi-Cloud (Multi-Cloud), or the like, or any combination thereof. In some embodiments, the server may be implemented on an electronic device having one or more components.

In order to make the implementation easier, the steps of the model training method provided in this embodiment are described in detail below with reference to fig. 1. It should be understood that the operations of the flow diagrams may be performed out of order and that steps that have no logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art. As shown in fig. 1, the method includes:

S101, obtaining a model to be trained.

The model to be trained is a reinforcement learning model.

S102, obtaining an experience set generated by interaction between a training ship and the environment through a model to be trained.

In this embodiment, the training ship is used as an intelligent agent, and interacts with the navigation environment to generate an experience set, wherein each history experience in the experience set comprises an instant reward and a new navigation state obtained by the training ship executing the navigation action generated by the model to be trained; the instant rewards include internal instant rewards that are positively correlated with the novelty of the new navigational state, which characterizes the difference between the new navigational state and the conventional navigational state.

It should be understood here that, aiming at the problem that the algorithm exploration capability is insufficient only through the external environment rewards optimization strategy in the reinforcement learning algorithm, the embodiment introduces the internal instant rewards based on the state novelty on the basis of the external instant rewards so as to achieve the effect of encouraging training ships to actively explore the unknown state in the environment.

The naive idea is that the training vessel is trained by collected samples, and for the training vessel, the more unfamiliar and novel samples represent the less adequate the strategic network learning of their corresponding samples, the more learning is required. In other words, for the voyage state s at time t _t In all state sets visited by the training ship, if s is equal to _t The smaller the number of similar states, the higher its novelty; and the more novel the state, the higher the corresponding exploration rewards the training vessel gets, the more exploration is encouraged. In this embodiment, to measure the novelty of the new sailing state of the training ship, the method compares the new sailing state with the conventional sailing state, and the specific implementation manner of step S102 includes:

s102-1, inputting the current sailing state of the training ship into a model to be trained, and obtaining the sailing action which the training ship should take in the current sailing state.

S102-2, obtaining the internal instant rewards according to the difference between the new sailing state and the conventional sailing state after the training ship executes the sailing action.

In an alternative embodiment, the electronic device obtains a new sailing state of the training ship after performing the sailing action; predicting to obtain a conventional sailing state according to the current sailing state and sailing action of the training ship; and obtaining the internal instant rewards according to the difference between the new sailing state and the conventional sailing state.

In order to measure the difference between the new navigation state and the normal navigation state, the electronic device in the embodiment may vectorize the new navigation state to obtain a first state vector; vectorizing the conventional navigation state to obtain a second state vector; and obtaining the internal instant rewards according to the Euclidean distance between the first state vector and the second state vector.

Illustratively, as shown in FIG. 2, E in the figure represents a reinforcement learning environment; s is(s) _t ,a _t Pi in turn represents the state, action and policy functions in a standard Markov problem; the MSE represents a Mean-Square Error function (MSE) for synchronously updating the prediction network P and calculating an internal instant prize generated by the sailing action in the training process of the model to be trained, wherein the instant prize calculating process is divided into two stages for obtaining an internal instant prize and an external instant prize respectively:

(1) Calculating an external instant prize r _E : assume that at time t, the current sailing state of the training ship as an agent is s _t Fitting according to the model to be trainedAction a is generated by a policy function pi of (a) _t Agent performs a _t After reaching the new state s _t+1 The external environment E feeds back to obtain the external instant rewards r _E 。

(2) Calculating an internal instant prize r _I : training of the vessel in State s _t Next, executing action a _t Reach a new state s _t+1 . With continued reference to FIG. 2, to represent the current navigational state s of the training vessel _t And action a _t The predictive network P provided in this embodiment is used for predicting the current sailing state s of the intelligent agent _t Executing action a _t The latter most likely routine voyage state is reached and the prediction is expressed as s' _t+1 ：

s' _t+1 ＝f(s _t ,a _t ；θ _P )；

The predicted network reception state s _t And action a _t Outputting the most likely conventional sailing state s' _t+1 . It will be appreciated that the conventional sailing state predicted by the prediction network represents the most likely state that the training vessel will reach, meaning that the new sailing state is more innovative the more the most likely state is different from the new sailing state that the training vessel actually reaches after performing the sailing action.

To quantify the variability between the predicted regular navigational state and the new navigational state, the electronic device may extract characteristic information that is valid for each of the regular navigational state and the new navigational state. For example, the electronic device may compare the current navigational state s _t Encoded into a high-dimensional feature space by a deep neural network, denoted as phi (s _t ). Thus, the first state vector for the new voyage state may be expressed as phi (s _t+1 ) The second state vector for the normal voyage state may be expressed as:

φ(s' _t+1 )＝f(φ(s _t ),a _t ；θ _P )；

the optimization objective of the predictive network is to minimize the normal voyage state phi (s' _t+1 ) And a new sailing state phi(s) _t+1 ) Mean square error function (MSE) between:

the prediction network definition shows that the larger the difference between the output result of the prediction network and the actual new navigation state is, the higher the novelty degree representing the state is, and the corresponding internal instant exploration rewards are correspondingly higher; i.e. internal instant prize r _I Proportional to the distance between the predicted network and the real state:

r _I ∝dis(φ(s' _t+1 ),φ(s _t+1 ))；

the embodiment can take Euclidean distance between the two as the internal instant rewards r _I ：

Thus, by introducing an internal instant prize r _I The exploration capacity of the training ship to the complex environment in the reinforcement learning training process can be effectively increased.

S102-3, evaluating the new sailing state according to a preset rewarding evaluation rule to obtain external instant rewards.

Before describing the preset reward evaluation rule in detail, the state space and the action space related to this embodiment will be described. For the target navigation area of the north pole, the target navigation area is divided into a plurality of grids, and navigation Risk (RIO), weather and topography elements are further assigned to each grid on the basis of the constructed quasi-grids, so that the navigation environment of the north pole can be described more accurately. In this embodiment, RIO calculated by using the POLARIS model represents the risk spatial distribution of the ship sailing in the arctic, and the data used are sea ice thickness and concentration data. The conditions under which each mesh can navigate are three conditions that are met simultaneously:

(1) RIO is greater than 0.

(2) The wind speed is less than 20m/s (the empirical value determined by the captain of the polar center snowwheel: the polar navigation wind speed is less than 40 knots, namely 20 m/s).

(3) The water depth is greater than 13m (the region with shallow water depth of northeast channel in the north is a Sannikov strait, and the water depth is 13m;Xu and Yang,2020).

And the grid points meeting the three conditions are modified to be 1 in attribute value, so that the grid points can navigate. The grid attribute which does not meet one of the conditions is set to be-1, which indicates that the grid cannot navigate and the land attribute is 0; thus, an dynamically passable area and an obstacle in the target navigation area are determined.

Assume a state space S of a training vessel as an agent at time t _t Coordinate information P of packet training ship _t Distance information D _t And an action mask M for navigation actions _t ：

S _t ＝{P _t ,D _t ,M _t }；

Wherein the coordinate information P _t The information comprises the current coordinates of the device, longitude and latitude coordinates of a starting point and a finishing point, and the like; distance information D _t Representing the longitude and latitude distance of the current coordinate distance end point; action mask M _t Consists of an element set only comprising 0 and 1, is used for indicating whether the training ship can move in all directions within a certain range of the current coordinates of the training ship, and is set to be 1 when the training ship can move and is set to be 0 when the training ship cannot move.

For the motion space of the training vessel, the present embodiment models the motion space of the training vessel at time t as a discretized motion set a _t ＝{a _e ,a _s ,a _w ,a _n And the intelligent agent moves to the corresponding directions of east, south, west and north under the current coordinates. The agent acts on action set A at each decision _t One action is selected for movement, each time a fixed unit length of distance is moved. It will be appreciated that the selectable actions of the training vessel are different in different states, and that such action space designs greatly hinder the decision making of the agent. Therefore, in this embodiment, consider the Mask (Mask) that introduces a priori information as the action, and restrict the action that cannot be performed from being selected, when the agent is in the current state s _t Selecting the ith from the action spaceAction a _i The probability of (2) can be expressed as:

wherein P (a) _i ) Representing action a _i The corresponding model to be trained outputs the original motion probability distribution, mask (a _i ) A mask indicating whether the corresponding action is optional, the mask being 1 when optional, and 0 when not optional; thus, for actions that cannot be performed, the probability of being selected is set to 0 by the mask.

Based on the state space and the motion space of the design, it is assumed that the training ship is in the state s at the time t _t The navigation action selectively executed in the lower action space is a _t And reach new sailing state s _t+1 The external instant rewards obtained at this time are expressed asThe instant rewards include a base distance rewards r _t,1 Collision penalty r _t,2 Successful reward r _t,3 The following details of the components of the external instant prize are provided:

(1) The basic distance rewards mainly comprise two parts of Euclidean distance between a current coordinate point and an end point and Euclidean distance between the current coordinate point and a starting point:

wherein X is _p Representing the current coordinates of the training vessel, X _e ,X _s Coordinates sequentially representing the end and start of the route, c ₁ ,c ₂ Which in turn represent the weight coefficients of the two-part prize.

(2) Collision punishment and successful rewarding, when the training ship runs into an obstacle or reaches a target point after running, a negative or positive fixed value rewarding is obtained for guiding the training ship to avoid the obstacle and advance, namely r _t,2 And r _t,3 The respective terms can be expressed as:

wherein, the collision and the no_collision sequentially indicate that collision occurs and no collision occurs, and when an intelligent agent encounters an obstacle, rewards with the size of-M are obtained; success and no_success represent a state of successfully reaching a target point and a state of not reaching the target point in sequence, when the target point is reached, rewards with the size of N are obtained, specific numerical values of-M and N can be adaptively adjusted according to requirements, and the embodiment is not limited specifically.

In the above implementation, the internal instant prize and the external instant prize are described, and step S102 further includes:

s102-4, obtaining the instant rewards after the training ship executes the sailing action according to the internal instant rewards and the external instant rewards.

In an alternative embodiment, the electronic device obtains weights of the internal instant rewards and the external instant rewards in a current training period respectively; the instant rewards are obtained according to the weights of the internal instant rewards and the external instant rewards in the current training period, and the mathematical expression can be expressed as follows:

r _t ＝αr _I +r _E ；

wherein r is _t Indicating instant prize, r _I Indicating an internal instant prize, alpha>0 represents the weight of the internal instant prize, R _E The external instant prize is represented and the weight of the external instant prize in the expression is set to 1.

It has further been found that the introduction of an external instant prize based on novelty can effectively improve the efficiency and performance of the agent exploration. However, in the training process, the weight coefficient of the external instant rewards is seriously dependent on manual setting. However, in most practical problems, there is often no effective balance between internal and external instant rewards with some fixed weight. Namely, the weight design is too small, and the exploration rewards can not obviously improve the strategy effect; excessive weight design will cause unstable policy network update and high variance, and even fail to converge.

Furthermore, the optimal weighting coefficients tend to be different at different stages of training. For example, in the early stages of training, training vessels that are agents are more prone to encouragement of exploration; as training proceeds, the importance of the exploration will gradually decrease. Therefore, it is important to choose an appropriate, effective weight for the internal timely rewards at different stages of model training.

In view of this, to improve the exploration ability of the agent, and to ensure the stability of the algorithm training in long-sequence route planning problems, the present embodiment provides a mechanism to achieve adaptive adjustment between external instant rewards and the introduction of internal instant rewards.

In an alternative embodiment, the electronic device obtains an average internal instant prize within a preset training period; obtaining a weight adjustment coefficient according to the difference between the average internal instant rewards and the target expected rewards; and adjusting the weight of the previous training period according to the weight adjustment coefficient to obtain the weight of the internal instant rewards in the current training period.

It should be understood here that, before each parameter update of the model to be trained, a certain amount of history experience needs to be collected in the current training period, and when the history experience of the current training period is collected, the weight of the previous training period is updated to obtain the weight of the current training period, so as to calculate the instant rewards in each history experience; therefore, the calculation expression corresponding to the weight of the current training period is:

1. Wherein alpha is _new A weight representing a current training period; alpha _old The weight representing the last training period;representing a weight adjustment coefficient; beta represents the base update amplitude; r is (r) _target Representing target expected rewards, wherein the target expected rewards are obtained by collecting and processing external instant rewards of environmental navigation tracks in advance in early training period, and the expected internal instant rewards are always r during training period of a model to be trained _target Nearby wave motion; for example, if the external instant rewards of the environmental navigation tracks are found to be within the range of-1 to 0 after being collected and processed in advance in the early stage of training, r can be calculated _target Designed as- (-1+0)/2 x 0.1= -0.05; />Representing an average internal instant prize, where the internal instant prize r during each period is calculated by the following mathematical expression assuming the number of preset training periods is N and the step size in each period is T _i,n,t To obtain an average internal instant prize +.>

With continued reference to FIG. 1, the model training method further includes, based on the historical experience collected in the implementation described above:

and S103, updating the model to be trained according to the experience set until the model to be trained meets the training conditions, and obtaining the route planning model.

In an alternative embodiment, the model to be trained is an improved proximal strategy optimization algorithm model. It should be understood that the near-end strategy optimization algorithm model is a reinforcement learning method based on an Actor-Critic framework, has the advantages of stable training, strong anti-interference performance, easy convergence and the like, and can obtain an excellent learning effect. The near-end policy optimization algorithm model uses an Actor-Critic framework to represent the cost function of the agent by using a value network based on a deep neural network and the policy network based on the deep neural network to represent the policy function. The cost function in the near-end strategy optimization algorithm is used for evaluating the current state, and the strategy function network is scored according to the merits of the value, so that the strategy network parameters are optimized, and the parameters can be expressed as:

v _π (s)＝E[r _t+1 +γG _t+1 |S _t ＝s]；

Wherein r is _t+1 Representing instant rewards obtained after the training ship executes sailing action at the moment t, G _t+1 Representing a cumulative prize:

the cumulative rewards represent cumulative sums of rewards obtained by the training ship from the time t+1 to the end of the environment, and the fitted strategy function is used for outputting navigation actions to be executed in the corresponding states according to the states of the environment input, and the aim is to obtain the final cumulative rewards by optimizing the strategy.

As shown in fig. 3, during the training of the improved near-end strategic optimization algorithm model, the agent continuously interacts with the environment for simulating the channel until a set of experience is obtained with N historical experiences. Wherein, when the ith interaction is performed with the environment, the internal instant rewards and the external instant rewards are calculated during the interaction, and the instant rewards of the ith interaction action are obtained based on the internal instant rewards and the external instant rewards, thereby obtaining a historical experience ([ s ] _i ,a _i ,r _i ],s _i+1 ) The method comprises the steps of carrying out a first treatment on the surface of the And finally, sampling the experience set collected through interaction, optimizing network parameters through the corresponding cost loss function and strategy loss function, and repeating the steps until the algorithm converges, wherein the strategy network after convergence can be used for generating an optimal route.

And the method is similar to the near-end strategy optimization algorithm model, and the cost loss function corresponding to the improved near-end strategy optimization algorithm model is the mean square error of the dominant function:

L _t ^VF ＝MSE(A _t )＝E(r+γV(s _t+1 )-V(s _t )) ²

wherein A is _t ＝r+γ(V(s _t+1 )-V(s _t )) ² Representing an estimate of the dominance function (Advantage Function) of the training vessel as agent at time t, representing the current state s _t Action a is performed at the time _t The merits with respect to the average policy. V(s) _t ) Representation s _t The output of the moment cost function, r, represents the reward obtained by training the ship, and gamma represents the reward attenuation coefficient less than 0 in reinforcement learning.

And the method is similar to the near-end strategy optimization algorithm model, and the strategy loss function corresponding to the improved near-end strategy optimization algorithm model is as follows:

L ^CLIP (θ)＝E _t [min(r _t (θ)A _t ,CLIP(r _t (θ),1-∈,1+∈)A _t )]；

in the method, in the process of the invention,as importance weight, represents the new and old strategy pi of the training ship as a whole at the moment t _θ And->Is a ratio of (2). CLIP (r) _t (θ), 1-e, 1+_e) means that r will be _t The value of (theta) is limited to [ 1-E, 1+ [ E ]]And ensuring the stability of strategy convergence. Thus, after at least one training period, a route planning model meeting training conditions is obtained. It should be understood that, in this embodiment, an internal instant reward is introduced on the basis of the proximal policy optimization algorithm model to obtain an improved proximal policy optimization algorithm model, and the structure and training details of the improved proximal policy optimization algorithm model are consistent with those of the proximal policy optimization algorithm model, which is not described in detail in this embodiment.

As shown in fig. 4 and 5, assuming that the improved near-end strategy optimization algorithm is modulo referred to as AEPPO algorithm, constructing a north-polar northeast route verification scene, training by using DQN algorithm, PPO algorithm and AEPPO algorithm respectively, and the relation between the rewarding curve and step change curve in the training process is shown in fig. 4 and 5. It can be seen that there is a large gap between the training rewards and the round length change curves of the three different algorithms, wherein the rewards curves of the AEPPO algorithm can still achieve faster rising and convergence, and the rewards rising of the DQN algorithm and the PPO algorithm are obviously slowed down.

Further, predicting according to the trained three algorithm network models to obtain a predicted navigation path, and adding a traditional A star algorithm to compare results to obtain training prediction time and prediction results of four algorithms shown in the following table:

algorithm	Average training time	Average prediction time	Predicting navigation points
				A star	/	2.13s	419
DQN	5m11s	0.203s	425
				PPO	2m23s	0.218s	425
AEPPO	2m16s	0.237	422

It can be seen that all three reinforcement learning algorithms can complete training between 5 minutes, with the PPO and AEPPO algorithms requiring shorter training times (within 3 minutes). The steps of the predicted navigation points of the three reinforcement learning algorithms are 425, 425 and 422 respectively, the prediction time is basically consistent (about 0.2 seconds), wherein the average navigation distance obtained by the AEPPO and the A star algorithm provided by the method is closest, and the prediction result is optimal.

Based on the model training method, the implementation also provides a route planning method. In the method, an electronic device implementing the method determines a target ship, a starting point and an ending point of the target ship; then, the route planning model trained by the model training method plans a navigation route from the starting point to the end point for the target ship.

Based on the same inventive concept as the model training method provided in the present embodiment, the present embodiment also provides a model training apparatus including at least one software functional module that may be stored in a memory or solidified in an electronic device in a software form. A processor in the electronic device is configured to execute the executable modules stored in the memory. For example, a software function module included in the model training apparatus, a computer program, and the like. Referring to fig. 6, functionally divided, the model training apparatus may include:

the experience generation module 101 is configured to obtain a model to be trained, where the model to be trained is a reinforcement learning model;

the experience generation module 101 is further configured to obtain, through the model to be trained, an experience set generated by interaction between the training ship and the environment, where each historical experience in the experience set includes an instant reward and a new sailing state obtained by the training ship executing the sailing action generated by the model to be trained; the instant rewards include internal instant rewards positively correlated with the novelty of the new navigational state, the novelty characterizing the difference between the new navigational state and the conventional navigational state;

The model updating module 102 is configured to update the model to be trained according to the experience set until the model to be trained meets the training condition, and obtain the route planning model.

In the present embodiment, the experience generating module 101 is used to implement steps S101 and S102 in fig. 1, and for a detailed description of the experience generating module 101, reference may be made to a detailed description of steps S101 and S102. The model updating module 102 is used to implement step S103 in fig. 1, and a detailed description of the model updating module 102 may be referred to as a detailed description of step S103. In addition, it should be noted that, since the model training method provided in the present embodiment has the same inventive concept, the above experience generation module 101 and model update module 102 may also be used to implement other steps or sub-steps of the method, which is not specifically limited in this embodiment.

In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

It should also be appreciated that the above embodiments, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.

Accordingly, the present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the model training method or the route planning method provided by the present embodiment. The computer readable storage medium may be any of various media capable of storing a program code, such as a usb (universal serial bus), a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk.

Referring to fig. 7, an electronic device provided in this embodiment includes a processor 202 and a memory 201. The memory 201 stores a computer program, and the processor reads and executes the computer program corresponding to the above embodiment in the memory 201 to realize the model training method or the route planning method provided in the present embodiment.

With continued reference to fig. 7, the electronic device communication unit 203. The memory 201, the processor 202, and the communication unit 203 are electrically connected to each other directly or indirectly through a system bus 204 to achieve data transmission or interaction.

The memory 201 may be an information recording device based on any electronic, magnetic, optical or other physical principle for recording execution instructions, data, etc. In some embodiments, the memory 201 may be, but is not limited to, volatile memory, non-volatile memory, storage drives, and the like.

In some embodiments, the volatile memory may be random access memory (Random Access Memory, RAM); in some embodiments, the non-volatile Memory may be Read Only Memory (ROM), programmable ROM (Programmable Read-Only Memory, PROM), erasable ROM (Erasable Programmable Read-Only Memory, EPROM), electrically erasable ROM (Electric Erasable Programmable Read-Only Memory, EEPROM), flash Memory, or the like; in some embodiments, the storage drive may be a magnetic disk drive, a solid state disk, any type of storage disk (e.g., optical disk, DVD, etc.), or a similar storage medium, or a combination thereof, etc.

The communication unit 203 is used for transmitting and receiving data through a network. In some embodiments, the network may include a wired network, a wireless network, a fiber optic network, a telecommunications network, an intranet, the internet, a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), a wireless local area network (Wireless Local Area Networks, WLAN), a metropolitan area network (Metropolitan Area Network, MAN), a wide area network (Wide Area Network, WAN), a public switched telephone network (Public Switched Telephone Network, PSTN), a bluetooth network, a ZigBee network, a near field communication (Near Field Communication, NFC) network, or the like, or any combination thereof. In some embodiments, the network may include one or more network access points. For example, the network may include wired or wireless network access points, such as base stations and/or network switching nodes, through which one or more components of the service request processing system may connect to the network to exchange data and/or information.

The processor 202 may be an integrated circuit chip with signal processing capabilities and may include one or more processing cores (e.g., a single-core processor or a multi-core processor). By way of example only, the processors may include a central processing unit (Central Processing Unit, CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a special instruction set Processor (Application Specific Instruction-set Processor, ASIP), a graphics processing unit (Graphics Processing Unit, GPU), a physical processing unit (Physics Processing Unit, PPU), a digital signal Processor (Digital Signal Processor, DSP), a field programmable gate array (Field Programmable Gate Array, FPGA), a programmable logic device (Programmable Logic Device, PLD), a controller, a microcontroller unit, a reduced instruction set computer (Reduced Instruction Set Computing, RISC), a microprocessor, or the like, or any combination thereof.

It should be understood that the apparatus and method disclosed in the above embodiments may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing is merely various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training, the method comprising:

2. The model training method of claim 1, wherein the instant rewards further comprise external instant rewards of the training vessel, the obtaining of the experience set generated by the training vessel interacting with the environment through the model to be trained comprises:

inputting the current sailing state of the training ship into a model to be trained to obtain sailing actions which the training ship should take in the current sailing state;

obtaining the internal instant rewards according to the difference between the new sailing state and the normal sailing state after the training ship executes the sailing action;

evaluating the new navigation state according to a preset rewarding evaluation rule to obtain the external instant rewards;

and obtaining the instant rewards after the training ship executes the sailing action according to the internal instant rewards and the external instant rewards.

3. The model training method according to claim 2, wherein the obtaining the internal instant prize based on a difference between a new voyage state and the regular voyage state of the training vessel after performing the voyage action comprises:

acquiring a new sailing state of the training ship after executing the sailing action;

Predicting the conventional sailing state according to the current sailing state of the training ship and the sailing action;

and obtaining the internal instant rewards according to the difference between the new sailing state and the conventional sailing state.

4. A model training method as claimed in claim 3, wherein said deriving said internal instant prize from a difference between said new voyage state and said regular voyage state comprises:

vectorizing the new navigation state to obtain a first state vector;

vectorizing the conventional navigation state to obtain a second state vector;

and obtaining the internal instant rewards according to Euclidean distance between the first state vector and the second state vector.

5. A model training method as claimed in claim 3, wherein said obtaining an instant prize after said training vessel performs said sailing action based on said internal instant prize and said external instant prize comprises:

acquiring weights of the internal instant rewards and the external instant rewards in a current training period respectively;

and obtaining the instant rewards according to the weights of the internal instant rewards and the external instant rewards in the current training period.

6. The model training method of claim 5, wherein said obtaining weights for the internal instant rewards during a current training period comprises:

acquiring average internal instant rewards in a preset training period;

obtaining a weight adjustment coefficient according to the difference between the average internal instant rewards and the target expected rewards;

and adjusting the weight of the previous training period according to the weight adjustment coefficient to obtain the weight of the internal instant rewards in the current training period.

7. A method of route planning, the method comprising:

a model for planning a voyage for the target vessel trained by the model training method according to any of claims 1-6 to plan a voyage from the start point to the end point.

8. A model training apparatus, the apparatus comprising:

9. A storage medium storing a computer program which, when executed by a processor, implements the model training method of any one of claims 1-6 or the route planning method of claim 7.

10. An electronic device comprising a processor and a memory, the memory storing a computer program which, when executed by the processor, implements the model training method of any one of claims 1-6 or the route planning method of claim 7.