CN112395690A

CN112395690A - Reinforced learning-based shipboard aircraft surface guarantee flow optimization method

Info

Publication number: CN112395690A
Application number: CN202011328243.XA
Authority: CN
Inventors: 张勇
Original assignee: Naval Aeronautical University
Current assignee: Naval Aeronautical University
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-23

Abstract

The invention discloses a reinforced learning-based method for optimizing a ship surface guarantee process of a ship-based aircraft, and belongs to the technical field of ship surface guarantee of ship-based aircraft. Firstly, establishing a guarantee scheduling model according to the characteristics of a ship-based aircraft surface guarantee flow, and resolving the guarantee scheduling model into a mixed flow shop scheduling problem; then, combining a reinforcement learning process, designing corresponding state and action expressions in the scheduling problem of the carrier-based aircraft surface guarantee process, summarizing the carrier-based aircraft surface guarantee process into a Markov decision process, and designing a corresponding reward function in reinforcement learning; and finally, optimizing and solving the scheduling problem by using a reinforcement learning algorithm so as to realize the minimum guarantee completion time. The invention obviously improves the real-time performance of the solving process on the basis of ensuring the optimal quality of the carrier-based aircraft surface guarantee scheduling process, and can provide a reasonable solution for the carrier-based aircraft surface guarantee real-time scheduling.

Description

Reinforced learning-based shipboard aircraft surface guarantee flow optimization method

Technical Field

The invention belongs to the technical field of shipboard aircraft surface guarantee, and particularly relates to a shipboard aircraft surface guarantee flow optimization method based on reinforcement learning.

Background

The carrier-based aircraft operation is a periodic process, and comprises catapult takeoff, task execution, return landing, deck aircraft service guarantee and catapult takeoff again, and the process is repeated in a circulating manner. Because the number of the carrier-based aircraft carried on the aircraft carrier is limited, the circulation is ensured to be efficiently and orderly carried out, and the method is an effective way for fully exerting the fighting capacity of the carrier-based aircraft. In the process, the carrier-based aircraft safely and quickly completes the surface guarantee and the catapult takeoff, which is the key for restricting the takeoff capability of the carrier-based aircraft. The guarantee process optimization technology of the shipboard aircraft on the deck mainly comprises modeling and optimizing a guarantee process, and the optimization quality and the real-time performance need to be considered when an optimization method is designed.

At present, a commonly used method for optimizing a ship surface guarantee process of a carrier-based aircraft mainly comprises the following steps: (1) the operation conditions of the carrier-based aircraft and the guarantee equipment on the aircraft carrier deck are simulated and deduced by means of a 'display board' (invented by reducing the aircraft carrier deck and related equipment thereof in equal proportion, carrier-based models corresponding to various aircraft carrier-based aircraft are laid on the display board, and corresponding marks are adopted to represent the operation states of the carrier-based aircraft), so that a reasonable scheduling plan is made; (2) modeling is carried out by establishing a ship surface guarantee flow of the carrier-based aircraft, and optimization is carried out by utilizing a traditional intelligent optimization method, so that a scheduling plan is formulated. The scheduling method based on the 'demon board' is simple to operate, the situation of the carrier-based aircraft is clear, but the scheduling method too depends on manual experience and has the defects of untimely state updating, lack of interactivity and the like, and although the scheduling method optimized through the traditional intelligent algorithm can achieve better optimization quality, the scheduling method has the defect of poor real-time performance.

Therefore, an optimization method for the ship-based aircraft surface guarantee flow, which has both optimization quality and solution real-time performance and good applicability, is urgently needed in engineering application.

Disclosure of Invention

In order to solve the technical problems, the invention provides a reinforced learning-based carrier-based aircraft surface guarantee flow optimization method, which can give consideration to both optimization quality and solution efficiency and has good applicability.

The technical scheme adopted by the invention is as follows: according to the guarantee process of the carrier-based aircraft on the deck, a carrier-based aircraft surface guarantee scheduling model is established and is concluded to be a Markov decision process, according to the characteristics of the model, appropriate states, action expressions and reward functions in reinforcement learning are designed, and the problem is solved based on a reinforcement learning algorithm so as to minimize guarantee completion time.

Specifically, the technical scheme of the invention is as follows:

a reinforced learning-based shipboard aircraft surface guarantee flow optimization method comprises the following steps:

s1, determining guarantee time of the shipboard aircraft on each guarantee station according to the type of the shipboard aircraft to be guaranteed and historical experience guarantee data;

s2, establishing a ship-based aircraft surface guarantee scheduling model according to a ship-based aircraft surface guarantee flow;

s3, combining a reinforcement learning process, designing corresponding state and action expressions in reinforcement learning in the scheduling problem of the carrier-based aircraft surface guarantee process, and summarizing the carrier-based aircraft surface guarantee process into a Markov decision process;

s4, designing a reward function r (s, a) ═ omega c _ t in the reinforcement learning process according to the characteristics that the guarantee efficiency of the ship-based aircraft surface is closely related to the guarantee efficiency of a single guarantee station and the optimization target of the model_i,j,l+β；

And S5, optimizing and solving the problem by using a reinforcement learning algorithm.

In step S2, the specific carrier-based aircraft surface guarantee scheduling model is:

min max{ET_i,j,l}i∈I,j＝f,l∈M_j (1)

s.t.

ST_i,j,l＝max{AT_i,j,l，FT_j,l}i∈I，j∈F，l∈M_j (3)

ET_i,j,l＝ST_i,j,l+t_i,j,l i∈I，j∈F，l∈M_j (4)

AT_i,j+1,l＝ET_i,j,l+ct_j,j+1i belongs to I, j belongs to F and j is not equal to F (5)

Wherein I is a set of shipboard aircrafts to be guaranteed, F is the number of guarantee stages, F is a set of guarantee stages, and M is_j(j belongs to F) is a guarantee station set of the j stage, t_i,j,lEnsuring the required time, ct, for the shipboard aircraft at the corresponding station_j,j+1For the transfer time of the carrier-based aircraft between adjacent guarantee phases, AT_i,j,lFT for corresponding to the moment when the carrier-based aircraft reaches the guarantee station_j,lEnsuring the station to guarantee the completion of the current time of waiting for the carrier-based aircraft, ST_i,j,lET for the moment that the shipboard aircraft starts to receive the guarantee at the corresponding guarantee station_i,j,lBT corresponds to the moment when the shipboard aircraft finishes the guarantee_j,lTo ensure the time when the station begins to ensure the carrier-based aircraft, y_i,j,lAre decision variables.

In the step S3, in the carrier-based aircraft surface guarantee scheduling process, the binary group (I, j) is regarded as a state in the markov decision process, wherein (I belongs to I, j belongs to F); making the next stage guarantee station a be in the middle of M_j+1Considered as an action in the markov decision process.

In step S4, in order to uniformly represent the influence of the station guarantee completion time on the overall completion time of the scheduling of the carrier-based aircraft, a linear reward function is proposed, as shown in formula (9), that is, the reward r (S, a) obtained by executing the action is inversely related to the time consumption of the carrier-based aircraft at the station.

r(s,a)＝ω*c_t_i,j,l+β (9)

Wherein, c _ t_i,j,l＝FT_j,l-AT_i,j,lThe time of the carrier-based aircraft waiting and receiving the guarantee at a certain guarantee station is represented, omega and beta are integers, omega belongs to the range of-5 to-1]，β∈(0,300]。

Furthermore, the problem is optimized and solved through a reinforcement learning method.

Aiming at the ship-based aircraft surface guarantee process, the invention designs the state, action expression and corresponding reward function in reinforcement learning, and provides the ship-based aircraft surface guarantee process optimization method based on reinforcement learning on the basis of guaranteeing the optimization quality of the ship-based aircraft surface guarantee process, so that the calculation time is greatly reduced, and the requirement of real-time scheduling of the ship-based aircraft surface can be met.

Drawings

FIG. 1: the invention is a flow chart of the shipboard aircraft surface guarantee process.

FIG. 2: the Q-Learning algorithm designed in the invention is implemented by a flow chart.

FIG. 3: is a flow chart of the invention.

Detailed Description

Specifically, the method for optimizing the ship-based aircraft surface guarantee process based on reinforcement learning comprises the following steps:

s1, determining guarantee time of the shipboard aircraft to be guaranteed on each guarantee station according to the type of the shipboard aircraft to be guaranteed and historical experience guarantee data

The carrier-based aircraft needs to be guaranteed in three stages of detection, maintenance, refueling and weapon mounting on a deck, and then catapult takeoff is carried out. A guarantee stage set F and a guarantee station set M_j(j belongs to F) to form a carrier-based aircraft surface guarantee system, n carrier-based aircraft form a carrier-based aircraft set I to be guaranteed, and historical guarantee experience data are used for determining the time t required for the carrier-based aircraft to receive guarantee at each guarantee station_i,j,l(i∈I,j∈F,l∈M_j). Meanwhile, in order to make up for errors of empirical data, the transfer time ct between the protection stages is taken_j,j+1Taking the transfer time ct for transferring the weapon mounting guarantee stage to the ejector as N (2,0.1) (j belongs to F and j is not equal to F)_f-1,f＝N(2,0.2)。

S2, establishing a corresponding guarantee scheduling model according to a ship-based aircraft surface guarantee flow

According to the carrier-based aircraft surface guarantee process, the scheduling model is modeled as a scheduling model shown in figure 1. After all the carrier-based aircraft finish landing and slide to a temporary stop area positioned on a ship bow, the carrier-based aircraft is sequentially pulled to a maintenance detection stage, an oiling stage and a weapon hanging stage from the temporary stop area by a tractor at intervals (usually about 2min) to carry out corresponding guarantee, and the carrier-based aircraft can be guaranteed at different guarantee stations in each guarantee stage due to different guarantee stations with different personnel and equipment. The purpose of the aviation guarantee scheduling of the carrier-based aircraft deck is to ensure that the carrier-based aircraft deck guarantee is safely and efficiently carried out, optimize the completion time of the carrier-based aircraft guarantee and improve the operational capability of an aircraft carrier. According to the process shown in fig. 1, the problem of carrier-based aircraft deck aviation guarantee scheduling can be summarized as a problem of hybrid flow shop scheduling.

S3, designing corresponding state and action expression in the scheduling problem of the carrier-based aircraft surface guarantee process according to the Markov property of the scheduling process and combining with a reinforcement learning process, and summarizing the carrier-based aircraft surface guarantee process into a Markov decision process

In the carrier-based aircraft surface guarantee scheduling process, a binary group (I, j) is regarded as a state s in a Markov decision process, wherein (I belongs to I, j belongs to F); making the next stage guarantee station a be in the middle of M_j+1Considered as action a in the markov decision process.

S4, designing a corresponding reward function according to the characteristics of the ship surface guarantee process of the ship-based aircraft and the optimization target of the model

In order to uniformly represent the influence of the station guarantee completion time on the overall completion time of the scheduling of the carrier-based aircraft, a linear reward function is provided, namely the reward r (s, a) obtained by executing the action is inversely related to the guarantee completion time of the carrier-based aircraft at the station. Setting the reward function in reinforcement learning as r (s, a) ═ ω c _ t_i,j,l+ b; and will take learning rate α to be 0.1 and discount factor γ to be 0.9.

S5, optimizing and solving the problem by using a reinforcement learning algorithm

The Q-learning algorithm is one of the most prominent algorithms in reinforcement learning, and is an algorithm based on a value function. The value function Q (s, a) at a particular state can be expressed as,

wherein r is_t+1(s, a) is the reward obtained at time step t, γ ∈ (0, 1)]Is a discount factor.

The value function Q (s, a) is iterated as follows,

where α is the learning rate and γ is the discount factor.

In the Q-learning algorithm the agent needs to interact continuously with the environment. Whether the intelligent entity can select the correct action according to the observed external information determines whether the interaction is effective. When selecting an action, on the one hand the agent should select the action that maximizes the value function Q (s, a) in each state, in order to obtain as many rewards as possible, to be utilized; on the other hand, the agent should explore better actions to obtain the optimal Q (s, a) (called exploration) in order not to fall into local optima. In the invention, Boltzmann exploration strategy is adopted, and the probability of each action selection is determined by using a random distribution function. Given a random temperature coefficient T (> 1), in the current time step T state, the probability that the ith action is selected is,

where N is the total number of actions available for selection in the current state.

At the beginning of learning, the temperature coefficient T is large in value and Q (s, a) is relatively small, so that the selection probability of all actions is almost the same, and the actions corresponding to the non-optimal Q values are favorably explored. As learning progresses, the temperature coefficient T gradually decreases, the probability that each action is selected changes with the change in Q (s, a), while the probability that random actions are employed decreases, which helps to select the optimum action with the largest Q (s, a). In the later stage of learning, the temperature parameter T tends to 0, the agent selects the action corresponding to the larger Q (s, a) with a larger probability, and finally selects the action corresponding to the maximum Q (s, a) each time, at this moment, the exploration strategy also becomes a greedy strategy.

The iteration of the temperature coefficient T is taken as,

where e is the number of learning times, e₀Is aConstant, T₀Is the initial value of the temperature coefficient T (taking T in the implementation process of the invention)₀＝500)。

Further, the specific implementation process of step S5 is as follows:

s5.1, inputting a discount factor gamma, a learning rate alpha and a random initialization Q value;

s5.2, for each training round, initializing the state S to the starting state, ST_i,j,l、ET_i,j,l、BT_j,l、FT_j,lInitialized to 0, AT_i,j,lInitializing to an initial arrival time;

s5.3, in each state, according to the formula AT_i,j+1,l＝ET_i,j,l+ct_j,j+1(j belongs to F and j is not equal to F), calculating the arrival time of the carrier-based aircraft at a certain stage;

s5.4, selecting an action a to belong to M according to the Q value and the exploration strategy for each carrier-based aircraft_jAnd executing;

s5.5, observe the status of the next step and according to the formula r (S, a) ═ ω c _ t_i,j,l+ b, calculating the reward r (s, a) for performing the selected action;

s5.6 according to formula

And updating the Q value.

Claims

1. A ship-based aircraft surface guarantee process optimization method based on reinforcement learning is characterized by comprising the following steps:

s1, determining time required by guarantee on each guarantee station according to the type of a shipboard aircraft to be guaranteed and historical guarantee experience data;

s2, establishing a guarantee scheduling model according to a ship surface guarantee flow of the ship-based aircraft;

s3, combining a reinforcement learning process, designing a corresponding state s and action a in the scheduling problem of the carrier-based aircraft surface guarantee flow, and summarizing the carrier-based aircraft surface guarantee flow into a Markov decision process;

s4, according to the characteristic that the guarantee efficiency of the ship-based aircraft surface is determined by the guarantee efficiency of each guarantee station, and model optimizationAiming at the goal, the corresponding reward function in the design reinforcement learning is r (s, a) ═ ω c _ t_i,j,l+ β, wherein, c _ t_i,j,l＝FT_j,l-AT_i,j,lThe time from the arrival of the carrier-based aircraft at a certain guarantee station to the completion of the guarantee at the station is represented, both omega and beta are integers, and omega belongs to the range of-5 and-1]，β∈(0,300)；

And S5, optimizing and solving the scheduling problem by using a reinforcement learning algorithm.

2. The reinforcement learning-based carrier-based aircraft surface guarantee process optimization method according to claim 1, wherein the step S1 specifically comprises:

a guarantee stage set F and a guarantee station set M_j(j belongs to F) to form a carrier-based aircraft surface guarantee system, n carrier-based aircraft form a carrier-based aircraft set I to be guaranteed, and historical guarantee experience data are used for determining the time t required for the carrier-based aircraft to receive guarantee at each guarantee station_i,j,l(i∈I,j∈F,l∈M_j)。

3. The reinforcement learning-based carrier-based aircraft surface guarantee process optimization method according to claim 1, wherein the step S2 specifically comprises:

taking a to-be-guaranteed shipboard aircraft set I formed by n shipboard aircraft as a to-be-processed workpiece set, and establishing a hybrid flow shop scheduling model:

min max{ET_i,j,l}i∈I,j＝f,l∈M_j

s.t.

ST_i,j,l＝max{AT_i,j,l，FT_j,l}i∈I，j∈F，l∈M_j

ET_i,j,l＝ST_i,j,l+t_i,j,li∈I，j∈F，l∈M_j

AT_i,j+1,l＝ET_i,j,l+ct_j,j+1i belongs to I, j belongs to F and j is not equal to F

Wherein, I ═ {1, 2., n } is the set of the shipboard aircrafts to be guaranteed, F is the number of the guarantee phases, F is the set of the guarantee phases, M_j(j belongs to F) is a guarantee station set of the j stage, t_i,j,lEnsuring the required time, ct, for the shipboard aircraft at the corresponding station_j,j+1For the transfer time of the carrier-based aircraft between adjacent guarantee phases, AT_i,j,lFT for corresponding to the moment when the carrier-based aircraft reaches the guarantee station_j,lEnsuring the station to guarantee the completion of the current time of waiting for the carrier-based aircraft, ST_i,j,lET for the moment that the shipboard aircraft starts to receive the guarantee at the corresponding guarantee station_i,j,lBT corresponds to the moment when the shipboard aircraft finishes the guarantee_j,lTo ensure the time when the station begins to ensure the carrier-based aircraft, y_i,j,lAre decision variables.

4. The reinforcement learning-based carrier-based aircraft surface safeguard process optimization method according to claim 1, wherein the state and action representation in step S3 is specifically:

in the carrier-based aircraft surface guarantee scheduling process, a carrier-based aircraft and a guarantee phase binary group (I, j) are regarded as a state s in a Markov decision process, wherein (I belongs to I, j belongs to F); each guarantee station l (l belongs to M) in the next guarantee stage_j+1) Considered as action a in the markov decision process.

5. The reinforcement learning-based carrier-based aircraft surface guarantee process optimization method according to claim 1, wherein the coefficients ω and β in the excitation function in step S4 are specifically selected from the following values: ω -2 and β -150.

6. The reinforcement learning-based carrier-based aircraft surface safeguard process optimization method according to claim 1, wherein in the step S5, the reinforcement learning algorithm solving process specifically comprises:

inputting a discount factor gamma and a learning rate alpha, randomly initializing a Q value table, and initializing a time parameter for each training round; in each state, by the AT_i,j+1,l＝ET_i,j,l+ct_j,j+1(j belongs to F and j is not equal to F) calculating the arrival time of the carrier-based aircraft at a certain stage; selecting an action a to be in an M state according to a Q value table and an exploration strategy aiming at each carrier-based aircraft_jAnd executing; the state of the next step was observed and according to the formula r (s, a) ═ ω c _ t_i,j,l+ β calculating the reward r (s, a) for executing the current action; and updating the Q value table.