CN112085520A

CN112085520A - Flight space control method based on artificial intelligence deep reinforcement learning

Info

Publication number: CN112085520A
Application number: CN202010814188.9A
Authority: CN
Inventors: 刘震; 周兴; 王闯
Original assignee: Guizhou Youce Network Technology Co ltd
Current assignee: Guizhou Youce Network Technology Co ltd
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2020-12-15

Abstract

A flight space control method based on artificial intelligence deep reinforcement learning is carried out through the following steps: s1, establishing a deep reinforcement learning model; s2, training the deep reinforcement learning model to obtain a trained deep reinforcement learning model; and S3, inputting the sale state of the time unit to be decided to the deep reinforcement learning model trained in the step S2, and outputting the current flight opening condition by the trained deep reinforcement learning model. The method of the invention uses an artificial intelligence deep reinforcement learning algorithm to carry out interactive training learning with simulation environments containing different passenger demand probability distributions in the training process, and the trained deep reinforcement learning model can carry out real-time dynamic cabin space control and has excellent effect, thus being capable of carrying out decision control in real time in the sales process.

Description

Flight space control method based on artificial intelligence deep reinforcement learning

Technical Field

The invention relates to the technical field of flight space control, in particular to a flight space control method based on artificial intelligence deep reinforcement learning.

Background

How to control the flight space to achieve the maximum profit is an important work in the pricing strategy of the airline company.

The concept and method of "yield management" originally originated in the civil aviation industry, with the aim of allocating the seats of a flight in the most efficient way according to the demand, with a relatively fixed capacity to adequately match the potential needs of each market segment.

The traditional revenue method mainly comprises the following steps: over-sale control, bay control, multi-level pricing, team management, etc.

According to the seat control classical theory-EMSR method, each cabin level is assumed to be independent, and when the remaining number of seats is x, expected marginal income of seats corresponding to price level k tickets is calculated. The EMSR model is established on the basis of a Littlewood criterion, the Littlewood criterion assumes that booking demands arrive according to a low-first and high-later condition, and the EMSR model calculates and determines the daily booking control quantity according to fixed passenger demand probability distribution. EMSR is a static, heuristic, cabin-level independent, cabin-level control method.

The static cabin space control strategy is to determine the number of seat booking control once a day, and because the probability distribution and demand estimation of actual demand for passengers have large uncertainty, the EMSR method cannot dynamically adjust the control strategy in real time according to the continuously updated demand and capacity information in the whole airline ticket selling period of the airline company.

Therefore, in order to overcome the defects in the prior art, it is necessary to provide a flight space control method based on artificial intelligence deep reinforcement learning.

Disclosure of Invention

The invention aims to avoid the defects of the prior art and provides a flight space control method based on artificial intelligence deep reinforcement learning, which can carry out dynamic space control in real time according to specific environmental conditions and has a good effect.

The object of the invention is achieved by the following technical measures.

The flight space control method based on artificial intelligence deep reinforcement learning is provided and comprises the following steps:

s1, establishing a deep reinforcement learning model;

s2, training the deep reinforcement learning model to obtain a trained deep reinforcement learning model;

and S3, inputting the sale state of the time unit to be decided to the deep reinforcement learning model trained in the step S2, and outputting the current flight opening condition by the trained deep reinforcement learning model.

Preferably, step S1 specifically includes:

s11, establishing a passenger arrival process and a sales state as a simulation environment;

s12, taking flight income as the maximum income target;

s13, establishing a deep reinforcement learning model, and deciding the current flight opening condition at the current time point through the deep reinforcement learning model according to the sale State.

Preferably, step S11 is to simulate the number of passengers arriving in the same time unit each day and the upper limit of fare each passenger can bear during the period of M days before the aircraft takes off,

and the defined sale State is composed of current market information, sale data of seats in each cabin level of the flight, the rest number of the seats and the takeoff time from the flight.

Preferably, the deep reinforcement learning model established in step S13 specifically includes:

establishing an original Q neural network, wherein a value function is Q (s, a; theta), and iteratively updating a parameter theta in the deep learning process, wherein s is a sale state, a is an opening action and theta is a neural network weight parameter;

establishing a target Q ' neural network with the same structure as the original Q neural network, wherein the target Q ' neural network does not update parameters in the deep learning process, and the parameters after the iterative update of the original Q neural network are copied from the original Q neural network at intervals as the parameters of the target Q ' neural network;

the deep reinforcement learning model is based on a DQN algorithm, wherein when calculating the target value of the accumulated reward, the calculation is split into two steps according to the mode in the Double DQN algorithm:

1) obtaining the opening action a of the maximum function through the original Q neural network,

2) obtaining a target value of the accumulated reward corresponding to the opening action a through a target Q' neural network;

and (3) adopting an algorithm for extracting data in the empirical playback pool by taking a sampling rule of the attendance rate into consideration:

specifically, on the basis of extracting the algorithm from the Prioritized Replay DQN algorithm, the method for calculating the weight is changed into the method for calculating the Gaussian kernel weight of the data by substituting the upper seat value:

w is weight, i and j represent ith data and jth data in the experience playback pool, and X⁽ⁱ⁾The seating rate of the current data, x is the average seating rate of all data, and T is a parameter taken as 15.

Preferably, in step S2, the deep reinforcement learning model is trained to obtain a trained deep reinforcement learning model, specifically:

s21, randomly initializing a simulation environment;

initializing an experience playback pool replay _ memory, wherein the capacity is N, and N is a natural number and is used for storing training samples;

initializing an original Q neural network of the action-value function, and randomly initializing a weight parameter theta; initializing a target Q 'neural network of the target action-value function, wherein the structure and the initialization weight theta of the target Q' neural network are the same as those of the original Q neural network;

s22, when k is set to 1, the process proceeds to step S23;

s23, step is changed to 1, and the process proceeds to step S24;

s24, resetting the simulation environment, and randomly generating a simulation environment;

s25, simulating the arrival of passengers on a flight line by utilizing the CAM, dividing a period of time before the flight takes off into time units according to hours, and simulating a sales process;

s26, judging whether passengers arrive in the current time unit or not, if no passengers arrive in the time unit, arriving at the next time unit and returning to the step S26; otherwise, go to step S27;

s27, when a passenger arrives, the deep reinforcement learning model decides the current flight opening condition through the State (t) of the current time unit t given by the simulation environment, and disturbance is added on the basis of the decision of the deep reinforcement learning model to form action;

s28, opening the cabin according to the action in the simulation environment, giving the State (t +1) of the next time unit, rewarded the reward value obtained by the current action, and carrying out the training of whether to finish the round to identify done, wherein rewarded is formed according to the equivalent price of the sold ticket, the equivalent index value of the seat rate and the equivalent weight of the current time unit when the ticket is sold;

s29, storing the data (State (t), action, Reward, State (t +1) and done) into a deep reinforcement learning model experience playback pool replay _ memory;

s210, assigning the State (t +1) to the State (t) to complete a step; judging whether done is True, namely whether the whole airline sales process is finished, if yes, entering step S211; otherwise, return to step S26;

s211.reward adds the equivalent total benefit value, and stops the current round; entering step S212;

s212, step +1 is set, whether step is less than the threshold learning _ start is determined, if yes, the process returns to step S24; otherwise, go to step S213;

s213, extracting minipatch data in display _ memory through a self-defined rule considering the seat-up rate to calculate and update the estimated value original Q neural network parameters; updating the parameters of the target Q' neural network to be the parameters of the current original Q neural network after the original Q neural network parameters are iterated for each time C;

s214, judging whether the training stopping condition is met, if so, entering the step S215, otherwise, enabling k to be k +1, and returning to the step S23;

and S215, taking the currently trained deep reinforcement learning model as the well-trained deep reinforcement learning model.

Preferably, the perturbation applied in S27 on the basis of the decision of the deep reinforcement learning model is specifically: random sample values generated by np.random.randn (1). e function, which obey the standard normal distribution; the normal distribution is a normal distribution having a mean value of 0 and a standard deviation of 1, and is denoted as N (0, 1).

Preferably, in step s214, it is determined whether a training stopping condition is met, specifically, whether the training frequency K is smaller than a stopping threshold or whether the neural network converges.

Preferably, the stop threshold in step S214 is 8-15 times.

Preferably, the simulation environment established in step S11 includes: passenger cabin capacity, cabin class, passenger cabin price, length of time of sale, random seeds, and the number of people passengers arriving per time unit during the 30 days before aircraft takeoff and the upper limit of fare each passenger can bear.

Preferably, the deep reinforcement learning model established in step S13 determines, through the sale State, that the action of the current flight opening situation at the current time point is an integer value between 0 and P, where P is a natural number, and P is greater than or equal to 1 and less than or equal to 20, where action 0-P represents opening of compartments at different levels respectively; the original Q neural network structure adopts a fully-connected neural network structure with three hidden layers, batch normalization processing is added, and an Adam optimization algorithm is adopted.

The flight space control method based on artificial intelligence deep reinforcement learning is carried out by the following steps: s1, establishing a deep reinforcement learning model; s2, training the deep reinforcement learning model to obtain a trained deep reinforcement learning model; and S3, inputting the sale state of the time unit to be decided to the deep reinforcement learning model trained in the step S2, and outputting the current flight opening condition by the trained deep reinforcement learning model. The method of the invention uses an artificial intelligence deep reinforcement learning algorithm to carry out interactive training learning with simulation environments containing different passenger demand probability distributions in the training process, and the trained deep reinforcement learning model can carry out real-time dynamic cabin space control and has excellent effect, thus being capable of carrying out decision control in real time in the sales process.

Detailed Description

The invention is further illustrated by the following examples.

Example 1.

A flight space control method based on artificial intelligence deep reinforcement learning is carried out through the following steps:

s1, establishing a deep reinforcement learning model;

Wherein, step S1 specifically includes:

s12, taking flight income as the maximum income target;

Step S11 simulation environment is based on CAM (passenger arrival model); CAM (customer arrive model passenger arrival model): the number of arriving passengers in the same time unit every day and the upper limit of the fare price that each passenger can bear are simulated in the process of M days before the aircraft takes off, wherein M is a natural number, M is more than or equal to 10 and less than 20000, and 30 or 60 or other numbers can be selected.

The number of passengers arriving at each time unit can be generated through Poisson distribution, wherein a ratio parameter lambda of the Poisson distribution is generated through binomial distribution, a mean value and a variance of the upper limit of the fare that each passenger can bear are generated through a WTP (Willing To Pay willingness) function, the upper limit of the fare that each passenger can bear is generated through uniform distribution according To the mean value and the variance, and the specific time point of arrival of the passengers is obtained through uniform distribution.

The deep reinforcement learning model established in step S13 specifically includes:

establishing an original Q neural network, wherein a value function is Q (s, a; theta), and iteratively updating a parameter theta in the deep learning process, wherein s is a sale state, a is an opening action and theta is a neural network weight parameter; establishing a target Q ' neural network with the same structure as the original Q neural network, wherein the target Q ' neural network does not update parameters in the deep learning process, and the parameters after the iterative update of the original Q neural network are copied from the original Q neural network at intervals as the parameters of the target Q ' neural network;

Step S2, training the deep reinforcement learning model to obtain a trained deep reinforcement learning model, specifically:

s21, randomly initializing a simulation environment;

initializing an experience playback pool replay _ memory, wherein the capacity is N, and N is a natural number and is used for storing training samples; initializing an original Q neural network of the action-value function, and randomly initializing a weight parameter theta; initializing a target Q 'neural network of the target action-value function, wherein the structure and the initialization weight theta of the target Q' neural network are the same as those of the original Q neural network;

s22, when k is set to 1, the process proceeds to step S23;

s23, step is changed to 1, and the process proceeds to step S24;

the disturbance added on the basis of the decision of the deep reinforcement learning model specifically comprises the following steps: the disturbance added on the basis of the decision of the deep reinforcement learning model specifically comprises the following steps: random sample values generated by np.random.randn (1). e function, which obey the standard normal distribution; the normal distribution is a normal distribution having a mean value of 0 and a standard deviation of 1, and is denoted as N (0, 1).

The opening action is 0,1,2,3 and 4, which can represent an opening equal-class cabin, a business class cabin, a first class cabin, an economy class cabin and a price class cabin respectively; corresponding are the 'Y', 'H', 'B', 'M', 'N' compartments. The number of the opening actions and the corresponding cabin level can be flexibly set according to the actual situation, and are not limited to the five cabin levels of the embodiment.

Assuming that the action' of the deep reinforcement learning model is 1, and the action of the combination is 4 after adding the disturbance. The decision action of the front depth reinforcement learning model is 1 corresponding to the opening of the 'H' cabin, and the action after disturbance is 4 corresponding to the opening of the 'N' cabin, and 0,1,2,3 and 4 are always 'Y', 'H', 'B', 'M', 'N'.

s214, judging whether the training stopping condition is met, if so, entering the step S215, otherwise, enabling k to be k +1, and returning to the step S23; the training stopping condition may specifically be to determine whether the training frequency K (K is a natural number) is smaller than a stopping threshold or whether the neural network converges;

The invention can combine the market real-time market information to carry out real-time dynamic cabin space control through a deep reinforcement learning model. Compared with an EMSR (enhanced learning) static cabin space control method with independent cabin levels, the deep reinforcement learning model is a dynamic planning model, has stronger adaptability and more flexible application, can perform real-time dynamic cabin space control, and has better effect on timely feedback of market information.

Reward R in the deep reinforcement learning model is calculated by a reward function defined in the model, the problem of dynamic programming is a non-linear and non-convex problem with a very large state space, the design of the reward function determines whether a model meeting the requirements can be trained in the existing simulation environment or not and whether the model can be trained efficiently or not, whether a model with decision making capability meeting requirements can be trained, the reward function designed in the invention consists of three parts of income (including equivalent price when selling tickets every time and equivalent total profit value when the whole airline sales process is finished), attendance rate (attendance rate equivalent index value) and industry prior knowledge (the farther the distance from takeoff time, the more the trend is to open a low-price cabin, the equivalent weight of the current time unit is converted in the model), so that the model is prevented from being explored endlessly in a state space, the model has directivity, and the model can focus on two indexes of the income and the attendance rate.

The self-defined sampling rule considering the sitting rate in the deep reinforcement learning model improves the tendency of the neural network to improve the sitting rate and improves the training speed.

The experimental result shows that in the same simulation environment, the yield distribution comparison graph of the EMSR and the DRL after 1000 times of verification shows that the DRL is superior to the EMSR in decision effect.

According to the flight space control method based on artificial intelligence deep reinforcement learning, interactive training learning is carried out on simulation environments containing different passenger demand probability distributions in the training process by applying an artificial intelligence deep reinforcement learning algorithm, a trained deep reinforcement learning model DRL can carry out real-time dynamic space control and is good in effect, and decision control can be carried out in real time in the sales process.

Example 2.

The flight space control method based on artificial intelligence deep reinforcement learning is described in combination with a specific example.

1. Establishing a passenger arrival process and a sales state similar to the actual situation as a simulation environment;

simulating the number of passengers arriving in each time unit (certain time of a certain day) and the upper limit of the fare each passenger can bear during the 30 days before the aircraft takes off:

the method comprises the following steps: the number of passengers arriving at each time unit is generated through Poisson distribution, wherein a ratio parameter lambda of the Poisson distribution is generated through binomial distribution, a mean value and a variance of the upper limit of the fare price which can be borne by each passenger are generated through a WTP (willingto pay) function, the upper limit of the fare price which can be borne by each passenger is generated through uniform distribution according to the mean value and the variance, and the specific time point of arrival of the passengers is obtained through uniform distribution.

The sale State (current market information, sale data of seats in each cabin level of the flight, the rest number of the seats, and the departure time from the flight) is defined.

The manner in which the poisson distribution generates the number of passengers arriving per time unit is well known in the art, and the following provides code that enables the poisson distribution to produce the number of passengers arriving per time unit:

wherein binomial is a binomial distribution, poisson is a poisson distribution, unifom is a uniform distribution, mean variance mean and var obtained by WTP function.

2. Designing a profit target: the accumulated profit is the largest in the whole sale process 30 days before the aircraft takes off.

3. Designing a DRL (deep reinforcement learning) model, and deciding that the action of the current flight opening condition at the current time point is an integer value between 0 and P through a sale State State, wherein P is a natural number, and P is more than or equal to 1 and less than or equal to 20, wherein the action is 0-P respectively representing the opening of different levels; for example, action 0 indicates opening the highest-order cabin; the original Q neural network structure adopts a fully-connected neural network structure with three hidden layers, batch normalization processing is added, and an Adam optimization algorithm is adopted.

The value of P can be selected according to the actual situation, for example, 3,4, 5, 6 and the like can be selected, different values represent different cabin positions, and the number and the value of the cabin position grades can be flexibly set according to the actual situation. For example, action 0 indicates opening the highest-order bay. The opening action is 0,1,2,3 and 4, which can represent an opening equal-class cabin, a business class cabin, a first class cabin, an economy class cabin and a price class cabin respectively; corresponding are the 'Y', 'H', 'B', 'M', 'N' compartments.

4. Randomly initializing a simulation environment; setting a cabin capacity of 200, a cabin class of [ 'Y', 'H', 'B', 'M', 'N' ], a cabin price of [1000,900,800,700,600], a time period of sale of 30, and a random seed of None;

5. initializing an experience playback pool replay _ memory, wherein the capacity is N-2000 and is used for storing training samples;

6. initializing an action-value function original Q neural network, three hidden layers, each hidden layer comprises 16 neurons, and randomly initializing a weight parameter theta;

7. initializing a target Q' neural network of the target action-value function, wherein the structure and the initialization weight theta are the same as Q;

8. resetting the simulation environment, randomly generating a simulation environment: setting a time parameter as a starting time point 30, wherein the number of arriving passengers in each time unit (certain time of a certain day) and the upper limit of the fare that each passenger can bear are different in the process of 30 days before the aircraft takes off generated in each resetting;

9. simulating the arrival of passengers on a flight line by using the CAM, dividing a period of time (such as 30 days) before the flight takes off into time units according to hours, and simulating a sales process;

10. if passengers arrive in the time unit, the state (t) of the current time unit given by the DRL model through the simulation environment is [ list ([0,0,0,0,0]) 2000300 ] (corresponding to the sold seat number, the residual seat number, the income, a certain day and a certain time point of a certain day respectively), the current flight opening condition is decided, and disturbance (random values generated by standard normal distribution decreasing along with time are superposed) on the decision basis of the DRL model to increase the exploration capacity of the DRL model agent (intelligent agent) in the training process to form action; if no passenger arrives in the time unit, the next time unit is reached;

11. the simulation environment opens the cabin according to the action, gives the State (t +1) of the next time unit, the reward value rewarded obtained by the current action, whether to finish the training identification done of the current round, and forms rewarded according to the equivalent price of the sold ticket, the equivalent index value of the seat rate and the equivalent weight of the current time unit when the ticket is sold;

12. storing the data (State (t), action, Reward, State (t +1), done) into the DRL model experience playback pool replay _ memory;

13. assigning the State (t +1) to the State (t), repeating the steps 10-12 (step) to carry out the process of continuously interacting the agent and the simulation environment to generate data, adding an equivalent total benefit value to reward when done is True (the whole sale process of the airline is finished), stopping the current round to enter the next round, and jumping to the step 8;

14. when the step reaches a set threshold value learning _ starts to be 2000, extracting 64 data in play _ memory by a self-defined rule considering the seating rate to calculate and update an estimated value Q neural network parameter, and then training the network for each step;

15. updating the parameters of the estimation target Q' neural network to be the parameters of the current original Q neural network after every C-500 iterations;

16. repeating steps 8-15 several times until a stopping condition is reached or manually stopping training when the neural network has converged; and obtaining the trained DRL model, and realizing the real-time sales strategy calculation of the single airline.

It should be noted that, this embodiment is an example selected to illustrate, and the parameter selection of the present invention is not limited to specific data in this embodiment, and those skilled in the art can flexibly set the parameter according to specific needs.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope of the present invention, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A flight space control method based on artificial intelligence deep reinforcement learning is characterized by comprising the following steps:

s1, establishing a deep reinforcement learning model;

2. The flight space control method based on artificial intelligence deep reinforcement learning as claimed in claim 1,

step S1 specifically includes:

s12, taking flight income as the maximum income target;

3. The flight space control method for artificial intelligence deep reinforcement learning according to claim 2,

step S11 is to simulate the number of passengers arriving at the same time unit every day and the upper limit of fare that each passenger can bear in the process of M days before the takeoff of the airplane, wherein M is a natural number;

4. The flight space control method for artificial intelligence deep reinforcement learning according to claim 3,

5. The flight space control method based on artificial intelligence deep reinforcement learning as claimed in claim 4,

s21, randomly initializing a simulation environment;

s22, when k is set to 1, the process proceeds to step S23;

s23, step is changed to 1, and the process proceeds to step S24;

s213, extracting minipatch data in display _ memory through a self-defined rule considering the seat-up rate to calculate and update the estimated value original Q neural network parameters; updating the target Q after each C iterations of the original Q neural network parameters^’The parameters of the neural network are the parameters of the current original Q neural network;

6. The flight deck space control method based on artificial intelligence deep reinforcement learning of claim 5, wherein the disturbance applied in S27 based on the decision of the deep reinforcement learning model is specifically: random sample values generated by np.random.randn (1). e function, which obey the standard normal distribution; the normal distribution is a normal distribution having a mean value of 0 and a standard deviation of 1, and is denoted as N (0, 1).

7. The flight slot control method based on artificial intelligence deep reinforcement learning of claim 6, wherein in step S214, it is determined whether a training stopping condition is met, specifically, whether the training times K are less than a stopping threshold or whether the neural network converges.

8. The flight slot control method based on artificial intelligence deep reinforcement learning as claimed in claim 7, wherein the stop threshold in step S214 is 8-15 times.

9. The flight space control method based on artificial intelligence deep reinforcement learning as claimed in claim 5,

the simulation environment established in step S11 includes: passenger cabin capacity, cabin class, passenger cabin price, length of time of sale, random seeds, and the number of people passengers arriving per time unit during the 30 days before aircraft takeoff and the upper limit of fare each passenger can bear.

10. The flight space control method based on artificial intelligence deep reinforcement learning as claimed in claim 5,

the deep reinforcement learning model established in the step S13 determines, through the sale State, that the action of the current flight opening situation at the current time point is an integer value between 0 and P, where P is a natural number, and P is greater than or equal to 1 and less than or equal to 20, where action 0-P represents opening of compartments at different levels respectively; the original Q neural network structure adopts a fully-connected neural network structure with three hidden layers, batch normalization processing is added, and an Adam optimization algorithm is adopted.