CN114595958A

CN114595958A - Shipboard aircraft guarantee operator scheduling method for emergency

Info

Publication number: CN114595958A
Application number: CN202210211487.2A
Authority: CN
Inventors: 栾添添; 付强; 孙明晓; 姬长宇; 马爽; 王皓; 王涵旭; 吴凯
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-07
Anticipated expiration: 2042-02-28
Also published as: CN114595958B

Abstract

The invention discloses a scheduling method for guarantee personnel of a shipboard aircraft, which is used for dealing with uncertain emergency conditions of guarantee operation scheduling of the shipboard aircraft on a deck. Firstly, constructing a guarantee process of a guarantee worker for a carrier-based aircraft as a Markov decision process; then, a modified Soft Actor Critical (SAC) scheduling algorithm is designed according to the process characteristics: (1) in order to reduce the learning difficulty, the SAC algorithm is expanded into a multi-agent algorithm, and environmental data processing is added, so that the environmental state information required to be processed by the agents is reduced; (2) in order to avoid action conflict situation, the self-adaptive rate is designed to increase the scheduling quality; (3) in order to optimize the whole training process, an invalid action shielding mechanism and a prior experience playback mechanism are set. And finally, putting the designed algorithm into training, and putting the trained intelligent agent into scheduling. The method can better deal with the emergency situation of the deck, so that the deck dispatching has stronger robustness to uncertainty, and the efficiency of the deck dispatching is increased.

Description

Shipboard aircraft guarantee operator scheduling method for emergency

(I) technical field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a scheduling method for guarantee operators of a shipboard aircraft in emergency.

(II) background of the invention

The aircraft carrier is used as the core of the overseas battle of the naval of the country and is also a symbol integrating the national force and the naval force, and has incomparable military value. The index for measuring the operational capacity of the aircraft carrier is conventionally determined as the carrier-based aircraft setting rate, the setting rate is related to various factors, after the aircraft carrier is built, the hardware configuration is basically fixed, and the approach mainly capable of improving the operational capacity of the aircraft carrier is to improve the utilization efficiency of resources of all parties. The important point of embodying the efficiency of each aspect on the number of times of the operation and the recovery of the carrier-based aircraft lies in various guarantee tasks performed by the carrier-based aircraft on a deck, and meanwhile, most of the guarantee operations on the current aircraft carrier are implemented by guarantee personnel, so that the dispatching of the guarantee personnel can directly influence the improvement of the guarantee work efficiency.

Although a traditional intelligent optimization algorithm can obtain a relatively excellent scheduling strategy, the traditional intelligent optimization algorithm is mainly used for optimizing an overall scheduling list and is difficult to deal with emergency situations of a deck. The thesis of integrated optimization of the scheduling and resource allocation of the aircraft-borne service support operation of the aircraft deck constructs an integrated optimization model of the scheduling and resource allocation of the aircraft-borne service support operation of the aircraft deck, and adopts NSGA II to solve, but does not consider the emergency in the support operation; the problem of personnel configuration and scheduling is optimized simultaneously in a thesis of a marginal-artificial bee colony algorithm-based shipboard aircraft fleet movement support personnel configuration-scheduling combined optimization method, the problem of personnel configuration is solved by adopting a marginal optimization algorithm in the outer layer, and the problem of scheduling is solved by adopting an improved artificial bee colony algorithm in the inner layer, but the influence caused by deck emergency situations cannot be solved; the patent "shipboard aircraft support operator scheduling method based on deep reinforcement learning" provides a shipboard aircraft support operator scheduling method based on an MADDPG algorithm, and although real-time scheduling is considered, instant response to deck emergency situations can be made, corresponding processing is not performed on execution action conflicts, and possible action conflicts are made up only through frequent decisions, so that decision time is prolonged. Therefore, when the scheduling problem of the support personnel is solved, the decision can be further optimized by considering action conflict and deck emergency situations.

Disclosure of the invention

The invention aims to provide a scheduling algorithm for guaranteeing operators of a shipboard aircraft in emergency, and improve the execution efficiency of shipboard aircraft guarantee work. In order to achieve the purpose, the invention adopts the following technical scheme:

step 1: analyzing the ship-based aircraft movement recovery process, determining the process characteristics, extracting key requirements and providing basic requirements for subsequent step design;

step 2: determining an environment and an agent according to the specific characteristics in the step 1, setting a state space, an action space and state transition according to the environmental characteristics, designing a corresponding reward function, establishing an overall environment-agent training framework, and determining to adopt a multi-agent algorithm to solve the problem model;

and step 3: processing original environment data, filtering unnecessary data according to learning characteristics, and making a decision only by adopting the necessary data to reduce an observed value required by an intelligent agent so as to reduce learning difficulty and optimize a learning process;

and 4, step 4: based on the setting and requirements of the steps, firstly, because of strong robustness of a SAC (Soft Actor critic) algorithm and being suitable for a deck scheduling environment, a multi-agent learning frame is designed for the SAC algorithm and is applied to multi-agent learning, then, in order to further optimize the learning process, invalid action shielding is adopted, so that the learning process is further optimized, then, in order to learn the deck emergency situation with small probability, a corresponding prior experience extraction mechanism is set, so that the deck emergency situation can be effectively learned, and finally, in order to solve the action conflict problem, a self-adaptive conflict penalty coefficient is set, so that the conflict penalty coefficient can be correspondingly adjusted according to the current situation, the action conflict is better avoided, and the remaining few conflicts are avoided by adopting less optimal selection;

and 5: training is carried out through the algorithm and the environment set in the previous step, when the intelligent agent can output a better scheduling decision, the training is stopped, and the intelligent agent is stored;

step 6: and putting the set intelligent agent into practical application, and simultaneously storing the practical data into an experience pool to facilitate subsequent further learning.

The invention has the following beneficial effects:

(1) the method provided by the invention is improved from a strong robustness single agent algorithm, inherits the excellent robustness of the original algorithm, and is more suitable for guaranteeing the scheduling problem;

(2) the method provided by the invention reduces the sensitive hyper-parameters of the algorithm by setting the adaptive conflict punishment coefficient, advances the convergence of the algorithm by 30-50 times of training, and obviously reduces the reward fluctuation at the later stage of the training;

(3) the method provided by the invention optimizes the learning process by setting invalid action shielding and preferential experience extraction, so that an intelligent agent can complete tasks which cannot be completed originally, and the learning process is stabilized.

(IV) description of the drawings

FIG. 1 is a general block diagram of an algorithm;

FIG. 2 is a flow chart of a shipboard aircraft deck safeguard operation;

FIG. 3 is an environment-agent architecture;

FIG. 4 is a multi-agent algorithm framework;

FIG. 5 is a schematic diagram of an invalid action mask;

FIG. 6 is a schematic diagram of a preferred experience extraction for emergency situations;

FIG. 7 is a diagram illustrating adaptive collision penalty coefficients;

FIG. 8 is a reward graph with a preferential experience extraction mechanism for emergency situations;

FIG. 9 is a reward curve for a needleless preferential experience extraction mechanism for emergency situations;

FIG. 10 is a diagram of agent 3 reward curves without adding invalid action masks and adaptive conflict coefficients;

FIG. 11 is a diagram of agent 6 reward curves without adding invalid action masks and adaptive conflict coefficients;

FIG. 12 is a diagram of agent 3 reward curves without adding adaptive conflict coefficients;

FIG. 13 is a diagram of agent 6 reward curves without the addition of an adaptive conflict factor;

FIG. 14 is a diagram of the reward curve of agent 1 according to the present invention;

FIG. 15 is a diagram of the method agent 2 reward curve of the present invention;

FIG. 16 is a diagram of the smart 3 reward curve of the proposed method;

FIG. 17 is a diagram of the reward curve of agent 4 according to the present invention;

FIG. 18 is a diagram of intelligent agent 5 reward curves for the proposed method;

FIG. 19 is a diagram of agent 6 reward curves for the proposed method;

(V) detailed description of the preferred embodiments

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and experimental examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The general structure of the invention is shown in figure 1.

Step 1: analyzing the movement recovery process of the carrier-based aircraft;

after the carrier-based aircraft is recovered to a deck, the carrier-based aircraft can be moved out again by the following procedures: fault checking and maintenance; anchoring, oxygenating, filling nitrogen for tyre replacement, etc.; refueling; fourthly, hanging the bullet. After the processes, the carrier-based aircraft can take off, and the processes have extremely strong time sequence, so that the processes except the bullet hanging process in the oiling process can be operated in parallel, and other processes are executed strictly according to the time sequence; the flow is shown in fig. 2.

Step 2: according to the guarantee operation characteristics, a Markov decision process model is constructed to serve as an intelligent agent training environment;

step 2.1: environment and agent;

through the analysis of the process flow of the carrier-based aircraft protection process, the environment can be determined to be set as a flight deck guarantee aircraft position. On the basis of the above, agents are determined, and if all the safeguard groups are set as one agent, the entire system is set as a single agent system. However, such a design will make the motion space of the agent huge, and since it is a discrete motion space, it will make the motion space extremely difficult to handle. Therefore, a single guarantee group is set as an intelligent agent, the original huge action space is split, the decision making system is converted into a multi-intelligent-agent system, and a multi-intelligent-agent algorithm is adopted for control decision making. Although the complexity of the algorithm is increased, the motion space is easier to handle. The environment-agent is shown in fig. 3.

Because the scheduling object is a guarantee group, the guarantee group has enough capacity to execute after the decision system gives instructions. Also, during its execution of the safeguard procedure, there is sufficient capacity to resolve the issue in the safeguard process (except for emergency situations, which can cause changes in the environmental conditions). When the process is completed, the environmental status changes, and the command needs to be issued again. Therefore, the setting algorithm is adopted to make decision only when the environment state changes, and thus more unnecessary commands can be reduced.

Step 2.2: determining a state space, an action space and a state transition;

action A is defined as (A)_r,A_o,A_f,A_a) Quadruple, wherein A_rFor all maintenance team actions, A_oFor all unit group actions, A_fFor all refuelling team actions, A_aFor all the bullet hanging group actions, each element is a (n, t) binary group, n is the group number, and t is the target machine position number.

Based on the environment and agent structure determined above, the environment state space can be constructed as follows:

in the formula, station _ s is a guarantee station state, snum is a guarantee station serial number, ie is a flag bit of whether the station parks the carrier-based aircraft, ia is a flag bit of whether the station has an emergency, and wp is a process to be executed when the current station parks the carrier-based aircraft; team _ s is a guarantee group state, namely the state of each agent, tc is a process for the agent, tnum is the agent number, tp is the current position of the agent, and tb is a flag bit indicating whether the agent is currently executing the guarantee process.

Each agent action space is the target position. In the environment-agent structure determined above, the space size of each agent is the guaranteed number of seats. The joint action is obtained by splicing the actions of the single agents.

The state transition is performed according to the above-mentioned deck guarantee operation process flow and the flow logic, and the corresponding change is performed with the specified time step as a standard, which is reflected in the change of each flag bit of the environmental state.

In particular, at each time step, each security position has a certain probability of emergency situations. After an emergency occurs, if a process is executed in the current machine position, the process is reserved in the middle section, and the process can be recovered after the emergency is finished by corresponding group processing. Two types of emergency situations are contemplated herein, including equipment failure and fuel leaks. The equipment fault needs to be responsible for a maintenance and inspection group of the shipboard aircraft to process, and the fuel oil leakage needs to be responsible for an oiling group of the shipboard aircraft to process. For the rest of the practical situations, emergency emergencies may occur, and the processing means and the form are similar to those considered in the text, so that the emergency emergencies are not considered.

Step 2.3: determining a reward function;

in order to avoid learning difficulties caused by the environment becoming a sparse reward environment, an instant reward function is set for the intelligent agent, so that the intelligent agent can obtain instant feedback when making a decision, and the reward is given to each single intelligent agent. The reward function R (s, a) is defined as follows:

wherein a, b, c are hyper-parameters, are constants, and satisfy a > b > 0 > -c, dis is a moving distance, and long is a maximum movable distance.

In particular, -c rewards may also help an agent fall into a local optimum that foregoes exploring a single choice to spin on the spot for a single agent.

And step 3: processing environmental data;

the system is provided with n safeguard stands in total, 4 classes of safeguard groups are arranged in total, and each class of safeguard group is respectively provided with m_iI is the guaranteed team category, then the total state space size is:

such huge state space is not beneficial to learning, and for each intelligent agent, the environment states of other guarantee processes are not necessary information for self learning, so that the environment state processing is set, the necessary information for self learning is screened out for each intelligent agent from the huge environment states, and the intelligent agent is arranged and then delivered for learning. The environment information after finishing is as follows:

in the formula, station _ s _ agent is a simplified guarantee machine position state, and wt is a process identifier which can be executed by the current intelligent agent on the current machine position; and the team _ s _ agent is the simplified guarantee group state, op is the current position of the agent, and ob is whether the agent is executing the guarantee process. The processed state space size is 3ⁿ×m_i×2。

And 4, step 4: designing a guarantee scheduling algorithm;

step 4.1: designing a multi-agent algorithm;

currently, the existing multi-agent algorithms are various in types and long in length. For the problem of guarantee scheduling, the environment is flooded with a large amount of uncertainty, including the inherent environment and the environment external, so that the robustness of the algorithm becomes a powerful selection basis.

In the single-agent algorithm, the SAC algorithm is considered for the maximum decision-making action entropy value, and the fitting of the SAC algorithm to the action-value function multi-mode makes the SAC algorithm become a single-agent deep reinforcement learning algorithm with excellent robustness. But because the method only adopts a heavy parameter skill to fit the assumed multi-modal boltzmann distribution through the single-modal normal distribution aiming at the continuous motion space, the real effect of the method is greatly reduced. However, for discrete motions, the constraint on motion description can be released, so that the re-parametric skill can be abandoned, and multi-modal fitting can be realized.

For the guarantee scheduling problem, the essential is a discrete action space model according to the environment setting of the step 2, so a discrete SAC algorithm can be adopted. However, the algorithm belongs to a single-agent algorithm, and a multi-agent algorithm is needed for solving the guarantee scheduling problem through the analysis in the step 2. Therefore, a multi-agent algorithm framework is designed for the method, the multi-agent algorithm framework is converted into a multi-agent algorithm, the condition of guaranteeing the scheduling problem is met, and the problem is solved.

Aiming at the problem of unstable environment of a distributed single agent algorithm, an action value function sharing global information is designed for an agent, and a strategy network is updated by using value evaluation of the action value function, so that strategy updating considering the global information is realized. However, the process only occurs in the training phase, and the agent still only has decision information in the execution phase, but the agent decision can be improved because the global information is used in the strategy updating. Because the global information is used during training updating, the relative static state of environment transfer can be ensured during updating, so that stable learning can be carried out, and CTDE is realized. The multi-agent algorithm framework is shown in FIG. 4.

And 4.2: invalid operation mask setting:

invalid action masking is a technique used to optimize the learning process in large action spaces. For the problem of guarantee scheduling, according to the environment model set in the step 2, the size of the action space increases as the number of guarantee positions increases, and under the 'integrated' combined guarantee model, the number of the guarantee positions of the whole deck needs to be considered, so that the learning is difficult. Therefore, an invalid action shielding mechanism is introduced to shield actions which are not in practical significance but are possibly selected, the probability of selecting the invalid actions is zero, the back propagation gradient of the invalid actions is further made to be zero, the actions selected by the intelligent agent each time have significance, the problem of decision conflict of the intelligent agent is solved to a certain extent, and the learning process of the intelligent agent is optimized. The invalid action masking mechanism is shown in FIG. 5.

Step 4.2: setting a prior experience extraction mechanism aiming at the emergency;

preferential empirical decimation is an importance sampling mechanism. In the off-line strategy algorithm, the agent extracts the current experience from the experience pool to learn each time, so that the relevance of the training data is reduced, and the learning is facilitated. However, for general learning, the experience with larger error of Temporal-Difference (TD) is preferentially extracted when the experience is extracted, so that the learning efficiency is better.

For guaranteeing the scheduling problem, the model set in step 2 is combined, so that the emergency situation is low in occurrence frequency and large in harm, and the emergency situation needs to be handled in an important way. However, due to the low occurrence frequency, it is difficult to learn the emergency situation effectively according to the previous experience extraction mechanism, so the priority extraction strategy is changed, and the experience of the emergency situation is extracted preferentially, so that the strategy when the emergency situation occurs can be better learned. However, excessive extraction of some experiences will cause overfitting to occur, resulting in unsatisfactory final learning effect, so that the priority of each experience is exponentially decreased according to the selected times, and even if the strategy for emergency situations occurs, the priority of the strategy is lower than that of the common experience after being selected for several times according to the design. The priority is calculated as follows:

P＝ne-ηe^nc (4)

in the formula, P is the experience priority, ne is the number of emergency situations when the experience is sampled, η is a proportionality coefficient and is a hyperparameter, and nc is the accumulated sampled times of the experience.

In practice, calculating the priority of experience in the experience pool for each learning would result in a large amount of wasted computation. Therefore, during each experience extraction, the priority of the random extraction part is calculated by experience, and then the experience with the priority higher than the priority is selected for learning. The overall preferential experience extraction mechanism for emergency situations is shown in fig. 6.

Step 4.3: setting a self-adaptive conflict penalty coefficient;

because the algorithm is modified based on the SAC algorithm, the adjustment of the temperature coefficient of the specific parameter of the SAC adopts a self-adaptive parameter method of the literature. For the problem of guarantee scheduling, decision conflict must be avoided, so for the decision conflict, a corresponding penalty coefficient is set. If the penalty of the decision is directly added into the reward function, the penalty coefficient set manually is not optimal and is set to be a fixed value, and unfavorable negative reward items are added to the exploration in the early stage of training in which the exploration is to be encouraged, so that the early stage exploration is difficult; if the early stage exploration is considered, the penalty coefficient is too small, and the expected effect cannot be achieved during the middle and later stage training. Therefore, the punishment is moved inside the intelligent agent, and the size of the intelligent agent is adjusted by adopting self-adaptive setting.

Consider first the maximum accumulated expected reward, and consider the conflict constraint:

where c is the number of collisions and τ is the number of collision limits, it may be set to a small positive number.

Consider first the last time step T:

let Z be E [ r(s) ]_t,a_t)]+δ_TL (c), where l (c) τ -c, δ_TIs the collision coefficient of the Tth time step. If l (c) is greater than or equal to 0, let δ _T0, then Z ═E[r(s_t,a_t)](ii) a If l (c) < 0, let δ_T→ infinity, then Z → - ∞.

Therefore, the method comprises the following steps:

therefore, the following can be obtained:

in the same way, the front step can be pushed out

There are also similar forms, so there can be an objective function of:

J(δ)＝δ·(τ-c) (9)

the whole algorithm is not sensitive to the initial value of delta, so the design can realize the self-adaptive change of the collision penalty coefficient, and the collision penalty coefficient is always kept in a reasonable range.

Step 4.4: suboptimal selection avoids few collisions;

although adaptive collision coefficients are added, collision cannot be avoided perfectly. It should be noted that, at this time, the number of collisions is very small, and the sub-optimal actions can be avoided by comparing the sub-optimal actions based on the distance, specifically, the priority is defined according to the action distance, the larger the distance is, the lower the priority is, the sub-optimal action with the highest priority will be executed, and if there is still a collision, the above steps are repeated.

It can be seen that the process is complicated, and if the process is directly applied without conflict coefficient processing, the execution efficiency is greatly reduced, and the learning effect of the intelligent agent is greatly influenced.

And 5: putting the intelligent agent into a constructed environment to train according to a designed algorithm until the intelligent agent accurately generates a scheduling instruction to obtain the intelligent agent which finishes training;

step 6: the trained intelligent agent is applied to a scene to guide support personnel to carry out support operation, and at the moment, the intelligent agent can store real data into an experience pool so as to facilitate learning again in idle time.

In this test example, only a single safeguard procedure is taken as an example, since the overall schedule can be considered as a simple combination of different procedures. The extraction of prior experience for emergency situations is only compared in the single agent case since it is independent of the multi agent architecture. Because of the large number of agents, the embodiment of the unadditized modified algorithm is described only with agent 3 and agent 6. If the process has 6 security groups and 16 security machine positions in total, the emergency situation occurs at the probability of 2.5% in each action. The machine positions are linearly distributed, namely the distance between the machine positions is the absolute value of the number difference of the distances between the two machine positions. Set up as a reward function

Fig. 8 and 9 show the preferential experience extraction reward curves for emergency situations with and without, respectively, and it is evident that the training variance is significantly reduced, with the reward remaining almost at 40 or so, only 1 time below 30, with preferential experience extraction for emergency situations, and with no preferential experience extraction for emergency situations, the reward fluctuates very sharply, even below 15. For the portion with the prize higher than 40, the extra prize is obtained because the emergency situation is resolved.

FIGS. 10 and 11 are plots of rewards for agent 3 and agent 6 without adding invalid action masks and adaptive conflict coefficients, as is evident in this case the agent is completely unable to complete the task; FIGS. 12 and 13 are plots of rewards for agent 3 and agent 6 without the addition of an adaptive conflict factor, and it can be seen that in this case, the agent rewards converge and are substantially stable around 76, indicating that the agent is able to perform better; fig. 14-19 show the reward curves of the agents for the proposed algorithm of the present invention, and it can be seen that in this case, the agents can also complete their tasks better, and compared with the method without adding adaptive conflict coefficient, the reward curve of the agents obtained by the proposed algorithm of the present invention is increased by 30-50 training times, and the fluctuation of the reward obtained after 300 training rounds is smaller. The invention proposes that the algorithm agent reward curve fluctuates around 60 training sessions, since the coefficient of conflict is constantly changing, so the agent's training session will be affected by it, but then quickly recover to stability. The above examples demonstrate the effectiveness of the present invention.

The above test examples of the present invention are merely to illustrate the calculation model and the calculation flow of the present invention in detail, and are not intended to limit the embodiments of the present invention. It will be apparent to those skilled in the art that other variations and modifications can be made on the basis of the foregoing description, and it is not intended to be exhaustive of all embodiments, and all obvious variations and modifications can be made without departing from the scope of the invention.

Claims

1. A method for scheduling carrier-based aircraft support operators for emergency situations is characterized by comprising the following steps:

step 1: analyzing the handling recovery process of the carrier-based aircraft;

step 2: according to the guarantee operation characteristics, a Markov decision process model is constructed to be used as an intelligent agent training environment;

and step 3: processing the environmental data;

and 4, step 4: designing a multi-agent algorithm aiming at scheduling of security personnel, adding an invalid action mask, a priority experience extraction mechanism aiming at emergency and a self-adaptive conflict punishment coefficient according to the process characteristics, and finally adopting suboptimal selection to avoid few conflicts and optimize the whole training;

and 5: training the intelligent agent in the constructed environment until the intelligent agent can output a better scheduling strategy;

step 6: the trained intelligent agent is applied to a scene to guide support personnel to carry out support operation, and at the moment, the intelligent agent can store real data into an experience pool so that the intelligent agent can learn again in idle time;

the method is characterized in that the environmental data processing process in the step 3 is as follows:

the total number of the security stands is n, 4 types of security groups are arranged, and each type of security group has m_iI is the guaranteed team category, then the total state space size is:

the huge state space is not beneficial to learning, and for each intelligent agent, the environment states of other guarantee processes are not necessary information for the learning of the intelligent agent, so that the environment state processing is set, the necessary information for the learning of the intelligent agent is screened out for each intelligent agent from the huge environment states, and the information is sorted and then delivered to the intelligent agent for learning; the environment information after finishing is as follows:

in the formula, state _ space _ agent is environment information received by the intelligent agent, state _ s _ agent is a simplified guaranteed machine position state, and wt is a process number which can be executed by the current intelligent agent; the team _ s _ agent is the simplified guarantee group state, op is the current position of the agent, ob is whether the agent is executing the guarantee process, and the simplified state space is 3ⁿ×m_i×2；

The multi-agent algorithm design process in the step 4 is characterized by comprising the following steps:

because the SAC (Soft Actor criticic) algorithm has good robustness, the SAC (Soft Actor criticic) algorithm is used as a basic algorithm, an action value function sharing global information is designed for an intelligent agent aiming at the problem of unstable environment of a distributed single intelligent agent algorithm, a strategy network is updated by using value evaluation, strategy updating considering the global information is realized, and centralized training-decentralized execution is realized;

the method is characterized in that the invalid action masking in the step 4, a prior experience extraction mechanism aiming at the emergency and an adaptive conflict penalty coefficient are designed as follows:

an invalid action shielding mechanism, which shields actions which have no practical significance but can be selected, the probability of selecting the invalid action is zero, and further the back propagation gradient of the invalid action is zero;

the prior experience extraction mechanism aiming at the emergency preferentially extracts the experience of the emergency, balances the selected times and makes the study aiming at the emergency, and the experience is preferentially calculated as follows:

P＝ne-ηe^nc (2)

in the formula, P is experience priority, ne is the number of emergency situations when the experience is sampled, eta is a proportionality coefficient and is a hyperparameter, and nc is the accumulated sampled times of the experience;

the adaptive conflict penalty coefficient is adopted to automatically adjust the negative reward for the conflict action because the adaptive conflict penalty coefficient is inconvenient to directly set the negative reward for the conflict action, and the optimization objective function is as follows:

J(δ)＝δ·(τ-c) (3)

in the formula, J is an optimization objective function, δ is a collision penalty coefficient, τ is a target collision number, and is usually set to be a very small positive number, and c is an actual collision number.