CN114861368B

CN114861368B - Construction method of railway longitudinal section design learning model based on near-end strategy

Info

Publication number: CN114861368B
Application number: CN202210659378.7A
Authority: CN
Inventors: 缪鹍; 戴炎林; 况卫; 周启航; 肖智; 王介源
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2023-09-12
Anticipated expiration: 2042-06-13
Also published as: CN114861368A

Abstract

The invention discloses a construction method of a railway longitudinal section design learning model based on a near-end strategy, relates to application of a deep reinforcement learning theory in the field of railway intelligent route selection, and is an intelligent design method of a railway longitudinal section scheme based on a near-end strategy algorithm (Proximal Policy Optimization, PPO). The invention builds a railway vertical section design learning model based on near-end strategy optimization, combines a railway vertical section cutting line model and a deep reinforcement learning theory, defines state vectors and action vectors in the cutting line model, utilizes a reward function to process various constraints in the railway vertical section design, and simultaneously gives a form of a railway vertical section cost reward function. The automatic optimized vertical section scheme can comprehensively consider engineering cost and operation environment, well bypass obstacles and adapt to terrains, and provides early design reference for engineering designers.

Description

Construction method of railway longitudinal section design learning model based on near-end strategy

Technical Field

The invention relates to application of a deep reinforcement learning theory in the field of railway intelligent route selection, in particular to an intelligent design method of a railway longitudinal section scheme based on a near-end strategy (Proximal Policy Optimization, PPO).

Background

Reinforcement learning is a machine learning method that maximizes rewards or achieves specific goals by constantly learning and optimizing strategies during the interaction of agents with the environment. Conventional reinforcement learning methods generally utilize value iterations, strategy iterations, and Q-learning to solve the bellman optimal equation. When the environment related to the problem to be solved is particularly complex, the information processed by the agent is continuously increased, and if the method is continuously adopted, the iteration process is quite complex and long. With the rapid development of deep learning in recent years, the problem of reinforcement learning is solved by utilizing the nonlinear characteristic of the deep learning, so that the deep reinforcement learning theory is formed, different deep reinforcement learning algorithms are developed, and the deep reinforcement learning methods are widely applied to the research fields of Atari games, path planning, robot control and the like, wherein the application of a near-end strategy optimization algorithm is most common.

Disclosure of Invention

The invention utilizes reinforcement learning theory to construct a vertical section design learning model based on a near-end strategy optimization algorithm, and the main invention content is as follows:

(1) Firstly, constructing a longitudinal section cutting line model, designing a motion mode of a variable slope point, obtaining a calculation method of the variable slope point coordinate, and constructing a kinematic model of the environment in the reinforcement learning problem; the objective function, design variables and constraint conditions of the longitudinal section cutting line model are defined, and a theoretical basis is laid for the definition of elements required by subsequent reinforcement learning;

the longitudinal section cutting line model refers to a plane rectangular coordinate system with a plane mileage as an S axis and an elevation as a Z axis; the starting point of the line is S (S _S ,Z _S ) The end point is E (S _E ,Z _E ) Distance between two points l _s ＝S _E -S _S The method comprises the steps of carrying out a first treatment on the surface of the Assuming that the initial longitudinal section scheme has M slope changing points, M is uniformly spaced and perpendicular to S axis ₁ ,M ₂ ,…,M _M M cutting lines in total, and the initial distance d between two adjacent cutting lines _s ＝l _s /(M+1), let the intersection point of the cutting line and the base line be B _m (m=1, 2, …, M) with coordinates in the S-Z coordinate system ofSatisfy->And->

Build B _m S-Z coordinate system with (m=1, 2, …, M) point as origin, S-Z coordinate system is local coordinate system relative to S-Z coordinate system, slope change point VI _m (m＝1,2…, M) has a coordinate in s-z coordinate system of(s) _m ,z _m ) While the coordinates of the slope change point in the S-Z coordinate system are (S _m ,Z _m ) Satisfies the following conditions

The optimization targets comprise land and stone sides, bridges and tunnel construction costs, and the constraint condition means that the design result accords with the industry design specification: the slope length, the slope and the slope difference of the scheme meet the required range, and the line shape in the scheme can pass through the starting point and the control point, and the moderation curve and the vertical curve are not overlapped;

(2) Then, based on reinforcement learning theory, a vertical section cutting line model is combined to define reinforcement learning elements, the form of state vectors and motion vectors is defined, a cost rewarding function and a default rewarding function are designed, and a reinforcement learning model of intelligent vertical section design is constructed;

the model considers the quantity of variable slope points, the mileage of variable slope points, the elevation of variable slope points and the elevation of starting and ending points as design variables, and defines corresponding reinforcement learning elements of the PPO algorithm, wherein the reinforcement learning elements comprise a state vector s, an action vector a and a rewarding function value r; the state vector s and the motion vector a of the vertical section cutting line model are defined in the following ways:

the constraint conditions of each scheme are slope length M+1, slope difference M, starting and ending point 3, wherein the starting and ending point comprises the slope of a starting and ending point connecting line, and the control point related constraint N _c A moderating curve and a vertical curve are overlapped by N _h And N is _h To mitigate the larger of the number of curves and the number of vertical curves, a solution violation number variable f is defined as the solution violation count, the maximum value f of f _max ＝3M+5+N _c +N _h The method comprises the steps of carrying out a first treatment on the surface of the By M _max Indicating the maximum number of slope change points, l _s ＝S _E -S _S For distance of starting and ending point, Z _max Represents the maximum elevation in the environment, (S) _m ,Z _m ) (m=1, 2, …, M) is the slope change point VI _m (m=1, 2, …, M) coordinates in the S-Z coordinate system, the definition of the state vector S is then

Then the variable slope point number change quantity is defined as delta M,(s) _m ,z _m ) (m=1, 2, …, M) is the slope change point VI _m (m=1, 2, …, M) at B _m When the coordinates in the local coordinate system s-z coordinate system with the (m=1, 2, …, M) point as the origin are defined as a= (Δm, s) for the motion vector a ₁ ,s ₂ ,…,s _M ,z ₁ ,z ₂ ,…,z _M )；

The prize function value r refers to: r=r ₁ +r ₂ ，r ₁ Is a cost reward, r ₂ Is a default incentive, the cost is incentive r ₁ Defining negative rewards to minimize engineering costs, the model utilizing the offending rewards r ₂ To deal with the problem of violation of the longitudinal section scheme, and to give a price to r ₁ Is defined asWherein f _E Representing the engineering cost of the earthwork, f _B Representing bridge engineering cost f _T Represents the engineering cost of the tunnel, c represents the unit price of the earthwork and the earthwork, A _max Representing the cross-sectional area A of all roadbeds _i Maximum value of (2);

offence prize r ₂ Divided into 6 parts: slope length violation rewards r ₂₁ Grade violation rewards r ₂₂ Grade difference default prize r ₂₃ Starting and ending point connection gradient default rewards r ₂₄ Control point elevation violation rewards r ₂₅ Vertical slow overlap default prize r ₂₆ ，

Final r ₂ Equal to the sum of these six terms, the solution violates the divisor f, i.e

Definition of the variable alpha _i (i=1, 2, …, m+1) such thatWherein l _min Is the minimum slope length, l _i For the slope length of the ith slope section, +.>Awarding a slope length violation; definition of the variable beta _j (j=1, 2, …, m+1) such that +.>Wherein i is _max Is maximum gradient, |i _j Absolute value of slope of j-th slope sectionAwarding a grade violation; definition variable χ _j (j=1, 2, …, M) such thatWherein |Δi _j Absolute value of gradient difference between j and j+1 slope segments, < +.>Awarding a grade difference violation; will design the starting and ending point line gradient default rewards r ₂₄ Is defined asWherein i is _SE To start and end point link gradient: i.e _SE ＝(Z _E +z _E -Z _S -z _S )/l _s The method comprises the steps of carrying out a first treatment on the surface of the Definition of variable delta _k ,(k＝1,2,…,N _c ) Is->Wherein z is _k Reflecting the design elevation, z, at the kth control point _kmax 、z _kmin For its constraint, then +.>Awarding control point elevation violations; will vertically delay the overlap default prize r ₂₆ Defined as->Wherein l _m Is the mth vertical curveTotal length of overlapping portion of line and relaxation curve, +.>The tangent length of the mth vertical curve;

(3) Finally, introducing a PPO algorithm into a longitudinal section cutting line model, and providing a solution for designing effective dimensions for the problems of dimension and state updating modes involved in the variable slope point number optimizing process; defining a dividing mode of bridge tunneling in a line, providing nominal cost, estimating the nominal cost into equal amounts of earth and stone costs according to a unified standard, unifying cost rewarding functions, and facilitating algorithm training; the concrete implementation method of the vertical section reinforcement learning model based on the PPO algorithm is described, and a technical route is provided;

the effective dimension means that the input and output vectors of the algorithm are specially processed, and when the correlation calculation is carried out, only data in the dimension number corresponding to the current variable slope point number is calculated, and the setting ensures the requirement of the neural network on the fixed input and output dimensions, also ensures that the corresponding vector dimension cannot overflow in the optimization process, and meanwhile, experiments prove that the convergence of the algorithm cannot be influenced;

the use of the effective dimension simultaneously brings the difficult problem of mutual learning among schemes of different slope change points. The invention provides a restarting mechanism for the method: when the number of the change slope points of the last state is n _z1 At this time, the dimension of the position information including all the slope change points of the start and end points is 2 (n _z1 +2) by calculation of the neural network, the perturbation of the number of variable slope points of the next state is delta n _z The number of the variable slope points in the next state after superposition becomes n _z2 At this time, the dimension of the position information including all the slope change points of the start and end points is 2 (n _z2 +2), and the effective dimensions of the intercepted motion vector related to the non-starting end point position are different, at the moment, a restarting mechanism is introduced into the model, namely, the number of the model regenerated variable slope points is n _z2 And then the action calculated from the previous state is superimposed to generate the longitudinal section proposal of the next state.

The internal design steps of the model are as follows:

step1, initializing a vertical section cutting line model and a near-end strategy optimization algorithm to obtain an initial vertical section scheme and an initial state vector, wherein the initial vertical section scheme is used as a reference scheme;

step2, inputting a state vector, and outputting an action vector by a near-end strategy;

step3, judging whether the number of the variable slope points is changed or not by the longitudinal section cutting line model, if so, regenerating a longitudinal section scheme of the latest variable slope point number, and taking the scheme as a new reference scheme. Superposing actions on the reference scheme to obtain a new longitudinal section scheme;

step4, calculating a new state vector, and calculating the updated rewarding value at the time;

step5, record and save s _t ，a _t ，s _t+1 ，r _t+1 ；

Step6, repeating Step 2-Step 5, and after a certain number of steps, starting to update network parameters, namely updating strategies, of the neural network in the PPO algorithm;

step7, reaching a termination condition, converging an algorithm, and strengthening learning success.

The technical scheme of the model is shown in figure 1.

Drawings

FIG. 1 is a technical roadmap of a model;

FIG. 2 is a deep reinforcement learning graph;

FIG. 3 is an interaction diagram of a PPO with an environment;

FIG. 4 is a flow chart of a reviewer network update;

FIG. 5 is a flow chart of an actor network update in a PPO-Clip.

Detailed Description

First, a deep reinforcement learning theory is introduced.

A method of approximating an optimal cost function or an optimal strategy of the reinforcement learning problem using a nonlinear approximation of the deep neural network is called a deep reinforcement learning method (see fig. 2), and may be classified into a deep reinforcement learning algorithm based on a value function and a deep reinforcement learning algorithm based on a strategy.

(1) Deep reinforcement learning algorithm based on value function

The value function-based deep reinforcement learning algorithm will establish a function described by the parameter θ (i.e., neural network weight w and bias term b)I.e. approximating the cost function with the neural network parameter θ. The value of a state is calculated by a deep neural network by inputting a continuous variable s representing the state characteristic, and the value of a parameter theta is adjusted to be consistent with the value of the final state based on a certain strategy pi, so that the function is a state value function v _π (s) an approximate representation. Similarly, the following is true:

an approximate representation of the cost function is constructed and the reinforcement learning problem is converted into solving the approximate cost function parameter theta. Using a deep neural network nonlinear approximation cost function, for a state cost function:

wherein the nonlinear function sigma is called the activation function of the neuron and b is the bias term. The back propagation algorithm and the gradient descent algorithm in the deep neural network are utilized to iterate the parameters w and b of the deep neural network continuously, so that the cost function approaches to the optimal cost function, and the action to be executed can be selected and determined through the common epsilon-greedy strategy, so that the reinforcement learning problem is solved.

(2) Deep reinforcement learning algorithm based on strategy

The deep reinforcement learning algorithm based on the strategy directly utilizes the nonlinear approximation strategy of the deep neural network, and iterates the strategy parameters by calculating the expected total rewards of the strategy so as to approach the optimal strategy. When the reinforcement learning problem is solved, a deep neural network with a parameter theta is adopted to represent the strategy pi, and the optimization parameter theta is the optimization strategy pi. The main part of the traditional strategy gradient algorithmThe idea is that at time t, the state s obtained by interaction of the agent with the environment _t Inputting into a strategy pi with a parameter theta, outputting probability distribution of actions, and selecting an action a by an agent _t Execute and get the next state s _t+1 At the same time get rewards r _t+1 . By repeating the steps, a batch of samples can be collected by the strategy pi, and the parameter theta is updated to be theta 'by a random gradient descent algorithm, so that the strategy pi is changed to be the strategy pi' until the strategy pi is changed to be the optimal strategy pi ^* 。

Assume that a complete round (epoode) based on policy pi is an agent from initial state s ₀ Reaching the end state s through all states _T-1 The obtained state, action and trace of rewards define round e as:

e＝{s ₀ ,a ₀ ,r ₁ ,s ₁ ,a ₁ ,r ₂ ,…,s _T-1 ,a _T-1 ,r _T }

the probability of occurrence of round e is:

p _θ (e)＝p(s ₀ )p _θ (a ₀ |s ₀ )p(s ₁ |s ₀ ,a ₀ )…＝π _θ (a|s) (3)

the most basic strategy-based deep reinforcement learning algorithm optimizes the overall rewards expected by a full round of strategies. Definition from initial State s ₀ Start sampling until termination state s _T-1 The sum R (e) of all decay rewards is:

the expectations are that:

unlike the value function-based deep reinforcement learning algorithm, the strategy-based deep reinforcement learning algorithm generally directly employs a gradient descent approach to update the strategy to maximize the desired total rewards. N rounds may be defined, with a policy gradient of cumulative rewards of T time steps per round:

after going through a number of complete rounds, the N samples are used to approximate the expectation of the jackpot, and the parameter θ is updated by:

wherein:

η—learning rate, rate of control parameter update;

the difference between the deep reinforcement learning algorithm based on the value function and the deep reinforcement learning algorithm based on the strategy is that the former outputs the value of all actions based on the environmental information, and the intelligent agent only selects the action with the largest value, and the actions in the method are discrete. And outputting all the probabilities of possible actions according to the environment information, wherein the intelligent agent can select the next action according to the probabilities, and each action is possible to be selected, so that the continuous action problem can be solved by adopting the method.

The design of the railway longitudinal section involves a plurality of factors, the final trend of the line is determined by the factors of environment, topography, geology and the like, and the design method or design strategy cannot be simply solved by a linear model. But the nonlinear characteristic of the deep learning can solve the problem, and the deep learning can well establish a nonlinear mapping relation from a series of features to results through activating functions, so that a foundation is laid for the application of the deep learning in the field of line selection. However, the line selection design work is particularly complex, a large amount of data with labels is required for simple supervised learning, and in the line selection field, although a large amount of line scheme data is available, on one hand, the data is difficult to convert into quantized data which can be utilized by deep learning, and on the other hand, the difference of line data which is different in topography is too large, so that effective training is difficult to perform. Reinforcement learning is unsupervised learning, and can enable an intelligent agent to find a strategy for solving a problem without a large amount of labeled data. The method combines deep learning and reinforcement learning, and is a potential research method for realizing railway route selection and relates to artificial intelligence.

(3) Common deep reinforcement learning algorithm

The deep reinforcement learning algorithm has been developed for many years, and has the algorithms of different network structures and different parameter updating methods. The Deep Q-Network (DQN) proposed by Minh et al provides an experience playback mechanism, eliminates the correlation among samples, and simultaneously adds a target Q-value Network, so that the stability of an algorithm is improved, and the algorithm is excellent in discrete action problems; there are two sets of Network structures in Deep Double Q-Network (DDQN) based on the improvement, namely two sets of parameters theta and theta ^- θ is used to select the optimal action with the largest Q value, θ ^- The DDQN separates action selection and policy evaluation for evaluating the Q value of the optimal action, reducing the risk of overestimation of the Q value in the DQN. The deep reinforcement learning algorithm based on the value function like DQN, DDQN and the like is suitable for the discrete action problem, and when the continuous action problem is processed, the dimensional explosion problem is brought. The design variables involved in railway line selection and line selection are multiple, design constraints are multiple, and meanwhile, variables such as design elevation belong to continuous variables, so that the method is obviously unsuitable for solving the problems by using the method, and is suitable for selecting a strategy-based deep reinforcement learning method.

Common reinforcement learning methods based on strategies include strategy gradient algorithm and REINFORCE algorithm. The most widely applied deep reinforcement learning algorithm now integrates the concepts of value functions and strategies, the most basic algorithm being the Actor-critter (AC) algorithm. The AC algorithm has two sets of network structures, one set is called an Actor, and selects behaviors according to probability, namely a strategy system; another set is called Critic, which scores behaviors based on what the Actor selects, i.e., the value system. The Actor modifies its own parameters according to Critic's score, thereby modifying the probability of the selection behavior. In the AC algorithm, because the Critic network is difficult to converge, and the updating of the Actor network is very dependent on the value judgment of the Critic network, the convergence of the whole AC algorithm becomes very slow.

To solve the problem of slow convergence of the AC algorithm, lillicrap et al propose a depth deterministic strategy gradient (Deep Deterministic Policy Gradient, DDPG) algorithm based on a deterministic strategy gradient (Deterministic Policy Gradient, DPG) method, in combination with the DQN idea. The DDPG algorithm has a value function-based network and a policy-based network as the AC algorithm, the former having a state estimation network and a state reality network due to the introduction of the DQN idea, and the latter having a motion reality network and a motion estimation network. In the strategy network structure, the action estimation network is used for outputting actions in real time for an Actor to execute in reality; the action reality network is then used to update the value network structure. Both networks in the value network structure are outputting the value of the current state, but the input to the state reality network comes from two parts, one part being the output of the action reality network and the other part being the state observations, and the input to the state estimation network comes from the output of the action estimation network. Meanwhile, the DDPG also utilizes an experience playback mechanism, and experiments show that in the task of continuous action space, each item of DDPG has better performance than DQN, and the learning process and the convergence rate are much faster than that of an AC algorithm.

The original strategy gradient algorithm is more sensitive to parameters such as learning rate, step number and the like, so that the strategy gradient algorithm is more difficult to train on certain problems. If the step number set by one algorithm is larger, the strategy learned by the algorithm always jumps, so that the algorithm is not converged; conversely, if the number of steps is small, the algorithm training time is infinitely long. To address this problem, a trust domain policy optimization (Trust Region Policy Optimization, TRPO) algorithm and a near-end policy optimization (Proximal Policy Optimization, PPO) algorithm are proposed sequentially, which limit the update amplitude of the new policy relative to the old policy, making the algorithm more easy to train and converge. Meanwhile, compared with the TRPO algorithm, the PPO algorithm is easier to realize, has more stable performance in most tasks, becomes the default algorithm of OpenAI on the reinforcement learning problem, and also becomes one of the most applicable deep reinforcement learning algorithms.

The near-end strategy optimization (Proximal Policy Optimization, PPO) algorithm adopted by the invention is the same as an AC algorithm, is a deep reinforcement learning algorithm based on values and strategies at the same time, but in order to limit the updating amplitude of the strategies, the PPO algorithm is provided with two Actor networks, a Critic network and a database for storing trial-error data.

As shown in fig. 3, the PPO algorithm has two Actor networks and a Critic network, where the two Actor networks are respectively an actor_new network and an actor_old network. The intelligent agent interacts with the environment to obtain a state s, the state s is input into an actor_New network, then the actor_New network outputs a mean mu and a variance sigma, the mean mu and the variance sigma are used for calculating normal distribution of actions, one action a is sampled from the distribution and output into the environment, the intelligent agent carries out the action a to obtain a reward r, and the intelligent agent automatically jumps into the next state s_, and is used as the next input s. The [ (s, a, r), s_ ] is stored in a database and the above steps are cycled. In this process, actor_New is not updated.

As shown in FIG. 4, after training a complete round, the last time step is taken to obtain s _last Inputting to Critic network to obtain the estimated value v of the last state, and calculating the discount rewards R of all time steps _t And obtaining a real value matrix V of all states. Inputting all s in the database into a Critic network to obtain an estimated value matrix V of all s states, and then passing V _t And V, calculating c_loss for back propagation, updating Critic network parameters using gradient descent algorithm:

V _t ＝R _t ＝r _t +γr _t+1 +γ ² r _t+2 +…+γ ^T-t+1 r _T-1 +γ ^T-t v ^* (t＝0,1,2...T) (8)

for updating the actor_new network, importance sampling theory is utilized. Assuming that there are two distributions p (x) and q (x), the expectation that p (x) obeys a certain distribution and is not integrable, and that one argument x obeys the function f (x) of the p (x) distribution only when sampling from q (x) is as follows:

then equation (6) becomes:

definition:

A ^θ (s _t ,a _t )＝∑ _t V _t -V _t ^* (12)

assuming that the distributions of policy pi and policy pi' differ not much, equation (11) becomes:

changing the gradient to a likelihood function, equation (14) is reduced to:

in order to ensure that the action probability distributions obtained by the strategy pi and the strategy pi' are not far apart, two methods exist at present. The first is that the difference Penalty calculated by using the KL divergence to get two distributions is added to the likelihood function of the PPO algorithm, called the PPO-Penalty algorithm (herein called the PPO1 algorithm), namely:

the second is to cut the likelihood function to a certain extent, called the PPO-Clip algorithm (herein referred to as the PPO2 algorithm), namely:

to update the actor_new network, a_loss must be calculated using the likelihood function described above. Inputting all s stored in the experience pool into an actor_New network and an actor_old network, and outputting mu _new 、σ _new Value sum mu _old 、σ _old Values, thereby obtaining Normal distribution Normal _new And Normal _old . Then all a stored in the experience pool are input into Normal distribution Normal _new And Normal _old To calculateOr->I.e., a_loss, and then back-propagation updates the actor_new network using a gradient-increasing algorithm. And after a certain time step, assigning the parameters of the Actor_New to the Actor_old network. The updating process is shown in fig. 5.

In summary, the PPO algorithm flow is shown in table 1:

TABLE 1 PPO Algorithm flow

According to the test results, PPO2 performs better than PPO1, so that the practical conclusion of the invention is that PPO2, namely PPO-Clip, is used.

Claims

1. A construction method of a railway longitudinal section design learning model based on a near-end strategy is characterized by comprising the following steps:

step1, determining the initial position of a cutting line and the position of a slope change point according to a longitudinal section cutting line model;

the longitudinal section cutting line model is a plane right-angle seatThe standard system takes the plane mileage as an S axis and the elevation as a Z axis; the starting point of the line is S (S _S ,Z _S ) The end point is E (S _E ,Z _E ) Distance between two points l _s ＝S _E -S _S The method comprises the steps of carrying out a first treatment on the surface of the Assuming that the initial longitudinal section scheme has M slope changing points, M is uniformly spaced and perpendicular to S axis ₁ ,M ₂ ,…,M _M M cutting lines in total, and the initial distance d between two adjacent cutting lines _s ＝l _s /(M+1), let the intersection point of the cutting line and the base line be B _m (m=1, 2, …, M) with coordinates in the S-Z coordinate system of

Build B _m S-Z coordinate system with (m=1, 2, …, M) point as origin, S-Z coordinate system is local coordinate system relative to S-Z coordinate system, slope change point VI _m (m=1, 2, …, M) in the s-z coordinate system is(s) _m ,z _m ) While the coordinates of the slope change point in the S-Z coordinate system are (S _m ,Z _m )：

Step2, determining an optimization target and constraint conditions;

step3, determining a reward function of a near-end strategy optimization algorithm PPO, constructing a railway longitudinal section design learning model, and carrying out scheme optimization solution;

the railway longitudinal section design learning model takes the variable slope point number, variable slope point mileage, variable slope point elevation and starting and ending point elevation as design variables, and defines reinforcement learning elements corresponding to a PPO algorithm:

the vertical section cutting line model is based on a reinforcement learning theory, and defined reinforcement learning elements comprise a state vector s, an action vector a and a reward function value r;

the state vector s and the motion vector a of the vertical section cutting line model are defined in the following ways:

the constraint conditions of each scheme are slope length M+1, slope difference M, starting and ending point 3, wherein the starting and ending point comprises the slope of a starting and ending point connecting line, and the control point related constraint N _c A moderating curve and a vertical curve are overlapped by N _h And N is _h To mitigate the larger of the number of curves and the number of vertical curves, a solution violation number variable f is defined as the solution violation count, the maximum value f of f _max ＝3M+5+N _c +N _h ，

By M _max Indicating the maximum number of slope change points, l _s ＝S _E -S _S For distance of starting and ending point, Z _max Represents the maximum elevation in the environment, (S) _m ,Z _m ) (m=1, 2, …, M) is the slope change point VI _m (m=1, 2, …, M) coordinates in S-Z coordinate system, the definition of the state vector S is as follows:

then the variable slope point number change quantity is defined as delta M,(s) _m ,z _m ) (m=1, 2, …, M) is the slope change point VI _m (m=1, 2, …, M) at B _m The definition of the motion vector a is shown in the formula (5) when the (m=1, 2, …, M) point is the coordinate in the local coordinate system s-z coordinate system of the origin:

a＝(ΔM,s ₁ ,s ₂ ,…,s _M ,z ₁ ,z ₂ ,…,z _M ) (5)

the reward function value r of the vertical section cutting line model is: r=r ₁ +r ₂ ，r ₁ Is a cost reward, r ₂ Is a default prize to be awarded,

cost prize r ₁ Defining negative rewards to minimize engineering costs, the model utilizing the offending rewards r ₂ To deal with the problem of violation of the longitudinal section scheme, and to give a price to r ₁ Defined as formula (6):

wherein f _E Representing the engineering cost of the earthwork, f _B Representing bridge engineering cost f _T Represents the engineering cost of the tunnel, c represents the unit price of the earthwork and the earthwork, A _max Representing the cross-sectional area A of all roadbeds _i Maximum value of (2);

offence prize r ₂ Divided into 6 parts: slope length violation rewards r ₂₁ Grade violation rewards r ₂₂ Grade difference default prize r ₂₃ Starting and ending point connection gradient default rewards r ₂₄ Control point elevation violation rewards r ₂₅ Vertical slow overlap default prize r ₂₆ Final r ₂ Equal to the sum of these six terms, the solution violates the divisor f, i.e

Definition of the variable alpha _i ,(i＝1,2,…,M+1)：

Wherein l _min Is the minimum slope length, l _i Is the slope length of the ith slope sectionFor awarding of slope length violations, a variable beta is defined _j ,(j＝1,2,…,M+1)：

Wherein i is _max Is maximum gradient, |i _j Absolute value of slope of j-th slope sectionFor grade violating rewards, define variable χ _j ,(j＝1,2,…,M)：

Wherein |Δi _j Absolute value of gradient difference between j and j+1 slope segmentsFor the gradient difference default rewards, a final point connecting line gradient default rewards r is designed ₂₄ The definition is as follows:

wherein i is _SE To start and end point link gradient: i.e _SE ＝(Z _E +z _E -Z _S -z _S )/l _s ，

Definition of variable delta _k ,(k＝1,2,…,N _c )：

Wherein z is _k Reflecting the design elevation, z, at the kth control point _kmax 、z _kmin To be constrained by the constraintFor the control point elevation violation rewards,

vertical-slow overlap default rewards r ₂₆ The definition is as follows:

wherein l _m Is the total length of the overlapping part of the m-th vertical curve and the moderation curve,Is the tangent length of the mth vertical curve.