CN114861368B - Construction method of railway longitudinal section design learning model based on near-end strategy - Google Patents

Construction method of railway longitudinal section design learning model based on near-end strategy Download PDF

Info

Publication number
CN114861368B
CN114861368B CN202210659378.7A CN202210659378A CN114861368B CN 114861368 B CN114861368 B CN 114861368B CN 202210659378 A CN202210659378 A CN 202210659378A CN 114861368 B CN114861368 B CN 114861368B
Authority
CN
China
Prior art keywords
slope
point
rewards
railway
longitudinal section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210659378.7A
Other languages
Chinese (zh)
Other versions
CN114861368A (en
Inventor
缪鹍
戴炎林
况卫
周启航
肖智
王介源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210659378.7A priority Critical patent/CN114861368B/en
Publication of CN114861368A publication Critical patent/CN114861368A/en
Application granted granted Critical
Publication of CN114861368B publication Critical patent/CN114861368B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/10Geometric CAD
    • G06F30/17Mechanical parametric or variational design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a construction method of a railway longitudinal section design learning model based on a near-end strategy, relates to application of a deep reinforcement learning theory in the field of railway intelligent route selection, and is an intelligent design method of a railway longitudinal section scheme based on a near-end strategy algorithm (Proximal Policy Optimization, PPO). The invention builds a railway vertical section design learning model based on near-end strategy optimization, combines a railway vertical section cutting line model and a deep reinforcement learning theory, defines state vectors and action vectors in the cutting line model, utilizes a reward function to process various constraints in the railway vertical section design, and simultaneously gives a form of a railway vertical section cost reward function. The automatic optimized vertical section scheme can comprehensively consider engineering cost and operation environment, well bypass obstacles and adapt to terrains, and provides early design reference for engineering designers.

Description

Construction method of railway longitudinal section design learning model based on near-end strategy
Technical Field
The invention relates to application of a deep reinforcement learning theory in the field of railway intelligent route selection, in particular to an intelligent design method of a railway longitudinal section scheme based on a near-end strategy (Proximal Policy Optimization, PPO).
Background
Reinforcement learning is a machine learning method that maximizes rewards or achieves specific goals by constantly learning and optimizing strategies during the interaction of agents with the environment. Conventional reinforcement learning methods generally utilize value iterations, strategy iterations, and Q-learning to solve the bellman optimal equation. When the environment related to the problem to be solved is particularly complex, the information processed by the agent is continuously increased, and if the method is continuously adopted, the iteration process is quite complex and long. With the rapid development of deep learning in recent years, the problem of reinforcement learning is solved by utilizing the nonlinear characteristic of the deep learning, so that the deep reinforcement learning theory is formed, different deep reinforcement learning algorithms are developed, and the deep reinforcement learning methods are widely applied to the research fields of Atari games, path planning, robot control and the like, wherein the application of a near-end strategy optimization algorithm is most common.
Disclosure of Invention
The invention utilizes reinforcement learning theory to construct a vertical section design learning model based on a near-end strategy optimization algorithm, and the main invention content is as follows:
(1) Firstly, constructing a longitudinal section cutting line model, designing a motion mode of a variable slope point, obtaining a calculation method of the variable slope point coordinate, and constructing a kinematic model of the environment in the reinforcement learning problem; the objective function, design variables and constraint conditions of the longitudinal section cutting line model are defined, and a theoretical basis is laid for the definition of elements required by subsequent reinforcement learning;
the longitudinal section cutting line model refers to a plane rectangular coordinate system with a plane mileage as an S axis and an elevation as a Z axis; the starting point of the line is S (S S ,Z S ) The end point is E (S E ,Z E ) Distance between two points l s =S E -S S The method comprises the steps of carrying out a first treatment on the surface of the Assuming that the initial longitudinal section scheme has M slope changing points, M is uniformly spaced and perpendicular to S axis 1 ,M 2 ,…,M M M cutting lines in total, and the initial distance d between two adjacent cutting lines s =l s /(M+1), let the intersection point of the cutting line and the base line be B m (m=1, 2, …, M) with coordinates in the S-Z coordinate system ofSatisfy->And->
Build B m S-Z coordinate system with (m=1, 2, …, M) point as origin, S-Z coordinate system is local coordinate system relative to S-Z coordinate system, slope change point VI m (m=1,2…, M) has a coordinate in s-z coordinate system of(s) m ,z m ) While the coordinates of the slope change point in the S-Z coordinate system are (S m ,Z m ) Satisfies the following conditions
The optimization targets comprise land and stone sides, bridges and tunnel construction costs, and the constraint condition means that the design result accords with the industry design specification: the slope length, the slope and the slope difference of the scheme meet the required range, and the line shape in the scheme can pass through the starting point and the control point, and the moderation curve and the vertical curve are not overlapped;
(2) Then, based on reinforcement learning theory, a vertical section cutting line model is combined to define reinforcement learning elements, the form of state vectors and motion vectors is defined, a cost rewarding function and a default rewarding function are designed, and a reinforcement learning model of intelligent vertical section design is constructed;
the model considers the quantity of variable slope points, the mileage of variable slope points, the elevation of variable slope points and the elevation of starting and ending points as design variables, and defines corresponding reinforcement learning elements of the PPO algorithm, wherein the reinforcement learning elements comprise a state vector s, an action vector a and a rewarding function value r; the state vector s and the motion vector a of the vertical section cutting line model are defined in the following ways:
the constraint conditions of each scheme are slope length M+1, slope difference M, starting and ending point 3, wherein the starting and ending point comprises the slope of a starting and ending point connecting line, and the control point related constraint N c A moderating curve and a vertical curve are overlapped by N h And N is h To mitigate the larger of the number of curves and the number of vertical curves, a solution violation number variable f is defined as the solution violation count, the maximum value f of f max =3M+5+N c +N h The method comprises the steps of carrying out a first treatment on the surface of the By M max Indicating the maximum number of slope change points, l s =S E -S S For distance of starting and ending point, Z max Represents the maximum elevation in the environment, (S) m ,Z m ) (m=1, 2, …, M) is the slope change point VI m (m=1, 2, …, M) coordinates in the S-Z coordinate system, the definition of the state vector S is then
Then the variable slope point number change quantity is defined as delta M,(s) m ,z m ) (m=1, 2, …, M) is the slope change point VI m (m=1, 2, …, M) at B m When the coordinates in the local coordinate system s-z coordinate system with the (m=1, 2, …, M) point as the origin are defined as a= (Δm, s) for the motion vector a 1 ,s 2 ,…,s M ,z 1 ,z 2 ,…,z M );
The prize function value r refers to: r=r 1 +r 2 ,r 1 Is a cost reward, r 2 Is a default incentive, the cost is incentive r 1 Defining negative rewards to minimize engineering costs, the model utilizing the offending rewards r 2 To deal with the problem of violation of the longitudinal section scheme, and to give a price to r 1 Is defined asWherein f E Representing the engineering cost of the earthwork, f B Representing bridge engineering cost f T Represents the engineering cost of the tunnel, c represents the unit price of the earthwork and the earthwork, A max Representing the cross-sectional area A of all roadbeds i Maximum value of (2);
offence prize r 2 Divided into 6 parts: slope length violation rewards r 21 Grade violation rewards r 22 Grade difference default prize r 23 Starting and ending point connection gradient default rewards r 24 Control point elevation violation rewards r 25 Vertical slow overlap default prize r 26
Final r 2 Equal to the sum of these six terms, the solution violates the divisor f, i.e
Definition of the variable alpha i (i=1, 2, …, m+1) such thatWherein l min Is the minimum slope length, l i For the slope length of the ith slope section, +.>Awarding a slope length violation; definition of the variable beta j (j=1, 2, …, m+1) such that +.>Wherein i is max Is maximum gradient, |i j Absolute value of slope of j-th slope sectionAwarding a grade violation; definition variable χ j (j=1, 2, …, M) such thatWherein |Δi j Absolute value of gradient difference between j and j+1 slope segments, < +.>Awarding a grade difference violation; will design the starting and ending point line gradient default rewards r 24 Is defined asWherein i is SE To start and end point link gradient: i.e SE =(Z E +z E -Z S -z S )/l s The method comprises the steps of carrying out a first treatment on the surface of the Definition of variable delta k ,(k=1,2,…,N c ) Is->Wherein z is k Reflecting the design elevation, z, at the kth control point kmax 、z kmin For its constraint, then +.>Awarding control point elevation violations; will vertically delay the overlap default prize r 26 Defined as->Wherein l m Is the mth vertical curveTotal length of overlapping portion of line and relaxation curve, +.>The tangent length of the mth vertical curve;
(3) Finally, introducing a PPO algorithm into a longitudinal section cutting line model, and providing a solution for designing effective dimensions for the problems of dimension and state updating modes involved in the variable slope point number optimizing process; defining a dividing mode of bridge tunneling in a line, providing nominal cost, estimating the nominal cost into equal amounts of earth and stone costs according to a unified standard, unifying cost rewarding functions, and facilitating algorithm training; the concrete implementation method of the vertical section reinforcement learning model based on the PPO algorithm is described, and a technical route is provided;
the effective dimension means that the input and output vectors of the algorithm are specially processed, and when the correlation calculation is carried out, only data in the dimension number corresponding to the current variable slope point number is calculated, and the setting ensures the requirement of the neural network on the fixed input and output dimensions, also ensures that the corresponding vector dimension cannot overflow in the optimization process, and meanwhile, experiments prove that the convergence of the algorithm cannot be influenced;
the use of the effective dimension simultaneously brings the difficult problem of mutual learning among schemes of different slope change points. The invention provides a restarting mechanism for the method: when the number of the change slope points of the last state is n z1 At this time, the dimension of the position information including all the slope change points of the start and end points is 2 (n z1 +2) by calculation of the neural network, the perturbation of the number of variable slope points of the next state is delta n z The number of the variable slope points in the next state after superposition becomes n z2 At this time, the dimension of the position information including all the slope change points of the start and end points is 2 (n z2 +2), and the effective dimensions of the intercepted motion vector related to the non-starting end point position are different, at the moment, a restarting mechanism is introduced into the model, namely, the number of the model regenerated variable slope points is n z2 And then the action calculated from the previous state is superimposed to generate the longitudinal section proposal of the next state.
The internal design steps of the model are as follows:
step1, initializing a vertical section cutting line model and a near-end strategy optimization algorithm to obtain an initial vertical section scheme and an initial state vector, wherein the initial vertical section scheme is used as a reference scheme;
step2, inputting a state vector, and outputting an action vector by a near-end strategy;
step3, judging whether the number of the variable slope points is changed or not by the longitudinal section cutting line model, if so, regenerating a longitudinal section scheme of the latest variable slope point number, and taking the scheme as a new reference scheme. Superposing actions on the reference scheme to obtain a new longitudinal section scheme;
step4, calculating a new state vector, and calculating the updated rewarding value at the time;
step5, record and save s t ,a t ,s t+1 ,r t+1
Step6, repeating Step 2-Step 5, and after a certain number of steps, starting to update network parameters, namely updating strategies, of the neural network in the PPO algorithm;
step7, reaching a termination condition, converging an algorithm, and strengthening learning success.
The technical scheme of the model is shown in figure 1.
Drawings
FIG. 1 is a technical roadmap of a model;
FIG. 2 is a deep reinforcement learning graph;
FIG. 3 is an interaction diagram of a PPO with an environment;
FIG. 4 is a flow chart of a reviewer network update;
FIG. 5 is a flow chart of an actor network update in a PPO-Clip.
Detailed Description
First, a deep reinforcement learning theory is introduced.
A method of approximating an optimal cost function or an optimal strategy of the reinforcement learning problem using a nonlinear approximation of the deep neural network is called a deep reinforcement learning method (see fig. 2), and may be classified into a deep reinforcement learning algorithm based on a value function and a deep reinforcement learning algorithm based on a strategy.
(1) Deep reinforcement learning algorithm based on value function
The value function-based deep reinforcement learning algorithm will establish a function described by the parameter θ (i.e., neural network weight w and bias term b)I.e. approximating the cost function with the neural network parameter θ. The value of a state is calculated by a deep neural network by inputting a continuous variable s representing the state characteristic, and the value of a parameter theta is adjusted to be consistent with the value of the final state based on a certain strategy pi, so that the function is a state value function v π (s) an approximate representation. Similarly, the following is true:
an approximate representation of the cost function is constructed and the reinforcement learning problem is converted into solving the approximate cost function parameter theta. Using a deep neural network nonlinear approximation cost function, for a state cost function:
wherein the nonlinear function sigma is called the activation function of the neuron and b is the bias term. The back propagation algorithm and the gradient descent algorithm in the deep neural network are utilized to iterate the parameters w and b of the deep neural network continuously, so that the cost function approaches to the optimal cost function, and the action to be executed can be selected and determined through the common epsilon-greedy strategy, so that the reinforcement learning problem is solved.
(2) Deep reinforcement learning algorithm based on strategy
The deep reinforcement learning algorithm based on the strategy directly utilizes the nonlinear approximation strategy of the deep neural network, and iterates the strategy parameters by calculating the expected total rewards of the strategy so as to approach the optimal strategy. When the reinforcement learning problem is solved, a deep neural network with a parameter theta is adopted to represent the strategy pi, and the optimization parameter theta is the optimization strategy pi. The main part of the traditional strategy gradient algorithmThe idea is that at time t, the state s obtained by interaction of the agent with the environment t Inputting into a strategy pi with a parameter theta, outputting probability distribution of actions, and selecting an action a by an agent t Execute and get the next state s t+1 At the same time get rewards r t+1 . By repeating the steps, a batch of samples can be collected by the strategy pi, and the parameter theta is updated to be theta 'by a random gradient descent algorithm, so that the strategy pi is changed to be the strategy pi' until the strategy pi is changed to be the optimal strategy pi *
Assume that a complete round (epoode) based on policy pi is an agent from initial state s 0 Reaching the end state s through all states T-1 The obtained state, action and trace of rewards define round e as:
e={s 0 ,a 0 ,r 1 ,s 1 ,a 1 ,r 2 ,…,s T-1 ,a T-1 ,r T }
the probability of occurrence of round e is:
p θ (e)=p(s 0 )p θ (a 0 |s 0 )p(s 1 |s 0 ,a 0 )…=π θ (a|s) (3)
the most basic strategy-based deep reinforcement learning algorithm optimizes the overall rewards expected by a full round of strategies. Definition from initial State s 0 Start sampling until termination state s T-1 The sum R (e) of all decay rewards is:
the expectations are that:
unlike the value function-based deep reinforcement learning algorithm, the strategy-based deep reinforcement learning algorithm generally directly employs a gradient descent approach to update the strategy to maximize the desired total rewards. N rounds may be defined, with a policy gradient of cumulative rewards of T time steps per round:
after going through a number of complete rounds, the N samples are used to approximate the expectation of the jackpot, and the parameter θ is updated by:
wherein:
η—learning rate, rate of control parameter update;
the difference between the deep reinforcement learning algorithm based on the value function and the deep reinforcement learning algorithm based on the strategy is that the former outputs the value of all actions based on the environmental information, and the intelligent agent only selects the action with the largest value, and the actions in the method are discrete. And outputting all the probabilities of possible actions according to the environment information, wherein the intelligent agent can select the next action according to the probabilities, and each action is possible to be selected, so that the continuous action problem can be solved by adopting the method.
The design of the railway longitudinal section involves a plurality of factors, the final trend of the line is determined by the factors of environment, topography, geology and the like, and the design method or design strategy cannot be simply solved by a linear model. But the nonlinear characteristic of the deep learning can solve the problem, and the deep learning can well establish a nonlinear mapping relation from a series of features to results through activating functions, so that a foundation is laid for the application of the deep learning in the field of line selection. However, the line selection design work is particularly complex, a large amount of data with labels is required for simple supervised learning, and in the line selection field, although a large amount of line scheme data is available, on one hand, the data is difficult to convert into quantized data which can be utilized by deep learning, and on the other hand, the difference of line data which is different in topography is too large, so that effective training is difficult to perform. Reinforcement learning is unsupervised learning, and can enable an intelligent agent to find a strategy for solving a problem without a large amount of labeled data. The method combines deep learning and reinforcement learning, and is a potential research method for realizing railway route selection and relates to artificial intelligence.
(3) Common deep reinforcement learning algorithm
The deep reinforcement learning algorithm has been developed for many years, and has the algorithms of different network structures and different parameter updating methods. The Deep Q-Network (DQN) proposed by Minh et al provides an experience playback mechanism, eliminates the correlation among samples, and simultaneously adds a target Q-value Network, so that the stability of an algorithm is improved, and the algorithm is excellent in discrete action problems; there are two sets of Network structures in Deep Double Q-Network (DDQN) based on the improvement, namely two sets of parameters theta and theta - θ is used to select the optimal action with the largest Q value, θ - The DDQN separates action selection and policy evaluation for evaluating the Q value of the optimal action, reducing the risk of overestimation of the Q value in the DQN. The deep reinforcement learning algorithm based on the value function like DQN, DDQN and the like is suitable for the discrete action problem, and when the continuous action problem is processed, the dimensional explosion problem is brought. The design variables involved in railway line selection and line selection are multiple, design constraints are multiple, and meanwhile, variables such as design elevation belong to continuous variables, so that the method is obviously unsuitable for solving the problems by using the method, and is suitable for selecting a strategy-based deep reinforcement learning method.
Common reinforcement learning methods based on strategies include strategy gradient algorithm and REINFORCE algorithm. The most widely applied deep reinforcement learning algorithm now integrates the concepts of value functions and strategies, the most basic algorithm being the Actor-critter (AC) algorithm. The AC algorithm has two sets of network structures, one set is called an Actor, and selects behaviors according to probability, namely a strategy system; another set is called Critic, which scores behaviors based on what the Actor selects, i.e., the value system. The Actor modifies its own parameters according to Critic's score, thereby modifying the probability of the selection behavior. In the AC algorithm, because the Critic network is difficult to converge, and the updating of the Actor network is very dependent on the value judgment of the Critic network, the convergence of the whole AC algorithm becomes very slow.
To solve the problem of slow convergence of the AC algorithm, lillicrap et al propose a depth deterministic strategy gradient (Deep Deterministic Policy Gradient, DDPG) algorithm based on a deterministic strategy gradient (Deterministic Policy Gradient, DPG) method, in combination with the DQN idea. The DDPG algorithm has a value function-based network and a policy-based network as the AC algorithm, the former having a state estimation network and a state reality network due to the introduction of the DQN idea, and the latter having a motion reality network and a motion estimation network. In the strategy network structure, the action estimation network is used for outputting actions in real time for an Actor to execute in reality; the action reality network is then used to update the value network structure. Both networks in the value network structure are outputting the value of the current state, but the input to the state reality network comes from two parts, one part being the output of the action reality network and the other part being the state observations, and the input to the state estimation network comes from the output of the action estimation network. Meanwhile, the DDPG also utilizes an experience playback mechanism, and experiments show that in the task of continuous action space, each item of DDPG has better performance than DQN, and the learning process and the convergence rate are much faster than that of an AC algorithm.
The original strategy gradient algorithm is more sensitive to parameters such as learning rate, step number and the like, so that the strategy gradient algorithm is more difficult to train on certain problems. If the step number set by one algorithm is larger, the strategy learned by the algorithm always jumps, so that the algorithm is not converged; conversely, if the number of steps is small, the algorithm training time is infinitely long. To address this problem, a trust domain policy optimization (Trust Region Policy Optimization, TRPO) algorithm and a near-end policy optimization (Proximal Policy Optimization, PPO) algorithm are proposed sequentially, which limit the update amplitude of the new policy relative to the old policy, making the algorithm more easy to train and converge. Meanwhile, compared with the TRPO algorithm, the PPO algorithm is easier to realize, has more stable performance in most tasks, becomes the default algorithm of OpenAI on the reinforcement learning problem, and also becomes one of the most applicable deep reinforcement learning algorithms.
The near-end strategy optimization (Proximal Policy Optimization, PPO) algorithm adopted by the invention is the same as an AC algorithm, is a deep reinforcement learning algorithm based on values and strategies at the same time, but in order to limit the updating amplitude of the strategies, the PPO algorithm is provided with two Actor networks, a Critic network and a database for storing trial-error data.
As shown in fig. 3, the PPO algorithm has two Actor networks and a Critic network, where the two Actor networks are respectively an actor_new network and an actor_old network. The intelligent agent interacts with the environment to obtain a state s, the state s is input into an actor_New network, then the actor_New network outputs a mean mu and a variance sigma, the mean mu and the variance sigma are used for calculating normal distribution of actions, one action a is sampled from the distribution and output into the environment, the intelligent agent carries out the action a to obtain a reward r, and the intelligent agent automatically jumps into the next state s_, and is used as the next input s. The [ (s, a, r), s_ ] is stored in a database and the above steps are cycled. In this process, actor_New is not updated.
As shown in FIG. 4, after training a complete round, the last time step is taken to obtain s last Inputting to Critic network to obtain the estimated value v of the last state, and calculating the discount rewards R of all time steps t And obtaining a real value matrix V of all states. Inputting all s in the database into a Critic network to obtain an estimated value matrix V of all s states, and then passing V t And V, calculating c_loss for back propagation, updating Critic network parameters using gradient descent algorithm:
V t =R t =r t +γr t+12 r t+2 +…+γ T-t+1 r T-1T-t v * (t=0,1,2...T) (8)
for updating the actor_new network, importance sampling theory is utilized. Assuming that there are two distributions p (x) and q (x), the expectation that p (x) obeys a certain distribution and is not integrable, and that one argument x obeys the function f (x) of the p (x) distribution only when sampling from q (x) is as follows:
then equation (6) becomes:
definition:
A θ (s t ,a t )=∑ t V t -V t * (12)
assuming that the distributions of policy pi and policy pi' differ not much, equation (11) becomes:
changing the gradient to a likelihood function, equation (14) is reduced to:
in order to ensure that the action probability distributions obtained by the strategy pi and the strategy pi' are not far apart, two methods exist at present. The first is that the difference Penalty calculated by using the KL divergence to get two distributions is added to the likelihood function of the PPO algorithm, called the PPO-Penalty algorithm (herein called the PPO1 algorithm), namely:
the second is to cut the likelihood function to a certain extent, called the PPO-Clip algorithm (herein referred to as the PPO2 algorithm), namely:
to update the actor_new network, a_loss must be calculated using the likelihood function described above. Inputting all s stored in the experience pool into an actor_New network and an actor_old network, and outputting mu new 、σ new Value sum mu old 、σ old Values, thereby obtaining Normal distribution Normal new And Normal old . Then all a stored in the experience pool are input into Normal distribution Normal new And Normal old To calculateOr->I.e., a_loss, and then back-propagation updates the actor_new network using a gradient-increasing algorithm. And after a certain time step, assigning the parameters of the Actor_New to the Actor_old network. The updating process is shown in fig. 5.
In summary, the PPO algorithm flow is shown in table 1:
TABLE 1 PPO Algorithm flow
According to the test results, PPO2 performs better than PPO1, so that the practical conclusion of the invention is that PPO2, namely PPO-Clip, is used.

Claims (1)

1. A construction method of a railway longitudinal section design learning model based on a near-end strategy is characterized by comprising the following steps:
step1, determining the initial position of a cutting line and the position of a slope change point according to a longitudinal section cutting line model;
the longitudinal section cutting line model is a plane right-angle seatThe standard system takes the plane mileage as an S axis and the elevation as a Z axis; the starting point of the line is S (S S ,Z S ) The end point is E (S E ,Z E ) Distance between two points l s =S E -S S The method comprises the steps of carrying out a first treatment on the surface of the Assuming that the initial longitudinal section scheme has M slope changing points, M is uniformly spaced and perpendicular to S axis 1 ,M 2 ,…,M M M cutting lines in total, and the initial distance d between two adjacent cutting lines s =l s /(M+1), let the intersection point of the cutting line and the base line be B m (m=1, 2, …, M) with coordinates in the S-Z coordinate system of
Build B m S-Z coordinate system with (m=1, 2, …, M) point as origin, S-Z coordinate system is local coordinate system relative to S-Z coordinate system, slope change point VI m (m=1, 2, …, M) in the s-z coordinate system is(s) m ,z m ) While the coordinates of the slope change point in the S-Z coordinate system are (S m ,Z m ):
Step2, determining an optimization target and constraint conditions;
the optimization targets comprise land and stone sides, bridges and tunnel construction costs, and the constraint condition means that the design result accords with the industry design specification: the slope length, the slope and the slope difference of the scheme meet the required range, and the line shape in the scheme can pass through the starting point and the control point, and the moderation curve and the vertical curve are not overlapped;
step3, determining a reward function of a near-end strategy optimization algorithm PPO, constructing a railway longitudinal section design learning model, and carrying out scheme optimization solution;
the railway longitudinal section design learning model takes the variable slope point number, variable slope point mileage, variable slope point elevation and starting and ending point elevation as design variables, and defines reinforcement learning elements corresponding to a PPO algorithm:
the vertical section cutting line model is based on a reinforcement learning theory, and defined reinforcement learning elements comprise a state vector s, an action vector a and a reward function value r;
the state vector s and the motion vector a of the vertical section cutting line model are defined in the following ways:
the constraint conditions of each scheme are slope length M+1, slope difference M, starting and ending point 3, wherein the starting and ending point comprises the slope of a starting and ending point connecting line, and the control point related constraint N c A moderating curve and a vertical curve are overlapped by N h And N is h To mitigate the larger of the number of curves and the number of vertical curves, a solution violation number variable f is defined as the solution violation count, the maximum value f of f max =3M+5+N c +N h
By M max Indicating the maximum number of slope change points, l s =S E -S S For distance of starting and ending point, Z max Represents the maximum elevation in the environment, (S) m ,Z m ) (m=1, 2, …, M) is the slope change point VI m (m=1, 2, …, M) coordinates in S-Z coordinate system, the definition of the state vector S is as follows:
then the variable slope point number change quantity is defined as delta M,(s) m ,z m ) (m=1, 2, …, M) is the slope change point VI m (m=1, 2, …, M) at B m The definition of the motion vector a is shown in the formula (5) when the (m=1, 2, …, M) point is the coordinate in the local coordinate system s-z coordinate system of the origin:
a=(ΔM,s 1 ,s 2 ,…,s M ,z 1 ,z 2 ,…,z M ) (5)
the reward function value r of the vertical section cutting line model is: r=r 1 +r 2 ,r 1 Is a cost reward, r 2 Is a default prize to be awarded,
cost prize r 1 Defining negative rewards to minimize engineering costs, the model utilizing the offending rewards r 2 To deal with the problem of violation of the longitudinal section scheme, and to give a price to r 1 Defined as formula (6):
wherein f E Representing the engineering cost of the earthwork, f B Representing bridge engineering cost f T Represents the engineering cost of the tunnel, c represents the unit price of the earthwork and the earthwork, A max Representing the cross-sectional area A of all roadbeds i Maximum value of (2);
offence prize r 2 Divided into 6 parts: slope length violation rewards r 21 Grade violation rewards r 22 Grade difference default prize r 23 Starting and ending point connection gradient default rewards r 24 Control point elevation violation rewards r 25 Vertical slow overlap default prize r 26 Final r 2 Equal to the sum of these six terms, the solution violates the divisor f, i.e
Definition of the variable alpha i ,(i=1,2,…,M+1):
Wherein l min Is the minimum slope length, l i Is the slope length of the ith slope sectionFor awarding of slope length violations, a variable beta is defined j ,(j=1,2,…,M+1):
Wherein i is max Is maximum gradient, |i j Absolute value of slope of j-th slope sectionFor grade violating rewards, define variable χ j ,(j=1,2,…,M):
Wherein |Δi j Absolute value of gradient difference between j and j+1 slope segmentsFor the gradient difference default rewards, a final point connecting line gradient default rewards r is designed 24 The definition is as follows:
wherein i is SE To start and end point link gradient: i.e SE =(Z E +z E -Z S -z S )/l s
Definition of variable delta k ,(k=1,2,…,N c ):
Wherein z is k Reflecting the design elevation, z, at the kth control point kmax 、z kmin To be constrained by the constraintFor the control point elevation violation rewards,
vertical-slow overlap default rewards r 26 The definition is as follows:
wherein l m Is the total length of the overlapping part of the m-th vertical curve and the moderation curve,Is the tangent length of the mth vertical curve.
CN202210659378.7A 2022-06-13 2022-06-13 Construction method of railway longitudinal section design learning model based on near-end strategy Active CN114861368B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210659378.7A CN114861368B (en) 2022-06-13 2022-06-13 Construction method of railway longitudinal section design learning model based on near-end strategy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210659378.7A CN114861368B (en) 2022-06-13 2022-06-13 Construction method of railway longitudinal section design learning model based on near-end strategy

Publications (2)

Publication Number Publication Date
CN114861368A CN114861368A (en) 2022-08-05
CN114861368B true CN114861368B (en) 2023-09-12

Family

ID=82624954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210659378.7A Active CN114861368B (en) 2022-06-13 2022-06-13 Construction method of railway longitudinal section design learning model based on near-end strategy

Country Status (1)

Country Link
CN (1) CN114861368B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757347B (en) * 2023-06-19 2024-02-13 中南大学 Railway line selection method and system based on deep learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447437A (en) * 2018-10-17 2019-03-08 中南大学 A kind of public affairs (iron) road vertical section method for auto constructing comprising cut-fill transition
CN112231870A (en) * 2020-09-23 2021-01-15 西南交通大学 Intelligent generation method for railway line in complex mountain area
CN114519292A (en) * 2021-12-17 2022-05-20 北京航空航天大学 Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10606962B2 (en) * 2011-09-27 2020-03-31 Autodesk, Inc. Horizontal optimization of transport alignments

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447437A (en) * 2018-10-17 2019-03-08 中南大学 A kind of public affairs (iron) road vertical section method for auto constructing comprising cut-fill transition
CN112231870A (en) * 2020-09-23 2021-01-15 西南交通大学 Intelligent generation method for railway line in complex mountain area
CN114519292A (en) * 2021-12-17 2022-05-20 北京航空航天大学 Air-to-air missile over-shoulder launching guidance law design method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
铁路线路智能优化方法研究综述;薛新功;李伟;蒲浩;;铁道学报(第03期);全文 *

Also Published As

Publication number Publication date
CN114861368A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN110262511B (en) Biped robot adaptive walking control method based on deep reinforcement learning
CN112356830B (en) Intelligent parking method based on model reinforcement learning
CN111766782B (en) Strategy selection method based on Actor-Critic framework in deep reinforcement learning
CN114237222B (en) Delivery vehicle path planning method based on reinforcement learning
Lee et al. Monte-carlo tree search in continuous action spaces with value gradients
CN114861368B (en) Construction method of railway longitudinal section design learning model based on near-end strategy
CN112613608A (en) Reinforced learning method and related device
CN114519433A (en) Multi-agent reinforcement learning and strategy execution method and computer equipment
CN114626505A (en) Mobile robot deep reinforcement learning control method
Tong et al. Enhancing rolling horizon evolution with policy and value networks
CN117520956A (en) Two-stage automatic feature engineering method based on reinforcement learning and meta learning
CN116307331B (en) Aircraft trajectory planning method
CN111507499B (en) Method, device and system for constructing model for prediction and testing method
CN116306947A (en) Multi-agent decision method based on Monte Carlo tree exploration
CN112297012B (en) Robot reinforcement learning method based on self-adaptive model
Raza et al. Policy reuse in reinforcement learning for modular agents
Wenwen Application Research of end to end behavior decision based on deep reinforcement learning
Shi et al. Adaptive reinforcement q-learning algorithm for swarm-robot system using pheromone mechanism
Ji et al. Research on Path Planning of Mobile Robot Based on Reinforcement Learning
CN116718198B (en) Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph
CN118083808B (en) Dynamic path planning method and device for crown block system
Sakurai et al. Making Uncoordinated Autonomous Machines Cooperate: A Game Theory Approach
Garnica et al. Autonomous virtual vehicles with FNN-GA and Q-learning in a video game environment
Beltman Optimization of ideal racing line

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant