CN114019795B - Intelligent decision method for shield tunneling correction based on reinforcement learning - Google Patents

Intelligent decision method for shield tunneling correction based on reinforcement learning Download PDF

Info

Publication number
CN114019795B
CN114019795B CN202111203818.XA CN202111203818A CN114019795B CN 114019795 B CN114019795 B CN 114019795B CN 202111203818 A CN202111203818 A CN 202111203818A CN 114019795 B CN114019795 B CN 114019795B
Authority
CN
China
Prior art keywords
shield
decision
deviation
model
deviation correcting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111203818.XA
Other languages
Chinese (zh)
Other versions
CN114019795A (en
Inventor
庄元顺
苏叶茂
牟松
徐进
刘绥美
李开富
张炬
朱菁
梅元元
张中华
陈可
刘洋
梁博
李才洪
杨冰
胡可
陈鑫
李明扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
China Railway Engineering Service Co Ltd
China Railway Hi Tech Industry Corp Ltd
Original Assignee
Southwest Jiaotong University
China Railway Engineering Service Co Ltd
China Railway Hi Tech Industry Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University, China Railway Engineering Service Co Ltd, China Railway Hi Tech Industry Corp Ltd filed Critical Southwest Jiaotong University
Priority to CN202111203818.XA priority Critical patent/CN114019795B/en
Publication of CN114019795A publication Critical patent/CN114019795A/en
Application granted granted Critical
Publication of CN114019795B publication Critical patent/CN114019795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance
    • EFIXED CONSTRUCTIONS
    • E21EARTH OR ROCK DRILLING; MINING
    • E21DSHAFTS; TUNNELS; GALLERIES; LARGE UNDERGROUND CHAMBERS
    • E21D9/00Tunnels or galleries, with or without linings; Methods or apparatus for making thereof; Layout of tunnels or galleries
    • E21D9/06Making by using a driving shield, i.e. advanced by pushing means bearing against the already placed lining
    • E21D9/093Control of the driving shield, e.g. of the hydraulic advancing cylinders

Landscapes

  • Engineering & Computer Science (AREA)
  • Mining & Mineral Resources (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Environmental & Geological Engineering (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Life Sciences & Earth Sciences (AREA)
  • Geochemistry & Mineralogy (AREA)
  • Geology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application belongs to the technical field of shield construction, and particularly relates to an intelligent decision method for shield tunneling deviation correction based on reinforcement learning. Designing an environment state set, an action set and a reward function, and constructing a shield simulation deviation correcting environment; constructing a shield deviation correcting decision model; constructing a model evaluation method to obtain a shield deviation rectification decision model with the highest rewarding score after the shield deviation rectification decision model interacts with a shield deviation rectification simulation environment; determining parameters of a value function network structure through a grid searching method; according to the grid search result, performing multi-pass training on the determined shield deviation correction decision model in a simulation environment; and (3) inputting the state data of the shield deviation correcting decision model into a final model, and directly outputting the value of the execution action by the model to be used as a decision scheme. The deviation correcting decision scheme of the shield provided by the application avoids the problems of automatic deviation correction of a shield driver according to the site situation and snakelike deviation correction caused by manual operation.

Description

Intelligent decision method for shield tunneling correction based on reinforcement learning
Technical Field
The application belongs to the technical field of shield construction, and particularly relates to an intelligent decision method for shield tunneling deviation correction based on reinforcement learning.
Background
Tunnel construction is an important component of underground space development, and the tunnel penetration task is very widely completed by adopting a shield machine. In the shield construction project, the shield posture is a key factor for the decision of the propulsion scheme of a shield operator, and when the shield machine deviates from a design axis, related tunneling parameters need to be adjusted in time so that the shield machine gradually returns to the axis. The posture of the shield machine is closely related to earth surface subsidence, segment assembly and the like, and the quality and the route of a formed tunnel are directly influenced. Therefore, the control of the propulsion attitude of the shield machine is a key problem in the quality management of shield construction projects. The existing shield attitude correction technology can be roughly divided into the following categories:
(1) Based on a three-point method, the calculation method of shield coordinates and attitude deviation is improved by combining devices such as a total station, a prism and an inclinometer, and necessary basic support is provided for the deviation correction control decision of shield attitude.
(2) By descriptive statistics and regression analysis of construction history data, the influence rule of tunneling parameters such as oil cylinder stroke, soil cabin pressure and the like on the shield posture is analyzed, the correlation between the tunneling parameters and the shield posture is explored, the correlation tunneling parameters are reversely adjusted according to the rule, the shield direction is further controlled, and a theoretical basis is provided for deviation correcting operation decision of a shield operator.
(3) The tunneling parameters are subjected to feature selection by methods such as recursive feature elimination and random forest, and the attitude deviation, attitude angle and the like of the shield are predicted by methods such as XGBoost method and construction of a neural network. Because the posture of the shield is the main basis of the control decision of the shield operation, the predicted posture of the shield according to the construction parameters can be used as the reference of the next parameter adjustment decision, and the parameters can be adjusted in advance to control the tunneling direction.
(4) And a fuzzy mathematic and PID control system is adopted, a deviation rectifying curve model of the shield is established based on a kinematic model, an outer ring planning and inner ring accurate control PID control system is established, and deviation rectifying control is realized on the posture of the shield.
Accurate measurement and calculation of the shield posture, influence rules of related parameters on the shield posture and prediction research of the posture provide basic decision support for operators of the shield on decision basis, but a direct decision scheme cannot be provided, and each operation of the shield still needs to be controlled manually. The serpentine correction is easily caused by poor control of correction time and correction amount by manual correction, and precise control of shield posture is very dependent on project experience of operators.
Disclosure of Invention
The application discloses a shield tunneling correction intelligent decision method based on reinforcement learning, which aims to solve the problems that accurate control of shield posture in the background technology is very dependent on project experience of operators, and serpentine correction is easy to be caused by poor correction time and correction amount control simply by manual correction.
In order to solve the technical problems, the application adopts the following technical scheme:
a shield tunneling deviation correction intelligent decision method based on reinforcement learning comprises the following steps:
step 1: combining the tunneling deviation correcting strategy process and technical experience of the shield project site, designing an environment state set, an action set and a reward function, and constructing a shield deviation correcting simulation environment based on a reinforcement learning frame;
step 2: constructing a shield deviation correcting decision model interacted with a shield deviation correcting simulation environment;
step 3: constructing a model evaluation method to obtain a shield deviation rectification decision model with the highest rewarding score after the shield deviation rectification decision model interacts with a shield deviation rectification simulation environment;
step 4: determining parameters of a median function network structure of the shield deviation correcting decision model by a grid searching method;
step 5: according to the grid search result, carrying out multi-pass training on the shield deviation correction decision model of the determined value function network structure in a simulation environment; because the value function network is a component part of the shield deviation rectifying decision model, the structure of the shield deviation rectifying decision model is determined by determining the value function network structure through the network search result;
step 6: and (3) inputting the state data of the shield deviation correcting decision model into a final model, and directly outputting the value of the execution action by the model to be used as a decision scheme. And (3) training the shield deviation correcting decision model determined in the step (5) in a simulation environment for a plurality of rounds, and then storing a training model, wherein the final model is the stored training model.
The method can provide decision scheme reference for the shield driver according to the adjustment strategy of the key decision parameters obtained by the environmental state set, is beneficial to relieving decision doubt and decision strength of the shield driver, reduces the working strength of the shield driver, effectively ensures that the shield driver can have abundant energy to make a timely and efficient treatment scheme when facing emergency, and further improves project safety and ensures stable project progress; the deviation correcting decision scheme of the shield provided by the application avoids the problems of automatic deviation correction of a shield driver according to the site situation and snakelike deviation correction caused by manual operation.
Preferably, the environmental state set is a shield key attitude parameter measured by a shield measurement system.
Preferably, the key attitude parameters of the shield include horizontal deviation of a notch ring, vertical deviation of a notch change, horizontal deviation of a shield tail, vertical deviation of the shield tail, rolling angle, pitch angle, horizontal yaw angle and vertical yaw angle.
Preferably, the action set is designed according to the shield deviation correcting principle.
Preferably, the reward function is designed based on the deviation correcting direction and the deviation correcting speed of the shield and the deviation between the shield machine and the design curve;
R(s,a,s′)=r_d+r_v+r_y
wherein r_d represents rewards of shield deviation correcting directions, r_v represents rewards of deviation correcting speeds, and r_y represents axis deviation rewards of the shield machine;
the rewards of the deviation rectifying directions are shown in the following formula:
wherein r_d represents the rewards of the shield deviation rectifying direction and delta y t-1 The difference value between the shield machine and the design axis at the moment t-1 before t;
the rewards of the deviation correcting speed are shown as follows:
where r_v represents the prize of the correction speed, |d rt The I is the absolute value of the axis deviation of the shield machine;
the rewards for axis deviation are shown as follows:
where r_y represents the axis deviation rewards of the shield machine, |Deltay t And I represents the distance of the shield machine deviating from a given line.
Further, in order to improve the utilization rate of the sample and reduce the correlation of the sample, the step 2 further comprises constructing an experience pool by adopting a queue structure, and the experience pool is used for storing interaction history data obtained by interaction of the shield deviation correcting decision model and the shield deviation correcting simulation environment; sample data sampled from the experience pool is used for training of the value function network. The samples are data sampled from an experience pool.
Furthermore, because the interaction and the updating of the shield deviation rectifying simulation environment are both from the same shield deviation rectifying decision model, the data generated by the interaction can influence the iteration, and in order to reduce the instability of the model updating, the shield deviation rectifying decision model comprises two convolutional neural networks with consistent structures to form a double-network mechanism; one is used as an online network to select, and the decision action with the highest value is interacted with the shield deviation correction simulation environment to obtain a sample; the other as a target network for calculating the value of the decision performed by the online network; each step of training updates the parameters of the online network according to the following training formula, and the parameters of the target network are replaced by the parameters of the online network after iteration for a plurality of times;
the training formula is as follows:
Q(s,a;θ)←Q(s,a;θ)+α[r+γQ(s′,argmax a′ Q(s′,a);θ - )-Q(s,a;θ)]
wherein: s is shieldConstructing an environment state of the correction decision model, wherein a is a decision action of the shield correction decision model under the environment state, s' represents a next environment state transferred to after the decision action is executed, r represents feedback obtained from the environment after the decision action is executed, gamma is a discount factor, Q (s, a; theta) represents a value function network, theta represents a parameter of an online network, and theta - Representing the parameters of the target network, and alpha represents the degree of each update of the parameters.
Preferably, the step 4 includes the steps of:
step 4.1: the structure parameters to be determined of the comb value function network;
step 4.2: determining a candidate value list of the parameters to be determined;
step 4.3: combining the candidate values of the undetermined parameters to form a plurality of candidate shield deviation correcting decision models;
step 4.4: training and testing the candidate shield deviation correcting decision models in the same round number in a shield deviation correcting simulation environment, and calculating test average scores of the candidate shield deviation correcting decision models according to the model evaluation method established in the step 3;
step 4.5: and determining the value of the parameter corresponding to the shield deviation correcting decision model with the highest test average score as the network structure parameter of the value function.
Preferably, the step 5 includes the steps of:
step 5.1: initializing parameters of a shield deviation correcting decision model; initializing experience pool capacity, parameter epsilon, sampling batch batch_size, online value function network parameters and initializing parameters of a target network by using the online network parameters;
step 5.2: initializing a simulation environment to obtain an initial state and preprocessing;
step 5.3: if epsilon is larger than 0.5, randomly selecting actions with epsilon probability, otherwise, selecting the action with the largest value with 1-epsilon probability;
step 5.4: the shield deviation correcting decision model executes the action with the highest value selected in the step 3, obtains a new state, rewards and termination signals returned by the shield deviation correcting simulation environment, and stores the previous state, the execution action, the rewards, the new state and the termination signals into an experience pool as experience data;
step 5.5: if the new state is a termination state or negative rewards are obtained in the continuous prescribed times, the training of the current round is terminated, and the step 5.2 is executed;
step 5.6: when the number of the samples in the experience pool is larger than that of the samples in the training batch, randomly sampling the batch_size samples from the experience pool, and updating an online value function network by using a gradient descent method;
step 5.7: updating the target network by using the parameters of the online value function network every preset fixed iteration times;
step 5.8: and 5.2 to 5.7 are repeated until the set training round number is reached, and the trained shield deviation correcting decision model is stored.
Preferably, the step 5.2 includes the steps of:
step 5.21: color processing of the state image, converting the color space of the image, and changing the image into a gray image; the state image refers to a screen shot when the shield deviation correction decision model interacts with the visual simulation environment, and is an image reflecting the relative position of the shield machine and the design axis.
Step 5.22: setting a shield deviation correcting decision model to make a decision every 4 frames of images;
step 5.23: the 3-frame image is superimposed as an input to the network using a method of superimposing the image frames.
In summary, due to the adoption of the technical scheme, the beneficial effects of the application are as follows:
1. according to the application, technologies such as data analysis, machine learning, artificial intelligence and the like are integrated into an industry construction process, so that problems existing in the current construction process can be found in time, risks possibly occurring in the construction process, consequences generated by a current decision scheme and the like are prejudged in advance, and further risks are avoided and better decisions are made.
2. The intelligent decision of shield deviation correction is realized by constructing the reinforcement learning model, the intelligent decision management trend of the building construction industry including shield projects is met, the informatization and data development of the industry are promoted, and the application field of reinforcement learning is expanded.
3. The adjustment strategy of the key decision parameters obtained according to the environmental state can provide decision scheme reference for the shield driver, is beneficial to alleviating decision doubt and decision strength of the shield driver, reduces working strength of the shield driver, effectively ensures that the shield driver can have vigorous energy to make a timely and efficient treatment scheme when facing emergency, and further improves project safety and ensures stable project performance; the deviation correcting decision scheme of the shield provided by the application avoids the problems of automatic deviation correction of a shield driver according to the site situation and snakelike deviation correction caused by manual operation.
4. Deep reinforcement learning is introduced in the intelligent exploration of the shield project decision, and the designed model scheme can also be applied to the intelligent management research application of related large equipment, so that the application research of reinforcement learning in the large engineering field is enriched.
Drawings
The application will now be described by way of example and with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of the intelligent decision-making method for shield tunneling deviation correction.
FIG. 2 is an example of a visual simulation environment for shield deviation rectification decision making in the present application.
FIG. 3 is an overall framework of the shield deviation correcting decision model of the present application.
FIG. 4 is a graph showing a comparative multi-model evaluation method according to the present application.
FIG. 5 shows the decision result of the shield deviation correcting decision model of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.
Embodiments of the present application are described in detail below with reference to fig. 1 and 5; the intelligent decision model of shield deviation correction mentioned in fig. 1 is simply called a shield deviation correction decision model.
A shield tunneling deviation correction intelligent decision method based on reinforcement learning comprises the following steps:
step 1: constructing a shield deviation correction simulation environment based on a reinforcement learning frame;
(1) Design of environmental state sets
In the shield construction process, a direct decision maker for determining how to tunnel and how to adjust parameters is a shield driver. According to comprehensive deduction of distances between the two points of the shield incision and the shield tail center and a set axis in the horizontal direction and the numerical direction and the attitude angle and the tunneling trend, a shield driver can judge whether the shield machine has upward tilting or 'head planting', whether the shield machine deviates from the axis, whether the degree of deviation from the axis is acceptable, whether deviation correction is needed and the like. The state data in the environmental state set includes the key attitude parameters measured by the shield measurement system. Thus, the state space of the simulation environment is defined as s=s 1 ×S 2 ×S 3 ×S 4 ×S 5 ×S 6 ×S 7 ×S 8 ,S i (i=1, 2, … …, 8) represents each key posture parameter, as shown in table 1 below.
Table 1 key attitude parameters for measurement and calculation by measurement system
S Parameters (parameters) Unit (B) Description of the application
S 1 Horizontal deviation of notch ring mm Horizontal distance between shield machine head (cutter head center) and axis
S 2 Vertical deviation of cut ring mm Vertical distance between shield machine head (cutter head center) and axis
S 3 Horizontal deviation of shield tail mm Horizontal distance between tail of shield machine and axis
S 4 Vertical deviation of shield tail mm Horizontal distance between tail of shield machine and axis
S 5 Roll angle Degree of Angle of shield machine rotating around its machine body axis
S 6 Pitch angle Degree of Included angle formed by axial direction of shield machine and horizontal plane
S 7 Horizontal yaw angle mm/m Continuing tunneling in the original horizontal direction, wherein the deviation is generated when tunneling is performed for 1 meter
S 8 Vertical yaw angle mm/m Continuing tunneling according to the original vertical direction, wherein the deviation is generated when tunneling is performed for 1 meter
In actual operation of a project site, decision making of a decision maker for shield tunneling is mainly based on a system chart displayed on a monitoring screen, and because attitude parameters of the shield in a state set are obtained by single measurement and calculation of a measurement system, real-time support of the measurement system cannot be obtained when a simulation environment is constructed, pictures reflecting shield tunneling conditions are used as states, namely, shield tunneling states displayed on the monitoring screen in real time are used as states reflecting shield tunneling.
(2) Action set design
The attitude of the shield is controlled, and the track deviation of the shield machine is generally corrected by changing the advancing and rotating directions of the cutterhead; the jacks behind the shield cutter head are generally divided into A, B, C, D groups which respectively represent right, lower, left and upper four areas, the propelling stroke of the jacks is changed mainly by adjusting the propelling pressure of the left and right (group C and group A), the deviation rectifying moment is adjusted by the stroke difference, the deviation rectifying control principle in the vertical direction is the same as that in the horizontal direction, and the propelling stroke difference of the two groups of jacks is mainly adjusted B, D. Thus, the first and second substrates are bonded together,taking the horizontal direction as an example, define the action set as a=a 1 ×A 2 ×A 3 Wherein A is 1 Represents the propulsion pressure change of group A, A 2 Represents the propulsion pressure change of group C, A 3 Representing the cutter head rotational speed. In order to avoid too fast a pressure change, the range of the propulsion pressure is set to be [ -5,5]Negative sign indicates decreasing the advancing pressure, value 0 indicates constant pressure, positive sign indicates increasing pressure; in performing the action of changing the pressure, it will also be defined that the magnitude of the propulsion pressure after adjustment is also within the allowable threshold, see in particular table 2 below. The threshold value range is a reasonable value range of the propulsion pressure, and the reasonable value range is a range defined in technical specification parameters of the shield machine.
Table 2 basic actions of horizontal deviation correction for shield
(3) Design of reward functions
The application designs the reward function from three aspects of the deviation correcting direction, the deviation correcting speed and the deviation between the shield machine and the design axis, and the formula is as follows:
R(s,a,s′)=r_d+r_v+r_y
wherein r_d represents rewards of shield deviation correcting directions, r_v represents rewards of deviation correcting speeds, and r_y represents axis deviation rewards of the shield machine;
deviation rectifying direction
Order theFor the deviation value of the time t shield and the design axis, delta y t-1 D is the difference between the shield and the design axis at the previous time t-1 of t rt =Δy t -Δy t-1 In order to correct the variation value of the deviation before and after the deviation, according to the variation value of the deviation and the deviation sign of the previous moment, whether the shield is tunneling towards the design axis or not can be judged, and corresponding positive feedback or negative feedback is given as the reward r_d for the deviation correcting direction:
for example: if the shield is positioned at the left side of the axis at the previous moment, namely delta y t-1 <0, d is the right side of the original position of the shield at the next moment rt >0, indicating that the shield realizes rightward deviation correction, and giving forward rewards 1; if the shield is positioned on the right side of the axis at the previous moment, namely delta y t-1 >0, d is the left side of the original position of the shield at the next moment rt <0, indicating that the shield achieves a left correction, and therefore also gives a positive prize of 1. Combining the two conditions, namely when the original position of the shield is not positioned on the axis, realizing the correction of the corresponding direction, the environment feeds back the forward rewards 1; conversely, when correction is needed and tunneling towards the correction direction is not performed, the environment feeds back negative rewards-1. If the shield is on the axis at the beginning, namely delta y t-1 When =0, the variation d of the front and rear time is rt When=0, the shield advances along the axis, and the environment feeds back a positive reward 1; variation d of the deviation between the front and rear time rt A value of 0 indicates that the environment will give a negative prize of-1 when the shield begins to deviate from the axis.
Deviation correcting speed
According to the adjustment principle of gradual deviation correction and incapability of violent correction and jerk adjustment, larger negative feedback needs to be given to the action of overlarge deviation change values before and after deviation correction:
wherein, |d rt And I is the absolute value of shield-axis deviation. The application segments the deviation rectifying variable quantity, and when the deviation rectifying quantity is more than 0mm and less than 1mm, the application accords with the principle of slow deviation rectifying and gives positive rewards 1; when the deviation correction amount is more than or equal to 1mm and less than 3mm, the current deviation correction amount is considered to be in accordance with the condition, and no rewards are given; when the deviation correcting amount is more than or equal to 3mm and less than 5mm, the current deviation correcting amount is considered to be larger; giving negative rewards-1; when the deviation is more than or equal to 5mm and less than 10mm, correcting the deviationAn amount exceeding the acceptance range may cause serious accidents, thus giving a larger negative prize-10; if the deviation correcting speed is greater than 10mm, the deviation correcting speed is seriously beyond the regulation, and negative rewards are given to minus 100.
Deviation size
In order to strictly control the tunneling of the shield machine to accord with the design axis, the correct deviation correcting direction and the correct deviation correcting speed are ensured, and meanwhile, the deviation with the design axis should not be too large;
in the above, |Δy t And I represents the distance of the shield machine deviating from a given line. The acceptance criterion generally requires that the deviation of the formed tunnel is within 5mm, so that the deviation is controlled in the deviation correcting process in addition to the deviation correcting direction and the deviation correcting speed; the deviation correction is equal to or more than 0mm and less than 3mm as an acceptable range, and negative rewarding feedback is not carried out; when the value is equal to or greater than 3mm and less than 5mm, negative rewards-1 are given; when the deviation is equal to or greater than 5mm and less than 10mm, negative rewards-10 are given; when the deviation is equal to or greater than 10, a negative prize of-100 is given.
(4) Visual shield deviation rectifying simulation environment
The open source tool bag Gym is taken as an environment building foundation, and the designed simulation environment is visualized. As shown in fig. 2. Corresponding to the shield tunneling process, black rectangles in the figure represent the set shield deviation correcting decision model, and the two sides of the black rectangle frame are the rock-soil environment around the designed axis; the line where the black rectangle is located is the design axis. The whole image reflects the advancing direction of the shield deviation correcting decision model, the deviation situation between the shield deviation correcting decision model and the design axis, and the like in the interaction process.
Step 2: and constructing a shield deviation correcting decision model.
The whole frame of the constructed shield deviation correcting decision model is shown in figure 3 based on the reinforcement learning algorithm.
The shield deviation correcting decision model is an intelligent agent interacting with the shield deviation correcting simulation environment (hereinafter, the shield deviation correcting decision model is called as the intelligent agent); the goal is to maximize the overall rewards obtained from the environment. The input of the intelligent agent is the environmental state, and the output is the action with the highest value under the corresponding state, namely decision. In the intelligent agent, a convolutional neural network is adopted as a value function network to evaluate the value of the decision.
The agent generates an experience data in each interaction with the environment, the data including rewards r from the environment; in one round of interaction, the agent will interact with the environment n times, with the goal that the agent's overall rewards for one round of interaction with the environment can be maximized through training.
Considering that the rewards for performing actions at future times will typically have less impact on the current time, the discount factor γ can be introduced into the calculation formula for the overall rewards: r is R t =∑ k=0 γ k r t+j+1
Wherein t represents the current time, k represents k times after t, gamma is a discount factor, R t Indicating the overall prize value by time t.
The training data of the intelligent agent are derived from experience data obtained by interaction of the intelligent agent and a shield correction simulation environment, an experience pool is built by adopting a queue mechanism in order to improve the utilization rate of the experience data and reduce the correlation of sample data, a certain experience data is stored in the experience pool, and then data are sampled in batches from the experience pool for training of a value function network; the data sampled in batch is herein referred to as sample data.
In addition, if the interaction and updating of the agent and the shield deviation correcting simulation environment are both from the same model (different data are substituted into the agent to generate different agents), the interaction data can influence the iteration, in order to reduce the instability of the update of the agent, two convolution neural networks with completely consistent structures are adopted to form a double-network mechanism, and one decision action with the greatest value is selected as an online network to interact with the shield deviation correcting simulation environment to obtain experience data. The other network is used as a target network for calculating the value of the decision executed by the online network, the parameters of the online network are updated according to the following formula in each step of training, and the parameters of the target network are replaced by the parameters of the online network after a certain number of iterations. The formula is as follows:
Q(s,a;θ)←Q(s,a;θ)+α[r+γQ(s′,argmax a′ Q(s′,a);θ - )-Q(s,a;θ)]
wherein: s is the environmental state of the shield deviation correcting decision model, a is the decision action of the shield deviation correcting decision model under the environmental state, s' represents the next environmental state transferred after the decision action is executed, r represents the feedback obtained from the environment after the decision action is executed, gamma is a discount factor, Q (s, a; theta) represents the value function network, theta represents the parameter of the online network, and theta - Representing the parameters of the target network, and alpha represents the degree of each update of the parameters.
Step 3: evaluation method for design intelligent agent
And (3) interacting the training and saving agent with the shield deviation correction simulation environment for 100 rounds, and counting the average rewarding points obtained by the agent in the test of 100 rounds, wherein the higher the rewarding points are, the better the strategy learning of the model is indicated.
If the comparison of a plurality of agents is needed, the agents are interacted with the shield deviation correction simulation environment for 100 rounds respectively, the comprehensive rewarding value obtained by each agent in each round is counted, and the agent strategy with the highest average round score is the best.
Step 4: network structure for determining value function
The value function network mainly extracts information from the current state diagram of the shield to calculate the value, and the depth of the neural network is not too deep in consideration of the size of the picture and the calculation method of the convolution layer; in this embodiment, the design model includes two convolution layer-maximum convergence layer pairs, so as to implement feature extraction of the input state, the extracted feature data is flattened through the flat layer, and the value of each action is obtained through the operation of two full connection layers.
Step 4.1: the method comprises the steps that structural parameters which need to be determined are carded out by a value function network, the structural parameters which need to be determined comprise the size and the number of convolution kernels of two convolution layers and the number of units of a full connection layer based on a value function network structure which is preliminarily set by a model;
step 4.2: and determining a candidate value list of the undetermined parameters. The candidate values of the parameters to be determined are empirically determined from convolutional neural network parameter settings, as shown in table 3 below:
table 3 value function network pending parameters
Step 4.3: combining the undetermined parameter candidate values to form 54 candidate intelligent agents;
step 4.4: training the 54 candidate intelligent agents for 600 rounds in a shield deviation correction simulation environment respectively, performing 100 rounds of tests, and calculating the test average score of each candidate model according to a model evaluation method;
step 4.5: based on the grid search results, the highest score achieved by agent "3-32-4-64-64" in the test and the average number of actions performed per round are listed in the top. The value function network structure of the final model is thus:
the first convolution layer uses 32 convolution kernels with a size of 3×3, a step size of 3, and a Relu activation function is adopted;
the first convergence layer adopts maximum convergence to reduce the dimension, and the size of a used filter is 2 multiplied by 2;
the second convolution layer uses 64 convolution kernels, 4 x 4 in size, with step size set to 1, and the activation function still uses Relu;
the second convergence layer is the largest convergence layer, and the size of the filter is set to be 2 multiplied by 2;
the flattened data will be input to the fully connected layer of 64 neurons, output by the Relu activation function.
The last full connection layer is the output layer.
Step 5: and training an intelligent body. And according to the network search result, training the determined shield deviation correcting intelligent decision model in a simulation environment for a plurality of rounds.
Step 5.1: initializing agent parameters and related training parameter settings as shown in table 4:
table 4 initialization parameter settings
Parameters (parameters) Value of Meaning of
episode 5000 Number of rounds of interaction of agent with environment
batch_size 32 Sampled batch data size
save_training_frequency 50 Preserving the number of rounds of interval of training models
update_target_model_frequency 5 Updating the number of rounds of interval of a target network
memory_size 50000 Experience pool size
gamma 0.95 RewardsDiscount factor
epsilon 1.0 Search rate
epsilon_decay 0.99999 Attenuation rate per search rate
epsilon_min 0.1 Minimum value of exploration rate
learning_rate 0.001 CNN network learning rate
Step 5.2: initializing a simulation environment, obtaining an initial state and processing, wherein the method comprises the following steps:
a. color processing of the state image, namely converting the color space of the image to change the image into a gray image;
b. frame skipping, namely setting an agent to make a decision every 4 frames of images;
c. the frame number stacking, because the superposition of multi-frame images can reflect the dynamic change of the agent, the application uses the method for superposing the frame number of the images, and the superposition of 3 frames of images is used as the input of the network;
step 5.3: if epsilon is larger than 0.5, randomly selecting actions with epsilon probability, otherwise, selecting the action with the largest value with 1-epsilon probability;
step 5.4: the intelligent agent executes the action selected in the step 5.3 to obtain a new state, rewards and termination signals returned by the environment, and stores the previous state, the executed action, the rewards, the new state and the termination signals into an experience pool as experience data;
step 5.5: if the new state is a termination state or negative rewards are obtained continuously for 10 times, the training of the current round is terminated, and the step 5.2 is executed in a return mode;
step 5.6: when the experience data in the experience pool is larger than the training batch experience data, randomly sampling 32 experience data from the experience pool, and updating an online value function network by using a gradient descent method; because the initial data size in the experience pool is 0, along with continuous interaction of the intelligent body and the shield correction simulation environment, interaction data can be stored in the experience pool one by one, training data of the intelligent body are obtained by randomly sampling data in the experience pool, the data size required to be sampled for each training is one batch, and the batch of the embodiment is 32; when the data amount stored in the experience pool is larger than 32, the intelligent agent can sample the data in batches for training.
Step 5.7: updating the target network with parameters of the online value function network every 5 training rounds;
step 5.8: step 5.2 to step 5.7 are repeatedly executed, 550 rounds of training are performed, and the trained agent is saved.
Step 6: and (5) intelligent decision of the shield deviation correcting model. And (3) carrying out decision test on the intelligent agent in the constructed shield deviation correction simulation environment, inputting a state image in which the intelligent agent is positioned into a model in the test process, and executing action value of the model decision output by the intelligent agent. A portion of the control trajectory of the agent model in the simulation environment is shown in fig. 5. According to the control decision result, the designed shield deviation rectifying intelligent decision model can basically move forward along the design axis at the parts of the eased curve and the straight curve, which indicates that the decision scheme of the model is effective and provides intelligent decision support for shield deviation rectifying decision of a project site.
The above examples merely illustrate specific embodiments of the application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that it is possible for a person skilled in the art to make several variants and modifications without departing from the technical idea of the application, which fall within the scope of protection of the application.

Claims (8)

1. The intelligent decision-making method for shield tunneling deviation correction based on reinforcement learning is characterized by comprising the following steps:
step 1: designing an environment state set, an action set and a reward function, and constructing a shield deviation correction simulation environment based on a reinforcement learning frame;
the environment state set is a shield key attitude parameter measured by a shield measurement system;
the action set is designed according to the shield deviation correcting principle;
the reward function is designed based on deviation correcting direction, deviation correcting speed and deviation between the shield machine and a design curve;
step 2: constructing a shield deviation correcting decision model interacted with a shield deviation correcting simulation environment;
step 3: constructing a model evaluation method to obtain a shield deviation rectification decision model with the highest rewarding score after the shield deviation rectification decision model interacts with a shield deviation rectification simulation environment; the shield deviation correcting decision model comprises a double-network mechanism formed by two convolution neural networks with identical structures;
step 4: determining parameters of a median function network structure of the shield deviation correcting decision model by a grid searching method;
step 5: according to the grid search result, carrying out multi-pass training on the shield deviation correction decision model of the determined value function network structure in a simulation environment;
step 6: and (3) inputting the state data of the shield deviation correcting decision model into a final model, and directly outputting the value of the execution action by the model to be used as a decision scheme.
2. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the key attitude parameters of the shield include notch ring horizontal deviation, notch changing vertical deviation, shield tail horizontal deviation, shield tail vertical deviation, rolling angle, pitch angle, horizontal yaw angle and vertical yaw angle.
3. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the reward function is represented by the following formula;
R(s,a,s′)=r_d+r_v+r_y
wherein r_d represents rewards of shield deviation correcting directions, r_v represents rewards of deviation correcting speeds, and r_y represents axis deviation rewards of the shield machine;
the rewards of the deviation rectifying directions are shown in the following formula:
wherein r_d represents the rewards of the shield deviation rectifying direction and delta y t-1 The difference value between the shield machine and the design axis at the moment t-1 before t;
the rewards of the deviation correcting speed are shown as follows:
where r_v represents the prize of the correction speed, |d rt The I is the absolute value of the axis deviation of the shield machine;
the rewards for axis deviation are shown as follows:
where r_y represents the axis deviation rewards of the shield machine, |Deltay t And I represents the distance of the shield machine deviating from a given line.
4. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the step 2 further comprises constructing an experience pool by adopting a queue structure for storing training data obtained by interaction of a shield correction decision-making model and a shield correction simulation environment; the training data in the experience pool is used for training of the value function network.
5. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the shield tunneling correction decision-making model comprises a double-network mechanism formed by two convolutional neural networks with identical structures; one is used as an online network to select, and the decision action with the highest value is interacted with the shield deviation correction simulation environment to obtain a sample; the other is used as a target network for calculating the value of the decision executed by the online network, each step of training updates the parameters of the online network according to the following training formula, and the parameters of the target network are replaced by the parameters of the online network after iteration for a plurality of times;
the training formula is as follows:
Q(s,a;θ)←Q(s,a;θ)+α[r+γQ(s′,argmax a′ Q(s′,a);θ - )-Q(s,a;θ)]
wherein: s is the environmental state of the shield deviation correcting decision model, a is the decision action of the shield deviation correcting decision model under the environmental state, s' represents the next environmental state transferred after the decision action is executed, r represents the feedback obtained from the environment after the decision action is executed, gamma is a discount factor, Q (s, a; theta) represents the value function network, theta represents the parameter of the online network, and theta - Representing the parameters of the target network, and alpha represents the degree of each update of the parameters.
6. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 4, wherein the step 4 comprises the following steps:
step 4.1: the structure parameters to be determined of the comb value function network;
step 4.2: determining a candidate value list of the parameters to be determined;
step 4.3: combining the candidate values of the undetermined parameters to form a plurality of candidate shield deviation correcting decision models;
step 4.4: training and testing the candidate shield deviation correcting decision models in the same round number in a shield deviation correcting simulation environment, and calculating test average scores of the candidate shield deviation correcting decision models according to the model evaluation method established in the step 3;
step 4.5: and determining the value of the parameter corresponding to the shield deviation correcting decision model with the highest test average score as the network structure parameter of the value function.
7. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 6, wherein the step 5 comprises the following steps:
step 5.1: initializing model parameters; initializing experience pool capacity, parameter epsilon, sampling batch batch_size, online value function network parameters and initializing parameters of a target network by using the online network parameters;
step 5.2: the initial shield rectifies the simulation environment to obtain an initial state and performs pretreatment;
step 5.3: if epsilon is larger than 0.5, randomly selecting actions with epsilon probability, otherwise, selecting the action with the largest value with 1-epsilon probability;
step 5.4: the shield deviation correcting decision model executes the action with the highest value selected in the step 3, obtains a new state, rewards and termination signals returned by the shield deviation correcting simulation environment, and stores the previous state, the execution action, the rewards, the new state and the termination signals into an experience pool as experience data;
step 5.5: if the new state is a termination state or negative rewards are obtained in the continuous prescribed times, the training of the current round is terminated, and the step 5.2 is executed;
step 5.6: when the number of the samples in the experience pool is larger than that of the samples in the training batch, randomly sampling the batch_size samples from the experience pool, and updating an online value function network by using a gradient descent method;
step 5.7: updating the target network by using the parameters of the online value function network every preset fixed iteration times;
step 5.8: and 5.2 to 5.7 are repeated until the set training round number is reached, and the trained shield deviation correcting decision model is stored.
8. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 7, wherein the step 5.2 comprises the following steps:
step 5.21: color processing of the state image, converting the color space of the image, and changing the image into a gray image;
step 5.22: setting a shield deviation correcting decision model to make a decision every 4 frames of images;
step 5.23: the 3-frame image is superimposed as an input to the network using a method of superimposing the image frames.
CN202111203818.XA 2021-10-15 2021-10-15 Intelligent decision method for shield tunneling correction based on reinforcement learning Active CN114019795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111203818.XA CN114019795B (en) 2021-10-15 2021-10-15 Intelligent decision method for shield tunneling correction based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111203818.XA CN114019795B (en) 2021-10-15 2021-10-15 Intelligent decision method for shield tunneling correction based on reinforcement learning

Publications (2)

Publication Number Publication Date
CN114019795A CN114019795A (en) 2022-02-08
CN114019795B true CN114019795B (en) 2023-10-20

Family

ID=80056253

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111203818.XA Active CN114019795B (en) 2021-10-15 2021-10-15 Intelligent decision method for shield tunneling correction based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN114019795B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114722697A (en) * 2022-03-09 2022-07-08 山东拓新电气有限公司 Method and device for determining control parameters of heading machine based on machine learning
CN114706322B (en) * 2022-03-30 2023-06-23 西南交通大学 Automatic control simulation system for posture of shield tunneling machine
CN115057006A (en) * 2022-06-15 2022-09-16 中国科学院软件研究所 Distillation strategy evaluation method, device and medium based on reinforcement learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1891979A (en) * 2005-04-18 2007-01-10 普拉德研究及开发股份有限公司 Shoulder bed effects removal
CN111651825A (en) * 2020-06-08 2020-09-11 盾构及掘进技术国家重点实验室 Method for calculating curve bending degree of shield tunneling track
JP2021014727A (en) * 2019-07-12 2021-02-12 株式会社奥村組 Construction management method for shield excavation machine
CN113486463A (en) * 2021-07-02 2021-10-08 中铁工程装备集团有限公司 Shield optimal autonomous tunneling control method based on deep reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210287459A1 (en) * 2018-09-30 2021-09-16 Strong Force Intellectual Capital, Llc Digital twin systems and methods for transportation systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1891979A (en) * 2005-04-18 2007-01-10 普拉德研究及开发股份有限公司 Shoulder bed effects removal
JP2021014727A (en) * 2019-07-12 2021-02-12 株式会社奥村組 Construction management method for shield excavation machine
CN111651825A (en) * 2020-06-08 2020-09-11 盾构及掘进技术国家重点实验室 Method for calculating curve bending degree of shield tunneling track
CN113486463A (en) * 2021-07-02 2021-10-08 中铁工程装备集团有限公司 Shield optimal autonomous tunneling control method based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于机器学习的盾构姿态调整决策方法研究;郭正刚;大连理工大学博士学位论文;全文 *
盾构推进姿态控制策略研究;任颖莹 等;隧道建设(第6期);第1038-1044页 *

Also Published As

Publication number Publication date
CN114019795A (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN114019795B (en) Intelligent decision method for shield tunneling correction based on reinforcement learning
CN110195592B (en) Intelligent shield tunneling pose prediction method and system based on hybrid deep learning
CN111612126A (en) Method and device for reinforcement learning
CN111580544B (en) Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN107392388A (en) A kind of method for planning no-manned plane three-dimensional flight path using artificial fish-swarm algorithm is improved
CN112682049B (en) Deviation rectifying control method and device for shield tunneling attitude
EP3391361A1 (en) Method and device for analysis of movement of a piece of sports equipment
CN110185463B (en) Control method for shield tunneling attitude
CN101515373B (en) Sports interactive animation producing method
CN102156992A (en) Intelligent simulating method for passively locating and tracking multiple targets in two stations
CN106454108B (en) Track up method, apparatus and electronic equipment based on artificial intelligence
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN114895707B (en) Agricultural unmanned aerial vehicle path planning method and system based on variable frequency bat algorithm
CN109919229A (en) Monitoring pernicious gas prediction technique and system based on artificial bee colony and neural network
CN117215197B (en) Four-rotor aircraft online track planning method, four-rotor aircraft online track planning system, electronic equipment and medium
CN113378466A (en) DQN-based radar countermeasure intelligent decision-making method
CN113344972A (en) Intensive culture-based fish track prediction method
CN112182955A (en) Construction method and application of shield general type segment point selection model
CN109034192A (en) A kind of track based on deep learning-body oscillating trend prediction method
US20220121920A1 (en) Multi-agent coordination method and apparatus
CN108639177A (en) A kind of autonomous full traversal climbing robot
CN116171962B (en) Efficient targeted spray regulation and control method and system for plant protection unmanned aerial vehicle
CN111445024A (en) Medical image recognition training method
CN111191815B (en) Ultra-short-term output prediction method and system for wind power cluster
CN114609925B (en) Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant