CN114019795B

CN114019795B - Intelligent decision method for shield tunneling correction based on reinforcement learning

Info

Publication number: CN114019795B
Application number: CN202111203818.XA
Authority: CN
Inventors: 庄元顺; 苏叶茂; 牟松; 徐进; 刘绥美; 李开富; 张炬; 朱菁; 梅元元; 张中华; 陈可; 刘洋; 梁博; 李才洪; 杨冰; 胡可; 陈鑫; 李明扬
Original assignee: Southwest Jiaotong University; China Railway Engineering Service Co Ltd; China Railway Hi Tech Industry Corp Ltd
Current assignee: Southwest Jiaotong University; China Railway Engineering Service Co Ltd; China Railway Hi Tech Industry Corp Ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2023-10-20
Anticipated expiration: 2041-10-15
Also published as: CN114019795A

Abstract

The application belongs to the technical field of shield construction, and particularly relates to an intelligent decision method for shield tunneling deviation correction based on reinforcement learning. Designing an environment state set, an action set and a reward function, and constructing a shield simulation deviation correcting environment; constructing a shield deviation correcting decision model; constructing a model evaluation method to obtain a shield deviation rectification decision model with the highest rewarding score after the shield deviation rectification decision model interacts with a shield deviation rectification simulation environment; determining parameters of a value function network structure through a grid searching method; according to the grid search result, performing multi-pass training on the determined shield deviation correction decision model in a simulation environment; and (3) inputting the state data of the shield deviation correcting decision model into a final model, and directly outputting the value of the execution action by the model to be used as a decision scheme. The deviation correcting decision scheme of the shield provided by the application avoids the problems of automatic deviation correction of a shield driver according to the site situation and snakelike deviation correction caused by manual operation.

Description

Intelligent decision method for shield tunneling correction based on reinforcement learning

Technical Field

The application belongs to the technical field of shield construction, and particularly relates to an intelligent decision method for shield tunneling deviation correction based on reinforcement learning.

Background

Tunnel construction is an important component of underground space development, and the tunnel penetration task is very widely completed by adopting a shield machine. In the shield construction project, the shield posture is a key factor for the decision of the propulsion scheme of a shield operator, and when the shield machine deviates from a design axis, related tunneling parameters need to be adjusted in time so that the shield machine gradually returns to the axis. The posture of the shield machine is closely related to earth surface subsidence, segment assembly and the like, and the quality and the route of a formed tunnel are directly influenced. Therefore, the control of the propulsion attitude of the shield machine is a key problem in the quality management of shield construction projects. The existing shield attitude correction technology can be roughly divided into the following categories:

(1) Based on a three-point method, the calculation method of shield coordinates and attitude deviation is improved by combining devices such as a total station, a prism and an inclinometer, and necessary basic support is provided for the deviation correction control decision of shield attitude.

(2) By descriptive statistics and regression analysis of construction history data, the influence rule of tunneling parameters such as oil cylinder stroke, soil cabin pressure and the like on the shield posture is analyzed, the correlation between the tunneling parameters and the shield posture is explored, the correlation tunneling parameters are reversely adjusted according to the rule, the shield direction is further controlled, and a theoretical basis is provided for deviation correcting operation decision of a shield operator.

(3) The tunneling parameters are subjected to feature selection by methods such as recursive feature elimination and random forest, and the attitude deviation, attitude angle and the like of the shield are predicted by methods such as XGBoost method and construction of a neural network. Because the posture of the shield is the main basis of the control decision of the shield operation, the predicted posture of the shield according to the construction parameters can be used as the reference of the next parameter adjustment decision, and the parameters can be adjusted in advance to control the tunneling direction.

(4) And a fuzzy mathematic and PID control system is adopted, a deviation rectifying curve model of the shield is established based on a kinematic model, an outer ring planning and inner ring accurate control PID control system is established, and deviation rectifying control is realized on the posture of the shield.

Accurate measurement and calculation of the shield posture, influence rules of related parameters on the shield posture and prediction research of the posture provide basic decision support for operators of the shield on decision basis, but a direct decision scheme cannot be provided, and each operation of the shield still needs to be controlled manually. The serpentine correction is easily caused by poor control of correction time and correction amount by manual correction, and precise control of shield posture is very dependent on project experience of operators.

Disclosure of Invention

The application discloses a shield tunneling correction intelligent decision method based on reinforcement learning, which aims to solve the problems that accurate control of shield posture in the background technology is very dependent on project experience of operators, and serpentine correction is easy to be caused by poor correction time and correction amount control simply by manual correction.

In order to solve the technical problems, the application adopts the following technical scheme:

a shield tunneling deviation correction intelligent decision method based on reinforcement learning comprises the following steps:

step 1: combining the tunneling deviation correcting strategy process and technical experience of the shield project site, designing an environment state set, an action set and a reward function, and constructing a shield deviation correcting simulation environment based on a reinforcement learning frame;

step 2: constructing a shield deviation correcting decision model interacted with a shield deviation correcting simulation environment;

step 3: constructing a model evaluation method to obtain a shield deviation rectification decision model with the highest rewarding score after the shield deviation rectification decision model interacts with a shield deviation rectification simulation environment;

step 4: determining parameters of a median function network structure of the shield deviation correcting decision model by a grid searching method;

step 5: according to the grid search result, carrying out multi-pass training on the shield deviation correction decision model of the determined value function network structure in a simulation environment; because the value function network is a component part of the shield deviation rectifying decision model, the structure of the shield deviation rectifying decision model is determined by determining the value function network structure through the network search result;

step 6: and (3) inputting the state data of the shield deviation correcting decision model into a final model, and directly outputting the value of the execution action by the model to be used as a decision scheme. And (3) training the shield deviation correcting decision model determined in the step (5) in a simulation environment for a plurality of rounds, and then storing a training model, wherein the final model is the stored training model.

The method can provide decision scheme reference for the shield driver according to the adjustment strategy of the key decision parameters obtained by the environmental state set, is beneficial to relieving decision doubt and decision strength of the shield driver, reduces the working strength of the shield driver, effectively ensures that the shield driver can have abundant energy to make a timely and efficient treatment scheme when facing emergency, and further improves project safety and ensures stable project progress; the deviation correcting decision scheme of the shield provided by the application avoids the problems of automatic deviation correction of a shield driver according to the site situation and snakelike deviation correction caused by manual operation.

Preferably, the environmental state set is a shield key attitude parameter measured by a shield measurement system.

Preferably, the key attitude parameters of the shield include horizontal deviation of a notch ring, vertical deviation of a notch change, horizontal deviation of a shield tail, vertical deviation of the shield tail, rolling angle, pitch angle, horizontal yaw angle and vertical yaw angle.

Preferably, the action set is designed according to the shield deviation correcting principle.

Preferably, the reward function is designed based on the deviation correcting direction and the deviation correcting speed of the shield and the deviation between the shield machine and the design curve;

R(s,a,s′)＝r_d+r_v+r_y

wherein r_d represents rewards of shield deviation correcting directions, r_v represents rewards of deviation correcting speeds, and r_y represents axis deviation rewards of the shield machine;

the rewards of the deviation rectifying directions are shown in the following formula:

wherein r_d represents the rewards of the shield deviation rectifying direction and delta y _t-1 The difference value between the shield machine and the design axis at the moment t-1 before t;

the rewards of the deviation correcting speed are shown as follows:

where r_v represents the prize of the correction speed, |d _rt The I is the absolute value of the axis deviation of the shield machine;

the rewards for axis deviation are shown as follows:

where r_y represents the axis deviation rewards of the shield machine, |Deltay _t And I represents the distance of the shield machine deviating from a given line.

Further, in order to improve the utilization rate of the sample and reduce the correlation of the sample, the step 2 further comprises constructing an experience pool by adopting a queue structure, and the experience pool is used for storing interaction history data obtained by interaction of the shield deviation correcting decision model and the shield deviation correcting simulation environment; sample data sampled from the experience pool is used for training of the value function network. The samples are data sampled from an experience pool.

Furthermore, because the interaction and the updating of the shield deviation rectifying simulation environment are both from the same shield deviation rectifying decision model, the data generated by the interaction can influence the iteration, and in order to reduce the instability of the model updating, the shield deviation rectifying decision model comprises two convolutional neural networks with consistent structures to form a double-network mechanism; one is used as an online network to select, and the decision action with the highest value is interacted with the shield deviation correction simulation environment to obtain a sample; the other as a target network for calculating the value of the decision performed by the online network; each step of training updates the parameters of the online network according to the following training formula, and the parameters of the target network are replaced by the parameters of the online network after iteration for a plurality of times;

the training formula is as follows:

Q(s,a；θ)←Q(s,a；θ)+α[r+γQ(s′,argmax _a′ Q(s′,a)；θ ^- )-Q(s,a；θ)]

wherein: s is shieldConstructing an environment state of the correction decision model, wherein a is a decision action of the shield correction decision model under the environment state, s' represents a next environment state transferred to after the decision action is executed, r represents feedback obtained from the environment after the decision action is executed, gamma is a discount factor, Q (s, a; theta) represents a value function network, theta represents a parameter of an online network, and theta ^- Representing the parameters of the target network, and alpha represents the degree of each update of the parameters.

Preferably, the step 4 includes the steps of:

step 4.1: the structure parameters to be determined of the comb value function network;

step 4.2: determining a candidate value list of the parameters to be determined;

step 4.3: combining the candidate values of the undetermined parameters to form a plurality of candidate shield deviation correcting decision models;

step 4.4: training and testing the candidate shield deviation correcting decision models in the same round number in a shield deviation correcting simulation environment, and calculating test average scores of the candidate shield deviation correcting decision models according to the model evaluation method established in the step 3;

step 4.5: and determining the value of the parameter corresponding to the shield deviation correcting decision model with the highest test average score as the network structure parameter of the value function.

Preferably, the step 5 includes the steps of:

step 5.1: initializing parameters of a shield deviation correcting decision model; initializing experience pool capacity, parameter epsilon, sampling batch batch_size, online value function network parameters and initializing parameters of a target network by using the online network parameters;

step 5.2: initializing a simulation environment to obtain an initial state and preprocessing;

step 5.3: if epsilon is larger than 0.5, randomly selecting actions with epsilon probability, otherwise, selecting the action with the largest value with 1-epsilon probability;

step 5.4: the shield deviation correcting decision model executes the action with the highest value selected in the step 3, obtains a new state, rewards and termination signals returned by the shield deviation correcting simulation environment, and stores the previous state, the execution action, the rewards, the new state and the termination signals into an experience pool as experience data;

step 5.5: if the new state is a termination state or negative rewards are obtained in the continuous prescribed times, the training of the current round is terminated, and the step 5.2 is executed;

step 5.6: when the number of the samples in the experience pool is larger than that of the samples in the training batch, randomly sampling the batch_size samples from the experience pool, and updating an online value function network by using a gradient descent method;

step 5.7: updating the target network by using the parameters of the online value function network every preset fixed iteration times;

step 5.8: and 5.2 to 5.7 are repeated until the set training round number is reached, and the trained shield deviation correcting decision model is stored.

Preferably, the step 5.2 includes the steps of:

step 5.21: color processing of the state image, converting the color space of the image, and changing the image into a gray image; the state image refers to a screen shot when the shield deviation correction decision model interacts with the visual simulation environment, and is an image reflecting the relative position of the shield machine and the design axis.

Step 5.22: setting a shield deviation correcting decision model to make a decision every 4 frames of images;

step 5.23: the 3-frame image is superimposed as an input to the network using a method of superimposing the image frames.

In summary, due to the adoption of the technical scheme, the beneficial effects of the application are as follows:

1. according to the application, technologies such as data analysis, machine learning, artificial intelligence and the like are integrated into an industry construction process, so that problems existing in the current construction process can be found in time, risks possibly occurring in the construction process, consequences generated by a current decision scheme and the like are prejudged in advance, and further risks are avoided and better decisions are made.

2. The intelligent decision of shield deviation correction is realized by constructing the reinforcement learning model, the intelligent decision management trend of the building construction industry including shield projects is met, the informatization and data development of the industry are promoted, and the application field of reinforcement learning is expanded.

3. The adjustment strategy of the key decision parameters obtained according to the environmental state can provide decision scheme reference for the shield driver, is beneficial to alleviating decision doubt and decision strength of the shield driver, reduces working strength of the shield driver, effectively ensures that the shield driver can have vigorous energy to make a timely and efficient treatment scheme when facing emergency, and further improves project safety and ensures stable project performance; the deviation correcting decision scheme of the shield provided by the application avoids the problems of automatic deviation correction of a shield driver according to the site situation and snakelike deviation correction caused by manual operation.

4. Deep reinforcement learning is introduced in the intelligent exploration of the shield project decision, and the designed model scheme can also be applied to the intelligent management research application of related large equipment, so that the application research of reinforcement learning in the large engineering field is enriched.

Drawings

The application will now be described by way of example and with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of the intelligent decision-making method for shield tunneling deviation correction.

FIG. 2 is an example of a visual simulation environment for shield deviation rectification decision making in the present application.

FIG. 3 is an overall framework of the shield deviation correcting decision model of the present application.

FIG. 4 is a graph showing a comparative multi-model evaluation method according to the present application.

FIG. 5 shows the decision result of the shield deviation correcting decision model of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

Embodiments of the present application are described in detail below with reference to fig. 1 and 5; the intelligent decision model of shield deviation correction mentioned in fig. 1 is simply called a shield deviation correction decision model.

step 1: constructing a shield deviation correction simulation environment based on a reinforcement learning frame;

(1) Design of environmental state sets

In the shield construction process, a direct decision maker for determining how to tunnel and how to adjust parameters is a shield driver. According to comprehensive deduction of distances between the two points of the shield incision and the shield tail center and a set axis in the horizontal direction and the numerical direction and the attitude angle and the tunneling trend, a shield driver can judge whether the shield machine has upward tilting or 'head planting', whether the shield machine deviates from the axis, whether the degree of deviation from the axis is acceptable, whether deviation correction is needed and the like. The state data in the environmental state set includes the key attitude parameters measured by the shield measurement system. Thus, the state space of the simulation environment is defined as s=s ₁ ×S ₂ ×S ₃ ×S ₄ ×S ₅ ×S ₆ ×S ₇ ×S ₈ ，S _i (i=1, 2, … …, 8) represents each key posture parameter, as shown in table 1 below.

Table 1 key attitude parameters for measurement and calculation by measurement system

S	Parameters (parameters)	Unit (B)	Description of the application
				S ₁	Horizontal deviation of notch ring	mm	Horizontal distance between shield machine head (cutter head center) and axis
S ₂	Vertical deviation of cut ring	mm	Vertical distance between shield machine head (cutter head center) and axis
				S ₃	Horizontal deviation of shield tail	mm	Horizontal distance between tail of shield machine and axis
S ₄	Vertical deviation of shield tail	mm	Horizontal distance between tail of shield machine and axis
				S ₅	Roll angle	Degree of	Angle of shield machine rotating around its machine body axis
S ₆	Pitch angle	Degree of	Included angle formed by axial direction of shield machine and horizontal plane
				S ₇	Horizontal yaw angle	mm/m	Continuing tunneling in the original horizontal direction, wherein the deviation is generated when tunneling is performed for 1 meter
S ₈	Vertical yaw angle	mm/m	Continuing tunneling according to the original vertical direction, wherein the deviation is generated when tunneling is performed for 1 meter

In actual operation of a project site, decision making of a decision maker for shield tunneling is mainly based on a system chart displayed on a monitoring screen, and because attitude parameters of the shield in a state set are obtained by single measurement and calculation of a measurement system, real-time support of the measurement system cannot be obtained when a simulation environment is constructed, pictures reflecting shield tunneling conditions are used as states, namely, shield tunneling states displayed on the monitoring screen in real time are used as states reflecting shield tunneling.

(2) Action set design

The attitude of the shield is controlled, and the track deviation of the shield machine is generally corrected by changing the advancing and rotating directions of the cutterhead; the jacks behind the shield cutter head are generally divided into A, B, C, D groups which respectively represent right, lower, left and upper four areas, the propelling stroke of the jacks is changed mainly by adjusting the propelling pressure of the left and right (group C and group A), the deviation rectifying moment is adjusted by the stroke difference, the deviation rectifying control principle in the vertical direction is the same as that in the horizontal direction, and the propelling stroke difference of the two groups of jacks is mainly adjusted B, D. Thus, the first and second substrates are bonded together,taking the horizontal direction as an example, define the action set as a=a ₁ ×A ₂ ×A ₃ Wherein A is ₁ Represents the propulsion pressure change of group A, A ₂ Represents the propulsion pressure change of group C, A ₃ Representing the cutter head rotational speed. In order to avoid too fast a pressure change, the range of the propulsion pressure is set to be [ -5,5]Negative sign indicates decreasing the advancing pressure, value 0 indicates constant pressure, positive sign indicates increasing pressure; in performing the action of changing the pressure, it will also be defined that the magnitude of the propulsion pressure after adjustment is also within the allowable threshold, see in particular table 2 below. The threshold value range is a reasonable value range of the propulsion pressure, and the reasonable value range is a range defined in technical specification parameters of the shield machine.

Table 2 basic actions of horizontal deviation correction for shield

(3) Design of reward functions

The application designs the reward function from three aspects of the deviation correcting direction, the deviation correcting speed and the deviation between the shield machine and the design axis, and the formula is as follows:

R(s,a,s′)＝r_d+r_v+r_y

deviation rectifying direction

Order theFor the deviation value of the time t shield and the design axis, delta y _t-1 D is the difference between the shield and the design axis at the previous time t-1 of t _rt ＝Δy _t -Δy _t-1 In order to correct the variation value of the deviation before and after the deviation, according to the variation value of the deviation and the deviation sign of the previous moment, whether the shield is tunneling towards the design axis or not can be judged, and corresponding positive feedback or negative feedback is given as the reward r_d for the deviation correcting direction:

for example: if the shield is positioned at the left side of the axis at the previous moment, namely delta y _t-1 <0, d is the right side of the original position of the shield at the next moment _rt >0, indicating that the shield realizes rightward deviation correction, and giving forward rewards 1; if the shield is positioned on the right side of the axis at the previous moment, namely delta y _t-1 >0, d is the left side of the original position of the shield at the next moment _rt <0, indicating that the shield achieves a left correction, and therefore also gives a positive prize of 1. Combining the two conditions, namely when the original position of the shield is not positioned on the axis, realizing the correction of the corresponding direction, the environment feeds back the forward rewards 1; conversely, when correction is needed and tunneling towards the correction direction is not performed, the environment feeds back negative rewards-1. If the shield is on the axis at the beginning, namely delta y _t-1 When =0, the variation d of the front and rear time is _rt When=0, the shield advances along the axis, and the environment feeds back a positive reward 1; variation d of the deviation between the front and rear time _rt A value of 0 indicates that the environment will give a negative prize of-1 when the shield begins to deviate from the axis.

Deviation correcting speed

According to the adjustment principle of gradual deviation correction and incapability of violent correction and jerk adjustment, larger negative feedback needs to be given to the action of overlarge deviation change values before and after deviation correction:

wherein, |d _rt And I is the absolute value of shield-axis deviation. The application segments the deviation rectifying variable quantity, and when the deviation rectifying quantity is more than 0mm and less than 1mm, the application accords with the principle of slow deviation rectifying and gives positive rewards 1; when the deviation correction amount is more than or equal to 1mm and less than 3mm, the current deviation correction amount is considered to be in accordance with the condition, and no rewards are given; when the deviation correcting amount is more than or equal to 3mm and less than 5mm, the current deviation correcting amount is considered to be larger; giving negative rewards-1; when the deviation is more than or equal to 5mm and less than 10mm, correcting the deviationAn amount exceeding the acceptance range may cause serious accidents, thus giving a larger negative prize-10; if the deviation correcting speed is greater than 10mm, the deviation correcting speed is seriously beyond the regulation, and negative rewards are given to minus 100.

Deviation size

In order to strictly control the tunneling of the shield machine to accord with the design axis, the correct deviation correcting direction and the correct deviation correcting speed are ensured, and meanwhile, the deviation with the design axis should not be too large;

in the above, |Δy _t And I represents the distance of the shield machine deviating from a given line. The acceptance criterion generally requires that the deviation of the formed tunnel is within 5mm, so that the deviation is controlled in the deviation correcting process in addition to the deviation correcting direction and the deviation correcting speed; the deviation correction is equal to or more than 0mm and less than 3mm as an acceptable range, and negative rewarding feedback is not carried out; when the value is equal to or greater than 3mm and less than 5mm, negative rewards-1 are given; when the deviation is equal to or greater than 5mm and less than 10mm, negative rewards-10 are given; when the deviation is equal to or greater than 10, a negative prize of-100 is given.

(4) Visual shield deviation rectifying simulation environment

The open source tool bag Gym is taken as an environment building foundation, and the designed simulation environment is visualized. As shown in fig. 2. Corresponding to the shield tunneling process, black rectangles in the figure represent the set shield deviation correcting decision model, and the two sides of the black rectangle frame are the rock-soil environment around the designed axis; the line where the black rectangle is located is the design axis. The whole image reflects the advancing direction of the shield deviation correcting decision model, the deviation situation between the shield deviation correcting decision model and the design axis, and the like in the interaction process.

Step 2: and constructing a shield deviation correcting decision model.

The whole frame of the constructed shield deviation correcting decision model is shown in figure 3 based on the reinforcement learning algorithm.

The shield deviation correcting decision model is an intelligent agent interacting with the shield deviation correcting simulation environment (hereinafter, the shield deviation correcting decision model is called as the intelligent agent); the goal is to maximize the overall rewards obtained from the environment. The input of the intelligent agent is the environmental state, and the output is the action with the highest value under the corresponding state, namely decision. In the intelligent agent, a convolutional neural network is adopted as a value function network to evaluate the value of the decision.

The agent generates an experience data in each interaction with the environment, the data including rewards r from the environment; in one round of interaction, the agent will interact with the environment n times, with the goal that the agent's overall rewards for one round of interaction with the environment can be maximized through training.

Considering that the rewards for performing actions at future times will typically have less impact on the current time, the discount factor γ can be introduced into the calculation formula for the overall rewards: r is R _t ＝∑ _k＝0 γ ^k r _t+j+1

Wherein t represents the current time, k represents k times after t, gamma is a discount factor, R _t Indicating the overall prize value by time t.

The training data of the intelligent agent are derived from experience data obtained by interaction of the intelligent agent and a shield correction simulation environment, an experience pool is built by adopting a queue mechanism in order to improve the utilization rate of the experience data and reduce the correlation of sample data, a certain experience data is stored in the experience pool, and then data are sampled in batches from the experience pool for training of a value function network; the data sampled in batch is herein referred to as sample data.

In addition, if the interaction and updating of the agent and the shield deviation correcting simulation environment are both from the same model (different data are substituted into the agent to generate different agents), the interaction data can influence the iteration, in order to reduce the instability of the update of the agent, two convolution neural networks with completely consistent structures are adopted to form a double-network mechanism, and one decision action with the greatest value is selected as an online network to interact with the shield deviation correcting simulation environment to obtain experience data. The other network is used as a target network for calculating the value of the decision executed by the online network, the parameters of the online network are updated according to the following formula in each step of training, and the parameters of the target network are replaced by the parameters of the online network after a certain number of iterations. The formula is as follows:

wherein: s is the environmental state of the shield deviation correcting decision model, a is the decision action of the shield deviation correcting decision model under the environmental state, s' represents the next environmental state transferred after the decision action is executed, r represents the feedback obtained from the environment after the decision action is executed, gamma is a discount factor, Q (s, a; theta) represents the value function network, theta represents the parameter of the online network, and theta ^- Representing the parameters of the target network, and alpha represents the degree of each update of the parameters.

Step 3: evaluation method for design intelligent agent

And (3) interacting the training and saving agent with the shield deviation correction simulation environment for 100 rounds, and counting the average rewarding points obtained by the agent in the test of 100 rounds, wherein the higher the rewarding points are, the better the strategy learning of the model is indicated.

If the comparison of a plurality of agents is needed, the agents are interacted with the shield deviation correction simulation environment for 100 rounds respectively, the comprehensive rewarding value obtained by each agent in each round is counted, and the agent strategy with the highest average round score is the best.

Step 4: network structure for determining value function

The value function network mainly extracts information from the current state diagram of the shield to calculate the value, and the depth of the neural network is not too deep in consideration of the size of the picture and the calculation method of the convolution layer; in this embodiment, the design model includes two convolution layer-maximum convergence layer pairs, so as to implement feature extraction of the input state, the extracted feature data is flattened through the flat layer, and the value of each action is obtained through the operation of two full connection layers.

Step 4.1: the method comprises the steps that structural parameters which need to be determined are carded out by a value function network, the structural parameters which need to be determined comprise the size and the number of convolution kernels of two convolution layers and the number of units of a full connection layer based on a value function network structure which is preliminarily set by a model;

step 4.2: and determining a candidate value list of the undetermined parameters. The candidate values of the parameters to be determined are empirically determined from convolutional neural network parameter settings, as shown in table 3 below:

table 3 value function network pending parameters

Step 4.3: combining the undetermined parameter candidate values to form 54 candidate intelligent agents;

step 4.4: training the 54 candidate intelligent agents for 600 rounds in a shield deviation correction simulation environment respectively, performing 100 rounds of tests, and calculating the test average score of each candidate model according to a model evaluation method;

step 4.5: based on the grid search results, the highest score achieved by agent "3-32-4-64-64" in the test and the average number of actions performed per round are listed in the top. The value function network structure of the final model is thus:

the first convolution layer uses 32 convolution kernels with a size of 3×3, a step size of 3, and a Relu activation function is adopted;

the first convergence layer adopts maximum convergence to reduce the dimension, and the size of a used filter is 2 multiplied by 2;

the second convolution layer uses 64 convolution kernels, 4 x 4 in size, with step size set to 1, and the activation function still uses Relu;

the second convergence layer is the largest convergence layer, and the size of the filter is set to be 2 multiplied by 2;

the flattened data will be input to the fully connected layer of 64 neurons, output by the Relu activation function.

The last full connection layer is the output layer.

Step 5: and training an intelligent body. And according to the network search result, training the determined shield deviation correcting intelligent decision model in a simulation environment for a plurality of rounds.

Step 5.1: initializing agent parameters and related training parameter settings as shown in table 4:

table 4 initialization parameter settings

Parameters (parameters)	Value of	Meaning of
			episode	5000	Number of rounds of interaction of agent with environment
batch_size	32	Sampled batch data size
			save_training_frequency	50	Preserving the number of rounds of interval of training models
update_target_model_frequency	5	Updating the number of rounds of interval of a target network
			memory_size	50000	Experience pool size
gamma	0.95	RewardsDiscount factor
			epsilon	1.0	Search rate
epsilon_decay	0.99999	Attenuation rate per search rate
			epsilon_min	0.1	Minimum value of exploration rate
learning_rate	0.001	CNN network learning rate

Step 5.2: initializing a simulation environment, obtaining an initial state and processing, wherein the method comprises the following steps:

a. color processing of the state image, namely converting the color space of the image to change the image into a gray image;

b. frame skipping, namely setting an agent to make a decision every 4 frames of images;

c. the frame number stacking, because the superposition of multi-frame images can reflect the dynamic change of the agent, the application uses the method for superposing the frame number of the images, and the superposition of 3 frames of images is used as the input of the network;

step 5.4: the intelligent agent executes the action selected in the step 5.3 to obtain a new state, rewards and termination signals returned by the environment, and stores the previous state, the executed action, the rewards, the new state and the termination signals into an experience pool as experience data;

step 5.5: if the new state is a termination state or negative rewards are obtained continuously for 10 times, the training of the current round is terminated, and the step 5.2 is executed in a return mode;

step 5.6: when the experience data in the experience pool is larger than the training batch experience data, randomly sampling 32 experience data from the experience pool, and updating an online value function network by using a gradient descent method; because the initial data size in the experience pool is 0, along with continuous interaction of the intelligent body and the shield correction simulation environment, interaction data can be stored in the experience pool one by one, training data of the intelligent body are obtained by randomly sampling data in the experience pool, the data size required to be sampled for each training is one batch, and the batch of the embodiment is 32; when the data amount stored in the experience pool is larger than 32, the intelligent agent can sample the data in batches for training.

Step 5.7: updating the target network with parameters of the online value function network every 5 training rounds;

step 5.8: step 5.2 to step 5.7 are repeatedly executed, 550 rounds of training are performed, and the trained agent is saved.

Step 6: and (5) intelligent decision of the shield deviation correcting model. And (3) carrying out decision test on the intelligent agent in the constructed shield deviation correction simulation environment, inputting a state image in which the intelligent agent is positioned into a model in the test process, and executing action value of the model decision output by the intelligent agent. A portion of the control trajectory of the agent model in the simulation environment is shown in fig. 5. According to the control decision result, the designed shield deviation rectifying intelligent decision model can basically move forward along the design axis at the parts of the eased curve and the straight curve, which indicates that the decision scheme of the model is effective and provides intelligent decision support for shield deviation rectifying decision of a project site.

The above examples merely illustrate specific embodiments of the application, which are described in more detail and are not to be construed as limiting the scope of the application. It should be noted that it is possible for a person skilled in the art to make several variants and modifications without departing from the technical idea of the application, which fall within the scope of protection of the application.

Claims

1. The intelligent decision-making method for shield tunneling deviation correction based on reinforcement learning is characterized by comprising the following steps:

step 1: designing an environment state set, an action set and a reward function, and constructing a shield deviation correction simulation environment based on a reinforcement learning frame;

the environment state set is a shield key attitude parameter measured by a shield measurement system;

the action set is designed according to the shield deviation correcting principle;

the reward function is designed based on deviation correcting direction, deviation correcting speed and deviation between the shield machine and a design curve;

step 3: constructing a model evaluation method to obtain a shield deviation rectification decision model with the highest rewarding score after the shield deviation rectification decision model interacts with a shield deviation rectification simulation environment; the shield deviation correcting decision model comprises a double-network mechanism formed by two convolution neural networks with identical structures;

step 5: according to the grid search result, carrying out multi-pass training on the shield deviation correction decision model of the determined value function network structure in a simulation environment;

step 6: and (3) inputting the state data of the shield deviation correcting decision model into a final model, and directly outputting the value of the execution action by the model to be used as a decision scheme.

2. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the key attitude parameters of the shield include notch ring horizontal deviation, notch changing vertical deviation, shield tail horizontal deviation, shield tail vertical deviation, rolling angle, pitch angle, horizontal yaw angle and vertical yaw angle.

3. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the reward function is represented by the following formula;

R(s,a,s′)＝r_d+r_v+r_y

the rewards of the deviation correcting speed are shown as follows:

the rewards for axis deviation are shown as follows:

4. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the step 2 further comprises constructing an experience pool by adopting a queue structure for storing training data obtained by interaction of a shield correction decision-making model and a shield correction simulation environment; the training data in the experience pool is used for training of the value function network.

5. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 1, wherein the shield tunneling correction decision-making model comprises a double-network mechanism formed by two convolutional neural networks with identical structures; one is used as an online network to select, and the decision action with the highest value is interacted with the shield deviation correction simulation environment to obtain a sample; the other is used as a target network for calculating the value of the decision executed by the online network, each step of training updates the parameters of the online network according to the following training formula, and the parameters of the target network are replaced by the parameters of the online network after iteration for a plurality of times;

the training formula is as follows:

6. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 4, wherein the step 4 comprises the following steps:

7. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 6, wherein the step 5 comprises the following steps:

step 5.1: initializing model parameters; initializing experience pool capacity, parameter epsilon, sampling batch batch_size, online value function network parameters and initializing parameters of a target network by using the online network parameters;

step 5.2: the initial shield rectifies the simulation environment to obtain an initial state and performs pretreatment;

8. The intelligent decision-making method for shield tunneling correction based on reinforcement learning according to claim 7, wherein the step 5.2 comprises the following steps:

step 5.21: color processing of the state image, converting the color space of the image, and changing the image into a gray image;