CN111445081A

CN111445081A - Digital twin virtual-real self-adaptive iterative optimization method for dynamic scheduling of product operation

Info

Publication number: CN111445081A
Application number: CN202010251710.7A
Authority: CN
Inventors: 刘振宇; 胡亮; 裘辿; 陈俊奇; 谭建荣
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2020-07-24

Abstract

The invention discloses a digital twin virtual-real self-adaptive iterative optimization method for dynamic scheduling of product operation. Establishing timing S for product digital twin operation process³A PR mesh model; constructing a first sublayer structure of library-to-transition feature propagation; constructing a second sublayer structure of the feature propagation of the transition to the library; building neural network fitting timing S³Mapping between the identification state of the PR network model and the scheduling return; will give time S³The dynamic scheduling problem of the PR network is converted into a Markov decision model; method for solving Mark by using deep Q value network reinforcement learning methodA Kefu decision model; three experiments are designed to verify that the method provided by the invention has more advantages. The method utilizes deep reinforcement learning, and has more advantages in scheduling performance, calculation efficiency and adaptability compared with the traditional heuristic rule method, heuristic search method and reinforcement learning method combined with the full-connection neural network.

Description

Digital twin virtual-real self-adaptive iterative optimization method for dynamic scheduling of product operation

Technical Field

The invention relates to a method for processing virtual reality simulation data of equipment, in particular to a digital twin virtual reality self-adaptive iterative optimization method for dynamic scheduling of product operation based on deep reinforcement learning, and belongs to the field of system scheduling management.

Background

Scheduling optimization is a classic research subject in work flow control, and is a key for realizing coordination linkage of unit components and full utilization of resources in complex products. The full life cycle virtual-real consistency and the iteration optimization capability required in the digital twin structure of the complex product provide new problems for the traditional scheduling optimization method, and firstly, how to realize the smooth deployment from a virtual simulation prototype to an actual physical product by the scheduling method and ensure the consistency between the virtual simulation prototype and the actual physical product in the service of the product; secondly, how to use the operation data collected by CPS and other information technologies to carry out iterative optimization by the scheduling method, so that the intelligence is improved, and a scheduling strategy which is more in line with the real service environment is generated; in addition, the real-time requirement of complex product operation also emphasizes the computational efficiency of the scheduling method.

At present, some researches apply the reinforcement learning or the Deep reinforcement learning to the process scheduling problem, however, the actual effect is not ideal, and two reasons exist, namely, the early Deep reinforcement learning directly uses a Deep neural network fitting value function on the traditional Q learning framework, and is easily influenced by the experimental correlation in training to cause unstable learning process, the defect is effectively solved by two technologies of experience and bivalence functions until the Deep Q value network (DQN) method is proposed, and the existing process scheduling method based on the reinforcement learning is based on the value function fitting mode, even the Deep reinforcement learning scheduling method only uses a multilayer fully connected neural network (Multi-layer Perceptron, Multi L a percenter, M L P), the fitting modes are based on the comparison, the fitting state of the Deep reinforcement learning scheduling method only uses a multilayer fully connected neural network (M L P) and the flow is simply restricted by the flow element containing the time, and the flow is not limited by the flow element, and the fitting state is not considered in the process.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides a DQN method combined with a graph convolutional neural Network (GCN), which is applied to the aspect of digital twin operation flow scheduling with three characteristics of resource sharing, line flexibility and task randomness, and realizes more intelligent and stable scheduling optimization.

The method utilizes deep reinforcement learning, and has more advantages in digital twin operation scheduling performance, calculation efficiency and adaptability compared with the traditional heuristic rule method, heuristic search method and reinforcement learning method combined with a full-connection neural network.

In order to achieve the purpose, the establishment of the model comprises the following specific steps:

s1, establishing timing S for product digital twin operation process³A PR mesh model;

the library represents the procedures of the product digital twin operation flow, and the transition represents the transition between the procedures of the product digital twin operation flow.

S2. structure timing S³A first sublayer structure of library-to-transition feature propagation in the PR mesh model;

s3. structure timing S³A second sublayer structure of characteristic propagation of the transition to the library in the PR network model;

s4, building a neural network by using the first sublayer structure and the second sublayer structure, and fitting the timing S by using the neural network³The state of the PR mesh model;

s5, time assignment S is to be based on³The dynamic scheduling problem of the product digital twin operation process of the PR network model is converted into a Markov decision model;

and S6, solving the Markov decision model established in the S5 by using a Deep Q value network (Deep Q-network) reinforcement learning method, and realizing virtual-real adaptive iterative optimization.

In step S2, a 2-unit shared convolution kernel is constructed to calculate a weighted sum of the features of each of the transitional preamble operation libraries and the features of the preamble resource library, and the weighted sum is used as the structure of the first sub-layer structure.

In the specific implementation, a dummy resource base with the identification constantly larger than zero is added for the autonomous transition, so that the time S can be assigned³All transitions in a PR network have the same preamble node structure.

In step S3, a second sublayer structure is constructed according to the linear shift invariant filter. In particular, the method utilizes a linear shift invariant filter on a directed weighted graph

Wherein F and

the original signal and the filtered signal on the directed weighted graph are respectively, H is a filter, A is an adjacency matrix of the directed weighted graph, H_kIs a filter parameter, K is the order of the filter, and the filter order K is limited to 1 so that the filtered features of the library depend only on the original features of the library itself and the original features of its preamble transitions.

In step S4, the output of the first sub-layer structure is used as the input of the second sub-layer structure, and the first sub-layer structure and the second sub-layer structure are combined to form a timing S³Convolution layer of PR net model, convolution layer will give time S³The input matrix and the output matrix of the PR network are taken as static parameters, the number of trainable weight values is only related to the dimension of input and output characteristics, and is related to the timing S³The PR network has no relation with the scale, and can overcome the problem of weight explosion in the deep neural network construction.

In the step S5, quintuple is used

To describe the endowmentTime S³Interactive environment for PR net scheduling. Wherein the content of the first and second substances,

represents a parameter of the state space that is,

represents the motion space parameter, phi represents the state transition parameter, r represents the cost function, and gamma represents the impairment coefficient.

In step S6, the step of solving the markov decision model established in step S5 by using a Deep Q-value network (Deep Q-network) reinforcement learning method is as follows:

s61, establishing a cost function

To evaluate the current state at time step tau

Selection actions

The value of (D); wherein Q represents a value parameter, s_τIndicating the state at time tau, a indicates a certain action,

the state space is represented by a representation of,

representing a motion space;

s62, the optimal cost function obeys Bellman Equation (Bellman Equation), and the Bellman Equation is used for matching the cost function in the training process

Learning in an iterative manner;

s63, fitting a value function through two deep neural networks with the same structure by using a deep Q value network reinforcement learning method

And a weight value empirical playback (priority Experience Replay) method is used in training;

in specific implementation, when scheduling is trapped in deadlock, a scheduling Agent (scheduling Agent) can obtain a large penalty, the playback probability of deadlock experience can be improved in the early training period by weight experience playback, so that the scheduling Agent learns a deadlock avoidance strategy as fast as possible, the improvement of scheduling performance is emphasized in the later training period, the playback probability of a state containing β ≠ 0 is higher, so that the scheduling Agent is more sensitive to the change of task completion time, and β represents the total cost of completed tasks in the current scheduling step.

And S64, a mask zeta is introduced into the deep Q value network reinforcement learning method, so that trial and error of invalid actions in training can be avoided, and the convergence rate is further improved.

Specifically, an Adam optimization algorithm is selected, the learning rate is set to be 0.0001, and the target neural network in the DQN is updated every 1,000 steps; performance evaluations were performed every 10,000 steps, 100 cycles were scheduled using the current neural network and the average cycle return and deadlock rate were recorded.

The invention firstly passes the time S³Adding a dummy resource pool with the identification constantly larger than zero to the autonomous transition in the PR network so as to ensure that the assigning time S is³All transitions in the PR network have the same preorder node structure, and a first sublayer structure for propagation from the structure library to the transition characteristics is constructed on the basis of the preorder node structure; then, constructing a second sublayer structure of characteristic propagation of the transition to the library by using a construction method of a linear shift invariant filter on the directed weighted graph; the P2T and second sublayer structure are then combined into a convolutional layer, overcoming the difficulties faced in generalizing from the classical convolutional layer of convolutional neural networks of euclidean spatial data to the graph convolutional layer on the graph.

Deep reinforcement learning to solve the dynamic scheduling optimization problem requires a digital twin operation flow operation environment through quintuple

To describe the endowmentTime S³A digital twin interoperable environment for PR net scheduling. Wherein the content of the first and second substances,

the representation of the state space is represented by,

represents the action space, Φ represents the state transition rule, r represents the reward function, and γ represents the discount coefficient.

Next, a cost function is defined

To evaluate the current state at time step tau

Selection actions

And using the Bellman equation pair in the training process

Learning is performed in an iterative manner. Fitting by two structurally identical deep neural networks using the DQN method

A weight Experience Replay (Prioritized experiential Replay) method is used in training, namely, a scheduling Agent obtains a large penalty when scheduling is trapped into deadlock, the Replay probability of deadlock Experience can be improved in the early training stage by the weight Experience Replay, the scheduling Agent learns a strategy of deadlock avoidance as soon as possible, so that the improvement of scheduling performance is emphasized in the later training stage, the probability that the state containing β ≠ 0 is replayed is higher, so that the scheduling Agent is more sensitive to the change of task completion time, and mask zeta is introduced in the DQN process to avoid trial and error of invalid actions in training, so that the convergence speed is further improved.

Compared with the prior art and method, the method has the following advantages:

the invention constructs a time S³And (3) a depth map convolutional neural network of a PR network model structure. By constructing two convolution sublayers to respectively calculate the characteristic propagation from the library place to the transition and from the transition to the library place, the time S is assigned³PR networks identify the mining of deep implicit information in states. Compared with a fully connected neural network, time S³The graph convolution neural network of the PR network has less trainable weight, higher robustness and better convergence

The invention combines the graph convolution neural network to provide the timing S³A deep reinforcement learning optimization method for PR network dynamic scheduling. By timing S of the work flow³The dynamic scheduling problem of the PR network model is converted into a Markov decision model, the state, the action and the return in the digital twin scheduling are formally defined, and then the Markov decision model is solved by combining a graph convolution network and a deep Q value network learning method, so that the optimization of the digital twin scheduling performance is realized.

Drawings

FIG. 1 is a schematic diagram of the operation of the process of the present invention.

FIG. 2 is a schematic diagram of the first sublayer structural feature propagation calculation process in an embodiment of the present invention.

FIG. 3 is a schematic diagram of a convolutional layer structure in an embodiment of the present invention.

FIG. 4 is a schematic diagram of the partial structure and the operation flow of the chemiluminescence immunoassay analyzer for model verification in the embodiment of the invention.

FIG. 5 is a diagram of results of dynamic scheduling of a chemiluminescence immunoassay analyzer workflow according to an example model of the present invention.

Detailed Description

The present invention is further described below with reference to the accompanying drawings and workflow scheduling of a chemiluminescent immunoassay analyzer as specific examples:

the embodiment of the invention and the process thereof are as follows:

the function that a scheduling Agent based on deep reinforcement learning can play in a digital twin application is described in the figure 1.

S1, dividing the structure of an example and determining the flow;

as shown in fig. 2(a), the present invention is described by taking the workflow schedule of the chemiluminescence immunoassay analyzer as an example, and specifically includes the following steps:

combing the local structure of the chemiluminescence immunoassay analyzer: comprising three transport modules (T1, T2 and T3, transport times of 2, 3 and 4 respectively) and four manipulator modules (M1, M2, M3 and M4, action times of 8, 16, 10 and 10 respectively), wherein each transport module can transport only one sample strip at a time, and each manipulator module is dual-channel (i.e. can process two sample strips at the same time). This configuration is capable of handling three different types of samples (S1, S2, and S3) and interfaces with the input ports (I1, I2, and I3) and output ports (O1, O2, and O3) of each sample strip. The different transportation modules realize the transportation of the sample strips between the different operation modules and the input/output ports: t1 realizes transportation among I1, O1, M1, M2, M3 and M4, T2 realizes transportation among I2, O3, M2 and M3, and T3 realizes transportation among I3, O2, M1 and M4. As can be seen from the figure, the minimum completion times mms for each type of sample are 12, 29 and 29, respectively.

S2. structure timing S³A graph convolution neural network of the PR mesh;

as shown in FIG. 2(b), a timing S of the chemiluminescence immunoassay analyzer was constructed³PR mesh model, p₁To p₃Represents the input of samples S1, S2 and S3, respectively, p₂₃To p₂₉Respectively represent T1, M2, M1, T3, M4, M3 and T2. The operation time of each step in the operation flow is indicated in brackets after the name of the library in the figure, and the schedulable transition and the autonomous transition are respectively represented by a black solid square and a black slashed square.

Structure timing S³The specific implementation steps of the graph convolution neural network of the PR network are as follows:

s21, adding a nominal resource base with the identification constantly larger than zero for the autonomous transition to ensure that the time S is assigned³All transitions in the PR network have the same preamble node structure, so that a 2-unit shared convolution kernel can be constructed to calculate the weighted sum of the characteristics of the preamble operation library and the preamble resource library of each transition by referring to the traditional convolution layer,and taking the result as the characteristic f of the transition_t：

Wherein the content of the first and second substances,

、

and 1 are d-dimensional feature vectors, w, of the preamble operating pool site, the preamble resource pool site, and the dummy resource pool site, respectively_PAnd w_RD × d dimension trainable weight matrix corresponding to 2 units, b d dimension trainable bias_tAlso a d-dimensional feature vector, and keeping the transition and library feature dimensions consistent in the first sub-layer is used for realizing the feature propagation calculation of the transition to library feature subsequently. The convolution in the first sublayer may be performed by time-stamping S³The input matrix I of the PR net is calculated as shown in fig. 4. First, transpose and rearrange the assigned time S³The input matrix I of PR network is used to obtain a new position matrix I' ∈ {0, 1}^{(2|T|)×(|P|+1)}Where | T | and | P | are potentials for transition set and bank set, respectively, the odd rows of I' are locations for transition preamble operation banks and the even rows are locations for transition preamble resource banks. In each row of I', only the element at the corresponding position indexed by the bank is 1, and the others are 0. I' is the static parameter of the P2T sublayer. Then, the input of P2T sub-layer-library state feature matrix —

By expanding a row to obtain

All 1's in the extended row represent features of the dummy resource pool. Mixing I' and F_P' matrix multiplication to obtain a feature matrix for convolution

. Finally, F is achieved with reference to classical convolutional layers_P"and a 2 × 1 × d × d dimension shared convolution kernel by making the convolution kernel step at F with 2_P"and calculating the weighted sum to obtain the expected transition feature matrix

F_PAnd F_TTogether as an input to the second sublayer.

S22, utilizing a directional weighted graph to carry out linear shift on an invariant filter

And limiting the order K of the filter to 1 so that the filtered features of the library only depend on the original features of the library and the original features of the preorder transition of the library, and constructing a second sublayer structure for feature propagation of the library. Following the theory of construction of linear shift invariant filter, the characteristic propagation formula of the transition to the library in the second sub-layer is defined as

h₀And h₁Is a trainable scalar weight of a first order filter,

is a normalized timing S³PR net output matrix, W is trainable weight of d × d 'dimension, original feature of d dimension is mapped to new feature of d' dimension, B is trainable bias.

S23, as shown in figure 3, taking the output of the first sublayer structure as the input of the second sublayer structure, and combining the first sublayer structure and the second sublayer structure to construct a timing S³PR mesh convolutional layer (PNC) using neural network fitting timing S³State of the PR mesh model. The convolution layer will be timed S³The input matrix and the output matrix of the PR network are taken as static parameters, the number of trainable weights is only related to the dimension of the input and output characteristics, and is related to the giving time S³The PR network has no relation with the scale, and can overcome the problem of weight explosion in the deep neural network construction.

S3, giving time S³The dynamic scheduling problem of the PR network is converted into a Markov decision model;

s31, utilizing quintuple

To describe the timing S³An interactive environment for PR mesh scheduling, in which,

the representation of the state space is represented by,

represents the motion space, phi represents the state transition rule, r represents the return function, and gamma represents the discount coefficient.

S4, solving the Markov decision model by using a Deep Q value network (Deep Q-network) reinforcement learning method, wherein the steps are as follows:

s41, defining a cost function

To evaluate the current state at time step tau

Selection actions

The value of (A) is obtained.

S42, the optimal cost function obeys Bellman Equation (Bellman Equation), and the Bellman Equation pair is utilized in the training process

Learning is performed in an iterative manner.

S43, fitting by using a DQN method through two deep neural networks with the same structure

During specific implementation, when scheduling is trapped in deadlock, a scheduling Agent obtains a large penalty, the playback probability of deadlock Experience can be improved in the early training stage through weight Experience playback, the scheduling Agent learns a strategy of deadlock avoidance as soon as possible, so that the improvement of scheduling performance is emphasized in the later training stage, the playback probability of a state containing β ≠ 0 is higher, and the scheduling Agent is more sensitive to the change of task completion time.

And S44, by introducing a mask zeta into the DQN, trial and error of invalid actions in training can be avoided, and the convergence speed is further improved.

S5. timing S for realizing algorithm 1³PR network operating environment and scheduling Agent of algorithm 2 for verifying DQN method combined with PNCN³Performance on PR network dynamic scheduling;

FIG. 5 is a diagram illustrating the result of dynamic scheduling;

s51. the network comprises seven convolutional layers and one fully connected layer, PNC (12) -BN-L eakyRe L U-PNC (12) -BN-L0 eakyRe L1U-PNC (24) -BN-L2 eakyRe L3U-PNC (24) -BN-L4 eakyRe L5U-PNC (36) -BN-L6 eakyRe L7U-PNC (36) -BN-L8 eakyRe L U-PNC (12) -BN-L eakyRe L U-FC (18) -L initial, wherein BN represents batch standardization processing, L eakyRe L U represents leaky linear rectifying activator (L eakyRe L U);

s52, three reference methods are realized: FCFS +, D²WS and M L PQ, and compared to the method proposed by this patent (using the deep Q value network of the PNC layer, PNCQ for short).

The settings for scheduling Agent training in the experiment are as follows: training is started to use FCFS + preheating weight value to perform empirical return for 10,000 steps; trainingUsing the ∈ -greedy strategy, ∈ dropped linearly from 1.0 to 0.1 at 200,000 steps before training and remained at 0.1 in subsequent training steps, each training step performed a back propagation training of the neural network, the optimizer used the Adam algorithm with a learning rate of 0.0001, updated the target neural network in DQN every 1,000 steps, performed a performance assessment every 10,000 steps, scheduled 100 cycles with the current neural network and recorded average cycle return and deadlock rates, while using FCFS + and D²WS schedules these periods for comparison. This experiment recorded a training procedure that performed 3,000,000 steps, with the results shown in figure 5.

From the periodic return curve shown in fig. 5(a), it can be seen that the curve of PNCQ is entirely higher than M L PQ, which indicates that the convergence rate and scheduling performance of PNCQ are better than M L PQ, PNCQ can achieve FCFS + performance when training at about 500,000 steps, while M L PQ needs to exceed 1,300,000 steps, PNCQ approaches D at about 800,000 steps²The value curve in FIG. 5(c) and the loss curve in FIG. 5(D) also support that PNCQ has a faster convergence than M L PQ. the return statistics for the last 100 evaluation cycles are shown in FIG. 5 (e). the average cycle return that PNCQ can eventually reach is about 309.95, about 12.3% above FCFS + and about 7.0% M L PQ, slightly below D²WS about 1.3%.

Albeit D²WS achieves optimal scheduling performance, but its time complexity is high and is unacceptable in some real-time scenarios. The experiment counted the computation time for each method to schedule 100 cycles on a computer equipped with an intel i 53.4 ghz processor and 8G memory, as shown in the table below. As can be seen from the results, D²The slight performance advantage of WS over PNCQ is built on hundreds of times its computation time. The PNCQ has both better tuning performance and better computational performance.

TABLE 1 calculation time (unit: ms) for each scheduling method

The above examples are merely the tuning results of the present invention on the examples, but the specific implementation of the present invention is not limited to the examples. Any alternatives which have similar effects according to the principles and concepts of the invention should be considered as the protection scope of the invention.

Claims

1. A digital twin virtual-real self-adaptive iterative optimization method for dynamic scheduling of product operation is characterized by comprising the following steps: the method comprises the following steps:

and S6, solving the Markov decision model established in the S5 by using a Deep Q value network (Deep Q-network) reinforcement learning method, and realizing digital twin virtual-real self-adaptive iterative optimization.

2. The digital twin virtual-real adaptive iterative optimization method based on deep reinforcement learning product job dynamic scheduling according to claim 1, characterized in that:

in step S2, a 2-unit shared convolution kernel is constructed to calculate a weighted sum of features of each of the transitional preamble operation libraries and features of the preamble resource library, and the weighted sum is used as a structure of a first sub-layer structure; in the implementation, a dummy resource pool with the identification constantly larger than zero is added for the autonomous transition, so that the timing S is given³All transitions in a PR network have the same preamble node structure.

3. The digital twin virtual-real adaptive iterative optimization method based on deep reinforcement learning product job dynamic scheduling according to claim 1, characterized in that:

in step S3, the second sub-layer structure is constructed according to the linear shift invariant filter, and in the specific implementation, the linear shift invariant filter on the directed weighted graph is used

Wherein F and

respectively the original signal and the filtered signal on the directed weighted graph, H is the filter, A is the adjacency matrix of the directed weighted graph, H_kIs a filter parameter, K is the order of the filter, and the filter order K is limited to 1 so that the filtered features of the library depend only on the original features of the library itself and the original features of its preamble transitions.

4. The digital twin virtual-real adaptive iterative optimization method based on deep reinforcement learning product job dynamic scheduling according to claim 1, characterized in that:

in step S4, the output of the first sub-layer structure is used as the input of the second sub-layer structure, and the first sub-layer structure and the second sub-layer structure are combined to form a timing S³Convolution layer of PR net model, convolution layer will give time S³The input matrix and the output matrix of the PR net are used as static parameters.

5. The digital twin virtual-real adaptive iterative optimization method based on deep reinforcement learning product job dynamic scheduling according to claim 1, characterized in that: in step S6, the step of solving the markov decision model established in step S5 by using the deep Q-value network reinforcement learning method is as follows:

s61, establishing a cost function Q(s)_τA) to evaluate the current state at time step τ

Selection actions

the state space is represented by a representation of,

representing a motion space;

s62, the optimal cost function obeys Bellman Equation (Bellman Equation), and the value function Q(s) is matched by using the Bellman Equation in the training process_τA) learning by means of iteration;

s63, fitting a value function Q(s) through two deep neural networks with the same structure by using a deep Q value network reinforcement learning method_τAnd a), using a weight empirical Replay (priority Replay) method in training;

s64, by introducing the mask ξ into the deep Q-value network reinforcement learning method, trial and error of invalid actions in training can be avoided, and therefore the convergence speed is further improved.