CN113837628B

CN113837628B - Metallurgical industry workshop crown block scheduling method based on deep reinforcement learning

Info

Publication number: CN113837628B
Application number: CN202111142373.9A
Authority: CN
Inventors: 冯凯; 张云贵; 马湧; 梁青艳
Original assignee: China Iron and Steel Research Institute Group
Current assignee: China Iron and Steel Research Institute Group
Priority date: 2021-09-16
Filing date: 2021-09-28
Publication date: 2022-12-09
Anticipated expiration: 2041-09-28
Also published as: CN113837628A

Abstract

The invention discloses a metallurgical industry workshop overhead traveling crane dispatching method based on deep reinforcement learning, and belongs to the technical field of workshop overhead traveling crane dispatching. The invention comprises the following steps: (1) Acquiring the spatial layout of a cross region where the crown blocks are located in a metallurgical workshop and a historical crown block transportation task data table; (2) According to the cross-region space layout, an overhead traveling crane is used as an intelligent agent, the cross-region space is used as an environment, and a deep reinforcement learning model for overhead traveling crane dispatching is established; (3) Performing parameter optimization and training on the deep reinforcement learning model according to a historical overhead traveling crane transportation task data table; (4) And regularly acquiring the current position and state of the overhead travelling crane in the cross-region and the conditions of the transportation tasks which are being executed and are to be executed, generating an environment state input trained deep reinforcement learning model, and generating an overhead travelling crane dispatching scheme. The invention can generate a global optimization scheduling scheme in time aiming at the transportation tasks which are randomly generated or temporarily changed in the metallurgical industry workshop, improves the scheduling efficiency of the crown block, and has stronger robustness and effectiveness.

Description

Method for dispatching shop crown blocks in metallurgical industry based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of shop overhead traveling crane dispatching, and particularly relates to a method for generating an overhead traveling crane dispatching scheme based on a deep reinforcement learning method in the face of an uncertain overhead traveling crane transportation task.

Background

In metallurgical enterprises, whether smelting workshops or storage workshops, crown blocks are the most important transportation tools. The overhead traveling crane scheduling is an important component of enterprise production management, and the overhead traveling crane scheduling efficiency is high and low, so that the production logistics efficiency and the connection matching among working procedures are influenced to a great extent. And reasonable crown block scheduling is very important for improving the overall benefit of metallurgical enterprises. Influenced by non-planning factors of the production process of the metallurgical industry, such as: the overhead crane transportation task in the metallurgical industry workshop has certain uncertainty due to time fluctuation of smelting processes, transportation plan change caused by equipment failure and the like. In the face of uncertain transportation tasks, the overhead traveling crane dispatcher basically can only temporarily make a corresponding dispatching scheme according to self accumulated experience or a certain fixed rule. The scheduling scheme can not avoid the problems of unbalanced load of the overhead travelling crane, more conflict times in the transportation process, low overall scheduling efficiency and the like.

The deep reinforcement learning is a framework for solving a complex sequential decision problem, and a dynamic and optimized overhead traveling crane scheduling scheme can be provided for an uncertain transportation task by training and learning the trial and error performance of a historical overhead traveling crane task and sensing spatial information in a workshop in real time. In the actual production process, how to realize fast and efficient crown block scheduling by using a deep reinforcement learning method is a problem needing further research.

Disclosure of Invention

The invention aims to provide a method for dispatching a crown block in a metallurgical industry workshop based on deep reinforcement learning, which aims to solve the problem that an optimized dispatching scheme of the crown block cannot be obtained in the face of uncertain transportation tasks. The method generates the overhead crane scheduling scheme of the metallurgical industry workshop based on the deep reinforcement learning, can effectively solve the problem of optimizing and scheduling the uncertain transportation tasks, improves the scheduling efficiency of the overhead crane, reduces the conflict probability of the transportation path of the overhead crane, and realizes the efficient completion of the transportation tasks.

The invention discloses a metallurgical industry workshop overhead traveling crane scheduling method based on deep reinforcement learning, which comprises the following steps of:

(1) Acquiring the spatial layout of a cross region where an overhead crane is located in a metallurgical workshop and a historical overhead crane transportation task data table;

(2) According to the cross-region space layout, a crown block is used as an agent, the cross-region space is used as an environment, and a deep reinforcement learning model is created;

(3) Performing parameter optimization and training on the deep reinforcement learning model according to a historical crown block transportation task data table;

(4) According to a certain time interval, acquiring the position and the state of the current overhead travelling crane in a cross-region at regular time, and the conditions of the transportation task which is being executed and is to be executed, and generating an environment state;

(5) Inputting the environmental state into the trained deep reinforcement learning model, outputting the operation actions to be executed by each overhead traveling crane, generating a current overhead traveling crane dispatching scheme, matching the overhead traveling cranes for the transportation tasks according to the current overhead traveling crane dispatching scheme, and designating the transportation paths of the overhead traveling cranes, and then continuing to execute the step 4 until all the transportation tasks in the cross-region are completed.

Further, in the step (1), the spatial layout of the bay area where the crown block is located refers to an overhead layout of a certain bay in a steel-making workshop, and the data to be acquired includes the length and the width of the bay area, and the relative distance between all stations in the bay area and the edge of the bay area in the transverse direction and the longitudinal direction.

Further, in the step (1), the historical data table of the transportation task of the overhead travelling crane comprises a task number, a starting station, a target station, a starting time and an ending time.

Further, in the step (2), the deep reinforcement learning model adopts a DQN algorithm to design a feedback mechanism frame of "action-environment state-reward", each overhead traveling crane in a span area is abstracted into a single intelligent agent, the span area where the overhead traveling crane is located is abstracted into an environment state, the action of the intelligent agent is an operation action of the overhead traveling crane, and states observed by the intelligent agent are task information and state information of all overhead traveling cranes in the span area; the environment state comprises the positions of a task starting station and a task finishing station in a bay area, and the positions and the states of all crown blocks in the same bay area; the set reward function is the reward value when the crown block executes different operations under the conditions of no load and full load, and the reward value is fed back to the intelligent agent in an immediate reward mode.

Further, in the step (2), the information required by the environment state includes positions of the task start station and the task end station in the span, and positions and states of all crown blocks in the same span. The environment state is expressed as a 3 × N matrix; the number of columns N of the matrix is a positive integer, and the cross-region space is represented as N positions. The first, second and third rows of the matrix are the relative positions of the task start station, the crown block and the task end station in the span, respectively. When a crown block transportation task is generated, the value of the corresponding position of the starting station and the end station in the matrix is the serial number of the task. The position value of the crown block corresponding to the second row of the matrix is the number of the crown block; the position values of the first row and the third row of the matrix without tasks are 0, and the position value of the second row of the matrix without the overhead travelling crane is 0. After the overhead traveling crane hoists the task, the position value of the task start station becomes 0, and the position value of the overhead traveling crane becomes a combination of the overhead traveling crane number as an integer and the task number as a decimal. When the crane puts down the task, the position value of the station at the task end point becomes 0, and the position value of the crown block becomes the number of the crown block.

Further, in the step (2), the operation of the crown block comprises 5 actions of moving left, moving right, standing, hoisting and placing the ladle, which are respectively indicated by 0,1,2,3 and 4.

Compared with the prior art, the invention has the advantages and positive effects that: according to the method, a crown block dispatching depth reinforcement learning model is constructed and trained according to historical crown block transportation tasks, reasonable research is carried out, each crown block in a cross-region is abstracted into a single intelligent body, task information, crown block positions and state information are expressed as environment states, and a crown block dispatching model is trained and generated through DQN, so that a crown block real-time dispatching strategy is realized. The crown block scheduling model realized by the method can generate a globally optimized scheduling scheme in time aiming at randomly generated or temporarily changed transportation tasks, and meets the efficiency requirement of online application. The method ensures the robustness and the effectiveness of the crown block dispatching model in the actual production scene on the basis of improving the dispatching efficiency of the crown block.

Drawings

Fig. 1 is a schematic flow diagram of a method for scheduling a crown block in a metallurgical industry workshop based on deep reinforcement learning according to an embodiment of the present invention;

FIG. 2 is an environment state matrix of a current crown block position and task in an embodiment of the present invention;

FIG. 3 is an environmental state matrix after a crown block has been hoisted for a mission in an embodiment of the present invention;

fig. 4 is a training process of a double crown block in the embodiment of the present invention.

Detailed Description

To make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, a method for scheduling a crown block in a steelmaking workshop based on dynamic area allocation according to an embodiment of the present invention includes the following five steps, and the implementation of each step is described below.

Step 1, obtaining the spatial layout of a cross region where an overhead traveling crane is located in a metallurgical workshop, and meanwhile obtaining a historical overhead traveling crane transportation task data table.

In this embodiment, according to the span where the actual crown block is located, the length and the width of the span, the distance between all stations in the span and the edge of the span, and the relative distance between the stations in the span and the transverse direction and the longitudinal direction are collected. The spatial layout of the bay where the crown block is located refers to the overlooking layout of a certain bay in a steel-making workshop, and the data to be acquired comprises the length and the width of the bay, the distance between all stations in the bay and the edge of the bay, and the relative distance between the stations in the bay and the edge of the bay in the transverse direction and the longitudinal direction. A certain span in the steelmaking workshop, such as a raw material span, a molten steel receiving span, a refining span and the like.

The historical overhead traveling crane transportation task data table comprises a task number, a starting station, a target station, task starting time and task ending time.

And 2, according to the cross-region space layout, taking the overhead travelling crane as an intelligent agent and taking the cross-region space as an environment, and creating a model framework for deep reinforcement learning.

The Deep reinforcement learning model frame established by the invention is a feedback mechanism frame of 'action-environment state-reward' designed by adopting a DQN (Deep Q-Network) algorithm, and specifically, each crown block is abstracted into a single intelligent body, the cross region where the crown block is positioned is abstracted into an environment state, the action of the intelligent body is the operation action of the crown block, and the state observed by the intelligent body is task information and state information of all crown blocks in the cross region. And the overhead travelling crane is guided to efficiently complete the overhead travelling crane task by designing a reward function.

The environment state comprises the positions of the task starting station and the task ending station in the bay, and the positions and the states of all crown blocks in the same bay, and is represented as a 3 multiplied by N matrix. The first, second and third rows of the matrix are the relative positions of the task start station, the crown block and the task end station in the span, respectively. N is a positive integer, and is the number of positions for dividing the cross area according to the space size of the cross area and the distribution of the stations, and represents N positions. Generally, 5-10 m of the cross-region space can be used as a position for space size scaling, and different scaling ratios can be selected according to specific application scenes.

According to the embodiment of the invention, the cross-region spatial layout of the crown block is converted into a 3x 30 matrix by scaling the relative positions in equal proportion, so as to represent the environmental state of the deep reinforcement learning model frame. The length of the span is equal to the number of columns by 30 x the reduction scale. The embodiment of the invention directly reflects the station information, the state of the crown block, the task being executed and the task to be executed in the cross area through the environment state matrix of the matrix of 3 multiplied by 30.

As shown in fig. 2, the first, second and third rows of the matrix are the relative positions of the task start station, the crown block and the task end station in the span, respectively. When a task is generated, the value of the position corresponding to the start station and the end station in the matrix is a task number, the value of the task number is greater than 0, and the value of the position without the task is 0. In fig. 2, there are 3 crown blocks in a bay, the crown block 1 is located at position 3, the crown block 2 is located at position 6, the crown block 3 is located at position 26, the three crown blocks are idle in the current state, there are two tasks in the current environment state, the starting station of task 1 is located at position 2, the ending station is located at position 5, the starting station of task 2 is located at position 26, and the ending station is located at position 7. After the overhead traveling crane hoists the task, the value of the position corresponding to the starting station becomes 0, and the value of the position of the overhead traveling crane becomes a combination of the number of the overhead traveling crane as an integer and the number of the task as a decimal, as shown in fig. 3. In fig. 3, the overhead traveling vehicle No. 1 moves to the position 2, and the position value of the overhead traveling vehicle No. 1 for the hoisting

task

1,1 becomes 1.1. The operation of the overhead traveling crane to drop the task is the reverse process, for example, the overhead traveling crane 1 drops the task 1, at this time, the vehicle 1 moves to the position 5, the value of the terminal station of the task 1 becomes 0, and the value of the position of the vehicle 1 becomes 1.

Meanwhile, the invention standardizes the operation actions of the crown block to 5 types, namely left moving, right moving, static moving, ladle lifting and ladle releasing, which are sequentially represented by 0,1,2,3, 4.

The parameters of the deep reinforcement learning model comprise a Q network structure, model factors and reward functions, wherein the Q network structure parameters comprise the number of convolution layers, the number of full link layers, the size of a convolution kernel, the number of neurons of the full link layers and activation functions; the model factors comprise reward discount factors, experience pool size, exploration rate and learning rate; the reward function R comprises reward values of hoisting, placing, standing and moving of the crown block under the conditions of no load and full load.

The reward function R of the invention is an evaluation rule of the environment on the current action taken by the intelligent agent, and is fed back to the intelligent agent in an immediate reward mode, and is an important guide signal for the intelligent agent to learn and improve the strategy. Generally, the total time for completing all tasks is the minimum scheduling target of the overhead travelling crane, and the reward value can be additionally reduced by 0.5 after the overhead travelling crane performs one action. Thus, the overhead traveling crane needs to complete all tasks with the least actions, and the total time for completing the tasks is the least.

The action value network (Q network) in the DQN algorithm of the invention is represented as Q ^π (s, a) in particular form Q ^π (s, a | theta), wherein s is an environment state, a is the action of the agent, and theta is a parameter of the Q network, and the optimal strategy for iteratively solving the action value network is as follows:

wherein: theta.theta. _k+1 For parameters of the (k + 1) th iteration Q network；θ _k Parameters for the kth iteration Q network; alpha is learning rate, and the value range is 0-1; r is _t The immediate reward at the moment t can be obtained by calculation through a set reward function R; gamma is a reward discount factor with the value range of 0 to 1; q ^π (s _t ,a _t |θ _k ) For the Q network at time t in the kth iteration, s _t Is the environmental state at time t, a _t Is the agent action at time t; q ^π (s _t+1 ,a _t+1 |θ _k ) A target Q network at the next moment in the kth iteration; s is _t+1 ，a _t+1 The environmental state and the action at the next moment; theta.theta. _k Parameters of the target Q network are iterated for the kth time. V is obtained as a gradient. V _θ Q ^π (s _t ,a _t |θ _k ) Q network Q for t time in k iteration ^π (s _t ,a _t |θ _k ) Gradient at parameter θ.

π ^* (as) is the agent action that seeks the Q network output such that the total reward value for the agent action sequence is maximized.

All the crown blocks in the cross area in each iteration process complete all the transportation tasks, and the Q network of each crown block obtains the environmental state s of the current time from the starting time to the ending time in one iteration process _t Outputting the execution action a _t To obtain an immediate reward r _t After one iteration is completed, the total time for task completion and the total reward value may be obtained.

And 3, performing parameter optimization and training on the deep reinforcement learning model.

In this embodiment, the data table of the historical transportation task of the overhead traveling crane includes a task number, a start station, a target station, a start time, and an end time, as shown in table 1. When the task numbers are distributed by the transportation tasks in the actual production process, the task numbers are automatically generated, and the historical transportation task list is sorted from small to large according to the task numbers.

TABLE 1 historical crown block transportation task table

Task numbering	Starting time	Completion time	Initial station	Destination station
					001	St001	Et001	Station A	Station B
002	St002	Et002	Station B	Station C
					003	St003	Et003	Station C	Station D
…	…	…	…	…

In this embodiment, the deep reinforcement learning model is trained based on historical data of the transportation task of the overhead travelling crane. The training process for a double crown block is shown in fig. 4. The condition of multiple crown blocks is similar to the training process of double crown blocks, each crown block is an agent with the same structure, at the moment t, the agents with the same structure are alternately trained in sequence, the environmental state observed by the next crown block is obtained based on the operation executed by the previous crown block, the immediate rewards of all crown blocks at the moment t are summed and stored in a memory pool as the comprehensive immediate reward, and the memory pool is used for updating the Q network.

In fig. 4, (1) to (9) show the sequence of two crown block Q network training steps. In the step (1), the crown block 1 obtains t time and observes the environmental state

The Q network of the crown block 1 in the step (2) is input according to the input

Outputting the action of the crown block 1 at time t

In the step (3), the crown block 1 compares the current environment state and action with the currently observed environment state of the crown block 2

Storing the data into a memory pool of the overhead travelling crane 1; and (4) acquiring the environmental state of the current t moment by the Q network of the crown block 2

Step (5), the Q network of the crown block 2 outputs the current action

Step (6), the crown block 2 makes the current environment state and action and the environment state observed by the crown block 1 at the next moment

Storing the data into a memory pool of the crown block 2;

the environmental state observed by the crown block 1 at time t + 1. In step (7), calculating the comprehensive immediate reward of two Q network outputs of the crown block 1 and the crown block 2 according to the environment feedback, r _t ¹ Awarding the crown block 1 at the moment t; r is a radical of hydrogen _t ² Awarding the crown block 2 at the moment t; comprehensive immediate reward r of two crown blocks at time t _t ＝r _t ¹ +r _t ² . And (8) storing the calculated comprehensive immediate rewards into respective memory pools of the crown blocks respectively. In step (9), the weights of all possible action combinations of the crown block are updated according to the comprehensive immediate reward. In the iterative process, the experience updating Q network is extracted from the memory pool in a timed mode.

And 4, obtaining an optimized deep reinforcement learning model.

In this embodiment, the deep reinforcement learning model parameters, including the Q network structure, the model parameters, and the reward function, are adjusted and optimized according to the training result of the historical data of the transportation task of the overhead travelling crane. The optimized result is as follows:

the Q network structure is composed of 2 convolutional layers and 2 fully-connected layers, wherein the convolutional cores of the convolutional layers are 3x3, the neurons of the fully-connected layers are 2048, and the ReLU function is adopted as an activation function. In the model factors, the learning rate is 0.0005, the discount factor is 0.8, the size of the experience pool is 100 ten thousand, and the exploration rate is 0.3. The bonus function settings are shown in table 2.

Table 2 reward function setting table

And step 5, acquiring the position and the state of the current crown block in the cross-region and the conditions of the transportation tasks being executed and to be executed according to a certain time interval, generating a corresponding environment state vector, inputting the environment state vector into the trained deep reinforcement learning model, outputting a current crown block scheduling scheme, namely the action to be executed by each crown block, matching the crown blocks for the transportation tasks to be executed according to the crown block scheduling scheme, and designating a transportation path.

In the embodiment, in the practical application process, firstly, a trained and optimized crown block dispatching depth reinforcement learning model is deployed; secondly, collecting the position and the state of a crown block in a cross area at the current moment, and a starting station and a target station of a transportation task which is being executed and is to be executed; and thirdly, updating the environment state of the model according to the information obtained in the second step, generating an overhead crane scheduling scheme after the deep reinforcement learning model reads the environment state, and indicating the overhead crane distributed for the task to be executed and the moving process of the overhead crane required to perform the transportation task by updating the position of the overhead crane and lifting/lowering operation in the environment state. When indicating the movement of the overhead traveling crane, the traveling track of the overhead traveling crane, the possible avoidance or waiting, etc. will be considered.

Therefore, on the time line, the second step and the third step in the process are continuously repeated at certain time intervals until all transportation tasks in the cross-region are completed, and the optimized crown block dispatching scheme is dynamically generated for the transportation tasks which are randomly generated or temporarily changed by utilizing a deep reinforcement learning model.

The prior art provides a crown block dispatching simulation method based on reinforcement learning, needs a large number of steps of intelligent agent modeling, rule construction, strategy construction and the like, and is not suitable for practical application scenes. The existing simulation method can only carry out simulation and comparison of various scheduling schemes aiming at a task set specified at a certain moment, thereby selecting an optimal scheme. Under actual production conditions, tasks are continuously and continuously generated, and if a large number of scheduling schemes need to be generated each time, the efficiency requirements of online application cannot be met obviously when comparison and optimization are carried out. Compared with the prior art, the method for dispatching the overhead travelling crane based on the deep reinforcement learning model saves a large number of steps of intelligent body modeling, construction rules, strategies and the like, has universality in the aspect of scene applicability, and can be directly used for constructing and using the actual overhead travelling crane dispatching model. The method of the invention utilizes the historical task to train the crown block dispatching model, the trained optimization model can be directly transplanted to the crown block dispatching system, the dispatching scheme can be given through one-time operation, and the requirement of online application is met without training or simulation comparison again.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A metallurgical industry workshop overhead traveling crane scheduling method based on deep reinforcement learning is characterized by comprising the following steps:

step 1, acquiring the spatial layout of a cross region where a crown block is located in a metallurgical workshop and a historical crown block transportation task data table;

the transportation task data comprises a task number, a starting station, a target station, a starting time and an ending time;

step 2, according to the cross-region space layout, a crown block is used as an intelligent agent, the cross-region space is used as an environment, and a deep reinforcement learning model is created;

the deep reinforcement learning model adopts a DQN algorithm to design a feedback mechanism frame of 'action-environment state-reward', each crown block in a cross region is abstracted into a single intelligent body, the cross region where the crown block is located is abstracted into an environment state, the action of the intelligent body is the operation action of the crown block, and the states observed by the intelligent body are task information and state information of all crown blocks in the cross region; the environment state comprises the positions of a task starting station and a task finishing station in a bay area, and the positions and the states of all crown blocks in the same bay area; the set reward function is a reward value when the crown block executes different operations under the conditions of no load and full load, and the reward value is fed back to the intelligent agent in an immediate reward mode;

step 3, performing parameter optimization and training on the deep reinforcement learning model according to a historical crown block transportation task data table;

step 4, acquiring the position and the state of the current crown block in the cross-region at regular time, and transportation task data which are being executed and are to be executed, and generating an environment state;

and 5, inputting the environmental state into the trained deep reinforcement learning model, outputting the operation action to be executed by each overhead traveling crane, generating a current overhead traveling crane dispatching scheme, matching the overhead traveling cranes for the transportation tasks according to the current overhead traveling crane dispatching scheme, and designating the transportation paths of the overhead traveling cranes, and continuing to execute the step 4 until all the transportation tasks in the cross-region are completed.

2. The method for dispatching the crown blocks in the metallurgical industry workshop based on the deep reinforcement learning as claimed in claim 1, wherein the spatial layout of the bay where the crown block is located obtained in step 1 refers to obtaining an overhead layout of the bay in the steelmaking workshop, obtaining the length and width of the bay, and obtaining the relative distance between all stations in the bay and the edge of the bay in the transverse direction and the longitudinal direction.

3. The method for scheduling the crown block in the metallurgical industry workshop based on the deep reinforcement learning as claimed in claim 1, wherein in the step 2, the environmental state is expressed as a 3x N matrix; the column number N of the matrix is a positive integer, and the trans-regional space is represented as N positions; the first, second and third rows of the matrix are the relative positions of a task starting station, a crown block and a task end station in a cross area respectively; when a crown block transportation task is generated, the value of the corresponding position of the starting station and the terminal station in the matrix is the serial number of the task; the position value of the crown block corresponding to the second row of the matrix is the number of the crown block; the position values of the first row and the third row of the matrix without tasks are 0, and the position value of the second row of the matrix without overhead travelling cranes is 0; after the overhead traveling crane hoists the task, the position value of the starting station of the task becomes 0, the position value of the overhead traveling crane becomes the combination of the number of the overhead traveling crane as an integer and the number of the task as a decimal; when the crane puts down the task, the position value of the station at the task end point becomes 0, and the position value of the crown block becomes the number of the crown block.

4. The method for dispatching the crown blocks in the metallurgical industry workshop based on the deep reinforcement learning as claimed in claim 1, wherein in the step 2, the operating actions of the crown blocks are specified to be 5, namely, left-moving, right-moving, static, hoisting and ladle-placing, which are sequentially represented by 0,1,2,3,4.

5. The method for dispatching the crown blocks in the metallurgical industry workshop based on the deep reinforcement learning as claimed in claim 1 or 4, wherein the reward function comprises reward values of the crown blocks during carrying out ladle lifting, ladle placing, standing and moving under the conditions of no load and full load.

6. The metallurgical industry workshop overhead traveling crane dispatching method based on deep reinforcement learning of claim 5, wherein the reward function is that the reward value is additionally reduced by 0.5 after the overhead traveling crane performs an action once.

7. The method for dispatching overhead traveling cranes in a metallurgical industry workshop based on deep reinforcement learning according to claim 1, wherein in the step 2, the optimal strategy of the action value network, namely the Q network, of each overhead traveling crane is solved iteratively as follows:

wherein Q is ^π (s, a | θ) represents the Q network, s is the environmental state, a is the action of the agent, and θ is a parameter of the Q network; theta.theta. _k+1 Parameters for the (k + 1) th iteration Q network; theta _k Parameters of a kth iteration Q network; alpha is learning rate, and the value range is 0-1; gamma is a reward discount factor with the value range of 0 to 1; s is _t 、s _t+1 The environmental states at the time t and the time t +1 respectively; a is _t 、a _t+1 The intelligent body actions at the time t and the time t +1 respectively; r is _t An immediate reward for time t; q ^π (s _t ,a _t |θ _k ) Is a Q network at the t moment in the kth iteration; q ^π (s _t+1 ,a _t+1 |θ _k ) A target Q network at the t +1 moment in the kth iteration is obtained; theta.theta. _k Parameters of a target Q network in the k iteration;

q network Q for t time in k iteration ^π (s _t ,a _t |θ _k ) A gradient at parameter θ; pi ^* (a | s) represents the agent action to find the Q network output such that the total reward for the agent action sequence is maximized;

all the crown blocks in the cross area complete all the transportation tasks in each iteration process, the Q network of each crown block obtains the environmental state of the current time from the starting time to the ending time in one iteration process, the execution action is output, the immediate reward is obtained, and the total time and the total reward for completing the tasks are obtained after one iteration is completed.

8. The method for dispatching the crown blocks in the metallurgical industry workshop based on the deep reinforcement learning according to claim 1, wherein in the step 3, the number of crown blocks in a cross area is larger than 1, each crown block is modeled as an agent with the same structure, at the moment t, the agents are alternately trained in sequence, the environmental state observed by the next crown block is obtained based on the operation executed by the previous crown block, and the immediate rewards of all crown blocks at the moment t are summed and stored in a memory pool as a comprehensive immediate reward.