CN116720703A

CN116720703A - AGV multi-target task scheduling method and system based on deep reinforcement learning

Info

Publication number: CN116720703A
Application number: CN202310726554.9A
Authority: CN
Inventors: 吴小倩; 郑益民; 吴庆耀; 秦卓睿
Original assignee: Shenzhen Bangqi Technology Intelligent Development Co ltd
Current assignee: Shenzhen Bangqi Technology Intelligent Development Co ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-09-08

Abstract

The invention discloses an AGV multi-target task scheduling method based on deep reinforcement learning, which comprises the following steps: acquiring environment real-time state information and making a decision through a deep Q network to obtain an optimal scheduling sequence; constructing a deep reinforcement learning task scheduling model; the processed real-time environment state information, actions and rewards are used as a sample training task scheduling model; and carrying out deployment prediction on the trained task scheduling model. According to the invention, AGV tasks in a specific scene are dynamically scheduled in real time by using a deep reinforcement learning method, and a transducer module is introduced to enable a model to pay attention to global task information, learn the characteristics and the importance of different tasks and learn the general knowledge among the tasks; the method improves the reinforcement learning reward function, introduces AGV collision waiting time, not only considers the goal of reducing environmental resource waste, but also considers the goal of reducing traffic collision blocking in the AGV driving process, and is more in line with the AGV task scheduling in a specific scene.

Description

AGV multi-target task scheduling method and system based on deep reinforcement learning

Technical Field

The invention relates to the technical field of AGV scheduling, in particular to an AGV multi-target task scheduling method and system based on deep reinforcement learning.

Background

In recent years, artificial intelligence and related industries are rapidly developing and becoming the focus of attention of academia, industry and governments around the world, and the national institutes have issued "new generation development planning" to highlight the national strategic status of artificial intelligence research and industry.

The prior bulletin number CN114912809A discloses an automatic container terminal AGV scheduling method for DDQN, which aims at data conversion of actual conditions of an automatic container terminal, and builds an automatic container terminal horizontal transport link AGV task allocation scheduling problem model by using a mathematical modeling method; converting the AGV scheduling problem model into a reinforcement learning DDQN model by using a Markov decision process MDP; constructing a DDQN deep reinforcement learning neural network model Q network; programming and simulating external environment changes in the DDQN model, and training the Q network model; the trained Q network model is packaged into a real-time online dispatching system for the automatic container terminal horizontal transport AGVs, real-time and efficient task dispatching is carried out on the terminal AGVs, and the method introduces a deep reinforcement learning DDQN method into the problem of AGV task allocation dispatching of the automatic container terminal horizontal transport links, so that the real-time and efficient task dispatching of the AGVs in the face of a terminal dynamic environment is realized, the AGV utilization rate is improved, and the transport efficiency is improved.

However, although the transportation efficiency is improved, the method only emphasizes the dispatching rule of the horizontal transportation link of the automatic container terminal, ignores the real-time dynamic task information of the whole system, is only suitable for the AGV dispatching scene of the horizontal transportation link of the automatic container terminal, and cannot consider the collision waiting time-consuming condition in the actual running of the AGV, so that the method combines the attention mechanism and the general knowledge in the task in the deep reinforcement learning process, and simultaneously considers the aim of reducing the traffic collision blocking in the running process of the AGV, thereby being a problem to be solved in task dispatching.

Disclosure of Invention

In order to overcome the defects in the prior art, the embodiment of the invention provides an AGV multi-target task scheduling method and system based on deep reinforcement learning, which enable a model to pay attention to global task information by introducing a transducer with an attention mechanism, learn the characteristics and the importance of different tasks and learn the general knowledge among the tasks; meanwhile, the targets for reducing traffic collision blocking in the AGV driving process are considered, the AGV task scheduling in a specific scene is more met, and the multi-target scheduling has a great effect on improving the task execution efficiency.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention provides an AGV multi-target task scheduling method based on deep reinforcement learning, which comprises the following steps:

s1: collecting and preprocessing environment real-time state information in a system;

s2: the method comprises the steps of constructing a task scheduling model, wherein the task scheduling model comprises an AGV module, a task module, a map module and a prediction module;

s3: taking the preprocessed environment real-time state information, corresponding actions and rewards as a sample training task scheduling model;

the process of training the task scheduling model in the step S3 is as follows:

s31: dividing the sample into an AGV state, a map state and a task state, respectively inputting the AGV state, the map state and the task state into the AGV module, the map module and the task module, and connecting the characteristic vector output by the AGV module and the map module with the characteristic vector output by the task module introduced with the transducer to obtain the characteristic vector of the current input training sample;

s32: and inputting the feature vector of the current input training sample into a prediction module to obtain rewards of corresponding actions, calculating the mean square error between the actual prediction result and the target model prediction result as an objective function of the training model, training a task scheduling model by using a random gradient descent method, and finally obtaining a trained task scheduling model.

S4: and carrying out deployment prediction on the trained task scheduling model, wherein the task scheduling model carries out deployment prediction by inputting real-time environment state information into the trained task scheduling model, and predicting the next scheduling option according to the output result of the model.

Preferably, the environmental real-time state information collected in the S1 comprises an AGV state, a map state and a task state, wherein the AGV state comprises a departure point, a target point and the current task running time; the map state comprises a warehouse entry point, a warehouse entry buffer zone, a high-order buffer zone, a warehouse exit buffer zone, an AGV waiting queue of the warehouse exit point, the running time of a running task and the AGV queue of the waiting zone; the task state includes corresponding target points for the task for each of the departure points.

Preferably, the training sample expression of the sample training task scheduling model in S3 is:

xi＝(s _t ,a _t ,R _t ,s _t+1 )

a _t ＝(agv _k ,task _l )

wherein xi is an ith training sample, st is an environmental state at t, at is an action taken at t, which means that a specific AGV is allocated to a specific task, and Rt is a reward obtained after the action is taken at t, including idle waiting loss of resources in a map module at t and collision waiting loss at t AGV, so that the method can reduce time consumption of collision waiting of the AGV while focusing on map resource waste.

Preferably, the specific calculation step of the feature vector of the currently input training sample in S31 is as follows:

a1, calculating an embedded matrix of a current input training sample, wherein the specific formula is as follows:

wherein ,an embedding matrix, x, representing the current module input samples _i1 、x _i2 、...、x _in Representing input information required by the current module, M representing a learnable embedded transformation matrix;

a2, respectively obtaining a query matrix Q, a key matrix K and a value matrix V from the embedded matrix through a plurality of different linear transformations, wherein the specific formula is as follows:

Q＝qW ^Q

K＝kW ^K

V＝vW ^V

wherein ,W^Q 、W ^K 、W ^V Representing a linear operation matrix;

a3, carrying out attention weighting on Q, K, V, carrying out normalization and full-connection linear operation, and finally obtaining a feature vector, wherein the specific formula is as follows:

where S is the weighted result of the attention,for normalization result, z is the result after full-join linear operation, +.>Is a feature vector +_>For scaling factor, W ₁ 、b ₁ Is an adaptively learnable parameter, LN () represents a layer normalization operation, where softmax () represents an activation function for calculating the attention score of the real-time state in the task module, and furthermore, a recursive jump connection is employed after the attention calculation and after the feed forward calculation in the self-attention module of the transducer.

Preferably, the next scheduling option predicted in S4 is a feature vectorAnd obtaining the score of each scheduling option through single-layer full-connection layer and Softmax operation in the prediction module, and taking the scheduling option with the highest score as the prediction result of the next scheduling option in the current state.

The invention further provides an AGV multi-target task scheduling system based on deep reinforcement learning, which is applied to the AGV multi-target task scheduling method based on deep reinforcement learning, and comprises a preprocessing module, a model building module, a model training module, a model prediction module and a data storage module;

the preprocessing module is used for collecting and preprocessing environment real-time state information in the system;

the model construction module is used for constructing a task scheduling model, and the task scheduling model construction module comprises an AGV module, a task module, a map module and a prediction module;

the model training module is used for taking the preprocessed environment real-time state information, corresponding actions and rewards as a sample training task scheduling model;

the model prediction module is used for carrying out deployment prediction on the trained task scheduling model, wherein the task scheduling model carries out deployment prediction by inputting real-time environment state information into the trained task scheduling model, and predicting the next scheduling option according to the output result of the model;

the data storage module is used for storing data information in the whole task scheduling process.

The invention has the technical effects and advantages that:

1. the AGV multi-target task scheduling method based on deep reinforcement learning provided by the invention utilizes a self-attention mechanism to process task states, has strong modeling capability on data, enables a model to pay attention to global task information, learns the characteristics of different tasks, learns the general knowledge of the tasks, and plays a great role in improving the task execution efficiency.

2. The task scheduling method provided by the invention can dynamically feed back the real-time environment state, and simultaneously, the more important task is focused by using the transducer, so that the time consumption of AGV collision waiting is introduced, the traffic collision blocking condition in the AGV driving process can be considered, and the method is more suitable for AGV task scheduling in a specific scene.

Drawings

Fig. 1 is a schematic diagram of a training process of a training task scheduling model according to an embodiment of the present invention.

Fig. 2 is a schematic overall flow chart of an AGV multi-objective task scheduling method based on deep reinforcement learning according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of an AGV multi-objective task scheduling system based on deep reinforcement learning according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1 and 2, in an embodiment of the present invention, there is provided an AGV multi-target task scheduling method based on deep reinforcement learning, including the following steps:

further, the environment real-time state information collected in the S1 comprises an AGV state, a map state and a task state, wherein the AGV state comprises a departure point, a target point and the current task running time; the map state comprises a warehouse entry point, a warehouse entry buffer zone, a high-order buffer zone, a warehouse exit buffer zone, an AGV waiting queue of the warehouse exit point, the running time of a running task and the AGV queue of the waiting zone; the task state includes corresponding target points for the task for each of the departure points.

In this embodiment, it is specifically required to explain that the preprocessing process of the environmental real-time status information is to make a decision on the environmental real-time information through the deep Q network to obtain an optimal scheduling sequence.

in this embodiment, it is specifically required to explain that there are a plurality of material transportation tasks and a plurality of AGV trolleys in the manufacturing system, under a certain constraint condition, the material transportation tasks need to be reasonably assigned to each AGV trolley, and the execution sequence is specified, and meanwhile, the material transportation tasks are completed by real-time scheduling under the dynamic condition, so that the whole system meets a certain performance optimization index, and the process is task scheduling.

further, the training sample expression of the sample training task scheduling model in S3 is:

xi＝(s _t ,a _t ,R _t ,s _t+1 )

a _t ＝(agv _k ,task _l )

The process of training the task scheduling model in the step S3 is as follows:

further, the specific calculation step of the feature vector of the currently input training sample in S31 is as follows:

Q＝qW ^Q

K＝kW ^K

V＝vW ^V

wherein ,W^Q 、W ^K 、W ^V Representing a linear operation matrix;

where S is the weighted result of the attention,for normalization result, z is the result after full-join linear operation, +.>Is a feature vector +_>For scaling factor, W ₁ 、b ₁ Is an adaptively learnable parameter, LN () represents a layer normalization operation, where softmax () represents an activation function, used to calculate an attention score for a real-time state in a task module, and, in addition,a recursive jump connection is used after the attention calculation and after the feedforward calculation in the transducer's self-attention module.

Further, the next scheduling option predicted in S4 is a feature vectorAnd obtaining the score of each scheduling option through single-layer full-connection layer and Softmax operation in the prediction module, and taking the scheduling option with the highest score as the prediction result of the next scheduling option in the current state.

Based on the same thought as the AGV multi-target task scheduling method based on the deep reinforcement learning in the embodiment, the invention also provides an AGV multi-target task scheduling system based on the deep reinforcement learning, which can be used for executing the AGV multi-target task scheduling method based on the deep reinforcement learning. For ease of illustration, only those portions of the structural schematic diagram of an embodiment of an AGV multi-objective task scheduling system based on deep reinforcement learning are shown in connection with embodiments of the present invention, and those skilled in the art will appreciate that the illustrated structure is not limiting of the apparatus and may include more or fewer components than illustrated, or may combine certain components, or a different arrangement of components.

As shown in fig. 3, in an embodiment of the present invention, an AGV multi-objective task scheduling system based on deep reinforcement learning is provided, which includes a preprocessing module, a model building module, a model training module, a model prediction module, and a data storage module;

further, the training task scheduling model specifically includes:

dividing the sample into an AGV state, a map state and a task state, respectively inputting the AGV state, the map state and the task state into an AGV module, a map module and a task module, and connecting the characteristic vectors output by the AGV module and the map module with the characteristic vectors output by the task module introduced with a transducer to obtain the characteristic vector of the current input training sample;

and inputting the feature vector of the sample into a prediction module to obtain rewards of corresponding actions, calculating the mean square error between the prediction result and the target model prediction result as an objective function of a training model, training a task scheduling model by using a random gradient descent method, and finally obtaining a trained task scheduling model.

In this embodiment, it should be specifically noted that, in the embodiment of the present invention that the deep reinforcement learning-based AGV multi-target task scheduling system and the deep reinforcement learning-based AGV multi-target task scheduling method of the present invention are in one-to-one correspondence, the technical features and the beneficial effects described in the embodiment of the foregoing embodiment of the deep reinforcement learning-based AGV multi-target task scheduling method are applicable to the embodiment of the deep reinforcement learning-based AGV multi-target task scheduling system, and specific content may be found in the description of the embodiment of the method of the present invention, which is not repeated herein, and thus is stated.

In addition, in the implementation of the deep reinforcement learning-based AGV multi-objective task scheduling system of the foregoing embodiment, the logic division of each program module is merely illustrative, and in practical application, the above-mentioned function allocation may be performed by different program modules according to needs, for example, in view of configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the deep reinforcement learning-based AGV multi-objective task scheduling system is divided into different program modules to perform all or part of the functions described above.

Finally: the foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The AGV multi-target task scheduling method based on deep reinforcement learning is characterized by comprising the following steps of:

the process of training the task scheduling model in the step S3 is as follows:

2. The deep reinforcement learning-based AGV multi-objective task scheduling method according to claim 1, wherein the method comprises the following steps: the environment real-time state information collected in the S1 comprises an AGV state, a map state and a task state, wherein the AGV state comprises a departure point, a target point and the current task running time; the map state comprises a warehouse entry point, a warehouse entry buffer zone, a high-order buffer zone, a warehouse exit buffer zone, an AGV waiting queue of the warehouse exit point, the running time of a running task and the AGV queue of the waiting zone; the task state includes corresponding target points for the task for each of the departure points.

3. The deep reinforcement learning-based AGV multi-objective task scheduling method according to claim 1, wherein the method comprises the following steps: and S3, training sample expression of the sample training task scheduling model is as follows:

xi＝(s _t ,a _t ,R _t ,s _t+1 )

a _t ＝(agv _k ,task _l )

4. The deep reinforcement learning-based AGV multi-objective task scheduling method according to claim 1, wherein the method comprises the following steps: the specific calculation steps of the feature vector of the current input training sample in S31 are as follows:

Q＝qW ^Q

K＝kW ^K

V＝vW ^V

wherein ,W^Q 、W ^K 、W ^V Representing a linear operation matrix;

5. The deep reinforcement learning-based AGV multi-objective task scheduling method according to claim 1, wherein the method comprises the following steps: the next scheduling option predicted in S4 is the feature vectorIn predictionAnd obtaining the score of each scheduling option through single-layer full-connection layer and Softmax operation in the module, and taking the scheduling option with the highest score as the prediction result of the next scheduling option in the current state.

6. An AGV multi-target task scheduling system based on deep reinforcement learning is characterized in that: an AGV multi-target task scheduling method based on deep reinforcement learning applied to any one of claims 1-5, comprising a preprocessing module, a model construction module, a model training module, a model prediction module and a data storage module;