CN114201303A

CN114201303A - Task unloading optimization method of fixed path AGV in industrial Internet of things environment

Info

Publication number: CN114201303A
Application number: CN202111539145.5A
Authority: CN
Inventors: 刘鹏; 吴自富
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-18

Abstract

The invention discloses a fixed path AGV task unloading optimization method in an industrial Internet of things environment. The invention is based on a traditional model-free value function updating-based reinforcement learning method, optimizes the task unloading scheduling problem under the situation that AGV assists edge calculation, and combines a load balancing algorithm and an improved DQN algorithm on the basis. And finally, under the constraints of sensitive task processing time delay, path, efficiency and the like, the problem that the unloading processing time of multiple AGV and multiple server nodes is shortest and optimal in the environment of the Internet of things is solved. The method of the invention does not need excessive prior knowledge, can unload and transmit tasks in a short distance, meets the requirement of data safety, has better reusability in similar application scenes, and has stronger practical value.

Description

Task unloading optimization method of fixed path AGV in industrial Internet of things environment

Technical Field

The invention belongs to the field of edge computing, and particularly relates to a reinforcement learning method for fixed path AGV task unloading optimization in an industrial Internet of things environment.

Background

At present, a large number of sensors are deployed in the industrial Internet of things, time-sensitive data operation and decision execution are full of the sensors, and a complete cloud center computing mode cannot meet actual requirements. Therefore, the edge computing mode is applied to an industrial scene, a data computing task is distributed from a centralized cloud end to edge equipment closer to a data source, the edge server provides cloud computing capability for a user in a short distance through being closer to a target, and the requirement of industrial internet of things application on real-time processing is well met. Among them, the task offloading scheduling optimization problem is the focus of current research. The existing task scheduling method and model comprise particle swarm optimization, genetic algorithm, ant colony optimization algorithm, game theory and the like. These methods may perform well in some specific scenarios, but there is still much room for improvement in the complexity of algorithm design and extended applicability.

With the development of AI, various reinforcement learning algorithms are proved to have remarkable advantages in solving the order decision problem, are very suitable for processing the strategy selection problem of the complex search space in the edge calculation scene, can bring better solution to the problem only by needing less prior knowledge, and simultaneously meet the requirement of privacy protection. Reinforcement learning is roughly divided into two categories: model-based reinforcement learning and model-free reinforcement learning. As data security is more and more emphasized, and it is difficult to obtain the related prior knowledge of the detailed data of the multi-user node, the model-free reinforcement learning is more suitable for solving the task unloading scheduling problem under the edge computing. Model-based reinforcement learning can also be subdivided into two broad categories, one is a strategy optimization method, which does not need to maintain a cost function model, but directly searches for an optimal strategy, and usually adopts a parameterized strategy to maximize the expected return by updating the parameter. Another method is a reinforcement Learning method based on value function update, generally referred to as Q-Learning algorithm, where Q is a historical experience memory table related to the current state and action selection, and may represent an accumulated expectation that action may be taken to obtain revenue at a certain time state, and the Q-Learning algorithm constructs an agent representing the algorithm, places the agent in a markov model that needs to solve a problem, and selects whether to make a new action selection by querying accumulated Learning experience or to randomly select an action by a search strategy. However, in such a case, there are two problems that are difficult to solve: the correlation between samples is too strong and the learned target depends on the target itself, and therefore it is difficult to converge to an optimal value. In addition, the implementation is too complex due to the fact that the state space and the action space of the target are too large, and in order to solve the problem, a DQN deep strengthening network is introduced, the correlation among samples is reduced, and training is converged better by combining a load balancing algorithm. Therefore, the method provides a reinforcement learning method for assisting multi-node task unloading scheduling by the unmanned aerial vehicle based on value function updating.

Disclosure of Invention

The invention aims to solve the problem that the AGV carries out unloading processing on a task based on a given path in an industrial Internet of things edge calculation scene, so that the task completion time is minimized.

The edge computing scene mainly comprises a plurality of AGV trolleys and a plurality of edge servers, wherein the AGV carries a given task to be processed, runs along a given route, and unloads the carried task to the edge servers for processing when passing through the edge servers. Each AGV route is different but may intersect, i.e., pass through the same edge server. Each edge server has given processing efficiency and maximum capacity, a new task is rejected if the unloaded task reaches the maximum capacity, when the AGV trolley runs to a target end point, the carried tasks need to be unloaded until all the tasks are processed and finished by the servers, and the whole task unloading processing flow is finished. In order to reasonably distribute tasks carried by all the AGVs and enable the total processing time of the tasks carried by all the AGVs to be as short as possible, the invention provides the method for optimizing the unloading of the fixed path AGV tasks in the environment of the industrial Internet of things.

In order to achieve the purpose, the invention adopts the technical scheme that: a task unloading optimization method for an AGV (automatic guided vehicle) with a fixed path in an industrial Internet of things environment comprises the following steps:

step one, a plurality of AGVs respectively drive from a given starting point according to a planned path, task unloading is carried out through a plurality of edge servers in the driving process, the task unloading is completed when each AGV reaches a target point, and when the edge servers finish all the unloading tasks, the task flow is finished; modeling the scene;

step two, in order to minimize the total time consumed by the whole task unloading processing, the model is processed in two stages: in the first stage, the problem of resource conflict caused by competition of a plurality of AGVs is ignored, and the optimal unloading scheme of each AGV at a plurality of edge servers is solved; in the second stage, on the basis of the first stage, the nodes causing the resource conflict part are optimized, and the resource conflict problem is solved, so that the overall optimal result is achieved;

and step three, firstly, carrying out first-stage processing on the task unloading of the AGV. Ignoring resource conflict issues caused by multiple AGVs unloading, the amount of tasks each AGV unloads at a passing edge server is related to the processing efficiency, capacity of the edge server, and time to reach the edge server. Performing resource allocation on each AGV according to the conditions by adopting a load balancing algorithm of weighted polling;

and step four, processing the nodes causing the resource conflict in the first stage in the second stage. Establishing a Markov model for the edge server nodes and the AGVs which cause the conflict, wherein the initialized state space is the task quantity carried by the AGVs, the action space is the unloading quantity of the AGVs at the edge server nodes, the reward obtained by executing the action is the reciprocal of the task quantity, and considering that the different time of the AGVs reaching the edge server nodes, each AGV has an unloading priority;

setting the limiting conditions in the application scene as small targets for reinforcement learning, and taking the total time obtained after final scheduling as small as possible as a large target, wherein the large target is realized on the basis of the small targets;

when the training period of the reinforcement learning method starts, the intelligent agent starts from the initial state of the Markov model, and selects the next action for the intelligent agent according to the improved strategy, the intelligent agent can reach the next environmental state after making action selection, and the environmental state can give corresponding reward according to the current characteristics;

seventhly, in the training process of task unloading of the AGV, an experience pool with fixed capacity is used, and the advantage of off-policy is fully utilized, so that the relevance of the sample is disturbed, and the utilization rate of the sample is improved;

and step eight, stopping training when the maximum training period of the algorithm is reached, outputting the maximum accumulated reward of training convergence, and obtaining the optimal task unloading strategy.

Further, in the two-stage processing scheme, the deep reinforcement learning state of the second stage is used

Is represented by a vector of

Indicating the task status of the AGV at the ith common node; and the action space is the task unloading amount of each AGV at the resource conflict node.

Further, the completion time of a task consists of three parts: the travel time of the AGV, the unloading time of the AGV to unload the task to the edge server, and the processing time of the edge server. The AGV time from the origin to the ith edge server is:

and the time for the AGV to unload the task to the edge server is as follows:

wherein s is_iRepresents the unload rate of the task:

s_i＝Wlog(1+pg_i/N₀)

the processing time of the edge server is:

the load balancing algorithm that thereby derives weighted polling allocates the amount of tasks as:

wherein Task and AR are the initial and remaining dispensed amounts, respectively:

furthermore, the AGV reaches each edge server at different time, and the processing capacity of the edge servers is considered, so that the AGV has different unloading priorities during selecting actions;

thus, the corresponding bonus setting is as follows:

wherein

Indicating the unloading priority of the ith AGV at the edge server at the time t;

indicating the unloading amount of the ith AGV at the edge server at the time t; e_tIndicating the processing efficiency of the edge server.

Further, the total reward earned by the agent may take into account not only the current reward, but also future long-term rewards, and the further apart the time is, the less accurate the value of the future reward earned, so the cumulative reward may be expressed as:

wherein gamma is a discount factor, and the discount factor is used to make the proportion of the reward value with longer time interval in the current return smaller.

Further, the process that the intelligent agent passes from one state to another state in the step six generates a learning experience in the experience pool, wherein the learning experience comprises the characteristics of the previous state, the selected action, the acquired reward and the next state; and the intelligent agent learns according to past experience in the experience pool, and finally the AGV finishes unloading and enters an ending state to start the next training period.

Further, the AGV needs to comply with the following restrictions during the unloading scheduling process:

a) the node that the AGV chooses to unload must be the edge server node that is covered in the AGV path;

b) when the AGV passes through the edge server node, the AGV can select unloading or not, and when the task accepted by the server node reaches the upper limit of the capacity, the AGV is refused to accept the unloading task;

c) the tasks carried by the AGVs must be unloaded when reaching the target point, and the time for processing all the tasks carried by the AGVs to be completed is also as short as possible.

The invention has the beneficial effects that:

the invention improves the original unloading scheduling algorithm aiming at the characteristics of the task unloading scheduling scene of the AGV in the environment of the industrial Internet of things, innovatively provides a new unloading method formed by combining a load balancing algorithm with a reinforcement learning algorithm, and achieves the aim of minimizing the total task completion time by reasonably selecting an unloading scheme under the condition of fixed path of the AGV under the constraints of fixed path, sensitive time delay, limited resources and the like of the AGV. The method does not need excessive prior knowledge, does not need to deeply know the information of the edge server node, meets the requirement of privacy protection, has better reusability in similar application scenes, and has stronger practical value.

Drawings

FIG. 1 is a schematic view of an AGV unloading model according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a resource conflict scenario provided by an embodiment of the present invention;

fig. 3 is a flowchart of the DQN algorithm provided by an embodiment of the present invention.

Detailed Description

The method of the invention is further described below with reference to the accompanying drawings and examples.

A task unloading optimization method for a fixed path AGV in an industrial Internet of things environment comprises the following steps:

step one, under the environment of an industrial Internet of things, a plurality of AGVs carrying a plurality of tasks to be processed are given, the traveling route of the AGVs is given, a plurality of edge servers are uniformly distributed in the traveling route of the AGVs, the tasks carried by the AGVs need to be unloaded to the edge servers for processing, when the AGVs reach a given terminal point, the carried tasks need to be unloaded, and when all the tasks are processed by the edge servers, the process is finished; this scenario is modeled.

And secondly, in order to shorten the time for completing the whole processing of the tasks carried by the AGVs as much as possible, the task unloading method divides the task unloading into two stages for processing, distributes the AGV in the first stage by adopting a weighted polling load balancing algorithm, and trains the stage causing the resource conflict of the edge server in the second stage by using a deep reinforcement learning DQN algorithm on the basis of the first stage, thereby obtaining an optimal distribution scheme and ensuring that the final completing time of all the tasks is shortest.

And step three, in the first stage of the method, load balancing distribution is carried out on each AGV by adopting a load balancing algorithm of weighted polling. Firstly, neglecting the unloading influence of other AGVs, considering the processing efficiency and capacity of the edge servers and the time sequence of reaching each edge server, and carrying out load balancing distribution on the tasks carried by each AGV.

Step four, the completion time of the task is composed of three parts: the travel time of the AGV, the unloading time of the AGV to unload the task to the edge server, and the processing time of the edge server. The AGV time from the origin to the ith edge server is:

and the time for the AGV to unload the task to the edge server is as follows:

wherein s is_iRepresents the unload rate of the task:

s_i＝Wlog(1+pg_i/N₀)

the processing time of the edge server is:

the weighted round robin algorithm can thus be deduced to allocate the amount of tasks as:

and step five, training the nodes causing the resource conflict by adopting a deep reinforcement learning DQN algorithm in the second stage of the method. When the best allocation is obtained for each AGV in the first stage, the result may be that the total amount of unloading at the common edge server exceeds the capacity of the edge server, which may cause resource conflict problem, so the amount of unloading of AGVs at the common edge server needs to be adjusted. Therefore, in the second stage, the deep reinforcement learning DQN algorithm is adopted to train the part, so that the optimal distribution mode is obtained.

At the moment, a Markov model is built for the edge server nodes and the AGVs which cause the conflict, the initialized state space is the task quantity carried by the AGVs, the action space is the unloading quantity of a certain edge server node, the reward obtained by executing the action is the reciprocal of the task quantity, and considering that the time of each AGV reaching the edge server node is different, each AGV has an unloading priority;

and step six, the unloading process of the AGV at each edge server corresponds to the state transition process of the Markov model, and each state transition of the Markov model generates a learning unit which comprises a state on the agent, an action for selecting the last state, an incentive given to the state transition by the environment and the current state. Wherein the reward is set to the inverse of the task processing time to maximize the reward because the final goal is to minimize the task completion time. The reward function is:

adding an attenuation factor: furthermore, since the total reward earned by the agent takes into account not only the current reward, but also future long-term rewards, and the further apart the time, the less accurate the value of the future reward earned, the cumulative reward can be expressed as:

wherein gamma is a discount factor, and the discount factor is used to make the proportion of the reward value with longer time interval in the current return smaller. Because the state space and the action space are both large, the invention uses the experience pool of DQN to perform experience playback, fully utilizes the advantage of off-policy, and can break the association between data, and its target is updated as:

and step seven, stopping training when the algorithm reaches the maximum training period, and outputting an action sequence corresponding to the maximum result, which represents the unloading action selection of the AGVs in the actual scene, so as to obtain the optimal unloading sequence of each AGV at the common edge server, namely the optimal unloading strategy for task unloading scheduling of the AGVs.

Example (b):

step 1, firstly, basic information of each AGV and an edge server is determined, wherein the basic information comprises information such as the task amount carried by the AGV, the traveling speed of the AGV, the processing efficiency and capacity of the edge server, the traveling route of the AGV and the like.

And 2, determining the completion time of the task, wherein the completion time comprises the driving time of the AGV reaching the edge server, the unloading time of the AGV unloading the task to the edge server and the processing time of the edge server processing the task. They are respectively:

and 3, performing load balance distribution on the unloading of the AGV at the edge servers by using a weighted polling algorithm according to the processing efficiency and capacity of the edge servers and the time of the AGV reaching each edge server as weights, wherein the time of reaching each edge server is different, namely the edge server node which arrives first is processed first, so that the part of tasks are initial distribution tasks and then are redistributed according to the processing efficiency weights.

And 4, the initial distribution amount of the AGV to each edge server is as follows:

the remaining distribution amounts were:

the load balancing algorithm using weighted round robin thus yields an allocation of:

step 5, the weighted polling algorithm can obtain the optimal allocation scheme of each AGV, but if a plurality of AGVs unload according to the optimal allocation scheme at the common edge server, the unloading amount exceeds the capacity of the edge server, so that the part needs to be optimized. The invention trains the part by adopting a deep reinforcement learning DQN algorithm. Fig. 3 is a flowchart of the DQN algorithm provided by an embodiment of the present invention.

And 6, constructing a Markov model at the moment, and initializing the state of the AGV, wherein the state comprises the task quantity carried by the AGV and the position of an edge server where the AGV is located. Acting as the capacity of the AGVs to unload at the common edge server. The reward function is the inverse of the processing time of the task of action offload. Initializing a maximum number of training cycles, an empirical pool size, and a weight parameter θ.

Step 7, executing the strategy, obtaining a random number, randomly selecting a node from all nodes as the action selection of the current state if the random number is less than epsilon, executing the action by the intelligent agent from the current state to the next state, and obtaining the reward R obtained by the edge server processing the part of tasks_t+1After the learning unit reward of the state transition of the step is determined, the state of the next step is updated, and an updating formula is used

The targets are updated, where γ is a discount factor, because the agent considers not only the current reward but also the future long-term reward in the total return, but the more distant the interval, the less accurate the value of the future reward obtained, so the discount factor is needed to make the reward value with longer interval account for the current return. Theta^-Meaning that it is updated slower than the Q network's weights. Eventually the complete learning unit is pushed into the experience pool. Subsequent training may be performed by randomly taking samples from the experience pool through empirical replay.

And 8, if the agent does not reach the end state, repeating the steps until the task is finished and the agent enters the end state. If the agent offloads tasks beyond the capacity of the edge server, the agent is given a punitive reward. And calculating the accumulated income of the current period from the starting state to the ending state, and updating the experience pool information if the experience pool is full or the income is greater than the current maximum target reward.

And 9, if the training cycle number does not reach the maximum cycle number, repeating the cycle training until the maximum training cycle number is reached. And if the maximum training period is reached, stopping the training of the reinforcement learning method, triggering from the initial state according to a greedy algorithm, and selecting the corresponding action with the maximum benefit until the final state. And recording all action selections to obtain an unloading scheduling decision sequence, and outputting a result serving as a solution for solving the problem that the AGV has the minimum task completion time under the constraint condition in the industrial Internet of things environment.

It should be understood that parts of the specification not set forth in detail are prior art. It should be understood by those skilled in the relevant art that the above examples are only for the purpose of assisting the reader in understanding the principles and implementations of the present invention, and the scope of the present invention is not limited to such examples. All equivalent substitutions made on the basis of the present invention are within the protection scope of the claims of the present invention.

Claims

1. The task unloading optimization method of the AGV with the fixed path in the environment of the industrial Internet of things is characterized by comprising the following steps of:

step three, firstly, the task unloading of the AGV is processed in a first stage; ignoring resource conflict problems caused by multiple AGV unloads, wherein the task amount unloaded by each AGV at the passing edge server is related to the processing efficiency and capacity of the edge server and the time of arriving at the edge server; performing resource allocation on each AGV according to the conditions by adopting a load balancing algorithm of weighted polling;

step four, the second stage processes the nodes causing resource conflict in the first stage; establishing a Markov model for the edge server nodes and the AGVs which cause the conflict, wherein the initialized state space is the task quantity carried by the AGVs, the action space is the unloading quantity of the AGVs at the edge server nodes, the reward obtained by executing the action is the reciprocal of the task quantity, and considering that the different time of the AGVs reaching the edge server nodes, each AGV has an unloading priority;

2. The method for optimizing task unloading of the AGV with the fixed path in the environment of the industrial internet of things according to claim 1, wherein in the two-stage processing scheme, the deep reinforcement learning state of the second stage is used

Is represented by a vector of

3. The method for optimizing task unloading of the AGV with the fixed path in the environment of the industrial internet of things according to claim 1, wherein the completion time of the task is composed of three parts: the method comprises the following steps that the moving time of an AGV, the unloading time of the AGV for unloading tasks to an edge server and the processing time of the edge server are determined; the AGV time from the origin to the ith edge server is:

and the time for the AGV to unload the task to the edge server is as follows:

wherein s is_iRepresents the unload rate of the task:

s_i＝Wlog(1+pg_i/N₀)

the processing time of the edge server is:

4. the method for optimizing task unloading of the AGV with the fixed path under the environment of the industrial internet of things according to claim 1, wherein the AGV arrives at each edge server at different time, and the processing capacity of the edge servers is considered, so that the AGV has different unloading priorities when selecting actions;

thus, the corresponding bonus setting is as follows:

wherein

5. The method of claim 4, wherein the total reward obtained by the agent is not only the current reward but also a future long-term reward, and the further the interval time is, the more inaccurate the value of the obtained future reward is, so the cumulative reward can be expressed as:

6. The method for optimizing task uninstallation of a fixed path AGV in an industrial internet of things environment according to claim 1, wherein in the sixth step, the agent generates a learning experience in the experience pool during the process of going from one state to another state, and the learning experience includes the characteristics of the previous state, the selected action, the obtained reward and the next state; and the intelligent agent learns according to past experience in the experience pool, and finally the AGV finishes unloading and enters an ending state to start the next training period.

7. The method for optimizing task unloading of the fixed path AGV in the environment of the industrial Internet of things according to any one of claims 1 to 6, wherein the AGV needs to comply with the following restrictions in the unloading scheduling process: