CN115237119A

CN115237119A - AGV collaborative transfer target distribution and decision algorithm

Info

Publication number: CN115237119A
Application number: CN202210627881.4A
Authority: CN
Inventors: 魏才盛
Original assignee: Suzhou Ruizhixingyuan Intelligent Technology Co ltd
Current assignee: Suzhou Ruizhixingyuan Intelligent Technology Co ltd
Priority date: 2022-06-06
Filing date: 2022-06-06
Publication date: 2022-10-25

Abstract

The invention discloses an AGV (automatic guided vehicle) cooperative transportation target distribution and decision algorithm, which is characterized by comprising the steps of respectively establishing an dominance matrix and a target decision matrix based on the quantity of AGV and the quantity of targets to be transported, establishing a target distribution optimization function, analyzing AGV transportation environment, establishing a probability matrix and an anti-intrusion action matrix, and establishing an action decision target function and an action decision constraint; linear weighting integration, namely establishing a uniform objective function, performing reinforcement learning and designing a reward function to form a multi-stage decision problem model; designing a joint reward function to carry out action reward; setting a selection strategy and backtracking an updating formula, improving the selection strategy and backtracking the updating formula, and solving a solution problem model. The invention has the advantages of high calculation speed, short training time and good convergence effect, thereby realizing the effects of intelligent calculation, accurate allocation and action decision.

Description

AGV collaborative transfer target distribution and decision algorithm

Technical Field

The invention relates to the technical field of intelligent carrying equipment, in particular to an AGV (automatic guided vehicle) cooperative carrying target distribution and decision algorithm.

Background

An AGV is an unmanned transport vehicle that implements vehicle positioning, orientation, obstacle avoidance, and path planning by installing a rotatable laser scanner to detect the surrounding environment. As key transportation equipment in goods transportation, all the AGVs form a transportation network, and transportation activities between a starting point and a target unit are completed under the instruction of a control system. Therefore, how to perform target allocation and action decision of AGVs through the control system is an important issue to be solved urgently.

The AGV target allocation and action decision problem is an extension of the Vehicle Routing Problem (VRP) and belongs to the NP-hard problem, so that the large-scale problem is difficult to solve using the conventional precise solution algorithm. The problem is far more complex than the traditional VRP problem and scheduling problem due to the fact that the scene is complex and the problems of multiple constraints, multiple targets, uncertainty and the like exist. At present, a scheduling method based on experience and rules is mainly used, and the method causes the problems of low on-time delivery rate, low AGV utilization rate, long delivery time and the like. Therefore, it is urgently needed to combine the optimal scheduling theory with the characteristics of AGV delivery and design an effective optimal theory and method for solving the problem.

In conclusion, the method has very important practical significance in researching the AGV scheduling problem, integrates target distribution and action decision under various constraints and maneuvers aiming at the AGV scheduling problem, and designs an effective intelligent algorithm to solve the problem.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an AGV cooperative transport target allocation and decision algorithm which has the advantages of high calculation speed, short training time and good convergence effect, thereby realizing the effects of intelligent calculation, accurate allocation and action decision.

In order to achieve the purpose, the invention provides the following technical scheme:

an AGV cooperative transport target distribution and decision algorithm comprises the following steps:

s1: respectively establishing an dominance matrix and a target decision matrix based on the number of the AGVs and the number of the targets to be carried, and establishing a target distribution optimization function based on the dominance matrix and the target decision matrix;

s2: analyzing an AGV carrying environment, establishing a probability matrix, optimizing according to the probability matrix to obtain a penetration action matrix, and establishing an action decision objective function based on the probability matrix and the penetration action matrix;

s3: establishing action decision constraint according to the action decision objective function;

s4: performing linear weighted integration on the target distribution optimization function and the action decision objective function to establish a uniform objective function;

s5: performing reinforcement learning, performing Markov decision process modeling on target allocation and decision in stages to construct a state space and an action space, and designing a reward function based on a unified target function to form a multi-stage decision problem model;

s6, designing a joint reward function based on the unified objective function;

s7: establishing a corresponding Monte Carlo tree for each AGV, searching each Monte Carlo tree, forming a combined action set for the search results, substituting a combined reward function for evaluation and obtaining a reward value;

s8: setting a selection strategy and a backtracking updating formula, returning the obtained reward value to each Monte Carlo tree, improving the selection strategy and the backtracking updating formula of the algorithm, and solving a decision problem model.

As a further improvement of the present invention, the step S1 specifically includes:

in the AGV cooperative process, N is set _U AGV and N _Tar An object to be carried;

the established dominance matrix is A _ij I (i =1, 2.... Times., N) in the matrix _u ) Line j (j =1, 2.. Ang., N) _Tar ) The values in the column represent the comprehensive dominance of the ith AGV to the target j;

the established objective decision matrix is X _ij The element value in the matrix is 1, which means that the ith AGV is allocated to the target j, otherwise, the element value of the matrix is 0;

the established target allocation optimization function is as follows:

wherein: j is a unit of _dis,i And allocating an optimization function to the target of the ith AGV stage.

As a further improvement of the present invention, step S1 further includes:

in the process of allocating targets, each AGV can only be allocated to one target, and each target is allocated with at least one AGV, and a constraint model is established for the AGV:

as a further improvement of the present invention, in the step S2, due to the AGV transport process, different targets make different maneuvers based on different environments to pass through the obstacle, and different targets are set to deploy different air defense areas in different numbers, that is, the target j has N in total _d,j Individual obstacle region, AGV has N _A Selecting one action for defense through each barrier area, and establishing probability matrix P _mk Matrix represents performing action m (m =1, 2..., N) _A ) N in the barrier region k (k =1,2.. Times.n) _d,j ) The probability percentage can be reduced, and the probability matrix is optimized to obtain a penetration matrix M _km The matrix represents that if the AGV selects the penetration action m in the obstacle area k, the matrix element is 1, otherwise, the matrix element is 0;

establishing an action decision objective function of the AGV:

wherein: j is a unit of _pene,i Representing the action decision objective function of the ith AGV.

As a further improvement of the invention, the step S2 further comprises the step of restricting the number of action selections in the process of transporting the AGV according to the physical characteristics of the AGV, and defining that each action selection does not exceed b ₁ Secondly, establishing action decision constraint:

and optimally selecting decision variables in a discrete decision space based on action decision constraints.

As a further improvement of the present invention, the step S4 specifically includes:

establishing a unified objective function for the ith AGV based on an objective distribution function and an action decision objective function:

converting the multi-target optimization of the AGV cluster into single-target optimization by using a linear weighting method:

wherein:

is a weight factor, and

as a further improvement of the present invention, said step S4 further includes establishing an objective function constraint:

wherein the objective function constraint is a combination of a constraint model and an action decision constraint.

As a further improvement of the present invention, the step S5 specifically includes:

the state space is established as:

wherein: s _dis Assign status to the target, S _r Is the area of the obstacle 1 and is,

in order to pass through the area of the obstacle 2 in sequence,

in order to pass through the area of the obstacle 3 in sequence,

are areas of obstacles 4 that pass in sequence. By integrating the state spaces of all targets, the total state space can be expressed as:

the AGV acts differently according to the states of different stages, the action is selected as a target in the target distribution stage, the AGV can only select one target, and the result is represented as a discrete vector

For the presence of N _Tar When there is one target, there is N target distribution stage _Tar An action is selectable;

when the obstacle passing stage action is taken as 5 types of defense actions, the AGV selects one action in a certain obstacle area state, the 5 types of defense actions are expressed by vectors, elements in the vectors sequentially represent action 1, action 2, action 3, action 4 and action 5, and a reward function is established:

as a further improvement of the present invention, the joint reward function is:

and giving a penalty of-1 when the objective decision matrix and the penetration matrix do not meet the constraint, or giving a uniform objective function value to the reward value.

As a further improvement of the present invention, the selection decision is:

wherein: v. of _father Being a parent node of a node v being computed, C _p A constant, orientation relationship used to weigh search and utilization, Q (v) is the result based on all AGV assignments and penetration;

the backtracking update formula is as follows:

wherein: n is a radical of _new (v) As a new node, N _old (v) Is an old node, Q _new (v) For new penetration results, Q _old (v) For old blast results, Δ Q is the blast result difference.

The invention has the beneficial effects that: the method comprises the steps of respectively establishing a target distribution optimization function and an action decision objective function by considering the advantages, the target value and the action probability of the AGVs, integrating the two functions to form a unified objective function for collaborative task planning of the AGVs, constructing a state space and an action space in stages in a reinforcement learning frame, designing a reward function according to the unified objective function, and finally providing an improved Monte Carlo tree search reinforcement learning algorithm.

Drawings

FIGS. 1-5 are partial AGV search tree search depth maps;

FIGS. 6-10 show J of a partial AGV _i The value is obtained.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples. In which like parts are designated by like reference numerals. It should be noted that the terms "front," "back," "left," "right," "upper" and "lower" used in the following description refer to directions in the drawings, and the terms "bottom" and "top," "inner" and "outer" refer to directions toward and away from, respectively, the geometric center of a particular component.

Referring to fig. 1, a specific embodiment of an AGV cooperative transport target allocation and decision algorithm according to the present invention includes the following steps:

s6, designing a combined reward function based on the unified objective function;

s8: setting a selection strategy and a backtracking updating formula, returning the obtained reward value to each Monte Carlo tree, improving the selection strategy and the backtracking updating formula of the algorithm, and solving the decision problem model.

The step S1 specifically comprises the following steps:

in the AGV cooperative process, N is set _U Tables AGV and N _Tar An object to be carried;

the established dominance degree matrix is A _ij I (i =1, 2.... Times., N) in the matrix _u ) Line j (j =1, 2.... Times., N) _Tar ) The values of the columns represent the comprehensive dominance of the ith AGV to the target j;

the established objective decision matrix is X _ij If the element value in the matrix is 1, the ith AGV is allocated to a target j, otherwise, the element value of the matrix is 0;

the established target distribution optimization function is as follows:

wherein: j is a unit of _dis,i Allocating an optimization function to the ith target of the AGVs, wherein in the process of allocating the targets, each AGV can only be allocated to one target, and each target is allocated to at least one AGV, and establishing a constraint model:

and carrying out constraint control on each AGV through a constraint model.

In the process of transporting the AGV, passing environments of different targets are different, the environments mainly comprise different obstacles, and the AGV can make different maneuvering actions to pass through the obstacles. The probability of the obstacle being blocked and the probability of the maneuver are shown in the following table:

table 1 original blocking probability table for each obstacle

TABLE 2 AGV action to obstacle block reduction ratio table

Different targets pass through the obstacle by making different maneuvers based on different environments in the AGV transporting process, and different targets are set to deploy different air defense areas with different numbers, namely the target j has N in total _d,j Individual obstacle region, AGV has N _A Selecting one action for surreptitious defense through each barrier area, and establishing probability matrix P _mk The matrix represents the execution of action m (m =1, 2...., N) _A ) In the barrier region k (k =1,2.. N.) _d,j ) The probability percentage can be reduced, and the probability matrix is optimized to obtain a penetration matrix M _km The matrix represents that if the AGV selects the penetration action m in the obstacle area k, the matrix element is 1, otherwise, the matrix element is 0;

establishing an action decision objective function of the AGV:

wherein: j is a unit of _pene,i And representing the action decision objective function of the ith station of the AGV.

According to the physical characteristics of the AGV, the times of action selection in the process of transporting the AGV are restrained, and each action selection is defined not to exceed b ₁ And secondly, establishing action decision constraint:

and optimally selecting decision variables in a discrete decision space based on action decision constraint, wherein the solving form is matched with the reinforcement learning solving process, and the multi-stage decision problem is uniformly solved by utilizing a Monte Carlo tree search algorithm.

wherein:

is a weight factor, an

Establishing an objective function constraint:

The problem to be solved is modeled by a Markov decision process so as to improve the use of the algorithm solving process, the state is the information quantity of the intelligent body for showing the self characteristics, and the intelligent body selects actions according to the state in the iteration process, so that the selection of the state has very important influence on the quality of a training result. The problem solved by the invention consists of two stages of target distribution and action decision, wherein the target distribution is a precondition stage of the action decision, and the action selection can be sequentially carried out according to the barrier area of the target only after the AGV selects the target, so that aiming at the problem, the target distribution is taken as a precondition state to be combined with the subsequent barrier area state to carry out unified state space modeling, and according to different targets j, the state space is established as follows:

in order to pass through the area of the obstacle 2 in sequence,

in order to pass through the area of the obstacle 3 in sequence,

the AGV acts according to different states of different stages, the action is taken as a target selection in the target distribution stage, the AGV can only select one target, and the result is represented as the target selection by a discretization vector

For the presence of N _Tar When there is one target, there is N target distribution stage _Tar The actions can be selected;

the reward function is the most central part of reinforcement learning and guides the intelligent agent to learn. For the problem, the decision result can be evaluated according to the uniform objective function only when the agent reaches the final state, namely, the complete objective distribution matrix X and the action matrix M are decided.

Based on the unified objective function and the objective function constraint, the joint reward function is designed as follows:

and giving a penalty of-1 when the objective decision matrix and the penetration matrix do not meet the constraint, or giving a reward value as a unified objective function value.

And establishing a Monte Carlo tree for each AGV, performing independent search on each tree, forming a combined action set by search results, substituting the combined action set into a combined reward function for evaluation, finally returning the obtained reward value to each tree, and performing backtracking update on the result path node of each tree.

To avoid the influence of the average simulation result on the problem solution result, the selection decision is established as follows:

wherein: v. of _father Being a parent node of a node v being computed, C _p A constant, orientation relationship used to weigh the search and utilization, Q (v) is based on the results of all AGV assignments and surreptitious defense;

the backtracking update formula is as follows:

wherein: n is a radical of hydrogen _new (v) As a new node, N _old (v) Is an old node, Q _new (v) For new penetration results, Q _old (v) Is oldThe defense result, Δ Q, is the poor defense result.

The method takes 20 AGVs and 5 targets as an example, namely the effectiveness of the proposed unified target function and improved algorithm is verified, the difference between the target value and the barrier region is considered, a target value dominance degree and a situation information table are given, meanwhile, the initial course angle of each AGV is different from the carrying time of each target, and each action selection is set not to exceed b ₁ =2 times, weight factor ρ ₁ ,ρ ₂ ,....,ρ ₅₀ All take 1/50, in the algorithm

The algorithm running environment is an i5-9400F processor and is 2.90GHz, the simulation environment is python3.6, and the total iteration step number is set to be 20000 steps.

After 20000 steps of learning training, the algorithm is converged, the total training time is 162.1854 seconds, a target distribution matrix obtained by training and an action matrix of each AGV are represented by the following table, in order to facilitate representation of action results, actions 1 to 5 are represented by numbers 1 to 5 respectively, and a motion sequence made by the AGV sequentially passing through a barrier region is represented by a one-dimensional vector.

Results of the algorithm presented in Table 2

Each AGV carries out independent search according to self heuristic factors in the training process, and because the space is limited, the AGV has the advantages that the search depth and the single AGV objective function value j in the training process are given by taking AGV U1, AGV U5, AGV U10, AGV U15 and AGV U20 as examples _i The data of (1).

Fig. 1 to fig. 5 show depth data of tree search in a training process of a partial AGV, and it can be seen from the graph that after 10000 steps of training, the tree search can reach the maximum depth, which illustrates that an AGV strategy can obtain an optimal solution range in the previous training according to a reward function and based on rolout random simulation, and a search strategy mainly searches the optimal solution range in the subsequent training.

FIGS. 6-10 show a portion of an AGV during the training process J _i The data is shallow in search depth in the previous training, target shooting is mainly carried out by Monte Carlo at the moment, the search tree obtains a return value by random simulation, but as the search depth is increased, the tree stores the obtained better solution in node information and continuously searches the optimal solution by means of heuristic factors. The oscillation phenomenon occurs in the searching process at the later stage of training, which is that the searching part plays a role in the selection strategy, the optimal solution recorded by the current node is avoided, and other nodes with less access times are selected, so that the local optimization is avoided.

The working principle and the effect are as follows:

the method comprises the steps of respectively establishing a target distribution optimization function and an action decision objective function by considering the advantages, the target value and the action probability of the AGVs, integrating the two functions to form a unified objective function for collaborative task planning of the AGVs, constructing a state space and an action space in stages in a reinforcement learning frame, designing a reward function according to the unified objective function, and finally providing an improved Monte Carlo tree search reinforcement learning algorithm.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims

1. An AGV cooperative transport target distribution and decision algorithm is characterized by comprising the following steps of:

s5: performing reinforcement learning, performing Markov decision process modeling on target distribution and decision in stages to construct a state space and an action space, and designing a reward function based on a uniform target function to form a multi-stage decision problem model;

s7: establishing a corresponding Monte Carlo tree for each AGV, searching each Monte Carlo tree, forming a combined action set for the search results, substituting the combined action set into a combined reward function for evaluation, and obtaining a reward value;

2. The AGV cooperative transport target allocation and decision algorithm of claim 1, wherein: the step S1 specifically includes:

the established dominance degree matrix is A _ij I (i =1, 2.... Times., N) in the matrix _u ) Line for mobile communication terminalJ (j =1, 2.. Times.n) _Tar ) The values in the column represent the comprehensive dominance of the ith AGV to the target j;

the established target allocation optimization function is as follows:

wherein: j is a unit of _dis,i And allocating an optimization function for the target of the ith AGV.

3. The AGV cooperative transport target allocation and decision algorithm of claim 2, wherein: the step S1 further comprises the following steps:

in the process of allocating targets, each AGV can only be allocated to one target, and each target is allocated with at least one AGV, and a constraint model is established for the target allocation process:

4. the AGV cooperative transport target allocation and decision algorithm according to claim 1, wherein: in the step S2, different targets make different maneuvers based on different environments to pass through the barrier in the AGV transporting process, and different targets are set to deploy different air defense areas with different numbers, namely, the target j has N in total _d,j Individual obstacle region, AGV has N _A Selecting one action for surreptitious defense through each barrier area, and establishing probability matrix P _mk The matrix represents the execution of action m (m =1, 2...., N) _A ) In the barrier region k (k =1,2.. N.) _d,j ) The probability percentage can be reduced, and the probability matrix is optimized to obtain a penetration matrix M _km The matrix indicates if the AGV selects the penetration action m in the obstacle area kThe matrix element is 1, otherwise 0;

establishing an action decision objective function of the AGV:

5. The AGV cooperative transport target allocation and decision algorithm according to claim 3, wherein: step S2, the times of action selection in the AGV transporting process are restrained according to the physical characteristics of the AGV, and each action selection is defined not to exceed b ₁ Secondly, establishing action decision constraint:

6. The AGV cooperative transport target allocation and decision algorithm according to claim 1, wherein: the step S4 specifically comprises the following steps:

wherein:

is a weight factor, and

7. the AGV cooperative transport target allocation and decision algorithm according to claim 6, wherein: the step S4 further includes establishing an objective function constraint:

8. The AGV cooperative transport target allocation and decision algorithm according to claim 1, wherein: the step S5 specifically comprises the following steps:

the state space is established as:

in order to pass through the area of the obstacle 2 in sequence,

in order to pass through the area of the obstacle 3 in sequence,

when the obstacle passing stage action is performed as 5 types of defense actions, the AGV selects one action in a certain obstacle area state, the 5 types of defense actions are expressed by vectors, elements in the vectors sequentially represent the action 1, the action 2, the action 3, the action 4 and the action 5, and a reward function is established:

9. the AGV cooperative transport target allocation and decision algorithm of claim 8, wherein: the joint reward function is:

10. The AGV cooperative transport target allocation and decision algorithm according to claim 1, wherein: the selection decision is:

wherein: v. of _father Being a parent node of the node v being computed, C _p A constant, orientation relationship used to weigh search and utilization, Q (v) is the result based on all AGV assignments and penetration;

the backtracking update formula is as follows:

wherein: n is a radical of hydrogen _new (v) As a new node, N _old (v) Is an old node, Q _new (v) For new penetration results, Q _old (v) For old blast results, Δ Q is poor blast results.