WO2022007179A1 - 一种多agv运动规划方法、装置和系统 - Google Patents

一种多agv运动规划方法、装置和系统 Download PDF

Info

Publication number
WO2022007179A1
WO2022007179A1 PCT/CN2020/114422 CN2020114422W WO2022007179A1 WO 2022007179 A1 WO2022007179 A1 WO 2022007179A1 CN 2020114422 W CN2020114422 W CN 2020114422W WO 2022007179 A1 WO2022007179 A1 WO 2022007179A1
Authority
WO
WIPO (PCT)
Prior art keywords
agv
action
current
neural network
current agv
Prior art date
Application number
PCT/CN2020/114422
Other languages
English (en)
French (fr)
Inventor
王学强
张一凡
邹李兵
张富强
Original Assignee
歌尔股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 歌尔股份有限公司 filed Critical 歌尔股份有限公司
Priority to US17/309,922 priority Critical patent/US20220317695A1/en
Publication of WO2022007179A1 publication Critical patent/WO2022007179A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W60/00Drive control systems specially adapted for autonomous road vehicles
    • B60W60/001Planning or execution of driving tasks
    • B60W60/0011Planning or execution of driving tasks involving control alternatives for a single driving scenario, e.g. planning several paths to avoid obstacles
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/0088Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0287Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
    • G05D1/0289Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling with means for avoiding collisions between vehicles
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0287Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
    • G05D1/0291Fleet control
    • G05D1/0297Fleet control by controlling means in a control room
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present application relates to the field of computer technology, and in particular to a multi-AGV motion planning method, device and system.
  • AGV Automated Guided Vehicles
  • the present application provides a multi-AGV motion planning method, device and system for solving or partially solving the above technical problems.
  • the embodiment of the present application provides a multi-AGV motion planning method, the method includes:
  • An object model for describing the sequence decision-making process of multi-AGV motion planning is established by reinforcement learning method.
  • the object model includes: AGV state, action space and evaluation indicators of motion planning results.
  • the AGV state includes the current AGV state, according to the current AGV state Calculated other AGV states and current AGV permission actions;
  • an embodiment of the present application provides a multi-AGV motion planning device, the device comprising:
  • the modeling unit is used to establish an object model for describing the sequence decision-making process of multi-AGV motion planning through reinforcement learning methods.
  • the object model includes: AGV state, action space and evaluation indicators of motion planning results, where the AGV state includes the current AGV state, other AGV states calculated based on the current AGV state, and the permitted actions of the current AGV;
  • the training unit is used to build a neural network model based on the object model, perform environment settings including AGV group deployment, and use the AGV object model in the set environment to train the neural network model until a stable neural network model is obtained;
  • the setting unit is used to set the action constraint rules, and use the action constraint rules to judge the validity of the current AGV's to-be-executed actions obtained based on the evaluation index;
  • the implementation unit is used to input the current AGV state, other AGV states and permitted actions in the current environment into the trained neural network model after starting the motion planning, and obtain the evaluation index of the motion planning result output by the neural network model.
  • the indicator obtains the action to be executed by the current AGV, and uses the action constraint rules to effectively judge the action to be executed, so that the current AGV executes the effective action according to the judgment result.
  • the embodiments of the present application provide a multi-AGV motion planning system, including a control center that deploys a multi-AGV motion planning device, an AGV group and a task scheduling center platform deployed in an environment composed of multiple AGVs;
  • the AGV in the AGV group uses its own sensor to obtain its own state and upload it to the control center, and receives the action command issued by the control center, and executes the action corresponding to the action command;
  • the task scheduling center platform is used to complete task planning and task distribution, send the task of an AGV to the control center, and the control center controls the AGV to complete the task;
  • the control center has a built-in stable neural network model, inputs the received AGV state into the neural network model, uses the neural network model to calculate the motion planning strategy of the AGV, and then generates action instructions according to the motion planning strategy and sends it to the AGV .
  • the present application provides an electronic device, a memory, and a processor; the memory stores computer-executable instructions; and the processor, when the computer-executable instructions are executed, causes the processor to execute a multi-AGV motion planning method.
  • the present application provides a computer-readable storage medium on which one or more computer programs are stored, and when the one or more computer programs are executed, implement a multi-AGV motion planning method.
  • the embodiment of the present invention adopts the deep reinforcement learning method, utilizes the advantages of the neural network model in calculating high-dimensional state space and the characteristics of reinforcement learning online control, and improves the multi-AGV motion planning method in dynamic environments.
  • the optimal path search algorithm and the deep reinforcement learning algorithm are combined, and the optimal path search algorithm is used to constrain the learning direction of the deep reinforcement learning algorithm, so that the AGV can plan a feasible route in a dense environment and avoid The AGV is stuck in a local deadlock state.
  • FIG. 1 is a schematic flowchart of a multi-AGV motion planning method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a multi-AGV motion environment according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a multi-AGV motion planning design according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a deep reinforcement learning training framework established based on the GA3C framework according to an embodiment of the present application;
  • FIG. 5 is a functional block diagram of a multi-AGV motion planning apparatus according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a multi-AGV motion planning system according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a control state of a system by a control center according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.
  • multi-AGV motion planning is carried out in a dynamic and dense environment.
  • the multi-AGV motion planning problem faces problems such as increased search space and dynamic changes in the environment, and the difficulty will increase exponentially with the expansion of the environment and the number of AGVs.
  • the existing motion planning algorithms have problems such as being difficult to solve or the computational cost is too high, and in dense environments, the existing methods are prone to lead to AGV deadlock and cannot complete motion planning well.
  • the embodiments of this application are based on the deep reinforcement learning method to solve the motion planning problem of multiple AGVs in a dynamic environment, and combine the deep reinforcement learning method with the optimal path search method to solve the problem of multiple AGVs in a dense environment.
  • Motion planning problem complete the multi-AGV motion planning task in dynamic dense environment.
  • the method includes the following steps:
  • step S110 an object model for describing the sequence decision-making process of multi-AGV motion planning is established by the reinforcement learning method, and the object model includes: AGV state, action space and evaluation indicators of the motion planning result, wherein the AGV state includes the current AGV state, Other AGV states calculated based on the current AGV state and permitted actions of the current AGV.
  • the permission action of the current AGV is obtained according to the optimal path strategy and action space.
  • step S120 a neural network model is built based on the object model, environment settings including AGV group deployment are performed, and the neural network model is trained by using the object model of the AGV in the set environment until a stable neural network model is obtained.
  • Step S130 setting an action constraint rule, and using the action constraint rule to judge the validity of the current AGV's to-be-executed action obtained based on the evaluation index.
  • the action constraint rule can judge the validity of the action to be executed by the current AGV.
  • the action to be executed is judged to be a valid action, after starting the motion planning, the current AGV can directly execute the action to be executed.
  • the action is an invalid action, after starting the motion planning, the current AGV needs to extract multiple executable actions contained in the optimal path of the current AGV from the action space, and select the corresponding action to execute from the multiple executable actions.
  • Step S140 after starting the motion planning, input the current AGV state, other AGV states and permitted actions in the current environment into the trained neural network model, obtain the evaluation index of the motion planning result output by the neural network model, and obtain according to the evaluation index.
  • the current AGV is to perform the action
  • the action constraint rule is used to effectively judge the to-be-executed action, so that the current AGV can perform an effective action according to the judgment result.
  • this embodiment adopts the deep reinforcement learning method on the one hand, and uses the advantages of the neural network model in computing high-dimensional state space and the characteristics of reinforcement learning online control to improve the multi-AGV motion planning method in a dynamic environment.
  • this embodiment combines the optimal path search algorithm with the deep reinforcement learning algorithm, and uses the optimal path search algorithm to constrain the learning direction of the deep reinforcement learning algorithm, so that the AGV can plan a feasible route in a dense environment , to avoid the AGV from falling into a local deadlock state.
  • the black-hearted circle in Figure 2 represents the location of the AGV; the black-hearted star represents the target position of the AGV; the slashed area represents the forbidden area in the environment, such as obstacles, work areas, shelves, etc., which can generally be based on the map drawings of the task scene AGV is prohibited from entering the prohibited area; the dotted line indicates the path, the arrow indicates the direction of the path, and the AGV needs to drive along the path according to the direction of the path; the road network consists of paths, indicating that the AGV is permitted in the environment
  • the route in the road network needs to form a closed loop, that is, any two points on the road network are taken as the starting position and the target position, and at least one path from the starting position to the target position can be searched on the road network.
  • the motion planning algorithm design includes the construction of the object model, the design and training of the neural network model, and the design of action constraints.
  • the above-mentioned step S110 specifically uses a Markov decision process to describe the object model, and the Markov decision process (Markov decision process, MDP) is a mathematical model of sequential decision Stochastic policies and rewards achievable by simulated agents in environments with Markov properties.
  • Markov decision process MDP
  • MDP Markov decision process
  • the object model includes: the AGV state, the action space, and the evaluation index of the motion planning result; wherein, the AGV state includes the current AGV state o a , other AGV states and the permitted action a a of the current AGV, the evaluation indicators of the motion planning result include: the policy ⁇ and the value function V; where,
  • the current AGV state o a includes: the speed of the current AGV in the body polar coordinate system, the target position, the current AGV size, the average speed, and the road network distance between the current AGV and the target position;
  • Other AGV states Including: the relative position of other AGVs in the current AGV body polar coordinate system, the relative position of other AGVs in the current AGV body polar coordinate system, the relative speed of other AGVs in the current AGV body polar coordinate system, other AGVs and their targets The road network distance of the location, the size of other AGVs, the sum of other AGV sizes and the current AGV size, and the road network distance between other AGVs and the current AGV;
  • the permitted action a a of the current AGV obtained according to the optimal path strategy and action space;
  • Action space a the speed and angle in the current polar coordinate system of the AGV
  • Strategy ⁇ the strategy used to describe the current AGV's choice of action a t in the joint state o t at time t;
  • Value function V used to describe the expected return of taking policy ⁇ under joint state o t;
  • Joint state o t composed of the current AGV state o a and other AGV states A state consisting of a permitted action a a.
  • the object model further includes: a reward r used to describe the current AGV taking action a t , and a discount factor ⁇ is a decay factor used when calculating the reward obtained by the current AGV performing action a t, for adjusting The output of the value function.
  • the AGV state s a includes the AGV position, speed, target position, size, the average speed of the nearest n (n is an integer greater than 1) states, and the road network distance of the AGV and the target.
  • the AGV state s a is obtained in a way that does not depend on the specific sensor, which can be expressed as:
  • (p x , p y ) represents the position of the AGV
  • (v x , v y ) represents the speed of the AGV
  • (p gx , p gy ) represents the target position
  • r represents the size of the AGV
  • v m represents the average speed of the last n states
  • d grn represents the road network distance between the AGV and the target.
  • this embodiment transforms the AGV state represented in the global Euclidean coordinate system ⁇ e into the body polar coordinate system of the current AGV.
  • the AGV observation state represented under ⁇ p that is, the current AGV state o a , is expressed as:
  • (r v, ⁇ v) indicates the current polar coordinate AGV speed
  • (r g, ⁇ g) indicating the current polar coordinate AGV targets
  • r represents the current AGV size
  • v m represents an average velocity recent n states
  • d grm represents the road network distance between the current AGV and the target.
  • the AGV travels along the optimal path from the starting position to the target position in the road network. Therefore, the actions that the AGV is permitted to perform at the current location has a constraint relationship with the optimal path at the current location, and the constraint relationship is expressed as the actions permitted to be performed.
  • the direction is consistent with the optimal path direction.
  • the action space a and the optimal path direction ⁇ or are used for encoding to obtain the permitted action a a of the current AGV at the current location, where the action space a and the optimal path direction ⁇ or are encoded It is represented as:
  • the optimal path (or optimal path direction ⁇ or ) of the AGV can be obtained by using the path search algorithm according to the departure position and the target position in the road network, wherein the path search algorithm includes the A* algorithm (A* It is a direct search method for solving the shortest path in a static road network, there is currently no Chinese translation), Dijkstra algorithm (shortest path algorithm), M* algorithm (also known as Big M algorithm).
  • A* algorithm A* It is a direct search method for solving the shortest path in a static road network, there is currently no Chinese translation
  • Dijkstra algorithm shortest path algorithm
  • M* algorithm also known as Big M algorithm
  • the action space a taken by the current AGV refers to the velocity v and angle ⁇ in the body polar coordinate system ⁇ p of the current AGV.
  • the action space a is described based on the body polar coordinate system ⁇ p and is not affected by the coordinate transformation of the global coordinate system of different environments. It is expressed as follows:
  • the current AGV takes action a t in the current joint state o t , and transitions to the probability distribution p(o t+1 ,o t
  • the reward function describes the reward given by the multi-AGV motion environment for the current AGV to take action a t , including at least three description types.
  • the first description type R goal is the reward given when the current AGV reaches or approaches the target position
  • the second description Type Ragv is the penalty given when the current AGV collides with or is close to other AGVs
  • the third description type Rrn is the penalty given when the current AGV deviates from the road network; among them,
  • the first discount coefficient i is set based on the road network distance, and the positive discount reward value i* ⁇ and the negative discount reward value -i* ⁇ are calculated from the first discount coefficient i and the maximum reward value ⁇ .
  • the second discount coefficient j is set based on the distance, and the discount penalty value j* ⁇ is calculated from the second discount coefficient j and the maximum penalty value ⁇ .
  • the third description type Rrn is set by:
  • R t R goal +R agv +R rn
  • the strategy ⁇ of the current AGV to select the action a t can be described as:
  • w t represents the parameters of the current neural network model
  • the goal of this strategy ⁇ is to minimize the expected time for the current AGV to reach the target position, and to avoid collisions with other AGVs and avoid deviating from the road network.
  • V(o t ; w t ) E[R t
  • the object model can be described based on the Markov decision process, and the construction of the object model can be completed.
  • a neural network model can be constructed by the following methods:
  • a fully connected neural network is used to build a neural network model, as shown in Figure 3, the input of the neural network model includes the current AGV state, other AGV states and the current AGV's permission actions; among them,
  • the permission action of the current AGV is input to the first single hidden layer fully connected neural network,
  • the current AGV state is input to the second single hidden layer fully connected neural network,
  • AGV states are input to the Long Short-Term Memory (LSTM) network, and then input to the third single hidden layer fully connected neural network,
  • the double-hidden-layer fully-connected neural network After concatenating the outputs of the three single-hidden-layer fully-connected neural networks into a tensor, it is input to the double-hidden-layer fully-connected neural network, and the double-hidden-layer fully-connected neural network outputs the evaluation index of the motion planning result;
  • the output of the neural network model is the output of the double hidden layer fully connected neural network.
  • the dimension of the third single hidden layer fully connected neural network is larger than that of the first single hidden layer fully connected neural network
  • the dimension of the third single hidden layer fully connected neural network is larger than that of the second single hidden layer fully connected neural network
  • a single-hidden-layer fully-connected neural network is composed of the fully-connected layer and the activation function ReLU in series, that is, the first single-hidden-layer fully-connected neural network and the second single-hidden-layer fully-connected neural network are composed of the fully connected layer and ReLU in series.
  • the third single hidden layer fully connected neural network is composed of the fully-connected layer and the activation function ReLU in series.
  • each of the three input tensors is firstly input into the corresponding single-hidden-layer fully-connected neural network.
  • the hidden layer fully connected neural network can improve the performance of the entire model.
  • the distances between other AGVs and the current AGV are first extracted, and the other AGV states are sorted by distance from the current AGV from far to near, and then input to the LSTM .
  • a deep reinforcement learning training framework can be established based on the GA3C training framework, and a neural network model can be built.
  • the established deep reinforcement learning training framework consists of three AGV agents, a data queue, and a GPU-based neural network model. part composition.
  • GA3C refers to the Asynchronous Advantage Actor-Critic (A3C) algorithm applied to Graphics Processing Unit (GPU).
  • A3C Asynchronous Advantage Actor-Critic
  • GPU Graphics Processing Unit
  • the GA3C framework uses GPU for reinforcement learning training, which can improve the speed of model training. and performance.
  • a plurality of parallel computing GPUs are added to the GA3C training framework, and a multi-GPU-based neural network model is established.
  • the deep reinforcement learning training framework is composed of an agent, a data queue and a multi-GPU neural network model. .
  • the deep reinforcement learning training framework in this embodiment includes:
  • An agent composed of multiple AGVs is used to interact with the external environment, obtain data such as the status and actions of AGVs in real time, and provide data for building a deep reinforcement learning training framework.
  • the data queue includes a predictor and a trainer, and the obtained training data and prediction data are stored in the data queue.
  • the input training is trained, and enters the current AGV state to the neural network model by the prediction unit, to obtain the policy from the neural network model ⁇ and the value function V, and select an action a t, calculated reward r obtained prediction data; input current AGV status trainer, permit the operation of the other AGV status neural network, trained neural network model in order to update the neural network model parameters.
  • step S120 in FIG. 1 specifically sets up at least one environment, where multiple AGVs are deployed in each environment, and the map and/or road network and/or number of AGVs in each environment are different;
  • the AGV state, other AGV states and the permitted actions of the current AGV are input to the neural network model for training to update the parameters of the neural network model.
  • the multi-GPU neural network model is composed of multiple GPUs performing parallel operations to form a multi-GPU neural network model based on deep reinforcement learning.
  • the process of updating the parameters of the neural network model includes:
  • random initialization neural network model parameters is obtained based on the operation of a t and obtaining reward r, calculation method loss function f ⁇ (w) and the value function loss function f V ( w); through the back-propagation algorithm, update the parameters of the neural network model in each GPU, and train the neural network model; detect the accumulated reward r, policy loss function f ⁇ (w) and value function loss function f V (w accumulated over a period of time ), the performance of the neural network model is judged according to the reward and loss functions until a stable neural network model is obtained.
  • the values of the reward and loss functions both reach the stable interval a stable neural network model can be obtained.
  • the policy loss function f ⁇ (w) is expressed as:
  • f ⁇ (w) log ⁇ ( a t
  • the action constraint rule is used to limit the output strategy of the neural network model, and further limit the selection of the current AGV action, so as to constrain the action performed by the current AGV within the range of the optimal path permitted action, and promote the model in the training process. Convergence in.
  • the current AGV action to be executed obtained based on the evaluation index in step S130 of FIG. 1 may be the action with the highest probability in the neural network output strategy selected by a greedy algorithm.
  • the action constraint rules are used to effectively judge the actions to be executed, so that the current AGV can execute effective actions according to the judgment results, including:
  • the effective action of the current AGV is calculated and executed. Specifically: extracting multiple first permission actions included in the optimal path from the action space of the current AGV; using the evaluation index of the motion planning result output by the neural network model to select the first permission action that satisfies the preset conditions from the multiple first permission actions Two permitted actions; according to other AGV states, calculate whether the current AGV performs the second permitted action to collide. If a collision occurs, the current AGV performs a stationary action. If no collision occurs, the current AGV performs the second permitted action.
  • the probability value corresponding to each first permission action may be calculated by using the strategy ⁇ , and the first permission action corresponding to the maximum probability value is selected as the second permission action.
  • the present application also provides a multi-AGV motion planning device.
  • FIG. 5 is a functional block diagram of a multi-AGV motion planning apparatus according to an embodiment of the present application. As shown in FIG. 5 , the multi-AGV motion planning apparatus 500 includes:
  • the modeling unit 510 is used to establish an object model for describing the sequence decision-making process of multi-AGV motion planning through the reinforcement learning method.
  • the object model includes: AGV state, action space and evaluation indicators of motion planning results, wherein the AGV state includes: The current AGV state, other AGV states calculated according to the current AGV state, and the permitted actions of the current AGV;
  • a training unit 520 configured to set an action constraint rule, and use the action constraint rule to judge the validity of the to-be-executed action of the current AGV obtained based on the evaluation index;
  • the setting unit 530 is configured to set an action constraint rule, and use the action constraint rule to judge the validity of the to-be-executed action of the current AGV obtained based on the evaluation index;
  • the implementation unit 540 is used to input the current AGV state, other AGV states and permitted actions in the current environment into the trained neural network model after starting the motion planning, and obtain the evaluation index of the motion planning result output by the neural network model, According to the evaluation index, the action to be executed by the current AGV is obtained, and the action constraint rule is used to effectively judge the action to be executed, so that the current AGV executes the effective action according to the judgment result.
  • the implementing unit 540 includes a judgment module, a first action execution module and a second action execution module;
  • the judgment module is used to judge whether the action to be executed is the permitted action of the current AGV
  • a first action execution module configured to make the current AGV perform the to-be-executed action when the to-be-executed action is a permitted action of the current AGV
  • the second action execution module is used to calculate and execute the effective action of the current AGV according to the evaluation index of the motion planning result output by the neural network model and the permitted action of the current AGV when the action to be executed is not the permitted action of the current AGV.
  • the second action execution module is further configured to extract multiple first permitted actions included in the optimal path from the action space of the current AGV; use the evaluation index of the motion planning result output by the neural network model from multiple Select the second permission action that satisfies the preset conditions in the first permission action; calculate whether the current AGV will collide when performing the second permission action according to other AGV states. If a collision occurs, the current AGV will perform a static action. The second permission action is performed.
  • the evaluation index of the motion planning result includes: a strategy ⁇ used to describe the current AGV's selection of action a t in a joint state at time t and a value function used to describe the expected benefit of taking the strategy ⁇ in the joint state V, the joint state is a state composed of the current AGV state, other AGV states and permitted actions;
  • the second action execution module is specifically configured to use the strategy ⁇ to calculate the probability value corresponding to each first permission action, and select the first permission action corresponding to the maximum probability value as the second permission action.
  • the modeling unit 510 uses the Markov decision process to describe the object model, and the permitted action is obtained according to the optimal path strategy and the action space, and the action space is the speed and angle in the polar coordinate system of the current AGV;
  • the current AGV status includes: the current speed of the AGV in the body polar coordinate system, the target position, the current AGV size, the average speed, and the road network distance between the current AGV and the target position;
  • Other AGV states include: the relative position of other AGVs in the current AGV body polar coordinate system, the relative position of other AGVs in the current AGV body polar coordinate system, the relative speed of other AGVs in the current AGV body polar coordinate system, other The road network distance between the AGV and its target location, the size of other AGVs, the sum of other AGV sizes and the current AGV size, and the road network distance between other AGVs and the current AGV;
  • the object model also includes: a reward used to describe the action given by the current AGV, the reward includes at least three description types, the first description type is the reward given when the current AGV reaches or approaches the target position, and the second description type is The penalty given when the current AGV collides or approaches other AGVs, and the third description type is the penalty given when the current AGV deviates from the road network; among them,
  • the positive maximum reward value ⁇ is given; the first discount coefficient i is set based on the road network distance, and the positive discount reward value i* is calculated from the first discount coefficient i and the maximum reward value ⁇ Discount reward value of ⁇ and negative value -i* ⁇ , when the current AGV is close to the target position p g , give a positive value of discount reward value i* ⁇ ; and when the current AGV is stationary or far away from the target position p g , give a negative value The discounted reward value of -i* ⁇ ;
  • the maximum penalty value ⁇ is given; the second discount coefficient j is set based on the distance, and the discount penalty value j* ⁇ is calculated from the second discount coefficient j and the maximum penalty value ⁇ ,
  • a discount penalty value j* ⁇ is given; and when the distance between the current AGV and other AGVs is greater than the second threshold condition m *t agv , no penalty is given;
  • m is the preset multiplier value;
  • the training unit 520 includes a model building module, which uses a fully connected neural network to build a neural network model, and the input of the neural network model includes the current AGV state, other AGV states, and the current AGV's permission actions; wherein , the current AGV permission action is input to the first single hidden layer fully connected neural network, the current AGV state is input to the second single hidden layer fully connected neural network, the other AGV states are input to the long short-term memory network, and then input to the third single hidden layer.
  • Layer fully connected neural network the outputs of three single hidden layer fully connected neural networks are concatenated into a tensor, input to the double hidden layer fully connected neural network, and the double hidden layer fully connected neural network outputs the evaluation index of the motion planning result ;
  • the dimension of the fully connected neural network with the third single hidden layer is greater than that of the fully connected neural network with the first single hidden layer, and the dimension of the fully connected neural network with the third single hidden layer is greater than that of the fully connected neural network with the second single hidden layer.
  • the training unit 520 further includes an environment setting module, the environment setting module is used to set at least one environment, multiple AGVs are deployed in each environment, maps and/or road networks and/or number of AGVs in each environment Differently, the current AGV state in each environment, other AGV states, and the permitted actions of the current AGV are input into the neural network model for training to update the parameters of the neural network model.
  • the environment setting module is used to set at least one environment, multiple AGVs are deployed in each environment, maps and/or road networks and/or number of AGVs in each environment Differently, the current AGV state in each environment, other AGV states, and the permitted actions of the current AGV are input into the neural network model for training to update the parameters of the neural network model.
  • FIG. 6 is a schematic structural diagram of a multi-AGV motion planning system according to an embodiment of the present application
  • FIG. 7 shows a schematic diagram of a control state of the system by a control center in an embodiment of the present application.
  • the multi-AGV motion The planning system 600 includes: a control center 610 including a multi-AGV motion planning device 500 deployed, an AGV group 620 composed of a plurality of AGVs deployed in an environment, and a task scheduling center platform 630;
  • the AGVs in the AGV group 620 use their own sensors to obtain their own states and upload them to the control center 610 , and receive action instructions issued by the control center 610 to execute actions corresponding to the action instructions.
  • the AGVs in the same environment operate in a multi-process mode, and the multi-process data sharing technology is used to realize multi-AGV data communication.
  • the multi-process data synchronization technology is used to realize multi-AGV synchronous operation.
  • the task scheduling center platform 630 is used to complete task planning and task distribution, and send the task of a certain AGV to the control center 610, and the control center 610 controls the AGV to complete the task.
  • the task scheduling center platform sends the task of the jth AGV to the control center, and the control center controls the jth AGV to complete the task.
  • the task scheduling center platform can include a display interface to supervise the execution of tasks by the AGV group in real time, and to intervene in a timely manner in emergency situations such as collisions and environmental changes.
  • the control center 610 has a built-in stable neural network model, inputs the received state of the AGV into the neural network model, uses the neural network model to calculate the motion planning strategy of the AGV, and then generates an action instruction according to the motion planning strategy and sends it to the AGV.
  • the technical solution of the present application adopts the deep reinforcement learning method, utilizes the advantages of the neural network model in calculating high-dimensional state space and the characteristics of reinforcement learning online control, and improves the multi-AGV motion planning method in dynamic environments.
  • this embodiment combines the optimal path search algorithm and the deep reinforcement learning algorithm, and uses the optimal path search algorithm to constrain the learning direction of the deep reinforcement learning algorithm, so that the AGV can plan a feasible plan in a dense environment. route to avoid the AGV falling into a local deadlock state.
  • modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined.
  • Each feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
  • Various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components of the multi-AGV motion planning apparatus according to the embodiments of the present application.
  • DSP digital signal processor
  • the present application can also be implemented as an apparatus or apparatus program (eg, computer programs and computer program products) for performing part or all of the methods described herein.
  • Such a program implementing the present application may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.
  • FIG. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 800 includes a processor 810 and a memory 820 arranged to store computer-executable instructions (computer-readable program code).
  • the memory 820 may be electronic memory such as flash memory, EEPROM (electrically erasable programmable read only memory), EPROM, hard disk, or ROM.
  • the memory 820 has storage space 830 for storing computer readable program code 831 for performing any of the method steps in the above-described methods.
  • the storage space 830 for storing computer-readable program code may include various computer-readable program codes 831 for implementing various steps in the above methods, respectively.
  • Computer readable program code 831 can be read from or written to one or more computer program products. These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such a computer program product is typically a computer-readable storage medium as described in FIG. 9 .
  • FIG. 9 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.
  • the computer-readable storage medium 900 stores computer-readable program code 831 for performing the method steps according to the present application, which can be read by the processor 810 of the multi-AGV motion planning device 800, when the computer-readable program code 831 is used by the multi-AGV When the motion planning apparatus 800 is running, the multi-AGV motion planning apparatus 800 is caused to execute each step in the method described above.
  • the computer-readable program code 831 stored in the computer-readable storage medium can execute any of the above-mentioned implementations. method shown in the example.
  • the computer readable program code 831 may be compressed in a suitable form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Automation & Control Theory (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Medical Informatics (AREA)
  • Mechanical Engineering (AREA)
  • Mathematical Analysis (AREA)
  • Feedback Control In General (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Transportation (AREA)
  • Algebra (AREA)
  • Human Computer Interaction (AREA)
  • Computational Mathematics (AREA)

Abstract

一种多AGV运动规划方法、装置(500)和系统(600)。方法包括:通过强化学习方法建立对象模型(S110);基于对象模型搭建神经网络模型,进行包括AGV群(620)部署在内的环境设置,利用所设置环境下AGV的对象模型对神经网络模型进行训练,直至得到稳定的神经网络模型(S120);设置动作约束规则(S130);启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得神经网络模型输出的运动规划结果的评估指标,根据评估指标得到当前AGV待执行动作,并利用动作约束规则对待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作(S140)。从而能够改善多AGV运动规划方法在动态密集环境中的性能。

Description

一种多AGV运动规划方法、装置和系统 技术领域
本申请涉及计算机技术领域,具体涉及一种多AGV运动规划方法、装置和系统。
发明背景
近年来,智能体(Automated Guided Vehicles,AGV)在智能制造和物流领域的应用逐步推广普及,有效的提高了生产、制造和搬运环节的运输效率,减轻了人力工作负担。其中,通过多AGV协同工作能够提高AGV工作效率,但同时会使运动规划问题更加复杂和困难。常见的路径规划算法,如A*算法,人工势场法等,在处理多AGV运动规划问题时,存在计算代价高,容易陷入局部最小解等问题。
发明内容
本申请提供了一种多AGV运动规划方法、装置和系统,用于解决或部分解决上述技术问题。
一方面,本申请实施例提供了一种多AGV运动规划方法,该方法包括:
通过强化学习方法建立用于描述多AGV运动规划的序列决策过程的对象模型,对象模型中包括:AGV状态、动作空间以及运动规划结果的评估指标,其中AGV状态包括当前AGV状态、根据当前AGV状态计算得到的其他AGV状态和当前AGV的许可动作;
基于对象模型搭建神经网络模型,进行包括AGV群部署在内的环境设置,利用所设置环境下AGV的对象模型对神经网络模型进行训练,直至得到稳定的神经网络模型;
设置动作约束规则,利用动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断;
启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得神经网络模型输出的运动规划结果的评估指标,根据评估指标得到当前AGV待执行动作,并利用动作约束规则对待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作。
再一方面,本申请实施例提供了一种多AGV运动规划装置,该装置包括:
建模单元,用于通过强化学习方法建立用于描述多AGV运动规划的序列决 策过程的对象模型,对象模型中包括:AGV状态、动作空间以及运动规划结果的评估指标,其中AGV状态包括当前AGV状态、根据当前AGV状态计算得到的其他AGV状态和当前AGV的许可动作;
训练单元,用于基于对象模型搭建神经网络模型,进行包括AGV群部署在内的环境设置,利用所设置环境下AGV的对象模型对神经网络模型进行训练,直至得到稳定的神经网络模型;
设置单元,用于设置动作约束规则,利用动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断;
实施单元,用于启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得神经网络模型输出的运动规划结果的评估指标,根据评估指标得到当前AGV待执行动作,并利用所动作约束规则对待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作。
又一方面,本申请实施例提供了一种多AGV运动规划系统,包括部署多AGV运动规划装置的控制中心、部署在环境中由多个AGV构成的AGV群和任务调度中心平台;
AGV群中的AGV利用自身传感器获取自身状态并上传给控制中心,以及接收控制中心下发的动作指令,执行动作指令对应的动作;
任务调度中心平台,用于完成任务规划与任务派发,将某个AGV的任务发送至控制中心,由控制中心控制该AGV完成任务;
控制中心,内置稳定的神经网络模型,将接收到的AGV状态输入到神经网络模型,利用神经网络模型计算得到该AGV的运动规划策略,再根据该运动规划策略生成动作指令并下发给该AGV。
第四方面,本申请提供了一种电子设备,存储器和处理器;存储器,存储计算机可执行指令;处理器,计算机可执行指令在被执行时使处理器执行多AGV运动规划方法。
又一方面,本申请提供了一种计算机可读存储介质,计算机可读存储介质上存储有一个或多个计算机程序,该一个或多个计算机程序被执行时实现多AGV运动规划方法。
本申请的有益效果是:本发明实施例一方面采用深度强化学习方法,利用神经网络模型在计算高维状态空间方面的优势和强化学习在线控制方面的特 点,改善多AGV运动规划方法在动态环境中的性能;另一方面将最优路径搜索算法和深度强化学习算法进行结合,使用最优路径搜索算法约束深度强化学习算法的学习方向,以使AGV在密集环境下能够规划出可行路线,避免AGV陷入局部死锁状态。
附图简要说明
图1是本申请一个实施例的多AGV运动规划方法的流程示意图;
图2是本申请一个实施例的多AGV运动环境示意图;
图3是本申请一个实施例的多AGV运动规划设计示意图;
图4是本申请一个实施例的基于GA3C框架建立的深度强化学习训练框架的结构示意图;
图5是本申请一个实施例的多AGV运动规划装置的功能框图;
图6是本申请一个实施例的多AGV运动规划系统的结构示意图;
图7是本申请一个实施例的控制中心对系统的控制状态示意图;
图8是本申请一个实施例的电子设备的结构示意图;
图9是本申请一个实施例的计算机可读存储介质的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
目前,在动态密集环境下进行多AGV运动规划,多AGV运动规划问题面临着搜索空间增大、环境动态变化等问题,并且难度会随着环境规模扩大、AGV数量增加而呈指数级增高。现有运动规划算法存在难以求解或者计算代价过大等问题,且在密集环境下,现有方法容易导致AGV死锁,不能很好的完成运动规划。
基于以上描述,本申请实施例以深度强化学习方法为基础,解决在动态环境下多AGV的运动规划问题,并将深度强化学习方法与最优路径搜索方法相结合,解决密集环境下多AGV的运动规划问题,完成在动态密集环境下的多AGV运动规划任务。
参见图1,该方法包括如下步骤:
步骤S110,通过强化学习方法建立用于描述多AGV运动规划的序列决策过程的对象模型,该对象模型中包括:AGV状态、动作空间以及运动规划结果的评估指标,其中AGV状态包括当前AGV状态、根据当前AGV状态计算得到的其他AGV状态和当前AGV的许可动作。
本步骤中,当前AGV的许可动作是根据最优路径策略和动作空间获得。
步骤S120,基于该对象模型搭建神经网络模型,进行包括AGV群部署在内的环境设置,利用所设置环境下AGV的对象模型对神经网络模型进行训练,直至得到稳定的神经网络模型。
步骤S130,设置动作约束规则,利用该动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断。
本步骤中,动作约束规则可以对当前AGV的待执行动作进行有效性判断,在判断待执行动作为有效动作时,启动运动规划后,当前AGV即可直接执行该待执行动作,在判断待执行动作为无效动作时,启动运动规划后,当前AGV需要从动作空间中提取出当前AGV的最优路径包含的多个可执行动作,从多个可执行动作中选择相应的动作执行。
步骤S140,启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得神经网络模型输出的运动规划结果的评估指标,根据评估指标得到当前AGV待执行动作,并利用动作约束规则对待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作。
由图1所示可知,本实施例一方面采用深度强化学习方法,利用神经网络模型在计算高维状态空间方面的优势和强化学习在线控制方面的特点,改善多AGV运动规划方法在动态环境中的性能;本实施例另一方面将最优路径搜索算法和深度强化学习算法进行结合,使用最优路径搜索算法约束深度强化学习算法的学习方向,以使AGV在密集环境下能够规划出可行路线,避免AGV陷入局部死锁状态。
下面以一个多AGV运动环境为例,对在多AGV运动环境中的当前AGV的运动规划方法的实现步骤进行具体说明。
参见图2示出的多AGV运动环境,为AGV常见的任务场景,如工厂、仓库、物流中心等环境。图2中的黑心圆形表示AGV所处位置;黑心星形表示AGV的目标位置;斜线区域表示环境中的禁行区,如障碍物、工区、货架等, 一般可依据任务场景的地图图纸、安全规范等实际条件绘制,AGV禁止进入禁行区;虚线表示路径,箭头表示路径的方向,AGV需要按照所处路径的方向沿该路径行驶;路网由路径组成,表示AGV在环境内许可的行驶路线,路网中的路径需要形成闭环,即在路网上任意取两点作为出发位置和目标位置,能够在路网上搜索出至少一条由出发位置到目标位置的路径。
在搭建好多AGV运动环境后,进行运动规划算法设计,该运动规划算法设计包括对象模型的构建、神经网络模型的设计与训练、动作约束设计。
对象模型的构建:
在一个实施例中,上述步骤S110具体是使用马尔可夫决策过程描述对象模型,马尔可夫决策过程(Markov Decision Process,MDP)是序贯决策(sequential decision)的数学模型,用于在系统状态具有马尔可夫性质的环境中模拟智能体可实现的随机性策略与回报。
在本实施例的多AGV运动环境中,对象模型包括:AGV状态、动作空间以及运动规划结果的评估指标;其中,AGV状态包括当前AGV状态o a、其他AGV状态
Figure PCTCN2020114422-appb-000001
和当前AGV的许可动作a a,运动规划结果的评估指标包括:策略π和值函数V;其中,
当前AGV状态o a包括:当前AGV在本体极坐标系下的速度、目标位置、当前AGV尺寸、平均速度、当前AGV与目标位置的路网距离;
其他AGV状态
Figure PCTCN2020114422-appb-000002
包括:其他AGV在当前AGV本体极坐标系下的相对位置、其他AGV在当前AGV本体极坐标系下其目标的相对位置、其他AGV在当前AGV本体极坐标系下其相对速度、其他AGV与其目标位置的路网距离、其他AGV尺寸、其他AGV尺寸与当前AGV尺寸的和值、其他AGV与当前AGV的路网距离;
当前AGV的许可动作a a:根据最优路径策略和动作空间获得;
动作空间a:在当前AGV的本体极坐标系下的速度与角度;
策略π:用于描述当前AGV在t时刻的联合状态o t下选择动作a t的策略;
值函数V:用于描述在联合状态o t下采取策略π的预期收益;
联合状态o t:由当前AGV状态o a、其他AGV状态
Figure PCTCN2020114422-appb-000003
和许可动作a a构成的状态。
在一实施例中,对象模型还包括:用于描述当前AGV采取动作a t给予的奖励r,以及折扣因子γ是在计算当前AGV执行动作a t所获得奖励时使用的衰减因子,用于调整值函数的输出。
即上述各对象模型具体内容如下:
1.状态s:
AGV状态s a包含AGV位置、速度、目标位置、尺寸、最近n(n为大于1 的整数)个状态的平均速度、AGV和目标的路网距离,AGV状态s a的获得方式不依赖于特定传感器,可以表示为:
s a=[p x,p y,v x,v y,p gx,p gy,r,v m,d grn]∈R 9
其中,在基于多AGV运动环境的欧氏坐标系∑ e下,(p x,p y)表示AGV的位置,(v x,v y)表示AGV的速度,(p gx,p gy)表示目标位置,r表示AGV尺寸,v m表示最近n个状态的平均速度,d grn表示AGV和目标的路网距离。
同理,相对于当前AGV,多AGV运动环境内其他AGV的状态
Figure PCTCN2020114422-appb-000004
表示如下:
Figure PCTCN2020114422-appb-000005
其中,
Figure PCTCN2020114422-appb-000006
表示第i个AGV的状态,表示为:
Figure PCTCN2020114422-appb-000007
观察o:
为适应不同运动环境,避免坐标变换带来的影响,以及实现去中心化的训练方法,本实施例将在全局欧氏坐标系∑ e下表示的AGV状态变换为在当前AGV的本体极坐标系∑ p下表示的AGV观察状态,即当前AGV状态o a,表示为:
o a=[r vv,r gg,r,v m,d grm]∈R 7
其中,(r vv)表示当前AGV速度的极坐标,(r gg)表示当前AGV目标的极坐标,r表示当前AGV尺寸,v m表示最近n个状态的平均速度,d grm表示当前AGV和目标的路网距离。
同理,在当前AGV的本体极坐标系∑ p下观察环境内其他AGV的状态,即其他AGV状态
Figure PCTCN2020114422-appb-000008
表示为:
Figure PCTCN2020114422-appb-000009
其中,
Figure PCTCN2020114422-appb-000010
表示第i个AGV的观察状态,可以表示为:
Figure PCTCN2020114422-appb-000011
其中,
Figure PCTCN2020114422-appb-000012
表示第i个AGV在∑ p下相对位置的极坐标,
Figure PCTCN2020114422-appb-000013
表示第i个AGV在∑ p下相对速度的极坐标,
Figure PCTCN2020114422-appb-000014
表示第i个AGV在∑ p下目标的相对位置的极坐标,
Figure PCTCN2020114422-appb-000015
表示第i个AGV和目标的路网距离,表示
Figure PCTCN2020114422-appb-000016
第i个AGV的尺寸,
Figure PCTCN2020114422-appb-000017
表示第i个AGV和当前AGV的尺寸之和,
Figure PCTCN2020114422-appb-000018
表示第i个AGV和当前AGV的路网距离。
许可动作a a
AGV沿路网中由出发位置到目标位置的最优路径行驶,因此,AGV在当前所处位置许可执行的动作与当前所处位置的最优路径存在约束关系,该约束关系表示为许可执行的动作方向与最优路径方向的一致。基于此,本实施例利用动作空间a和最优路径方向θ or进行编码,得到当前AGV在当前所处位置的许可 动作a a,其中动作空间a和最优路径方向θ or进行编码的一种表示方式为:
Figure PCTCN2020114422-appb-000019
在本实施例中,可以在路网中根据出发位置和目标位置,使用路径搜索算法得到AGV的最优路径(或最优路径方向θ or),其中,路径搜索算法包括A*算法(A*是一种静态路网中求解最短路径的直接搜索方法,目前暂无中文译名),Dijkstra算法(最短路径算法),M*算法(又称为大M算法)。
2.动作空间a
当前AGV采取的动作空间a是指在当前AGV的本体极坐标系∑ p的速度v和角度θ,动作空间a基于本体极坐标系∑ p描述,不受不同环境全局坐标系的坐标变换影响,表示如下:
a=[v,θ]∈R 2
3.状态转移模型p:
当前AGV在当前联合状态o t采取动作a t,转移到下一联合状态o t的概率分布p(o t+1,o t|a t)。
4.奖励r
奖励函数描述多AGV运动环境对当前AGV采取动作a t给予的奖励,包括至少三种描述类型,第一种描述类型R goal为在当前AGV到达或靠近目标位置时给予的奖励,第二种描述类型R agv为在当前AGV碰撞或靠近其他AGV时给予的惩罚,第三种描述类型R rn为在当前AGV偏离路网时给予的惩罚;其中,
通过下述方式设置第一种描述类型R goal
在当前AGV到达目标位置p g时,给予正值的最大奖励值α;
在当前AGV靠近目标位置p g时,给予正值的折扣奖励值i*α;以及,
在当前AGV静止或远离目标位置p g时,给予负值的折扣奖励值-i*α;
其中,基于路网距离设置第一折扣系数i,由第一折扣系数i和最大奖励值α计算正值的折扣奖励值i*α和负值的折扣奖励值-i*α。
通过下述方式设置第二种描述类型R agv
当前AGV与其他AGV的距离小于第一阈值条件t agv时,给予最大惩罚值β;
在当前AGV与其他AGV的距离大于第一阈值条件t agv且小于第二阈值条件m*t agv时,给予折扣惩罚值j*β;以及,
在当前AGV与其他AGV的距离大于第二阈值条件m*t agv时,不给予惩罚,m为预设倍数值;
其中,基于距离设置第二折扣系数j,由第二折扣系数j和最大惩罚值β计算折扣惩罚值j*β。
通过下述方式设置第三种描述类型R rn
在当前AGV与当前路径的距离d rn不小于第三阈值条件t rn时,给与惩罚δ;
在当前AGV与当前路径的距离d rn小于第三阈值条件t rn时,不给予惩罚。
在t时刻,完整的奖励函数R t为:
R t=R goal+R agv+R rn
5.折扣因子γ
是在计算当前AGV执行动作a t所获得奖励时使用的衰减因子,用于调整值函数的输出,γ∈[0,1)。
6.策略π
根据当前AGV在t时刻联合状态o t,当前AGV选择动作a t的策略π可描述为:
π:(a t|o t;w t)
其中,w t表示当前神经网络模型的参数,该策略π的目标是使当前AGV到达目标位置的期望时间最小化,并避免与其他AGV发生碰撞以及避免偏离路网。
7.值函数V
当前AGV在联合状态o t下采取策略π的预期收益,可描述为:
V(o t;w t)=E[R t|o t]
利用上述描述的参数可以基于马尔可夫决策过程描述对象模型,完成对象模型的构建。
神经网络模型的设计与训练
本实施例可以通过下述方法构建神经网络模型:
采用全连接神经网络搭建神经网络模型,如图3所示,该神经网络模型的输入包括当前AGV状态、其他AGV状态和当前AGV的许可动作;其中,
当前AGV的许可动作输入到第一单隐层全连接神经网络,
当前AGV状态输入到第二单隐层全连接神经网络,
其他AGV状态输入到长短期记忆网络(Long Short-Term Memory,LSTM),再输入到第三单隐层全连接神经网络,
将三个单隐层全连接神经网络的输出串联成为一个张量后,输入到双隐层全连接神经网络,由双隐层全连接神经网络输出运动规划结果的评估指标;
该神经网络模型的输出为双隐层全连接神经网络的输出。
其中,第三单隐层全连接神经网络的维度大于第一单隐层全连接神经网络的维度,第三单隐层全连接神经网络的维度大于第二单隐层全连接神经网络的维度,由全连接层和激活函数ReLU串行后组成单隐层全连接神经网络,即由全连接层和ReLU串行后组成第一单隐层全连接神经网络、第二单隐层全连接神经网络、第三单隐层全连接神经网络。
由于本实施例中其他AGV状态对应的张量维度远大于许可动作、当前AGV状态的张量维度,若直接将其他AGV状态、当前AGV状态和许可动作拼接成一个张量输入到神经网络模型,将会因为输入数据维度的不平衡而影响神经网 络模型的性能,导致输出结果欠佳。而本实施例先将三个输入张量各自输入到对应的单隐层全连接神经网络,通过单隐层全连接神经网络对输入张量进行维度平衡后,再拼接成一个张量输入到双隐层全连接神经网络,则可以提升整个模型的性能。
在一个实施例中,在将其他AGV状态输入到长短期记忆网络之前,先提取其他AGV与当前AGV的距离,将其他AGV状态按照与当前AGV的距离由远到近排序之后,再输入到LSTM。
在一个实施例中,可以基于GA3C训练框架建立深度强化学习训练框架,搭建神经网络模型,建立的深度强化学习训练框架由多个AGV构成的智能体、数据队列和基于GPU的神经网络模型三个部分组成。
需要说明的是,GA3C是指应用于图形处理器(Graphics Processing Unit,GPU)的异步优势评价器算法(Asynchronous Advantage Actor-Critic,A3C),GA3C框架使用GPU进行强化学习训练,能够提升模型训练速度和性能。
在一个优选实施例中,在GA3C训练框架中添加多个并行运算GPU,建立基于多GPU的神经网络模型,相应的,深度强化学习训练框架由智能体、数据队列和多GPU的神经网络模型组成。
如图4所示,本实施例中的深度强化学习训练框架包括:
由多个AGV构成的智能体,用于与外界环境交互,实时获取AGV的状态、动作等数据,为搭建深度强化学习训练框架的提供数据。
数据队列中包括有预测器和训练器,得到的训练数据和预测数据均存储在该数据队列中。
其中,通过预测器根据当前策略π选择动作a t,并收集AGV状态、动作空间作为训练数据,输入训练器进行训练,并通过预测器输入当前AGV状态到神经网络模型,从神经网络模型得到策略π和值函数V,并选择动作a t,计算奖励r得到预测数据;通过训练器输入当前AGV状态、许可动作、其他AGV状态神经网络模型,训练神经网络模型,以更新神经网络模型的参数。
在一个实施例中,图1步骤S120具体是设置至少一个环境,每个环境中部署多个AGV,各个环境中的地图和/或路网和/或AGV数量不同;将每个环境下的当前AGV状态、其他AGV状态和当前AGV的许可动作输入到神经网络模型进行训练,以更新神经网络模型的参数。
可以理解的是,不同环境可以是在路径、禁行区、AGV数量、目标等各个方面可以都不相同、部分相同或者完全一致,增加环境的类型,能够促进算法在不同环境下训练,保证模型的适用性。
在一个实施例中,多GPU的神经网络模型,由多个并行运算的GPU组成,构成多GPU的基于深度强化学习的神经网络模型。
在一个实施例中,更新神经网络模型的参数的过程包括:
将当前AGV状态、许可动作、其他AGV状态作为训练数据;随机初始化神经网络模型参数,基于获得动作a t和获得的奖励r,计算策略损失函数f π(w)和值函数损失函数f V(w);通过反向传播算法,更新各个GPU中神经网络模型的参 数,训练神经网络模型;检测一段时间内累积的奖励r、策略损失函数f π(w)和值函数损失函数f V(w)的变化过程,根据奖励和损失函数判断神经网络模型的性能,直至得到稳定的神经网络模型。
本实施例中,累积的奖励越高,表示神经网络模型的性能越好,损失函数的值越低,表示神经网络模型的性能越好。当奖励和损失函数的值均到达稳定区间,则可得到稳定的神经网络模型。
其中,策略损失函数f π(w)表示为:
f π(w)=logπ(a t|o t;w t)(R t-V(o t;w t))+φH(π(o t;w))
其中,表示策略损失函数中的正则项,起到调节f π(w)的作用。
值函数损失函数f V(w)表示为:
f V(w)=(R t-V(o t;w t)) 2
动作约束设计
在本实施例中,动作约束规则用于限制神经网络模型输出的策略,并进一步限制当前AGV动作的选择,从而将当前AGV执行的动作约束在最优路径许可动作范围内,促进模型在训练过程中收敛。
在一个实施例中,图1步骤S130中基于评估指标得到的当前AGV的待执行动作可以是采用贪婪算法选择神经网络输出策略中概率最大的动作。
相应的图1中利用动作约束规则对待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作,包括:
判断待执行动作是否为当前AGV的许可动作,若待执行动作为当前AGV的许可动作,当前AGV执行该待执行动作。
若待执行动作不是当前AGV的许可动作,根据神经网络模型输出的运动规划结果的评估指标和当前AGV的许可动作,计算当前AGV的有效动作并执行。具体是:从当前AGV的动作空间中提取最优路径包含的多个第一许可动作;利用神经网络模型输出的运动规划结果的评估指标从多个第一许可动作中选择满足预设条件的第二许可动作;根据其他AGV状态计算当前AGV执行第二许可动作是否会发生碰撞,若发生碰撞,当前AGV执行静止动作,若未发生碰撞,当前AGV执行该第二许可动作。
在一个实施例中,可以利用策略π计算出每个第一许可动作对应的概率值,选择概率值最大对应的第一许可动作为第二许可动作。
与前述方法相对应,本申请还提供了一种多AGV运动规划装置。
图5是本申请一个实施例的多AGV运动规划装置的功能框图,如图5所示,多AGV运动规划装置500包括:
建模单元510,用于通过强化学习方法建立用于描述多AGV运动规划的序列决策过程的对象模型,该对象模型中包括:AGV状态、动作空间以及运动规划结果的评估指标,其中AGV状态包括当前AGV状态、根据当前AGV状态计算得到的其他AGV状态和当前AGV的许可动作;
训练单元520,用于设置动作约束规则,利用该动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断;
设置单元530,用于设置动作约束规则,利用该动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断;
实施单元540,用于启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得该神经网络模型输出的运动规划结果的评估指标,根据该评估指标得到当前AGV待执行动作,并利用动作约束规则对该待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作。
在一个实施例中,实施单元540包括判断模块、第一动作执行模块和第二动作执行模块;
判断模块,用于判断待执行动作是否为当前AGV的许可动作;
第一动作执行模块,用于在该待执行动作为当前AGV的许可动作时,使当前AGV执行该待执行动作;
第二动作执行模块,用于在该待执行动作不是当前AGV的许可动作,根据神经网络模型输出的运动规划结果的评估指标和当前AGV的许可动作,计算当前AGV的有效动作并执行。
在一个实施例中,第二动作执行模块,还用于从当前AGV的动作空间中提取最优路径包含的多个第一许可动作;利用神经网络模型输出的运动规划结果的评估指标从多个第一许可动作中选择满足预设条件的第二许可动作;根据其他AGV状态计算当前AGV执行第二许可动作是否会发生碰撞,若发生碰撞,当前AGV执行静止动作,若未发生碰撞,当前AGV执行该第二许可动作。
在一个实施例中,运动规划结果的评估指标包括:用于描述当前AGV在t时刻的联合状态下选择动作a t的策略π和用于描述在联合状态下采取策略π的预期收益的值函数V,联合状态为由当前AGV状态、其他AGV状态和许可动作构成的状态;
第二动作执行模块,具体是用于利用策略π计算出每个第一许可动作对应的概率值,选择概率值最大对应的第一许可动作为第二许可动作。
在一个实施例中,建模单元510是使用马尔可夫决策过程描述对象模型,许可动作根据最优路径策略和动作空间获得,动作空间为在当前AGV的本体极坐标系下的速度与角度;
当前AGV状态包括:当前AGV在本体极坐标系下的速度、目标位置、当前AGV尺寸、平均速度、当前AGV与目标位置的路网距离;
其他AGV状态包括:其他AGV在当前AGV本体极坐标系下的相对位置、其他AGV在当前AGV本体极坐标系下其目标的相对位置、其他AGV在当前AGV本体极坐标系下其相对速度、其他AGV与其目标位置的路网距离、其他AGV尺寸、其他AGV尺寸与当前AGV尺寸的和值、其他AGV与当前AGV的路网距离;
对象模型还包括:用于描述当前AGV采取动作给予的奖励,该奖励包括至少三种描述类型,第一种描述类型为在当前AGV到达或靠近目标位置时给予的奖励,第二种描述类型为在当前AGV碰撞或靠近其他AGV时给予的惩罚,第三种描述类型为在当前AGV偏离路网时给予的惩罚;其中,
通过下述方式设置第一种描述类型:
在当前AGV到达目标位置p g时,给予正值的最大奖励值α;基于路网距离设置第一折扣系数i,由第一折扣系数i和最大奖励值α计算正值的折扣奖励值i*α和负值的折扣奖励值-i*α,在当前AGV靠近目标位置p g时,给予正值的折扣奖励值i*α;以及在当前AGV静止或远离目标位置p g时,给予负值的折扣奖励值-i*α;
通过下述方式设置第二种描述类型:
当前AGV与其他AGV的距离小于第一阈值条件t agv时,给予最大惩罚值β;基于距离设置第二折扣系数j,由第二折扣系数j和最大惩罚值β计算折扣惩罚值j*β,在当前AGV与其他AGV的距离大于第一阈值条件t agv且小于第二阈值条件m*t agv时,给予折扣惩罚值j*β;以及在当前AGV与其他AGV的距离大于第二阈值条件m*t agv时,不给予惩罚;m为预设倍数值;
通过下述方式设置第三种描述类型:
在当前AGV与当前路径的距离不小于第三阈值条件t rn时,给与惩罚δ;在当前AGV与当前路径的距离小于第三阈值条件t rn时,不给予惩罚。
在一个实施例中,训练单元520包括模型构建模块,该模型构建模块是采用全连接神经网络搭建神经网络模型,神经网络模型的输入包括当前AGV状态、其他AGV状态和当前AGV的许可动作;其中,当前AGV的许可动作输入到第一单隐层全连接神经网络,当前AGV状态输入到第二单隐层全连接神经网络,其他AGV状态输入到长短期记忆网络,再输入到第三单隐层全连接神经网络,将三个单隐层全连接神经网络的输出串联成为一个张量后,输入到双隐层全连接神经网络,由双隐层全连接神经网络输出运动规划结果的评估指标;第三单隐层全连接神经网络的维度大于第一单隐层全连接神经网络的维度,第三单隐层全连接神经网络的维度大于第二单隐层全连接神经网络的维度。
在一个实施例中,训练单元520还包括环境设置模块,该环境设置模块用于设置至少一个环境,每个环境中部署多个AGV,各个环境中的地图和/或路网和/或AGV数量不同,将每个环境下的当前AGV状态、其他AGV状态和当前AGV的许可动作输入到神经网络模型进行训练,以更新神经网络模型的参数。
以上所描述的装置实施例仅仅是示意性的,具体实施方式可以参照前述方法实施例的具体实施方式进行,在此不再赘述。
图6是本申请一个实施例的多AGV运动规划系统的结构示意图,图7示出本申请一个实施例的控制中心对系统的控制状态示意图,如图6与图7所示,该多AGV运动规划系统600包括:包括部署多AGV运动规划装置500的控制中心610、部署在环境中由多个AGV构成的AGV群620和任务调度中心平台 630;
AGV群620中的AGV利用自身传感器获取自身状态并上传给控制中心610,以及接收控制中心610下发的动作指令,执行动作指令对应的动作。
其中,同一环境下的AGV采用多进程方式运行,通过多进程数据共享技术实现多AGV数据通信,同时采用多进程数据同步技术实现多AGV同步运行。
任务调度中心平台630,用于完成任务规划与任务派发,将某个AGV的任务发送至控制中心610,由控制中心610制该AGV完成任务。
例如,任务调度中心平台将第j台AGV的任务发送至控制中心,由控制中心控制该第j台AGV完成任务。任务调度中心平台可以包含显示界面,用以实时监督AGV群执行任务的情况,并在碰撞、环境改变等突发状况下及时进行干预。
控制中心610,内置稳定的神经网络模型,将接收到的AGV状态输入到神经网络模型,利用神经网络模型计算得到该AGV的运动规划策略,再根据该运动规划策略生成动作指令并下发给该AGV。
综上所示,本申请的技术方案,一方面采用深度强化学习方法,利用神经网络模型在计算高维状态空间方面的优势和强化学习在线控制方面的特点,改善多AGV运动规划方法在动态环境中的性能;本实施例另一方面将最优路径搜索算法和深度强化学习算法进行结合,使用最优路径搜索算法约束深度强化学习算法的学习方向,以使AGV在密集环境下能够规划出可行路线,避免AGV陷入局部死锁状态。
需要说明的是:
在此提供的算法和显示不与任何特定计算机、虚拟装置或者其它设备固有相关。各种通用装置也可以与基于在此的示教一起使用。根据上面的描述,构造这类装置所要求的结构是显而易见的。此外,本申请也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本申请的内容,并且上面对特定语言所做的描述是为了披露本申请的最佳实施方式。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本申请的范围之内并且形成不同的实施例。例如,在下面的权利要求 书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本申请的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本申请实施例的多AGV运动规划装置中的一些或者全部部件的一些或者全部功能。本申请还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本申请的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图8示出本申请一个实施例的电子设备的结构示意图,该电子设备800包括处理器810和被安排成存储计算机可执行指令(计算机可读程序代码)的存储器820。存储器820可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器820具有存储用于执行上述方法中的任何方法步骤的计算机可读程序代码831的存储空间830。例如,用于存储计算机可读程序代码的存储空间830可以包括分别用于实现上面的方法中的各种步骤的各个计算机可读程序代码831。计算机可读程序代码831可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为例如图9所述的计算机可读存储介质。
图9示出了根据本申请一个实施例的一种计算机可读存储介质的结构示意图。该计算机可读存储介质900存储有用于执行根据本申请的方法步骤的计算机可读程序代码831,可以被多AGV运动规划装置800的处理器810读取,当计算机可读程序代码831由多AGV运动规划装置800运行时,导致该多AGV运动规划装置800执行上面所描述的方法中的各个步骤,具体来说,该计算机可读存储介质存储的计算机可读程序代码831可以执行上述任一实施例中示出的方法。计算机可读程序代码831可以以适当形式进行压缩。
应该注意的是上述实施例对本申请进行说明而不是对本申请进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的 限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本申请可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。

Claims (15)

  1. 一种多AGV运动规划方法,其特征在于,包括:
    通过强化学习方法建立用于描述多AGV运动规划的序列决策过程的对象模型,所述对象模型中包括:AGV状态、动作空间以及运动规划结果的评估指标,其中AGV状态包括当前AGV状态、根据当前AGV状态计算得到的其他AGV状态和当前AGV的许可动作;
    基于所述对象模型搭建神经网络模型,进行包括AGV群部署在内的环境设置,利用所设置环境下AGV的对象模型对所述神经网络模型进行训练,直至得到稳定的神经网络模型;
    设置动作约束规则,利用所述动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断;
    启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得所述神经网络模型输出的运动规划结果的评估指标,根据所述评估指标得到当前AGV待执行动作,并利用所述动作约束规则对所述待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作。
  2. 如权利要求1所述的方法,其特征在于,利用所述动作约束规则对所述待执行动作进行有效判断,使当前AGV根据判断结果执行有效工作,包括:
    判断所述待执行动作是否为当前AGV的许可动作;
    若所述待执行动作为当前AGV的许可动作,当前AGV执行该待执行动作;
    若所述待执行动作不是当前AGV的许可动作,根据神经网络模型输出的运动规划结果的评估指标和当前AGV的许可动作,计算当前AGV的有效动作并执行。
  3. 如权利要求2所述的方法,其特征在于,若所述待执行动作不是当前AGV的许可动作,根据神经网络模型输出的运动规划结果的评估指标和当前AGV的许可动作,计算当前AGV的有效动作并执行,包括:
    从当前AGV的动作空间中提取最优路径包含的多个第一许可动作;
    利用所述神经网络模型输出的运动规划结果的评估指标从所述多个第一许可动作中选择满足预设条件的第二许可动作;
    根据其他AGV状态计算当前AGV执行第二许可动作是否会发生碰撞,若发生碰撞,当前AGV执行静止动作,若未发生碰撞,当前AGV执行该第二许 可动作。
  4. 如权利要求3所述的方法,其特征在于,所述运动规划结果的评估指标包括:用于描述当前AGV在t时刻的联合状态下选择动作a t的策略π和用于描述在联合状态下采取策略π的预期收益的值函数V,联合状态为由当前AGV状态、其他AGV状态和许可动作构成的状态;
    利用所述神经网络模型输出的运动规划结果的评估指标从所述多个第一许可动作中选择满足预设条件的第二许可动作,包括:
    利用所述策略π计算出每个第一许可动作对应的概率值,选择概率值最大对应的第一许可动作为第二许可动作。
  5. 如权利要求1所述的方法,其特征在于,使用马尔可夫决策过程描述所述对象模型,所述许可动作根据最优路径策略和动作空间获得,所述动作空间为在当前AGV的本体极坐标系下的速度与角度;
    所述当前AGV状态包括:当前AGV在本体极坐标系下的速度、目标位置、当前AGV尺寸、平均速度、当前AGV与目标位置的路网距离;
    其他AGV状态包括:其他AGV在当前AGV本体极坐标系下的相对位置、其他AGV在当前AGV本体极坐标系下其目标的相对位置、其他AGV在当前AGV本体极坐标系下其相对速度、其他AGV与其目标位置的路网距离、其他AGV尺寸、其他AGV尺寸与当前AGV尺寸的和值、其他AGV与当前AGV的路网距离。
  6. 如权利要求5所述的方法,其特征在于,所述对象模型还包括:用于描述当前AGV采取动作给予的奖励,该奖励包括至少三种描述类型,第一种描述类型为在当前AGV到达或靠近目标位置时给予的奖励,第二种描述类型为在当前AGV碰撞或靠近其他AGV时给予的惩罚,第三种描述类型为在当前AGV偏离路网时给予的惩罚;其中,
    通过下述方式设置所述第一种描述类型:
    在当前AGV到达目标位置p g时,给予正值的最大奖励值α;基于路网距离设置第一折扣系数i,由第一折扣系数i和最大奖励值α计算正值的折扣奖励值i*α和负值的折扣奖励值-i*α,在当前AGV靠近目标位置p g时,给予正值的折扣奖励值i*α;以及在当前AGV静止或远离目标位置p g时,给予负值的折扣奖励值-i*α;
    通过下述方式设置所述第二种描述类型:
    当前AGV与其他AGV的距离小于第一阈值条件t agv时,给予最大惩罚值β;基于距离设置第二折扣系数j,由第二折扣系数j和最大惩罚值β计算折扣惩罚值j*β,在当前AGV与其他AGV的距离大于所述第一阈值条件t agv且小于第二 阈值条件m*t agv时,给予折扣惩罚值j*β;以及在当前AGV与其他AGV的距离大于所述第二阈值条件m*t agv时,不给予惩罚;m为预设倍数值;
    通过下述方式设置所述第三种描述类型:
    在当前AGV与当前路径的距离不小于第三阈值条件t rn时,给与惩罚δ;在当前AGV与当前路径的距离小于第三阈值条件t rn时,不给予惩罚。
  7. 如权利要求6所述的方法,其特征在于,基于所述对象模型搭建神经网络模型,包括:
    采用全连接神经网络搭建所述神经网络模型,所述神经网络模型的输入包括当前AGV状态、其他AGV状态和当前AGV的许可动作;其中,
    当前AGV的许可动作输入到第一单隐层全连接神经网络,
    当前AGV状态输入到第二单隐层全连接神经网络,
    其他AGV状态输入到长短期记忆网络,再输入到第三单隐层全连接神经网络,
    将三个单隐层全连接神经网络的输出串联成为一个张量后,输入到双隐层全连接神经网络,由双隐层全连接神经网络输出运动规划结果的评估指标;
    所述第三单隐层全连接神经网络的维度大于第一单隐层全连接神经网络的维度,所述第三单隐层全连接神经网络的维度大于第二单隐层全连接神经网络的维度。
  8. 如权利要求7所述的方法,其特征在于,在将其他AGV状态输入到长短期记忆网络之前,还包括:
    提取其他AGV与当前AGV的距离,将其他AGV状态按照与当前AGV的距离由远到近排序之后,再输入到长短期记忆网络。
  9. 如权利要求7所述的方法,其特征在于,基于所述对象模型搭建神经网络模型,还包括:
    由全连接层和激活函数串行后组成单隐层全连接神经网络。
  10. 如权利要求1所述的方法,其特征在于,进行包括AGV群部署在内的环境设置,利用所设置环境下AGV的对象模型对所述神经网络模型进行训练,包括:
    设置至少一个环境,每个环境中部署多个AGV,各个环境中的地图和/或路网和/或AGV数量不同;
    将每个环境下的当前AGV状态、其他AGV状态和当前AGV的许可动作输入到所述神经网络模型进行训练,以更新所述神经网络模型的参数。
  11. 如权利要求6所述的方法,其特征在于,
    在t时刻,基于三种描述类型的奖励通过下述方式得到对当前AGV采取动作给予的奖励R t
    R t=R goal+R agv+R rn
    其中,R goal为第一种描述类型,R agv为第二种描述类型,R rn为第三种描述类型。
  12. 一种多AGV运动规划装置,其特征在于,包括:
    建模单元,用于通过强化学习方法建立用于描述多AGV运动规划的序列决策过程的对象模型,所述对象模型中包括:AGV状态、动作空间以及运动规划结果的评估指标,其中AGV状态包括当前AGV状态、根据当前AGV状态计算得到的其他AGV状态和当前AGV的许可动作;
    训练单元,用于基于所述对象模型搭建神经网络模型,进行包括AGV群部署在内的环境设置,利用所设置环境下AGV的对象模型对所述神经网络模型进行训练,直至得到稳定的神经网络模型;
    设置单元,用于设置动作约束规则,利用所述动作约束规则对基于评估指标得到的当前AGV的待执行动作进行有效性判断;
    实施单元,用于启动运动规划后,将当前环境下的当前AGV状态、其他AGV状态和许可动作输入到训练后的神经网络模型中,获得所述神经网络模型输出的运动规划结果的评估指标,根据所述评估指标得到当前AGV待执行动作,并利用所述动作约束规则对所述待执行动作进行有效判断,使当前AGV根据判断结果执行有效动作。
  13. 如权利要求12所述的装置,其特征在于,
    所述实施单元包括判断模块、第一动作执行模块和第二动作执行模块;
    判断模块,用于判断待执行动作是否为当前AGV的许可动作;
    第一动作执行模块,用于在该待执行动作为当前AGV的许可动作时,使当前AGV执行该待执行动作;
    第二动作执行模块,用于在该待执行动作不是当前AGV的许可动作,根据神经网络模型输出的运动规划结果的评估指标和当前AGV的许可动作,计算当前AGV的有效动作并执行。
  14. 如权利要求12所述的装置,其特征在于,
    所述训练单元包括模型构建模块,该模型构建模块是采用全连接神经网络搭建神经网络模型,神经网络模型的输入包括当前AGV状态、其他AGV状态和当前AGV的许可动作;其中,当前AGV的许可动作输入到第一单隐层全连 接神经网络,当前AGV状态输入到第二单隐层全连接神经网络,其他AGV状态输入到长短期记忆网络,再输入到第三单隐层全连接神经网络,将三个单隐层全连接神经网络的输出串联成为一个张量后,输入到双隐层全连接神经网络,由双隐层全连接神经网络输出运动规划结果的评估指标;第三单隐层全连接神经网络的维度大于第一单隐层全连接神经网络的维度,第三单隐层全连接神经网络的维度大于第二单隐层全连接神经网络的维度。
  15. 一种多AGV运动规划系统,其特征在于,包括部署如权利要求12至14任一所述的多AGV运动规划装置的控制中心、部署在环境中由多个AGV构成的AGV群和任务调度中心平台;
    所述AGV群中的AGV利用自身传感器获取自身状态并上传给控制中心,以及接收控制中心下发的动作指令,执行所述动作指令对应的动作;
    所述任务调度中心平台,用于完成任务规划与任务派发,将某个AGV的任务发送至控制中心,由控制中心控制该AGV完成任务;
    所述控制中心,内置稳定的神经网络模型,将接收到的AGV状态输入到神经网络模型,利用神经网络模型计算得到该AGV的运动规划策略,再根据该运动规划策略生成动作指令并下发给该AGV。
PCT/CN2020/114422 2020-07-10 2020-09-10 一种多agv运动规划方法、装置和系统 WO2022007179A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/309,922 US20220317695A1 (en) 2020-07-10 2020-09-10 Multi-agv motion planning method, device and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010663470.1A CN112015174B (zh) 2020-07-10 2020-07-10 一种多agv运动规划方法、装置和系统
CN202010663470.1 2020-07-10

Publications (1)

Publication Number Publication Date
WO2022007179A1 true WO2022007179A1 (zh) 2022-01-13

Family

ID=73498821

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114422 WO2022007179A1 (zh) 2020-07-10 2020-09-10 一种多agv运动规划方法、装置和系统

Country Status (3)

Country Link
US (1) US20220317695A1 (zh)
CN (1) CN112015174B (zh)
WO (1) WO2022007179A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708527A (zh) * 2022-03-09 2022-07-05 中国石油大学(华东) 一种基于极坐标表示的数字冰壶策略价值提取方法
CN114815840A (zh) * 2022-04-29 2022-07-29 中国科学技术大学 基于深度强化学习的多智能体路径规划方法
CN115049324A (zh) * 2022-08-17 2022-09-13 上海国际港务(集团)股份有限公司 码头agv车调度方法、装置、计算机设备和存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835333B (zh) * 2020-12-31 2022-03-15 北京工商大学 一种基于深度强化学习多agv避障与路径规划方法及系统
CN113128770B (zh) * 2021-04-23 2022-08-09 新疆大学 基于dqn的不确定车间环境下物料配送实时优化方法
CN113627646A (zh) * 2021-07-08 2021-11-09 中汽创智科技有限公司 一种基于神经网络的路径规划方法、装置、设备及介质
CN113342004B (zh) * 2021-07-29 2021-10-29 山东华力机电有限公司 基于人工智能和视觉感知的多agv小车调度方法及系统
US20220114301A1 (en) * 2021-12-23 2022-04-14 Intel Corporation Methods and devices for a collaboration of automated and autonomous machines
US11814072B2 (en) * 2022-02-14 2023-11-14 May Mobility, Inc. Method and system for conditional operation of an autonomous agent
CN114463997B (zh) * 2022-02-14 2023-06-16 中国科学院电工研究所 一种无信号灯交叉路口车辆协同控制方法及系统
WO2023203721A1 (ja) * 2022-04-21 2023-10-26 日本電信電話株式会社 配送計画装置、配送計画方法、及びプログラム
CN114779780B (zh) * 2022-04-26 2023-05-12 四川大学 一种随机环境下路径规划方法及系统
CN117162086A (zh) * 2023-08-07 2023-12-05 南京云创大数据科技股份有限公司 一种用于机械臂目标寻找的训练方法、方法及训练系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019168763A (ja) * 2018-03-22 2019-10-03 三菱電機株式会社 運転開始条件変換装置および運転開始条件変換方法
CN110991545A (zh) * 2019-12-10 2020-04-10 中国人民解放军军事科学院国防科技创新研究院 一种面向多智能体对抗的强化学习训练优化方法及装置
CN111061277A (zh) * 2019-12-31 2020-04-24 歌尔股份有限公司 一种无人车全局路径规划方法和装置
CN111240356A (zh) * 2020-01-14 2020-06-05 西北工业大学 一种基于深度强化学习的无人机集群会合方法
CN111260031A (zh) * 2020-01-14 2020-06-09 西北工业大学 一种基于深度强化学习的无人机集群目标防卫方法
CN111309035A (zh) * 2020-05-14 2020-06-19 浙江远传信息技术股份有限公司 多机器人协同移动与动态避障方法、装置、设备及介质

Family Cites Families (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6377878B1 (en) * 1999-06-24 2002-04-23 Sandia Corporation Convergent method of and apparatus for distributed control of robotic systems using fuzzy logic
CN1592919A (zh) * 2000-07-10 2005-03-09 美国联合包裹服务公司 确定活动飞行器之间的碰撞路径的方法和相关系统及计算机软件程序产品
US8954260B2 (en) * 2010-06-15 2015-02-10 GM Global Technology Operations LLC Method and system for collision assessment for vehicles
CN102929281A (zh) * 2012-11-05 2013-02-13 西南科技大学 一种不完全感知环境下的机器人kNN路径规划方法
JP5997092B2 (ja) * 2013-04-17 2016-09-28 日本電信電話株式会社 ロボット協調搬送計画装置、方法及びプログラム
JP2019518273A (ja) * 2016-04-27 2019-06-27 ニューララ インコーポレイテッド 深層ニューラルネットワークベースのq学習の経験メモリをプルーニングする方法及び装置
US20180208195A1 (en) * 2017-01-20 2018-07-26 Pcms Holdings, Inc. Collaborative risk controller for vehicles using v2v
US11243532B1 (en) * 2017-09-27 2022-02-08 Apple Inc. Evaluating varying-sized action spaces using reinforcement learning
AU2018274849A1 (en) * 2017-12-05 2019-06-20 The Raymond Corporation Systems and methods for a material handling vehicle with a modular frame
CN107967513B (zh) * 2017-12-25 2019-02-15 徐雪松 多机器人强化学习协同搜索方法及系统
WO2019168710A1 (en) * 2018-03-02 2019-09-06 Walmart Apollo, Llc Systems and methods for delivering merchandise using autonomous ground vehicles
CN108600379A (zh) * 2018-04-28 2018-09-28 中国科学院软件研究所 一种基于深度确定性策略梯度的异构多智能体协同决策方法
CN109059931B (zh) * 2018-09-05 2019-04-26 北京航空航天大学 一种基于多智能体强化学习的路径规划方法
EP3874418A4 (en) * 2018-10-29 2022-08-10 HRL Laboratories, LLC ARTIFICIAL NEURAL NETWORKS WITH COMPETITIVE BACKWARD MODULATED SPIKE TIME DEPENDENT PLASTICITY AND METHODS FOR TRAINING SAME
US11657251B2 (en) * 2018-11-12 2023-05-23 Honda Motor Co., Ltd. System and method for multi-agent reinforcement learning with periodic parameter sharing
US10853670B2 (en) * 2018-11-21 2020-12-01 Ford Global Technologies, Llc Road surface characterization using pose observations of adjacent vehicles
JP6962900B2 (ja) * 2018-12-03 2021-11-05 ファナック株式会社 生産計画装置
JP2020098401A (ja) * 2018-12-17 2020-06-25 トヨタ自動車株式会社 車両情報管理装置及び車両情報管理方法
CN109540150B (zh) * 2018-12-26 2022-05-27 北京化工大学 一种应用于危化品环境下多机器人路径规划方法
CN109724612B (zh) * 2019-01-14 2021-06-15 浙江华睿科技有限公司 一种基于拓扑地图的agv路径规划方法及设备
GB2581789A (en) * 2019-02-22 2020-09-02 Tra Robotics Ltd Wireless battery charging system for automated guided vehicles (AGVs)
US20200272159A1 (en) * 2019-02-25 2020-08-27 Denso International America, Inc. Method and vehicle control system for intelligent vehicle control about a roundabout
WO2020206071A1 (en) * 2019-04-02 2020-10-08 Brain Corporation Systems, apparatuses, and methods for cost evaluation and motion planning for robotic devices
CN110471297B (zh) * 2019-07-30 2020-08-11 清华大学 多智能体协同控制方法、系统及设备
CN110378439B (zh) * 2019-08-09 2021-03-30 重庆理工大学 基于Q-Learning算法的单机器人路径规划方法
US20210148716A1 (en) * 2019-11-20 2021-05-20 Here Global B.V. Method, apparatus and computer program product for vehicle platooning
US11586862B2 (en) * 2020-02-14 2023-02-21 Ford Global Technologies, Llc Enhanced object detection with clustering

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019168763A (ja) * 2018-03-22 2019-10-03 三菱電機株式会社 運転開始条件変換装置および運転開始条件変換方法
CN110991545A (zh) * 2019-12-10 2020-04-10 中国人民解放军军事科学院国防科技创新研究院 一种面向多智能体对抗的强化学习训练优化方法及装置
CN111061277A (zh) * 2019-12-31 2020-04-24 歌尔股份有限公司 一种无人车全局路径规划方法和装置
CN111240356A (zh) * 2020-01-14 2020-06-05 西北工业大学 一种基于深度强化学习的无人机集群会合方法
CN111260031A (zh) * 2020-01-14 2020-06-09 西北工业大学 一种基于深度强化学习的无人机集群目标防卫方法
CN111309035A (zh) * 2020-05-14 2020-06-19 浙江远传信息技术股份有限公司 多机器人协同移动与动态避障方法、装置、设备及介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708527A (zh) * 2022-03-09 2022-07-05 中国石油大学(华东) 一种基于极坐标表示的数字冰壶策略价值提取方法
CN114815840A (zh) * 2022-04-29 2022-07-29 中国科学技术大学 基于深度强化学习的多智能体路径规划方法
CN115049324A (zh) * 2022-08-17 2022-09-13 上海国际港务(集团)股份有限公司 码头agv车调度方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN112015174A (zh) 2020-12-01
US20220317695A1 (en) 2022-10-06
CN112015174B (zh) 2022-06-28

Similar Documents

Publication Publication Date Title
WO2022007179A1 (zh) 一种多agv运动规划方法、装置和系统
WO2021135554A1 (zh) 一种无人车全局路径规划方法和装置
CN110955242B (zh) 机器人导航方法、系统、机器人及存储介质
CN107169608B (zh) 多无人机执行多任务的分配方法及装置
CN107103164B (zh) 无人机执行多任务的分配方法及装置
US11474529B2 (en) System and method for motion planning of an autonomous driving machine
CN110132282B (zh) 无人机路径规划方法及装置
CN109933067A (zh) 一种基于遗传算法和粒子群算法的无人艇避碰方法
Niu et al. Accelerated sim-to-real deep reinforcement learning: Learning collision avoidance from human player
Xu et al. Deep reinforcement learning-based path planning of underactuated surface vessels
Qu et al. Pursuit-evasion game strategy of USV based on deep reinforcement learning in complex multi-obstacle environment
Tagliaferri et al. A real-time strategy-decision program for sailing yacht races
Yang et al. Improved reinforcement learning for collision-free local path planning of dynamic obstacle
Gao et al. A constraint approximation assisted PSO for computationally expensive constrained problems
CN113110101A (zh) 一种生产线移动机器人聚集式回收入库仿真方法及系统
Cook et al. Intelligent cooperative control for urban tracking with unmanned air vehicles
Keong et al. Reinforcement learning for autonomous aircraft avoidance
Mohammed et al. Reinforcement learning and deep neural network for autonomous driving
CN114742644A (zh) 训练多场景风控系统、预测业务对象风险的方法和装置
Wang et al. Efficient Reinforcement Learning for Autonomous Ship Collision Avoidance under Learning Experience Reuse
KR20230024392A (ko) 주행 의사 결정 방법 및 장치 및 칩
Toan et al. Environment exploration for mapless navigation based on deep reinforcement learning
Elfahim et al. Drone path optimization in complex environment based on Q-learning algorithm
CN115079706B (zh) 人机协同控制移动式机器人智能避障方法和系统
Zhu et al. A location-based advising method in teacher–student frameworks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20944156

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20944156

Country of ref document: EP

Kind code of ref document: A1