CN117032134A

CN117032134A - Multi-AGV scheduling strategy evaluation and optimization method and system based on approximate synchronization estimation

Info

Publication number: CN117032134A
Application number: CN202311099265.7A
Authority: CN
Inventors: 兰旭光; 万里鹏; 宋徐威
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-11-10

Abstract

The application discloses a multi-AGV scheduling strategy evaluation and optimization method and system based on approximate synchronous estimation, wherein the method comprises the following steps: constructing a strategy model shared by multiple AGVs and a combined strategy evaluation model; based on a strategy model shared by multiple AGVs and a combined strategy evaluation model, enabling the AGVs to interact with the environment to generate training samples, and calculating edge dominance functions; calculating an update amplitude cut-off according to the uncertainty of the AGV strategy; and under any environmental state, replacing a joint advantage function according to the edge advantage function, and combining a near-end strategy optimization algorithm, taking the gradient cut-off quantity as a strategy ratio cutting interval, and performing independent optimization on the strategy of each AGV until the multi-AGV scheduling strategy is optimized. The method can realize stable and accurate strategy evaluation in the AGV scheduling task, and improves the training stability and scheduling efficiency.

Description

Multi-AGV scheduling strategy evaluation and optimization method and system based on approximate synchronization estimation

Technical Field

The application belongs to the field of AGV scheduling, and particularly relates to a multi-AGV scheduling strategy evaluation and optimization method and system based on approximate synchronous estimation.

Background

Decision tasks in many real-world scenarios, such as traffic scheduling, multi-sensor collaboration, robot cluster collaboration, etc., can be modeled as multi-agent collaborative decision-making problems. In such problems, in order to ensure independence between agents and completeness of information during training, a training framework of centralized evaluation-decentralized execution is generally adopted, and the framework needs to perform differential evaluation on agent policies according to a shared reward function.

In the policy-based deep reinforcement learning method, the policy of each agent is generally evaluated through a centralized reviewer network. The criticizing network takes the environmental state and the past counter experience and the strategy of each agent as input, and the evaluation result of the current agent strategy depends on the past specific behaviors of other agents. However, for the current agent, other agents are considered part of the environment, whose behavior is subject to uncertainty. Due to this uncertainty, the same behavior of the current agent may be evaluated differently depending on the counterfacts experience, thereby affecting the stability of the training process. In addition, because other agent strategies are continuously updated, the new strategies deviate from the past experience distribution, and the experience and the strategies are not synchronous, so that a certain error exists in the evaluation result.

How to reduce the variance from uncertainty in multi-unmanned truck (Automated Guided Vehicle, AGV) strategy evaluation is of great importance to the stability of the training process; at the same time, achieving synchronization of experience and strategy in strategy evaluation is critical to reducing evaluation errors.

Disclosure of Invention

The application aims to overcome the defects and provide a multi-AGV scheduling strategy evaluation and optimization method and system based on approximate synchronous estimation.

In order to achieve the above purpose, the application adopts the following technical scheme:

the first aspect of the present application provides a multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation, comprising:

constructing a strategy model shared by multiple AGVs and a combined strategy evaluation model;

based on a strategy model shared by multiple AGVs and a combined strategy evaluation model, enabling the AGVs to interact with the environment to generate training samples, and calculating edge dominance functions;

calculating an update amplitude cut-off according to the uncertainty of the AGV strategy;

and under any environmental state, replacing a joint advantage function according to the edge advantage function, and combining a near-end strategy optimization algorithm, taking the gradient cut-off quantity as a strategy ratio cutting interval, and performing independent optimization on the strategy of each AGV until the multi-AGV scheduling strategy is optimized.

As a further improvement of the application, the construction of the multi-AGV shared strategy model and the combined strategy evaluation model comprises the following steps:

the method comprises the steps of taking historical track information, current observation information and AGV marks of each AGV as input, inputting the probability of the action of the AGVs, and constructing a decision model shared by a plurality of AGVs through a cyclic neural network;

and taking the actions and the environmental states of all AGVs as inputs, inputting a joint action-state value function, and constructing a joint strategy evaluation model based on the critics network.

As a further improvement of the application, the strategy model and the combined strategy evaluation model based on the sharing of multiple AGVs, which enables the AGVs to interact with the environment to generate training samples, calculate edge dominance functions, comprise:

based on a strategy model shared by multiple AGVs and a combined strategy evaluation model, fixing the actions of the AGVs to be evaluated, and sampling multiple groups of actions according to the combined strategies of all other AGVs except the AGVs;

combining the sampled multiple groups of actions with the action of the AGV to be evaluated, and simulating an interaction process to generate a group of state transition data;

inputting the state transition data into a joint strategy evaluation model, and calculating a joint action-state value function corresponding to each transition data according to a calculation method of the reinforcement learning median function;

averaging the combined action-state value functions to obtain an edge dominance function of the AGV to be evaluated;

and repeating the steps for each AGV to obtain the edge dominance function of each AGV.

As a further improvement of the application, the simulation interaction process needs to construct a state transition model and a rewarding function model, wherein the state transition model and the rewarding function model take global states and joint actions as inputs.

As a further improvement of the present application, the calculating the update amplitude cut-off according to the uncertainty of the AGV policy includes:

and obtaining the action distribution of each AGV through a strategy network, calculating the variance of the action distribution, and taking the calculation result as the updated gradient cut-off quantity of the corresponding AGV.

As a further improvement of the present application, the obtaining, by the policy network, the motion profile of each AGV includes:

generating corresponding action masks according to the executable action list of the AGVs, and multiplying the output of each AGV strategy network by the corresponding action masks to obtain the value function estimation of legal actions;

and then activating a function through Geng Beier to obtain differentiable probability distribution, introducing a variable parameter t, and dynamically adjusting the probability distribution of different legal action value functions to obtain the action distribution of each AGV.

As a further improvement of the present application, the individual optimization of the policy of each AGV includes:

and using an edge dominance function as an update baseline, combining a near-end strategy optimization algorithm, adding an update gradient cut-off quantity as an update constraint, cutting and updating an AGV strategy network through a strategy ratio, controlling the update amplitude of the AGVs through a cutting interval, and performing independent optimization on the strategy of each AGV.

A second aspect of the present application provides a system for evaluating and optimizing a multi-AGV scheduling policy based on approximate synchronization estimation, including:

the construction module is used for constructing a strategy model shared by a plurality of AGVs and a combined strategy evaluation model;

the interaction module is used for enabling the AGVs to interact with the environment to generate training samples based on the strategy model shared by the AGVs and the combined strategy evaluation model, and calculating edge dominance functions;

the uncertainty module is used for calculating an update amplitude cut-off according to the uncertainty of the AGV strategy;

and the optimization module is used for replacing the joint advantage function according to the edge advantage function under any environment state, combining a near-end strategy optimization algorithm, taking the gradient cut-off quantity as a strategy ratio cutting interval, and performing independent optimization on the strategy of each AGV until the multi-AGV scheduling strategy is optimized.

A third aspect of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the multiple AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation when executing the computer program.

A fourth aspect of the present application is to provide a computer readable storage medium storing a computer program, which when executed by a processor implements the multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation.

Compared with the prior art, the application has the following technical effects:

according to the AGV scheduling strategy evaluation and optimization method based on approximate synchronous estimation, the variance of strategy estimation is reduced by improving the number of samples, so that the stability of the training process is improved. Meanwhile, by means of uncertainty estimation on the policies of the agents, the policy updating amplitude of each agent is dynamically restrained, approximate synchronous policy estimation is achieved, and errors of policy estimation are reduced. In the multi-AGV scheduling task sharing the reward function, the edge advantage function is obtained by integrating the joint value function according to the strategies of different AGVs, and the differential evaluation of the different AGVs under the shared reward function is realized, so that the centralized scheduling of the multi-AGV system is completed.

Drawings

FIG. 1 is a flow chart of a multi-AGV scheduling strategy evaluation and optimization method based on approximate synchronization estimation;

FIG. 2 is a method framework diagram presented by an embodiment of the present application;

FIG. 3 is a schematic diagram of an edge dominance estimation method according to the present application;

fig. 4 is a schematic diagram of the approximate synchronization estimation method of the present application.

FIG. 5 is a schematic illustration of a system for evaluating and optimizing multiple AGV scheduling strategies based on approximate synchronization estimation in accordance with the present application;

fig. 6 is a schematic diagram of an electronic device according to the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It will be apparent that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that, the terminal according to the embodiment of the present application may include, but is not limited to, a mobile phone, a personal digital assistant (Personal Digital Assistant, PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), a personal Computer (Personal Computer, PC), an MP3 player, an MP4 player, a wearable device (e.g., smart glasses, smart watches, smart bracelets, etc.), a smart home device, and other smart devices.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

As shown in fig. 1, a first object of the present application is to provide a multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation, which includes the following steps:

s1, constructing a strategy model shared by multiple AGVs and a combined strategy evaluation model;

s2, based on a strategy model shared by multiple AGVs and a combined strategy evaluation model, enabling the AGVs to interact with the environment to generate training samples, and calculating edge dominance functions;

s3, calculating an update amplitude cut-off quantity according to the uncertainty of the AGV strategy;

s4, under any environment state, replacing a joint advantage function according to the edge advantage function, combining a near-end strategy optimization algorithm, taking the gradient cut-off quantity as a strategy ratio cutting interval, and performing independent optimization on each AGV strategy until the multi-AGV scheduling strategy is optimized.

The multi-AGV scheduling strategy evaluation and optimization method based on the approximate synchronous estimation is based on edge advantage estimation, and the variance of strategy estimation is reduced by improving the number of samples, so that the stability of the training process is improved. Meanwhile, by means of uncertainty estimation on the policies of the agents, the policy updating amplitude of each agent is dynamically restrained, approximate synchronous policy estimation is achieved, and errors of policy estimation are reduced. The algorithm performance is improved considerably on a general multi-agent reinforcement learning test platform. The method can realize stable and accurate strategy evaluation in the AGV scheduling task, and improves the training stability and scheduling efficiency.

Alternatively, in the step, calculating the update amplitude cutoff amount according to the uncertainty of the AGV policy includes:

The obtaining, by the policy network, the action distribution of each AGV includes:

In order to realize synchronous estimation of different AGV strategies under a multi-agent training framework of centralized evaluation-decentralized execution, the method takes the variance of each AGV strategy as the standard of action uncertainty of an agent, and limits the strategy change within a certain range through strategy update amplitude cutoff according to the size of the uncertainty, so that approximate synchronous estimation is realized.

Alternatively, in the above steps, the policy of each AGV is individually optimized, including:

The method is based on a deep reinforcement learning algorithm, and the variance of AGV strategy evaluation is reduced through multi-sample estimation; the update amplitude truncation is utilized to approximate and simplify the synchronous estimation, and the policy evaluation accuracy and training efficiency are ensured. The method can be applied to the AGV system scheduling task based on online policy multi-agent reinforcement learning so as to improve the accuracy of each AGV scheduling policy evaluation and the stability of the training process.

The application is described in detail below with reference to the attached drawings and specific examples:

referring to fig. 2, the agent is an AGV, and the embodiment provides a multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation, which includes the following steps:

step one: the method comprises the following steps of constructing a strategy model and a combined strategy evaluation model shared by a plurality of AGVs:

a strategy model shared by a plurality of AGVs is constructed through a cyclic neural network, and the historical track information (comprising historical actions and historical observation information) of each AGV, the current observation information and the AGV identification are taken as inputs, so that the probability of the action of the AGV is input; and constructing a joint strategy evaluation model based on the critic network, and inputting a joint action-state value function by taking the actions of all AGVs and the environmental states as inputs.

Step two: based on a strategy model shared by multiple AGVs and a combined strategy evaluation model, enabling the AGVs to interact with the environment to generate training samples, and calculating edge dominance functions. Referring to fig. 3, the specific process is as follows:

1) The method comprises the steps of fixing the actions of an AGV to be evaluated based on a strategy model shared by multiple AGVs, and sampling multiple groups of actions according to the combined strategies of all other AGVs except the AGV; the interaction samples may be disabled in multiple groups.

2) Combining the sampling action with the action of the AGV to be evaluated, and generating a group of state transition data by simulating the interaction process in the previous step; and maintaining the sample to obtain a sample buffer pool.

3) Calculating a joint action-state value function corresponding to each transfer data according to a calculation method of the reinforcement learning median function;

4) Averaging the set of value functions to obtain an edge dominance function of the AGV to be evaluated;

5) The above steps 1) through 4) are repeated for each AGV.

In the second step, in order to simulate the interaction process of the AGV and the environment, a state transition model and a reward function model need to be constructed, and the two models take a global state and a joint action as input.

The sample buffer pool is used for randomly sampling training samples, performing criticism network training, performing TD-residual estimation to obtain a counter fact advantage estimation, and performing a Monte Carlo method to obtain an edge advantage estimation.

Step three: and calculating the update amplitude cut-off according to the uncertainty of the AGV strategy. The specific process is as follows:

As an alternative, in the third step, as shown in fig. 4, since the AGV can only execute some specific actions in its action set, in order to improve the action sampling efficiency, a corresponding action mask is generated according to the executable action list of the AGV, and the output of each AGV policy network is multiplied by the corresponding action mask to obtain a value function estimate of legal actions. And then, a differentiable probability distribution is obtained by activating a function through Geng Beier, and the variable parameter t is introduced in the work and is used for dynamically adjusting the probability distribution of different legal action value functions, so that the distribution is gradually gentle along with the training process, and the training process is more stable.

Step four: and under any environmental state, replacing the joint advantage function according to the edge advantage function obtained in the second step, combining a near-end strategy optimization algorithm, taking the gradient cut-off quantity obtained in the third step as a strategy ratio cutting interval, and performing independent optimization on the strategy of each AGV.

Alternatively, in the fourth step, the algorithm uses the edge dominance function estimate obtained in the second step as an update baseline, and adds the variance estimate in the third step as an update constraint. Because the hessian matrix corresponding to the KL divergence is needed to be calculated for directly solving the optimization problem, the calculation cost is high, and the training efficiency is low.

According to the embodiment, with reference to the near-end policy optimization method, an AGV policy network is updated through policy ratio cutting, and the AGV updating amplitude is controlled by a cutting interval. When the dominance function estimates positive, the AGV lifts the probability of the corresponding action, but the lifting amplitude does not exceed the set threshold.

More specifically, as shown in fig. 4, the sample buffer pool performs random sampling training samples, performs strategy network training to obtain track action probabilities of 1 to n, performs gradient constraint calculation to obtain a track n strategy gradient, and further obtains a strategy network optimization result.

The multi-AGV scheduling strategy evaluation and optimization method based on the approximate synchronous estimation is based on edge advantage estimation, and the variance of strategy estimation is reduced by improving the number of samples, so that the stability of the training process is improved. Meanwhile, by means of uncertainty estimation of the AGV strategies, the strategy updating amplitude of each AGV is dynamically restrained, approximate synchronous strategy estimation is achieved, and errors of strategy estimation are reduced. The algorithm performance is improved considerably on a general multi-AGV reinforcement learning test platform.

The shared multi-agent policy network may be implemented by a multi-layer perceptron and a recurrent neural network, where the input layer of the network is the recurrent neural network and the output layer is the multi-layer perceptron. The commentator network is implemented by a multi-layer perceptron. The output layer scale of the strategy network and the critics network can be adjusted according to the object task requirement, so that the strategy fitting and evaluation capability of different degrees can be realized. The experience playback is realized by a readable memory unit.

The algorithm related by the application can be deployed on any machine with a storage unit and floating point operation functions. For a simulation task, a corresponding simulation environment needs to be deployed on a machine; for tasks in real-world scenarios, it is also necessary to deploy corresponding sensors and action actuators for physical interaction with the environment.

As a specific embodiment, as shown in fig. 5, a second object of the present application is to provide a multi-AGV scheduling policy evaluation and optimization system based on approximate synchronization estimation, including:

According to an embodiment of the present application, there is also provided an electronic device and a non-transitory computer-readable storage medium storing computer instructions.

FIG. 6 is a schematic diagram of an electronic device for implementing a multi-AGV scheduling policy evaluation and optimization method based on near-synchrony estimation in accordance with an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 6, the electronic device includes: one or more processors 501, memory 502, and interfaces for connecting components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a GUI (graphical user interface) on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 501 is illustrated in fig. 6.

Memory 502 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation provided by the application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation provided by the present application.

The memory 502 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and units, such as program instructions/units corresponding to the optimization method for multi-AGV scheduling policy evaluation based on approximate synchronization estimation in the embodiment of the application. The processor 501 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and units stored in the memory 502, i.e., implements the multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation in the above-described method embodiment.

Memory 502 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the use of the electronic device according to the multi-AGV scheduling policy evaluation and optimization method based on the approximate synchronization estimation provided by the embodiment of the present application. In addition, memory 502 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 502 may optionally include memory remotely located with respect to the processor 501, which may be connected via a network to the electronic device implementing the multiple AGV scheduling policy evaluation and optimization method based on near-synchronous estimation provided by embodiments of the present application. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the multi-AGV scheduling policy evaluation and optimization method based on the approximate synchronization estimation may further include: an input device 503 and an output device 504. The processor 501, memory 502, input devices 503 and output devices 504 may be connected by a bus or otherwise, for example in fig. 6.

The input device 503 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device implementing the multi-AGV scheduling policy evaluation and optimization method based on the near-synchrony estimation provided by the embodiments of the present application, such as input devices for a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output devices 504 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, an LCD (liquid crystal display), an LED (light emitting diode) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, ASIC (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, PLDs (programmable logic devices)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (local area network), WAN (wide area network), internet and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the application without departing from the spirit and scope of the application, which is intended to be covered by the claims.

Claims

1. The multi-AGV scheduling strategy evaluation and optimization method based on the approximate synchronization estimation is characterized by comprising the following steps of:

2. The method for evaluating and optimizing a multi-AGV scheduling policy based on approximate synchronization estimation according to claim 1, wherein the constructing a multi-AGV shared policy model and a joint policy evaluation model comprises:

3. The method for evaluating and optimizing a multi-AGV scheduling policy based on approximate synchronization estimation according to claim 1, wherein the method for evaluating the policy model and the combined policy based on sharing of the multi-AGV, for the AGV to interact with the environment to generate training samples, calculates edge dominance functions, comprises:

averaging all the combined action-state value functions to obtain an edge dominance function of the AGV to be evaluated;

4. The method for evaluating and optimizing multi-AGV scheduling policies based on approximately synchronous estimation according to claim 3 wherein said simulated interaction process requires the construction of a state transition model and a reward function model that take global state and joint actions as inputs.

5. The method for evaluating and optimizing a multi-AGV scheduling policy based on approximate synchronization estimation according to claim 1, wherein the calculating the update magnitude cutoff according to the uncertainty of the AGV policy comprises:

6. The method for evaluating and optimizing a multi-AGV scheduling policy based on approximate synchronization estimation according to claim 5, wherein the obtaining the motion profile of each AGV through the policy network comprises:

7. The method of multiple AGV scheduling policy evaluation and optimization based on approximate synchronization estimation according to claim 1, wherein the performing the individual optimization on the policy of each AGV comprises:

8. A multi-AGV scheduling policy evaluation and optimization system based on approximate synchronization estimation, comprising:

9. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the multi-AGV scheduling policy evaluation and optimization method based on approximate synchronization estimation of any of claims 1-7 when executing the computer program.

10. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program which when executed by a processor implements the multi-AGV scheduling policy evaluation and optimization method based on near-simultaneous estimation of any one of claims 1-7.