CN116739077A

CN116739077A - Multi-agent deep reinforcement learning method and device based on course learning

Info

Publication number: CN116739077A
Application number: CN202311029693.2A
Authority: CN
Inventors: 李敏; 宋南骏; 曾祥光; 罗仕杰; 张加衡; 张童伟; 张森; 张越龙; 潘云伟; 邢丽静; 黄傲
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-09-12
Anticipated expiration: 2043-08-16
Also published as: CN116739077B

Abstract

The application relates to a multi-agent deep reinforcement learning method and device based on course learning. The method comprises the following steps: determining an environment of an initial multi-agent deep reinforcement learning model based on the radius of action; in the environment of the initial multi-agent deep reinforcement learning model, completing the training of the multi-agent deep reinforcement learning model; repeating the process until the multi-agent deep reinforcement learning model completes training of a time node; repeating the training process of the time nodes of the multi-agent deep reinforcement learning model to obtain a model to be evaluated; calculating a reward value obtained by interaction between the model to be evaluated and the environment, updating the action radius according to the calculation result of the reward value, and repeating the steps until the complete multi-agent deep reinforcement learning model is obtained. The method can solve the problem of low efficiency of processing course tasks by the computer because of occupying more computer resources.

Description

Multi-agent deep reinforcement learning method and device based on course learning

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a multi-agent deep reinforcement learning method and device based on course learning.

Background

Along with the development of artificial intelligence, the multi-agent cooperation problem becomes a classical cooperation game problem in the field of multi-agent systems, and has wide application in the aspects of trapping, interception, searching, tracking and the like of multi-robot control. However, since the multi-agent deep reinforcement learning has high complexity of course tasks to be solved in the course of applying to the course tasks, the conventional multi-agent deep reinforcement learning method has large calculation amount in the course of solving the course tasks, occupies more computer resources, and has the problem of low efficiency of processing the course tasks by the computer because of occupying more computer resources.

Aiming at the problem of low efficiency of processing course tasks by a computer due to the fact that more computer resources are occupied in the prior art, no effective solution is proposed at present.

Disclosure of Invention

Based on the foregoing, it is necessary to provide a multi-agent deep reinforcement learning method and device based on course learning.

In a first aspect, the present application provides a multi-agent deep reinforcement learning method based on course learning. The method comprises the following steps:

setting an action radius inversely proportional to the difficulty of an element based on the difficulty of the element corresponding to the course task, and generating a difficulty measuring device according to the action radius;

Determining the environment of an initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;

updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;

based on the basic experience storage information, selecting a result of the action according to a greedy strategy, and updating the experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;

repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with a preset first threshold number of times is completed, and marking that the multi-agent deep reinforcement learning model completes the training of a time node;

Repeating the training process of the multi-agent deep reinforcement learning model time node on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model trained for the last time as a model to be evaluated;

taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which are preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action based on the average of the prize values;

based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the steps until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.

In one embodiment, the setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius, includes:

determining a course object corresponding to the course task based on the course task;

determining, based on the curriculum object, an element associated with the curriculum object;

setting the action radius inversely proportional to the difficulty of the element based on the difficulty of the element, and arranging the action radius from large to small;

and generating a one-dimensional matrix according to the action radius and a preset arrangement sequence to obtain the difficulty measuring device.

In one embodiment, the updating the experience storage information in the environment of the initial multi-agent deep reinforcement learning model, before obtaining the basic experience storage information that can be used for model training, includes:

initializing parameters of a training network and parameters of a target network of the initial multi-agent deep reinforcement learning model, and initializing experience storage information corresponding to the environment of the initial multi-agent deep reinforcement learning model.

In one embodiment, in the environment of the initial multi-agent deep reinforcement learning model, updating the experience storage information to obtain the basic experience storage information which can be used for model training comprises the following steps:

updating the experience storage information according to the result of the multi-agent selecting action in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy;

and under the condition that the times of updating the experience storage information reach the preset updating threshold times, obtaining the basic experience storage information which can be used for model training.

In one embodiment, the updating the empirically stored information according to the result of the multi-agent selection action in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy includes:

controlling the multi-agent, and selecting actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model;

and storing information generated by the multi-agent selection action in an experience storage area, and updating the experience storage information.

In one embodiment, the storing the information generated by the multi-agent selection action in the experience storage area, and updating the experience storage information includes:

Determining state information of the next moment obtained by interaction between the multi-agent and the environment of the initial multi-agent deep reinforcement learning model;

calculating a reward value obtained by the multi-agent selecting action according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model;

and storing the action information selected by the multiple agents, the rewarding value obtained by the selected action and the obtained state information at the next moment in the experience storage area to finish updating of the experience storage information.

In one embodiment, the calculating the loss of the training network of each agent based on the updated empirically stored information includes:

calculating a differential value of a target network of each intelligent agent based on the updated experience storage information;

and calculating the loss of the training network of each intelligent agent according to the differential value of the target network of each intelligent agent.

In one embodiment, the sum of the reward values obtained by the interaction of the model to be evaluated and the environment, which is preset for the first threshold number of times, is used as the reward value of one time node, and the average number of the reward values of the time node, which is preset for the third threshold number of times, is calculated; updating the radius of action based on the average of the prize values, comprising:

Calculating the sum of rewards obtained by interaction of a model to be evaluated with the environment, which is preset for a first threshold number of times, and obtaining the rewards of a time node;

calculating an average value of rewards of time nodes of a preset third threshold number;

judging whether the average value of the rewards value is larger than or equal to a threshold value of the rewards value;

and updating the action radius to be the next column of the current column of the one-dimensional matrix of the difficulty measure when the average value of the reward values is greater than or equal to the threshold value of the reward values.

In a second aspect, the application also provides a multi-agent deep reinforcement learning device based on course learning. The device comprises:

the generation module is used for setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius;

the determining module is used for determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;

the first updating module is used for updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;

The first training module is used for updating the experience storage information according to the result of the greedy strategy selection action on the basis of the basic experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the multi-intelligent agent by adopting a gradient descent method, and updating the parameters of the target network of the multi-intelligent agent by adopting a time sequence difference method, so as to complete the training of a multi-intelligent agent deep reinforcement learning model;

the second training module is used for repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node;

the third training module is used for repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model after the last training as a model to be evaluated;

The second updating module is used for taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action based on the average of the prize values;

and the model acquisition module is used for obtaining a model to be evaluated corresponding to the updated action radius based on the updated action radius, updating the action radius by utilizing a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model trained last time as a complete multi-agent deep reinforcement learning model.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the multi-agent deep reinforcement learning method based on course learning according to the first aspect.

According to the multi-agent deep reinforcement learning method, device and computer equipment based on course learning, the action radius is set based on the difficulty of the element corresponding to the course task, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a block diagram of a hardware architecture of a terminal of a multi-agent deep reinforcement learning method based on course learning according to an embodiment of the present application;

FIG. 2 is a flowchart of a multi-agent deep reinforcement learning method based on course learning according to an embodiment of the present application;

FIG. 3 is a flow chart of a multi-agent deep reinforcement learning method based on course learning according to a preferred embodiment of the present application;

fig. 4 is a block diagram of a multi-agent deep reinforcement learning device based on course learning according to an embodiment of the present application.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application.

Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used herein, are intended to encompass non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.

The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. For example, the method runs on a terminal, and fig. 1 is a block diagram of a hardware structure of the terminal of the multi-agent deep reinforcement learning method based on course learning according to the present embodiment. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the multi-agent deep reinforcement learning method based on course learning according to the embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a multi-agent deep reinforcement learning method based on course learning is provided, fig. 2 is a flowchart of the multi-agent deep reinforcement learning method based on course learning in this embodiment, as shown in fig. 2, and the flowchart includes the following steps:

step S210, setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius.

In this step, the course task may be one or more of a capturing task, an intercepting task, a searching task, and a tracking task. The element corresponding to the course task may be an element associated with a course object corresponding to the course task, where the element includes difficulty in completing the course task. For example, in a trapping task, the course object corresponding to the trapping task may be a circle centered on the my agent. The element corresponding to the trapping task may be a discrete value of a trapping radius of the system. For example, an element corresponding to a trapping task includes a simplest trapping radius and a most difficult trapping radius, wherein the simplest trapping radius has a value of 9m and the most difficult trapping radius has a value of 3m. The radius of action may be a radius length inversely proportional to the difficulty of the element. The method comprises the steps of setting an action radius inversely proportional to the difficulty of an element based on the difficulty of the element corresponding to a course task, and generating a difficulty measure according to the action radius, which can be based on the course task, determining a course object corresponding to the course task, determining an element associated with the course object based on the course object, setting the action radius inversely proportional to the difficulty of the element based on the difficulty of the element, arranging the action radius from large to small, and finally generating a one-dimensional matrix according to the arrangement sequence preset by the action radius to obtain the difficulty measure. Wherein, the one-dimensional matrix of the difficulty measure D can be expressed by the following matrix formula:

Wherein d ₁ A value representing the maximum radius of action, d ₂ Values representing the radius of action, d, arranged in order from largest to smallest at position 2 _n Values representing the radius of action arranged in the nth order from the largest to the smallest are shown.

And S220, determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure.

Specifically, a simulation space is established according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measuring instrument, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the simulation space. The initial multi-agent deep reinforcement learning model may include a target network of the initial multi-agent deep reinforcement learning model and a training network of the initial multi-agent deep reinforcement learning model. Specifically, the target network may be an Actor-Critic target network, and the training network may be an Actor-Critic training network.

The environment of the initial multi-agent deep reinforcement learning model is used for obtaining initial experience storage information corresponding to the environment of the initial multi-agent deep reinforcement learning model through the environment of the initial multi-agent deep reinforcement learning model.

In step S230, in the environment of the initial multi-agent deep reinforcement learning model, the experience storage information is updated to obtain the basic experience storage information that can be used for model training.

The method comprises the steps of updating experience storage information in an environment of an initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for training the multi-agent deep reinforcement learning model, wherein the basic experience storage information can be obtained according to a result of selecting actions of the multi-agent in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy, updating the experience storage information, and obtaining the basic experience storage information which can be used for training the model under the condition that the times of updating the experience storage information reach a preset updating threshold value. Specifically, the method can be to control the multi-agent, select actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model, store information generated by the multi-agent selection actions in an experience storage area, and update the experience storage information. The Greedy policy may be an Epsilon-Greedy (Greedy algorithm). Specifically, the actions are selected according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model, and all agents can select actions according to an epsilon-greedy strategy in the environment of the initial multi-agent deep reinforcement learning model. After the intelligent agent selects actions in the environment of the initial multi-intelligent deep reinforcement learning model according to the epsilon-greedy strategy, if the small probability event does not occur, the intelligent agent selects actions according to the target network of the initial multi-intelligent deep reinforcement learning model, and interacts with the environment of the initial multi-intelligent deep reinforcement learning model to obtain the next moment state. The experience storage area can be used for storing experience storage information. The preset update threshold number may be preset according to an empirical value. When the number of times of updating the experience storage information reaches the preset updating threshold number of times, the updated experience storage information can be enough, and further enough experience storage information can be provided for training of the initial multi-agent deep reinforcement learning model.

The above-mentioned information generated by the multi-agent selecting action is stored in the experience storage area, and the updating of the experience storage information may be determining the state information of the next moment obtained by the multi-agent interacting with the environment of the initial multi-agent deep reinforcement learning model, calculating the reward value obtained by the multi-agent selecting action according to the greedy strategy in the environment of the initial multi-agent deep reinforcement learning model, and finally, storing the action information selected by the multi-agent, the reward value obtained by the selecting action, and the obtained state information of the next moment in the experience storage area, thereby completing the updating of the experience storage information.

Step S240, based on the basic experience storage information, selecting the action result according to a greedy strategy, and updating the experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, the gradient descent method is adopted to update the parameters of the training network of the plurality of intelligent agents, and the time sequence difference method is adopted to update the parameters of the target network of the plurality of intelligent agents, so that the training of the multi-intelligent-agent deep reinforcement learning model is completed once.

In this step, the loss of the training network of each agent may be calculated based on the updated experience storage information, and the difference value of the target network of each agent may be calculated based on the updated experience storage information, and further, the loss of the training network of each agent may be calculated based on the difference value of the target network of each agent. The calculation formula for calculating the differential value y of the target network of each agent based on the updated experience storage information is as follows:

Wherein r is _i Indicating the prize value obtained by the selected action of agent i, gamma being the prize discount coefficient,and outputting an action cost function for the target network.

The loss L (theta) of the training network of each agent is calculated according to the difference value y of the target network of each agent _i ) The calculation formula of (2) is as follows:

wherein E is _s,a,r,s՛ Representing the expected penalty of parallel computation of a set of empirically stored information,to train the action cost function of the network output.

The parameter theta' of the target network of the intelligent agent i is updated by adopting a time sequence difference method _i The specific calculation formula of (2) is as follows:

where τ is the update coefficient of the time sequence difference, θ _i Is a parameter of the training network of agent i.

The training of the one-time multi-agent deep reinforcement learning model can be accomplished by updating the experience storage information based on the basic experience storage information and according to the result of the greedy strategy selection action, so as to update the parameters of the training network of the multi-agent and the parameters of the target network of the multi-agent.

Step S250, repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node.

The preset first threshold number of times is the number of times that the training of the multi-agent deep reinforcement learning model is required to complete the training of a time node, and is empirically set, and may be a relatively large value, for example, the preset first threshold number of times is 200 times. The training of the multi-agent deep reinforcement learning model is completed on the basis of updated experience storage information in the process of repeating the training of the multi-agent deep reinforcement learning model.

Step S260, repeating the training process of the time nodes of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time nodes with the preset second threshold times, and taking the multi-intelligent deep reinforcement learning model after the last training as the model to be evaluated.

The preset second threshold number is the training number of the time node, which is required to be completed by the multi-agent deep reinforcement learning model to obtain the condition of the model to be evaluated, and is set according to experience, and may be a relatively large value, for example, the preset second threshold number is 100 times. In the process of repeating the training of the time node, the training of the time node of the multi-agent deep reinforcement learning model is completed on the basis of the basic experience storage information.

The method comprises the steps of obtaining a model to be evaluated, and further calculating a reward value obtained by interaction between the model to be evaluated corresponding to the current action radius and the environment, so that the action radius is adjusted.

Step S270, taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; the radius of action is updated based on the average of the prize values.

In this step, the sum of the reward values obtained by the interaction of the model to be evaluated and the environment with the preset first threshold number is used as the reward value of one time node, the average of the reward values of the time node with the preset third threshold number is calculated, the action radius is updated according to the average of the reward values, the sum of the reward values obtained by the interaction of the model to be evaluated and the environment with the preset first threshold number is calculated, the reward value of one time node is obtained, the average of the reward values of the time node with the preset third threshold number is further calculated, whether the average of the reward values is larger than or equal to the threshold value of the reward value is judged, and when the average of the reward values is larger than or equal to the threshold value of the reward value, the action radius is updated as the difficulty measure The next column to the current column of the one-dimensional matrix. The sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold number, is calculated, and the sum of the rewards obtained by selection action of each agent of the model to be evaluated in the environment of the model to be evaluated is actually calculated. Specifically, agent i selects the prize value r resulting from the action _i The calculation formula of (2) is as follows:

where i is the number of the chaser, r ⁱ _distance For the distance rewards,wherein->And->Representing the coordinates of the chaser, < >>And->Representing the coordinates of evasion, r ⁱ _capture Representing a trapping reward->Wherein L is ⁱ _distance Representing the distance from the chaser j to the evasion, d being the radius of action, r ⁱ _help Indicating help rewards->Wherein L is ^j _distance Indicating the distance of the chaser i to the evasion, < >>Wherein r is ⁱ _collision Indicating a collision reward,and->Divided into coordinates of obstacles, r _o Is the radius of the obstacle.

The average value of the prize values of the time nodes calculated by the preset third threshold number may be the sum of the prize values calculated by each time node divided by the preset third threshold number. The preset third threshold number is the number of times of obtaining the rewarding value of the time node for evaluating the model to be evaluated, may be preset according to specific situations, and in order to improve efficiency, the third threshold number may be set to a smaller value, for example, the third threshold number may be set to 6 times.

When the average value of the reward values is greater than or equal to the threshold value of the reward values, the action radius is updated to be the next column of the current column of the one-dimensional matrix of the difficulty measure, for example, the current action radius is d ₁ In the case of the current radius of action, when the calculated average value of the prize values is greater than or equal to the threshold value of the prize value, the radius of action is updated to the current column d of the one-dimensional matrix of the difficulty metric ₁ Next column d of (2) ₂ . The threshold value of the prize value may be empirically preset. If the average value of the reward values is smaller than the threshold value of the reward values, the radius of action is not updated, and the multi-agent deep reinforcement learning model is still trained with the current radius of action until the next model to be evaluated is obtained. If the average value of the prize values is less than the threshold value of the prize value, but the current radius of action is already the last column d of the one-dimensional matrix of the difficulty measure _n Default current column d _n The next column of (2) is still d _n With updated radius of action d _n And training the multi-agent deep reinforcement learning model.

Step S280, based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the step until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model trained last time as a complete multi-agent deep reinforcement learning model.

Based on the updated action radius, the model to be evaluated corresponding to the updated action radius is obtained, and the action radius is updated by using the reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, which is actually achieved by repeating the processes from step S220 to step S270. The preset fourth threshold number may be the number of times of training the multi-agent deep learning model set according to the experience value, and default to convergence of the multi-agent deep learning model when the training of the multi-agent deep learning model reaches the preset fourth threshold number of times. In order to ensure the performance of the complete multi-agent deep reinforcement learning model, the preset fourth threshold number may be set to a larger value, and the preset fourth threshold number is greater than the product of the preset first threshold number and the preset second threshold number. For example, the fourth threshold number of times may be preset to 100000 times.

Step S210 to step S280 are described above, in which the action radius is set based on the difficulty of the element corresponding to the course task, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.

In one embodiment, in the context of an initial multi-agent deep reinforcement learning model, updating the empirically stored information and, based on the updated empirically stored information, before calculating the loss of the training network for each agent, comprising the steps of:

parameters of a training network and parameters of a target network of the initial multi-agent deep reinforcement learning model are initialized, and empirical storage information corresponding to an environment of the initial multi-agent deep reinforcement learning model is initialized.

The present embodiment is described and illustrated below by way of preferred embodiments.

FIG. 3 is a flow chart of a multi-agent deep reinforcement learning method based on course learning according to a preferred embodiment of the present application. As shown in fig. 3, the multi-agent deep reinforcement learning method based on course learning includes the following steps:

step S310, setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius;

step S320, determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;

step S330, updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;

Step S340, based on the basic experience storage information, the experience storage information is updated according to the result of the action selection of the greedy strategy; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the training network of a plurality of intelligent agents by adopting a gradient descent method, and updating the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;

step S350, repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node;

step S360, repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model trained last time as a model to be evaluated;

Step S370, taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action according to the average number of the rewards values;

step S380, based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the step until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model trained last time as a complete multi-agent deep reinforcement learning model.

Step S310 to step S380 above, firstly, set an action radius based on the difficulty of the element corresponding to the course task, and then determine the environment of the initial multi-agent deep reinforcement learning model according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, in this embodiment, a multi-agent deep reinforcement learning device based on course learning is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

In one embodiment, fig. 4 is a block diagram of a multi-agent deep reinforcement learning device based on course learning according to an embodiment of the present application, as shown in fig. 4, the multi-agent deep reinforcement learning device based on course learning includes:

a generating module 41, configured to set an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generate a difficulty measure according to the action radius;

a determining module 42, configured to determine an environment of the initial multi-agent deep reinforcement learning model according to an action radius corresponding to a first column of the one-dimensional matrix of the difficulty measure;

a first updating module 43, configured to update experience storage information in an environment of an initial multi-agent deep reinforcement learning model, to obtain basic experience storage information that can be used for model training;

a first training module 44, configured to update the experience storage information according to a greedy policy, based on the basic experience storage information, and a result of the selection action; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of the multi-intelligent-agent deep reinforcement learning model;

The second training module 45 is configured to repeat the training process of the multi-agent deep reinforcement learning model based on the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and record that the multi-agent deep reinforcement learning model completes the training of a time node;

the third training module 46 is configured to repeat the training process of the time node of the multi-agent deep reinforcement learning model based on the basic experience storage information until the multi-agent deep reinforcement learning model completes training of the time node with a preset second threshold number of times, and take the multi-intelligent deep reinforcement learning model after the last training as the model to be evaluated;

a second updating module 47, configured to take a sum of reward values obtained by interaction between the model to be evaluated and the environment, which is preset for a first threshold number of times, as a reward value of one time node, and calculate an average of the reward values of the time node, which is preset for a third threshold number of times; updating the radius of action according to the average number of the rewards values;

and a model obtaining module 48, configured to obtain a model to be evaluated corresponding to the updated action radius based on the updated action radius, further update the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeat the step until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and use the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.

According to the multi-agent deep reinforcement learning device based on course learning, the action radius is set based on the difficulty of the element corresponding to the course task, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.

In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program implements any of the multi-agent deep reinforcement learning methods of the above embodiments based on course learning.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor implements any of the multi-agent deep reinforcement learning methods of the above embodiments based on course learning.

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A multi-agent deep reinforcement learning method based on course learning, the method comprising:

repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with a preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model trained for the last time as a model to be evaluated;

2. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein the setting of the difficulty of the element corresponding to the course task, the radius of action inversely proportional to the difficulty of the element, and the generating of the difficulty measure according to the radius of action, comprises:

3. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein the updating of experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information usable for model training is preceded by:

4. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein updating the experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain the basic experience storage information usable for model training comprises:

5. The course learning-based multi-agent deep reinforcement learning method of claim 4, wherein updating the empirically stored information based on the results of the multi-agent selection actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model comprises:

6. The course learning-based multi-agent deep reinforcement learning method of claim 5, wherein storing information generated by the multi-agent selection action in an experience storage area, updating the experience storage information, comprises:

7. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein the calculating the loss of training network for each agent based on updated empirically stored information comprises:

8. The multi-agent deep reinforcement learning method based on course learning according to claim 1, wherein the sum of the reward values obtained by interacting the model to be evaluated and the environment of the preset first threshold number is used as the reward value of one time node, and the average of the reward values of the time node of the preset third threshold number is calculated; updating the radius of action based on the average of the prize values, comprising:

Calculating the sum of rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold number of times, and obtaining the rewards of a time node;

9. A multi-agent deep reinforcement learning device based on course learning, the device comprising:

The first training module is used for updating experience storage information according to the result of the greedy strategy selection action on the basis of the basic experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;

and the model acquisition module is used for obtaining a model to be evaluated corresponding to the updated action radius based on the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the steps until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.