CN116739077A - Multi-agent deep reinforcement learning method and device based on course learning - Google Patents

Multi-agent deep reinforcement learning method and device based on course learning Download PDF

Info

Publication number
CN116739077A
CN116739077A CN202311029693.2A CN202311029693A CN116739077A CN 116739077 A CN116739077 A CN 116739077A CN 202311029693 A CN202311029693 A CN 202311029693A CN 116739077 A CN116739077 A CN 116739077A
Authority
CN
China
Prior art keywords
reinforcement learning
deep reinforcement
agent
training
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311029693.2A
Other languages
Chinese (zh)
Other versions
CN116739077B (en
Inventor
李敏
宋南骏
曾祥光
罗仕杰
张加衡
张童伟
张森
张越龙
潘云伟
邢丽静
黄傲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202311029693.2A priority Critical patent/CN116739077B/en
Publication of CN116739077A publication Critical patent/CN116739077A/en
Application granted granted Critical
Publication of CN116739077B publication Critical patent/CN116739077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application relates to a multi-agent deep reinforcement learning method and device based on course learning. The method comprises the following steps: determining an environment of an initial multi-agent deep reinforcement learning model based on the radius of action; in the environment of the initial multi-agent deep reinforcement learning model, completing the training of the multi-agent deep reinforcement learning model; repeating the process until the multi-agent deep reinforcement learning model completes training of a time node; repeating the training process of the time nodes of the multi-agent deep reinforcement learning model to obtain a model to be evaluated; calculating a reward value obtained by interaction between the model to be evaluated and the environment, updating the action radius according to the calculation result of the reward value, and repeating the steps until the complete multi-agent deep reinforcement learning model is obtained. The method can solve the problem of low efficiency of processing course tasks by the computer because of occupying more computer resources.

Description

Multi-agent deep reinforcement learning method and device based on course learning
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a multi-agent deep reinforcement learning method and device based on course learning.
Background
Along with the development of artificial intelligence, the multi-agent cooperation problem becomes a classical cooperation game problem in the field of multi-agent systems, and has wide application in the aspects of trapping, interception, searching, tracking and the like of multi-robot control. However, since the multi-agent deep reinforcement learning has high complexity of course tasks to be solved in the course of applying to the course tasks, the conventional multi-agent deep reinforcement learning method has large calculation amount in the course of solving the course tasks, occupies more computer resources, and has the problem of low efficiency of processing the course tasks by the computer because of occupying more computer resources.
Aiming at the problem of low efficiency of processing course tasks by a computer due to the fact that more computer resources are occupied in the prior art, no effective solution is proposed at present.
Disclosure of Invention
Based on the foregoing, it is necessary to provide a multi-agent deep reinforcement learning method and device based on course learning.
In a first aspect, the present application provides a multi-agent deep reinforcement learning method based on course learning. The method comprises the following steps:
setting an action radius inversely proportional to the difficulty of an element based on the difficulty of the element corresponding to the course task, and generating a difficulty measuring device according to the action radius;
Determining the environment of an initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;
updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;
based on the basic experience storage information, selecting a result of the action according to a greedy strategy, and updating the experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;
repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with a preset first threshold number of times is completed, and marking that the multi-agent deep reinforcement learning model completes the training of a time node;
Repeating the training process of the multi-agent deep reinforcement learning model time node on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model trained for the last time as a model to be evaluated;
taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which are preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action based on the average of the prize values;
based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the steps until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.
In one embodiment, the setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius, includes:
determining a course object corresponding to the course task based on the course task;
determining, based on the curriculum object, an element associated with the curriculum object;
setting the action radius inversely proportional to the difficulty of the element based on the difficulty of the element, and arranging the action radius from large to small;
and generating a one-dimensional matrix according to the action radius and a preset arrangement sequence to obtain the difficulty measuring device.
In one embodiment, the updating the experience storage information in the environment of the initial multi-agent deep reinforcement learning model, before obtaining the basic experience storage information that can be used for model training, includes:
initializing parameters of a training network and parameters of a target network of the initial multi-agent deep reinforcement learning model, and initializing experience storage information corresponding to the environment of the initial multi-agent deep reinforcement learning model.
In one embodiment, in the environment of the initial multi-agent deep reinforcement learning model, updating the experience storage information to obtain the basic experience storage information which can be used for model training comprises the following steps:
updating the experience storage information according to the result of the multi-agent selecting action in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy;
and under the condition that the times of updating the experience storage information reach the preset updating threshold times, obtaining the basic experience storage information which can be used for model training.
In one embodiment, the updating the empirically stored information according to the result of the multi-agent selection action in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy includes:
controlling the multi-agent, and selecting actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model;
and storing information generated by the multi-agent selection action in an experience storage area, and updating the experience storage information.
In one embodiment, the storing the information generated by the multi-agent selection action in the experience storage area, and updating the experience storage information includes:
Determining state information of the next moment obtained by interaction between the multi-agent and the environment of the initial multi-agent deep reinforcement learning model;
calculating a reward value obtained by the multi-agent selecting action according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model;
and storing the action information selected by the multiple agents, the rewarding value obtained by the selected action and the obtained state information at the next moment in the experience storage area to finish updating of the experience storage information.
In one embodiment, the calculating the loss of the training network of each agent based on the updated empirically stored information includes:
calculating a differential value of a target network of each intelligent agent based on the updated experience storage information;
and calculating the loss of the training network of each intelligent agent according to the differential value of the target network of each intelligent agent.
In one embodiment, the sum of the reward values obtained by the interaction of the model to be evaluated and the environment, which is preset for the first threshold number of times, is used as the reward value of one time node, and the average number of the reward values of the time node, which is preset for the third threshold number of times, is calculated; updating the radius of action based on the average of the prize values, comprising:
Calculating the sum of rewards obtained by interaction of a model to be evaluated with the environment, which is preset for a first threshold number of times, and obtaining the rewards of a time node;
calculating an average value of rewards of time nodes of a preset third threshold number;
judging whether the average value of the rewards value is larger than or equal to a threshold value of the rewards value;
and updating the action radius to be the next column of the current column of the one-dimensional matrix of the difficulty measure when the average value of the reward values is greater than or equal to the threshold value of the reward values.
In a second aspect, the application also provides a multi-agent deep reinforcement learning device based on course learning. The device comprises:
the generation module is used for setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius;
the determining module is used for determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;
the first updating module is used for updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;
The first training module is used for updating the experience storage information according to the result of the greedy strategy selection action on the basis of the basic experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the multi-intelligent agent by adopting a gradient descent method, and updating the parameters of the target network of the multi-intelligent agent by adopting a time sequence difference method, so as to complete the training of a multi-intelligent agent deep reinforcement learning model;
the second training module is used for repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node;
the third training module is used for repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model after the last training as a model to be evaluated;
The second updating module is used for taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action based on the average of the prize values;
and the model acquisition module is used for obtaining a model to be evaluated corresponding to the updated action radius based on the updated action radius, updating the action radius by utilizing a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model trained last time as a complete multi-agent deep reinforcement learning model.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the multi-agent deep reinforcement learning method based on course learning according to the first aspect.
According to the multi-agent deep reinforcement learning method, device and computer equipment based on course learning, the action radius is set based on the difficulty of the element corresponding to the course task, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a block diagram of a hardware architecture of a terminal of a multi-agent deep reinforcement learning method based on course learning according to an embodiment of the present application;
FIG. 2 is a flowchart of a multi-agent deep reinforcement learning method based on course learning according to an embodiment of the present application;
FIG. 3 is a flow chart of a multi-agent deep reinforcement learning method based on course learning according to a preferred embodiment of the present application;
fig. 4 is a block diagram of a multi-agent deep reinforcement learning device based on course learning according to an embodiment of the present application.
Detailed Description
The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application.
Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used herein, are intended to encompass non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.
The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. For example, the method runs on a terminal, and fig. 1 is a block diagram of a hardware structure of the terminal of the multi-agent deep reinforcement learning method based on course learning according to the present embodiment. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the multi-agent deep reinforcement learning method based on course learning according to the embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a multi-agent deep reinforcement learning method based on course learning is provided, fig. 2 is a flowchart of the multi-agent deep reinforcement learning method based on course learning in this embodiment, as shown in fig. 2, and the flowchart includes the following steps:
step S210, setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius.
In this step, the course task may be one or more of a capturing task, an intercepting task, a searching task, and a tracking task. The element corresponding to the course task may be an element associated with a course object corresponding to the course task, where the element includes difficulty in completing the course task. For example, in a trapping task, the course object corresponding to the trapping task may be a circle centered on the my agent. The element corresponding to the trapping task may be a discrete value of a trapping radius of the system. For example, an element corresponding to a trapping task includes a simplest trapping radius and a most difficult trapping radius, wherein the simplest trapping radius has a value of 9m and the most difficult trapping radius has a value of 3m. The radius of action may be a radius length inversely proportional to the difficulty of the element. The method comprises the steps of setting an action radius inversely proportional to the difficulty of an element based on the difficulty of the element corresponding to a course task, and generating a difficulty measure according to the action radius, which can be based on the course task, determining a course object corresponding to the course task, determining an element associated with the course object based on the course object, setting the action radius inversely proportional to the difficulty of the element based on the difficulty of the element, arranging the action radius from large to small, and finally generating a one-dimensional matrix according to the arrangement sequence preset by the action radius to obtain the difficulty measure. Wherein, the one-dimensional matrix of the difficulty measure D can be expressed by the following matrix formula:
Wherein d 1 A value representing the maximum radius of action, d 2 Values representing the radius of action, d, arranged in order from largest to smallest at position 2 n Values representing the radius of action arranged in the nth order from the largest to the smallest are shown.
And S220, determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure.
Specifically, a simulation space is established according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measuring instrument, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the simulation space. The initial multi-agent deep reinforcement learning model may include a target network of the initial multi-agent deep reinforcement learning model and a training network of the initial multi-agent deep reinforcement learning model. Specifically, the target network may be an Actor-Critic target network, and the training network may be an Actor-Critic training network.
The environment of the initial multi-agent deep reinforcement learning model is used for obtaining initial experience storage information corresponding to the environment of the initial multi-agent deep reinforcement learning model through the environment of the initial multi-agent deep reinforcement learning model.
In step S230, in the environment of the initial multi-agent deep reinforcement learning model, the experience storage information is updated to obtain the basic experience storage information that can be used for model training.
The method comprises the steps of updating experience storage information in an environment of an initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for training the multi-agent deep reinforcement learning model, wherein the basic experience storage information can be obtained according to a result of selecting actions of the multi-agent in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy, updating the experience storage information, and obtaining the basic experience storage information which can be used for training the model under the condition that the times of updating the experience storage information reach a preset updating threshold value. Specifically, the method can be to control the multi-agent, select actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model, store information generated by the multi-agent selection actions in an experience storage area, and update the experience storage information. The Greedy policy may be an Epsilon-Greedy (Greedy algorithm). Specifically, the actions are selected according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model, and all agents can select actions according to an epsilon-greedy strategy in the environment of the initial multi-agent deep reinforcement learning model. After the intelligent agent selects actions in the environment of the initial multi-intelligent deep reinforcement learning model according to the epsilon-greedy strategy, if the small probability event does not occur, the intelligent agent selects actions according to the target network of the initial multi-intelligent deep reinforcement learning model, and interacts with the environment of the initial multi-intelligent deep reinforcement learning model to obtain the next moment state. The experience storage area can be used for storing experience storage information. The preset update threshold number may be preset according to an empirical value. When the number of times of updating the experience storage information reaches the preset updating threshold number of times, the updated experience storage information can be enough, and further enough experience storage information can be provided for training of the initial multi-agent deep reinforcement learning model.
The above-mentioned information generated by the multi-agent selecting action is stored in the experience storage area, and the updating of the experience storage information may be determining the state information of the next moment obtained by the multi-agent interacting with the environment of the initial multi-agent deep reinforcement learning model, calculating the reward value obtained by the multi-agent selecting action according to the greedy strategy in the environment of the initial multi-agent deep reinforcement learning model, and finally, storing the action information selected by the multi-agent, the reward value obtained by the selecting action, and the obtained state information of the next moment in the experience storage area, thereby completing the updating of the experience storage information.
Step S240, based on the basic experience storage information, selecting the action result according to a greedy strategy, and updating the experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, the gradient descent method is adopted to update the parameters of the training network of the plurality of intelligent agents, and the time sequence difference method is adopted to update the parameters of the target network of the plurality of intelligent agents, so that the training of the multi-intelligent-agent deep reinforcement learning model is completed once.
In this step, the loss of the training network of each agent may be calculated based on the updated experience storage information, and the difference value of the target network of each agent may be calculated based on the updated experience storage information, and further, the loss of the training network of each agent may be calculated based on the difference value of the target network of each agent. The calculation formula for calculating the differential value y of the target network of each agent based on the updated experience storage information is as follows:
Wherein r is i Indicating the prize value obtained by the selected action of agent i, gamma being the prize discount coefficient,and outputting an action cost function for the target network.
The loss L (theta) of the training network of each agent is calculated according to the difference value y of the target network of each agent i ) The calculation formula of (2) is as follows:
wherein E is s,a,r,s՛ Representing the expected penalty of parallel computation of a set of empirically stored information,to train the action cost function of the network output.
The parameter theta' of the target network of the intelligent agent i is updated by adopting a time sequence difference method i The specific calculation formula of (2) is as follows:
where τ is the update coefficient of the time sequence difference, θ i Is a parameter of the training network of agent i.
The training of the one-time multi-agent deep reinforcement learning model can be accomplished by updating the experience storage information based on the basic experience storage information and according to the result of the greedy strategy selection action, so as to update the parameters of the training network of the multi-agent and the parameters of the target network of the multi-agent.
Step S250, repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node.
The preset first threshold number of times is the number of times that the training of the multi-agent deep reinforcement learning model is required to complete the training of a time node, and is empirically set, and may be a relatively large value, for example, the preset first threshold number of times is 200 times. The training of the multi-agent deep reinforcement learning model is completed on the basis of updated experience storage information in the process of repeating the training of the multi-agent deep reinforcement learning model.
Step S260, repeating the training process of the time nodes of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time nodes with the preset second threshold times, and taking the multi-intelligent deep reinforcement learning model after the last training as the model to be evaluated.
The preset second threshold number is the training number of the time node, which is required to be completed by the multi-agent deep reinforcement learning model to obtain the condition of the model to be evaluated, and is set according to experience, and may be a relatively large value, for example, the preset second threshold number is 100 times. In the process of repeating the training of the time node, the training of the time node of the multi-agent deep reinforcement learning model is completed on the basis of the basic experience storage information.
The method comprises the steps of obtaining a model to be evaluated, and further calculating a reward value obtained by interaction between the model to be evaluated corresponding to the current action radius and the environment, so that the action radius is adjusted.
Step S270, taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; the radius of action is updated based on the average of the prize values.
In this step, the sum of the reward values obtained by the interaction of the model to be evaluated and the environment with the preset first threshold number is used as the reward value of one time node, the average of the reward values of the time node with the preset third threshold number is calculated, the action radius is updated according to the average of the reward values, the sum of the reward values obtained by the interaction of the model to be evaluated and the environment with the preset first threshold number is calculated, the reward value of one time node is obtained, the average of the reward values of the time node with the preset third threshold number is further calculated, whether the average of the reward values is larger than or equal to the threshold value of the reward value is judged, and when the average of the reward values is larger than or equal to the threshold value of the reward value, the action radius is updated as the difficulty measure The next column to the current column of the one-dimensional matrix. The sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold number, is calculated, and the sum of the rewards obtained by selection action of each agent of the model to be evaluated in the environment of the model to be evaluated is actually calculated. Specifically, agent i selects the prize value r resulting from the action i The calculation formula of (2) is as follows:
where i is the number of the chaser, r i distance For the distance rewards,wherein->And->Representing the coordinates of the chaser, < >>And->Representing the coordinates of evasion, r i capture Representing a trapping reward->Wherein L is i distance Representing the distance from the chaser j to the evasion, d being the radius of action, r i help Indicating help rewards->Wherein L is j distance Indicating the distance of the chaser i to the evasion, < >>Wherein r is i collision Indicating a collision reward,and->Divided into coordinates of obstacles, r o Is the radius of the obstacle.
The average value of the prize values of the time nodes calculated by the preset third threshold number may be the sum of the prize values calculated by each time node divided by the preset third threshold number. The preset third threshold number is the number of times of obtaining the rewarding value of the time node for evaluating the model to be evaluated, may be preset according to specific situations, and in order to improve efficiency, the third threshold number may be set to a smaller value, for example, the third threshold number may be set to 6 times.
When the average value of the reward values is greater than or equal to the threshold value of the reward values, the action radius is updated to be the next column of the current column of the one-dimensional matrix of the difficulty measure, for example, the current action radius is d 1 In the case of the current radius of action, when the calculated average value of the prize values is greater than or equal to the threshold value of the prize value, the radius of action is updated to the current column d of the one-dimensional matrix of the difficulty metric 1 Next column d of (2) 2 . The threshold value of the prize value may be empirically preset. If the average value of the reward values is smaller than the threshold value of the reward values, the radius of action is not updated, and the multi-agent deep reinforcement learning model is still trained with the current radius of action until the next model to be evaluated is obtained. If the average value of the prize values is less than the threshold value of the prize value, but the current radius of action is already the last column d of the one-dimensional matrix of the difficulty measure n Default current column d n The next column of (2) is still d n With updated radius of action d n And training the multi-agent deep reinforcement learning model.
Step S280, based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the step until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model trained last time as a complete multi-agent deep reinforcement learning model.
Based on the updated action radius, the model to be evaluated corresponding to the updated action radius is obtained, and the action radius is updated by using the reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, which is actually achieved by repeating the processes from step S220 to step S270. The preset fourth threshold number may be the number of times of training the multi-agent deep learning model set according to the experience value, and default to convergence of the multi-agent deep learning model when the training of the multi-agent deep learning model reaches the preset fourth threshold number of times. In order to ensure the performance of the complete multi-agent deep reinforcement learning model, the preset fourth threshold number may be set to a larger value, and the preset fourth threshold number is greater than the product of the preset first threshold number and the preset second threshold number. For example, the fourth threshold number of times may be preset to 100000 times.
Step S210 to step S280 are described above, in which the action radius is set based on the difficulty of the element corresponding to the course task, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.
In one embodiment, in the context of an initial multi-agent deep reinforcement learning model, updating the empirically stored information and, based on the updated empirically stored information, before calculating the loss of the training network for each agent, comprising the steps of:
parameters of a training network and parameters of a target network of the initial multi-agent deep reinforcement learning model are initialized, and empirical storage information corresponding to an environment of the initial multi-agent deep reinforcement learning model is initialized.
The present embodiment is described and illustrated below by way of preferred embodiments.
FIG. 3 is a flow chart of a multi-agent deep reinforcement learning method based on course learning according to a preferred embodiment of the present application. As shown in fig. 3, the multi-agent deep reinforcement learning method based on course learning includes the following steps:
step S310, setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius;
step S320, determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;
step S330, updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;
Step S340, based on the basic experience storage information, the experience storage information is updated according to the result of the action selection of the greedy strategy; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the training network of a plurality of intelligent agents by adopting a gradient descent method, and updating the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;
step S350, repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node;
step S360, repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model trained last time as a model to be evaluated;
Step S370, taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action according to the average number of the rewards values;
step S380, based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the step until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model trained last time as a complete multi-agent deep reinforcement learning model.
Step S310 to step S380 above, firstly, set an action radius based on the difficulty of the element corresponding to the course task, and then determine the environment of the initial multi-agent deep reinforcement learning model according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, in this embodiment, a multi-agent deep reinforcement learning device based on course learning is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.
In one embodiment, fig. 4 is a block diagram of a multi-agent deep reinforcement learning device based on course learning according to an embodiment of the present application, as shown in fig. 4, the multi-agent deep reinforcement learning device based on course learning includes:
a generating module 41, configured to set an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generate a difficulty measure according to the action radius;
a determining module 42, configured to determine an environment of the initial multi-agent deep reinforcement learning model according to an action radius corresponding to a first column of the one-dimensional matrix of the difficulty measure;
a first updating module 43, configured to update experience storage information in an environment of an initial multi-agent deep reinforcement learning model, to obtain basic experience storage information that can be used for model training;
a first training module 44, configured to update the experience storage information according to a greedy policy, based on the basic experience storage information, and a result of the selection action; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of the multi-intelligent-agent deep reinforcement learning model;
The second training module 45 is configured to repeat the training process of the multi-agent deep reinforcement learning model based on the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and record that the multi-agent deep reinforcement learning model completes the training of a time node;
the third training module 46 is configured to repeat the training process of the time node of the multi-agent deep reinforcement learning model based on the basic experience storage information until the multi-agent deep reinforcement learning model completes training of the time node with a preset second threshold number of times, and take the multi-intelligent deep reinforcement learning model after the last training as the model to be evaluated;
a second updating module 47, configured to take a sum of reward values obtained by interaction between the model to be evaluated and the environment, which is preset for a first threshold number of times, as a reward value of one time node, and calculate an average of the reward values of the time node, which is preset for a third threshold number of times; updating the radius of action according to the average number of the rewards values;
and a model obtaining module 48, configured to obtain a model to be evaluated corresponding to the updated action radius based on the updated action radius, further update the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeat the step until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and use the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.
According to the multi-agent deep reinforcement learning device based on course learning, the action radius is set based on the difficulty of the element corresponding to the course task, and then the environment of the initial multi-agent deep reinforcement learning model is determined according to the action radius. And updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model, training the multi-agent deep reinforcement learning model based on the updated experience storage information, repeating the steps until the multi-agent deep reinforcement learning model completes training of one time node, repeating the training process of the time node of the multi-agent deep reinforcement learning model until training of the time node of a preset second threshold number is completed, and obtaining the model to be evaluated. And updating the action radius according to the rewarding value obtained by interaction of the model to be evaluated and the environment. Based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, repeating the steps until a trained multi-agent deep reinforcement learning model with a preset fourth threshold number of times is obtained, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model. According to the method, the action radius related to the element difficulty of the course task is set, and the adjustment mechanism of the action radius is set according to the rewarding value of the training result, so that a complex task can be split into a plurality of tasks and is gradually executed from simple to complex, the complexity of executing the task is reduced, the calculated amount in the course of solving the course task is reduced, the resources of a computer are further reduced, and the problem that the efficiency of processing the course task by the computer is low due to the fact that the resources of the computer are occupied is solved.
The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.
In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program implements any of the multi-agent deep reinforcement learning methods of the above embodiments based on course learning.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor implements any of the multi-agent deep reinforcement learning methods of the above embodiments based on course learning.
The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A multi-agent deep reinforcement learning method based on course learning, the method comprising:
setting an action radius inversely proportional to the difficulty of an element based on the difficulty of the element corresponding to the course task, and generating a difficulty measuring device according to the action radius;
determining the environment of an initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;
Updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;
based on the basic experience storage information, selecting a result of the action according to a greedy strategy, and updating the experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;
repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with a preset first threshold number of times is completed, and marking that the multi-agent deep reinforcement learning model completes the training of a time node;
repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with a preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model trained for the last time as a model to be evaluated;
Taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which are preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action based on the average of the prize values;
based on the updated action radius, obtaining a model to be evaluated corresponding to the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the steps until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.
2. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein the setting of the difficulty of the element corresponding to the course task, the radius of action inversely proportional to the difficulty of the element, and the generating of the difficulty measure according to the radius of action, comprises:
determining a course object corresponding to the course task based on the course task;
Determining, based on the curriculum object, an element associated with the curriculum object;
setting the action radius inversely proportional to the difficulty of the element based on the difficulty of the element, and arranging the action radius from large to small;
and generating a one-dimensional matrix according to the action radius and a preset arrangement sequence to obtain the difficulty measuring device.
3. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein the updating of experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information usable for model training is preceded by:
initializing parameters of a training network and parameters of a target network of the initial multi-agent deep reinforcement learning model, and initializing experience storage information corresponding to the environment of the initial multi-agent deep reinforcement learning model.
4. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein updating the experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain the basic experience storage information usable for model training comprises:
Updating the experience storage information according to the result of the multi-agent selecting action in the environment of the initial multi-agent deep reinforcement learning model according to a greedy strategy;
and under the condition that the times of updating the experience storage information reach the preset updating threshold times, obtaining the basic experience storage information which can be used for model training.
5. The course learning-based multi-agent deep reinforcement learning method of claim 4, wherein updating the empirically stored information based on the results of the multi-agent selection actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model comprises:
controlling the multi-agent, and selecting actions according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model;
and storing information generated by the multi-agent selection action in an experience storage area, and updating the experience storage information.
6. The course learning-based multi-agent deep reinforcement learning method of claim 5, wherein storing information generated by the multi-agent selection action in an experience storage area, updating the experience storage information, comprises:
Determining state information of the next moment obtained by interaction between the multi-agent and the environment of the initial multi-agent deep reinforcement learning model;
calculating a reward value obtained by the multi-agent selecting action according to a greedy strategy in the environment of the initial multi-agent deep reinforcement learning model;
and storing the action information selected by the multiple agents, the rewarding value obtained by the selected action and the obtained state information at the next moment in the experience storage area to finish updating of the experience storage information.
7. The course learning-based multi-agent deep reinforcement learning method of claim 1, wherein the calculating the loss of training network for each agent based on updated empirically stored information comprises:
calculating a differential value of a target network of each intelligent agent based on the updated experience storage information;
and calculating the loss of the training network of each intelligent agent according to the differential value of the target network of each intelligent agent.
8. The multi-agent deep reinforcement learning method based on course learning according to claim 1, wherein the sum of the reward values obtained by interacting the model to be evaluated and the environment of the preset first threshold number is used as the reward value of one time node, and the average of the reward values of the time node of the preset third threshold number is calculated; updating the radius of action based on the average of the prize values, comprising:
Calculating the sum of rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold number of times, and obtaining the rewards of a time node;
calculating an average value of rewards of time nodes of a preset third threshold number;
judging whether the average value of the rewards value is larger than or equal to a threshold value of the rewards value;
and updating the action radius to be the next column of the current column of the one-dimensional matrix of the difficulty measure when the average value of the reward values is greater than or equal to the threshold value of the reward values.
9. A multi-agent deep reinforcement learning device based on course learning, the device comprising:
the generation module is used for setting an action radius inversely proportional to the difficulty of the element based on the difficulty of the element corresponding to the course task, and generating a difficulty measure according to the action radius;
the determining module is used for determining the environment of the initial multi-agent deep reinforcement learning model according to the action radius corresponding to the first column of the one-dimensional matrix of the difficulty measure;
the first updating module is used for updating experience storage information in the environment of the initial multi-agent deep reinforcement learning model to obtain basic experience storage information which can be used for model training;
The first training module is used for updating experience storage information according to the result of the greedy strategy selection action on the basis of the basic experience storage information; calculating the loss of the training network of each agent based on the updated experience storage information; based on the loss of the training network of each intelligent agent, updating the parameters of the training network of the plurality of intelligent agents by adopting a gradient descent method, and updating the parameters of the target network of the plurality of intelligent agents by adopting a time sequence difference method, so as to complete the training of a multi-intelligent-agent deep reinforcement learning model;
the second training module is used for repeating the training process of the multi-agent deep reinforcement learning model on the basis of the updated experience storage information until the training of the multi-agent deep reinforcement learning model with the preset first threshold number of times is completed, and recording that the multi-agent deep reinforcement learning model completes the training of a time node;
the third training module is used for repeating the training process of the time node of the multi-agent deep reinforcement learning model on the basis of the basic experience storage information until the multi-agent deep reinforcement learning model finishes training of the time node with the preset second threshold number of times, and taking the multi-intelligent deep reinforcement learning model after the last training as a model to be evaluated;
The second updating module is used for taking the sum of the rewards obtained by interaction of the model to be evaluated and the environment, which is preset for the first threshold times, as the rewards of one time node, and calculating the average number of the rewards of the time node, which is preset for the third threshold times; updating the radius of action based on the average of the prize values;
and the model acquisition module is used for obtaining a model to be evaluated corresponding to the updated action radius based on the updated action radius, updating the action radius by using a reward value obtained by interaction between the model to be evaluated corresponding to the updated action radius and the environment, and repeating the steps until the training times of the multi-agent deep reinforcement learning model reach a preset fourth threshold time, and taking the multi-agent deep reinforcement learning model after the last training as a complete multi-agent deep reinforcement learning model.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
CN202311029693.2A 2023-08-16 2023-08-16 Multi-agent deep reinforcement learning method and device based on course learning Active CN116739077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311029693.2A CN116739077B (en) 2023-08-16 2023-08-16 Multi-agent deep reinforcement learning method and device based on course learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311029693.2A CN116739077B (en) 2023-08-16 2023-08-16 Multi-agent deep reinforcement learning method and device based on course learning

Publications (2)

Publication Number Publication Date
CN116739077A true CN116739077A (en) 2023-09-12
CN116739077B CN116739077B (en) 2023-10-31

Family

ID=87903053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311029693.2A Active CN116739077B (en) 2023-08-16 2023-08-16 Multi-agent deep reinforcement learning method and device based on course learning

Country Status (1)

Country Link
CN (1) CN116739077B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160225278A1 (en) * 2015-01-31 2016-08-04 Usa Life Nutrition Llc Method and apparatus for incentivization of learning
CN113449458A (en) * 2021-07-15 2021-09-28 海南大学 Multi-agent depth certainty strategy gradient method based on course learning
US20210357800A1 (en) * 2020-05-13 2021-11-18 Seagate Technology Llc Distributed decentralized machine learning model training
WO2022017596A1 (en) * 2020-07-22 2022-01-27 Telefonaktiebolaget Lm Ericsson (Publ) Method and computer system determining a representation of a parameter
US20220075383A1 (en) * 2020-09-10 2022-03-10 Kabushiki Kaisha Toshiba Task performing agent systems and methods
CN116127848A (en) * 2023-02-27 2023-05-16 东南大学 Multi-unmanned aerial vehicle collaborative tracking method based on deep reinforcement learning
CN116225016A (en) * 2023-03-06 2023-06-06 东北大学 Multi-agent path planning method based on distributed collaborative depth reinforcement learning model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160225278A1 (en) * 2015-01-31 2016-08-04 Usa Life Nutrition Llc Method and apparatus for incentivization of learning
US20210357800A1 (en) * 2020-05-13 2021-11-18 Seagate Technology Llc Distributed decentralized machine learning model training
WO2022017596A1 (en) * 2020-07-22 2022-01-27 Telefonaktiebolaget Lm Ericsson (Publ) Method and computer system determining a representation of a parameter
US20220075383A1 (en) * 2020-09-10 2022-03-10 Kabushiki Kaisha Toshiba Task performing agent systems and methods
CN113449458A (en) * 2021-07-15 2021-09-28 海南大学 Multi-agent depth certainty strategy gradient method based on course learning
CN116127848A (en) * 2023-02-27 2023-05-16 东南大学 Multi-unmanned aerial vehicle collaborative tracking method based on deep reinforcement learning
CN116225016A (en) * 2023-03-06 2023-06-06 东北大学 Multi-agent path planning method based on distributed collaborative depth reinforcement learning model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LIANG ZHANG 等: "A deep reinforce learning-based intrusion detection method for safeguarding Internet of Things", 《INTERNATIONAL CONFERENCE ON COMPUTER NETWORK SECURITY AND SOFTWARE ENGINEERING》, pages 127140 *
TONY DIANA: "Using sentiment analysis to reinforce learning: The case of airport community engagement", 《JOURNAL OF AIR TRANSPORT MANAGEMENT》, pages 1 - 8 *
刘敬蜀 等: "基于聚类和强化学习的多无人机协同侦察任务规划", 《中国电子科学研究院学报》, vol. 18, no. 1, pages 21 - 25 *
周东旭: "基于深度强化学习的机械臂运动规划方法研究", 《中国博士学位论文全文数据库 信息科技辑》, pages 140 - 4 *
陈平 等: "基于强化学习的车联网系统拟态防御设计研究", 《信息安全研究》, vol. 8, no. 6, pages 545 - 553 *

Also Published As

Publication number Publication date
CN116739077B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
US20230252327A1 (en) Neural architecture search for convolutional neural networks
CN109635917B (en) Multi-agent cooperation decision and training method
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN110674869B (en) Classification processing and graph convolution neural network model training method and device
US11388424B2 (en) Making object-level predictions of the future state of a physical system
CN111406264A (en) Neural architecture search
CN112513886B (en) Information processing method, information processing apparatus, and information processing program
CN111445020A (en) Graph-based convolutional network training method, device and system
CN112269382B (en) Robot multi-target path planning method
CN112947591A (en) Path planning method, device, medium and unmanned aerial vehicle based on improved ant colony algorithm
CN110009048B (en) Method and equipment for constructing neural network model
CN116739077B (en) Multi-agent deep reinforcement learning method and device based on course learning
CN111046955B (en) Multi-agent confrontation strategy intelligent prediction method and device based on graph network
CN110610231A (en) Information processing method, electronic equipment and storage medium
CN113609785B (en) Federal learning super-parameter selection system and method based on Bayesian optimization
CN111008705A (en) Searching method, device and equipment
KR101947780B1 (en) Method and system for downsizing neural network
JP7398625B2 (en) Machine learning devices, information processing methods and programs
CN115238134A (en) Method and apparatus for generating a graph vector representation of a graph data structure
CN110705437A (en) Face key point detection method and system based on dynamic cascade regression
CN113963551B (en) Vehicle positioning method, system, device and medium based on cooperative positioning
US20240078427A1 (en) Collaborative machine learning whose result is stored in a shared memory controlled by a central device
CN116030079A (en) Geofence partitioning method, device, computer equipment and storage medium
KR20220013231A (en) Electronic device and method for inferring objects within a video
CN116599845A (en) Safety communication and resource allocation method and device for power grid information physical system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant