CN113110101B - Production line mobile robot gathering type recovery and warehousing simulation method and system - Google Patents

Production line mobile robot gathering type recovery and warehousing simulation method and system Download PDF

Info

Publication number
CN113110101B
CN113110101B CN202110423843.2A CN202110423843A CN113110101B CN 113110101 B CN113110101 B CN 113110101B CN 202110423843 A CN202110423843 A CN 202110423843A CN 113110101 B CN113110101 B CN 113110101B
Authority
CN
China
Prior art keywords
agent
agents
mobile robot
warehousing
intelligent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110423843.2A
Other languages
Chinese (zh)
Other versions
CN113110101A (en
Inventor
张涵
程金
王琪琪
王中华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Jinan
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN202110423843.2A priority Critical patent/CN113110101B/en
Publication of CN113110101A publication Critical patent/CN113110101A/en
Application granted granted Critical
Publication of CN113110101B publication Critical patent/CN113110101B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B17/00Systems involving the use of models or simulators of said systems
    • G05B17/02Systems involving the use of models or simulators of said systems electric

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The scheme is that an improved artificial potential energy function mechanism is added into a depth certainty strategy gradient algorithm, so that a reward function mechanism of an intelligent agent in the depth certainty strategy gradient algorithm is designed, the intelligent agent can learn a clustering action with high reward through the reward mechanism based on the improved artificial potential energy function, and the clustering effect of a plurality of intelligent agents is further realized; and the local communication information of the specific intelligent agent is added into a critic neural network module of the depth certainty gradient algorithm, so that the intelligent agent can judge the surrounding environment better, and can learn a better clustering strategy to realize the movement of warehousing, recycling and warehousing.

Description

Production line mobile robot gathering type recovery and warehousing simulation method and system
Technical Field
The disclosure belongs to the technical field of motion control of intelligent mobile robots, and particularly relates to a method and a system for simulating gathering type recycling and warehousing of mobile robots in a production line.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
At the present stage, along with the rapid development of artificial intelligence technology, a reinforcement learning algorithm is adopted to solve a plurality of complex problems in the actual life. A single agent system is difficult to solve and is limited in speed and reliability, if any, so that a higher level of task can be performed by the cooperation of a plurality of agents. After a plurality of intelligent agents complete tasks on a production line, the gathering type recycling and warehousing of the mobile robots on the production line needs to be realized, the mobile intelligent agents move in a gathering type mode, the gathering type can be better kept, and the gathering type recycling and warehousing of the mobile intelligent agents is efficiently realized by means of mutual cooperation.
The inventor finds that aiming at the problem of gathering type recycling and warehousing of mobile robots, most of the existing control methods adopt a reinforcement learning control algorithm to enable an intelligent agent to learn a motion control strategy, but a simple and stable multi-intelligent-agent cluster control algorithm is not available so far to realize cluster type movement of a static target; meanwhile, in the existing method based on deep reinforcement learning, the sample explored by the intelligent agent is required to be used for learning by the intelligent agent, the experience cannot be drawn from the environment where the intelligent agent and the intelligent agent are located, and the unknown environment is explored by the intelligent agent, so that the control effect of the intelligent agent depends on the abundance degree of the training sample seriously, and the intelligent agent cannot effectively cope with the diversity and various changes of the environment.
Disclosure of Invention
In order to solve the problems, the invention provides a production line mobile robot gathering type recovery warehousing simulation method and system, wherein an improved artificial potential energy function mechanism is added into a depth certainty strategy gradient algorithm to realize the clustering effect of a plurality of agents; and increases the stability of the multi-mobile agent maintenance cluster and the rapid convergence of agent training by adding specific local communication information in the critic network.
According to a first aspect of the embodiments of the present disclosure, there is provided a production line mobile robot gathering type recycling and warehousing simulation method, including:
establishing a recovery warehousing kinematic model for the mobile robot based on scene information and mobile robot parameter information;
each mobile robot selects a storage position in the library as a target, generates an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizes the recovery of the mobile robots through the control of force and speed;
the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; and training the model by using historical experiences randomly explored by the agents stored in the experience pool.
Further, the reward function mechanism based on the improved artificial potential energy function is specifically represented as: for a single agent g, if there are i agents h around itiThen its artificial potential energy reward function is:
Figure BDA0003029018980000021
wherein the content of the first and second substances,
Figure BDA0003029018980000022
for agent g and other agents hiDistance of (A), RgThe sum of the total artificial potential energy function rewards for the individual agents g.
Further, a specific agent local scope is added into an input layer of the criticizing person networkStatus information of other agents within the enclosure, including location information potherAnd velocity information votherAnd the judgment of the intelligent agent on the surrounding environment is increased.
Further, the recycling warehousing kinematic model is specifically as follows:
Figure BDA0003029018980000031
Figure BDA0003029018980000032
wherein the content of the first and second substances,
Figure BDA0003029018980000033
in order for the agent to vary in speed,
Figure BDA0003029018980000034
as a position change amount, FnoiseAnd pnoiseRespectively representing force random noise and position random noise,
Figure BDA0003029018980000035
in order for the agent to be stressed at time t,
Figure BDA0003029018980000036
m is the velocity of the agent at time t and m is the mass of the agent.
According to a second aspect of the embodiments of the present disclosure, there is provided a production line mobile robot gathering type recycling and warehousing simulation system, including:
the motion model construction unit is used for establishing a recovery warehousing kinematics model for the mobile robot based on scene information and mobile robot parameter information;
the path planning unit is used for selecting the storage positions in the library as targets for all the mobile robots, generating an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizing the recovery of the mobile robots through the control of force and speed;
the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; and training the model by using historical experiences randomly explored by the agents stored in the experience pool.
According to a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and running on the memory, where the processor implements the production line mobile robot gathering type recycling-warehousing simulation method when executing the program.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for simulating the collective recycling warehouse of the production line mobile robots.
Compared with the prior art, the beneficial effect of this disclosure is:
(1) according to the scheme, an improved artificial potential energy function mechanism is added into a depth certainty strategy gradient algorithm, so that a reward function mechanism of an intelligent agent in the depth certainty strategy gradient algorithm is designed, the intelligent agent can learn a high reward clustering action through the reward mechanism based on the improved artificial potential energy function, and the clustering effect of a plurality of intelligent agents is further realized;
(2) according to the scheme, the local communication information of the specific intelligent agent is added into a critic neural network module of a depth certainty gradient algorithm, so that the intelligent agent can judge the surrounding environment better, and can learn a better clustering strategy to realize the movement of warehousing, recycling and warehousing.
Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
Fig. 1 is a flowchart of a production line mobile robot gathering type recycling and warehousing simulation method based on an improved DDPG algorithm according to a first embodiment of the present disclosure;
FIG. 2 is an image of an artificial potential energy function according to a first embodiment of the disclosure;
FIG. 3 is a drawing illustrating a division of multi-agent local communication information as described in one embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a multi-agent aggregate retrieval motion trajectory without an improved artificial potential energy function according to one embodiment of the present disclosure;
FIG. 5 is a diagram illustrating a multi-agent aggregate retrieval motion profile using an improved artificial potential energy function according to an embodiment of the present disclosure;
fig. 6 is a diagram illustrating an effect of a multi-mobile agent achieving cluster-based recycling and warehousing in a first embodiment of the disclosure.
FIG. 7 is a diagram illustrating a total reward of a local communication information system without adding other agents according to one embodiment of the disclosure;
fig. 8 is a diagram illustrating the total reward of the system for adding other agents to locally communicate information according to the first embodiment of the disclosure.
Detailed Description
The present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
The first embodiment is as follows:
the embodiment aims to provide a production line mobile robot gathering type recovery and warehousing simulation method.
A production line mobile robot gathering type recovery warehousing simulation method comprises the following steps:
establishing a recovery warehousing kinematic model for the mobile robot based on scene information and mobile robot parameter information;
each mobile robot selects a storage position in the library as a target, generates an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizes the recovery of the mobile robots through the control of force and speed;
the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; and training the model by using historical experiences randomly explored by the agents stored in the experience pool.
Specifically, for ease of understanding, the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings:
according to the scheme, an incentive function mechanism of an agent in a DDPG (Deep Deterministic Policy Gradient) algorithm is designed, an incentive mechanism which is good in incentive model design and based on an improved artificial potential energy function is designed, the agent can learn a clustering action with high incentive, and specific agent local communication information is added into a critic neural network module of the Deep Deterministic Gradient algorithm, so that the agent can better judge the surrounding environment, and the agent can learn a better clustering strategy to realize the movement of warehousing, recycling and warehousing.
First, a position coordinate of an agent (agent in this embodiment refers to a production line mobile robot) i in a two-dimensional space is defined as pi=(xi,yi) And speed
Figure BDA0003029018980000063
Each agent has a radius diIn the random exploration of the agent, the agent generates a random force according to the DDPG algorithm
Figure BDA0003029018980000062
The intelligent agent can perform exercises, learn in experience of a series of exploration to generate a better behavior strategy, and realize the next step of exercise of the intelligent agent through force and speed.
In order to achieve the purpose, the invention adopts the following technical scheme:
the method comprises the following steps: for realizing the motion of a simple intelligent agent in a two-dimensional plane space, firstly establishing a kinematic model for the intelligent agent:
Figure BDA0003029018980000061
the amount of change in the velocity and the amount of change in the position of the agent during each time period Δ t are as shown in the system of equations (1) in which random noise F is introducednoiseAnd pnoiseThe method is used for increasing certain force randomness and position randomness in the searching process of the energy body.
Step two: in order to enable an intelligent agent to obtain a better learning strategy, a DDPG learning framework needs to be established, and a system flow chart is designed as shown in figure 1.
The framework is composed of two modules, namely an actor network and a critic network, wherein the two networks comprise two deep neural networks with the same structure, and the actor network comprises a motion estimation network and a motion target network. The action estimation network estimates a proper action A according to the current state s to enable the intelligent agent to move, and the action target network estimates an action A 'at the next moment according to the state s' after actual movement.
The critic network is composed of a value estimation network and a value target network. The value estimation network fits the value Q of the current agent i action by taking the agent i current state s and the current action A as neural network inputs(s)j,Aj) The value target network passes through a state s at a time jj' and Current action Aj' fitting out the value of the current agent action Q(s) as a neural network inputj′,Aj′)。
At the present stage, along with the rapid development of artificial intelligence technology, a reinforcement learning algorithm is adopted to solve a plurality of complex problems in actual life. A single agent system is difficult to solve and is limited in speed and reliability, if any, so that a higher level of task can be performed by the cooperation of a plurality of agents. After a plurality of intelligent agents complete tasks on a production line, the gathering type recycling and warehousing of the mobile robots on the production line needs to be realized, the mobile intelligent agents move in a gathering type mode, the gathering type can be better kept, and the gathering type recycling and warehousing of the mobile intelligent agents is efficiently realized by means of mutual cooperation.
The solution described in the present disclosure, which employs an improved deep reinforcement learning method to study the group collaboration of multi-agents, does not use already existing samples for the agents to learn, but rather enables the agents to draw experience from the environment in which they are located by using learning training, by themselves exploring unknown environments. In an unknown exploration environment, the experience value obtained by training is measured by using a reward function, a brand-new better experience value is further obtained for the next exploration, and the task of multi-agent cluster state control is realized through the design mode. For the traditional artificial potential energy field method, the application is relatively complicated in reality, two types of equations, namely a repulsive field and a gravitational field, need to be set to achieve the appropriate distance between the mobile intelligent bodies, and the traditional artificial potential energy function is as follows:
the gravity function:
Figure BDA0003029018980000071
repulsion function:
Figure BDA0003029018980000072
where ω is the gravitational scale factor, λ (q, q)goal) Indicating the distance of the current state of the object from the target. Eta is a repulsive scale factor, lambda0Representing the radius of influence of each obstacle. The traditional artificial potential energy function equation is complex and is difficult to form certain stability; based on the above problem, the present disclosure proposes a reward function mechanism based on an improved artificial potential energy function as described in step three.
Step three: a reward function mechanism based on an improved artificial potential energy function is established.
In the collective recovery of mobile agents on a production line, both the efficiency of recovery warehousing and the avoidance of collisions between mobile agents, the cluster reward function between two agents a and b is designed as follows:
Figure BDA0003029018980000081
the improved artificial potential energy function image is shown in figure 2, D in formula (4) represents the reward size of the intelligent agent, rho is a proportionality coefficient, a proper value is taken in (0,1), and D in the formulaabIs the distance between two agents, when dabOn → 0, D will suddenly become more negative, and when the distance between two agents is relatively far, the negative reward received by the agent will also increase. Compared with the traditional artificial potential energy function, the improved equation is simpler, and the stable effect can be achieved.
For a single agent g, if there are i agents h aroundiThen the artificial potential energy reward function of the agent g is:
Figure BDA0003029018980000082
wherein the content of the first and second substances,
Figure BDA0003029018980000083
for agent g and other agents hiDistance of (A), RgThe sum of the total artificial potential energy function rewards for the individual agents g.
Step four: and establishing an experience pool structure for storing historical experiences randomly explored by the intelligent agent so as to provide the intelligent agent with the historical experiences for learning. The experience information obtained after each agent passes model training is stored in the experience storage area of the agent, and the experience information(s) of the agent obtained by training is storedj,Aj,rj,s′j) Wherein r isiThe stored information is provided to the agent for learning for the reward value at the current time.
Step five: adding status information of other agents in the house, including location information p, in the critic network input layerotherAnd velocity information votherTo increase the judgment of the agent on the surrounding environment, as shown in FIG. 3, the coordinate of agent i is piFor adding local other agent status information, the following equation should be met:
||pi-pother||<dmin (5)
step six: training a multi-agent DDPG algorithm model, and taking a group of experience n pieces from an experience pool to enable the agent to learn.
Step seven: and (4) reversely updating four neural network parameters by adopting a gradient descent algorithm. Wherein the motion estimates the network parameter θμThe motion estimation network performs a parameter lifting formula as follows:
Figure BDA0003029018980000091
the value estimation network parameter is thetaQDefining the loss function as:
Figure BDA0003029018980000092
where L is the average loss of n experiences, yjComprises the following steps:
yj=rj+γQ′(s′j,μ′(s′jj μ′)|θj Q′) (8)
wherein, yjThe accumulated value Q of the action of the intelligent agent at the next moment is expressed (the value of the action of the intelligent agent at the next moment is calculated by utilizing an improved artificial potential energy function mechanism), gamma is a converted value, and r isjTo the current value ofμ′And thetaQ′Parameters of the action target network and the value target network are respectively, eta is an update proportion parameter, and j +1 times of parameter update of the action target network and the value target network are as follows:
Figure BDA0003029018980000093
step eight: the experimental results show that the artificial potential energy function design method and the system for the deep certainty strategy gradient algorithm can well enable a plurality of mobile intelligent bodies to achieve a cluster form, and achieve the gathering recovery and warehousing of the plurality of mobile intelligent bodies on a production line by means of the clustering movement learned by the intelligent bodies.
Fig. 4 is a diagram showing the multi-agent aggregate retrieval motion trajectory without the modified artificial potential energy function, and fig. 5 is a diagram showing the aggregate cluster retrieval with cluster state maintained from the beginning implemented by applying the designed modified reward function mechanism, in comparison, fig. 5 can more efficiently maintain cluster effect, and achieve the cluster formation from an initial position to move to the target warehousing position. The red circle represents a mobile intelligent body, the green pentagon represents a warehousing position, the black frame represents a recycling warehouse, and fig. 6 corresponds to the movement track of the plurality of mobile intelligent bodies in the fig. 5 for gathering, recycling and warehousing, so that the cluster type recycling and warehousing can be achieved, the moving space can be saved, and the moving safety coefficient can be improved.
Comparing fig. 7 and fig. 8, it is found that the number of training rounds for converging the total reward training of the system without adding other agent local communication information is much smaller, and the convergence can be achieved stably and rapidly, and the clustering state of a plurality of mobile agents can be achieved rapidly.
The second embodiment:
the embodiment aims to provide a production line mobile robot gathering type recovery and warehousing simulation system.
A production line mobile robot gathering type recovery warehousing simulation method comprises the following steps:
the motion model construction unit is used for establishing a recovery warehousing kinematics model for the mobile robot based on scene information and mobile robot parameter information;
the path planning unit is used for selecting the storage positions in the library as targets for all the mobile robots, generating an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizing the recovery of the mobile robots through the control of force and speed;
the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; and training the model by using historical experiences randomly explored by the agents stored in the experience pool.
In further embodiments, there is also provided:
an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment one. For brevity, no further description is provided herein.
It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processor, a digital signal processor DSP, an application specific integrated circuit ASIC, an off-the-shelf programmable gate array FPGA or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.
The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the system for simulating the gathering type recycling and warehousing of the mobile robots of the production line can be realized, and have wide application prospects.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (8)

1. A production line mobile robot gathering type recovery warehousing simulation method is characterized by comprising the following steps:
establishing a recovery warehousing kinematic model for the mobile robot based on scene information and mobile robot parameter information;
each mobile robot selects a storage position in the library as a target, generates an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizes the recovery of the mobile robots through the control of force and speed;
the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; training the model by using historical experiences randomly explored by the agents stored in the experience pool; the reward function mechanism based on the improved artificial potential energy function is specifically expressed as that for a single agent g, if i agents h exist around the agent giThen its artificial potential energy prizeThe excitation function is:
Figure FDA0003614964110000011
wherein the content of the first and second substances,
Figure FDA0003614964110000012
for agent g and other agents hiDistance of (A), RgIs the sum of the total artificial potential energy function rewards of the single agent g, and rho is a proportionality coefficient.
2. The method as claimed in claim 1, wherein the critic network inputs status information of other agents within a local scope of a specific agent, including location information p, into the input layer of the critic networkotherAnd velocity information votherAnd the judgment of the intelligent agent on the surrounding environment is increased.
3. The method as claimed in claim 1, wherein the improved deep deterministic strategy gradient model training selects training samples from an experience pool to perform model training, and updates neural network parameters in a gradient descent algorithm reverse direction.
4. The production line mobile robot gathering type warehousing simulation method as claimed in claim 1, wherein the warehousing kinematics model is as follows:
Figure FDA0003614964110000021
Figure FDA0003614964110000022
wherein the content of the first and second substances,
Figure FDA0003614964110000023
in order for the agent to vary in speed,
Figure FDA0003614964110000024
as a position change amount, FnoiseAnd pnoiseRepresenting force random noise and position random noise, respectively, Fi tIn order for the agent to be stressed at time t,
Figure FDA0003614964110000025
m is the velocity of the agent at time t and m is the mass of the agent.
5. The method as claimed in claim 1, wherein the actor network comprises a motion estimation network and a motion target network, and the critic network comprises a value estimation network and a value target network.
6. The utility model provides a production line mobile robot gathering formula recovery warehouse entry simulation system which characterized in that includes:
the motion model construction unit is used for establishing a recovery warehousing kinematics model for the mobile robot based on scene information and mobile robot parameter information;
the path planning unit is used for selecting the storage positions in the library as targets for all the mobile robots, generating an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizing the recovery of the mobile robots through the control of force and speed;
the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; using experience poolsTraining the model by using the stored historical experience of the random exploration of the intelligent agent; the reward function mechanism based on the improved artificial potential energy function is specifically expressed as that for a single agent g, if i agents h exist around the agent giThen its artificial potential energy reward function is:
Figure FDA0003614964110000031
wherein the content of the first and second substances,
Figure FDA0003614964110000032
for agent g and other agents hiDistance of (A), RgIs the sum of the total artificial potential energy function rewards of the single agent g, and rho is a proportionality coefficient.
7. An electronic device comprising a memory, a processor and a computer program stored in the memory for operation, wherein the processor implements the method of any one of claims 1 to 5 when executing the program.
8. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements a production line mobile robot focused recycling warehousing simulation method according to any one of claims 1-5.
CN202110423843.2A 2021-04-20 2021-04-20 Production line mobile robot gathering type recovery and warehousing simulation method and system Active CN113110101B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110423843.2A CN113110101B (en) 2021-04-20 2021-04-20 Production line mobile robot gathering type recovery and warehousing simulation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110423843.2A CN113110101B (en) 2021-04-20 2021-04-20 Production line mobile robot gathering type recovery and warehousing simulation method and system

Publications (2)

Publication Number Publication Date
CN113110101A CN113110101A (en) 2021-07-13
CN113110101B true CN113110101B (en) 2022-06-21

Family

ID=76718853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110423843.2A Active CN113110101B (en) 2021-04-20 2021-04-20 Production line mobile robot gathering type recovery and warehousing simulation method and system

Country Status (1)

Country Link
CN (1) CN113110101B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113759902B (en) * 2021-08-17 2023-10-27 中南民族大学 Multi-agent local interaction path planning method, device, equipment and storage medium
CN114254722B (en) * 2021-11-17 2022-12-06 中国人民解放军军事科学院国防科技创新研究院 Multi-intelligent-model fusion method for game confrontation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412490A (en) * 2013-08-14 2013-11-27 山东大学 Polyclone artificial immunity network algorithm for multirobot dynamic path planning
CN110597067A (en) * 2019-10-11 2019-12-20 济南大学 Cluster control method and system for multiple mobile robots
WO2020180014A2 (en) * 2019-03-05 2020-09-10 네이버랩스 주식회사 Method and system for training autonomous driving agent on basis of deep reinforcement learning
CN113589842A (en) * 2021-07-26 2021-11-02 中国电子科技集团公司第五十四研究所 Unmanned clustering task cooperation method based on multi-agent reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110794842A (en) * 2019-11-15 2020-02-14 北京邮电大学 Reinforced learning path planning algorithm based on potential field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412490A (en) * 2013-08-14 2013-11-27 山东大学 Polyclone artificial immunity network algorithm for multirobot dynamic path planning
WO2020180014A2 (en) * 2019-03-05 2020-09-10 네이버랩스 주식회사 Method and system for training autonomous driving agent on basis of deep reinforcement learning
CN110597067A (en) * 2019-10-11 2019-12-20 济南大学 Cluster control method and system for multiple mobile robots
CN113589842A (en) * 2021-07-26 2021-11-02 中国电子科技集团公司第五十四研究所 Unmanned clustering task cooperation method based on multi-agent reinforcement learning

Non-Patent Citations (10)

* Cited by examiner, † Cited by third party
Title
An actor-critic deep reinforcement learning approach for metro train scheduling with rolling stock circulation under stochastic demand;Cheng-shuo Ying;《Transportation Research Part B: Methodological》;20200909;正文全文 *
Deep Reinforcement Learning Approach for Flocking Control of Multi-agents;Han Zhang, Jin Cheng;《Proceedings of the 40th Chinese Control Conference》;20211006;第5002-5007页 *
Moving Forward in Formation: A Decentralized Hierarchical Learning Approach to Multi-Agent Moving Together;Shanqi Liu;《2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)》;20211216;正文全文 *
Multiagent Motion Planning Based on Deep Reinforcement Learning in Complex Environments;Dingwei Wu;《2021 6th International Conference on Control and Robotics Engineering》;20210526;正文全文 *
基于DDPG算法的无人机集群追击任务;张耀中,许佳林;《航空学报》;20201116;第309-321页 *
基于分层强化学习及人工势场的多Agent 路径规划方法;郑延斌,李波,安德宇,李娜;《计算机应用》;20151210;第3491-3496页 *
基于强化学习的全自主机器人足球系统协作研究;王腾,李长江;《科学技术与工程》;20110429;第979-982+1011页 *
基于深度强化学习和人工势场法的移动机器人导航;陈满,李茂军,李宜伟,赖志强;《云南大学学报(自然科学版)》;20211110;第1125-1133页 *
基于深度强化学习的三维路径规划算法;黄东晋等;《计算机工程与应用》;20200325(第15期);第30-36页 *
未知环境下基于PF-DQN的无人机路径规划;何金等;《兵工自动化》;20200909(第09期);第15-21页 *

Also Published As

Publication number Publication date
CN113110101A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN111061277B (en) Unmanned vehicle global path planning method and device
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
CN112947562B (en) Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
CN113095481B (en) Air combat maneuver method based on parallel self-game
CN110794842A (en) Reinforced learning path planning algorithm based on potential field
CN113110101B (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
Fridman et al. Deeptraffic: Crowdsourced hyperparameter tuning of deep reinforcement learning systems for multi-agent dense traffic navigation
CN112433525A (en) Mobile robot navigation method based on simulation learning and deep reinforcement learning
CN112362066A (en) Path planning method based on improved deep reinforcement learning
CN110737968A (en) Crowd trajectory prediction method and system based on deep convolutional long and short memory network
CN112231968A (en) Crowd evacuation simulation method and system based on deep reinforcement learning algorithm
CN112488320A (en) Training method and system for multiple intelligent agents under complex conditions
CN113962012A (en) Unmanned aerial vehicle countermeasure strategy optimization method and device
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
Xin et al. DRL-based improvement for autonomous UAV motion path planning in unknown environments
CN114815891A (en) PER-IDQN-based multi-unmanned aerial vehicle enclosure capture tactical method
CN116796844A (en) M2 GPI-based unmanned aerial vehicle one-to-one chase game method
CN115097861B (en) Multi-unmanned aerial vehicle trapping strategy method based on CEL-MADDPG
CN116400726A (en) Rotor unmanned aerial vehicle escape method and system based on reinforcement learning
CN113095500B (en) Robot tracking method based on multi-agent reinforcement learning
CN116340737A (en) Heterogeneous cluster zero communication target distribution method based on multi-agent reinforcement learning
Alet et al. Robotic gripper design with evolutionary strategies and graph element networks
CN115097814A (en) Mobile robot path planning method, system and application based on improved PSO algorithm
CN110751869B (en) Simulated environment and battlefield situation strategy transfer technology based on countermeasure discrimination migration method
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant