CN113110101B

CN113110101B - Production line mobile robot gathering type recovery and warehousing simulation method and system

Info

Publication number: CN113110101B
Application number: CN202110423843.2A
Authority: CN
Inventors: 张涵; 程金; 王琪琪; 王中华
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2022-06-21
Anticipated expiration: 2041-04-20
Also published as: CN113110101A

Abstract

The scheme is that an improved artificial potential energy function mechanism is added into a depth certainty strategy gradient algorithm, so that a reward function mechanism of an intelligent agent in the depth certainty strategy gradient algorithm is designed, the intelligent agent can learn a clustering action with high reward through the reward mechanism based on the improved artificial potential energy function, and the clustering effect of a plurality of intelligent agents is further realized; and the local communication information of the specific intelligent agent is added into a critic neural network module of the depth certainty gradient algorithm, so that the intelligent agent can judge the surrounding environment better, and can learn a better clustering strategy to realize the movement of warehousing, recycling and warehousing.

Description

Production line mobile robot gathering type recovery and warehousing simulation method and system

Technical Field

The disclosure belongs to the technical field of motion control of intelligent mobile robots, and particularly relates to a method and a system for simulating gathering type recycling and warehousing of mobile robots in a production line.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

At the present stage, along with the rapid development of artificial intelligence technology, a reinforcement learning algorithm is adopted to solve a plurality of complex problems in the actual life. A single agent system is difficult to solve and is limited in speed and reliability, if any, so that a higher level of task can be performed by the cooperation of a plurality of agents. After a plurality of intelligent agents complete tasks on a production line, the gathering type recycling and warehousing of the mobile robots on the production line needs to be realized, the mobile intelligent agents move in a gathering type mode, the gathering type can be better kept, and the gathering type recycling and warehousing of the mobile intelligent agents is efficiently realized by means of mutual cooperation.

The inventor finds that aiming at the problem of gathering type recycling and warehousing of mobile robots, most of the existing control methods adopt a reinforcement learning control algorithm to enable an intelligent agent to learn a motion control strategy, but a simple and stable multi-intelligent-agent cluster control algorithm is not available so far to realize cluster type movement of a static target; meanwhile, in the existing method based on deep reinforcement learning, the sample explored by the intelligent agent is required to be used for learning by the intelligent agent, the experience cannot be drawn from the environment where the intelligent agent and the intelligent agent are located, and the unknown environment is explored by the intelligent agent, so that the control effect of the intelligent agent depends on the abundance degree of the training sample seriously, and the intelligent agent cannot effectively cope with the diversity and various changes of the environment.

Disclosure of Invention

In order to solve the problems, the invention provides a production line mobile robot gathering type recovery warehousing simulation method and system, wherein an improved artificial potential energy function mechanism is added into a depth certainty strategy gradient algorithm to realize the clustering effect of a plurality of agents; and increases the stability of the multi-mobile agent maintenance cluster and the rapid convergence of agent training by adding specific local communication information in the critic network.

According to a first aspect of the embodiments of the present disclosure, there is provided a production line mobile robot gathering type recycling and warehousing simulation method, including:

establishing a recovery warehousing kinematic model for the mobile robot based on scene information and mobile robot parameter information;

each mobile robot selects a storage position in the library as a target, generates an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizes the recovery of the mobile robots through the control of force and speed;

the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; and training the model by using historical experiences randomly explored by the agents stored in the experience pool.

Further, the reward function mechanism based on the improved artificial potential energy function is specifically represented as: for a single agent g, if there are i agents h around it_iThen its artificial potential energy reward function is:

wherein the content of the first and second substances,

for agent g and other agents h_iDistance of (A), R_gThe sum of the total artificial potential energy function rewards for the individual agents g.

Further, a specific agent local scope is added into an input layer of the criticizing person networkStatus information of other agents within the enclosure, including location information p_otherAnd velocity information v_otherAnd the judgment of the intelligent agent on the surrounding environment is increased.

Further, the recycling warehousing kinematic model is specifically as follows:

wherein the content of the first and second substances,

in order for the agent to vary in speed,

as a position change amount, F_noiseAnd p_noiseRespectively representing force random noise and position random noise,

in order for the agent to be stressed at time t,

m is the velocity of the agent at time t and m is the mass of the agent.

According to a second aspect of the embodiments of the present disclosure, there is provided a production line mobile robot gathering type recycling and warehousing simulation system, including:

the motion model construction unit is used for establishing a recovery warehousing kinematics model for the mobile robot based on scene information and mobile robot parameter information;

the path planning unit is used for selecting the storage positions in the library as targets for all the mobile robots, generating an optimal behavior strategy for each mobile robot by utilizing a pre-trained improved depth certainty strategy gradient model, and realizing the recovery of the mobile robots through the control of force and speed;

According to a third aspect of the embodiments of the present disclosure, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and running on the memory, where the processor implements the production line mobile robot gathering type recycling-warehousing simulation method when executing the program.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for simulating the collective recycling warehouse of the production line mobile robots.

Compared with the prior art, the beneficial effect of this disclosure is:

(1) according to the scheme, an improved artificial potential energy function mechanism is added into a depth certainty strategy gradient algorithm, so that a reward function mechanism of an intelligent agent in the depth certainty strategy gradient algorithm is designed, the intelligent agent can learn a high reward clustering action through the reward mechanism based on the improved artificial potential energy function, and the clustering effect of a plurality of intelligent agents is further realized;

(2) according to the scheme, the local communication information of the specific intelligent agent is added into a critic neural network module of a depth certainty gradient algorithm, so that the intelligent agent can judge the surrounding environment better, and can learn a better clustering strategy to realize the movement of warehousing, recycling and warehousing.

Advantages of additional aspects of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a flowchart of a production line mobile robot gathering type recycling and warehousing simulation method based on an improved DDPG algorithm according to a first embodiment of the present disclosure;

FIG. 2 is an image of an artificial potential energy function according to a first embodiment of the disclosure;

FIG. 3 is a drawing illustrating a division of multi-agent local communication information as described in one embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a multi-agent aggregate retrieval motion trajectory without an improved artificial potential energy function according to one embodiment of the present disclosure;

FIG. 5 is a diagram illustrating a multi-agent aggregate retrieval motion profile using an improved artificial potential energy function according to an embodiment of the present disclosure;

fig. 6 is a diagram illustrating an effect of a multi-mobile agent achieving cluster-based recycling and warehousing in a first embodiment of the disclosure.

FIG. 7 is a diagram illustrating a total reward of a local communication information system without adding other agents according to one embodiment of the disclosure;

fig. 8 is a diagram illustrating the total reward of the system for adding other agents to locally communicate information according to the first embodiment of the disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

The first embodiment is as follows:

the embodiment aims to provide a production line mobile robot gathering type recovery and warehousing simulation method.

A production line mobile robot gathering type recovery warehousing simulation method comprises the following steps:

Specifically, for ease of understanding, the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings:

according to the scheme, an incentive function mechanism of an agent in a DDPG (Deep Deterministic Policy Gradient) algorithm is designed, an incentive mechanism which is good in incentive model design and based on an improved artificial potential energy function is designed, the agent can learn a clustering action with high incentive, and specific agent local communication information is added into a critic neural network module of the Deep Deterministic Gradient algorithm, so that the agent can better judge the surrounding environment, and the agent can learn a better clustering strategy to realize the movement of warehousing, recycling and warehousing.

First, a position coordinate of an agent (agent in this embodiment refers to a production line mobile robot) i in a two-dimensional space is defined as p_i＝(x_i,y_i) And speed

Each agent has a radius d_iIn the random exploration of the agent, the agent generates a random force according to the DDPG algorithm

The intelligent agent can perform exercises, learn in experience of a series of exploration to generate a better behavior strategy, and realize the next step of exercise of the intelligent agent through force and speed.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method comprises the following steps: for realizing the motion of a simple intelligent agent in a two-dimensional plane space, firstly establishing a kinematic model for the intelligent agent:

the amount of change in the velocity and the amount of change in the position of the agent during each time period Δ t are as shown in the system of equations (1) in which random noise F is introduced_noiseAnd p_noiseThe method is used for increasing certain force randomness and position randomness in the searching process of the energy body.

Step two: in order to enable an intelligent agent to obtain a better learning strategy, a DDPG learning framework needs to be established, and a system flow chart is designed as shown in figure 1.

The framework is composed of two modules, namely an actor network and a critic network, wherein the two networks comprise two deep neural networks with the same structure, and the actor network comprises a motion estimation network and a motion target network. The action estimation network estimates a proper action A according to the current state s to enable the intelligent agent to move, and the action target network estimates an action A 'at the next moment according to the state s' after actual movement.

The critic network is composed of a value estimation network and a value target network. The value estimation network fits the value Q of the current agent i action by taking the agent i current state s and the current action A as neural network inputs(s)_j,A_j) The value target network passes through a state s at a time j_j' and Current action A_j' fitting out the value of the current agent action Q(s) as a neural network input_j′,A_j′)。

At the present stage, along with the rapid development of artificial intelligence technology, a reinforcement learning algorithm is adopted to solve a plurality of complex problems in actual life. A single agent system is difficult to solve and is limited in speed and reliability, if any, so that a higher level of task can be performed by the cooperation of a plurality of agents. After a plurality of intelligent agents complete tasks on a production line, the gathering type recycling and warehousing of the mobile robots on the production line needs to be realized, the mobile intelligent agents move in a gathering type mode, the gathering type can be better kept, and the gathering type recycling and warehousing of the mobile intelligent agents is efficiently realized by means of mutual cooperation.

The solution described in the present disclosure, which employs an improved deep reinforcement learning method to study the group collaboration of multi-agents, does not use already existing samples for the agents to learn, but rather enables the agents to draw experience from the environment in which they are located by using learning training, by themselves exploring unknown environments. In an unknown exploration environment, the experience value obtained by training is measured by using a reward function, a brand-new better experience value is further obtained for the next exploration, and the task of multi-agent cluster state control is realized through the design mode. For the traditional artificial potential energy field method, the application is relatively complicated in reality, two types of equations, namely a repulsive field and a gravitational field, need to be set to achieve the appropriate distance between the mobile intelligent bodies, and the traditional artificial potential energy function is as follows:

the gravity function:

repulsion function:

where ω is the gravitational scale factor, λ (q, q)_goal) Indicating the distance of the current state of the object from the target. Eta is a repulsive scale factor, lambda₀Representing the radius of influence of each obstacle. The traditional artificial potential energy function equation is complex and is difficult to form certain stability; based on the above problem, the present disclosure proposes a reward function mechanism based on an improved artificial potential energy function as described in step three.

Step three: a reward function mechanism based on an improved artificial potential energy function is established.

In the collective recovery of mobile agents on a production line, both the efficiency of recovery warehousing and the avoidance of collisions between mobile agents, the cluster reward function between two agents a and b is designed as follows:

the improved artificial potential energy function image is shown in figure 2, D in formula (4) represents the reward size of the intelligent agent, rho is a proportionality coefficient, a proper value is taken in (0,1), and D in the formula_abIs the distance between two agents, when d_abOn → 0, D will suddenly become more negative, and when the distance between two agents is relatively far, the negative reward received by the agent will also increase. Compared with the traditional artificial potential energy function, the improved equation is simpler, and the stable effect can be achieved.

For a single agent g, if there are i agents h around_iThen the artificial potential energy reward function of the agent g is:

wherein the content of the first and second substances,

Step four: and establishing an experience pool structure for storing historical experiences randomly explored by the intelligent agent so as to provide the intelligent agent with the historical experiences for learning. The experience information obtained after each agent passes model training is stored in the experience storage area of the agent, and the experience information(s) of the agent obtained by training is stored_j,A_j,r_j,s′_j) Wherein r is_iThe stored information is provided to the agent for learning for the reward value at the current time.

Step five: adding status information of other agents in the house, including location information p, in the critic network input layer_otherAnd velocity information v_otherTo increase the judgment of the agent on the surrounding environment, as shown in FIG. 3, the coordinate of agent i is p_iFor adding local other agent status information, the following equation should be met:

||p_i-p_other||＜d_min (5)

step six: training a multi-agent DDPG algorithm model, and taking a group of experience n pieces from an experience pool to enable the agent to learn.

Step seven: and (4) reversely updating four neural network parameters by adopting a gradient descent algorithm. Wherein the motion estimates the network parameter θ^μThe motion estimation network performs a parameter lifting formula as follows:

the value estimation network parameter is theta^QDefining the loss function as:

where L is the average loss of n experiences, y_jComprises the following steps:

y_j＝r_j+γQ′(s′_j,μ′(s′_j|θ_j ^μ′)|θ_j ^Q′) (8)

wherein, y_jThe accumulated value Q of the action of the intelligent agent at the next moment is expressed (the value of the action of the intelligent agent at the next moment is calculated by utilizing an improved artificial potential energy function mechanism), gamma is a converted value, and r is_jTo the current value of^μ′And theta^Q′Parameters of the action target network and the value target network are respectively, eta is an update proportion parameter, and j +1 times of parameter update of the action target network and the value target network are as follows:

step eight: the experimental results show that the artificial potential energy function design method and the system for the deep certainty strategy gradient algorithm can well enable a plurality of mobile intelligent bodies to achieve a cluster form, and achieve the gathering recovery and warehousing of the plurality of mobile intelligent bodies on a production line by means of the clustering movement learned by the intelligent bodies.

Fig. 4 is a diagram showing the multi-agent aggregate retrieval motion trajectory without the modified artificial potential energy function, and fig. 5 is a diagram showing the aggregate cluster retrieval with cluster state maintained from the beginning implemented by applying the designed modified reward function mechanism, in comparison, fig. 5 can more efficiently maintain cluster effect, and achieve the cluster formation from an initial position to move to the target warehousing position. The red circle represents a mobile intelligent body, the green pentagon represents a warehousing position, the black frame represents a recycling warehouse, and fig. 6 corresponds to the movement track of the plurality of mobile intelligent bodies in the fig. 5 for gathering, recycling and warehousing, so that the cluster type recycling and warehousing can be achieved, the moving space can be saved, and the moving safety coefficient can be improved.

Comparing fig. 7 and fig. 8, it is found that the number of training rounds for converging the total reward training of the system without adding other agent local communication information is much smaller, and the convergence can be achieved stably and rapidly, and the clustering state of a plurality of mobile agents can be achieved rapidly.

The second embodiment:

the embodiment aims to provide a production line mobile robot gathering type recovery and warehousing simulation system.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, the computer instructions when executed by the processor performing the method of embodiment one. For brevity, no further description is provided herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processor, a digital signal processor DSP, an application specific integrated circuit ASIC, an off-the-shelf programmable gate array FPGA or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the system for simulating the gathering type recycling and warehousing of the mobile robots of the production line can be realized, and have wide application prospects.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A production line mobile robot gathering type recovery warehousing simulation method is characterized by comprising the following steps:

the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; training the model by using historical experiences randomly explored by the agents stored in the experience pool; the reward function mechanism based on the improved artificial potential energy function is specifically expressed as that for a single agent g, if i agents h exist around the agent g_iThen its artificial potential energy prizeThe excitation function is:

wherein the content of the first and second substances,

for agent g and other agents h_iDistance of (A), R_gIs the sum of the total artificial potential energy function rewards of the single agent g, and rho is a proportionality coefficient.

2. The method as claimed in claim 1, wherein the critic network inputs status information of other agents within a local scope of a specific agent, including location information p, into the input layer of the critic network_otherAnd velocity information v_otherAnd the judgment of the intelligent agent on the surrounding environment is increased.

3. The method as claimed in claim 1, wherein the improved deep deterministic strategy gradient model training selects training samples from an experience pool to perform model training, and updates neural network parameters in a gradient descent algorithm reverse direction.

4. The production line mobile robot gathering type warehousing simulation method as claimed in claim 1, wherein the warehousing kinematics model is as follows:

wherein the content of the first and second substances,

in order for the agent to vary in speed,

as a position change amount, F_noiseAnd p_noiseRepresenting force random noise and position random noise, respectively, F_i ^tIn order for the agent to be stressed at time t,

m is the velocity of the agent at time t and m is the mass of the agent.

5. The method as claimed in claim 1, wherein the actor network comprises a motion estimation network and a motion target network, and the critic network comprises a value estimation network and a value target network.

6. The utility model provides a production line mobile robot gathering formula recovery warehouse entry simulation system which characterized in that includes:

the improved depth certainty strategy gradient model comprises an actor network and a critic network, the reward among the intelligent agents is calculated through a reward function mechanism based on an improved artificial potential energy function, and meanwhile, the judgment of the intelligent agents on the surrounding environment is increased by introducing state information of other intelligent agents in the local range of the specific intelligent agents; using experience poolsTraining the model by using the stored historical experience of the random exploration of the intelligent agent; the reward function mechanism based on the improved artificial potential energy function is specifically expressed as that for a single agent g, if i agents h exist around the agent g_iThen its artificial potential energy reward function is:

wherein the content of the first and second substances,

7. An electronic device comprising a memory, a processor and a computer program stored in the memory for operation, wherein the processor implements the method of any one of claims 1 to 5 when executing the program.

8. A non-transitory computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements a production line mobile robot focused recycling warehousing simulation method according to any one of claims 1-5.