CN115946133B

CN115946133B - Mechanical arm plug-in control method, device, equipment and medium based on reinforcement learning

Info

Publication number: CN115946133B
Application number: CN202310255934.9A
Authority: CN
Inventors: 熊得竹; 杨红杰; 温志庆
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-06-02
Anticipated expiration: 2043-03-16
Also published as: CN115946133A

Abstract

The application relates to the technical field of mechanical arm control, and provides a mechanical arm plug-in control method, device, equipment and medium based on reinforcement learning, wherein the method comprises the following steps: setting intermediate point information according to the pose information of the container and the preset distance, wherein the intermediate point information is information of a preset point positioned right above the container model; training a first reinforcement learning model based on the random first initialization pose and the midpoint information, the first reinforcement learning model being used to generate a first movement strategy; training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information, the second reinforcement learning model being used to generate a second movement strategy; deploying the trained first reinforcement learning model and the trained second reinforcement learning model to a control end of the mechanical arm so as to control the mechanical arm to carry out plug-in; the method can effectively reduce the data calculation amount of the plug-in method.

Description

Mechanical arm plug-in control method, device, equipment and medium based on reinforcement learning

Technical Field

The application relates to the technical field of mechanical arm control, in particular to a mechanical arm plug-in control method, device, equipment and medium based on reinforcement learning.

Background

Existing methods for inserting an insert into a target location typically utilize the end of a robotic arm, the workflow of the insert method being: firstly, a pre-trained motion control model is utilized to generate and send a motion instruction, and then the mechanical arm is controlled to insert the plug-in unit into the target position according to the received motion instruction. The process of generating the motion instruction by the motion control model comprises the following steps: 1. carrying out route planning on the tail end of the mechanical arm model; 2. and solving the angle of the angular joint corresponding to each position of the tail end by using inverse kinematics, and generating a motion instruction according to the angle of the angular joint. Because the motion control model has uncertainty when performing inverse kinematics solution, the existing motion control model needs to generate a motion instruction capable of inserting the plug-in into the container through a large amount of calculation, namely the existing plug-in method has the problem of large calculation amount.

Accordingly, the prior art is subject to improvement and development.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, an object of the present application is to provide a method, an apparatus, a device and a medium for controlling a robot plug-in based on reinforcement learning, which can effectively reduce the data calculation amount of the plug-in method.

In a first aspect, the present application provides a method for controlling a robot insert based on reinforcement learning, where the method includes the steps of:

setting intermediate point information according to the pose information of the container and the preset distance, wherein the intermediate point information is information of a preset point positioned right above the container model;

training a first reinforcement learning model based on random first initialization pose and intermediate point information, wherein the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the intermediate point information;

training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information, the second reinforcement learning model being used for generating a second movement strategy for controlling the mechanical arm model to insert the plug-in model on the end of the mechanical arm model into the container model when the first movement strategy controls the mechanical arm model to move the plug-in model on the end of the mechanical arm model to the intermediate point information;

and deploying the trained first reinforcement learning model and the trained second reinforcement learning model to a control end of the mechanical arm so as to control the mechanical arm to carry out plug-in.

According to the mechanical arm plug-in control method based on reinforcement learning, the plug-in method is split into the process of moving the plug-in to the middle point information and the process of inserting the plug-in into the container from the middle point information, and because the first reinforcement learning model and the second reinforcement learning model are both models constructed based on reinforcement learning and inverse kinematics solution is not needed when the mechanical arm is controlled by using reinforcement learning algorithm, the method can effectively reduce the data calculation amount of the plug-in method, and because the first movement strategy is only used for moving the plug-in to the middle point information, the first movement strategy only needs to consider the displacement precision of the plug-in, the first movement strategy is only used for inserting the plug-in into the container without considering the orientation precision of the plug-in during movement, and the second movement strategy only needs to be used for focusing on the orientation precision of the plug-in, the method can further reduce the data calculation amount of the first reinforcement learning model to generate the first movement strategy and the second reinforcement learning model to generate the second movement strategy, and therefore the data calculation amount of the plug-in method can be further reduced.

Preferably, the training process for training the first reinforcement learning model based on the random first initialization pose and the midpoint information is:

Training a first reinforcement learning model based on random first initialization pose, intermediate point information, a first reward function and a penalty function, wherein the first reward function is used for outputting a first reward output value according to the distance between the tail end of the mechanical arm model and a preset point positioned right above the container model, and the penalty function is used for outputting a penalty output value according to the movement speed of the joint angle of the mechanical arm model in the simulation movement process.

According to the technical scheme, the first rewarding output value and the punishment output value are generated by using the first rewarding function and the punishment function respectively, and the network parameters of the first strengthening network model are optimized according to the first rewarding output value and the punishment output value, so that the direction of the plug-in model is enabled to be close to the first target direction as much as possible, the action quantity of the mechanical arm is reduced, the difficulty of inserting the plug-in model into the container model is effectively reduced, the situation that the mechanical arm model moves excessively is avoided, the movement efficiency of the mechanical arm is effectively improved, and the energy consumption of the mechanical arm is reduced.

Preferably, the number of preset points located directly above the container model is two, and the first reward function formula is:

；

wherein r is ₁ Representing a first prize output value, |d| representing the average spatial distance of two points on the end of the robot arm model from two preset points located directly above the container model, c ₁ 、c ₂ Is constant, |d ₁ I represents the Euclidean distance between the point No. 1 on the end of the mechanical arm model and the preset point No. 1 located right above the container model, |d ₂ The l represents the Euclidean distance between the point No. 2 on the end of the robot arm model and the preset point No. 2 located right above the container model, (x) ₁ ，y ₁ ，z ₁ ) Represented in the mechanical armAfter the model moves according to the first movement strategy, the coordinates of point No. 1 on the end of the mechanical arm model in the space coordinate system, (x) _t1 ，y _t1 ，z _t1 ) Representing the coordinates of preset point number 1 located directly above the container model in a spatial coordinate system, (x) ₂ ，y ₂ ，z ₂ ) Representing coordinates of point No. 2 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t2 ，y _t2 ，z _t2 ) Representing the coordinates of preset point number 2 located directly above the container model in the spatial coordinate system.

Preferably, the training process for training the second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information is as follows:

and training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the middle point information, the container pose information, a second rewarding function and a punishment function, wherein the second rewarding function is used for outputting a second rewarding output value according to the distance and the direction between the tail end of the mechanical arm model and the container model, and the punishment function is used for outputting a punishment output value according to the movement speed of the joint angle of the mechanical arm model in the simulation movement process.

According to the technical scheme, the second rewarding output value and the punishment output value are generated by the second rewarding function and the punishment function respectively, and the network parameters of the second strengthening network model are optimized according to the second rewarding output value and the punishment output value, so that the direction of the plug-in model is enabled to be close to the direction of the second target as much as possible, the action quantity of the mechanical arm is reduced, the difficulty of inserting the plug-in model into the container model is effectively reduced, and the situation that the mechanical arm model moves excessively is avoided.

Preferably, the second prize function formula is:

；

wherein r is ₂ Representing a second prize output value, r _x A prize value r representing the distance between the end of the arm model and a predetermined point located within the container model ₀ To give a value of the directional reward between the end of the manipulator model and the container model, (x, y, z) represents a unit vector of the end of the manipulator model after the manipulator model moves according to the second movement strategy, (x) _t ,y _t ,z _t ) Representing the unit vector of the container model.

Preferably, the penalty function and the formula of the penalty function are:

；

wherein r is _p Represents penalty output value, |a ₁ |-|a ₆ The i represents the value of the movement speed of each joint angle of the mechanical arm model.

Preferably, the first reinforcement learning model and the second reinforcement learning model are both models constructed based on reinforcement learning algorithm.

In a second aspect, the present application provides a robot insert control device based on reinforcement learning, for controlling a robot insert, the device comprising:

the setting module is used for setting middle point information according to the pose information of the container and the preset distance, wherein the middle point information is information of a preset point positioned right above the container model;

the first training module is used for training a first reinforcement learning model based on random first initialization pose and intermediate point information, the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the intermediate point information;

the second training module is used for training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the middle point information and the container pose information, the second reinforcement learning model is used for generating a second movement strategy, and the second movement strategy is used for controlling the mechanical arm model to insert the plug-in model at the tail end of the mechanical arm model into the container model when the first movement strategy controls the mechanical arm model to move the plug-in model at the tail end of the mechanical arm model to the middle point information;

the deployment module is used for deploying the trained first reinforcement learning model and the trained second reinforcement learning model to the control end of the mechanical arm so as to control the mechanical arm to carry out plug-in.

According to the mechanical arm plug-in control device based on reinforcement learning, the plug-in method is split into the mode that the plug-in is moved to the middle point information and the plug-in is inserted into the container from the middle point information, and because the first reinforcement learning model and the second reinforcement learning model are both models constructed based on reinforcement learning and inverse kinematics solving is not needed when the mechanical arm is controlled by using reinforcement learning algorithm, the device can effectively reduce the data calculation amount of the plug-in method, and because the first movement strategy is only used for moving the plug-in to the middle point information, the first movement strategy only needs to consider the displacement precision of the plug-in, the first movement strategy only needs to consider the orientation precision of the plug-in the moving process, and the second movement strategy only needs to be used for inserting the plug-in into the container, the device can further reduce the data calculation amount of the first reinforcement learning model to generate the first movement strategy and the second reinforcement learning model to generate the second movement strategy, and therefore the data calculation amount of the plug-in method can be further reduced.

In a third aspect, the present application also provides an electronic device comprising a processor and a memory storing computer readable instructions which, when executed by the processor, perform the steps of the method as provided in the first aspect above.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs steps in a method as provided in the first aspect above.

As can be seen from the foregoing, the present application provides a method, an apparatus, a device and a medium for controlling a mechanical arm plug-in based on reinforcement learning, in which the plug-in method is split into moving the plug-in to the middle point information and inserting the plug-in from the middle point information into the container, and since both the first reinforcement learning model and the second reinforcement learning model are models constructed based on reinforcement learning and inverse kinematics solution is not required when the mechanical arm is controlled by using reinforcement learning algorithm, the method can effectively reduce the data calculation amount of the plug-in method, and since the first movement strategy is only used for moving the plug-in to the middle point information, the first movement strategy only needs to consider the displacement precision of the plug-in, the first movement strategy only needs to consider the orientation precision of the plug-in during the movement, and the second movement strategy only needs to focus on the orientation precision of the plug-in, so that the method can further reduce the data calculation amount of the first reinforcement learning model to generate the first movement strategy and the second reinforcement learning model to generate the second movement strategy, thereby further reducing the data calculation amount of the plug-in method.

Drawings

Fig. 1 is a flowchart of a method for controlling a mechanical arm plug-in based on reinforcement learning according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a mechanical arm plug-in control device based on reinforcement learning according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals: 201. setting a module; 202. a first training module; 203. a second training module; 204. deploying a module; 31. a processor; 32. a memory; 33. a communication bus.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, an embodiment of the present application provides a method for controlling a mechanical arm insert based on reinforcement learning, which is used for controlling the mechanical arm insert, and includes the following steps:

A1. setting intermediate point information according to the pose information of the container and the preset distance, wherein the intermediate point information is information of a preset point positioned right above the container model;

A2. training a first reinforcement learning model based on random first initialization pose and intermediate point information, wherein the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the intermediate point information;

A3. training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information, the second reinforcement learning model being used for generating a second movement strategy for controlling the mechanical arm model to insert the plug-in model on the end of the mechanical arm model into the container model when the first movement strategy controls the mechanical arm model to move the plug-in model on the end of the mechanical arm model to the intermediate point information;

A4. And deploying the trained first reinforcement learning model and the trained second reinforcement learning model to a control end of the mechanical arm so as to control the mechanical arm to carry out plug-in.

The mechanical arm is a mechanical arm capable of moving with multiple degrees of freedom, and the mechanical arm in the embodiment of the application is preferably a six-degree-of-freedom mechanical arm, and specifically, the six-degree-of-freedom mechanical arm is a UR5 mechanical arm. Before training the first reinforcement learning model and the second reinforcement learning model, super parameters of the first reinforcement learning model and the second reinforcement learning model are set as preset parameters respectively. Because the super-parameters may include a plurality of parameters, and the preset parameters are preset values, those skilled in the art can change the parameter types included in the preset parameters according to the super-parameters, and change the sizes of the parameters according to actual needs, preferably, the super-parameters in this embodiment include a total step number and an iteration step length, the preset parameters include a preset total step number and a preset iteration step length, the preset total step number is 200, the preset iteration step length is 0.01, in the training process of the first reinforcement learning model and the second reinforcement learning model, the actual iteration number is initially 0, each time the actor network and the Critic network are optimized is equivalent to completing one network iteration, at this time, the actual iteration number is automatically +1, and when the actual iteration number is greater than the total step number, one training is completed according to the first reinforcement learning model and the second reinforcement learning model.

In step A1, the container is a device for accommodating the card, the container pose information includes a position (corresponding to a target position) and a pose of the container, and the pose of the container may be vertical or inclined (i.e., an insertion port of the container may be vertically upward or inclined upward), and the pose of the container in the embodiment of the present application is preferably vertical, i.e., the insertion port of the container is vertically upward. The preset distance is a preset value, and a person skilled in the art can adjust the size of the preset distance according to the depth of the container and/or the size of the plug-in, wherein the minimum distance from the intermediate point information to the container is equal to the preset distance. It should be understood that the intermediate point information corresponds to a virtual point, and in order to meet the mathematical calculation requirement, at least two preset points are required to represent the virtual point in the simulation environment, specifically, the intermediate point information is represented by at least two preset points directly above the container model, and the two preset points are symmetrically arranged with the center of the container model as the symmetry center.

In step A2, the first initialization pose is randomly generated, and the first initialization pose is an initial pose of the mechanical arm model before starting the plug-in, and can reflect a position and a pose of the mechanical arm model before starting training the first learning reinforcement model. The first reinforcement learning model is a model constructed based on reinforcement learning algorithm in the simulation environment of the computer according to the embodiment of the present application, and the first reinforcement learning model of the embodiment of the present application is preferably a model constructed based on Soft Actor-Critic (SAC) algorithm. Specifically, the Soft Actor-Critic (SAC) algorithm includes an Actor network and a Critic network, where the Actor network is configured to generate a first movement policy according to parameters such as middle point information, a joint angle of the mechanical arm model, a terminal position of the mechanical arm model, and a position of the plug-in module, so that the mechanical arm model moves the plug-in module according to the first movement policy, and the Critic network is configured to generate a first evaluation result according to pose information and middle point information of the moved plug-in module, where the first evaluation result can reflect quality of the first movement policy. The training process of the first reinforcement learning model is as follows: generating a first movement strategy by using an Actor network, generating a first evaluation result by using a Critic network, optimizing the Actor network and the Critic network by using the first evaluation result, and repeatedly executing the steps by using the optimized Actor network and the optimized Critic network until the first movement strategy generated by the Actor network enables the mechanical arm model to move the plug-in model to the middle point information. Step A2 corresponds to training the first reinforcement learning model in the simulation environment to optimize the first reinforcement learning model.

In step A3, the second initialization pose is randomly generated, and the second reinforcement learning model is a model constructed based on a reinforcement learning algorithm in a simulation environment of a computer according to the embodiment of the present application, and the second reinforcement learning model in the embodiment of the present application is preferably a model constructed based on a Soft Actor-Critic (SAC) algorithm. Specifically, the Soft Actor-Critic (SAC) algorithm includes an Actor network and a Critic network, where the Actor network is configured to generate a second movement policy according to parameters such as middle point information, a joint angle of the mechanical arm model, a terminal position of the mechanical arm model, and a position of the plug-in module, so that the mechanical arm model moves the plug-in module according to the second movement policy, and the Critic network is configured to generate a second evaluation result according to pose information of the moved plug-in module and pose information of the container, where the second evaluation result can reflect quality of the second movement policy. The working flow of the step A3 is as follows: the pose of the mechanical arm model is adjusted to be a second initialization pose, then the first reinforcement learning model trained in the step A2 is utilized to generate a first movement strategy according to the intermediate point information so as to move the plug-in model to the intermediate point information, and then the second reinforcement learning model is trained, so that the second movement strategy generated by the second reinforcement learning model can be used for controlling the mechanical arm model to insert the plug-in model into the container model. The training process of the second reinforcement learning model is as follows: generating a second movement strategy by using the Actor network, generating a second evaluation result by using the Critic network, optimizing the Actor network and the Critic network by using the second evaluation result, and repeatedly executing the steps by using the optimized Actor network and the optimized Critic network until the second movement strategy generated by the Actor network enables the mechanical arm model to insert the plug-in model into the container model. Step A3 is equivalent to controlling the mechanical arm model to pre-move to the middle point information position based on the trained first reinforcement learning model in the simulation environment, and then training the second reinforcement learning model to optimize the second reinforcement learning model, so that a second movement strategy generated by the second reinforcement learning model is equivalent to adjusting the orientation angle of the plug-in model, and the plug-in model is vertically inserted into the container model. It should be understood that the simulation environment includes a mechanical arm model, an insert model, a first reinforcement learning model, a second reinforcement learning model and a container model, where the mechanical arm model, the insert model and the container model are equivalent to a physical model, the insert model is located at an end of the mechanical arm model, the first reinforcement learning model and the second reinforcement learning model are equivalent to a control model of the mechanical arm model, and the first reinforcement learning model and the second reinforcement learning model are both used for controlling the mechanical arm model to perform simulation motion so as to move the insert model.

In step A4, the deployment of the trained first reinforcement learning model and the trained second reinforcement learning model to the control end of the mechanical arm to control the mechanical arm to perform the plug-in belongs to the prior art, and will not be discussed in detail here. It should be appreciated that the robotic arm model corresponds to a robotic arm, the insert model corresponds to an insert, and the container model corresponds to a container, such that upon deployment of the first reinforcement learning model and the second reinforcement learning model to the robotic arm control end, a first movement strategy generated by the first reinforcement learning model enables the robotic arm to move the insert to the midpoint information, and a second movement strategy generated by the second reinforcement learning model enables the robotic arm to insert the insert on its end from the midpoint information into the container. It should also be appreciated that, since the robotic arm moves the insert from the initial position to the midpoint information first, and then inserts the insert from the midpoint information into the container, i.e., the robotic arm performs the first movement strategy first and then the second movement strategy, the priority of the first reinforcement learning model is preferably higher than the priority of the second reinforcement learning model.

The working principle of the embodiment is as follows: the first reinforcement learning model is utilized to generate a first movement strategy, the mechanical arm is controlled to move the plug-in to the middle point information according to the first movement strategy, the second reinforcement learning model is utilized to generate a second movement strategy, and the mechanical arm is controlled to insert the plug-in into the container according to the second movement strategy, namely the embodiment is equivalent to splitting the plug-in method into the middle point information and inserting the plug-in into the container from the middle point information, because the first reinforcement learning model and the second reinforcement learning model are both models constructed based on reinforcement learning, when the mechanical arm is controlled by using reinforcement learning algorithm, inverse kinematics solution is not needed, so the method can effectively avoid the problem of large calculation amount caused by uncertainty of the inverse kinematics solution, that is, the method can effectively reduce the data calculation amount of the card method, and because the first movement strategy is only used for moving the card to the middle point information, the first movement strategy only needs to consider the displacement precision of the card, the first movement strategy does not need to consider the orientation precision of the card in the moving process (that is, does not need to consider the gesture of the card in the moving process of the card to the middle point information), and the second movement strategy is only used for inserting the card into the container, and the second movement strategy only needs to consider the orientation precision of the card, the method can further reduce the data calculation amount of the first reinforcement learning model to generate the first movement strategy and the second reinforcement learning model to generate the second movement strategy, thereby further reducing the data calculation amount of the card method. In addition, because the first reinforcement learning model is trained based on the random first initialization pose, and the second reinforcement learning model is trained based on the random second initialization pose, the robustness of the first reinforcement learning model and the second reinforcement learning model can be effectively improved, and the defects of the first reinforcement learning model and the second reinforcement learning model can be overcome by adopting a multi-section reinforcement learning strategy, so that the movement precision and the movement efficiency of the mechanical arm can be effectively improved. This embodiment corresponds to splitting the complex calculation process of the existing card method into two calculation processes that respectively focus on the displacement accuracy of the card and the orientation accuracy of the card.

According to the mechanical arm plug-in control method based on reinforcement learning, the plug-in method is split into the mode that the plug-in is moved to the middle point information and the plug-in is inserted into the container from the middle point information, and because the first reinforcement learning model and the second reinforcement learning model are both models constructed based on reinforcement learning and inverse kinematics solving is not needed when the mechanical arm is controlled by using reinforcement learning algorithm, the data calculation amount of the plug-in method can be effectively reduced, and because the first movement strategy is only used for moving the plug-in to the middle point information, the first movement strategy only needs to consider the displacement precision of the plug-in, the first movement strategy only needs to consider the orientation precision of the plug-in during movement, and the second movement strategy only needs to be used for inserting the plug-in into the container, the second movement strategy only needs to pay attention to the orientation precision of the plug-in, and therefore the data calculation amount of the first movement strategy generated by the first reinforcement learning model and the second movement strategy can be further reduced, and the data calculation amount of the plug-in method can be further reduced.

In some embodiments, the training process for training the first reinforcement learning model based on the random first initialization pose and midpoint information is:

Wherein the first reward function outputs a first reward output value according to the distance between the end of the robot arm model and the preset point located directly above the container model, since the smaller the absolute value of the distance between the end of the robot arm model and the preset point located directly above the container model, the closer the orientation of the insert model is to the first target orientation (the ideal orientation at which the insert model moves to the intermediate point information, preferably coincides with the orientation of the container model), and the closer the orientation of the insert model is to the first target orientation, the more easily the insert model is inserted into the container model, and therefore the smaller the absolute value of the distance between the end of the robot arm model and the preset point located directly above the container model, i.e., the better the first reward output value is inversely related to the absolute value of the distance between the end of the robot arm model and the preset point located directly above the container model. The penalty function outputs a penalty output value according to the movement speed of the joint angle of the mechanical arm model in the simulation movement process, and as the movement speed of the joint angle of the mechanical arm model is positively correlated with the movement quantity of the mechanical arm model, the smaller the movement quantity of the mechanical arm model which moves the plug-in model to the central point information is, the better the corresponding first movement strategy is, so that the smaller the movement speed of the joint angle of the mechanical arm model is, namely the positive correlation between the movement speed of the joint angle of the mechanical arm model and the penalty output value is.

The process of training the first reinforcement learning model according to this embodiment is: 1. the Actor network generates a first movement strategy according to parameters such as the middle point information, the joint angle of the mechanical arm model, the tail end position of the mechanical arm model, the position of the plug-in model and the like; 2. the Actor network controls the movement of the mechanical arm model according to the first movement strategy; 3. the Critic network outputs a first rewarding output value according to the distance between the tail end of the mechanical arm model and a preset point positioned right above the container model by using a first rewarding function, and outputs a punishment output value according to the movement speed of the joint angle of the mechanical arm model in the simulation movement process by using a punishment function; 4. optimizing network parameters of an Actor network and a Critic network according to the first rewarding output value and the punishment output value respectively; 5. and repeatedly executing the steps by using the optimized Actor network and the optimized Critic network until the total step number of the first reinforcement learning model is greater than the preset maximum step number. It should be understood that optimizing the network parameters of the Actor network and the Critic network according to the first reward output value is equivalent to pre-adjusting the orientation of the plug-in model, so that the orientation of the plug-in model after moving to the intermediate point information is as consistent as possible with the orientation of the container model, and optimizing the network parameters of the Actor network and the Critic network according to the punishment output value is equivalent to optimizing the mechanical arm actions corresponding to the first movement strategy, so as to reduce unnecessary movement of the mechanical arm, so that the mechanical arm can move the plug-in to the intermediate point information more quickly, further effectively improving the movement efficiency of the mechanical arm and reducing the energy consumption of the mechanical arm.

According to the embodiment, the first rewarding output value and the punishment output value are generated by the first rewarding function and the punishment function respectively, and network parameters of the first strengthening network model (namely the Actor network and the Critic network) are optimized according to the first rewarding output value and the punishment output value, so that the direction of the plug-in model is enabled to be close to the first target direction as much as possible, the action quantity of the mechanical arm is reduced, the difficulty of inserting the plug-in model into the container model is effectively reduced, the situation that excessive movement of the mechanical arm model occurs is avoided, and the movement efficiency of the mechanical arm is effectively improved, and the energy consumption of the mechanical arm is reduced.

In some preferred embodiments, the number of preset points located directly above the container model is two, and the first reward function is as shown in formula (1):

（1）;

wherein r is ₁ Representing a first prize output value, |d| representing the average spatial distance of two points on the end of the robot arm model from two preset points located directly above the container model, c ₁ 、c ₂ Is constant (c) ₁ 、c ₂ For a preset value, both of which are related to the modeling environment, c ₁ Preferably 10, c ₂ Preferably 230), |d ₁ End of arm model is denoted byThe Euclidean distance between the point 1 and the preset point 1 above the container model, |d ₂ The l represents the Euclidean distance between the point No. 2 on the end of the robot arm model and the preset point No. 2 located right above the container model, (x) ₁ ，y ₁ ，z ₁ ) Representing coordinates of point No. 1 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t1 ，y _t1 ，z _t1 ) Representing the coordinates of preset point number 1 located directly above the container model in a spatial coordinate system, (x) ₂ ，y ₂ ，z ₂ ) Representing coordinates of point No. 2 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t2 ，y _t2 ，z _t2 ) Representing the coordinates of preset point number 2 located directly above the container model in the spatial coordinate system. It should be understood that the first reward function of this embodiment corresponds to determining the orientation of the insert model based on the average distance between the points on the end of the robot arm model and the preset points located directly above the container model, the number of points on the end of the robot arm model is the same as the number of preset points located directly above the container model, and those skilled in the art can change the number of points and the positions of the points on the end of the robot arm model and the positions of the preset points located directly above the container model according to actual needs, and adjust the formula (1) based on the number of points, e.g., the points 1, 2 and 3 are provided on the end of the robot arm model and directly above the container model, and calculate the average spatial distance except d in the formula (1) ₁ And d ₂ Also, d needs to be calculated ₃ (i.e., euclidean distance of point No. 3 on the end of the robot arm model from the preset point No. 3 located directly above the container model). It should also be appreciated that in order for two points on the end of the robot arm model to be in a rigid relationship with two preset points located directly above the container model, the distance between the two points on the end of the robot arm model is equal to the distance between the two preset points located directly above the container model.

In some preferred embodiments, the training process for training the second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the midpoint information, and the container pose information is:

The Actor network and the Critic network of the second reinforcement learning model randomly generate network parameters prior to training the second reinforcement learning model. The second bonus function outputs a second bonus output value according to the distance and direction between the end of the robot arm model and the container model, since the smaller the distance and direction between the end of the robot arm model and the container model, the closer the orientation of the insert model is to the second target orientation (the ideal orientation for successfully inserting the insert model into the container model), and the closer the orientation of the insert model is to the second target orientation, the easier the insert model is to be inserted into the container model, and thus the smaller the distance and direction between the end of the robot arm model and the container model, i.e., the better the second bonus output value is inversely related to the absolute value of the distance between the end of the robot arm model and the container model. The penalty function of this embodiment is the same as in the above embodiment and will not be discussed in detail here.

The process of training the second reinforcement learning model according to this embodiment is: 1. the Actor network generates a second movement strategy according to parameters such as container pose information, middle point information, joint angles of the mechanical arm model, tail end positions of the mechanical arm model, positions of the plug-in models and the like; 2. the Actor network controls the movement of the mechanical arm model according to the second movement strategy; 3. the Critic network outputs a second rewarding output value according to the distance and the direction between the tail end of the mechanical arm model and the container model by using a second rewarding function, and outputs a punishment output value according to the movement speed of the joint angle of the mechanical arm model in the training process by using a punishment function; 4. optimizing network parameters of an Actor network and a Critic network according to the second prize output value and the punishment output value respectively; 5. and repeatedly executing the steps by using the optimized Actor network and the optimized Critic network until the total step number of the second reinforcement learning model is greater than the preset maximum step number. It should be understood that optimizing the network parameters of the Actor network and the Critic network according to the second prize output value is equivalent to pre-adjusting the orientation of the card model, and optimizing the network parameters of the Actor network and the Critic network according to the penalty output value is equivalent to optimizing the mechanical arm action corresponding to the second movement strategy, so as to reduce unnecessary movement of the mechanical arm, so that the mechanical arm can move the card to the middle point information more quickly, further effectively improving the movement efficiency of the mechanical arm and reducing the energy consumption of the mechanical arm.

According to the embodiment, the second rewarding output value and the punishment output value are generated by the second rewarding function and the punishment function respectively, and network parameters of a second strengthening network model (namely an Actor network and a Critic network) are optimized according to the second rewarding output value and the punishment output value, so that the direction of the plug-in model and the direction of a second target are made to be close to each other as much as possible, the action quantity of the mechanical arm is reduced, the difficulty of inserting the plug-in model into a container model is effectively reduced, the situation that the mechanical arm model moves excessively is avoided, and the movement efficiency of the mechanical arm is effectively improved, and the energy consumption of the mechanical arm is reduced.

In some preferred embodiments, the second prize function is as shown in equation (2):

（2）

wherein r is ₂ Representing a second prize output value, r _x A prize value r representing the distance between the end of the arm model and a predetermined point located within the container model _x And r ₁ Is similar to the calculation formula of r ₀ For a directional reward value between the end of the robot arm model and the container model, (x, y, z) represents a single of the end of the robot arm model after the robot arm model moves according to the second movement strategyBit vector, (x) _t ,y _t ,z _t ) Representing the unit vector of the container model.

In some preferred embodiments, the penalty function is as shown in equation (3):

（3）

As can be seen from the foregoing, according to the reinforcement learning-based mechanical arm plug-in control method, the plug-in method is split into moving the plug-in to the middle point information and inserting the plug-in from the middle point information into the container, and since the first reinforcement learning model and the second reinforcement learning model are both models constructed based on reinforcement learning and inverse kinematics solution is not required when the reinforcement learning algorithm is used for controlling the mechanical arm, the method can effectively reduce the data calculation amount of the plug-in method, and since the first movement strategy is only used for moving the plug-in to the middle point information, the first movement strategy only needs to consider the displacement precision of the plug-in, the first movement strategy only needs to consider the orientation precision of the plug-in during movement, and the second movement strategy only needs to be used for inserting the plug-in into the container, and therefore the method can further reduce the data calculation amount of the first reinforcement learning model to generate the first movement strategy and the second reinforcement learning model to generate the second movement strategy, and further reduce the data calculation amount of the plug-in method.

Referring to fig. 2, an embodiment of the present application provides a robot plug-in control device based on reinforcement learning, for controlling a robot plug-in, the device includes:

the setting module 201 is configured to set intermediate point information according to the pose information of the container and a preset distance, where the intermediate point information is information of a preset point located right above the container model;

a first training module 202, configured to train a first reinforcement learning model based on random first initialization pose and intermediate point information, where the first reinforcement learning model is configured to generate a first movement strategy, and the first movement strategy is configured to control movement of the manipulator model to move the plug-in model at the end of the manipulator model to the intermediate point information;

a second training module 203, configured to train a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information, and the container pose information, where the second reinforcement learning model is configured to generate a second movement strategy, and the second movement strategy is configured to control the mechanical arm model to insert the insert model on its end into the container model when the first movement strategy controls the mechanical arm model to move the insert model on its end to the intermediate point information;

The deployment module 204 is configured to deploy the trained first reinforcement learning model and the trained second reinforcement learning model to the control end of the mechanical arm to control the mechanical arm to perform the plug-in.

The working principle of the device is the same as that of the mechanical arm plug-in control method based on reinforcement learning provided in the first aspect, and will not be discussed in detail here.

In a third aspect, referring to fig. 3, fig. 3 shows an electronic device provided in the present application, including: processor 31 and memory 32, the processor 31 and memory 32 being interconnected and in communication with each other by a communication bus 33 and/or other form of connection mechanism (not shown), the memory 32 storing computer readable instructions executable by the processor 31 for execution by the processor 31 when the electronic device is operating to perform any of the alternative implementations of the above embodiments to perform the functions of: setting intermediate point information according to the pose information of the container and the preset distance, wherein the intermediate point information is information of a preset point positioned right above the container model; training a first reinforcement learning model based on random first initialization pose and intermediate point information, wherein the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the intermediate point information; training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information, the second reinforcement learning model being used for generating a second movement strategy for controlling the mechanical arm model to insert the plug-in model on the end of the mechanical arm model into the container model when the first movement strategy controls the mechanical arm model to move the plug-in model on the end of the mechanical arm model to the intermediate point information; and deploying the trained first reinforcement learning model and the trained second reinforcement learning model to a control end of the mechanical arm so as to control the mechanical arm to carry out plug-in.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor 31, performs a method in any of the alternative implementations of the above embodiments to implement the following functions: setting intermediate point information according to the pose information of the container and the preset distance, wherein the intermediate point information is information of a preset point positioned right above the container model; training a first reinforcement learning model based on random first initialization pose and intermediate point information, wherein the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the intermediate point information; training a second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information, the second reinforcement learning model being used for generating a second movement strategy for controlling the mechanical arm model to insert the plug-in model on the end of the mechanical arm model into the container model when the first movement strategy controls the mechanical arm model to move the plug-in model on the end of the mechanical arm model to the intermediate point information; and deploying the trained first reinforcement learning model and the trained second reinforcement learning model to a control end of the mechanical arm so as to control the mechanical arm to carry out plug-in.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple groups of units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

Further, the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over multiple sets of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above is only an example of the present application, and is not intended to limit the scope of the present application, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The mechanical arm plug-in control method based on reinforcement learning is used for controlling mechanical arm plug-ins and is characterized by comprising the following steps of:

training a first reinforcement learning model based on a random first initialization pose and the intermediate point information, wherein the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the intermediate point information;

training a second reinforcement learning model based on a random second initialization pose, the trained first reinforcement learning model, the intermediate point information and the container pose information, wherein the second reinforcement learning model is used for generating a second movement strategy, and the second movement strategy is used for controlling the mechanical arm model to insert the plug-in model on the tail end of the mechanical arm model into the container model when the first movement strategy controls the mechanical arm model to move the plug-in model on the tail end of the mechanical arm model to the intermediate point information;

Deploying the trained first reinforcement learning model and the trained second reinforcement learning model to a control end of the mechanical arm so as to control the mechanical arm to carry out plug-in;

the training process for training the first reinforcement learning model based on the random first initialization pose and the intermediate point information is as follows:

training a first reinforcement learning model based on a random first initialization pose, middle point information, a first reward function and a penalty function, wherein the first reward function is used for outputting a first reward output value according to the distance between the tail end of the mechanical arm model and a preset point positioned right above the container model, and the penalty function is used for outputting a penalty output value according to the motion speed of a joint angle of the mechanical arm model in a simulation motion process;

the first reward function formula is:

；

wherein r is ₁ Representing a first prize output value, |d| representing the average spatial distance of two points on the end of the robot arm model from two preset points located directly above the container model, c ₁ 、c ₂ Is constant, |d ₁ I represents the Euclidean distance between the point No. 1 on the end of the mechanical arm model and the preset point No. 1 located right above the container model, |d ₂ The l represents the Euclidean distance between the point No. 2 on the end of the robot arm model and the preset point No. 2 located right above the container model, (x) ₁ ，y ₁ ，z ₁ ) Representing coordinates of point No. 1 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t1 ，y _t1 ，z _t1 ) Representing the coordinates of preset point number 1 located directly above the container model in a spatial coordinate system, (x) ₂ ，y ₂ ，z ₂ ) Representing coordinates of point No. 2 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t2 ，y _t2 ，z _t2 ) Representing the coordinates of preset point number 2 located directly above the container model in the spatial coordinate system.

2. The reinforcement learning-based robotic arm insert control method of claim 1, wherein the number of preset points located directly above the container is two.

3. The reinforcement learning-based mechanical arm plug-in control method according to claim 1, wherein the training process for training the second reinforcement learning model based on the random second initialization pose, the trained first reinforcement learning model, the midpoint information and the container pose information is as follows:

and training a second reinforcement learning model based on a random second initialization pose, the trained first reinforcement learning model, the middle point information, the container pose information, a second rewarding function and a punishment function, wherein the second rewarding function is used for outputting a second rewarding output value according to the distance and the direction between the tail end of the mechanical arm model and the container model, and the punishment function is used for outputting a punishment output value according to the movement speed of the joint angle of the mechanical arm model in the simulation movement process.

4. The reinforcement learning-based mechanical arm plug-in control method according to claim 3, wherein the second objective function formula is:

；

5. A method of controlling a reinforcement learning based robotic arm insert according to claim 1 or claim 3, wherein the penalty function is formulated as:

;

wherein r is _p Represents penalty output value, |a ₁ |-|a ₆ The "motion" represents the motion speed of each joint angle of the manipulator model.

6. The reinforcement learning-based mechanical arm plug-in control method according to claim 1, wherein the first reinforcement learning model and the second reinforcement learning model are models constructed based on a reinforcement learning algorithm.

7. A reinforcement learning-based robotic insert control device for controlling a robotic insert, the reinforcement learning-based robotic insert control device comprising:

the first training module is used for training a first reinforcement learning model based on the random first initialization pose and the middle point information, the first reinforcement learning model is used for generating a first movement strategy, and the first movement strategy is used for controlling the movement of the mechanical arm model to enable the plug-in model on the tail end of the mechanical arm model to move to the middle point information;

a second training module, configured to train a second reinforcement learning model based on a random second initialization pose, the trained first reinforcement learning model, the intermediate point information, and the container pose information, where the second reinforcement learning model is configured to generate a second movement strategy, and the second movement strategy is configured to control the mechanical arm model to insert the insert model on its end into the container model when the first movement strategy controls the mechanical arm model to move the insert model on its end to the intermediate point information;

the deployment module is used for deploying the trained first reinforcement learning model and the trained second reinforcement learning model to the control end of the mechanical arm so as to control the mechanical arm to carry out plug-in;

the first reward function formula is:

;

wherein r is ₁ Representing a first prize output value, |d| representing two points on the end of the robot arm model and being located in the container modelAverage spatial distance of two preset points directly above, c ₁ 、c ₂ Is constant, |d ₁ I represents the Euclidean distance between the point No. 1 on the end of the mechanical arm model and the preset point No. 1 located right above the container model, |d ₂ The l represents the Euclidean distance between the point No. 2 on the end of the robot arm model and the preset point No. 2 located right above the container model, (x) ₁ ，y ₁ ，z ₁ ) Representing coordinates of point No. 1 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t1 ，y _t1 ，z _t1 ) Representing the coordinates of preset point number 1 located directly above the container model in a spatial coordinate system, (x) ₂ ，y ₂ ，z ₂ ) Representing coordinates of point No. 2 on the end of the robot arm model in the spatial coordinate system after the robot arm model moves according to the first movement strategy, (x) _t2 ，y _t2 ，z _t2 ) Representing the coordinates of preset point number 2 located directly above the container model in the spatial coordinate system.

8. An electronic device comprising a processor and a memory storing computer readable instructions that, when executed by the processor, perform the steps of the method of any of claims 1-6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the method according to any of claims 1-6.