CN116985151B

CN116985151B - Reinforced learning obstacle avoidance planning and training method for mechanical arm in constraint truss

Info

Publication number: CN116985151B
Application number: CN202311271561.0A
Authority: CN
Inventors: 贺亮; 侯月阳; 卢山; 张文婧; 张世源; 宋婷
Original assignee: Shanghai Aerospace Control Technology Institute; Taicang Yangtze River Delta Research Institute of Northwestern Polytechnical University
Current assignee: Shanghai Aerospace Control Technology Institute; Taicang Yangtze River Delta Research Institute of Northwestern Polytechnical University
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-26
Anticipated expiration: 2043-09-28
Also published as: CN116985151A

Abstract

The invention discloses a reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss, which comprises the following steps: s1, determining DH parameters of the mechanical arm and the relative position relation between the constraint truss and the mechanical arm; s2, constructing a digital twin training scene of the mechanical arm; s3, building a mechanical arm kinematics model; s4, according to the results of the steps S2 and S3, training S5 for completing discrete point simulation learning is carried out, and according to the results of the step S4, reinforcement learning training of other positions in the space is completed; s6, training of the whole scene is completed according to the results of the steps S4 and S5, a physical system for training operation of the mechanical arm is built, and one-to-one digital twin and demonstration test of operation of the mechanical arm is achieved. According to the invention, reinforcement learning is combined with imitation learning, so that the mechanical arm does not learn from zero, a human operation demonstration sample is given to the mechanical arm, and reinforcement learning is performed on the basis of learning human demonstration, so that the training speed is greatly increased, and the mechanical arm exceeding the current level can be obtained.

Description

Reinforced learning obstacle avoidance planning and training method for mechanical arm in constraint truss

Technical Field

The invention belongs to the technical field of mechanical arm control, and particularly relates to a reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss.

Background

In recent years, artificial intelligence techniques typified by machine learning have been developed extremely rapidly, and have been increasingly used successfully in the fields of computer vision, speech recognition, robotics, and the like. At present, research and application of the space manipulator artificial intelligence technology are still in the early development stage. In face of a series of challenges brought by factors such as large space-to-earth transmission delay and high dynamic change of space environment, artificial intelligence becomes an important support technology for follow-up space on-orbit control tasks mainly comprising spacecraft on-orbit assembly and maintenance, space rubbish cleaning and the like. By combining with the artificial intelligence technology, the space manipulator has the capacity of autonomously completing space intelligent sensing, planning and control, and can remarkably improve the real-time performance, accuracy, reliability and safety of space control tasks and the working efficiency of completing on-orbit tasks.

At present, training of the intelligent control method for obstacle avoidance of the space manipulator is mainly realized by reinforcement learning, for example:

1. the mechanical arm obstacle avoidance path planning research (Li Anchuang) based on deep reinforcement learning trains the motion of the three-degree-of-freedom welding mechanical arm offline through a three-layer neural network, and the mechanical arm facing on-orbit fine operation is generally an omnibearing mechanical arm, namely, at least has six degrees of freedom, and the reinforcement learning can sink into dimension disasters after the degrees of freedom are increased, the length is increased greatly during training, and even training fails.

2. The layered reinforcement learning research and the application (Jin Xudong) thereof in the problem of obstacle avoidance of the mechanical arm are similar, the reinforcement learning obstacle avoidance training is also carried out on the mechanical arm with three degrees of freedom, the concept of the three degrees of freedom indicated in the paper as a redundant mechanical arm is inaccurate, the redundancy degree of freedom of the all-round operation mechanical arm is six degrees of freedom, the layered structure is not fully explored in the text, the solving quality cannot be ensured, and the problem of dimension disaster is still difficult to solve if the exploration is excessive.

3. The method (Shi Minhao) for planning and learning the mechanical arm reinforcement learning task based on the user guidance indicates that more tasks need to consume more time of the user to perform the feedback of the execution result, and is difficult to train efficiently and has long training time.

Therefore, the existing mechanical arm obstacle avoidance technology is still to be improved.

Disclosure of Invention

The invention aims to: in order to overcome the defects, the invention aims to provide a reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss, which combines reinforcement learning with imitation learning, so that the mechanical arm does not learn from zero, a demonstration sample for human operation is given to the mechanical arm, and reinforcement learning is carried out on the basis of learning human demonstration, thereby not only greatly accelerating the training speed, but also obtaining the mechanical arm exceeding the current level.

The technical scheme is as follows: in order to achieve the above purpose, the invention provides a reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss, which comprises the following steps: the specific reinforcement learning obstacle avoidance planning and training method is as follows:

s1): setting initial conditions, and determining DH parameters of the mechanical arm and the relative position relation between the constraint truss and the mechanical arm; the relative position relationship between the constraint truss and the mechanical arm is that the mechanical arm needs to move and operate in the truss, but can not touch the truss, and joints of the mechanical arm can not collide with each other;

s2): according to the setting of the step S1), the construction of a digital twin training scene of the mechanical arm is completed, namely, a digital twin scene of obstacle avoidance operation of the mechanical arm is established by adopting view modeling software, and the construction of the digital twin scene comprises three parts of model optimization and processing, collision detection design and development of a graphical user interface;

s3): according to the setting of the step S1), the establishment of a mechanical arm kinematic model is completed, namely, according to the DH parameters of the mechanical arm, transformation matrixes of all joints can be obtained, and the transformation matrixes of all joints are multiplied in sequence to obtain a positive kinematic formula;

s4): according to the results of the steps S2) and S3), training of discrete point imitation learning is completed; namely, intelligent training is carried out on the mechanical arm by adopting a simulated learning scheme; adding a component for recording a manual demonstration sample on the digital twin mechanical arm, operating a running scene during manual recording, enabling a user to drag the tail end of the mechanical arm to move in the scene, ensuring no collision between mechanical arm joints, ensuring no collision between the mechanical arm joints and a constraint truss, enabling each track to reach a target position from an initial position, dragging to a proper number of times, storing a sample, enabling the mechanical arm to acquire a track given by the user, and enabling the mechanical arm to describe the simulated ability by generating a multi-layer neural network obtained by anti-simulation learning;

S5): according to the result of the step S4), the deep reinforcement learning training of other positions in the space is completed; the deep reinforcement learning training outputs specific action according to state variable state of the environment, and updates parameters of the neural network according to rewards reward obtained by the environment according to the action; using two sets of neural networks of eval and target to represent a strategy function actor and a value function critic; the Actor receives the environment information, outputs corresponding action variables, and the critic network calculates a reward value according to the corresponding action variables;

s6): according to the results of the steps S4) and S5), the training of the whole scene is completed, a physical system for the training operation of the mechanical arm is built, and the demonstration test of one-to-one digital twin and the physical mechanical arm operation is realized.

The mechanical arm in the constraint truss reinforcement learning obstacle avoidance planning and training method is characterized in that the mechanical arm adopts a 6-joint mechanical arm and can be operated in all directions, and DH parameters are set as follows:

s101): for each connecting rodi(i＝1、…，n-1) completing steps S102) to S105); the connecting rod 1 is a base;

s102): each connecting rodiEstablishing a coordinate system; establishing a connecting rodiIs of the coordinate system of (2)zThe shaft is a joint shaft, which is a jointiThe positive direction of the motion axis of +1 is z _i A shaft;

s103): establishing a connecting rodiOrigin of coordinate systemO _i : if it isz _i Shaft and method for producing the samez _i-1 The axes intersect, then take intersection point of two axes as origin; if it isz _i Shaft and method for producing the samez _i-1 The axes are different or parallel, and the common perpendicular lines of the two axes are used for connecting withz _i The intersection point of the axes is the origin;

s104): establishing a connecting rodiOf a coordinate systemxShafts, i.e.x _i Shaft pressEstablishment ofx _i Shafts, i.e.x _i Shaft and method for producing the samez _i-1 Shaft and method for manufacturing the samez _i The axes are vertical at the same time; if it isz _i-1 Shaft and method for producing the samez _i Axes being parallel, the common perpendicular line thereof beingx _i A shaft;

s105): establishing a connecting rodiOf a coordinate systemyShafts, i.e.y _i A shaft according to establishedx _i Shaft and method for producing the samez _i Shaft, set up according to right hand ruley _i Shaft, i.e. order；

Definition: torsion angle of rod pieceα _i : winding machinex _i The shaft rotates, fromz _i-1 The shaft rotates toz _i The rotation angle of the shaft;

length of roda _i : edge of the framex _i Shaft, slavez _i-1 The shaft moves toz _i The distance of the axis;

joint distanced _i : edge of the framez _i-1 Shaft, slavex _i-1 The shaft moves tox _i The distance of the axis;

joint cornerθ _i : winding machinez _i-1 The shaft rotates, fromx _i-1 The shaft rotates tox _i Angle of the shaft.

The invention relates to a reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss, wherein in the step S2), model optimization and processing are conducted by leading the mechanical arm and the constraint truss in through three-dimensional modeling software, and the optimization and processing are conducted, and the specific optimization and processing process is as follows:

s201): coarse optimization, namely carrying out large-scale surface reduction on large parts in the model at the fastest speed so as to avoid excessive calculated amount remained in the fine optimization stage;

S202): the parts are optimized, namely after entering the parts, various parts are split, some parts hidden inside are directly eliminated, and some regular parts can be directly simplified; performing iterative optimization on the influence degree of the part view with the complex boundary on the overall shape of the spacecraft until the area reduction rate is converged to 0%; such as an elongated cylinder, can be directly simplified to a prism;

s203): the fine optimization, namely, for some small parts, can be performed with model iterative optimization until the face reduction rate optimization converges to 0%, so that a large amount of memory can be saved and the software fluency can be obviously improved. For example, the screw head model can be iteratively optimized, the screw rod is directly simplified into a quadrangular prism, a large amount of memory can be saved, the software fluency is obviously improved, the original model is 1G, the original model is 24MB after the optimization, the occupied storage space after the face reduction is reduced to 1/50 of the original size, and the details are not obviously lost.

The method for model iterative optimization of the small part in the fine optimization process in the step S203) is as follows:

s2031): setting an initial value of the number of faces of all small parts except the mechanical arm rod piece and the constraint truss after face reduction and model similarity of all small parts after face reduction, wherein the model similarity is defined as the volume of the small parts after face reduction/the volume before face reduction;

S2032): subtracting the surface according to the initial value of the surface number after subtracting the surface of the set small part;

s2033): calculating the model similarity after subtracting the surface, comparing with the model similarity set value, and if the model similarity set value is greater than the model similarity set value, turning to step S2034);

s2034): increasing the initial value of the number of the surfaces of the small part after the surface is subtracted, and then turning to the step S2032), and iterating until the end;

s2035): and if the model similarity after the face reduction is smaller than or equal to the model similarity set value, finishing the face reduction. It is noted that the small parts are screws

The invention relates to a reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss, wherein the surface reduction process is as follows:

(a) The method comprises the following steps Acquiring an initial total surface number of a model, and determining an initial value of the total surface number after surface subtraction;

(b) The method comprises the following steps Considering that the mechanical arm and the constraint truss in the scene are of a long rod type structure, calculating the linear density of the number of surfaces of each part of the model, namely the part linear density, wherein the part linear density is the number of surfaces of the model in unit length in the length direction of each part, the index clearly shows the complexity degree of the rod-shaped object, and then calculating the model bus density, namely the model bus density = the total number of surfaces of the model/the total length of each part of the model;

(c) The method comprises the following steps The number of the surfaces of each part after subtracting the surface can be determined according to the initial total surface number of the model, the initial value of the total surface number after subtracting the surface, the model bus density and the part linear density, and the calculation formula is as follows: the initial value of the total number of the surfaces of the parts after the surface reduction is multiplied by the initial total number of the surfaces of the model, and the linear density of the parts is multiplied by the bus density of the model;

(d) The method comprises the following steps Since the linear density is introduced, the subtracting face of each part is not linearly calculated, the total face number after subtracting face is calculated and possibly different from the initial value of the total face number after subtracting face, and further processing is needed;

the further processing principle is as follows: if the model surface reduction total surface number intermediate value is greater than the surface reduction total surface number initial value, making the surface reduction surface number final value of the part with the largest surface reduction surface number = the surface reduction surface number intermediate value of the part with the largest surface reduction surface number- (the surface reduction surface total surface number intermediate value-the surface reduction total surface number initial value);

if the model number of intermediate subtracted faces is less than the initial value of the number of intermediate subtracted faces, the final value of the number of parts with the smallest number of intermediate subtracted faces is set to =the intermediate value of the number of parts with the largest number of intermediate subtracted faces + (initial value of the number of intermediate subtracted faces-intermediate value of the number of intermediate subtracted faces).

The linear density calculation formula of the part i in the step (b) of the face reduction process is as follows:

Wherein i=1: n and N are the total number of parts in the model,represents the linear density of part i, +.>Representing the number of model faces of part i, +.>Representing the model length of part i.

The overall linear density of the model is:

wherein,representing the overall linear density of the model;

the invention relates to a method for planning and training obstacle avoidance by reinforcement learning of a mechanical arm in a constraint truss, which comprises the following specific process of collision detection design in the step S2): for the movable part, namely the joint of the mechanical arm, collision is possible, a series of convex objects such as cubes, capsules, spheres/ellipsoids and cones are used as surrounding boxes for collision detection, coarse collision detection is firstly carried out, and then fine collision detection is carried out, so that the calculation amount is reduced; for complex non-convex objects, multiple convex collider combinations are required;

the coarse collision detection refers to mutual detection between the rod pieces and the OBB-Box of the frame, and because the rod pieces with large whole scene are fewer, only a small quantity of OBB-Box needs to be detected between the rod pieces and the frame, so that the calculated amount of collision detection is smaller at the moment, and when the coarse detection has collision, the fine collision detection is carried out;

the fine collision detection is to perform collision detection on a plurality of convex objects forming non-convex objects, for example, each arm rod of the mechanical arm and each rod piece of the frame are all non-convex objects, namely parts, and the non-convex objects are formed by a plurality of convex objects; similar to the rough detection, collision detection is performed between the bounding box of each convex object of a certain non-convex object in the model and the bounding box of each convex object of other non-convex objects.

The invention relates to a method for planning and training obstacle avoidance by reinforcement learning of a mechanical arm in a constraint truss, wherein the development logic of a graphical user interface in the step S2 is as follows: firstly, decomposing huge manager functions, enabling function modules with similar responsibilities to be concentrated in a single manager, abstracting all manager classes, uniformly inheriting the abstract manager classes in a general single instance, and delivering the abstract manager classes to a main manager for uniform maintenance; the sub-managers are not coupled, the main manager is responsible for connecting with an engine and is used as a core component for starting, coordinating and running the whole software;

the method comprises the following steps: packaging all scene object data and corresponding time functions thereof together to form an abstract scene management module serving as a design prototype of the scene object; all local area network communication protocols are abstracted to form a base class, and the network communication module can be written by calling the base class;

decoupling of top and bottom business logic is completed by aggregating two abstract interfaces on a message management module and organizing codes by utilizing a lightweight MOM (Manager of Managers) architecture; the newly added scene and protocol complete the standardized flow of object construction by inheriting the corresponding abstract interface, the abstract interface is a bridge for communicating a specific module and is a secondary development standard, after the scene is newly added, the scene management module completes updating, after the protocol is newly added, the base class completes updating, and then the network communication module automatically updates;

After the design is finished, the message management module shields the business logic of the bottom layer, only needs to pay attention to the new scene or the content design of the newly added protocol during development, does not need to pay attention to the interaction with other modules, and the interaction part is completely finished by the message management module, so that the development efficiency and maintainability of the system are greatly improved.

The invention relates to a method for planning and training obstacle avoidance by reinforcement learning of a mechanical arm in a constraint truss, wherein the positive kinematic formula in the step S3 is obtained as follows: description of the first with the Link transformation matrixiThe coordinate system of the joints is ati-pose in 1 joint coordinate system;

with ends in the basePose and position; representing a connecting rodiThe coordinate system being relative to the connecting rodi-1 transformation of the coordinate system by the connecting rodiThe coordinate system is obtained through the following four sub-transformations in sequence:

1) Winding machinex _i-1 Rotation of shaftA corner;

2) Edge of the framex _i-1 Shaft movementa _i-1 ；

3) Winding machinez _i Rotation of shaftA corner;

4) Edge of the framez _i Shaft movementd _i ；

The transformation is described by a dynamic coordinate system, and the general formula of the connecting rod transformation is obtained according to the principle of' left to right

Wherein c represents cos and s represents sin;

the connecting rods are transformedMultiplying to obtain a mechanical arm transformation matrix +.>

Wherein,nindicating the total number of joints, as can be seen from the above formulaIs thatnThe function of the individual joint variables represents the description of the terminal coordinate system relative to the base coordinate system, so that the positive kinematics of the mechanical arm can be obtained;

According to the above meterCalculating to obtain an inverted positive kinematic transformation matrix as（n=7）；

（1）

Wherein px, py and pz represent the positions of the three directions of x, y and z, nx, sx and ax represent the posture transformation vectors of the x direction, and the y and z directions have the same meaning.

The invention relates to a method for planning and training obstacle avoidance by reinforcement learning of a mechanical arm in a constraint truss, wherein the parameters of a Target neural network in the step S5) are carried out in a soft update mode, and the method comprises the following steps:

（2）

in the middle ofFor parameter sets that need to be updated +.>Is weight value->As an actor in eval neural networksParameter of->Representation ofsIn the state of->Generating behavior under parametersaIs a network of (a); />Is Critic in eval neural network>Parameter of->Representation->Under parameters generatesProbability of state; />Is the actor +.>Parameter of->Representation ofsIn the state of->Generating behavior under parametersaIs a network of (a); />Is critic in target neural network>Parameter of->Representation->Under parameters generatesProbability of state;

the actor part in the eval neural network is optimized by a policy gradient method:

（3）

in the method, in the process of the invention,sin the state of the device, the device is in a state,Jfor optimization of the index, v represents the gradient operation,representing optimization metricsJAt->Gradient under parameters,/->As a total number of samples, j=1，…，/>；/>Representing network->In-behavioraGradient under->Representing probability->In parameter->Gradient under->Representation->Under parameters generatesProbability of state;

critic in eval neural networks uses root mean square error to define loss using a method similar to supervised learning:

（4）

wherein,as a total number of samples,j=1，…，/>；s _j representing a samplejThe state of the device is that,a _j representing a samplejBehavior, Q%s _j, a _j |θ ^Q ) Representation ofs _j In the state of->Generating behavior under parametersa _j Is a function of the probability of (1),y _j for rewarding probability->For the probability of being random,is weight value->Representation->Under parameters generates _j+1 Probability of state->Representation ofs _j+1 In the state of->Production under parameters->A network of probabilities;

optimizing network parameters by adopting a gradient descent method;

when the position of an object is randomly initialized, the mechanical arm is controlled to move the tail end joint to a designated position above the object, and a state acquisition mechanism and a rewarding mechanism are designed by an algorithm;

the state variables generated from the environment information are shown as follows:

（5）

in the method, in the process of the invention,is->To->Distance state variable between>Is the three-dimensional coordinates of joint i, < >>Is the three-dimensional coordinate of the center point of the object corresponding to the joint i; />Is->To->Distance state variable between>Is the three-dimensional coordinates of the base of the mechanical arm; />Is->To->Distance state variable between >Is the three-dimensional coordinates of the point under the object corresponding to joint i, < >>Above the end joint corresponding to joint iThree-dimensional coordinates of the points; />Is->To->Distance state variable between>Is the collision occurrence;

guiding the mechanical arm to make correct action by using a reward mechanism, wherein the correct action is divided into two stages;

the first stage guides the end joint of the mechanical arm to a position above the object:

（6）

in the method, in the process of the invention,1，/>2，/>3，/>4 are all rewarding values; i denotes the joint->Is->At->Values on the axis; />Is->At->Values on the axis; />Is the normal vector of the grip plane; />Is the normal vector of the upper surface of the object;

and in the second stage, guiding the tail end joint of the mechanical arm to vertically move upwards:

（7）

in the method, in the process of the invention,is the three-dimensional coordinates of the point under the object corresponding to joint i, < >>Is the three-dimensional coordinates of the point above the end joint corresponding to joint i; />And->Two prize values for the second phase respectively.

The physical system of the mechanical arm training operation in the step S6 comprises a plurality of twin computers, a control computer for mechanical arm training and control, a mechanical arm controller and a mechanical arm;

the specific working process of the physical system for the mechanical arm training operation is as follows: the control computer can train according to the simulated learning method in the step S4 and the reinforcement learning method in the step S5), the mechanical arm trains according to a given objective function and the joint angle, obstacle distance and other data fed back by the mechanical arm in a digital twin mode, and joint data in the training process are sent to the digital twin computer for real-time display; after the optimal motion path is obtained through training, the joint angle is sent to the digital twin computer and the mechanical arm in real time, and synchronous demonstration of the digital twin and the physical mechanical arm is carried out.

The technical scheme can be seen that the invention has the following beneficial effects:

1. the reinforcement learning obstacle avoidance planning and training method for the mechanical arm in the constraint truss can effectively shorten the traversing time of the mechanical arm space, can also combine the automatic training of reinforcement learning to realize the high-latitude training effect of the redundant mechanical arm, solves the dimension disaster problem caused by reinforcement learning training, and simultaneously displays the running condition of the redundant mechanical arm in real time through a digital twin training scene to realize the efficient judgment of the simulation learning of the mechanical arm.

2. The reinforcement learning obstacle avoidance planning and training method for the mechanical arm in the constraint truss combines reinforcement learning and imitation learning, so that the mechanical arm does not learn from scratch, a demonstration sample for human operation is given to the mechanical arm, reinforcement learning is performed on the basis of learning human demonstration, the training speed is greatly increased, and the mechanical arm exceeding the current level can be obtained.

3. The invention adopts the inverse kinematics analysis solution inversion solving method, avoids numerical solution calculation with large calculation amount, thereby being convenient for on-orbit application, and realizing the technical advantages of narrow space refined operation under the condition that the on-orbit mechanical arm is convenient to install.

4. According to the invention, the elbow and wrist bias mechanical arm can be decoupled in position and posture after inverse kinematics inversion, so that the joint angle expression can be solved.

Drawings

FIG. 1 is a schematic diagram of a method for reinforcement learning obstacle avoidance planning and training of a manipulator in a constraint truss according to the present invention;

FIG. 2 is a schematic diagram of the relationship between a restraining truss and a mechanical arm according to the present invention;

FIG. 3 is a side view of a robotic arm of the present invention;

FIG. 4 is a schematic view of a mechanical arm according to the present invention;

FIG. 5 is a table of DH parameters of the robotic arm of the present invention;

FIG. 6 is a diagram of a digital twin scenario of a constraining truss and a mechanical arm of the present invention;

FIG. 7 is a block diagram of a user interface module according to the present invention;

FIG. 8 is a diagram of a manipulator training manipulator system according to the present invention;

fig. 9 is a schematic diagram of the coordinates of a mechanical arm according to the present invention.

Detailed Description

The invention is further elucidated below in connection with the drawings and the specific embodiments.

Examples

The method for planning and training the obstacle avoidance of the reinforcement learning of the mechanical arm in the constraint truss comprises the following steps: the specific reinforcement learning obstacle avoidance planning and training method is as follows:

s1): setting initial conditions, and determining DH parameters of the mechanical arm and the relative position relation between the constraint truss and the mechanical arm; the relative position relationship between the constraint truss and the mechanical arm is that the mechanical arm needs to move and operate in the truss, but can not touch the truss, and joints of the mechanical arm can not collide with each other, as shown in fig. 2 and 5;

S5): according to the result of the step S4), the deep reinforcement learning training of other positions in the space is completed; namely, the deep reinforcement learning outputs specific action according to state variable state of the environment, and updates parameters of the neural network according to rewards reward obtained by the environment according to the action; using two sets of neural networks of eval and target to represent a strategy function actor and a value function critic; the Actor receives the environment information, outputs corresponding action variables, and the critic network calculates the rewarding value according to the corresponding action variables.

The manipulator described in this embodiment employs an elbow-wrist offset manipulator, it being conceivable that the elbow and wrist of the manipulator are offset.

It should be noted that, in the step S3), the angular displacement of each joint and the geometric parameters of the connecting rod are calculated, and the posture and the position of the end effector of the mechanical arm relative to the base are obtained, namely, the positive kinematics of the mechanical arm.

It should be noted that, in the training process of performing the discrete point simulation learning in step S4), the trajectory of the possible position of the mechanical arm in the target operation area is given as sparsely as possible. The sample assembly is a module program, a user drags the mechanical arm in the scene, and the assembly program can record the track of the mechanical arm as a sample. Note that, the module program here is an existing component Demonstration Recorder of the Unity three-dimensional software, and is used for recording data.

In this embodiment, the mechanical arm shown in fig. 3 and 4 adopts a 6-joint mechanical arm, so that the mechanical arm can be operated in all directions, and DH parameters are set as follows:

the mechanical arm adopts a 6-joint mechanical arm, can be operated in all directions, and has DH parameter setting rules as follows: s101): for each connecting rodi(i＝1、…，n-1) completing the following steps S102) to S105); the connecting rod 1 is a base;

s102): each connecting rodiEstablishing a coordinate system; establishing a connecting rodiIs of the coordinate system of (2)zShafts, i.e.z _i An axis, which is a joint axis (the joint can only rotate around one axis): by jointsiThe positive direction of the motion axis of +1 isz _i An axis (the joint numbers from the proximal end to the distal end are sequentially increased by 1);

s104): establishing a connecting rodiOf a coordinate systemxShafts, i.e.x _i And (3) a shaft: pressing the buttonEstablishment ofx _i Shafts, i.e.x _i Shaft and method for producing the samez _i-1 Shaft and method for manufacturing the samez _i The axes are vertical at the same time; if it isz _i-1 Shaft and method for producing the samez _i Axes being parallel, the common perpendicular line thereof beingx _i A shaft;

s105): establishing a connecting rodiOf a coordinate systemyShafts, i.e.y _i And (3) a shaft: according to establishedx _i Shaft and method for producing the samez _i Shaft, set up according to right hand rule y _i Shaft, i.e. order；

joint cornerθ _i : winding machinez _i-1 The shaft rotates, fromx _i-1 Rotate tox _i Is a function of the angle of (a).

In the step S2) of the present embodiment, the model optimization and processing means that the mechanical arm and the constraint truss are imported through the three-dimensional modeling software, and are optimized and processed, as shown in fig. 6, and the specific optimization and processing process is as follows:

s202): the parts are optimized, namely after entering the parts, various parts are split, some parts hidden inside are directly eliminated, and some regular parts can be directly simplified; performing iterative optimization on the influence degree of the part view with the complex boundary on the overall shape of the spacecraft until the area reduction rate is converged to 0%; such as an elongated cylinder, can be directly simplified into a prism (this sentence is put into the specification);

The large-scale parts in step S201) refer to the rod members of the mechanical arm and the rod members of the constraint truss.

In the step S203), the method for performing model iterative optimization on the small part in the fine optimization process in the embodiment is as follows:

s2035): and if the model similarity after the face reduction is smaller than or equal to the model similarity set value, finishing the face reduction.

The reinforcement learning obstacle avoidance planning and training method for the mechanical arm in the constraint truss in the embodiment comprises the following steps of:

(b) The method comprises the following steps Considering that the mechanical arm and the constraint truss in the scene are of a long rod type structure, calculating the linear density of the number of surfaces of each part of the model, namely the part linear density, wherein the part linear density is the number of surfaces of the model, which are arranged in unit length in the length direction, of each part, the index clearly shows the complexity degree of the rod-shaped object, and then calculating the model bus density, namely the total number of surfaces of the model/the total length of each part of the model;

the further processing principle is as follows: if the model number of total faces after face reduction (herein written as the number of total faces after face reduction intermediate value) > the number of total faces after face reduction initial value, let the number of parts face final value of the number of the face after face reduction being the largest = number of parts face intermediate value of the number of the face after face reduction being the largest- (number of total faces after face reduction intermediate value-number of total faces after face reduction initial value), if the number of total faces after face reduction intermediate value < number of total faces after face reduction initial value, let the number of parts face final value of the number of the face after face reduction being the smallest = number of parts face intermediate value of the number of the face after face reduction being the largest + (number of total faces after face reduction initial value-number of total faces after face reduction intermediate value).

For example, the model includes two parts a and b, the initial total number of faces of the model is 10000, the total number of faces after face subtraction is 1000, the number of faces a of the part is 4000, the linear density is 10, the number of faces b of the part is 6000, the linear density is 30, and the total linear density of the model is 20, then the number of faces after face subtraction of the part a is: 4000 x 1000/10000 x 10/20=200, the number of parts b minus the face is: 6000 x 1000/10000 x 30/20=900, 200+900 > 1000, so that the number of the face after subtracting the face of the part b is 900- (200+900-1000) =800, the face of the final part a is subtracted to 200, and the face of the part b is subtracted to 800.

In the step (a), the unit engine is called, and the initial total number of faces of the model can be obtained.

It should be noted that the total length of each part of the model in the step (b) is automatically measured by software according to the coordinates of each part. In step (b), if there is a large cube or sphere, the cube is diagonal and the sphere is the sphere diameter, calculated according to the longest length. And step (b) of the face reduction processThe linear density calculation formula of the part i is as follows:

wherein i=1: n and N are the total number of parts in the model,represents the linear density of part i, +.>Representing the number of model faces of part i, +.>Representing the model length of part i;

The overall linear density of the model is:

wherein,representing the overall linear density of the model.

The specific process of collision detection design in step S2) is as follows: for the movable part, namely the joint of the mechanical arm, collision is possible, a series of convex objects such as cubes, capsules, spheres/ellipsoids and cones are used as surrounding boxes for collision detection, coarse collision detection is firstly carried out, and then fine collision detection is carried out, so that the calculation amount is reduced; for complex non-convex objects, multiple convex collider combinations are required;

In order to distinguish between self-interference of the mechanical arm and collision with the outside, it is necessary to mark the components of the mechanical arm as one group and the outside obstacle as another group, so that the components of the mechanical arm have collision body properties.

The development logic of the graphical user interface in step S2) in this embodiment is: firstly, decomposing huge manager functions, enabling function modules with similar responsibilities to be concentrated in a single manager, abstracting all manager classes, uniformly inheriting the abstract manager classes in a general single instance, and delivering the abstract manager classes to a main manager for uniform maintenance; the sub-managers are not coupled, the main manager is responsible for connecting with an engine and is used as a core component for starting, coordinating and running the whole software; as shown in fig. 7, the graphical user interface includes a graphical user interface module, a master manager, and a generic case, where the generic case includes a message management module, a scene management module, an archive loading module, an object pool module, and a network communication module, where output ends of the message management module, the scene management module, the archive loading module, the object pool module, and the network communication module are all connected with the master manager, and the master manager is bi-directionally connected with the user interface module.

The specific working process is as follows: packaging all scene object data and corresponding time functions thereof together to form an abstract scene management module serving as a design prototype of the scene object; all local area network communication protocols are abstracted to form a base class, and the network communication module can be written by calling the base class;

after the design is finished, the message management module shields the business logic of the bottom layer, only needs to be concerned about the new scene or the content design of the new protocol during development, does not need to be concerned about the interaction with other modules, and the interaction part is completely finished by the message management module, thereby greatly improving the development efficiency and maintainability of the system.

The positive kinematic formula in step S3) in this embodiment is obtained as follows: description of the first with the Link transformation matrixiThe coordinate system of the joints is ati-pose in 1 joint coordinate system;

the position of the tail end in the base; representing a connecting rodiThe coordinate system being relative to the connecting rodi-1 transformation of the coordinate system by the connecting rodiThe coordinate system is obtained through the following four sub-transformations in sequence:

1) Winding machinex _i-1 Rotation of shaftA corner;

2) Edge of the framex _i-1 Shaft movementa _i-1 ；

3) Winding machinez _i Rotation of shaftA corner;

4) Edge of the framez _i Shaft movementd _i ；

Wherein c represents cos and s represents sin;

will eachIndividual link conversionMultiplying to obtain a mechanical arm transformation matrix +.>

from the above calculation, the inverse positive kinematic transformation matrix can be obtained as（n=7）；

（1）

In this embodiment, in step S5), the parameters of the target neural network are updated in a soft manner, which is specifically as follows:

（2）

In the middle ofFor parameter sets that need to be updated +.>Is weight value->As an actor in eval neural networksParameter of->Representation ofsIn the state of->Generating behavior under parametersaIs a network of (a); />Is Critic in eval neural network>Parameter of->Representation->Under parameters generatesProbability of state; />Is the actor +.>Parameter of->Representation ofsIn the state of->Generating behavior under parametersaIs a network of (a);is critic in target neural network>Parameter of->Representation->Under parameters generatesProbability of state.

(3)

in the method, in the process of the invention,sin the state of the device, the device is in a state,Jfor optimization of the index, v represents the gradient operation,representing optimization metricsJAt->Gradient under parameters,/->As a total number of samples,j=1，…，/>；/>representing network->In-behavioraGradient under->Representing probability->In parameter->Gradient under->Representation->Under parameters generatesProbability of state;

（4）

wherein,as a total number of samples,j=1，…，/>，s _j representing a samplejThe state of the device is that,a _j representing a samplejBehavior, Q%s _j, a _j |θ ^Q ) Representation ofs _j In the state of->Generating behavior under parametersa _j Is a function of the probability of (1),y _j for rewarding probability- >For the probability of being random,is weight value->Representation->Under parameters generates _j+1 Probability of state->Representation ofs _j+1 In the state of->Production under parameters->Probability network.

Optimizing network parameters by adopting a gradient descent method;

（5）

in the method, in the process of the invention,is->To->Distance state variable between>Is the three-dimensional coordinates of joint i, < >>Is the three-dimensional coordinate of the center point of the object corresponding to the joint i; />Is->To->Distance state variable between>Is the three-dimensional coordinates of the base of the mechanical arm; />Is->To->Distance state variable between>Is the three-dimensional coordinates of the point under the object corresponding to joint i, < >>Is the three-dimensional coordinates of the point above the end joint corresponding to joint i; />Is->To->Distance state variable between>Is the collision occurrence;

（6）

in the method, in the process of the invention,1，/>2，/>3，/>4 are all rewarding values; i denotes the joint- >Is->At->Values on the axis; />Is->At->Values on the axis; />Is the normal vector of the grip plane; />Is the normal vector of the upper surface of the object;

（7）

in the method, in the process of the invention,is the three-dimensional coordinates of the point under the object corresponding to joint i, < >>Above the end joint corresponding to joint iThree-dimensional coordinates of the points; />And->Two prize values for the second phase respectively.

In this embodiment, the physical system for the training operation of the mechanical arm in step S6) includes a digital twin computer, a control computer for training and controlling the mechanical arm, a mechanical arm controller, and a mechanical arm;

as shown in fig. 8, the specific working process of the physical system for the mechanical arm training operation is as follows: the control computer can train according to the simulation learning method of the step S4) and the deep reinforcement learning method of the step S5), the mechanical arm trains according to a given objective function (formula 4) and the joint angle and obstacle distance data (formula 5) fed back by the mechanical arm in a digital twin mode, and joint data in the training process are sent to the digital twin computer for real-time display; after the optimal motion path is obtained through training, the joint angle is sent to the digital twin computer and the mechanical arm in real time, and synchronous demonstration of the digital twin and the physical mechanical arm is carried out.

It should be noted that the imitative learning is a supervised learning method, and the strategy network of the mechanical arm can be trained to approach the state and the action according to the demonstration state and the corresponding action given by the operator. Reinforcement learning is considered as a process in which a robot arm gradually grows by continuously trial and error to accumulate experience, and imitation learning is considered as a process in which the robot arm continuously performs imitation learning on self-decisions following a user with abundant experience to obtain growth. The use of imitation learning allows the robotic arm to achieve a similar level of performance to the operator it simulates, in much shorter time than reinforcement learning when faced with the same complexity task.

However, the defects are obvious, the final result of the imitation learning is only infinitely close to the level of an operator and cannot exceed the level of the operator, and the reinforcement learning is long in learning time, but the mechanical arm with far exceeding human level can be finally obtained through training; in addition, the human body has no way to give countermeasures for various states existing in the environment, and often only a small part of all environment states are contained in the human demonstration, so if only imitative learning is performed, the mechanical arm has no way to cope with the situation which is never encountered in the human demonstration data.

Therefore, reinforcement learning is combined with imitation learning, so that the mechanical arm does not learn from zero, a demonstration sample for human operation is given to the mechanical arm, and reinforcement learning is performed on the basis of learning human demonstration, thereby not only greatly increasing the training speed, but also obtaining the mechanical arm exceeding the current level. The invention can be used in scenes with constraints on space environment, such as scenes with truss or plate constraints, such as space station outside operation, large spacecraft on-orbit construction, and the like, and can realize the training and demonstration of obstacle avoidance by simulating reinforcement learning and digital twin scenes.

The foregoing is merely a preferred embodiment of the invention, and it should be noted that modifications could be made by those skilled in the art without departing from the principles of the invention, which modifications would also be considered to be within the scope of the invention.

Claims

1. A reinforcement learning obstacle avoidance planning and training method for a mechanical arm in a constraint truss is characterized in that: comprising the following steps:

S5): according to the result of the step S4), reinforcement learning training of other positions except for the discrete point positions in the space is completed; namely, the deep reinforcement learning outputs specific action according to the state variable state of the environment, and updates the parameters of the neural network according to rewards reward obtained by the action; using two sets of neural networks of eval and target to represent a strategy function actor and a value function critic; the actor receives the environment information, outputs corresponding action variables, and calculates a reward value according to the corresponding action variables by critic;

the parameters of the target neural network are carried out in a soft update mode, and the parameters are specifically as follows:

wherein soft update is a parameter set to be updated, gamma is a weight value, θ ^Q For actor Q (s, a|θ in eval network ^Q ) Parameters of (1), Q (s, a|theta ^Q ) Representing state s, θ ^Q A network that produces behavior a under parameters; θ ^μ Is Criticμ (s|θ) in eval network ^μ ) Is set according to the parameters μ (s|θ ^μ ) Represents θ ^μ Probability of generating state s under parameters; θ ^Q′ Is the actor Q' (s, a|theta) in the target network ^Q′ ) Parameters of (1), Q'(s，a|θ ^Q′ ) Representing state s, θ ^Q′ A network that produces behavior a under parameters; θ ^μ′ Is the critic mu' (s|theta) in the target network ^μ′ ) Is set to μ' (s|θ) ^μ′ ) Represents θ ^μ′ Probability of generating state s under parameters;

The actor in the eval neural network is optimized by adopting a policy gradient method:

wherein s is a state, J is an optimization index,representing gradient operations->Representing the optimization index J at theta ^μ Gradient under parameters, N _sample For the total number of samples, j=1,.. _sample ，/>Representing the network Q (s, a|θ ^Q ) The gradient in the case of the behaviour a,

representing the probability μ (s|θ) ^μ ) At the parameter theta ^μ The gradient of the lower part of the gradient,

critic in eval neural networks uses root mean square error to define loss:

wherein s is _j Representing sample j state, a _j The behavior of sample j is represented and,

Q(s _j ，a _j |θ ^Q ) Representing state s _j Down, θ ^Q Generating behavior a under parameters _j Is a network of (a), y _j To give a bonus probability, r _j Is a random probability, gamma is a weight, μ'(s) _j+1 |θ ^μ′ ) Represents θ ^μ′ Generating state s under parameters _j Probability of +1, Q'(s) _j+1 ，μ′(s _j+1 |θ ^μ′ )|θ ^Q′ ) Representing state s _j At +1, θ ^Q′ Generating μ'(s) under parameters _j+1 |θ ^μ′ ) A network of probabilities;

optimizing network parameters by adopting a gradient descent method;

when the position of the object is randomly initialized, the mechanical arm is controlled to move the tail end joint to a designated position above the object, so that a state acquisition mechanism and a rewarding mechanism are designed;

in dis_jt _i Is a joint _i To tgt _i Distance state variable, joint between _i Is the three-dimensional coordinates of joint i, tgt _i Is the three-dimensional coordinate of the center point of the object corresponding to the joint i; dis_jj _i Is a joint _i To joint ₀ Distance state variable, joint between ₀ Is the three-dimensional coordinates of the base of the mechanical arm; dis_th _i Is tpoint _i To hPoint _i Distance state variable between, setpoint _i Is the three-dimensional coordinate of the point below the object corresponding to the joint i, and is the hPoint _i Is the three-dimensional coordinates of the point above the end joint corresponding to joint i; dis_hj _i Is hPoint _i To joint _i A distance state variable between the two, dis_col is a collision occurrence condition;

in jre ₁ ，jre ₂ ，jre ₃ ，jre ₄ Are all rewarding values; hPoint (hPoint) _i _x _i Is hPoint _i At x _i Values on the axis; hPoint (hPoint) _i _y _i Is hPoint _i In y _i Values on the axis; hvect is the normal vector of the grip plane; tvect is the normal vector of the upper surface of the object;

part ₁ and part ₂ Two prize values for the second phase respectively;

2. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 1, wherein the method is characterized in that: the model optimization and processing in the step S2) refers to importing a mechanical arm and a constraint truss through three-dimensional modeling software, and optimizing and processing the mechanical arm and the constraint truss, wherein the specific optimization and processing process is as follows:

s201): coarse optimization, namely carrying out large-scale surface reduction on large parts in the model so as to avoid excessive calculated amount remained in a fine optimization stage;

s202): the parts are optimized, the parts are separated into various parts, some parts hidden inside are directly eliminated, and regular parts can be directly simplified; performing iterative optimization on the influence degree of the part view with the complex boundary on the overall shape of the spacecraft until the area reduction rate is converged to 0%;

s203): and fine optimization, namely performing model iterative optimization on some small parts until the face reduction rate optimization converges to 0%.

3. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 2, wherein the method is characterized in that: the model iterative optimization method for the small part in the fine optimization process in the step S203) comprises the following steps:

S2032): subtracting the surface according to the initial value of the surface number after subtracting the surface of the small part set in the step S2031);

4. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 3, wherein the method is characterized in that: the surface reduction process in the step S2032) is as follows:

(b) The method comprises the following steps Considering that the mechanical arm and the constraint truss in the scene are of a long rod type structure, calculating the linear density of the number of the surfaces of each part of the model, namely solving the number of the surfaces of the model in unit length in the length direction for each part, clearly showing the complexity degree of the rod-shaped object by the linear density of the part, and then calculating the bus density of the model, namely

Model bus density = model total number of faces/model total length of each part;

(c) The method comprises the following steps The number of the surfaces of each part after subtracting the surface can be determined according to the initial total surface number of the model, the initial value of the total surface number after subtracting the surface, the model bus density and the part linear density, and the calculation formula is as follows:

the initial value of the total number of the surfaces of the parts after the surface reduction is multiplied by the initial total number of the surfaces of the model, and the linear density of the parts is multiplied by the bus density of the model;

(d) The method comprises the following steps Because linear density is introduced, the subtracting face of each part is not linearly calculated, and when the calculated total face number after subtracting face is different from the initial value of the total face number after subtracting face, further processing is needed;

5. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 4, wherein the method is characterized in that: linear density of part i in step (b) of the face reduction process

The calculation formula is as follows: ρ _i _lj＝S _i _lj/L _i _lj

Wherein i=1: n and N are the total number of parts in the model, ρ _i Lj represents the linear density of part i, S _i Lj represents the number of model surfaces of part i, L _i Lj represents the model length of part i;

the model bus density is:

where ρ_mx represents the model bus density.

6. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 1, wherein the method is characterized in that: the specific process of collision detection design in the step S2) is as follows: for the movable part, namely the joint of the mechanical arm, which is likely to collide, any convex object of a cube, a capsule body, a sphere, an ellipsoid and a cone is used as a surrounding box for collision detection, coarse collision detection is firstly carried out, and then fine collision detection is carried out;

the coarse collision detection is mutual detection between the mechanical arm and the OBB-Box of the constraint truss;

because the number of the large mechanical arms in the whole scene is small, only a small number of OBB-Box detection distances are needed between the mechanical arms and between the rod piece and the constraint truss, and therefore, when the coarse collision detection is in collision, the fine collision detection is carried out;

the fine collision detection is to perform collision detection on a plurality of convex objects forming a non-convex object, each arm rod of the mechanical arm and each rod piece of the constraint truss are all non-convex objects, and collision detection is performed between the bounding box of each convex object of a certain non-convex object in the model and the bounding box of each convex object of other non-convex objects.

7. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 1, wherein the method is characterized in that: the development logic of the graphical user interface in step S2) is as follows: firstly, decomposing huge manager functions, enabling function modules with similar responsibilities to be concentrated in a single manager, abstracting all manager classes, uniformly inheriting the abstract manager classes in a general single instance, and delivering the abstract manager classes to a main manager for uniform maintenance; the sub-managers are not coupled, the main manager is responsible for connecting with an engine and is used as a core component for starting, coordinating and running the whole software;

decoupling of top-layer and bottom-layer business logic is completed by aggregating two abstract interfaces on a message management module and organizing codes by utilizing a lightweight MOM architecture; the newly added scene and protocol complete the standardized flow of object construction by inheriting the corresponding abstract interface, the abstract interface is a bridge for communicating a specific module and is a secondary development standard, after the scene is newly added, the scene management module completes updating, after the protocol is newly added, the base class completes updating, and then the network communication module automatically updates;

8. The constrained intra-truss mechanical arm reinforcement learning obstacle avoidance planning and training method according to claim 2, wherein the method is characterized in that: the physical system of the mechanical arm training operation in the step S6) comprises a digital twin computer, a control computer for mechanical arm training and control, a mechanical arm controller and a mechanical arm;

the specific working process of the physical system for the mechanical arm training operation is as follows: the control computer trains according to the simulated learning method in the step S4) and the deep reinforcement learning method in the step S5), the mechanical arm trains according to a given objective function and joint angle and obstacle distance data digitally fed back by the mechanical arm, and joint data in the training process are sent to the digital twin computer for real-time display; after the optimal motion path is obtained through training, the joint angle is sent to the digital twin computer and the mechanical arm in real time, and synchronous demonstration of the digital twin and the physical mechanical arm is carried out.