CN113954076B

CN113954076B - Robot precision assembling method based on cross-modal prediction assembling scene

Info

Publication number: CN113954076B
Application number: CN202111336234.XA
Authority: CN
Inventors: 楼云江; 刘瑞凯; 杨先声; 黎阿建
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2023-01-13
Anticipated expiration: 2041-11-12
Also published as: CN113954076A

Abstract

The invention relates to a robot assembly method and system based on a cross-modal prediction assembly scene, wherein a touch sensor is arranged on a clamping actuator at the tail end of a robot, and an assembly area of the clamping actuator is shot by a visual device: providing a plurality of neural network models, the neural network models including a multi-layer perceptron, a reinforcement learning network model, and a tensor fusion network model, wherein the training data includes visual data from the visual device, haptic data from the haptic sensors, pose data from a robot controller, motion feedback data and moment feedback data of the robot. The predicted visual characterization vector output by the multilayer perceptron replaces the existing data of collecting actual real pictures by adopting visual equipment, so that when the system has the problems of the visual equipment being shielded by a clamping actuator or virtual focus of the visual equipment, and the like, the whole system can still finish the precise assembly step, and the reliability of the system is improved.

Description

Robot precision assembling method based on cross-modal prediction assembling scene

Technical Field

The invention relates to an assembly control method based on a robot, in particular to a robot precision assembly method based on a cross-mode prediction assembly scene. The invention belongs to the technical field of robots.

Background

In a batch production line of consumer electronics products, although most of assembly tasks with low precision requirements are fully automated by industrial robots, many precision assemblies still require manual completion by workers. Therefore, the problem that the visual equipment is easily influenced by the environment when the industrial robot is precisely assembled, such as the problem that the visual equipment is shielded by a clamping actuator or the visual equipment is virtual focus, and the like, cannot complete the precise assembly step.

Disclosure of Invention

The invention provides a robot precision assembling method and system based on a cross-mode prediction assembling scene, and aims to at least solve one of the technical problems in the prior art.

The technical scheme of the invention relates to a robot assembly method based on cross-modal prediction assembly scene, wherein a touch sensor is arranged on a clamping actuator at the tail end of a robot, the assembly area of the clamping actuator is shot by a visual device,

the method comprises the following steps:

s10, providing a plurality of neural network models, wherein the neural network models comprise a multilayer perceptron, a reinforcement learning network model and a tensor fusion network model, and training data comprise visual data from the visual equipment, tactile data from the tactile sensor, pose data from a robot controller, action feedback data and moment feedback data of the robot;

s20, the clamping actuator is driven by the robot controller to start assembling from an assembling initial point, then reference visual data, actual visual data, initial tactile data and actual tactile data corresponding to the assembling area are collected in real time through visual equipment and a tactile sensor, the robot controller provides the initial pose data and the actual pose data and executes compression and/or filtering processing to convert the initial pose data and the actual pose data into corresponding reference visual data characterization vectors, actual visual characterization vectors, actual tactile data characterization vectors, initial tactile data characterization vectors and actual pose data characterization vectors;

s30, splicing and fusing the reference visual data characterization vector, the initial tactile data characterization vector, the actual tactile data characterization vector, the initial pose data characterization vector and the actual pose data characterization vector, and inputting the spliced and fused vectors into a multilayer perceptron to output a predicted visual characterization vector close to the actual visual characterization vector;

s40, splicing and fusing the predicted visual representation vector and the actual tactile data representation vector, and inputting the fused visual representation vector and the actual tactile data representation vector to the reinforcement learning network model to output predicted action control data of the robot;

s50, reducing dimensions of the tactile data characterization vector and the assembly force data of the robot through causal convolution, inputting the tactile data characterization vector and a prediction visual characterization vector into the tensor fusion network model, and judging whether the clamping actuator contacts an assembly damping node through a damping point predictor formed by a multilayer perceptron;

and S60, according to the prediction result of whether the clamping actuator contacts the assembly damping node or not, position control and force control are implemented through a robot controller, so that the pose data of the next assembly motion node are calculated, and the assembly strength of the clamping actuator is adjusted.

Further, the step S20 includes: and S21, filtering the tactile data through a Kalman filter, and obtaining tactile flow data of the robot along the assembly action direction by using an optical flow method.

Further, the step S21 further includes: and the dimensionality of the tactile flow data reduced by Gaussian convolution is 9 tactile detection points multiplied by 1 tactile detection point to obtain the processed tactile data.

Further, the visual data includes RGB image data, and the step S20 further includes:

s22, cutting the collected RGB image to 128 pixels by 128 pixels and converting the RGB image into a gray-scale image. Then, for the last waypoint in each assembly action step of the clamping actuator, compressing and outputting RGB characterization vectors by an RGB image encoder according to a gray scale image converted from a corresponding RGB image;

and S23, providing a variational self-encoder, inputting the processed gray level image to the variational self-encoder for training for multiple times, and finally outputting the dimension of the RGB characterization vector to be 16 multiplied by 1.

Further, the step S30 includes:

s31, reducing the dimension of the reference visual data through a picture encoder to obtain a reference visual data representation vector with dimension of 16 x 1, obtaining reference picture data through a picture decoder by the reference visual data representation vector, obtaining a loss function by the reference picture data by comparing pixel points of an original input picture with mean square error, and updating the training picture encoder and the picture decoder through reverse transfer and parameters by the loss function.

Further, in step S10, the step of training the reinforcement learning network model includes:

s11, before the clamping actuator is transported to an assembly damping node, position control and force control are implemented through a robot controller, and the assembly force of the clamping actuator along the assembly damping node direction is M newtons; after the clamping actuator is transported to an assembly damping node, position control and force control are implemented through a robot controller, and the assembly force of the clamping actuator along the assembly damping node direction is N newtons;

and S12, inputting the action feedback data and the moment feedback data to the robot controller, and calculating the assembling force of the next assembling motion node of the robot by the action feedback gain and the moment feedback gain output by the robot controller, wherein N is greater than M.

Further, the step S10 further includes:

and S13, dividing a path between the clamping actuator and the assembling damping node before the clamping actuator is transported to the assembling damping node into 50 action steps, setting a plurality of data acquisition points in each action step, and acquiring visual data and tactile data once by each data acquisition point.

Further, the step S10 further includes,

s14, when the clamping actuator moves for 50 action steps and does not reach an assembly damping node, the robot controller drives the clamping actuator to reset to an assembly initial point, and the assembly action is restarted.

Further, the step S10 further includes:

and S15, dividing the action step into 10 sections, and setting 11 data acquisition points in total.

The invention also relates to a computer-readable storage medium, on which program instructions are stored, which program instructions, when executed by a processor, implement the above-mentioned method.

The beneficial effects of the invention are as follows.

The system introduces a neural network of a multilayer perceptron, reference visual data representation vectors, initial tactile data representation vectors, actual tactile data representation vectors, initial pose data representation vectors and actual pose data representation vectors are spliced and fused and input to the multilayer perceptron for training, the multilayer perceptron capable of predicting actual picture representation vectors is finally obtained, and a camera is not needed to collect picture data in the follow-up process; the predicted visual characterization vectors output by the multilayer perceptron replace the data of actual real pictures collected by the existing visual equipment, so that when the system has the problems that the visual equipment is shielded by a clamping actuator or virtual focus occurs to the visual equipment, the whole system can still complete the precision assembly step, and the reliability of the system is improved.

Drawings

Fig. 1 is a flow diagram of a method according to the invention.

Fig. 2 is a detailed block diagram of the robot motion control part in the method according to the invention.

Fig. 3 is a schematic view of the arrangement of the clamp actuators according to the embodiment of the present invention.

Fig. 4 is a hardware platform of a robot and set-up control system according to an embodiment of the present invention.

FIGS. 5 and 6 are graphs illustrating reinforcement learning results in the method according to the present invention.

Detailed Description

The conception, the specific structure and the technical effects produced by the present invention will be clearly and completely described in conjunction with the embodiments and the attached drawings, so as to fully understand the objects, the schemes and the effects of the present invention.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any combination of one or more of the associated listed items.

The technical scheme of the invention is implemented based on basic hardware of the robot, for example, the existing robot is additionally provided with hardware and software to implement. Referring to fig. 3 and 4, in some embodiments, a mounting system according to the present invention comprises: a clamping actuator 1 arranged at the tail end of the robot moving part; a tactile sensor 2 provided inside a jaw of the grip actuator 1; a vision device arranged to photograph a mounting area of the clamp actuator; and a computer device (not shown) in communication with the robot motion controller, the grip actuator 1, the tactile sensor 2 and the vision device.

Referring to fig. 3, in a typical assembly application example of the present invention, the assembly system according to the present invention can satisfy the socket assembly of the snap type. In a preferred embodiment, the tactile sensor 2 may be a 5 × 5 dot matrix pressure-sensitive tactile sensor. In order to better measure the pre-slippage of the clamped part 4 under the action of external force, a soft rubber pad 3 (about 5mm thick) made of soft rubber is arranged between the fingertip of the clamping actuator 1 and the touch sensor 2, and a layer of rubber film is pasted on the other side of the touch sensor (namely, the side contacted with the clamped part 4). Referring to fig. 4, the vision device may employ an RGB-D camera capable of outputting RGB picture and depth picture data at the same time. The robot employs a tandem type articulated robot, which mounts the grip executor 1 at the end. The computer means may be independent of the robot controller or may be integrated into the robot controller for performing the method according to the invention.

Referring to fig. 1 and 2, in some embodiments, a method according to the present invention includes the steps of:

s50, reducing dimensions of the tactile data characterization vector and the assembly force data of the robot through causal convolution, inputting the tactile data characterization vector and a prediction visual characterization vector into the tensor fusion network model, and judging whether the clamping actuator contacts an assembly damping node or not through a damping point predictor consisting of a plurality of layers of perceptrons;

Wherein the visual data comprises RGB image data.

Specific embodiments of the above steps are described below by way of specific 3C component assembly examples. In these examples, the assembly process of plugging the USB-C charging wire plug into the socket: the clamping actuator 2 clamps the USB-C charging line plug, the USB-C socket is fixed, the touch sensor collects pressure touch data of the clamping actuator 2 clamping the USB-C charging line plug, and the visual device collects visual data of the USB-C charging line plug (hereinafter called plug) and the USB-C socket (hereinafter called socket).

For the embodiment of step S10

S10, providing a plurality of neural network models, wherein the neural network models comprise a multilayer perceptron, a reinforcement learning network model and a tensor fusion network model.

Wherein the training data comprises visual data from the visual device, haptic data from the haptic sensor, pose data from a robot controller, motion feedback data and moment feedback data for the robot.

Tactile data from the tactile sensor, which is a 5 x 5 dot matrix pressure tactile sensor in this example, is collected of the pressure tactile data of the grip actuator gripping the plug.

Step S10 further specifically includes the following steps: as shown in fig. 1, a plurality of sets of the RGB image data and the haptic data collected in the assembly area are input into a reinforcement learning network model for training. In the embodiment, the real-time Cartesian space pose of a clamping actuator at the tail end of the robot is read, a reinforcement learning network model is input to train the robot, and RGB image data and touch information of the area where the plug and the socket are located are collected. When the distance between the clamping actuator and the assembly damping node is closer or the depth of the clamping actuator inserted into the assembly damping node is deeper, the reward function of the reinforcement learning network model is correspondingly increased. The reward function of the reinforcement learning network model in the embodiment is increased according to the proximity degree of the plug and the socket and the increase of the depth of the plug inserted into the socket. During the reinforcement learning jack task, RGB image data and tactile data of the assembly area are collected, and the image data are input into a self-encoder for training. In step S10, the reinforcement learning network model training process with the robot platform further includes the following steps:

s11, before the clamping actuator is transported to an assembly damping node, position control and force control are implemented through a robot controller, and the assembly force of the clamping actuator along the assembly damping node direction is M newtons; after the clamping actuator is transported to the assembly damping node, position control and force control are implemented through a robot controller, and the assembly force of the clamping actuator along the assembly damping node direction is N newtons;

In connection with the above-described embodiments, a force/position hybrid control method is used to control the movement of the robot, i.e., decoupling the movement in the direction in which the plug and socket are assembled (i.e., the vertical direction or the z-axis direction in fig. 3 and 4) and using force control, and the movement of the robot in five degrees of freedom (x, Y, R, P, Y) uses position control. Before the plug reaches the buckle damping point, the robot drives the clamping actuator to assemble along the z axis with the force M =2 newtons; after the plug reaches the damping point, the robot drives the clamp actuator to increase the assembly force along the z-axis to N =12 newtons. For 5 dimensions of position control, the feedback gain of the system is larger to ensure the accuracy; and for 1 dimension of force control, the feedback gain of the system is smaller, so that the assembly compliance of the components is ensured.

And S13, dividing a path between the clamping actuator and the assembly damping node before the clamping actuator is transported to the assembly damping node into 50 action steps, setting a plurality of data acquisition points in each action step, and acquiring visual data and tactile data once by each data acquisition point.

In connection with the above embodiment, for steps S13 to S15, one complete socket test of the robot is referred to as one "round", each round consisting of no more than 50 "action steps". If the robot still does not complete the jack task after the robot moves for 50 action steps, the jack is judged to fail and reset to the initial point. Each action step is divided into 10 segments, for a total of 11 "waypoints". The robot drives the clamping executor to sequentially pass through the waypoints to complete an action step. For RGB images of the plug and the socket, data are collected once in each action step; for haptic data, data is collected once per waypoint.

For the embodiment of step S20

The robot controller drives the clamping actuator to start assembling action from an assembling initial point, then acquires reference visual data, actual visual data, initial tactile data and actual tactile data corresponding to the assembling area in real time through visual equipment and a tactile sensor, provides the initial pose data and the actual pose data, and executes compression and/or filtering processing to convert the initial pose data and the actual pose data into corresponding reference visual data representation vectors, actual visual representation vectors, actual tactile data representation vectors, initial tactile data representation vectors and actual pose data representation vectors. With the above embodiment, the gripping actuator carries the plug to be located about 1mm directly above the socket, this point is taken as an initial point of assembly, and the robot reads the cartesian space six-dimensional pose vector of the gripping actuator at this initial point through its own system (such as the ROS system).

Step S20 further specifically includes the following steps:

and S21, filtering the tactile data through a Kalman filter, and obtaining tactile flow data of the robot along the assembly action direction by using an optical flow method. As shown in fig. 1, in conjunction with the above-described embodiment, the collected haptic information is filtered by using a kalman filter, and the haptic flow in the x and y directions of the grasping actuator in each action step (the dimension is 25 (5 × 5 detection points) × 2 (two directions of the x and y axes) × 10 (one haptic flow information is calculated for each two continuous waypoints)) is calculated by using a Farneback method in analogy with the optical flow method.

More specifically, the haptic flow data obtained in this step is processed by reducing dimensions of 9 haptic detection points × 1 haptic detection point through gaussian convolution.

S22, cutting the collected RGB image to 128 pixels by 128 pixels and converting the RGB image into a gray-scale image. And then, for the last path point in each assembly action step of the clamping actuator, compressing the gray scale image converted from the corresponding RGB image by an RGB image encoder to output an RGB characterization vector.

And S23, providing a variational self-encoder, inputting the processed gray-scale image to the variational self-encoder for training for multiple times, and finally outputting the dimension of the RGB representation vector to be 16 multiplied by 1. In steps S22 and S23, in combination with the above embodiment, the RGB images output by the camera of the plug and socket are cut into 128 × 128 sizes around the assembly area, and the RGB images are converted into grayscale images to reduce the data amount and processing time.

Detailed description of the step S30

And S30, splicing and fusing the reference visual data characterization vector (z _ ref), the initial tactile data characterization vector (tac _ ref), the actual tactile data characterization vector (tac _ real), the initial pose data characterization vector (pos _ ref) and the actual pose data characterization vector (pos _ real), and inputting the spliced and fused vectors into the multilayer perceptron to output a predicted visual characterization vector (z _ real) close to the actual visual characterization vector.

Step S30 further specifically includes the following steps:

s31, reducing the dimension of the reference visual data through a picture encoder to obtain a reference visual data representation vector with dimension of 16 x 1, obtaining reference picture data through a picture decoder by the reference visual data representation vector, obtaining a loss function by comparing pixel points of an original input picture with Mean Square Error (MSE) of the reference picture data, and updating the training picture encoder and the picture decoder through reverse transfer and parameters by the loss function. And obtaining the characterization vector by the trained picture encoder after the new picture obtained again by the vision equipment.

Wherein the initial pose data representation vector (pos _ ref), the reference visual data representation vector (z _ ref), and the initial haptic data representation vector (tac _ ref) are a set of data collected separately in step S20. The three information are feedbacks from different angles in the same state of the same object. The visual device acquires a reference picture, acquires a reference visual data characterization vector (z _ ref) through a picture encoder, acquires an actual real picture, and then acquires a prediction visual characterization vector (z _ real) through the picture encoder. Specifically, a reference visual data characterization vector (z _ ref), an initial tactile data characterization vector (tac _ ref), an actual tactile data characterization vector (tac _ real), an initial pose data characterization vector (pos _ ref), and an actual pose data characterization vector (pos _ real) are spliced and fused to form a vector, the vector is input into a multi-layer perceptron, a dimension and a predicted visual characterization vector (z _ real) are output, a mean square error is obtained by comparing the vector with the predicted visual characterization vector (z _ real) of an actual real picture, a loss function is obtained, training is performed through reverse transmission and parameter updating, the multi-layer perceptron capable of predicting the actual picture characterization vector is finally obtained, a visual device is not needed any more to collect picture data, and the predicted visual characterization vector (z _ real) is obtained through five characterization vectors (tac _ ref), (pos _ ref), (tac _ real), (pos _ real), (z _ real) to replace the predicted visual characterization vector (z _ real) of the actual real picture.

Detailed description of step S40

And S40, splicing and fusing the predicted visual representation vector and the actual tactile data representation vector, and inputting the merged vector into the reinforcement learning network model (NAF) so as to output predicted motion control data of the robot. The predicted visual characterization vector replaces the existing data of collecting actual real pictures by adopting visual equipment, so that when the system has the problems of the visual equipment being shielded by a clamping actuator or virtual focus of the visual equipment, and the like, the whole system can still finish the precise assembly step, and the reliability of the system is improved.

Detailed description of step S50

And S50, reducing dimensions of the tactile data characterization vector and the assembly force data of the robot through causal convolution, inputting the tactile data characterization vector and a prediction visual characterization vector into the tensor fusion network model, and judging whether the clamping actuator contacts an assembly damping node through a damping point predictor formed by a multilayer perceptron.

In some embodiments, the torques for the 6 active joints of the robot as shown in fig. 4 are solved by the following jacobian matrix:

wherein, K _p ，K _v ，

And K _Fi The control parameters are Proportional Derivative (PD) and Proportional Integral (PI), respectively, S is a decision matrix S = diagS = diag (1, 0, 1), i.e. the value corresponding to the position control dimension is 1 and the value corresponding to the force control dimension is 0.

The robot controller implements position control through a PD control algorithm; the robot controller implements force control through a PI control algorithm.

Detailed description of step S60

Referring to fig. 5 and fig. 6, the graph of the reinforcement learning result in the method of the present invention is illustrated, and a network model trained by combining RGB image data, F/T force sensor data, and robot moment feedback data is adopted, so as to obtain more reward (rewarded) (i.e. more times of successfully completing the expected assembly effect) and fewer steps (step) of the assembly operation when the number of tested rounds (epideode) is more.

It should be recognized that the method steps in embodiments of the present invention may be embodied or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The method may use standard programming techniques. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, the operations of processes described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described herein (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention may also include the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described herein to transform the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The present invention is not limited to the above embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A method for robot assembly based on cross-modal prediction assembly scene is disclosed, wherein a touch sensor is arranged on a clamping actuator at the tail end of a robot, the assembly area of the clamping actuator is shot by a visual device,

characterized in that the method comprises the following steps:

s20, the clamping actuator is driven by the robot controller to start assembling from an assembling initial point, then reference visual data, actual visual data, initial tactile data and actual tactile data corresponding to the assembling area are collected in real time through the visual device and the tactile sensor, the robot controller provides the initial pose data and the actual pose data and executes compression and/or filtering processing to convert the initial pose data and the actual pose data into corresponding reference visual data characterization vectors, actual visual characterization vectors, actual tactile data characterization vectors, initial tactile data characterization vectors and actual pose data characterization vectors;

s40, splicing and fusing the predicted visual representation vector and the actual tactile data representation vector, and inputting the merged vectors into the reinforcement learning network model to output predicted action control data of the robot;

2. The method of claim 1, wherein the step S20 comprises:

and S21, filtering the tactile data through a Kalman filter, and obtaining tactile flow data of the robot along the assembly action direction by using an optical flow method.

3. The method of claim 2, wherein the step S21 further comprises:

the dimensionality of the haptic flow data reduced by Gaussian convolution is 9 haptic detection points multiplied by 1 haptic detection point to obtain processed haptic data.

4. The method of claim 1, wherein the visual data comprises RGB image data,

the step S20 further includes:

s22, cutting the collected RGB image to 128 pixels multiplied by 128 pixels and converting the RGB image into a gray scale image, and then compressing the gray scale image converted from the corresponding RGB image into an RGB characterization vector through an RGB image encoder for the last waypoint in each assembly action step of the clamping actuator;

5. The method of claim 1, wherein the step S30 comprises:

6. The method according to claim 1, wherein in the step S10, the step of training the reinforcement learning network model includes:

s12, inputting the action feedback data and the moment feedback data to the robot controller, calculating the assembling force of the next assembling motion node of the robot according to the action feedback gain and the moment feedback gain output by the robot controller,

wherein N > M.

7. The method of claim 6, wherein the step S10 further comprises:

8. The method according to claim 7, wherein the step S10 further comprises,

9. The method of claim 7, wherein the step S10 further comprises:

10. A computer readable storage medium having stored thereon program instructions which, when executed by a processor, implement the method of any one of claims 1 to 9.