CN117464683A

CN117464683A - Method for controlling mechanical arm to simulate video motion

Info

Publication number: CN117464683A
Application number: CN202311571826.9A
Authority: CN
Inventors: 程鹏; 周俊莹; 邵晨曦; 程璐; 巩浩; 刘检华; 邓新建
Original assignee: China Machinery Productivity Promotion Center Co ltd
Current assignee: China Machinery Productivity Promotion Center Co ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-01-30
Anticipated expiration: 2043-11-23
Also published as: CN117464683B

Abstract

A method for controlling a mechanical arm to simulate video actions, which relates to the mechanical arm and a camera; the mechanical arm comprises a base, an arm section and a clamping jaw which are sequentially connected, and the clamping jaw is used for clamping a target object; the method comprises the following steps: 1, acquiring the position of a target object in a camera coordinate system; 2, acquiring the pixel position of the target object on the image; extracting key feature points of the target object from the image; acquiring a data set and training a neural network; 5, mapping the gesture of the target object in the demonstration video to a real space; and 6, mapping the space coordinates of the target object to a base coordinate system. The invention realizes the feature recognition of the target object clamped by the mechanical arm in the video, accurately perceives the real three-dimensional space operation track of the target object, and further controls the stable copying/restoring operation of the mechanical arm.

Description

Method for controlling mechanical arm to simulate video motion

Technical Field

The invention relates to the technical field of robot automatic control, in particular to a method for controlling a mechanical arm to simulate video actions.

Background

With the development of computer and robotics, traditional manual assembly is being gradually replaced by industrial robotic assembly. The wide application of industrial robots greatly improves assembly efficiency and reliability. At present, on an industrial production line, a teaching programming or off-line programming method is mainly used for ensuring that a robot automatically completes an assembly task. However, when different types of assembly tasks need to be handled, the programming effort is significant. In addition, due to the low level of human-machine cooperation, conventional robot programming methods are difficult to apply to complex assembly scenarios.

Man-machine cooperation assembly based on artificial intelligence is actively developed, which further improves assembly efficiency, quality, flexibility and sustainability, and can be suitable for various future assembly scenes. In human-computer collaboration based on artificial intelligence, vision is an important way for robots to obtain environmental information. However, it is difficult for the robot to perform feature recognition on an unknown object from a video, and in particular, it is difficult to accurately perceive a real three-dimensional space operation trajectory of a target object from the video and stably reproduce the operations.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a method for controlling a mechanical arm to simulate video actions, which realizes feature recognition on unknown objects in video and accurate perception of real three-dimensional space operation tracks of target objects, thereby stably copying the operations.

The technical scheme of the invention is as follows: a method for controlling a mechanical arm to simulate video actions, which relates to the mechanical arm and a camera; the mechanical arm comprises a base, an arm section and a clamping jaw which are sequentially connected, and the clamping jaw is used for clamping a target object; the target object is marked as P; the camera is a depth camera;

the method comprises the following steps:

s01, acquiring the position of a target object in a camera coordinate system:

calibrating a hand-eye system, establishing a coordinate conversion relation between the base and the camera, and determining a base coordinate system and a camera sitting position by utilizing the relationConversion matrix between labelsSince the pose of the gripper in the base coordinate system can be derived from the state of the robot arm, it is assumed that the target object is at the gripper center point P in the base coordinate system _Rob ：(x _Rob ，y _Rob ,z _Rob ) Then calculate the center point P of the clamping jaw in the basic coordinate system based on the formula 1 _Rob Position P in camera coordinate system _Dcam ；

Equation 1:

s02, acquiring pixel positions of a target object on an image:

after the camera is subjected to internal reference calibration, an internal reference Matrix shown in a formula 2 is obtained _in ；

Equation 2:wherein f _x C for the focal length of the camera along the X-axis of the captured image _x To capture the optical center point of the X-axis of the image along the camera, f _y C for the focal length of the camera along the Y-axis of the captured image _y An optical center point along a Y-axis of a camera-captured image;

position P of target object from camera coordinate system _Dcam Conversion to position P in a pixel coordinate System _pixel The conversion process is based on equation 3;

equation 3: p (P) _pixel ＝Matrix _in ·P _Dcam Wherein P is _pixel Simply called a target object pixel coordinate point;

coordinate point P of target object pixel _pixel Normalizing to obtain the pixel position P of the target object _image As shown in equation 4;

equation 4:wherein, [0 ]]Index indicating first coordinate value, [1 ]]Index representing the second coordinate value, [2 ]]Index indicating the third coordinate value, P _pixel [0]Representing the X-axis coordinate value, P, of the target object in the pixel coordinate system _pixel [1]Representing Y-axis coordinate values, P, of a target object in a pixel coordinate system _pixel [2]A Z-axis coordinate value representing the target object in a pixel coordinate system;

s03, extracting key feature points of a target object from the image:

firstly, establishing an arm segment end central point S to a target object pixel position P on a shooting image _image Screening out the point on the forward side of the extension line vector, defining a central area according to the position of the target area, deleting the pixel points outside the central area, and screening out the target area according to the depth value range set by the target scene; extracting 3 key feature points of a target object from a target area to obtain positions of the 3 key feature points in a pixel coordinate system, wherein the 3 key feature points have relatively higher stability and discrimination degree compared with other common feature points which are not screened, and can be detected again in images with different visual angles and different illumination conditions shot by a camera;

in the step, the target area is an area where the target object is located;

s04, acquiring a data set and training a neural network:

recording the positions of the 3 key feature points in a pixel coordinate system and the positions of the 3 key feature points in a camera coordinate system, and then performing random operation on a target object by the mechanical arm, wherein the random operation comprises rotation and translation; the following two operations are performed out of order: i, through conversion matrixCalculating new positions of the clamping jaw center point and the 3 key feature points in a camera coordinate system after the operation is executed; II, acquiring a newly generated target area after operation; repeating the random operation for a plurality of times, and recording the positions of the 3 key feature points and the target area before and after each operation to form a data set for training a subsequent Keypoint-RCNN neural network; collectingAfter training is completed, each time the mechanical arm clamping jaw in the demonstration video is observed to clamp any target object, the Keypoint-RCNN neural network can automatically acquire 3 key feature points, and the 3 key feature points are tracked and the positions of the 3 key feature points in a pixel coordinate system are predicted along with the playing of the demonstration video;

s05, mapping the gesture of the target object in the demonstration video to a real space:

preparing a demonstration video which needs to be imitated by a mechanical arm in advance, and clamping any article by a clamping jaw of the mechanical arm in the demonstration video to execute operation; the key point-RCNN neural network automatically acquires 3 key feature points from the demonstration video, tracks the 3 key feature points along with the playing of the demonstration video and predicts the positions of the 3 key feature points in a pixel coordinate system;

firstly, measuring the angle of view of a built-in parameter of a camera and demonstrating the actual length among key feature points in a first frame of a video as priori knowledge;

then, when the camera is horizontally placed, a space coordinate system is established, in the space coordinate system, the origin is the center of the camera lens, the x-axis is arranged along the horizontal direction and is perpendicular to the axial center line of the camera lens, the y-axis is arranged along the vertical direction, and the z-axis is directed to the direction of the target object along the lens; finally, mapping coordinates of 3 key feature points automatically acquired by the Keypoint-RCNN neural network in an image coordinate system to coordinates in a space coordinate system;

s06, mapping the space coordinates of the target object to a base coordinate system:

passing spatial coordinates of 3 key feature points in all frames of the demonstration video through a conversion matrixMapping to a base coordinate system to obtain a formula 10;

equation 10:wherein (1)>Spatial coordinates for 3 key feature points in all frames, +.>Base coordinates for 3 key feature points in all frames; the target position to be reached by the mechanical arm can be calculated based on the base coordinates, and then a control command is output to control the mechanical arm to act so as to reach the target position.

The invention further adopts the technical scheme that: in the step S01, the hand-eye system is a binary system consisting of a mechanical arm and a depth camera, the base coordinate system is based on a base, and the center point of the clamping jaw is the center of a minimum sphere capable of completely containing the clamping jaw.

The invention further adopts the technical scheme that: in step S05, the mapping process is as follows:

the coordinates of the key feature points in the image coordinate system are (P _cx ,P _cy ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the subscript c represents the serial number of the key feature point, P _cx Representing the pixel coordinates of the c-th key feature point in the x direction, P _cy Representing pixel coordinates of the c-th key feature point in the y direction; the coordinates of the key feature points in the real space are (x _c ,y _c ,z _c ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the subscript c represents the serial number of the key feature point, x _c Representing the spatial coordinates of the c-th key feature point on the x-axis, y _c Representing the spatial coordinates of the c-th key feature point in the y direction, z _c Representing the spatial coordinates of the c-th key feature point in the z direction;

the effective focal length EFL is calculated with reference to equation 7;

equation 7:wherein Fov is the camera built-in parameter field angle, and H is the pixel height of the shot image;

the transformation mapping P of key feature points from the image coordinate system to the spatial coordinate system _c See formula 8;

equation 8:wherein, the subscript c represents the serial number of the key feature point, gamma _c A scale factor representing a key feature point with a sequence number of c, wherein the unit is mm/pixel;

adopting a trust domain algorithm to solve the numerical solution of formula 8, setting the initial value of the trust domain algorithm between 0 and 1, and then starting to solve the gamma of the first frame of the demonstration video ₁ 、γ ₂ And gamma ₃ And taking the calculation result of the previous frame as the initial value of the new iteration solution of the current frame, and reciprocating in this way to obtain the space coordinates of all key feature points in all frames of the demonstration videoExpression is as in equation 9;

equation 9:wherein, the superscript indicates the frame number, the maximum frame number is j, the subscript indicates the sequence number of the key feature point, and each vertical line contains the space coordinates of 3 key feature points in a frame.

The invention further adopts the technical scheme that: in the step S05, the unknown number gamma is solved _c The actual length L between any two key feature points is combined to construct an equation set, and the equation set is shown as formulas 10-1, 10-2 and 10-3;

equation 10-1: l (L) _1-2 ＝||P ₁ -P ₂ || ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _1-2 For the actual length between key feature point 1 and key feature point 2, P ₁ And P ₂ See equation 8 for the expanded equation of (2);

equation 10-2: l (L) _1-3 ＝||P ₁ -P ₃ || ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _1-3 For the actual length between key feature point 1 and key feature point 3, P ₁ And P ₃ See equation 8 for the expanded equation of (2);

equation 10-3: l (L) _2-3 ＝||P ₂ -P ₃ || ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _2-3 As key feature points2 to the actual length between the key feature points 3, P ₂ And P ₃ See equation 8 for the expanded equation of (2);

by substituting equation 8 into equations 10-1, 10-2 and 10-3, respectively, equations 10-1, 10-2 and 10-3 can solve for γ _c ，γ _c Comprising gamma ₁ 、γ ₂ And gamma ₃ 。

The invention further adopts the technical scheme that: in the S04 step, a scale-invariant feature transformation algorithm is applied, a plurality of feature points are extracted from a target area of a shot image, the positions of the key feature points in a pixel coordinate system are obtained at the same time, and then 3 key feature points with highest response values are further selected as main observation targets; the scale-invariant feature transformation algorithm comprises the following steps:

i, detecting the limit of a scale space: searching local maximum or minimum values in different scale spaces by constructing Gaussian differential scale spaces, wherein the extreme points are potential key feature points;

II, positioning key points: fine positioning is carried out on potential key feature points so as to remove points with relatively low contrast and response points of edges, thereby improving the stability and reliability of the key feature points;

and III, key point direction assignment: assigning one or more directions to each screened key feature point, so that the SIFT descriptor has rotation invariance, wherein the directions are determined by the directions of local image gradients;

IV, generating key point descriptors: and generating a descriptor for each screened key characteristic point, wherein the descriptor is a histogram of the gradient of the area around the key point.

Compared with the prior art, the invention has the following advantages: the method realizes feature recognition of the target object clamped by the mechanical arm in the video, accurately perceives the real three-dimensional space operation track of the target object, and further controls the mechanical arm to stably copy/restore the operations.

The invention is further described below with reference to the drawings and examples.

Drawings

FIG. 1 is a schematic illustration of an extension line from the center point of the end of an arm segment to the pixel position of a target object on a photographed image;

FIG. 2 is a schematic diagram of screening out points on the forward side of an extension vector on a captured image;

FIG. 3 is a schematic diagram of defining a center region on a captured image according to a location of a target region and deleting pixels outside the center region;

fig. 4 is a schematic diagram of screening out a target area according to a depth value range set in a target scene on a photographed image.

Detailed Description

Example 1:

a method for controlling a mechanical arm to simulate video actions relates to the mechanical arm and a camera. The mechanical arm comprises a base, an arm section and a clamping jaw which are sequentially connected, and the clamping jaw is used for clamping a target object. The target object is identified as P. The camera is a depth camera.

The method comprises the following steps:

s01, acquiring the position of a target object in a camera coordinate system:

calibrating a hand-eye system, establishing a coordinate conversion relation between a base and a camera, and determining a conversion matrix between a base coordinate system and a camera coordinate system by utilizing the relationSince the pose of the gripper in the base coordinate system can be derived from the state of the robot arm, it is assumed that the target object is at the gripper center point P in the base coordinate system _Rob ：(x _Rob ，y _Rob ,z _Rob ) Then calculate the center point P of the clamping jaw in the basic coordinate system based on the formula 1 _Rob Position P in camera coordinate system _Dcam ；

Equation 1:

in the step, the hand-eye system is a binary system consisting of a mechanical arm and a depth camera, the base coordinate system is based on a base, and the center point of the clamping jaw is the center of sphere of the smallest sphere capable of completely containing the clamping jaw.

S02, acquiring pixel positions of a target object on an image:

equation 4:wherein, [0 ]]Index indicating first coordinate value, [1 ]]Index representing the second coordinate value, [2 ]]Index indicating the third coordinate value, P _pixel [0]Representing the X-axis coordinate value, P, of the target object in the pixel coordinate system _pixel [1]Representing Y-axis coordinate values, P, of a target object in a pixel coordinate system _pixel [2]Representing the Z-axis coordinate value of the target object in the pixel coordinate system.

S03, extracting key feature points of a target object from the image:

referring to fig. 1, first, an arm segment end center point S is established on a photographed image to a target object pixel position P _image Referring to FIG. 2, the point on the forward side of the vector of the extension line is selected, referring to FIG. 3, and then the target area is selectedDefining a central area at the position, deleting pixel points outside the central area, referring to fig. 4, and finally screening out a target area according to a depth value range set by a target scene; and extracting 3 key feature points of the target object from the target area to obtain the positions of the 3 key feature points in the pixel coordinate system, wherein the 3 key feature points have relatively higher stability and discrimination degree compared with other common feature points which are not screened, and can be detected again in images with different visual angles and different illumination conditions shot by a camera.

In this step, the target area is an area where the target object is located.

S04, acquiring a data set and training a neural network:

recording the positions of the 3 key feature points in a pixel coordinate system and the positions of the 3 key feature points in a camera coordinate system, and then performing random operation on a target object by the mechanical arm, wherein the random operation comprises rotation and translation; the following two operations are performed out of order: i, through conversion matrixCalculating new positions of the clamping jaw center point and the 3 key feature points in a camera coordinate system after the operation is executed; II, acquiring a newly generated target area after operation; repeating the random operation for a plurality of times, and recording the positions of the 3 key feature points and the target area before and after each operation to form a data set for training a subsequent Keypoint-RCNN neural network; after training is completed, the Keypoint-RCNN neural network can automatically acquire 3 key feature points every time when the mechanical arm clamping jaw in the demonstration video clamps any target object, and track the 3 key feature points and predict the positions of the 3 key feature points in a pixel coordinate system along with the playing of the demonstration video.

In the step, a scale-invariant feature transformation algorithm is applied, a plurality of feature points are extracted from a target area of a shot image, the positions of the key feature points in a pixel coordinate system are obtained at the same time, and then 3 key feature points with highest response values are further selected as main observation targets.

The scale-invariant feature transformation algorithm comprises the following steps:

IV, generating key point descriptors: a descriptor is generated for each key feature point screened, and the descriptor is a histogram of the gradient of the area around the key point, and the histograms are converted in the scale and direction of the key feature point so as to ensure scale and rotation invariance.

then, when the camera is horizontally placed, a space coordinate system is established, in the space coordinate system, the origin is the center of the camera lens, the x-axis is arranged along the horizontal direction and is perpendicular to the axial center line of the camera lens, the y-axis is arranged along the vertical direction, and the z-axis is directed to the direction of the target object along the lens; and finally, mapping coordinates of the 3 key feature points automatically acquired by the Keypoint-RCNN neural network in an image coordinate system to coordinates in a space coordinate system.

The mapping process is as follows:

the effective focal length EFL is calculated with reference to equation 7;

In this step, the unknown number gamma is solved _c The actual length L between any two key feature points is combined to construct an equation set, and the equation set is shown as formulas 10-1, 10-2 and 10-3;

equation 10-3: l (L) _2-3 ＝||P ₂ -P ₃ || ₂ The method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _2-3 For the actual length between key feature point 2 and key feature point 3, P ₂ And P ₃ See equation 8 for the expanded equation of (2);

equation 10:wherein (1)>Spatial coordinates for 3 key feature points in all frames, +.>Base coordinates in a base coordinate system for 3 key feature points in all frames; the target position to be reached by the mechanical arm can be calculated based on the base coordinates, and then a control command is output (by a control system for controlling the mechanical arm to act) to control the mechanical arm to act so as to reach the target position.

Claims

1. A method for controlling a mechanical arm to simulate video actions, which relates to the mechanical arm and a camera; the mechanical arm comprises a base, an arm section and a clamping jaw which are sequentially connected, and the clamping jaw is used for clamping a target object; the target object is marked as P; the camera is a depth camera;

the method is characterized by comprising the following steps:

s01, acquiring the position of a target object in a camera coordinate system:

Equation 1:

s02, acquiring pixel positions of a target object on an image:

s03, extracting key feature points of a target object from the image:

firstly, establishing an arm segment end central point S to a target object pixel position P on a shooting image _image Then screening out the point on the forward side of the extension vector, and defining a central region according to the position of the target regionDeleting the pixel points outside the central area, and finally screening out a target area according to a depth value range set by a target scene; extracting 3 key feature points of a target object from a target area to obtain positions of the 3 key feature points in a pixel coordinate system, wherein the 3 key feature points have relatively higher stability and discrimination degree compared with other common feature points which are not screened, and can be detected again in images with different visual angles and different illumination conditions shot by a camera;

in the step, the target area is an area where the target object is located;

s04, acquiring a data set and training a neural network:

recording the positions of the 3 key feature points in a pixel coordinate system and the positions of the 3 key feature points in a camera coordinate system, and then performing random operation on a target object by the mechanical arm, wherein the random operation comprises rotation and translation; the following two operations are performed out of order: i, through conversion matrixCalculating new positions of the clamping jaw center point and the 3 key feature points in a camera coordinate system after the operation is executed; II, acquiring a newly generated target area after operation; repeating the random operation for a plurality of times, and recording the positions of the 3 key feature points and the target area before and after each operation to form a data set for training a subsequent Keypoint-RCNN neural network; after training is completed, each time the mechanical arm clamping jaw in the demonstration video is observed to clamp any target object, the Keypoint-RCNN neural network can automatically acquire 3 key feature points, and the 3 key feature points are tracked and the positions of the 3 key feature points in a pixel coordinate system are predicted along with the playing of the demonstration video;

equation 10:wherein (1)>For the spatial coordinates of 3 key feature points in all frames,base coordinates for 3 key feature points in all frames; the target position to be reached by the mechanical arm can be calculated based on the base coordinates, and then a control command is output to control the mechanical arm to act so as to reach the target position.

2. The method for controlling a robotic arm to simulate a video action of claim 1, wherein: in the step S01, the hand-eye system is a binary system consisting of a mechanical arm and a depth camera, the base coordinate system is based on a base, and the center point of the clamping jaw is the center of a minimum sphere capable of completely containing the clamping jaw.

3. The method for controlling a robotic arm to simulate a video action of claim 1, wherein: in step S05, the mapping process is as follows:

the effective focal length EFL is calculated with reference to equation 7;

adopting a trust domain algorithm to solve the numerical solution of formula 8, setting the initial value of the trust domain algorithm between 0 and 1, and starting to solve the first frame of the demonstration videoGamma of (2) ₁ 、γ ₂ And gamma ₃ And taking the calculation result of the previous frame as the initial value of the new iteration solution of the current frame, and reciprocating in this way to obtain the space coordinates of all key feature points in all frames of the demonstration videoExpression is as in equation 9;

4. A method of controlling a robotic arm to simulate video actions as claimed in claim 3, wherein: in the step S05, the unknown number gamma is solved _c The actual length L between any two key feature points is combined to construct an equation set, and the equation set is shown as formulas 10-1, 10-2 and 10-3;

5. The method for controlling a robotic arm to simulate a video action according to claim 4, wherein: in the S04 step, a scale-invariant feature transformation algorithm is applied, a plurality of feature points are extracted from a target area of a shot image, the positions of the key feature points in a pixel coordinate system are obtained at the same time, and then 3 key feature points with highest response values are further selected as main observation targets; the scale-invariant feature transformation algorithm comprises the following steps: