CN111055279B

CN111055279B - Multi-mode object grabbing method and system based on combination of touch sense and vision

Info

Publication number: CN111055279B
Application number: CN201911304586.XA
Authority: CN
Inventors: 刘厚德; 周星如; 张郑; 王学谦; 阮见; 刘思成; 梁斌
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2022-02-15
Anticipated expiration: 2039-12-17
Also published as: CN111055279A

Abstract

The embodiment of the application discloses a multi-mode object grabbing method and system based on combination of touch sense and vision. The method comprises the following steps: calibrating a camera; filtering out background interference factors from the image acquired from the camera; preprocessing the image to obtain a capture candidate region set, and selecting N capture candidate regions with the highest score from the capture candidate region set as the feasible capture regions of the manipulator; controlling a manipulator to randomly select a feasible grabbing area of the manipulator, closing the manipulator with a certain force, and staying for M time periods to acquire tactile data of a target object; fusing the collected tactile data and the image data obtained in A2 and inputting the fused tactile data and the image data into a convolutional neural network, and judging whether the capturing is feasible or not; and sending a grabbing instruction to control the mechanical arm and the mechanical hand to complete the action of grabbing the target object. The system is used for executing the method. The embodiment of the application can improve the one-time grabbing success rate.

Description

Multi-mode object grabbing method and system based on combination of touch sense and vision

Technical Field

The application relates to the technical field of robots, in particular to a multi-modal object grabbing method and system based on combination of touch sense and vision.

Background

The rapid development of artificial intelligence and hardware equipment has greatly advanced the development of industrial process and robot science. The grasping function of the robot is the most basic function of the robot to perform basic task operations such as sorting, picking, and the like in a task. In an industrial production environment or a logistics sorting task, robot grabbing is very common. However, the current grabbing work generally uses a single mode, namely, the object is grabbed and predicted through vision, and a point suitable for being grabbed by the manipulator is obtained. However, when the grasping point is determined by using vision, it is often difficult to determine the gravity center and the surface roughness of the object, and there are factors such as a system error of the robot itself, an input error of a vision sensor, and environmental noise objectively.

Generally, a method for grabbing an object by point contact includes acquiring point cloud information of a visible portion of a target object through a fixed depth camera, and reconstructing a curved surface based on a gaussian process. By setting constraint conditions meeting stable grabbing, such as a force closure principle, a set of feasible grabbing points meeting the conditions is screened out, and finally the grabbing success rate is verified on a simulation environment and a robot object. However, this approach has the disadvantages: only by means of visual grabbing, the information obtained from the object is too little, so that misjudgment is easily caused, and grabbing failure is caused.

The general grabbing mode is to judge the whole environment by using the vision of a robot. The grabbing mode is as follows: the method comprises the steps of firstly collecting environmental information through a camera to obtain an integral RGBD image, then removing a background, dividing different objects into independent images, obtaining the gravity center of the object from the surface of an interested image through experience, and capturing after determining an optimal capture point according to the gravity center. Such a grasping manner derives the center of gravity of the object only from the shape judgment experience of the object, but there may be a case of uneven distribution, resulting in failure of grasping. All the geometric information of the objects is difficult to obtain by the operation, so that the problem of judging whether the objects collide with each other accurately exists; it is difficult to judge the center of gravity of the object only from the vision, so that it is easy to miss-grasp and cause a grasping failure.

The other scheme is that the shape of the surface of the object at the moment is judged by using the deformation state collected by the micro camera under the surface of the object which is very easy to deform through the GelSight touch sensor only through touch. This requires a "search" of the object surface before grabbing, and then determining these shapes by using a model trained offline in advance to obtain the most suitable positions for surface grabbing. Problems with this solution are: (1) only the touch sense is not visual, the grabbing position needs to be set manually, the time cost is too high, and the human-computer interaction is not friendly; (2) if the position of the grabbing point is judged incorrectly, the robot fails to grab the object, so that the environment of the stacked object is affected; (3) the environment of the whole object opposite to the environment is not known, so that other objects are possibly misjudged in the planning process of the mechanical arm, and the whole operation environment is influenced.

The above background disclosure is only for the purpose of assisting in understanding the inventive concepts and technical solutions of the present application and does not necessarily pertain to the prior art of the present application, and should not be used to assess the novelty and inventive step of the present application in the absence of explicit evidence to suggest that such matter has been disclosed at the filing date of the present application.

Disclosure of Invention

The difficulty exists for robot grabbing, for example: the multi-mode object grabbing method and system based on the combination of touch sense and vision are provided, the process of grabbing a target object by a manipulator under a real scene is simulated, and the one-time grabbing success rate can be improved.

In a first aspect, the present application provides a multi-modal object grabbing method based on haptic and visual combination, comprising:

a1, calibrating the camera to realize conversion from a world coordinate system to a pixel coordinate system;

a2, filtering out background interference factors of the image acquired from the camera;

a3, preprocessing the image in A2 to obtain a grabbing candidate region set, and selecting N grabbing candidate regions with the highest scores from the grabbing candidate region set as manipulator feasible grabbing regions;

a4, controlling the manipulator to randomly select a feasible grabbing area of the manipulator, closing the area with a certain force, and staying for M time periods to acquire tactile data of the target object; fusing the collected tactile data and the image data obtained in A2 and inputting the fused tactile data and the image data into a convolutional neural network, and judging whether the capturing is feasible or not; if the grasping is not feasible, judging the feasible grasping areas of the other manipulators according to the same steps; if the N manipulator feasible grabbing areas are judged to be not grabbed, judging that the target object exceeds the manipulator grabbing capacity range;

and A5, sending a grabbing command to control the mechanical arm and the mechanical hand to complete the action of grabbing the target object.

In some preferred embodiments, the method further comprises establishing a data set:

acquiring visual data of various objects;

acquiring the haptic data of P haptic sensor data acquisition cycles continuously acquired from small to large by the force used by different parts of each object, finally superposing all the haptic data of time series, acquiring P +1 haptic data by one part, and acquiring a plurality of parts by one object to acquire a plurality of groups of haptic data;

aligning the plurality of sets of haptic data;

arranging the visual data and the tactile data into a column to realize the fusion of the visual data and the tactile data to obtain the visual tactile data;

and inputting the visual and tactile data into the convolutional neural network to train the characteristics of different target objects in the data set.

In some preferred embodiments, aligning the plurality of sets of haptic data is specifically: and aligning the multiple groups of tactile data by adopting a DTW dynamic time programming method.

In some preferred embodiments, the plurality of sets of haptic data are two sets of haptic data for a two finger manipulator.

In some preferred embodiments, said a1 is specifically: and calibrating the camera by using a Zhangzhen chessboard calibration method.

In some preferred embodiments, the a2 includes: acquiring an image only containing a target object; performing two classifications of foreground and background on the image and framing out a target object in the foreground; after classification, the background is masked.

In some preferred embodiments, the manipulator is a two-finger manipulator; the convolutional neural network is y ═ f (x); wherein y is whether the capture can be performed or not, and is binary distribution of 0 and 1; x ═ D (D)_CAMERA,D_LSENSOR,D_RSENSOR) Wherein D is_CAMERA,D_LSENSOR,D_RSENSORThe tactile data are respectively obtained from a camera, a left fingertip tactile sensor of a two-finger manipulator, and a right fingertip tactile sensor of the two-finger manipulator.

In some preferred embodiments, N has a value of 3, M has a value of 100, and P has a value of 100.

In a second aspect, the present application provides a multi-modal object grasping system based on a combination of touch and vision, comprising a camera, a manipulator, a robotic arm, a master control computer, and a force sensor; the main control computer is used for executing the method.

In a third aspect, the present application provides a computer readable storage medium having stored therein program instructions which, when executed by a processor of a computer, cause the processor to perform the above-described method.

Compared with the prior art, the beneficial effects of the embodiment of the application are as follows:

modeling the contact area of the manipulator and the target object through vision, screening out a feasible grabbing area which can be grabbed by the manipulator, and judging whether the grabbed surface and force can successfully grab the screened area through touch; and finally, controlling the mechanical arm and the mechanical hand to finish the action of grabbing the target object. The process of fully grabbing objects by people is simulated based on the real physical environment. The problem that the object is difficult to grab under the condition that the object model information obtained by vision is too little can be solved. The real situation that the manipulator contacts with the target object can be further reduced, so that the one-time grabbing success rate can be higher in the actual operation process, repeated operation is reduced, and the grabbing time cost and the grabbing energy cost are reduced.

Drawings

FIG. 1 is a schematic diagram of a multi-modal object grasping system based on a combination of touch sensation and vision according to an embodiment of the present application;

FIG. 2 is an information interaction diagram of a multi-modal object grabbing method based on haptic and visual combination according to one embodiment of the present application;

FIG. 3 illustrates a workflow of a multi-modal object grabbing method based on haptic and visual integration according to one embodiment of the present application;

FIG. 4 illustrates the fusion of visual data and haptic data as a result of one embodiment of the present application.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present application more clearly apparent, the present application is further described in detail below with reference to fig. 1 to 4 and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. The connection may be for fixation or for circuit connection.

It will be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like, refer to an orientation or positional relationship indicated in the drawings that is solely for the purpose of facilitating the description of the embodiments and simplifying the description, and do not indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be considered as limiting the application.

Referring to fig. 1 and 2, the present embodiment provides a multi-modal object grabbing system based on the combination of touch sense and vision, which includes a main control computer 1, a camera 2, a camera fixing bracket 3, a manipulator 4, a mechanical arm 6, a storage platform 7 and a force sensor 8.

Technical terms which will be mentioned hereinafter are explained.

The force sealing condition refers to the capability that the manipulator can realize the contact force applied to the object to be grabbed and can balance any external force and external moment under the condition of meeting the corresponding friction constraint condition. The method commonly used to determine whether a capture mode satisfies a force closure condition is based on whether the corresponding capture matrix is a row full rank matrix. In general, friction constraints differ due to the different contact patterns between the robot and the object to be grasped. The contact model comprises a frictionless point contact model, a point contact model with friction and a soft finger contact model, wherein the frictionless point contact model has no friction constraint due to idealized contact modeling; corresponding friction constraint conditions exist in the friction point contact model and the soft finger model. This embodiment is a point contact model with friction.

Grabbing a matrix: is based on a multidimensional vector space to represent a grabbing mapping between the grabbing force and the associated contact force at all contact points. And all these contact forces must satisfy the friction constraints under the corresponding contact model.

Convolutional Neural Network (Convolutional Neural Network): the feedforward artificial neural network with the depth structure is widely applied to the field of images and comprises convolution calculation. The common structure is: input-convolutional layer-pooling layer-full-link layer-output. The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually a time or frequency spectrum sample; the two-dimensional array may include a plurality of channels; an input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array. Because convolutional neural networks are widely used in the field of computer vision, many studies have assumed three-dimensional input data, i.e., two-dimensional pixel points and RGB channels on a plane, in advance when introducing their structures. The hidden layer of the convolutional neural network comprises a convolutional layer, a pooling layer and a full-connection layer 3 common structures; some more modern algorithms may have complicated structures such as an inclusion module and a residual block. In a common architecture, convolutional and pooling layers are characteristic of convolutional neural networks. The convolution kernels in the convolutional layers contain weight coefficients, while the pooling layers do not. The fully-connected layer in the convolutional neural network is equivalent to the hidden layer in the traditional feedforward neural network. The fully-connected layer is usually built on the last part of the hidden layer of the convolutional neural network and only passes signals to other fully-connected layers. The convolutional neural network is usually a fully-connected layer upstream of the output layer, and thus has the same structure and operation principle as the output layer in the conventional feedforward neural network. For the image classification problem, the output layer outputs the classification label using a logistic function or a normalized exponential function (softmax function).

Multimodal: each source or form of information may be referred to as a modality. For example, humans have touch, hearing, vision, smell; information media such as voice, video, text and the like; a wide variety of sensors such as radar, infrared, accelerometer, etc. Each of the above may be referred to as a modality. Also, the modality may be defined very broadly, for example, two different languages may be considered as two modalities, and even the data sets collected under two different situations may be considered as two modalities. In this embodiment, one modality is a visual sensor and the other modality is a tactile sensor.

The main control computer 1 is equipped with a Windows operating system and an ubuntu16.04 system.

The camera 2 is a 3D depth camera for acquiring information of an object to be grabbed. Referring to fig. 1, a camera 2 is mounted on a fixed bracket 3, and the arrangement of the camera 2 is vertically downward. The 3D depth camera collects digital information about the depth and RGB image of the target object to be grabbed and transmits the obtained object information to the main control computer 1.

The manipulator 4 is a two-finger manipulator. The fingertips of the two-finger robot are equipped with a tactile sensor 41. The tactile sensor is a magneto-rheological tactile sensor. The tactile sensor 41 collects object information such as surface texture, roughness, and object centroid information about the target object to be grasped, and transmits the obtained object information to the main control computer 1. In other embodiments, the manipulator 4 may also be a three-finger manipulator, a four-finger manipulator or a five-finger manipulator, according to actual needs.

The mechanical arm 6 is a six-degree-of-freedom mechanical arm. In other embodiments, the robotic arm 6 may also be a robotic arm having other numbers of degrees of freedom. Six-degree-of-freedom mechanical arm is provided with two-finger mechanical arm

The storage platform 7 is used for placing the target object 5 to be grasped.

The force sensor 8 is a six-dimensional force sensor.

The main function of the main control computer 1 is to process visual and tactile input data. Calculating three-dimensional coordinate information of a target object through depth and RGB image digital information obtained by a 3D depth camera; and then, through inverse kinematics calculation, the main control computer 1 communicates with a controller of the mechanical arm 6 to realize position control of the mechanical arm and control of the two fingers of the mechanical arm, and the grabbing task is completed.

The six-degree-of-freedom mechanical arm is mainly used for completing a grabbing task. The controller of the robot arm 6 moves to a designated position by receiving a movement command from the main control computer. The motion instruction is obtained by subtracting the measured size between the two mechanical arms at the tail end of the mechanical arm 6 from the space position of the target object, and then the space position of the tail end of the mechanical arm is obtained through inverse kinematics calculation, so that the grabbing task is completed, and the motion instruction that the tail end of the mechanical arm needs to be moved to the space position and the angle instruction that each corresponding joint needs to rotate is generated.

The two-finger manipulator is a key tool for realizing a grabbing task, and after the six-freedom-degree mechanical arm moves to a designated position, the main control computer 1 sends an instruction to the manipulator, so that the manipulator finishes grabbing through opening and closing actions after moving to a specific position.

The six-dimensional force sensor 8 is used to monitor the force of the two-finger robot when it is closed (i.e. to obtain the value of the contact force), because the gripping force required for gripping objects with different degrees of hardness is different. Therefore, the two-finger manipulator can sense the local shape and the grabbing strength of the object. In this embodiment, the gripping force is also used as one of the criteria for determining whether stable gripping can be achieved.

The depth information and RGB information of an object to be grabbed are acquired through a 3D depth camera and serve as original data, OpenNI is configured on a main control computer to acquire the information, data processing is performed through OpenCV, grabbing and positioning of the target object are achieved, and the target object runs in a ubuntu16.04 system.

The embodiment also provides a multi-modal object grabbing method based on the combination of touch sense and vision, which comprises a training phase and a real object grabbing phase.

In order to make the object stably grabbed, in the training phase, visual and tactile data are respectively collected, and a data set is established. The training phase includes steps S1 through S5.

And step S1, acquiring visual data of various objects.

The visual data of 179 items in the YCB open source dataset are used as the visual data for training. In other embodiments, various objects can be selected from daily life, and then the visual data of the objects can be acquired through the camera.

Step S2, acquiring the force used by different parts of each object, continuously acquiring the tactile data of P tactile sensor data acquisition cycles from small to large, finally superposing all the tactile data of time series, acquiring P +1 tactile data by one part, and acquiring a plurality of parts by one object to acquire a plurality of groups of tactile data. Step S2 is to collect haptic information corresponding to the visual information of the item. In this embodiment, P has a value of 100.

The used articles are manufactured through 3D printing, the tactile data of 100 tactile sensor data acquisition cycles are continuously acquired from small to large according to the force used by different parts of each object, and finally all the time-series tactile data are superposed, wherein 101 tactile data are obtained in one part, and at least 10 parts are acquired in one object. In order to facilitate grabbing, when the tactile data information is collected, the manipulator only carries out grabbing attempts on the object from light to heavy along the three-coordinate direction of the world coordinate system until the manipulator can stably grab the object. Wherein the haptic data comprises object information and corresponding forces; the object information includes information of the grasped surface.

Step S3, align the multiple sets of haptic data.

Step S3 is processing the tactile sensor data. Since the two-finger manipulator is used and the tactile sensors are attached to both fingertips, there is a possibility that the tactile information collection time series are not uniform (there is a possibility that data is not matched because a certain sensor collects more than 100 data due to frequency influence). The multiple groups of tactile data are two groups of tactile data of the two-finger manipulator. The method of dynamic Time planning using dtw (dynamic Time warping) aligns two sets of data.

And step S4, arranging the visual data and the tactile data into a column to realize the fusion of the visual data and the tactile data to obtain the visual tactile data.

Step S4 is visual haptic data fusion. Because both tactile and visual data are in an n x 3 data format (n being the number of sets of data acquired), the data is arranged in a column, as shown in fig. 4.

And step S5, inputting the visual and tactile data into the convolutional neural network, and training the characteristics of different target objects in the data set.

In this embodiment, a neural network y ═ f (x) is trained, where y is whether grabbing is possible or not, and is a binary distribution of 0 and 1. x ═ D (D)_CAMERA,D_LSENSOR,D_RSENSOR) Wherein D is_CAMERA,D_LSENSOR,D_RSENSORThe data collected from the camera, the left fingertip sensor and the right fingertip sensor. Since the data of the left and right fingertips can be represented in an image format, the neural network here is a convolutional neural network.

The object capture stage includes steps a1 through a5, and the execution subject is the main control computer 1.

Step A1, calibrating the camera to realize the conversion from the world coordinate system to the pixel coordinate system.

Step a1 is for 3D depth camera calibration. In order to realize accurate positioning, firstly, a 3D depth camera is calibrated, and a Zhang-Yongyou chessboard calibration method is used for realizing the conversion matrix from a world coordinate system to a pixel coordinate system, thereby realizing the conversion of the coordinate system.

And step A2, filtering background interference factors from the image acquired from the camera.

Step a2 specifically includes: and (4) partitioning the point cloud and removing noise interference. Because background interference exists in the actual recognition process of the object to be grabbed, the background interference factors need to be filtered out first, and therefore image information only containing the object to be grabbed is obtained. In the step of filtering the background, the used mask-rcnn network of the open source classifies a foreground and a background of a picture, and selects an object in the foreground, and after the classification is finished, the background is subjected to masking operation; that is, the pixel values of the image of the background portion are all assigned to 0, thereby eliminating the influence of the background on the foreground object.

And A3, preprocessing the image in the step A2 to obtain a capture candidate region set, and selecting N capture candidate regions with the highest scores from the capture candidate region set as the feasible capture regions of the manipulator.

In this embodiment, N has a value of 3.

Preprocessing the image in the step A2; namely, preprocessing is performed on the visual data by using a neural network to obtain a capture candidate area set, and 3 candidates with the highest scores are selected as manipulator feasible capture areas.

The acquisition of the capture candidate region set can be realized by a mode in the prior art; the selection of the candidate with the highest score is also achieved by means of the prior art; such as that of chinese patent application No. 201910527766.8.

Step A4, controlling the manipulator to randomly select a manipulator feasible grabbing area, closing the manipulator with a certain force, and staying for M time periods to acquire tactile data of the target object; fusing the collected tactile data and the image data obtained in the step A2 and inputting the fused tactile data and the image data into a convolutional neural network, and judging whether the grabbing is feasible or not; if not, judging the feasible grabbing area of the other manipulator according to the same steps; and if the N possible grabbing areas of the manipulator are judged to be not grabbed, judging that the target object exceeds the grabbing capacity range of the manipulator.

In the present embodiment, the value of M is 100. Referring to fig. 3, according to the control instruction of the main control computer 1, the manipulator 4 randomly selects a candidate grabbing area, that is, a manipulator feasible grabbing area, and according to the pose relationship between the six-degree-of-freedom manipulator and the two-finger manipulator, the main control computer makes the grabbing point coordinate of the selected manipulator feasible grabbing area pass through the moveit! The software is converted into a pose instruction of the movement of the mechanical arm and a time sequence control instruction of the opening and closing of the two-finger mechanical arm, and the pose instruction and the time sequence control instruction are respectively sent to the mechanical arm and the two-finger mechanical arm. The robot was closed with a force and left to collect data for 100 time periods. The main control computer 1 fuses and inputs the acquired tactile data and the image data obtained in the step A2 into a convolutional neural network to obtain whether the capturing is feasible or not; and if not, judging the feasible grabbing areas of the other two manipulators according to the same steps. And if the three manipulator possible grabbing areas are judged to be not grabbed, the object is considered to be beyond the grabbing capacity range of the manipulator. The touch data comprises object information of a target object and closing force corresponding to the manipulator; the object information of the target object includes surface information of the grasped target object.

The specific instruction sequence is as follows: the manipulator in the initial state is in a state that the two-finger gripper is closed and is in a horizontal position, the distance from the upper object is 20cm, when the manipulator runs to a position 5cm away from the object to be grabbed, the two-finger manipulator is opened, the position and the posture of the manipulator are adjusted to avoid collision with the object to be grabbed, and when the manipulator reaches an optimal grabbing area and is not in contact with the optimal grabbing area, the manipulator is closed with a certain force and stays for a certain time. And finishing the grabbing instruction.

Whether the capturing in the step a4 is possible can be determined by: firstly, respectively extracting characteristics of visual data and tactile data by using a small neural network, and then fusing the characteristics; inputting the data into a convolutional neural network to train the convolutional neural network, wherein the output result is a two-classification result: graspable is 1 and non-graspable is 0.

Step a5, the main control computer 1 sends out a grabbing command to control the mechanical arm and the mechanical hand to complete the action of grabbing the target object.

After receiving a grabbing command sent by the main control computer 1, the mechanical arm moves to a specified spatial position and adjusts the terminal attitude. After the mechanical arm reaches the expected position, the two-finger mechanical arm executes a control command to finish the action of grabbing the target object, so that stable grabbing is realized. The mechanical arm and the mechanical arm execute tasks in sequence to complete instructions.

The stable grabbing conditions are that the force sealing condition under the contact model, the task constraint, the constraint of the self structure of the mechanical arm, the shape grabbing force and the grabbing force in the sense of touch and the like are met. As long as the above conditions are satisfied when the task of grasping is performed, it can be determined that the grasping at this time is successful.

The method comprises the steps of firstly modeling the contact area of a manipulator and a target object through vision, screening out the area which can be grabbed by the manipulator, measuring the similarity between the target object and the object in a data set through a convolutional neural network, and judging whether the grabbed surface and force can successfully grab the screened area through touch; and finally, the target object is grabbed through the control of the mechanical arm. The mechanical arm control mainly comprises motion control of a mechanical arm main body and grabbing pose control of a mechanical arm. The process of fully grabbing objects by people is simulated based on the real physical environment. The problem that the object is difficult to grab under the condition that the object model information obtained by vision is too little can be solved. The real situation that the manipulator contacts with the target object can be further reduced, so that the one-time grabbing success rate can be higher in the actual operation process, repeated operation is reduced, and the grabbing time cost and the grabbing energy cost are reduced.

For a robot to complete a grabbing task, the existing common difficulties are that the precision of a sensor is limited, the weight and the mass center of a target object are unknown, the irregular appearance and the surface friction coefficient of the target object and an objective non-ideal environment cause that the surface information of the object cannot be accurately acquired and the grabbing task cannot be accurately completed. In the embodiment, a concept of multi-mode fusion grabbing is provided, a neural network suitable for multi-mode grabbing is trained, and in the actual grabbing process, stable grabbing is realized by using multi-mode information of vision and touch through a convolutional neural network method in cooperation with grabbing configuration of a two-finger manipulator during grabbing. Specifically, the shape of a target object is judged through vision, and an area suitable for grabbing is found out; the weight and the gravity center of the object are judged while the object is contacted, and the grabbing strength and the grabbing area are adjusted.

The embodiment provides a multi-mode grabbing mode based on visual sense and touch sense fusion aiming at the condition that the position of a target object under a world coordinate system is unknown and the surface shape, the material, the friction coefficient and the object mass center position are unknown, and the success rate of object grabbing can be improved by fusing respectively acquired visual data and touch sense data and then passing through a convolutional neural network.

In order to ensure accurate grasping, the present embodiment establishes a grasping coordinate system based on the two fingers and the target object, i.e., step a 1; the target object can be accurately positioned and described according to the pose relationship between the two, and the grabbing action can be accurately described in a parameterized manner.

The appearance and the mass center of the object are judged by combining the visual mode and the tactile mode, so that stable grabbing is realized, and the robot has important significance for completing high-difficulty tasks and expanding the application range of the robot and promoting the development of the robot industry.

Those skilled in the art will appreciate that all or part of the processes of the embodiments methods may be performed by a computer program, which may be stored in a computer-readable storage medium and executed to perform the processes of the embodiments methods. And the aforementioned storage medium includes: various media capable of storing program codes, such as ROM or RAM, magnetic or optical disks, etc.

The foregoing is a further detailed description of the present application in connection with specific/preferred embodiments and is not intended to limit the present application to that particular description. For a person skilled in the art to which the present application pertains, several alternatives or modifications to the described embodiments may be made without departing from the concept of the present application, and these alternatives or modifications should be considered as falling within the scope of the present application.

Claims

1. A multi-modal object grabbing method based on combination of touch sense and vision is characterized by comprising the following steps:

a5, sending a grabbing command to control the mechanical arm and the mechanical hand to complete the action of grabbing the target object;

judging the shape of the target object through vision, and finding out an area suitable for grabbing; the weight and the gravity center of the object are judged while the object is contacted, and the grabbing strength and the grabbing area are adjusted;

further comprising establishing a data set:

acquiring visual data of various objects;

acquiring the haptic data of P haptic sensor data acquisition cycles continuously acquired from small to large by the force used by different parts of each object, and finally superposing all time-series haptic data, wherein P +1 haptic data are acquired from one part, and a plurality of parts are acquired from one object to acquire a plurality of groups of haptic data, and the groups of haptic data are two groups of haptic data acquired by the haptic sensors on different fingers of the two-finger manipulator;

aligning the multiple groups of haptic data by using a DTW dynamic time programming method;

2. The method according to claim 1, wherein a1 is specifically: and calibrating the camera by using a Zhangzhen chessboard calibration method.

3. The method according to claim 1, wherein said a2 comprises: performing two classifications of foreground and background on the image and framing out a target object in the foreground; after classification, the background is masked.

4. The method of claim 1, further comprising: the manipulator is a two-finger manipulator; the convolutional neural network is y ═ f (x); wherein y is whether the capture can be performed or not, and is binary distribution of 0 and 1; x ═ D (D)_CAMERA,D_LSENSOR,D_RSENSOR) Wherein D is_CAMERA,D_LSENSOR,D_RSENSORThe tactile data are respectively obtained from a camera, a left fingertip tactile sensor of a two-finger manipulator, and a right fingertip tactile sensor of the two-finger manipulator.

5. The method of claim 1, further comprising: the value of N is 3, the value of M is 100, and the value of P is 100.

6. A multi-modal object grabbing system based on a combination of touch and vision is characterized in that: the system comprises a camera, a mechanical arm, a main control computer and a force sensor; the master control computer is adapted to perform the method according to any one of claims 1 to 5.

7. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein program instructions which, when executed by a processor of a computer, cause the processor to carry out the method according to any one of claims 1 to 5.