CN114495109A

CN114495109A - Grabbing robot based on matching of target and scene characters and grabbing method and system

Info

Publication number: CN114495109A
Application number: CN202210081494.5A
Authority: CN
Inventors: 许庆阳; 刘志超; 丁凯旋; 宋勇; 李贻斌; 张承进; 袁宪锋; 庞豹
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-13

Abstract

The invention belongs to the field of intelligent robots, and provides a grabbing robot based on matching of a target and scene characters, a grabbing method and a grabbing system, wherein according to a target image to be grabbed and a target detection model obtained by a camera, CNN is used for carrying out feature extraction, and classification results and a boundary frame of the target to be grabbed are obtained through regression; for targets with the same classification result, extracting characters in a target detection box area by adopting a text detection and identification model for detection and identification, and obtaining an initial three-dimensional coordinate after the character identification result is successfully matched with a specific target; and positioning the specific grabbing target detection frame by using a target tracking algorithm to obtain a final grabbing coordinate, and controlling the chassis motion and the mechanical arm motion to grab the specific target according to the grabbing coordinate.

Description

Grabbing robot based on matching of target and scene characters and grabbing method and system

Technical Field

The invention belongs to the field of intelligent robots, and particularly relates to a grabbing robot based on matching of a target and scene characters, and a grabbing method and a grabbing system of the grabbing robot.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In the prior art, a robot grabbing detection algorithm mostly directly grabs and detects a single object, or distinguishes a plurality of objects by adopting a complex neural network to perform methods such as segmentation, classification and marking. However, when a large number of object objects exist in the grabbing scene, and information such as appearance colors of the objects is consistent or the objects belong to the same category, the above detection algorithm cannot finely distinguish the objects, which directly affects grabbing judgment of the robot, and results in insufficient grabbing precision.

Disclosure of Invention

In order to solve at least one technical problem in the background technology, the invention provides a grabbing robot based on matching of a target and scene characters, a grabbing method and a grabbing system.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a grabbing robot based on matching of an object and scene characters, which comprises: the system comprises a depth camera, a chassis, a mechanical arm and a controller;

the controller comprises a preliminary detection module of an object to be grabbed, a text detection and identification module and an object grabbing module;

the preliminary detection module of the object to be grabbed is configured to: according to the image of the target to be captured and the target detection model obtained by the camera, performing feature extraction by using CNN (CNN), and performing regression to obtain a classification result and a boundary frame of the target to be captured;

the text detection recognition module is configured to: for targets with the same classification result, extracting characters in a target detection box area by adopting a text detection and identification model for detection and identification, and obtaining an initial three-dimensional coordinate after the character identification result is successfully matched with a specific target;

the object grabbing module is configured to: and positioning the specific grabbing target detection frame by using a target tracking algorithm to obtain a final grabbing coordinate, and controlling the chassis motion and the mechanical arm motion to grab the specific target according to the grabbing coordinate.

The second aspect of the present invention provides a capture method based on matching of a target and a scene text, including the following steps:

acquiring an image of a target to be captured;

performing feature extraction by using CNN according to the image of the target to be captured and the target detection model, and performing regression to obtain a classification result and a boundary frame of the target to be captured;

for targets with the same classification result, extracting characters in a target detection box area by adopting a text detection and identification model for detection and identification, and obtaining an initial three-dimensional coordinate after the character identification result is successfully matched with a specific target;

and positioning the specific grabbing target detection frame by using a target tracking algorithm to obtain a final grabbing coordinate, and controlling the chassis motion and the mechanical arm motion to grab the specific target according to the grabbing coordinate.

The third aspect of the present invention provides a capture system based on matching of a target and a scene text, including:

the robot comprises a preliminary detection module of an object to be grabbed, a text detection and identification module and an object grabbing module;

the preliminary detection module of the target to be grabbed is used for acquiring an image of the target to be grabbed; performing feature extraction by using CNN according to the image of the target to be captured and the target detection model, and performing regression to obtain a classification result and a boundary frame of the target to be captured;

the text detection and identification module is used for extracting characters in a target detection frame area for detection and identification by adopting a text detection and identification model for targets with the same classification result, and obtaining an initial three-dimensional coordinate after the character identification result is successfully matched with a specific target;

the target grabbing module is used for positioning the specific grabbed target detection frame by using a target tracking algorithm to obtain a final grabbing coordinate, and controlling the chassis motion and the mechanical arm motion to grab the specific target according to the grabbing coordinate.

Compared with the prior art, the invention has the beneficial effects that:

according to the method, the lightweight target detection model NanoDet is used for carrying out target detection on the object to be grabbed, and then the image in the detection frame area is subjected to enhancement processing, so that adverse factors such as the target area is too small are overcome. And carrying out character detection and recognition on the enhanced detection box area by utilizing a character detection and recognition model PP-OCR, and extracting character information. And fusing the target information provided by the two models to realize matching of the character recognition result and the object target detection frame, and finishing accurate positioning of the object to be grabbed. The real-time tracking of a specific target is realized through a KCF tracking algorithm, so that the accurate grabbing control of the robot is realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are included to illustrate an exemplary embodiment of the invention and not to limit the invention.

FIG. 1 is a schematic overall flow chart of a target capture monitoring and positioning method according to an embodiment of the present invention;

FIG. 2 is a structure diagram of a NanoDet according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process of enhancing an image of a detection frame region according to an embodiment of the present invention;

FIG. 4 is a schematic view of a PP-OCR detection flow according to an embodiment of the present invention;

FIG. 5 is a diagram of a CRNN structure according to an embodiment of the present invention;

FIGS. 6(a) -6 (b) illustrate an IOU calculation process according to an embodiment of the present invention;

FIGS. 7(a) -7 (d) illustrate a target tracking process according to one embodiment of the present invention;

8(a) -8 (c) illustrate a depth camera calibration and registration process according to an embodiment of the present invention;

FIGS. 9(a) -9 (c) illustrate a robot grasping action according to an embodiment of the present invention;

fig. 10(a) -10 (e) are diagrams illustrating the character detection and recognition effect according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

The invention provides a method for constructing an accurate detection system for a specific captured target by utilizing character information of a captured object and fusing related information provided by a target detection and character detection identification algorithm, so as to realize accurate identification and positioning of the target object; meanwhile, the light-weight model is adopted, the real-time effect of the system is guaranteed, the grabbing task is convenient to deploy in the robot controller, and the problem that similar objects cannot be carefully distinguished in the current grabbing target detection algorithm is solved.

As shown in fig. 1, the present embodiment provides a grabbing robot based on matching of a target and scene characters, including a depth camera, a chassis, a mechanical arm, and a controller;

the controller comprises a preliminary detection module of an object to be grabbed and a text detection and identification module;

the depth camera is used for capturing an image of an object to be grabbed, and the preliminary detection module of the object to be grabbed is configured to: and performing feature extraction by using the CNN according to the image of the target to be captured and the target detection model, and regressing to obtain the classification result category and the boundary frame of the target to be captured.

The text detection recognition module is configured to: and for the targets with the same classification result, extracting characters in a target detection frame area by adopting a text detection and recognition model for detection and recognition, obtaining an initial three-dimensional coordinate after the character recognition result is successfully matched with the specific target, positioning the specific captured target detection frame by utilizing a target tracking algorithm to obtain a real-time captured coordinate, and controlling the chassis motion and the mechanical arm motion to capture the specific target according to the captured coordinate.

In this embodiment, the target detection model is a NanoDet, which is a high-speed and lightweight anchor-free target detection model, and can provide performance close to the YOLO series, and is also convenient for training and transplantation.

The network structure of the target detection model is shown in fig. 2. The NanoDet is a FCOS (fuzzy conditional One-Stage Object Detection) style Detection network, and the model can be divided into three parts: the system comprises a backbone network, a feature fusion layer and a detection head. In order to ensure that the model parameter volume is as small as possible, the backbone network adopts ShuffleNet V2.0, removes the last layer of convolution of the ShuffleNet V2.0, extracts the 8, 16 and 32 times of downsampled features and inputs the features into PAN for multi-scale feature fusion.

And the feature fusion layer part adopts PAN which increases a mode from bottom to top, performs down-sampling on the low-order feature mapping, and then adds a down-sampling result to the high-order features. The detection head part of the NanoDet adopts two 96-channel convolution layers, and the same set of convolution calculation is used for frame regression and classification.

The target detection module can obtain the detection frame of the object in real time and classify the image target in the detection frame, thereby positioning the two-dimensional position of each object in the image, and for the objects with the same classification result, the detection system further distinguishes the objects by extracting the character information on the objects.

Because the camera shoots the whole scene image, the area of the region where the characters are located is small, and the region is often influenced by other factors such as illumination, and if the whole image is input to the character detection network, the characteristics of the character region cannot be fully extracted, so that the text region detection effect is poor.

In order to improve the accuracy of subsequent text detection and recognition, character area image enhancement operation is performed before character detection.

As shown in fig. 3, the controller further includes a text region image enhancement module configured to: image cropping, image enlargement and filling, gray processing and image sharpening.

(1) Image cutting: cutting each target object from the whole image according to a target boundary frame generated by target detection;

(2) image enlargement and filling: because each cut target area is too small, the text amplifies the cut area to be twice of the original area by using a bicubic interpolation method, carries out boundary filling on the amplified image, and fills each detection frame area into a square image with the same length-width ratio;

(3) gray level processing: carrying out graying processing on the amplified and filled picture, removing the influence of variables such as color illumination and the like, and then carrying out histogram equalization transformation on the grayscale picture to increase the text area contrast;

(4) image sharpening: and finally, enhancing the edge part of the characters in the image by adopting an image sharpening method to make the characters clearer.

Through the processing, the enhanced picture with the same shape of each target detection frame area is obtained and is used as the input of the text detection recognition model.

As shown in fig. 4, the text detection and recognition model includes a text detection module, a detection box correction module, and a text recognition module.

The text detection module is configured to:

locating text in an imageIn this region, DB-Net is used as a text detector, and the DB-Net performs binarization processing in a segmentation network and standard binarization processing B_i,jModified to differentiable binarization processing function

In the two formulae, B_i,j，

Is a binary map, P_i,jIs a probability map, T, T_i,jTo set the threshold.

The problem that the gradient of standard binarization is not differentiable in training is solved by using differentiable binarization. In order to further improve the efficiency, six strategies are adopted in PP-OCR to reduce the weight of DB-Net.

The detection frame rectification module is configured to:

before the text of the detection box is recognized, the detection box needs to be corrected, a text direction classifier is designed in PP-OCR, the text detection box is firstly converted into a horizontal rectangular box through geometric transformation, then the converted text direction is judged, and if the text box is reverse, the text box is further turned over. And meanwhile, four strategies are adopted to enhance the model capability and reduce the model volume.

The text recognition module is configured to: the CRNN is used as a text recognizer, the CRNN network structure is shown in fig. 5, the CRNN integrates feature extraction and sequence modeling, and sequence alignment is performed by using CTC (Connectionist Temporal Classification) loss. In order to enhance the text recognition capability and reduce the model volume, nine strategies are adopted for processing the text.

In this embodiment, the text detection and recognition model adopts an ultra-lightweight PP-OCR character detection and recognition network; making it easier to deploy to the mobile end. Through PP-OCR character detection and recognition, character information contained in all target objects in target detection is obtained, and even if the target objects are classified into the same class by the target detection module, the object objects can be further distinguished according to the recognized character information.

The obtaining of the initial three-dimensional coordinates is configured to: and determining whether the given character information contained in the recognition result text belongs to the corresponding actual object in the detection box, if so, completing the matching of the characters and the target object, and combining the coordinates of the matched target boundary box with the camera depth information to obtain the initial three-dimensional coordinates of the captured target.

The preliminary detection module and the text detection and identification module for the target to be grabbed are carried out by taking an image frame as a unit and setting a current frame of image F_iExtracting multiple regions D with the target detection frame as the unit through the above process₁,D₂,…,D_nD is the corresponding actual object in the detection frame₁,d₂,…,d_nGiving specific character information T to be matched in the task, and carrying out character detection and identification on n detection frame areas, wherein the identification result is [ T₁,T₂,…,T_n]If at a certain recognition result, the text T_tIf the given character information t is included, the following determination can be made:

by the above formula, it is determined that the character information t belongs to the object d_tAnd then matching the characters with the target object, and combining the detection position of the target object at the moment with the depth information to obtain the initial three-dimensional coordinate of the captured target.

According to the physical characteristics of the mobile service robot, the robot takes the initial three-dimensional coordinates as the first coordinate input, chassis movement is carried out, and grabbing actions are prepared. The moving of the robot chassis can cause the real-time change of the picture captured by the camera, so that the relative offset of the detection frame positioned according to the character recognition result at the previous moment is generated, the robot needs to receive the new coordinate position of the object in real time, if each frame carries out character detection and recognition on the new detection frame area, the network calculation amount is huge, the real-time effect of the whole system is deteriorated, and the robot grabbing efficiency is directly influenced. Therefore, a tracking algorithm is adopted to track the target in real time.

In order to solve the problem of target position transformation caused by robot movement, a target tracking algorithm KCF (Kernel Correlation Filter) tracking algorithm based on a Kernel Correlation Filter is introduced, and is configured to:

a cyclic matrix is constructed for the collected image blocks to represent samples for densely sampling the target and the background thereof, so that a large number of training sets are constructed. After the first frame image is detected in two stages, a target object detection frame to be grabbed is positioned, at the moment, a KCF algorithm is utilized to track the area of the positioned target detection frame in real time, and the training of a tracker is to find a filter omega which enables a target function to be minimum;

wherein, the step of solving omega is as follows:

(1) a ridge regression equation is constructed:

X_tfor a single training sample taken, y_tλ is the regularization parameter for the corresponding confidence sample label, preventing overfitting of the regression.

The cyclic shift of a single training sample constitutes a sample set X, which is a cyclic matrix as follows:

(2) regression on LingIn the equation, f (X) ═ ω^TX, taking the derivative of equation with respect to ω, we can obtain:

ω＝(X^TX+λI)^-1X^TY

wherein, X^TIs the transpose of a training sample X, I is an identity matrix, Y is a column vector, represented by the label Y_tAnd (4) forming.

The circulant matrix X has the property of being diagonalized in fourier space, substituting the fourier diagonalized equation into a ridge regression as follows:

wherein F is a discrete Fourier matrix, X represents the Fourier transformed value of the first row matrix of X,

F^His the conjugate transpose of F.

After a series of transformations, the following results are obtained:

according to the Fourier space transformation, there are:

in the formula, F^-1Is an inverse fourier transform.

Positioning the specific grabbed target detection box by using a target tracking algorithm comprises:

in the process that the robot executes the grabbing task movement, considering that the KCF tracking algorithm drifts when errors accumulate for a long time, the embodiment finds the target detection frame with the largest calculation result by calculating the intersection ratio (IOU) between the tracking frame in each frame of image and all the current target detection frames, and can position the boundary frame of the target to be grabbed in each frame of image.

The IOU calculation diagram is shown in FIG. 6(a) -FIG. 6(b), and the IOU calculation formula is:

suppose that n detection frames with the same classification label result are generated in the target detection at this time, and are respectively A₁,A₂,……,A_n。

Through character detection and identification, the position is located to A_tThe object in the detection frame contains specific character information, namely A_tDetecting the target object in the frame as the target to be grabbed, and using a KCF tracking algorithm to pair A at the moment_tThe target in the process is sampled to generate a tracking frame T, the target is tracked in real time in the moving process of the robot, and T and A are calculated in the whole process_i(i-1, 2, …, n) at each instant, where the IOU value is maximized is a_tThe calculation formula is as follows:

fig. 7(a) -7 (d) show the complete process of target detection frame tracking from time T0 to time T3.

The detection frame with the largest IOU can be found and tracked, so that the real-time positioning of the detection frame for the grabbed target can be realized, the real-time position of the grabbed target can be updated, and the robot finishes the grabbing task according to the grabbing position. The positioning strategy for grabbing the target by the robot can realize real-time positioning of the detection frame only by once character detection and identification, so that the overall calculation amount is reduced, and the real-time property is ensured.

As shown in fig. 8(a) -8 (c), for the depth camera calibration and registration process, configured to:

using 8 multiplied by 11 checkerboard to calibrate RGB and depth maps of the depth camera by Zhang-Zhengyou calibration method, and obtaining that the internal reference matrixes of the RGB camera and the depth camera are respectively H_rgbAnd H_irThe external reference matrix consists of a rotation matrix and a translation vectorAre each R_rgb、T_rgbAnd R_ir、T_ir。

Let P_rgbAnd P_irThe spatial coordinates of a certain point under the coordinates of the RGB camera and the depth camera are respectively, and because the coordinates of the depth camera and the coordinates of the RGB camera are different, the left relation between the two can be related by a rotation matrix and a translation vector:

P_rgb＝RP_ir+T

by computational derivation, the rotation matrix R and the translation vector T can be represented as:

and performing camera coordinate conversion by using the rotation matrix and the translation vector obtained by calculation to align the RGB-D images, and manually fine-tuning the translation vector between the two cameras according to the actual alignment condition to obtain a better alignment effect.

As shown in fig. 9(a) -9 (c), for the robot arm grasping action, after positioning the specific grasping object detecting frame, the robot arm grasping action is configured to:

and using the two-dimensional coordinates of the central area of the positioned grabbing detection frame and the depth information of the area corresponding to the registered depth map as original three-dimensional coordinate information of the grabbing object, calculating a transformation matrix of a camera coordinate system and a mechanical arm coordinate system, and mapping the three-dimensional coordinates obtained by the camera to the mechanical arm coordinate system, namely the grabbing coordinates of the robot. The robot moves to the reach range of the mechanical arm according to the grabbing coordinates received in real time, the mechanical arm executes grabbing actions, and the robot finishes grabbing tasks.

The invention integrates two detection algorithms of target detection and character detection identification, integrates character information on the basis that the target detection algorithm provides position information, and realizes accurate detection of a specific target object. The detection system is built by adopting a lightweight deep learning model, the deployment is easy to carry out on the robot controller, the real-time effect is achieved at the robot end, and the design experiment proves that the method provided by the invention has higher feasibility aiming at the medicine bottle grabbing scene of patients in hospitals, the robot finishes the real-time detection and positioning of a specific target by identifying the specific character information on the medicine bottle, and realizes the intelligent medicine bottle grabbing task of the robot in the hospital scene.

Example two

The embodiment provides a capture method based on matching of a target and scene characters, which comprises the following steps:

step 1: acquiring an image of a target to be captured;

step 2: performing feature extraction by using CNN according to the image of the target to be captured and the target detection model, and performing regression to obtain a classification result and a boundary frame of the target to be captured;

and step 3: for targets with the same classification result, extracting characters in a target detection frame area by adopting a text detection and recognition model for detection and recognition, and obtaining an initial three-dimensional coordinate after the character recognition result is successfully matched with a specific target;

and 4, step 4: and positioning the specific grabbing target detection frame by using a target tracking algorithm to obtain a final grabbing coordinate, and controlling the chassis motion and the mechanical arm motion to grab the specific target according to the grabbing coordinate.

EXAMPLE III

The embodiment provides a capture system based on matching of a target and a scene text, which includes: the robot is used for receiving a grabbing instruction issued by the terminal;

Taking a grabbing scene of a service robot in a medical environment as an example, firstly, a command for grabbing a medicine bottle is given to the robot, namely, name information of a specific patient is sent to the robot. The target detection module detects and frames all medicine bottles in the robot visual field to obtain the position of a boundary frame of each medicine bottle, then the image enhancement operation extracts a target image in the detection frame area and performs enhancement processing, the enhanced image is sent to the character detection and identification module for character detection and identification, and finally the given patient name information is matched according to the character identification result. The character detection effect is shown in fig. 10(a) to 10 (e).

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. Snatch robot based on target and scene characters match, its characterized in that includes: the system comprises a depth camera, a chassis, a mechanical arm and a controller; the controller comprises a preliminary detection module of the object to be grabbed, a text detection and identification module and an object grabbing module;

the preliminary detection module of the object to be grabbed is configured to: according to the target image to be captured and the target detection model obtained by the depth camera, performing feature extraction by using CNN (CNN), and performing regression to obtain a classification result and a boundary frame of the target to be captured;

2. The capture robot based on target and scene literal matching of claim 1, wherein the target tracking algorithm is configured to: a target tracking algorithm KCF tracking algorithm based on a kernel correlation filter is introduced, a cyclic matrix is constructed for an acquired image block to represent samples for densely sampling a target and a background thereof, a large number of training sets are constructed, training is carried out, and a filter with the minimum target function is searched.

3. The capture robot based on object and scene literal matching of claim 1, wherein the object detection model is configured to: the method adopts a NanoDet network and comprises a backbone network, a feature fusion layer and a detection head, wherein the backbone network adopts ShuffleNet V2.0, and the feature fusion layer adopts PAN.

4. The target and scene text matching-based crawling robot of claim 1, wherein the text detection recognition model employs a PP-OCR text detection recognition network.

5. The object-grabbing robot based on object-and-scene-literal matching according to claim 1, where said locating a specific grabbing object detection box is configured to: and taking the two-dimensional coordinates of the central area of the positioned grabbing detection frame and the depth information of the area corresponding to the registered depth map as original three-dimensional coordinate information of the grabbing object, calculating a transformation matrix of a camera coordinate system and a mechanical arm coordinate system, and mapping the three-dimensional coordinates obtained by the camera to the mechanical arm coordinate system, namely the final grabbing coordinates.

6. The capture robot based on object and scene literal matching of claim 5, wherein the depth map is derived by a depth camera calibration and registration process configured to:

calibrating RGB and a depth map of the depth camera by using a Zhang Zhengyou calibration method to obtain an internal reference matrix and an external reference matrix of the RGB camera and the depth camera, wherein the external reference matrix consists of a rotation matrix and a translation vector;

and performing camera coordinate conversion according to the obtained rotation matrix and translation vector to obtain a depth map.

7. The target-and-scene-text-matching-based crawling robot of claim 1, wherein the controller further comprises a text region image enhancement module configured to: and performing image cutting, image amplification and filling, gray processing and image sharpening on the target image to be captured.

8. The target-to-scene text matching-based crawling robot of claim 1, wherein the matching of the text recognition result to a specific target comprises: and comparing a plurality of areas taking the target detection frame as a unit with corresponding actual objects in the detection frame according to the given specific character information to be matched, judging whether the given character information contained in the recognition result text belongs to the corresponding actual object objects in the detection frame, and if so, completing the matching of the characters and the target objects.

9. The grabbing method based on matching of the target and the scene characters is characterized in that the method is applied to a robot and comprises the following steps:

acquiring an image of a target to be captured;

according to the image of the target to be grabbed and the target detection model, performing feature extraction by using CNN (convolutional neural network), and regressing to obtain a classification result and a bounding box of the target to be grabbed;

10. Grabbing system based on target and scene characters match, characterized in that, the system is applied to the robot, includes: the system comprises a to-be-grabbed target preliminary detection module, a text detection and identification module and a target grabbing module;

the preliminary detection module of the target to be grabbed is used for acquiring an image of the target to be grabbed; according to the image of the target to be grabbed and the target detection model, performing feature extraction by using CNN (convolutional neural network), and regressing to obtain a classification result and a bounding box of the target to be grabbed;