CN117237751A - Training method, recognition method, system and equipment for grabbing detection model - Google Patents

Training method, recognition method, system and equipment for grabbing detection model Download PDF

Info

Publication number
CN117237751A
CN117237751A CN202310986735.5A CN202310986735A CN117237751A CN 117237751 A CN117237751 A CN 117237751A CN 202310986735 A CN202310986735 A CN 202310986735A CN 117237751 A CN117237751 A CN 117237751A
Authority
CN
China
Prior art keywords
feature
training
processing
data set
detection model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310986735.5A
Other languages
Chinese (zh)
Inventor
程良伦
陈泳斌
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Nengge Knowledge Technology Co ltd
Guangdong University of Technology
Original Assignee
Guangdong Nengge Knowledge Technology Co ltd
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Nengge Knowledge Technology Co ltd, Guangdong University of Technology filed Critical Guangdong Nengge Knowledge Technology Co ltd
Priority to CN202310986735.5A priority Critical patent/CN117237751A/en
Publication of CN117237751A publication Critical patent/CN117237751A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The application discloses a training method, an identification method, a system and equipment for grabbing a detection model. The method comprises the steps of obtaining a point cloud data set by obtaining and carrying out point cloud conversion processing on an RGB-D image training set; performing multi-scale feature extraction processing on the RGB-D image training set to obtain a global feature image set; respectively carrying out characteristic interpolation processing and size balance processing on the point cloud data set to obtain an interpolation characteristic data set and a local characteristic data set; performing feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set; and carrying out parameter updating on the initialized grabbing detection model according to the training prediction candidate set to obtain the grabbing detection model. The method can effectively improve the distinguishing capability of the grabbing detection model on the edge of the stacked object, and effectively improve the identification accuracy of the grabbing detection model. The application can be widely applied to the technical field of artificial intelligence.

Description

Training method, recognition method, system and equipment for grabbing detection model
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a training method, an identification method, a system and equipment for grabbing a detection model.
Background
In recent years, with the continuous development of social science and technology, a scene in which a robot/manipulator is applied to automatically grasp a target object in a stacked scene is more common.
At present, aiming at the problems of automatic recognition and object grabbing of a robot/manipulator in a stacking scene, the traditional grabbing detection method mainly comprises two types:
the first method is that one or more object grabbing frames are directly output through a plane RGB image, and the grabbing parameters mainly comprise the position and the rotation angle of a center point of an object to be grabbed, the method is suitable for a mechanical arm to grab from top to bottom, and the effect on a single object is good, but under the condition that a scene is more complex, the performance of the algorithm is worse, misjudgment is easier to be generated on the edge of a target object in the scene, depth information is lacked, so that the mechanical arm grabs inaccurately, and even damage is likely to be caused to the surface of the object;
the second method is to take a scene point cloud obtained directly or indirectly (RGB-D image) as input, input the scene point cloud into a backbone network by utilizing semantic and geometric information of the point cloud to obtain a 6D (i.e. six degrees of freedom) pose of an object to be grabbed, and perform grabbing. However, this method is too ideal and depends on the environment (the light around the camera and the quality of the point cloud data), it is easy to take the different objects in contact as a whole in the stacked scene, and the fault tolerance is small for objects of small or irregular size, and in some scenes where no collision is required, the quality of the grabbing evaluated by the algorithm is too low, which results in that the objects in serious contact in the scene cannot be grabbed successfully.
Accordingly, there is a need for solving and optimizing the problems associated with the prior art.
Disclosure of Invention
The present application aims to solve at least one of the technical problems existing in the related art to a certain extent.
Therefore, a first object of the embodiments of the present application is to provide a training method for a grabbing detection model, which can effectively improve the distinguishing capability of the grabbing detection model on the edges of stacked objects, and effectively improve the recognition accuracy of the grabbing detection model.
A second object of the embodiment of the present application is to provide an identification method for a grabbing detection model.
A third object of the embodiment of the present application is to provide a training system for grabbing a detection model.
In order to achieve the technical purpose, the technical scheme adopted by the embodiment of the application comprises the following steps:
in a first aspect, an embodiment of the present application provides a training method for capturing a detection model, where the training method includes:
acquiring a labeled RGB-D image training set, and performing point cloud conversion processing on the labeled RGB-D image training set to obtain a point cloud data set;
performing multi-scale feature extraction processing on the labeled RGB-D image training set to obtain a global feature atlas;
performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set, and performing size balance processing on the point cloud data set to obtain a local feature data set;
performing feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set;
and updating parameters of the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model.
In addition, the training method according to the above embodiment of the present application may further have the following additional technical features:
further, in an embodiment of the present application, the performing a multi-scale feature extraction process on the labeled RGB-D image training set to obtain a global feature atlas includes:
performing first feature extraction processing on the RGB-D image training set to obtain a first feature image set;
performing second feature extraction processing on the first feature atlas to obtain a second feature atlas;
performing third feature extraction processing on the second feature atlas to obtain a third feature atlas;
and carrying out semantic aggregation processing on the first feature atlas, the second feature atlas and the third feature atlas to obtain a global feature atlas.
Further, in an embodiment of the present application, the performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set includes:
sequentially performing up-sampling processing and instance segmentation processing on the point cloud data set to obtain a first intermediate data set;
downsampling the first intermediate data set to obtain a second intermediate data set;
and carrying out feature interpolation processing on the second intermediate data set according to the global feature atlas to obtain the interpolation feature set.
Further, in an embodiment of the present application, the performing a size balancing process on the point cloud data set to obtain a local feature data set includes:
acquiring a preset multi-layer annular cylinder;
grouping the point cloud data sets according to the multi-layer annular cylinder to obtain grouped data sets;
and performing sensing processing on the group data set according to the first intermediate data set to obtain the local characteristic data set.
Further, in an embodiment of the present application, the performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set includes:
performing up-sampling processing on the semantic space feature set to obtain a third intermediate data set;
performing grabbing width processing on the third intermediate data set to obtain a width candidate set;
performing grabbing angle processing on the third intermediate data set to obtain an angle candidate set;
performing grabbing probability processing on the third intermediate data set to obtain a probability candidate set;
and generating the training prediction candidate set according to the width candidate set, the angle candidate set and the probability candidate set.
Further, in an embodiment of the present application, the parameter updating of the initialized grabbing detection model according to the training prediction candidate set, to obtain a trained grabbing detection model, includes:
obtaining a stacked object label corresponding to the training prediction candidate set;
determining a training loss value according to the training prediction candidate set and the stacked object label;
and according to the training loss value, carrying out parameter updating on the initialized grabbing detection model to obtain a trained grabbing detection model.
In a second aspect, an embodiment of the present application provides a method for identifying a capture detection model, including:
acquiring an RGB-D image to be detected;
inputting the RGB-D image to be detected into the grabbing detection model according to any one of the first aspect, so as to obtain a prediction candidate set;
and carrying out confusion degree analysis on the prediction candidate set to obtain a first prediction candidate frame, wherein the first prediction candidate frame is used for representing the prediction candidate frame with the lowest confusion degree in the prediction candidate set.
In a third aspect, an embodiment of the present application provides a training system for capturing a detection model, including:
the acquisition module is used for acquiring the labeled RGB-D image training set and carrying out point cloud conversion processing on the labeled RGB-D image training set to obtain a point cloud data set;
the first processing module is used for carrying out multi-scale feature extraction processing on the marked RGB-D image training set to obtain a global feature image set;
the second processing module is used for carrying out characteristic interpolation processing on the point cloud data set according to the global characteristic atlas to obtain an interpolation characteristic set, and carrying out size balance processing on the point cloud data set to obtain a local characteristic data set;
the third processing module is used for carrying out feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and carrying out multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set;
and the updating module is used for carrying out parameter updating on the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model.
In a fourth aspect, an embodiment of the present application further provides a computer device, including:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
In a fifth aspect, embodiments of the present application further provide a computer readable storage medium in which a processor executable program is stored, which when executed by the processor is configured to implement the above-described method.
The advantages and benefits of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.
According to the training method for grabbing the detection model, disclosed by the embodiment of the application, a labeled RGB-D image training set is obtained, and point cloud conversion processing is carried out on the labeled RGB-D image training set to obtain a point cloud data set; performing multi-scale feature extraction processing on the labeled RGB-D image training set to obtain a global feature atlas; performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set, and performing size balance processing on the point cloud data set to obtain a local feature data set; performing feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set; and updating parameters of the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model. The training method can effectively improve the distinguishing capability of the grabbing detection model on the edge of the stacked object, and effectively improve the identification accuracy of the grabbing detection model.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present application or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of expressing some of the embodiments in the technical solutions of the present application, and other drawings may be obtained according to the drawings without the need of inventive labor for those skilled in the art.
Fig. 1 is a flow chart of a training method for grabbing a detection model according to an embodiment of the present application;
fig. 2 is a schematic diagram of a feature extraction network according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a model corresponding to step 130 according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a size balancing network according to an embodiment of the present application;
fig. 5 is a schematic diagram of a model corresponding to step 140 according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a training system for capturing a detection model according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
At present, aiming at the problems of automatic recognition and object grabbing of a robot/manipulator in a stacking scene, the traditional grabbing detection method mainly comprises two types:
the first method is that one or more object grabbing frames are directly output through a plane RGB image, and the grabbing parameters mainly comprise the position and the rotation angle of a center point of an object to be grabbed, the method is suitable for a mechanical arm to grab from top to bottom, and the effect on a single object is good, but under the condition that a scene is more complex, the performance of the algorithm is worse, misjudgment is easier to be generated on the edge of a target object in the scene, depth information is lacked, so that the mechanical arm grabs inaccurately, and even damage is likely to be caused to the surface of the object;
the second method is to take a scene point cloud obtained directly or indirectly (RGB-D image) as input, input the scene point cloud into a backbone network by utilizing semantic and geometric information of the point cloud to obtain a 6D (i.e. six degrees of freedom) pose of an object to be grabbed, and perform grabbing. However, this method is too ideal and depends on the environment (the light around the camera and the quality of the point cloud data), it is easy to take the different objects in contact as a whole in the stacked scene, and the fault tolerance is small for objects of small or irregular size, and in some scenes where no collision is required, the quality of the grabbing evaluated by the algorithm is too low, which results in that the objects in serious contact in the scene cannot be grabbed successfully.
Therefore, the embodiment of the application provides a training method for a grabbing detection model, which can effectively improve the distinguishing capability of the grabbing detection model on the edge of a stacked object and effectively improve the identification accuracy of the grabbing detection model.
Referring to fig. 1, in an embodiment of the present application, a training method for capturing a detection model includes:
step 110, acquiring a labeled RGB-D image training set, and performing point cloud conversion processing on the labeled RGB-D image training set to obtain a point cloud data set;
it can be understood that the RGB-D image training set may be divided into a batch of RGB images (color images) and a batch of depth images, where the RGB images are used to supplement contour information of objects in a scene and global semantic information of the scene, and the depth images are used to identify the pose of the objects; the labeled RGB-D image training set also contains labels corresponding to objects in the RGB-D image. It can be further understood that the point cloud data set is used for enriching scene features, and specific implementation manners of generating the point cloud data according to the RGB-D image are various, which will not be described in detail herein.
Referring to fig. 2, step 120, performing multi-scale feature extraction processing on the noted RGB-D image training set to obtain a global feature atlas;
step 120, performing multi-scale feature extraction processing on the labeled RGB-D image training set to obtain a global feature atlas, including:
step 121, performing first feature extraction processing on the RGB-D image training set to obtain a first feature atlas;
step 122, performing a second feature extraction process on the first feature atlas to obtain a second feature atlas;
step 123, performing third feature extraction processing on the second feature atlas to obtain a third feature atlas;
and 124, performing semantic aggregation processing on the first feature atlas, the second feature atlas and the third feature atlas to obtain a global feature atlas.
It can be understood that in the embodiment of the present application, the multi-scale feature extraction processing on the RGB-D image training set may be implemented by grabbing the feature extraction network in the detection model, so as to obtain the global feature atlas. The feature extraction network in the present application includes a convolution layer of 7x7x128, a max pooling layer of 3x3, a 4-layer base block, a global max pooling layer (GMP), a global average pooling layer (GAP), an attention channel module, and an upsampling layer. Specifically, taking an example that a certain RGB-D image in an RGB-D image training set is input to a feature extraction network, the feature extraction network outputs a 16-time feature image of the RGB-D image at a third layer basic block, and takes the output 16-time feature image as a first feature image; then outputting a 32-fold feature map of the RGB-D image at a fourth layer basic block, and taking the output 32-fold feature map as a second feature map; then, outputting a maximum pooling feature map of the RGB-D image at a global maximum pooling layer, and taking the output maximum pooling feature map as a third feature map; then the first feature map, the second feature map and the third feature map are transferred through channels of a 1×1 convolution layer with a channel number of 128 to obtain the same form of input; then, the first feature map is passed through a global averaging pooling layer, a first attention channel module, a 1×1 convolution layer with 128 channels and a second attention channel convolution layer, and the flow of the second feature map after channel transfer is similar to that of the first feature map, so that redundant description of the present application is omitted. And finally, taking the first feature map, the second feature map and the third feature map as semantic aggregation input, integrating the local information extracted from the network and having different sizes to obtain scene-rich semantic context information, and outputting a global feature map set at a second up-sampling layer in the feature extraction network.
Referring to fig. 3, step 130, performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set, and performing size balance processing on the point cloud data set to obtain a local feature data set;
step 130, performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set, including:
step 131, sequentially performing up-sampling processing and instance segmentation processing on the point cloud data set to obtain a first intermediate data set;
step 132, performing downsampling processing on the first intermediate data set to obtain a second intermediate data set;
and step 133, performing feature interpolation processing on the second intermediate data set according to the global feature atlas to obtain the interpolation feature set.
It may be understood that in the embodiment of the present application, firstly, an up-sampling process may be performed on a point cloud data set to increase a feature dimension of a point cloud, then, a K-nearest neighbor algorithm may be used to perform feature extraction and point cloud analysis, obtain an object type and feature information of each point, obtain a segmented point cloud, and perform a down-sampling process on the segmented point cloud to obtain a second intermediate data set belonging to a plurality of same instances, where the second intermediate data set includes an object type and corresponding feature information to which point cloud data belongs in a scene. It should be noted that, after the up-sampling process, the instance segmentation process and the down-sampling process are performed on the point cloud data, the obtained second intermediate data set is used for reflecting the local characteristics of the object. It may be further understood that, in the embodiment of the present application, the point cloud feature (i.e., the interpolation feature set) of the scene is obtained by performing feature interpolation processing on the first intermediate data set according to the global feature atlas.
Referring to fig. 3 and 4, in step 130, performing a size balancing process on the point cloud data set to obtain a local feature data set, including:
step 135, obtaining a preset multilayer annular cylinder;
136, grouping the point cloud data sets according to the multi-layer annular cylinder to obtain grouped data sets;
and step 137, performing sensing processing on the group data set according to the first intermediate data set to obtain the local characteristic data set.
It may be appreciated that in the embodiment of the present application, the size balancing process for the point cloud data set may be implemented by a size balancing network, where the size balancing network is mainly composed of a multi-layer annular cylinder and a multi-layer perceptron (MLP), and the multi-layer annular cylinder may be expressed as:
M c =(r,h,k)
wherein r is the grabbing radius of the manipulator, r e (d min ,d width ),d min D is the minimum grabbing radius of the manipulator width The maximum grabbing width of the manipulator is h, the height of the annular cylinder and k, the number of layers of the annular cylinder.
It will be appreciated that by means of a multi-layered annular cylinder M c The obtained point cloud data sets are grouped, each object is classified according to the minimum wrapped cylinder corresponding to the size of the object in the scene, and corresponding local features are extracted to obtain the grouped data sets, wherein the grouped data sets are used for representing the multi-size local features of the scene.
It can be understood that after the packet data set is obtained, the multi-layer perceptron is used for encoding the point cloud data of different annular cylinders, and in order to further improve the semantic information condition of the packet data set, the packet data set is interpolated according to the first intermediate data set for continuous grabbing detection, so that the local characteristic data set of the point cloud data in the scene is obtained.
Referring to fig. 5, step 140, performing feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set;
it may be understood that in the embodiment of the present application, the multi-angle prediction processing on the semantic space feature data set may be implemented by capturing a prediction network, first, feature fusion may be performed on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, then, the semantic space feature set is input into a 1×1 convolution layer with 128 channels, so as to reduce the channel dimension of the semantic space feature set, and the semantic space feature set with reduced channel dimension is input into the capturing prediction network for multi-angle prediction processing.
With continued reference to fig. 5, in step 140, performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set, including:
141, performing up-sampling processing on the semantic space feature set to obtain a third intermediate data set;
step 142, performing grabbing width processing on the third intermediate data set to obtain a width candidate set;
step 143, performing grabbing angle processing on the third intermediate data set to obtain an angle candidate set;
step 144, performing grabbing probability processing on the third intermediate data set to obtain a probability candidate set;
step 145, generating the training prediction candidate set according to the width candidate set, the angle candidate set and the probability candidate set.
It may be appreciated that in the embodiment of the present application, the grabbing prediction network performs preliminary processing on the semantic space feature set by using two transposed convolution layers with a channel number of 128, so that the image size in the semantic space feature set is the same as the size of the RGB-D image that is initially input, and then performs the grabbing width processing, the grabbing angle processing, and the grabbing probability processing by using three branches after obtaining the third intermediate data set through the upsampling processing, where each branch has a 1×1 convolution layer and a 3×3 convolution layer meaning.
Specifically, for a certain RGB-D image input to the capture detection model, the corresponding training prediction candidate set may be represented by an approximation function, which may be specifically expressed as:
G=(Γ,W,Q)∈R 3×H×w
wherein Γ is a manipulator angle grabbing configuration corresponding to an angle candidate set, W is a manipulator width grabbing configuration corresponding to a width candidate set, Q is a manipulator grabbing probability configuration corresponding to a probability candidate set, H is a height of an output prediction candidate frame, W is a width of the output prediction candidate frame, R 3×H×w And G is used for representing the prediction candidate set comprising the angle grabbing configuration gamma, the width grabbing configuration W and the grabbing probability configuration Q.
And 150, updating parameters of the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model.
Step 150, performing parameter updating on the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model, including:
step 151, obtaining a stacking object label corresponding to the training prediction candidate set;
step 152, determining a training loss value according to the training prediction candidate set and the stacked object label;
and 153, updating parameters of the initialized grabbing detection model according to the training loss value to obtain a trained grabbing detection model.
It can be understood that in the embodiment of the application, before the grabbing detection model is put into use, training is needed to adjust the parameters in the grabbing detection model, so that a better prediction effect is achieved. Specifically, when the model is trained, a batch of RGB-D images may be acquired, where each RGB-D image includes image data of a stacked object, and a stacked object tag corresponding to the RGB-D image is also acquired, where the tag is used to characterize a real type of the stacked object in the RGB-D image. Then, each RGB-D image and the corresponding stacked object label can be used as a group of training data, the input data of the model is the RGB-D image, the RGB-D image is predicted through the model, and the output data of the model is a training prediction candidate frame. After obtaining the training prediction candidate frame output by the model, the accuracy of model prediction can be evaluated according to the training prediction candidate frame and the stacked object label, so that parameters of the model are updated. It should be noted that, the training prediction candidate frame in the embodiment of the present application is one of the recognition results in the training prediction candidate set.
Specifically, for machine learning models, the accuracy of model predictions may be measured by a Loss Function (Loss Function) defined on a single training data for measuring the prediction error of a training data, and specifically determining the Loss value of the training data from the label of the single training data and the model's predictions of the training data. In actual training, one training data set has a lot of training data, so that a Cost Function (Cost Function) is generally adopted to measure the overall error of the training data set, and the Cost Function is defined on the whole training data set and is used for calculating the average value of the prediction errors of all the training data, so that the prediction effect of the model can be better measured. For a general machine learning model, based on the cost function, a regular term for measuring the complexity of the model can be used as a training objective function, and based on the objective function, the loss value of the whole training data set can be obtained. There are many kinds of common loss functions, such as 0-1 loss function, square loss function, absolute loss function, logarithmic loss function, cross entropy loss function, etc., which can be used as the loss function of the machine learning model, and will not be described in detail herein. In embodiments of the present application, a loss function may be selected from among which to determine a trained loss value, such as a cross entropy loss function. Based on the training loss value, updating parameters of the model by adopting a back propagation algorithm, and iterating for several rounds to obtain a trained grabbing detection model. The specific number of iteration rounds may be preset or training may be deemed complete when the test set meets the accuracy requirements.
In summary, according to the embodiment of the application, more accurate semantic information in a scene is obtained through a feature extraction network structure, feature interpolation is performed on a point cloud data set through a global feature image set, so that local features of the scene are obtained, a first intermediate data set obtained by dividing a point cloud data instance and a multi-layer perceptron are used for size balance processing, so that more abundant local features of the scene are obtained, and the feature data are input into a grabbing detection network for predictive analysis, so that more accurate grabbing gestures can be generated.
In addition, the embodiment of the application provides a method for identifying a grabbing detection model, which comprises the following steps:
step 210, acquiring an RGB-D image to be detected;
step 220, inputting the RGB-D image to be detected into the capture detection model according to any one of the foregoing, to obtain a prediction candidate set;
and 230, performing confusion degree analysis on the prediction candidate set to obtain a first prediction candidate frame, wherein the first prediction candidate frame is used for representing the prediction candidate frame with the lowest confusion degree in the prediction candidate set.
It can be understood that, compared with the traditional method of selecting the maximum value of the predicted capture candidate frames to execute after obtaining the predicted candidate set for the RGB-D image to be detected input to the capture detection model, the embodiment of the application can integrate the predicted candidate frames greater than the threshold value by presetting a fixed threshold value in the range of 0 to 1 to form the predicted candidate set, and combining context semantics of a scene to analyze the chaos degree of the predicted candidate set, and guide a manipulator to execute object capture according to the first predicted candidate frame with the lowest chaos degree, instead of relying on the data set, thereby effectively improving the object capture success rate under the stacked scene.
Specifically, in the embodiment of the present application, in order to intuitively measure the clutter degree of the surrounding environment where the target object is located, the clutter degree may be used to represent the clutter degree of the corresponding semantic category in the prediction candidate frame, and may be expressed as follows:
wherein S is the number of pixels of the object in the grabbing frame, c is the semantic category, H (c) is the degree of confusion of the semantic category c, and t (c) i ) Representing the pixel with the ith semantic class of c in the region, h i Represents the height of the ith object, h t Representing the average height of all objects within the scene area,the height coefficient is used for measuring the influence of objects with different heights around the target object on the grabbing difficulty.
Referring to fig. 6, an embodiment of the present application further provides a training system for capturing a detection model, which is characterized by including:
the acquisition module 101 is configured to acquire a labeled RGB-D image training set, and perform point cloud conversion processing on the labeled RGB-D image training set to obtain a point cloud data set;
the first processing module 102 is configured to perform multi-scale feature extraction processing on the labeled RGB-D image training set to obtain a global feature atlas;
a second processing module 103, configured to perform feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set, and perform size balance processing on the point cloud data set to obtain a local feature data set;
the third processing module 104 is configured to perform feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and perform multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set;
and the updating module 105 is configured to update parameters of the initialized grabbing detection model according to the training prediction candidate set, so as to obtain a trained grabbing detection model.
It can be understood that the content in the above method embodiment is applicable to the system embodiment, and the functions specifically implemented by the system embodiment are the same as those of the above method embodiment, and the achieved beneficial effects are the same as those of the above method embodiment.
Referring to fig. 7, an embodiment of the present application further provides a computer apparatus, including:
at least one processor 201;
at least one memory 202 for storing at least one program;
the at least one program, when executed by the at least one processor 201, causes the at least one processor 201 to implement the method embodiments described above.
Similarly, it can be understood that the content in the above method embodiment is applicable to the embodiment of the present apparatus, and the functions specifically implemented by the embodiment of the present apparatus are the same as those of the embodiment of the foregoing method, and the achieved beneficial effects are the same as those achieved by the embodiment of the foregoing method.
The embodiment of the present application further provides a computer readable storage medium in which a program executable by the processor 201 is stored, the program executable by the processor 201 being configured to implement the above-described method embodiment when executed by the processor 201.
Similarly, the content in the above method embodiment is applicable to the present computer-readable storage medium embodiment, and the functions specifically implemented by the present computer-readable storage medium embodiment are the same as those of the above method embodiment, and the beneficial effects achieved by the above method embodiment are the same as those achieved by the above method embodiment.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and the equivalent modifications or substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (10)

1. A training method for grabbing a detection model, the training method comprising:
acquiring a labeled RGB-D image training set, and performing point cloud conversion processing on the labeled RGB-D image training set to obtain a point cloud data set;
performing multi-scale feature extraction processing on the labeled RGB-D image training set to obtain a global feature atlas;
performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set, and performing size balance processing on the point cloud data set to obtain a local feature data set;
performing feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set;
and updating parameters of the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model.
2. The method for training a capture detection model according to claim 1, wherein the performing multi-scale feature extraction processing on the labeled RGB-D image training set to obtain a global feature atlas includes:
performing first feature extraction processing on the RGB-D image training set to obtain a first feature image set;
performing second feature extraction processing on the first feature atlas to obtain a second feature atlas;
performing third feature extraction processing on the second feature atlas to obtain a third feature atlas;
and carrying out semantic aggregation processing on the first feature atlas, the second feature atlas and the third feature atlas to obtain a global feature atlas.
3. The method for training a capture detection model according to claim 1, wherein the performing feature interpolation processing on the point cloud data set according to the global feature atlas to obtain an interpolation feature set includes:
sequentially performing up-sampling processing and instance segmentation processing on the point cloud data set to obtain a first intermediate data set;
downsampling the first intermediate data set to obtain a second intermediate data set;
and carrying out feature interpolation processing on the second intermediate data set according to the global feature atlas to obtain the interpolation feature set.
4. The method for training a capture detection model according to claim 3, wherein performing a size balance process on the point cloud data set to obtain a local feature data set comprises:
acquiring a preset multi-layer annular cylinder;
grouping the point cloud data sets according to the multi-layer annular cylinder to obtain grouped data sets;
and performing sensing processing on the group data set according to the first intermediate data set to obtain the local characteristic data set.
5. The method for training a grabbing detection model according to claim 1, wherein the performing multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set includes:
performing up-sampling processing on the semantic space feature set to obtain a third intermediate data set;
performing grabbing width processing on the third intermediate data set to obtain a width candidate set;
performing grabbing angle processing on the third intermediate data set to obtain an angle candidate set;
performing grabbing probability processing on the third intermediate data set to obtain a probability candidate set;
and generating the training prediction candidate set according to the width candidate set, the angle candidate set and the probability candidate set.
6. The method for training a capture detection model according to claim 1, wherein the parameter updating the initialized capture detection model according to the training prediction candidate set to obtain a trained capture detection model comprises:
obtaining a stacked object label corresponding to the training prediction candidate set;
determining a training loss value according to the training prediction candidate set and the stacked object label;
and according to the training loss value, carrying out parameter updating on the initialized grabbing detection model to obtain a trained grabbing detection model.
7. A method of identifying a grasping detection model, comprising:
acquiring an RGB-D image to be detected;
inputting the RGB-D image to be detected into the grabbing detection model according to any one of claims 1-6 to obtain a prediction candidate set;
and carrying out confusion degree analysis on the prediction candidate set to obtain a first prediction candidate frame, wherein the first prediction candidate frame is used for representing the prediction candidate frame with the lowest confusion degree in the prediction candidate set.
8. A training system for grasping a test model, comprising:
the acquisition module is used for acquiring the labeled RGB-D image training set and carrying out point cloud conversion processing on the labeled RGB-D image training set to obtain a point cloud data set;
the first processing module is used for carrying out multi-scale feature extraction processing on the marked RGB-D image training set to obtain a global feature image set;
the second processing module is used for carrying out characteristic interpolation processing on the point cloud data set according to the global characteristic atlas to obtain an interpolation characteristic set, and carrying out size balance processing on the point cloud data set to obtain a local characteristic data set;
the third processing module is used for carrying out feature fusion processing on the local feature data set and the interpolation feature data set to obtain a semantic space feature set, and carrying out multi-angle prediction processing on the semantic space feature set to obtain a training prediction candidate set;
and the updating module is used for carrying out parameter updating on the initialized grabbing detection model according to the training prediction candidate set to obtain a trained grabbing detection model.
9. A computer device, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any of claims 1-7.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for implementing the method according to any of claims 1-7 when being executed by the processor.
CN202310986735.5A 2023-08-07 2023-08-07 Training method, recognition method, system and equipment for grabbing detection model Pending CN117237751A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310986735.5A CN117237751A (en) 2023-08-07 2023-08-07 Training method, recognition method, system and equipment for grabbing detection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310986735.5A CN117237751A (en) 2023-08-07 2023-08-07 Training method, recognition method, system and equipment for grabbing detection model

Publications (1)

Publication Number Publication Date
CN117237751A true CN117237751A (en) 2023-12-15

Family

ID=89086916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310986735.5A Pending CN117237751A (en) 2023-08-07 2023-08-07 Training method, recognition method, system and equipment for grabbing detection model

Country Status (1)

Country Link
CN (1) CN117237751A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117523181A (en) * 2023-12-29 2024-02-06 佛山科学技术学院 Multi-scale object grabbing point detection method and system based on unstructured scene

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117523181A (en) * 2023-12-29 2024-02-06 佛山科学技术学院 Multi-scale object grabbing point detection method and system based on unstructured scene

Similar Documents

Publication Publication Date Title
CN112884064B (en) Target detection and identification method based on neural network
CN107169954B (en) Image significance detection method based on parallel convolutional neural network
CN112434586B (en) Multi-complex scene target detection method based on domain self-adaptive learning
CN112233181A (en) 6D pose recognition method and device and computer storage medium
CN110659550A (en) Traffic sign recognition method, traffic sign recognition device, computer equipment and storage medium
CN117253154B (en) Container weak and small serial number target detection and identification method based on deep learning
CN117237751A (en) Training method, recognition method, system and equipment for grabbing detection model
CN113240716B (en) Twin network target tracking method and system with multi-feature fusion
CN114419413A (en) Method for constructing sensing field self-adaptive transformer substation insulator defect detection neural network
CN116452966A (en) Target detection method, device and equipment for underwater image and storage medium
CN116994135A (en) Ship target detection method based on vision and radar fusion
Fan et al. A novel sonar target detection and classification algorithm
CN113221731B (en) Multi-scale remote sensing image target detection method and system
CN116681885A (en) Infrared image target identification method and system for power transmission and transformation equipment
CN114782827B (en) Object capture point acquisition method and device based on image
CN116953702A (en) Rotary target detection method and device based on deduction paradigm
CN116523881A (en) Abnormal temperature detection method and device for power equipment
CN116342536A (en) Aluminum strip surface defect detection method, system and equipment based on lightweight model
CN112446292B (en) 2D image salient object detection method and system
CN115760695A (en) Image anomaly identification method based on depth vision model
CN114820732A (en) System and method for detecting and describing key points of high-speed train image
CN117011722A (en) License plate recognition method and device based on unmanned aerial vehicle real-time monitoring video
CN114049478A (en) Infrared ship image rapid identification method and system based on improved Cascade R-CNN
CN113408356A (en) Pedestrian re-identification method, device and equipment based on deep learning and storage medium
CN113807354A (en) Image semantic segmentation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination