CN113034575A

CN113034575A - Model construction method, pose estimation method and object picking device

Info

Publication number: CN113034575A
Application number: CN202110111623.6A
Authority: CN
Inventors: 杨洋
Original assignee: Shenzhen Huahan Weiye Technology Co ltd
Current assignee: Shenzhen Huahan Weiye Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-06-25

Abstract

The application relates to a model construction method, a pose estimation method and an object picking device, wherein the model construction method comprises the following steps: acquiring dense fusion data of a target object; training a preset network model according to the dense fusion data, and learning to obtain a network weight parameter; configuring and forming a pose estimation model of the target object according to the network weight parameters; the network model comprises a backbone node layer and a head node layer, the head node layer comprises a classification layer and a regression layer, the backbone node layer is used for constructing high-level semantic information of a target object according to dense fusion data, the classification layer in the head node layer is used for processing the high-level semantic information to judge the category and the score of the target object, and the regression layer in the head node layer is used for processing the high-level semantic information to predict the pose and the confidence of the target object. According to the technical scheme, the panoramic information reconstruction capability of the pose estimation model on the detection scene with a complex background and strong interference can be improved, so that the capability of detecting the pose of the target object is improved.

Description

Model construction method, pose estimation method and object picking device

Technical Field

The invention relates to the technical field of image processing, in particular to a model construction method, a pose estimation method and an object picking device.

Background

In the current manufacturing industry, the assembly process takes a lot of time and capital, and in order to improve the production efficiency and reduce the labor cost, people begin to explore the realization of automatic assembly by using robots. The part identification and the grabbing position planning are used as indispensable important links in the automatic assembly process and have important influence on the assembly quality, the automation degree and the flexibility of product assembly can be obviously improved through the vision-based part pose judgment and the grabbing position planning, the consumed time is reduced, the cost is reduced, and the production and manufacturing efficiency is improved. Robot automation relates to two key technologies: and identifying and automatically grabbing parts.

Computer vision techniques have occupied an important position in the perception of robot unstructured scenes. The visual image is an effective means for acquiring real world information, and features of corresponding tasks, such as object position, angle, posture and the like, are extracted through a visual perception algorithm, so that the robot can execute corresponding operations to complete specified operation tasks. For industrial robot sorting, although scene data can be acquired by using a visual sensor, how to identify a target object from a scene and estimate the position and posture thereof becomes a core problem in calculating a grasping position and a grasping path of an industrial robot. In recent years, with the rapid development of deep learning technology, the pose estimation technology based on deep learning has become a mainstream algorithm in the pose estimation field, but the existing mainstream pose estimation algorithm based on deep learning mostly depends on information such as color and texture of the surface of an object, has a poor identification effect on parts with low texture and reflective surfaces in industry, and has a certain obstacle to realizing efficient automatic sorting of the parts.

At present, a relatively mature machine vision grabbing method based on artificial intelligence predicts the pose of a workpiece according to a two-dimensional image acquired by a camera, but the method usually lacks three-dimensional information of the workpiece and can only realize two-dimensional pose estimation. The traditional reinforcement learning method has great limitation when solving the problems of high-dimensional state and action space, has limited representation capability on complex functions under the conditions of limited samples and computing units, and is not ideal in performance in practical application. Meanwhile, the traditional deep reinforcement learning algorithm needs to provide a large amount of data for training, and in the training process, a robot needs to continuously grab and trial and error to possibly obtain stable picking capacity; the training method has long period and low efficiency, has potential safety hazard in the actual training process, and often cannot meet the requirements of industrial production application.

Disclosure of Invention

The invention mainly solves the technical problems that: how to improve the accuracy of machine vision picking. In order to solve the technical problems, the application provides a model construction method, a pose estimation method and an object picking device. According to a first aspect, an embodiment provides a pose estimation model construction method, which includes: acquiring dense fusion data of a target object; the dense fusion data is obtained by heterogeneously fusing two-dimensional image data and three-dimensional point cloud data of a target object; training a preset network model according to the dense fusion data, and learning to obtain a network weight parameter; and forming a pose estimation model of the target object according to the network weight parameter configuration.

The network model comprises a backbone node layer and a head node layer, wherein the head node layer comprises a classification layer and a regression layer; the backbone node layer is used for constructing high-level semantic information of the target object according to the dense fusion data; the high-level semantic information comprises coordinates and feature vectors of feature points on the surface of the target object; a classification layer in the head node layers is used for processing the high-level semantic information to determine the category and the score of the target object, and a regression layer in the head node layers is used for processing the high-level semantic information to predict the pose and the confidence of the target object.

For a regression layer in the network model, predicting one or more poses of the target object according to coordinates of each feature point in the high-level semantic information, calculating confidence degrees of the predicted poses according to feature vectors of each feature point in the high-level semantic information, comparing the confidence degrees of the poses, and determining the pose corresponding to the highest confidence degree as the optimal pose; establishing a total loss function for the network model and expressing as

L＝L_conf+L_loc；

Wherein L is_conf、L_locRespectively, the loss function of the classification layer and the loss function of the regression layer, the superscript p is the index of the class, the superscript 0 is the background information,

in order to be the weight coefficient,

the score of the ith category is represented, Pos represents a characteristic point set outside the background, and Neg represents a characteristic point set of the background; n is the number of characteristic points, L_iAttitude loss function, s, required to predict the ith attitude_iAnd w is a weight coefficient, and log () is a logarithm operation function.

Predicting one or more poses of the target object according to the coordinates of each feature point in the high-level semantic information, wherein the predicting comprises: acquiring coordinates of each feature point in the high-level semantic information and expressing the coordinates as x_j(ii) a When the shape of the target object is judged to be an asymmetric structure, a first loss function aiming at the ith pose is established and expressed as

Wherein M is the number of the selected characteristic points, j is the traversal sequence number in the range of M,

respectively, the marked rotation matrix and translation vector, and R, t respectively, the calculated rotation matrix and translation vector; and obtaining R and t through iterative calculation and calculation when the total loss function converges, and using the R and t as one or more poses of the target object.

Predicting one or more poses of the target object according to the coordinates of each feature point in the high-level semantic information, wherein the predicting comprises: acquiring coordinates of each feature point in the high-level semantic information and expressing the coordinates as x_j(ii) a When the shape of the target object is judged to be a symmetrical structure, a second loss function aiming at the ith pose is established and expressed as

Wherein M is the number of the selected characteristic points, j and k are the traversal sequence numbers in the range of M,

respectively, the marked rotation matrix and translation vector, and R, t respectively, the calculated rotation matrix and translation vector; and obtaining R and t through iterative calculation when the total loss function is converged, and taking the R and t as one or more poses of the target object.

The calculating the confidence of each predicted pose according to the feature vector of each feature point in the high-level semantic information includes: acquiring a feature vector of each feature point in the high-level semantic information and expressing the feature vector as p_v(ii) a Calculating the confidence coefficient of the ith pose by using a normalization function; the normalization function is expressed as

Wherein p is_iAnd the subscript v is the traversal sequence number in the range of N for the ith characteristic point in the high-level semantic information.

Training a preset network model according to the dense fusion data, and learning to obtain a network weight parameter, wherein the training comprises the following steps: and inputting the dense fusion data into the network model, optimizing the total loss function through back propagation, and obtaining a network weight parameter when the total loss function is converged.

According to a second aspect, an embodiment provides a pose estimation method of a target object, which includes: acquiring a scene image of a target object; processing the scene image according to the pose estimation model of the target object configured and formed in the first aspect, and obtaining category information and pose information of the target object through pose estimation; outputting the category information and the pose information of the target object;

according to a third aspect, an embodiment provides an acquisition apparatus for a target object, comprising: the sensor is used for acquiring a scene image of a target object; a processor connected to the sensor, for processing the scene image by the pose estimation method in the second aspect to output the category information and the pose information of the target object; and the controller is connected with the sensor and the processor and is used for controlling the sensor to capture the image of the target object and controlling a motion mechanism to capture the target object according to the category information and the pose information of the target object.

According to a fourth aspect, an embodiment provides a computer-readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the construction method described in the first aspect above and to implement the pose estimation method described in the second aspect above.

The beneficial effect of this application is:

according to the model construction method, the pose estimation method and the object picking device of the embodiment, the model construction method comprises the following steps: acquiring dense fusion data of a target object; training a preset network model according to the dense fusion data, and learning to obtain a network weight parameter; configuring and forming a pose estimation model of the target object according to the network weight parameters; the network model comprises a backbone node layer and a head node layer, the head node layer comprises a classification layer and a regression layer, the backbone node layer is used for constructing high-level semantic information of a target object according to dense fusion data, the classification layer in the head node layer is used for processing the high-level semantic information to judge the category and the score of the target object, and the regression layer in the head node layer is used for processing the high-level semantic information to predict the pose and the confidence of the target object. On one hand, as the pose estimation model adopts the backbone node layer and the head node layer to carry out deep learning processing on the dense fusion data of the target object, the deep analysis of the dense fusion data can be realized, and the panoramic information reconstruction capability of the pose estimation model on the detection scene with complex background and strong interference is improved, so that the performance of the pose detection of the target object is improved; on the other hand, the constructed pose estimation model provides possibility for pose estimation of the target object, and the category information and the pose information of the target object can be accurately output by extracting object features only by inputting the scene image of the target object into the model, so that the motion path of the motion mechanism can be controlled, the self-adaptive grabbing operation of the target object is realized, and particularly the sorting capability of the robot on the scattered part scene is improved.

Drawings

Fig. 1 is a flowchart of a pose estimation method of a target object in the present application;

FIG. 2 is a schematic diagram of a network model;

FIG. 3 is a schematic diagram of a classification layer and a regression layer;

FIG. 4 is a schematic diagram of pose estimation;

FIG. 5 is a flowchart of a pose estimation model construction method in the present application;

FIG. 6 is a flow chart of training a network model;

FIG. 7 is a flow chart for establishing a total loss function;

FIG. 8 is a flow chart of predicting the pose of a target object;

FIG. 9 is a schematic diagram of an acquisition apparatus for a target object according to the present application;

FIG. 10 is a schematic diagram of a processor and controller;

fig. 11 is a schematic structural diagram of a pose recognition apparatus in an embodiment.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in detail in order to avoid obscuring the core of the present application from excessive description, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification and drawings are for the purpose of describing certain embodiments only and are not intended to imply a required sequence unless otherwise indicated where such sequence must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).

China is one of five industrial robot consuming countries around the world, the number of installed robots is increased by 59% in year 2019, the consumption amount exceeds the sum of Europe and the United states, and the demand of intelligent industrial robots is huge. The robot for carrying, loading and unloading accounts for more than two thirds, and the additional value brought by intelligent upgrading is obvious. With the development of artificial intelligence, the technology gradually focuses on workpiece grabbing pose estimation based on the artificial intelligence technology, data dimension reduction and feature extraction are carried out on images by using a pre-trained deep reinforcement learning network, a control strategy of a robot is obtained according to a feature extraction result, and the robot controls a motion path and the pose of a mechanical arm by using the control strategy, so that self-adaptive picking operation of a target is realized.

The machine vision grabbing method based on artificial intelligence predicts the pose of a workpiece according to a two-dimensional image acquired by a camera, but the method usually lacks three-dimensional information of the workpiece and can only realize two-dimensional pose estimation. The artificial intelligence method for estimating the pose of the workpiece according to the three-dimensional point cloud information of the workpiece is less, and is realized by adopting a depth reinforcement learning method at present. However, the traditional reinforcement learning method has great limitation when solving the problem of high-dimensional state and motion space, and has limited capability of representing complex functions under the condition of limited samples and computing units, and the performance in practical application is not ideal. Meanwhile, the traditional deep reinforcement learning algorithm needs to provide a large amount of data for training, and in the training process, a robot needs to continuously grab and trial and error to possibly obtain stable picking capacity; the training method has long period and low efficiency, has potential safety hazard in the actual training process, and often cannot meet the requirements of industrial production application. Therefore, the invention simulates a real scene by combining a two-dimensional image and depth information, establishes a pose estimation network and a multi-mode data fusion network based on deep learning, establishes a corresponding data set, and obtains an object pose estimation algorithm by continuous iteration so as to improve the sorting capacity of the robot for the scattered part scene.

The technical solution of the present application will be specifically described with reference to the following examples.

The first embodiment,

Referring to fig. 1, the present application discloses a pose estimation method for a target object, which includes steps S110-S130, which are described below.

Step S110, a scene image of the target object is acquired.

It should be noted that the target object in the present embodiment may be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table, and the like, for example, an irregularly shaped mechanical part in the tool box of fig. 9. Then, the scene image of the target object may be acquired by an image capturing device such as a camera, a vision sensor, or the like.

It can be understood that the scene image of the target object refers to a two-dimensional image capturing result and a three-dimensional image capturing result of the target object in a simple or complex scene, such as RGB-D data (i.e., registration data of a color map and a depth map), the scene image includes not only the target object but also some other objects and backgrounds, and a visual image including the scene image is an effective means for acquiring the scene where the target object is located, and features of a corresponding task, such as information of an object position, an angle, a posture and the like, are extracted through a visual perception algorithm, so that the robot can perform corresponding operations and complete a designated task. Of course, the scene image of the target object may also be a two-dimensional image or a three-dimensional image of the target object in the scene, but the expression capability of some features of the object is reduced by using a single image, so that the accuracy of the subsequent estimation of the object pose is affected.

And step S120, processing the scene image according to a preset pose estimation model of the target object, and obtaining the category information and the pose information of the target object through pose estimation.

For industrial robot sorting, scene data is acquired by using a vision sensor, but how to identify a target object from a scene and estimate the position and the posture of the target object, so that calculating the gripping position and the gripping path of an industrial robot becomes a core problem. The pose estimation technology based on deep learning becomes an algorithm implementation means for pose estimation, the conventional mainstream pose estimation algorithm based on deep learning mostly depends on information such as color, texture and the like of the surface of an object, the identification effect of parts with low texture and reflective surfaces in the industry is poor, and certain obstruction is generated for realizing high-efficiency automatic sorting of the parts. In the embodiment, a real scene is simulated by combining a two-dimensional image and depth information, a pose estimation network and a multi-mode data fusion network based on deep learning are established, and a corresponding data set is established to obtain an object pose estimation algorithm (namely a pose estimation model) through continuous iteration, so that the sorting capacity of the robot on the scattered part scene is improved. Then, when the pose estimation processing is performed on the scene image by using the already-constructed pose estimation model, the target object can be identified from the scene image and the pose thereof can be estimated individually, and finally the category information and the pose information of the target object are given.

And step S130, outputting the category information and the pose information of the target object. The category information of the target object reflects attributes of the target object, such as bottles, fruits, tables, toolboxes, and the like, which are related to the training samples and the label attributes of the pose estimation model, and the object category can be output as long as the model is trained. In addition, the pose information of the target object reflects the spatial pose of the target object, such as the pose position, the appearance shape, the pose orientation, and the like, which are related to the estimation algorithm of the pose estimation model, and the pose estimation model in the application can accurately estimate the spatial pose of the target object.

The related pose estimation model is obtained by a network model through sample training and learning, and then the pose estimation model has the same network structure as the network model. In a specific embodiment, referring to fig. 2, the network structure specifically includes a backbone node layer 100 and a head node layer 200, wherein the head node layer 200 may include a classification layer 201 and a regression layer 202.

The backbone node layer 100 is configured to construct high-level semantic information of the target object according to the scene image of the target object; the high-level semantic information here includes coordinates and feature vectors of feature points of the surface of the target object. The head node layer 200 is configured to process high-level semantic information output by the backbone node layer, and analyze the high-level semantic information to obtain a category and a pose of the target object.

Among them, the classification layer 201 in the head node layer 200 is used to process the high-level semantic information to determine the category and the score of the target object, and the regression layer 202 in the head node layer 200 is used to process the high-level semantic information to predict the pose and the confidence of the target object. Both the classification layer 201 and the regression layer 202 may adopt the network structure mode of full connection + ReLU in fig. 3, for example, the classification layer 201 uses one layer of full connection + ReLU, the regression layer 202 adopts two layers of full connection + ReLU, one path of high-level semantic information may be calculated to obtain the category and score of the target object after passing through one layer of full connection + ReLU, and the other path of high-level semantic information may be calculated to obtain the pose and confidence of the target object after passing through the other two layers of full connection + ReLU.

When the high-level semantic information is processed by the classification layer 201, the probability of the target object belonging to which category is determined and the score corresponding to the category is used to express the probability numerically, and the higher the score is, the higher the probability of belonging to a certain category is, and it is preferable to select the category corresponding to the maximum score as the final category determination result. Similarly, when the regression layer 202 is used to process the high-level semantic information, the possibility of which pose the target object is in is determined and is expressed numerically by the confidence of the predicted pose, the higher the confidence is, the higher the possibility of being in a certain pose is, and the predicted pose corresponding to the maximum confidence is preferably selected as the final pose determination result.

In the present embodiment, the main roles of the regression layer 202 are pose prediction and confidence representation of the target object, and the operation principle of the regression layer 202 can be represented by fig. 4. High-level semantic information (including feature vectors and feature point coordinates of the target object) is input into the regression layer, and the rotation matrix R is output from the regression layer_iTranslation vector t_iAnd confidence s_iWherein the matrix R is rotated_iAnd a translation vector t_iThe system is used for representing the rotation and translation relation of the current spatial pose of the target object in a camera coordinate system and can be used as the ith prediction pose of the target object; confidence s_iFor characterizing the i-th predictionThe possibility of pose.

Referring to fig. 3, for the classification layer, the feature vector and the feature point coordinate of the high-level semantic information are subjected to a "full-link + ReLU" operation to obtain the number of classes of the target object and the score corresponding to each class, and if the number of classes into which the target object can be classified is n +1 classes (n is an object class, and 1 is a background), n +1 scores are obtained after full-link processing of the classification layer.

Referring to fig. 3 and 4, for the regression layer, the feature vector and the feature point coordinate of the high-level semantic information are subjected to two 'full-connection + ReLU' operations to obtain the corresponding position estimation

Wherein the transformation relation is

p is the input feature vector, and the subscript represents x, y, z. That is, the feature vector is subjected to a series of fully-concatenated processing and functions consisting of an activation function, ReLU

Is processed to obtain estimated coordinates

The pose is finally estimated by the transformation relation R, t of the loss function. There are two transformation relations, one is to predict the estimation value (i.e. confidence) of the corresponding position according to the feature vector, and the other is to solve the corresponding transformation relation. The estimated values of the predictions are carried by the weights of the neural network and the transformation relations are carried by the variable regression of the loss functions. In addition, the regression layer mainly analyzes according to the input feature point coordinates and feature vectors and outputs a rotation matrix R and a translational vector t, wherein the rotation matrix is expressed by using an equivalent shaft angle, such as the equivalent shaft angle

The direction of the rotation axis is shown, and the mode length is the rotation angle.

It can be understood that the pose estimation model in this embodiment is a neural network, and its functional components are related to the structure of the model itself and the training and learning process of the model, so for clear understanding of the construction process of the pose estimation model, a specific description will be made through a construction method, and a detailed description will be given below in the second embodiment.

Example II,

Referring to fig. 5, the present embodiment discloses a method for constructing a pose estimation model, which includes steps S210-S230, which are described below.

Step S210, acquiring dense fusion data of the target object, wherein the dense fusion data is obtained by heterogeneously fusing two-dimensional image data and three-dimensional point cloud data of the target object. The two-dimensional image data and the three-dimensional point cloud data are heterogeneous data located in different feature spaces, so that the two kinds of data can be respectively processed by using a heterogeneous network so as to simultaneously reserve the structures of the two kinds of data, the respective advantages of object depth information and object image information are fully utilized, and the surface feature points of a target object are accurately represented by means of dense fusion data.

It should be noted that the target object may be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table, or the like. Then, a two-dimensional image of the target object may be acquired by an optical image capturing component such as a camera, a video camera, or the like, and two-dimensional image data may be generated; also, the three-dimensional point cloud data of the target object, which may be a part of the appearance shape data of the surface of the target object, may be acquired by a scanning apparatus such as a contact or non-contact type (e.g., a laser scanning apparatus), or may be derived even by using three-dimensional drawing software. In one embodiment, in order to obtain two-dimensional image data and three-dimensional point cloud data of a target object, a 2D camera and a 3D camera are used in cooperation for image capture, the 2D camera can be used for acquiring a two-dimensional image of the target object, and the 3D camera can be used for acquiring a three-dimensional point cloud of the target object; of course, a 3D camera of RGB-D data may also be employed to capture three-dimensional and two-dimensional images of the target object, thereby forming three-dimensional point cloud data and two-dimensional image data from depth and color information.

And step S220, training a preset network model according to the dense fusion data, and learning to obtain a network weight parameter.

In a particular embodiment, referring to fig. 2 and 3, the network model may include a backbone node level 100 and a head node level 200, where the head node level 200 includes a classification level 201 and a regression level 202. And, the backbone node layer 100 is configured to construct high-level semantic information of the target object according to the dense fusion data, where the high-level semantic information includes coordinates and feature vectors of each feature point on the surface of the target object. A classification layer 201 in the head node layer 200 is used to process the high-level semantic information to determine the class and score of the target object, and a regression layer 202 in the head node layer 200 is used to process the high-level semantic information to predict the pose and confidence of the target object. For the specific functions of the classification layer 201 and the regression layer 202 in the network model, reference may be made to the related description in the first embodiment.

It should be noted that the dense fusion data is input to the network model as a training sample, and the network is optimized through iterative computation and a network back propagation algorithm, so as to obtain a network weight parameter through learning.

And step S230, configuring and forming a pose estimation model of the target object according to the network weight parameters. And configuring parameters of each layer of the network model into finally learned network weight parameters, and constructing and forming a pose estimation model.

In the present embodiment, the above step S210 mainly involves acquiring the dense fusion data of the target object, and how to obtain the dense fusion data by heterogeneous fusion according to the two-dimensional image data and the three-dimensional point cloud data of the target object will be described in detail below.

(1) And acquiring two-dimensional image data and three-dimensional point cloud data of the target object. For example, a 3D camera using RGB-D data collects related data, and a color map and a Depth map are output at the same time, and at this time, the RGB image and the Depth image are registered, so that pixel points have a one-to-one correspondence relationship. Of course, in some cases, data corresponding to the 2D camera and the 3D camera may be used, the 2D camera may be used to acquire a two-dimensional image of the target object, and the 3D camera may be used to acquire a three-dimensional point cloud of the target object.

(2) Extracting semantic information of the two-dimensional image data to obtain color space characteristics, and extracting point cloud characteristics of the three-dimensional point cloud data to obtain geometric space characteristics.

The image corresponding to the two-dimensional image data often contains rich semantics, such as three levels of image semantics: low-level semantics, medium-level semantics, and high-level semantics; the low-level semantics are color, texture, and the like of the pixel, the medium-level semantics are roughness, contrast, compactness, and the like of the image block, and the high-level semantics are the category of the object included in the image or the image area, and the like. Then, extracting semantic information of the two-dimensional image data can be regarded as semantic segmentation, the image is mainly segmented by utilizing the semantic information, and during segmentation, which semantics can be selected as segmentation objects according to actual needs, not only expanding the receptive field, but also improving the feature abstraction. For example, when extracting semantic information of two-dimensional image data, it is desirable that the semantic information includes both high-level semantics and low-level semantics, and because the high-level semantics have good abstract capability and the low-level semantics have good locality, the object can be well expressed in a color space by effectively combining the high-level semantics and the low-level semantics, so that color space characteristics can be obtained. In a specific embodiment, to obtain the color space characteristics, the region of interest of the image corresponding to the two-dimensional image data may be first segmented to obtain mask data; two-dimensional image segmentation is a process of dividing an image into a plurality of specific regions with unique categories and proposing an interesting target, and is a key step from image processing to image analysis, such as a binarization method using image processing, instance segmentation based on deep learning and semantic segmentation, and the invention preferably adopts a semantic segmentation method based on deep learning. When the invention adopts the convolutional neural network to carry out semantic segmentation, SegNet, FCN and other segmentation networks can be specifically used. Then, according to the mask data, obtaining a region image by cutting an image corresponding to the two-dimensional image data, and mapping each pixel point in the region image to a color space to obtain low-level semantic information and/or high-level semantic information; since the mask data has a shielding effect on the region of interest of the image corresponding to the two-dimensional image data, only the region of interest can be processed to form a bounding box in the region of interest, so that the image map in the bounding box is cut out, thereby forming a region image. Next, obtaining color space characteristics according to low-level semantic information and/or high-level semantic information; since the high-level semantics have good abstract capability and the low-level semantics have good locality, the high-level semantics and the low-level semantics can be effectively combined to express an object in a color space, and the low-level semantic information and the high-level semantic information are preferably combined to obtain the color space characteristics.

The three-dimensional point cloud data is a set of vectors of a target object in a three-dimensional coordinate system, and is recorded in the form of data points, and each point comprises three-dimensional coordinates. With the increasingly wide application of three-dimensional point cloud data in many fields, the feature extraction of the three-dimensional point cloud data is used as a key technology in the point cloud data processing, is the basis of subsequent work such as region segmentation, curved surface reconstruction and the like, and influences the application effect of the three-dimensional point cloud data. The point cloud features of the three-dimensional point cloud data mainly refer to three-dimensional coordinate vectors of data points, and specifically may be 3D geometric features (such as radius, elevation difference, elevation standard deviation, and point density) and 3D local shape features (such as linear features, planar features, scattering features, total variance, anisotropy, eigenvalues, normal vectors, and curvatures), and may also include color information (such as R, G, B) and reflection Intensity information (Intensity). In the embodiment, it is desirable to establish a high-dimensional feature through three-dimensional point cloud data, that is, to express the point cloud feature of the three-dimensional point cloud data through fitting of a continuous function, and since the point cloud also has a rotation characteristic, a network can be correctly identified no matter what coordinate system the point cloud is presented in; to extract point cloud features of the three-dimensional point cloud data, the extraction process may include: firstly, mapping input three-dimensional point cloud data into a high-dimensional space, which can be called a feature extraction layer, then carrying out symmetric operation processing on the high-dimensional features to obtain feature information of local invariance output by the feature extraction layer, for example, adopting operation operations such as maximum pooling or average pooling, and finally carrying out combined operation on the obtained features to obtain final point cloud features; then after the point cloud features are expressed in the geometric space, the geometric space features can be obtained. In a specific embodiment, under the condition of obtaining mask data corresponding to two-dimensional image data, pixel points corresponding to the mask data can be mapped to a geometric space, and data matching is carried out on the pixel points and three-dimensional point cloud data in the geometric space to obtain point cloud mask data; for example, under the condition that the three-dimensional point cloud data is judged to be ordered point cloud, a pixel index of a pixel point corresponding to the mask data mapped to a geometric space is obtained, a data index of the three-dimensional point cloud data mapped to the geometric space is determined according to the pixel index, and the point cloud mask data is obtained from the three-dimensional point cloud data by using the data index; or for example, under the condition that the three-dimensional point cloud data is judged to be disordered point cloud, pixel points corresponding to the mask data are obtained and mapped to pixel coordinates of a geometric space, the pixel coordinates are subjected to numerical value conversion according to a preset camera internal reference matrix to obtain point cloud coordinates, and the point cloud mask data are obtained from the three-dimensional point cloud data by utilizing the point cloud coordinates. And then, carrying out feature transformation on the point cloud mask data to obtain local point cloud features and global point cloud features. The feature transformation is to generate a new set of high-dimensional feature vectors from input data through certain nonlinear transformation, so as to obtain point cloud features with different dimensional characteristics. And finally, obtaining the geometric space characteristics by combining the local point cloud characteristics and the global point cloud characteristics.

(3) And carrying out fusion processing on the color space characteristic and the geometric space characteristic to obtain a pixel-level fusion characteristic. For example, the color feature of each pixel point in the color space feature and the point cloud feature corresponding to the pixel point in the geometric space feature are spliced on the channel, so that the pixel-level fusion feature is obtained.

(4) And performing pooling treatment on the pixel level fusion features to obtain global features. For example, global average pooling or maximum pooling may be performed on the pixel-level fusion features through a Convolutional Neural Network (CNN), and the global features corresponding to the pixel-level fusion features are obtained after the processing. For example, a copy of the pixel level fusion features is sent to the CNN for information integration, and global features are obtained by using a global average pooling or maximum pooling; the average pooling or maximum pooling is to solve the problem of the disorder of the point cloud, it is a symmetric function and the output value does not depend on the order of the input variables.

Global features refer to the overall properties of an image, typically including color features, texture features, and shape features. The global feature is a pixel-level low-level visual feature, so that the global feature has the characteristics of good invariance, simplicity in calculation, intuition in representation and the like, but also has the characteristics of high feature dimension and large calculation amount. Compared with global features, the pixel-level fusion features have the characteristics of local image features, the number of the fusion features in an image is rich, the correlation degree between the features is small, and the detection of other features cannot be influenced due to the disappearance of partial features even under the shielding condition. In addition, the global feature is obtained by performing pooling processing through a convolutional neural network, the adaptability of the algorithm to the position, scale and other change features can be improved, the global feature and the pixel-level fusion feature are combined to obtain dense fusion data, the dense fusion data not only has the local feature information of the image and the local feature information of the point cloud, but also has the global feature information of the pixel-level fusion feature, and therefore the low-dimensional feature is mapped into the high-dimensional global feature information capable of reflecting object information through nonlinear transformation.

(5) And splicing the pixel-level fusion features and the global features to obtain dense fusion data. The dense fusion data here is used for surface feature point detection of the target object. For example, the global feature may be stitched on the channel behind the pixel-level fusion feature, resulting in a set of fusion features with context information, which is also referred to as dense fusion data because it is pixel-level.

The process of obtaining the dense fusion data of the target object by using the two-dimensional image data and the three-dimensional point cloud data of the target object through heterogeneous fusion is adopted, and the dense fusion data can be used as a training sample to train the network model under the condition of obtaining the dense fusion data.

It should be noted that a dense fusion data may correspond to a predicted pose, i.e., the ith rotation matrix R and translational vector t, so that the final output is a set of predicted poses. In the embodiment, the optimal predicted postures can be selected in an auto-supervision mode, and a confidence is correspondingly output for each predicted posture to serve as a judgment basis. It can be understood that the network model is implemented in the training optimization for the pose prediction and the confidence calculation through the design of the loss function, so the following will describe in detail how to establish the loss function and how to train the model.

In this embodiment, referring to fig. 6, the step S220 mainly relates to a training and learning process of a network model, and may specifically include steps S221 to S223, which are respectively described as follows.

Step S221, high-level semantic information of the target object is constructed according to the dense fusion data, wherein the high-level semantic information comprises coordinates and feature vectors of feature points on the surface of the target object. In one embodiment, referring to fig. 2, the densely fused data is processed by using the backbone node layer 100 in the network model to construct high-level semantic information of the target object. The high-level semantic information refers to characteristic information which has good abstract capability and can reflect the surface texture, shape and contour characteristics of an object.

In step S222, the category and the score of the target object are judged according to the high-level semantic information. In one particular embodiment, referring to FIG. 2, the high level semantic information is processed by a classification layer 201 in the head node layer 200 to determine the class and score of the target object; for example, the probability of the target object belonging to which category is determined and is represented numerically by the score corresponding to the category, and the higher the score is, the higher the probability of belonging to a certain category is.

And step S223, predicting the pose and the confidence coefficient of the target object according to the high-level semantic information. In one particular embodiment, referring to FIG. 2, the high level semantic information is processed using a regression layer 202 in the head node layer 200 to predict the pose and confidence of the target object; for example, the probability of the target object in which pose is determined and the probability of the target object in which pose is determined is numerically represented by predicting the confidence of the pose, and the probability of the target object in a certain pose is higher as the confidence is higher.

For example, in fig. 7, the above step S223 mainly relates to a process of predicting the pose and the confidence of the target object, and may specifically include steps S310 to S340, which are respectively described as follows.

And S310, predicting one or more poses of the target object according to the coordinates of the feature points in the high-level semantic information. Since the shape of the target object is divided into a symmetric structure and an asymmetric structure, different loss functions need to be established according to situations to predict the pose of the target object, so step S310 may specifically include steps S311-S314, and the specific contents may refer to fig. 8.

Step 311, obtaining coordinates and feature vectors of each feature point in the high-level semantic information. For example, coordinates of each feature point in the high-level semantic information are obtained and expressed as x_jAnd obtaining a feature vector of each feature point in the high-level semantic information and expressing the feature vector as p_v。

In step S312, it is determined whether the shape of the target object is an asymmetric structure, and if the shape of the target object is an asymmetric structure, the process proceeds to step S313, and if the shape of the target object is not an asymmetric structure (i.e., a symmetric structure), the process proceeds to step S314.

Step S313, under the condition that the shape of the target object is in an asymmetric structure, establishing a first loss function for the ith pose and expressing the first loss function as

respectively, the labeled rotation matrix and translation vector, and R, t, respectively, the calculated rotation matrix and translation vector.

Then, R and t can be obtained by iterative calculation and calculation when the total loss function converges, wherein the construction of the total loss function will be explained in the following step S340. Referring to fig. 2 and 3, high-level languages will be usedInputting semantic information (including feature vectors and feature point coordinates of target objects) into a regression layer, and outputting a rotation matrix R by the regression layer in the iterative computation process_iTranslation vector t_iWherein the matrix R is rotated_iAnd a translation vector t_iThe method is used for representing the rotational and translational relation of the current spatial pose of the target object in the camera coordinate system, and therefore the method can be used as the ith prediction pose of the target object. Of course, one or more predicted poses will be obtained when the total loss function converges, and only one of the predicted poses needs to be selected as the optimal pose.

It should be noted that the purpose of using the first loss function is to minimize the mean of euclidean distances between the coordinates of the sampling points on the network model in the marker posture and the coordinates in the prediction posture. Of course, there is a penalty function as above for different poses predicted from different dense fusion data.

The object used in the first loss function is an asymmetric target object. If the object is an object with a symmetrical shape, a plurality of different poses can be used as solutions, and the optimization problem is met. For symmetric objects such as spherical objects, there are even infinite poses that can be the solution to the optimization problem, so the first loss function above will become ambiguous, which is not conducive to the training of the network model.

Step S314, under the condition that the shape of the target object is a symmetrical structure, establishing a second loss function aiming at the ith pose and expressing the second loss function as

Then, the stack is passed nextR and t are calculated and obtained when the total loss function converges, wherein the construction of the total loss function will be explained in the following step S340. Referring to fig. 2 and 3, high-level semantic information (including feature vectors and feature point coordinates of the target object) is input to the regression layer, and the rotation matrix R is output by the regression layer in the iterative computation process_iTranslation vector t_iAnd confidence s_iWherein the matrix R is rotated_iAnd a translation vector t_iThe method is used for representing the rotational and translational relation of the current spatial pose of the target object in the camera coordinate system, and therefore the method can be used as the ith prediction pose of the target object. Of course, one or more predicted poses will be obtained when the total loss function converges, and only one of the predicted poses needs to be selected as the optimal pose.

It should be noted that the purpose of using the second loss function is to find the closest point of each network model sampling point in the predicted posture, calculate the distance between the pair of sampling points, and then find the minimum value of the average value of the distances. In the process of optimization calculation, the second loss function gradually enables all corresponding points of the model for marking the attitude information and predicting the attitude information to be attached together, and finally only one optimal value can be converged, so that the condition that the shape of an object is attached in a symmetrical distribution to become a feasible solution of the optimization target is avoided, and the object can become a feasible solution of the optimization target only by attaching point by point.

And step S320, calculating the confidence of each predicted pose according to the feature vector of each feature point in the high-level semantic information. In one embodiment, if the feature vector of each feature point in the high-level semantic information is represented as p_vThen a normalization function (i.e., Softmax function) can be used to calculate the confidence level for the ith pose. The normalization function utilized herein is expressed as

Wherein p is_iIs the ith characteristic point in the high-level semantic information, and the subscript v is an N rangeTraversal sequence numbers within the enclosure. For the ith feature point, the confidence s_iFor characterizing the possibility of the ith prediction pose, and rotating the matrix R_iAnd a translation vector t_iAnd the method is used for representing the ith prediction pose of the target object.

And step S330, comparing the confidence coefficients of the poses, and determining the pose corresponding to the highest confidence coefficient as the optimal pose. It can be understood that the higher the confidence coefficient is, the higher the possibility that the pose is, and then the accuracy of the prediction result can be ensured by selecting the prediction pose corresponding to the maximum confidence coefficient as the final pose determination result.

In step S340, a first loss function L is obtained_i(or second loss function) and confidence s_iIn the case of (2), a total loss function for the network model may be established and expressed as

L＝L_conf+L_loc；

Wherein L is_conf、L_locRespectively being a loss function of the classification layer and a loss function of the regression layer; the superscript p is the index of the category, the superscript 0 is the background information,

the weighting coefficients (the weighting coefficient corresponding to the background information is assigned to 0, the weighting coefficient corresponding to the foreground information is assigned to 1),

the score of the ith category is represented, Pos represents a characteristic point set outside the background, and Neg represents a characteristic point set of the background; n is the number of characteristic points, L_iAttitude loss function, s, required to predict the ith attitude_iIs the confidence coefficient of the ith pose, and w is the weight coefficientLog () is a logarithmic operation function.

It should be noted that the total loss function L obtained here is used to perform an iterative computation function when the network model is trained, and then dense fusion data may be input into the network model, the total loss function L is optimized through back propagation, and a network weight parameter is obtained when the total loss function L converges. After obtaining the network weight parameters of the network model, an attitude estimation model for the target object may be configured.

Those skilled in the art can understand that the second embodiment of the present application introduces a pose estimation model construction method, which can be used as a pose estimation operation for a target object after a pose estimation model is constructed. In this embodiment, a pose estimation method for a target object is introduced, and only a scene image is required to be used as an input, and by performing pose estimation processing on a pose estimation model, not only category information of the target object (i.e., which type of object the target object most likely belongs to) but also pose information of the target object (i.e., which predicted pose the target object is most likely to be in) can be output.

Example III,

Referring to fig. 9, the embodiment discloses an acquisition apparatus for a target object, which mainly includes a sensor 41, a processor 42, a controller 43 and a movement mechanism 44, which are described below.

The sensor 41 is used to capture a scene image of the target object, and the description of the scene image can refer to the related contents in the first embodiment. The sensor 41 may be some vision sensor with image capturing function, such as a camera device or a laser scanning device. The target object may be a product on an industrial production line, a mechanical part in an article box, a tool on an operation table, and the like, and is not particularly limited.

The processor 42 is connected to the sensors. The processor 42 is configured to output the category information and the pose information of the target object by the pose estimation method disclosed in the first embodiment. For the extraction method used by processor 42, see steps S110-S130 in embodiment one.

The controller 43 is connected to the sensor 41 and the processor 42. The controller 43 is configured to control the sensor 41 to capture a scene image of the target object, for example, parameters such as an image capture time, a capture interval, a capture position, and the like of the sensor 41 may be set. In addition, the controller 43 is further configured to control the motion mechanism 44 to capture the target object according to the category information and the pose information of the target object output by the processor 42. For example, in fig. 9, the controller 43 may output a motion command to the motion mechanism 44 based on the category information and the pose information output by the processor 42, so that the motion mechanism grasps the target object 45 in the toolbox.

In the present embodiment, the movement mechanism 44 is a gripper or suction cup with a robotic arm; the moving mechanism 44 is used for receiving a control command sent by the controller 43 and performing a grabbing or sucking operation on the target object within the movement range of the robot arm. If the target object is determined to be a cylinder by the category information, the moving mechanism 44 takes a grasping operation for the cylinder; if the target object is determined to be a plane body or a sphere by the category information, the moving mechanism 44 takes a suction operation on the plane body or the sphere; if it is determined from the posture information that the plane body is in the inclined state, the moving mechanism 44 rotates the suction cup to the inclined suction angle. It will be appreciated that the adaptive operation of the motion mechanism 44 may enhance the performance of the task on the target object.

In one implementation, referring to fig. 10, the processor 42 may include an image acquisition module 421, a pose estimation module 422. The image acquiring module 421 is configured to acquire a scene image of the target object, such as directly acquire RGB-D data from the sensor 41, so as to obtain a scene image with color and depth information at the same time; the pose estimation module 422 is connected to the image acquisition module 421, and is configured to perform category determination and pose estimation on the target object, so as to obtain category information and pose information of the target object.

In a particular embodiment, referring to FIG. 10, the controller 43 includes an optimization module 431 and a control module 432. Wherein the optimization module 431 is connected to the pose estimation module 422 in the processor 42 for planning the motion route and the grasping/sucking position of the motion mechanism 44 according to the pose information of the target object relative to the sensor 41. The control module 432 is connected to the optimization module 431, and is configured to output a control instruction, on one hand, control the motion mechanism 44 to grasp/suck the target object according to the planned motion route and the grasping position, and on the other hand, the control module 432 further outputs a control instruction to control the sensor 41 to acquire an image of the target object.

Those skilled in the art can understand that the target object picking device disclosed in this embodiment can enable the controller to control the motion mechanism to accurately pick/suck the target object according to the pose information output by the processor, and can effectively improve the accuracy of picking/sucking while ensuring the execution efficiency, and enhance the practical performance of the device in the application process.

Example four,

Referring to fig. 11, the present embodiment discloses a pose estimation apparatus for a target object, which may include a memory 51 and a processor 52, which are respectively described below.

The memory 51 serves as a computer-readable storage medium for storing a program, which may be program codes corresponding to the bit posture estimation methods S110 to S130 in the first embodiment.

Of course, the memory 52 may also store some network weight parameters and network training process data, and may also store data such as two-dimensional image data, three-dimensional point cloud data, training sample data, category information, pose information, and the like.

The processor 52 is connected to the memory 51 for executing the program stored in the memory 51 to implement the corresponding pose estimation method. In one particular embodiment, the functions performed by processor 52 may be summarized as: acquiring a scene image of a target object from a sensor; processing the scene image according to a pose estimation model of the target object formed by configuration, and obtaining the category information and the pose information of the target object through pose estimation; and outputting the category information and the pose information of the target object to a controller.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims

1. A pose estimation model construction method is characterized by comprising the following steps:

acquiring dense fusion data of a target object; the dense fusion data is obtained by heterogeneously fusing two-dimensional image data and three-dimensional point cloud data of a target object;

training a preset network model according to the dense fusion data, and learning to obtain a network weight parameter;

and forming a pose estimation model of the target object according to the network weight parameter configuration.

2. The method of constructing of claim 1, wherein the network model includes a backbone node layer and a head node layer, the head node layer including a classification layer and a regression layer;

the backbone node layer is used for constructing high-level semantic information of the target object according to the dense fusion data; the high-level semantic information comprises coordinates and feature vectors of feature points on the surface of the target object;

a classification layer in the head node layers is used for processing the high-level semantic information to determine the category and the score of the target object, and a regression layer in the head node layers is used for processing the high-level semantic information to predict the pose and the confidence of the target object.

3. The construction method according to claim 2,

for a regression layer in the network model, predicting one or more poses of the target object according to coordinates of each feature point in the high-level semantic information, calculating confidence degrees of the predicted poses according to feature vectors of each feature point in the high-level semantic information, comparing the confidence degrees of the poses, and determining the pose corresponding to the highest confidence degree as the optimal pose;

establishing a total loss function for the network model and expressing as

L＝L_conf+L_loc；

in order to be the weight coefficient,

score of ith category, Pos backThe method comprises the steps of (1) collecting characteristic points outside a scene, wherein Neg represents a characteristic point collection of a background; n is the number of characteristic points, L_iAttitude loss function, s, required to predict the ith attitude_iAnd w is a weight coefficient, and log () is a logarithm operation function.

4. The construction method according to claim 3, wherein the predicting one or more poses of the target object according to the coordinates of each feature point in the high-level semantic information comprises:

acquiring coordinates of each feature point in the high-level semantic information and expressing the coordinates as x_j；

When the shape of the target object is judged to be an asymmetric structure, a first loss function aiming at the ith pose is established and expressed as

respectively, the marked rotation matrix and translation vector, and R, t respectively, the calculated rotation matrix and translation vector;

and obtaining R and t through iterative calculation and calculation when the total loss function converges, and using the R and t as one or more poses of the target object.

5. The construction method according to claim 3, wherein the predicting one or more poses of the target object according to the coordinates of each feature point in the high-level semantic information comprises:

When the shape of the target object is judged to be a symmetrical structure, a second loss function aiming at the ith pose is established and expressed as

and obtaining R and t through iterative calculation when the total loss function is converged, and taking the R and t as one or more poses of the target object.

6. The construction method according to claim 3, wherein the calculating the confidence of each predicted pose according to the feature vector of each feature point in the high-level semantic information comprises:

acquiring a feature vector of each feature point in the high-level semantic information and expressing the feature vector as p_v；

Calculating the confidence coefficient of the ith pose by using a normalization function; the normalization function is expressed as

7. The construction method according to any one of claims 3 to 6, wherein the training of the preset network model according to the dense fusion data and the learning of the network weight parameters comprise:

and inputting the dense fusion data into the network model, optimizing the total loss function through back propagation, and obtaining a network weight parameter when the total loss function is converged.

8. A pose estimation method of a target object, characterized by comprising:

acquiring a scene image of a target object;

processing the scene image according to the pose estimation model of the target object configured and formed in the claims 2-6, and obtaining the category information and the pose information of the target object through pose estimation;

and outputting the category information and the pose information of the target object.

9. An acquisition device for a target object, comprising:

the sensor is used for acquiring a scene image of a target object;

a processor connected to the sensor for processing the scene image by the pose estimation method of claim 8 to output the category information and pose information of the target object;

and the controller is connected with the sensor and the processor and is used for controlling the sensor to capture the image of the target object and controlling a motion mechanism to capture the target object according to the category information and the pose information of the target object.

10. A computer-readable storage medium characterized in that the medium has stored thereon a program executable by a processor to implement the building method according to any one of claims 1 to 7, and to implement the pose estimation method according to claim 8.