WO2022178952A1 - 一种基于注意力机制和霍夫投票的目标位姿估计方法及系统 - Google Patents

一种基于注意力机制和霍夫投票的目标位姿估计方法及系统 Download PDF

Info

Publication number
WO2022178952A1
WO2022178952A1 PCT/CN2021/084690 CN2021084690W WO2022178952A1 WO 2022178952 A1 WO2022178952 A1 WO 2022178952A1 CN 2021084690 W CN2021084690 W CN 2021084690W WO 2022178952 A1 WO2022178952 A1 WO 2022178952A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
target object
estimation
convolutional neural
target
Prior art date
Application number
PCT/CN2021/084690
Other languages
English (en)
French (fr)
Inventor
王耀南
刘学兵
朱青
袁小芳
毛建旭
冯明涛
周显恩
谭浩然
Original Assignee
湖南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 湖南大学 filed Critical 湖南大学
Publication of WO2022178952A1 publication Critical patent/WO2022178952A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Definitions

  • the invention relates to the fields of robot visual perception and computer vision, and in particular relates to a target pose estimation method and system based on an attention mechanism and Hough voting.
  • Object pose estimation refers to identifying known objects in the current scene from the perspective of the camera, and estimating their 3-axis orientation and 3-axis position in the camera's 3-dimensional space coordinate system, more specifically, refers to the object
  • the rigid body transformation matrix T of the 3-dimensional model transformed from its own coordinate system to the camera coordinate system consists of a 3-dimensional rotation matrix R and a 3-dimensional translation vector t, which constitute the 6-dimensional pose P of the object.
  • Object pose estimation is a key content in robot scene understanding. Computer vision technology has achieved a series of results in the fields of robot grasping, human-computer interaction and augmented reality, and has been widely used. Due to the complexity of the scene and the large range of pose variation, the object pose estimation method faces many challenges, and it is necessary to overcome the influence of background interference, chaotic stack occlusion, illumination difference and weak surface texture on pose estimation.
  • Early object pose estimation methods mainly include module matching and feature point detection.
  • the method based on template matching first detects the target area, and then matches the extracted image with the standard template image in the pose database, and selects the template position with the highest similarity. pose as the result; the method based on feature point detection first calculates the image features in the input image, such as SIFT, ORB, HOG, etc., and then matches with the known feature points in the object image to establish 2D-3D correspondence, and finally uses The PnP method solves the object pose.
  • the ICP method can be used to iteratively optimize the object pose, or the 3D point feature method can be used to establish a more robust 2D-3D point correspondence to improve the object pose accuracy.
  • the robustness is poor, the process is cumbersome, and such methods are also susceptible to background or occlusion, and their accuracy is low.
  • the main methods are as follows: 1) Use convolutional neural network to extract image convolution features, then use multi-layer perceptron network to fit the relationship between features and output pose, and output the 6-dimensional pose of the target object ; 2) Based on the traditional 2D-3D correspondence idea, use the deep network to directly predict the 2-dimensional image coordinates of the 3-dimensional key points of the target object, and then use the PnP method to solve the object pose; 3) Use the Hough network to perform point-by-point pose or Key point predictions are then performed for evaluation and optimization, and the best parameters are selected as the output results.
  • PointNet-like networks are generally used to learn 3D features from the extracted point clouds, and then fuse them with color image features for subsequent pose prediction processing.
  • deep learning-based methods have been greatly improved in feature extraction capability, pose prediction accuracy, and generalization performance.
  • accurate pose prediction has always been a research direction in this field.
  • the invention provides a target pose estimation method and system based on attention mechanism and Hough voting.
  • a target pose estimation method and system based on attention mechanism and Hough voting.
  • Different strategies are used for parameter estimation, which can efficiently extract the color and depth image features of the target object, and establish a more accurate pose parameter estimation model. high.
  • the present invention provides the following technical solutions:
  • a target pose estimation method based on attention mechanism and Hough voting includes the following steps:
  • Step S1 acquiring a color image and a depth image in a scene containing multiple target objects
  • Step S2 obtaining the category and segmentation mask of each target object from the color image by the target segmentation method
  • the target segmentation method adopts the existing well-known segmentation method, such as the Mask RCNN instance segmentation network;
  • the object category depends on the object category in the training dataset used, such as the YCB dataset contains 21 life scene objects, bottles, jars, cups, chairs, etc.;
  • Step S3 using each object segmentation mask obtained in step S2, the color image and the depth image are cropped and spliced, each target object image block is extracted, and normalized;
  • Crop out color image blocks and depth image blocks corresponding to the target object from the entire color map image and depth image, and perform channel splicing to obtain 4-channel image blocks of each target object containing 3-channel color and 1-channel depth O, o j ⁇ O,j 1,2,...,k, k is the number of target objects in the image;
  • Step S4 constructing a rotation estimation network and a translation vector estimation network
  • the rotation estimation network includes a series of feature extraction networks based on bidirectional spatial attention, feature splicing networks, multi-scale pooling networks and multi-layer perceptron networks, and the bidirectional spatial attention feature extraction network includes ResNet34 convolutional neural network. And two parallel spatial aggregation convolutional neural networks and spatially distributed convolutional neural networks;
  • the translation vector estimation network includes a series-connected PointNet++ network and a point-by-point Hough voting network;
  • Step S5 network training
  • step S1-step S3 Use the color images and depth images in different scenes in the known target pose estimation data set, and process according to step S1-step S3 to obtain the normalized image blocks of each target object, the corresponding object point cloud and the corresponding
  • the rotation matrix quaternion and the 3-dimensional translation unit vector respectively train the rotation estimation network and the translation vector estimation network.
  • the absolute angle error of the rotation matrix is used as the loss of the rotation estimation network
  • the absolute angle error of the translation vector is used as the loss Estimate the network loss as a translation vector
  • Step S6 After the target object image to be subjected to target pose estimation is processed according to steps S1 to S3, it is input into the rotation estimation network and translation vector estimation network trained in step S5, and 3-dimensional rotation matrix estimation is performed respectively. and 3-dimensional translation vector estimation to achieve target pose estimation.
  • Rotation estimation normalization normalize the color channel value and depth channel value from the range of [0, 255], [near, far] to [-1, 1] from the image block O of each target object;
  • the minimum circumscribed rectangle of block O is the boundary, maintaining the set aspect ratio, up-sampling or down-sampling for each target object image block O, scaling to a fixed rectangle size, filling the blank area with 0, and obtaining a uniform width and height size.
  • 3D point cloud normalization Obtain the 3D point cloud of each target object from each target object image block O, and normalize the color value and depth value of the 3D point cloud from the range of [0,255], [near,far] to [- 1,1], and remove the center of gravity of the three-dimensional coordinates of the three-dimensional point cloud to obtain offset coordinates, and unit vectorize the offset coordinates to obtain normalized coordinates, so as to obtain the three-dimensional coordinates of each target object in the same space. point cloud data;
  • near and far are the nearest and farthest values of the depth image of the target object, respectively.
  • the space aggregation convolutional neural network utilizes the convolutional feature obtained based on the ResNet34 convolutional neural network as the input data of the convolutional neural network, the context distribution feature Fdc obtained from the convolutional neural network: [(H ⁇ W ) ⁇ (H ⁇ W),H,W], extract the aggregated feature F c from global point to local point corresponding to the feature constraint relationship between H ⁇ W global points and local H ⁇ W points: [H ⁇ W ,H,W], and as the output data of the spatially aggregated convolutional neural network;
  • the spatial distribution convolutional neural network utilizes the convolutional feature obtained based on the ResNet34 convolutional neural network as the input data of the convolutional neural network, and the context distribution feature Fdc obtained from the convolutional neural network: [(H ⁇ W) ⁇ ( H ⁇ W),H,W], extract the local-to-global distribution feature F d : [H ⁇ W,H,W] corresponding to the feature constraint relationship between H ⁇ W local points and global H ⁇ W points , and as the output data of the spatially distributed network convolutional neural network.
  • the spatial distribution network obtains the feature constraint relationship between H ⁇ W local points and global H ⁇ W points, extracts the feature values of corresponding points channel by channel according to the feature space positions, and arranges and integrates them according to the two-dimensional positions of the feature images to generate distribution features.
  • F d [H ⁇ W, H, W], each position in the feature image contains H ⁇ W values, representing the distribution constraint relationship between H ⁇ W global points and the position;
  • the rotation estimation network uses the ResNet34 convolutional neural network to obtain convolutional features, and then inputs the obtained convolutional features into the spatial aggregation convolutional neural network and the spatially distributed convolutional neural network, respectively, to extract the aggregation features and distribution features; use feature splicing. After the network splices the aggregated features and the distribution features, the multi-scale pooling network is used to perform multi-scale pooling operations on the spliced features to obtain the feature vector of the target object image; The 3-dimensional rotation matrix of the target object is regressed from the feature vector;
  • the translation vector estimation network uses the normalized three-dimensional point cloud of the target object to be input into the PointNet++ network to obtain the point cloud features, and then uses the point-by-point Hough voting network based on the multi-layer perceptron network form to point by point. Regression obtains the unit vector of the 3D translation vector of the target object.
  • the three-dimensional point cloud coordinates and unit vectors of each target object are used to establish a set of straight line equations where the three-dimensional translation vector of each target object is located, and the three-dimensional translation vector t of each target object is obtained by solving the closest point in the three-dimensional space to the straight line set.
  • normalizing the three-dimensional point cloud of each target object specifically refers to:
  • training the rotation estimation network is to use the normalized image blocks of the rotation estimation as the input data of the rotation estimation network, output the rotation matrix quaternion Q, unitize the rotation matrix quaternion Q, and then transfer into a rotation matrix with a rotation matrix with rotation truth value
  • the training of the translation vector estimation network takes the normalized 3D point cloud of the image block O as the input data, and points the 3D translation vector by the point cloud of each surface of the target object.
  • unit vector of As the output data take the angle error L t as the translation vector loss: Backpropagating L t , using the gradient descent method to train the parameters of the translation vector estimation network, and update the translation vector estimation network parameters, where, The true value of the translation vector representing the ith pixel: m represents the number of pixels in the target object image block.
  • a target pose estimation system based on attention mechanism and Hough voting including:
  • Image acquisition module use RGB-D cameras to acquire color images and depth images in scenes containing multi-target objects;
  • Target segmentation module used to segment the color image to obtain the category and segmentation mask of each target object
  • Target extraction module Based on the segmentation masks of each object, the color image and the depth image are cropped and spliced, and the image blocks of each target object are extracted;
  • Normalization module normalize the coordinates, color values and depth values of the 3D point cloud in each target object image block to obtain the 3D point cloud data of each target object in the same space;
  • Pose estimation network building block used to build a rotation estimation network and a translation vector estimation network
  • the rotation estimation network includes a series of feature extraction networks based on bidirectional spatial attention, feature splicing networks, multi-scale pooling networks and multi-layer perceptron networks, and the bidirectional spatial attention feature extraction network includes ResNet34 convolutional neural network. And two parallel spatial aggregation convolutional neural networks and spatially distributed convolutional neural networks;
  • the translation vector estimation network includes a series-connected PointNet++ network and a point-by-point voting network;
  • Network training module use the deep learning workstation to train the pose estimation network
  • the image acquisition module Using the color images and depth images in different scenes in the known target pose estimation data set, call the image acquisition module, target segmentation module, target extraction module and normalization module for processing, so as to obtain the normalized processed target
  • the object image block, the corresponding object point cloud, the corresponding rotation matrix quaternion, and the 3-dimensional translation unit vector respectively train the rotation estimation network and the translation vector estimation network.
  • the absolute angle error of the rotation matrix is used as the Rotation estimation network loss
  • the absolute angle error of the translation vector is used as the translation vector to estimate the network loss
  • the parameters are updated in the form of gradient descent;
  • Pose estimation module use the trained rotation estimation network and translation vector estimation network to perform 3-dimensional rotation matrix estimation and 3-dimensional translation vector estimation for the target object image block to be subjected to target pose estimation to realize the target pose estimate.
  • the spatial aggregation convolutional neural network adopts the convolutional neural network architecture, and the convolutional feature obtained based on the ResNet34 convolutional neural network is used as the input data of the convolutional neural network, and the context distribution feature F obtained from the convolutional neural network is used.
  • dc [(H ⁇ W) ⁇ (H ⁇ W),H,W] extract the aggregated features from global points to local points corresponding to the feature constraint relationship between H ⁇ W global points and local H ⁇ W points F c : [H ⁇ W,H,W], and used as the output data of the spatial aggregation convolutional neural network;
  • the spatial distribution convolutional neural network adopts the convolutional neural network architecture, and utilizes the convolutional feature obtained based on the ResNet34 convolutional neural network as the input data of the convolutional neural network, and the context distribution feature Fdc obtained from the convolutional neural network: [ (H ⁇ W) ⁇ (H ⁇ W),H,W], extract the local-to-global distribution features F d :[H ⁇ W,H,W], and as the output data of the spatial distribution network convolutional neural network;
  • the translation vector estimation network includes a PointNet++ network and a point-by-point Hough voting network, and the point-by-point Hough voting network adopts a multi-layer perceptron network architecture;
  • the translation vector estimation network uses the normalized three-dimensional point cloud of the target object to be input to the PointNet++ network to obtain the point cloud features, and then uses the point-by-point Hough voting network based on the multi-layer perceptron network to obtain the point-by-point regression of the target object.
  • the unit vector of the dimensional translation vector uses the normalized three-dimensional point cloud of the target object to be input to the PointNet++ network to obtain the point cloud features, and then uses the point-by-point Hough voting network based on the multi-layer perceptron network to obtain the point-by-point regression of the target object.
  • the invention provides a target pose estimation method and system based on attention mechanism and Hough voting.
  • the method includes the following steps: acquiring a color image and a depth image; dividing and cropping the color image to obtain the color image of each target object and depth image blocks; two strategies are used to estimate the 6-dimensional pose of the target object, for the 3-dimensional rotation matrix, a feature extraction network based on bidirectional spatial attention, robust feature extraction using two-dimensional feature constraints on the target surface, and then using multiple layers
  • the perception network returns the target 3-dimensional rotation matrix; for the 3-dimensional translation vector, the point cloud of the target object is reconstructed and the point cloud data is normalized, and the Hough voting network is used to estimate the 3-dimensional translation direction vector of the point cloud point by point, and finally the translation center line is established. Set, solve the nearest point in space to get the target 3-dimensional translation vector.
  • the input and output data are in unitized form
  • the 3-dimensional rotation matrix estimates the network input specification to the color and image data in the [0,1] space
  • the output rotation matrix is in the unit quaternion form
  • the 3-dimensional translation vector estimates the network input specification to For point cloud data in [-1,1] space
  • the unit direction vector pointing to the translation vector is output point by point, which effectively solves the problem that the training gradient disappears, explodes or becomes unstable under data of different dimensions and dimensions, and accelerates the network convergence.
  • FIG. 1 is a schematic diagram of the network structure of the target pose estimation method involved in the example of the present invention.
  • Figure 2 is a schematic diagram of the YCB dataset used for training and verifying the pose estimation network proposed by the method of the present invention, wherein (a) is the RGB image of scene 1, (b) is the Depth image corresponding to (a), and (c) Label the Label image for the target object corresponding to (a), (d) is the RGB image of scene 2, (e) is the Depth image corresponding to (d), (f) is the Label image corresponding to the target object (d), ( g) is the RGB image of scene 3, (h) is the Depth image corresponding to (g), (i) is the label image of the target object corresponding to (g),;
  • Figure 3 shows the processing of 6 target object RGB and Depth image blocks from the dataset shown in Figure 2, where (a) is the 1RGB image block of the target object, (b) is the Depth image block corresponding to (a), (c) ) The target object 2RGB image block, (d) is the Depth image block corresponding to (c), (e) is the target object 3RGB image block, (f) is the Depth image block corresponding to (e), (g) The target object 4RGB image block, (h) is the Depth image block corresponding to (g), (i) is the 5RGB image block of the target object, (j) is the Depth image block corresponding to (i), (k) is the 6RGB image block of the target object, (l) ) is the Depth image block corresponding to (k);
  • Figure 4 is the target point cloud of 6 target objects obtained from the data set shown in Figure 2, wherein (a) is the target point cloud of target object 1, (b) is the target point cloud of target object 2, and (c) is The target point cloud of target object 3, (d) is the target point cloud of target object 4, (e) is the target point cloud of target object 5, and (f) is the target point cloud of target object 6;
  • Figure 5 is a schematic diagram of the change curve of the rotation matrix estimation network loss with the increase of the number of iterations
  • Figure 6 is a schematic diagram of the change curve of the translation vector estimation network loss with the increase of the number of iterations
  • Figure 7 is a schematic diagram of the 6D pose test results, where (a)-(l) are 12 different test scenarios respectively.
  • the present invention provides a target pose estimation method based on attention mechanism and Hough voting.
  • the specific network structure is shown in Figure 1, including the following steps:
  • Step S1 Obtain the color and depth images in the scene containing the target object, as shown in Figure 2, showing the RGB image and Depth image obtained by the RGB-D camera in each scene, as well as the labeled Label images of each target object, 3 scenes in total;
  • Step S2 Obtain the category and segmentation mask of each object from the color image through the existing state-of-the-art target segmentation method
  • Step S4 constructing a rotation estimation network and a translation vector estimation network
  • the rotation estimation network includes a series of feature extraction networks based on bidirectional spatial attention, feature splicing networks, multi-scale pooling networks and multi-layer perceptron networks, and the bidirectional spatial attention feature extraction network includes ResNet34 convolutional neural network. And two parallel spatial aggregation convolutional neural networks and spatially distributed convolutional neural networks;
  • the translation vector estimation network includes a series-connected PointNet++ network and a point-by-point Hough voting network;
  • Step S5 network training
  • step S1-step S3 Utilize the color images and depth images in different scenes in the known target pose estimation data set, and process according to step S1-step S3 to obtain the normalized image blocks of each target object, the corresponding three-dimensional point cloud of the target object and The corresponding rotation matrix quaternion and 3-dimensional translation unit vector respectively train the rotation estimation network and the translation vector estimation network.
  • the absolute angle error of the rotation matrix is used as the loss of the rotation estimation network
  • the absolute angle of the translation vector The angle error is used as a translation vector to estimate the network loss;
  • Step S6 After the target object image to be subjected to target pose estimation is processed according to steps S1 to S3, it is input into the rotation estimation network and translation vector estimation network trained in step S5, and 3-dimensional rotation matrix estimation is performed respectively. and 3-dimensional translation vector estimation to achieve target pose estimation.
  • step S2 the target is divided into a color image of the input scene, and the segmentation mask of each known object is output.
  • the specific implementation can adopt the most advanced target segmentation method.
  • the present invention does not include this part of the content, but the accuracy of the segmentation result will affect the present invention. Invent the accuracy of the final object pose estimation.
  • Object pose estimation is decomposed into two independent tasks, namely 3-dimensional rotation matrix estimation and 3-dimensional translation vector estimation;
  • Rotation estimation normalization perform data normalization on the cropped target object image blocks O, and normalize the channel values of the color and depth images from the range [0,255], [near,far] to [0,1] , where near and far are the nearest and farthest values of the target depth image, respectively;
  • Step S53 Input the image block OR into the rotation estimation network, and the rotation estimation network uses the ResNet34 convolutional neural network to obtain convolutional features, and then input the obtained convolutional features into the spatial aggregation convolutional neural network and the spatially distributed convolutional neural network respectively. , extract aggregation features and distribution features; use the feature splicing network to splicing the aggregation features and distribution features, and then use the multi-scale pooling network to perform multi-scale pooling operations on the spliced features to obtain the feature vector FA of the target object image; Finally, the multi-layer perceptron network is used to regress the 3D rotation matrix of the target object from the feature vector of the target object image;
  • the spatial aggregation convolutional neural network utilizes the convolutional features obtained based on the ResNet34 convolutional neural network as the input data of the convolutional neural network, and the context distribution feature Fdc obtained from the convolutional neural network: [(H ⁇ W) ⁇ ( H ⁇ W), H, W], extract the aggregated feature F c from global point to local point corresponding to the feature constraint relationship between H ⁇ W global points and local H ⁇ W points: [H ⁇ W,H, W], and as the output data of the spatially aggregated convolutional neural network;
  • the spatial distribution convolutional neural network utilizes the convolutional feature obtained based on the ResNet34 convolutional neural network as the input data of the convolutional neural network, and the context distribution feature Fdc obtained from the convolutional neural network: [(H ⁇ W) ⁇ ( H ⁇ W),H,W], extract the local-to-global distribution feature F d : [H ⁇ W,H,W] corresponding to the feature constraint relationship between H ⁇ W local points and global H ⁇ W points , and as the output data of the spatially distributed network convolutional neural network.
  • the spatial distribution network obtains the feature constraint relationship between H ⁇ W local points and global H ⁇ W points, extracts the feature values of corresponding points channel by channel according to the feature space positions, and arranges and integrates them according to the two-dimensional positions of the feature images to generate distribution features.
  • F d [H ⁇ W, H, W], each position in the feature image contains H ⁇ W values, representing the distribution constraint relationship between H ⁇ W global points and the position;
  • the translation vector estimation network uses the normalized three-dimensional point cloud of the target object to be input to the PointNet++ network to obtain the point cloud features, and then uses the point-by-point Hough voting network based on the multi-layer perceptron network form to obtain the target point-by-point regression.
  • unit vector of the 3D translation vector of the object uses the normalized three-dimensional point cloud of the target object to be input to the PointNet++ network to obtain the point cloud features, and then uses the point-by-point Hough voting network based on the multi-layer perceptron network form to obtain the target point-by-point regression.
  • the training of the rotation estimation network is to use the normalized image blocks of the rotation estimation as the input data of the rotation estimation network, output the rotation matrix quaternion Q, unitize the rotation matrix quaternion Q, and then convert it into a rotation matrix with a rotation matrix with rotation truth value
  • the translation vector estimation network training is based on the normalization of image block O
  • the latter three-dimensional point cloud is the input data, and the point cloud of each surface of the target object points to the three-dimensional translation vector unit vector of As the output data, take the angle error L t as the translation vector loss: Backpropagating L t , using the gradient descent method to train the parameters of the translation vector estimation network, and update the translation vector estimation network parameters, where,
  • the true value of the translation vector representing the ith pixel: m represents the number of pixels in the target object image block.
  • a minimum value is set. When the loss value is less than this value, the training will be stopped to achieve the optimal effect. The selection of the minimum value will be continuously adjusted according to the results of the actual simulation experiment;
  • the three-dimensional rotation estimation network and the three-dimensional translation vector estimation network are independent of each other, and the training process does not interfere with each other, and can be completed in parallel.
  • each test result shows that the scene has been Know the 3D bounding box of the object.
  • the bounding box is calculated from the known object model, the camera internal parameters and the 6D pose of the object in the scene.
  • the 6D pose includes the true value of the annotation and the estimated value calculated by the network. The degree of agreement between the two reflects the The computational accuracy of the network further verifies the effectiveness and accuracy of the method described in this scheme.
  • an embodiment of the present invention also provides a target pose estimation system based on an attention mechanism and Hough voting, which is characterized by comprising:
  • Image acquisition module use RGB-D cameras to acquire color images and depth images in scenes containing multi-target objects;
  • the RGB-D camera uses the Azure Kinect DK camera
  • Target segmentation module used to segment the color image to obtain the category and segmentation mask of each target object
  • Target extraction module Based on the segmentation masks of each object, the color image and the depth image are cropped and spliced, and the image blocks of each target object are extracted;
  • Normalization module normalize the coordinates, color values and depth values of the 3D point cloud in each target object image block to obtain the 3D point cloud data of each target object in the same space;
  • Pose estimation network building block used to build a rotation estimation network and a translation vector estimation network
  • the rotation estimation network includes a series of feature extraction networks based on bidirectional spatial attention, feature splicing networks, multi-scale pooling networks and multi-layer perceptron networks, and the bidirectional spatial attention feature extraction network includes ResNet34 convolutional neural network. And two parallel spatial aggregation convolutional neural networks and spatially distributed convolutional neural networks;
  • the translation vector estimation network includes a series-connected PointNet++ network and a point-by-point voting network;
  • Network training module use the deep learning workstation to train the pose estimation network
  • the image acquisition module Using the color images and depth images in different scenes in the known target pose estimation data set, call the image acquisition module, target segmentation module, target extraction module and normalization module for processing, so as to obtain the normalized processed target
  • the object image block, the corresponding object point cloud, the corresponding rotation matrix quaternion, and the 3-dimensional translation unit vector respectively train the rotation estimation network and the translation vector estimation network.
  • the absolute angle error of the rotation matrix is used as the Rotation estimation network loss
  • the absolute angle error of the translation vector is used as the translation vector to estimate the network loss
  • the parameters are updated in the form of gradient descent;
  • Pose estimation module use the trained rotation estimation network and translation vector estimation network to perform 3-dimensional rotation matrix estimation and 3-dimensional translation vector estimation for the target object image block to be subjected to target pose estimation to realize the target pose estimate.
  • each unit module may exist physically alone, or two or more unit modules may be integrated into one unit module, It can be implemented in the form of hardware or software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于注意力机制和霍夫投票的目标位姿估计方法及系统,该方法包括如下步骤:获取彩色图像和深度图像;对彩色图像进行分割与裁剪,得到各目标物体的彩色和深度图像块;采用两种策略估计目标物体6维位姿,针对3维旋转矩阵,基于双向空间注意力的特征提取网络,利用目标表面二维特征约束进行鲁棒特征提取,再利用多层感知网络回归出目标3维旋转矩阵;针对3维平移向量,重建目标物体点云并归一化点云数据,采用霍夫投票网络逐点估计点云3维平移方向向量,最后建立平移中心直线集,求解空间最近点得到目标3维平移向量。本发明方法分别估计旋转矩阵和平移向量,执行速度快、精度高。

Description

一种基于注意力机制和霍夫投票的目标位姿估计方法及系统 技术领域
本发明涉及机器人视觉感知及计算机视觉领域,具体涉及一种基于注意力机制和霍夫投票的目标位姿估计方法及系统。
背景技术
物体位姿估计指的是在相机视角下,识别当前场景中的已知物体,并估计出其在相机3维空间坐标系下的3轴朝向和3轴位置,更具体地,指的是物体3维模型从自身坐标系转换至相机坐标系的刚体变换矩阵T,由3维旋转矩阵R和3维平移向量t组成,二者构成了物体的6维位姿P。物体位姿估计是机器人场景理解中的一个关键内容,利用计算机视觉技术在机器人抓取、人机交互和增强现实领域中已取得了一系列成果,并得到了广泛应用。由于场景复杂、位姿变化范围大,物体位姿估计方法面临着诸多挑战,需要克服背景干扰、混乱堆叠遮挡、光照差异和表面弱纹理等问题给位姿估计带来的影响。
早期物体位姿估计方法主要有模块匹配和特征点检测,基于模板匹配的方法首先检测出目标区域,然后将提取到的图像与位姿数据库中标准模板图像进行匹配,选择相似度最高的模板位姿作为结果;基于特征点检测的方法首先计算出输入图像中的图像特征,如SIFT、ORB、HOG等,然后与物体图像中的各已知特征点匹配,建立2D-3D对应关系,最后利用PnP方法解算出物体位姿。此外,在深度图像可用的情况下,可以利用ICP方法迭代优化目标位姿,或者利用3维点特征方法建立更鲁棒的2D-3D点对应关系,提高物体位姿精度。但由于需要针对特定物体人为计算模板或特征点,鲁棒性差、过程繁琐,并且此类方法也易受背景或者遮挡影响,精度低。
现如今,基于深度学习的计算机视觉处理方法由于直接从原始图像提取特征,过程简便,并且从海量数据样本中自主学习特征描述并拟合处理结果,鲁棒性高、泛化能力强,已成为主流方法。具体到物体位姿估计,主要方法有:1)利用卷积神经网络提取图像卷积特征,然后利用多层感知机网络拟合特征与输出位姿之间的关系,输出目标物体6维位姿;2)基于传统2D-3D对应关系思路,利用深度网络直接预测目标物体3维关键点的2维图像坐标,然后利用PnP方法求解物体位姿;3)利用霍夫网络进行逐点位姿或者关键点预测,然后进行评估优化, 选择最佳参数作为输出结果。针对深度图像可用的场景,一般采用PointNet类似网络从提取到的点云中学习三维特征,然后与彩色图像特征进行融合,用于后续位姿预测处理。相比于早期位姿估计方法,基于深度学习方法从特征提取能力、位姿预测精度、泛化性能上都得到了很大提高,但由于深度网络的不可解释性,如何利用网络高效提取图像特征并进行精确位姿预测一直是该领域的研究方向。
发明内容
本发明提供了一种基于注意力机制和霍夫投票的目标位姿估计方法及系统,针对6维位姿中3维旋转矩阵和3维平移向量对彩色、深度图像各自不同的约束特点,采用不同的策略分别进行参数估计,能够高效提取目标物体彩色、深度图像特征,建立更精确的位姿参数估计模型,同时可以克服大规模神经网络带来的计算冗余,结构简单、位姿估计精度高。
为了实现上述目的,本发明提供了如下技术方案:
一方面,一种基于注意力机制和霍夫投票的目标位姿估计方法,包括如下几个步骤:
步骤S1:获取包含多目标物体场景下的彩色图像与深度图像;
步骤S2:通过目标分割方法从彩色图像中获得各目标物体的类别和分割掩码;
目标分割方法采用现有公知的分割方法,如Mask RCNN实例分割网络;
物体类别取决于使用的训练数据集中物体类别,如YCB数据集包含21个生活场景物体,瓶子,罐子,杯子,椅子等;
步骤S3:利用步骤S2中获取的各物体分割掩码,对彩色图像与深度图像进行裁剪与拼接,提取各目标物体图像块,并进行归一化处理;
从整幅彩图图像、深度图像中裁剪出对应目标物体的彩色图像块、深度图像块,并进行通道拼接,获得包含3通道彩色和1通道深度的4通道各目标物体图像块O,o j∈O,j=1,2,...,k,k为图像中目标物体数量;
步骤S4:构建旋转估计网络以及平移向量估计网络;
所述旋转估计网络包括串联的基于双向空间注意力的特征提取网络、特征拼接网络、多尺度池化网络和多层感知机网络,所述双向空间注意力的特征提取网络包含ResNet34卷积神经网络以及两个并联的空间聚合卷积神经网络、空间分布卷积神经网络;
所述平移向量估计网络包含串联的PointNet++网络和逐点霍夫投票网络;
步骤S5:网络训练;
利用已知的目标位姿估计数据集中不同场景下彩色图像和深度图像,按照步骤S1-步骤S3处理,以得到的归一化处理后的各目标物体图像块、对应的物体点云和对应的旋转矩阵四元数、3维平移单位向量分别对所述旋转估计网络以及平移向量估计网络进行训练,训练过程中,以旋转矩阵的绝对角度误差作为旋转估计网络损失,以平移向量的绝对角度误差作为平移向量估计网络损失;
步骤S6:将待进行目标位姿估计的目标物体图像按照步骤S1-步骤S3处理后,输入到利用步骤S5训练好的所述旋转估计网络和平移向量估计网络中,分别进行3维旋转矩阵估计和3维平移向量估计,实现目标位姿估计。
进一步地,将各目标物体图像块进行归一化处理的具体过程如下:
旋转估计归一化:从各目标物体图像块O中彩色通道值、深度通道值分别从[0,255]、[near,far]范围归一化至[-1,1];再以各目标物体图像块O的最小外接矩形为边界,保持设定的纵横比,对各目标物体图像块O进行上采样或下采样,放缩至固定矩形尺寸,空白区域以0填充,获得宽、高尺寸统一的各目标物体图像块O R
三维点云归一化:从各目标物体图像块O中获取各目标物体三维点云,将三维点云彩色值以及深度值分别从[0,255]、[near,far]范围归一化至[-1,1],并将三维点云的三维坐标进行重心移除,获得偏移坐标,并对偏移坐标进行单位向量化,获得归一化坐标,从而获得各目标物体在同一空间下的三维点云数据;
其中,near、far分别为目标物体深度图像的最近、最远值。
进一步地,所述空间聚合卷积神经网络利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个全局点与局部H×W个点的特征约束关系对应的全局点至局部点的聚合特征F c:[H×W,H,W],并作为空间聚合卷积神经网络的输出数据;
所述空间分布卷积神经网络利用基于ResNet34卷积神经网络获取的卷积特 征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个局部点与全局H×W个点的特征约束关系对应的局部至全局的分布特征F d:[H×W,H,W],并作为空间分布网络卷积神经网络的输出数据。
所述空间分布网络获得H×W个局部点与全局H×W个点的特征约束关系,依照特征空间位置逐通道提取对应点特征值,并按照特征图像二维位置进行排列整合,生成分布特征F d:[H×W,H,W],所述特征图像中每个位置包含H×W个值,表示H×W个全局点与该位置的分布约束关系;
所述旋转估计网络利用ResNet34卷积神经网络获取卷积特征,然后将获得的卷积特征分别输入到空间聚合卷积神经网络和空间分布卷积神经网络,提取聚合特征和分布特征;利用特征拼接网络将聚合特征和分布特征拼接后,再利用多尺度池化网络对拼接后的特征进行多尺度池化操作,获得目标物体图像的特征向量;最后,利用多层感知机网络从目标物体图像的特征向量中回归出目标物体的3维旋转矩阵;
进一步地,所述平移向量估计网络是利用归一化处理后的目标物体三维点云输入至PointNet++网络,获得点云特征,再利用基于多层感知机网络形式的逐点霍夫投票网络逐点回归获得目标物体的3维平移向量的单位向量。
进一步地,利用各目标物体的三维点云坐标和单位向量建立各目标物体3维平移向量所在直线方程集,通过求解三维空间距直线集的最近点,得到各目标物体3维平移向量t。
进一步地,将各目标物体三维点云进行归一化处理具体是指:
首先,利用相机内参和小孔成像模型从各目标物体图像块O中获取某一目标物体三维点云V,V=(X,Y,Z,I);
其中:
Figure PCTCN2021084690-appb-000001
构成相机内参K,f x、f y为等效焦距,u i、v i分别为图像块O中像素i在原输入图像中的横纵坐标,I=(R,G,B)为彩色 值,D(u i,v i)为图像块O中像素i深度值,c x、c y为图像坐标偏移,i=1,2,...,m,m表示目标物体图像块中的像素数量;
接着,计算三维点云V的三维重心
Figure PCTCN2021084690-appb-000002
对三维点云V进行归一化处理,彩色值I各通道从[0,255]归一化至[-1,1],三维点云的三维坐标首先移除重心,获得偏移坐标ΔS(ΔX,ΔY,ΔZ)=(X-G x,Y-G y,Z-G z),然后对ΔS进行单位向量化norm(ΔX,ΔY,ΔZ),得到归一化向量
Figure PCTCN2021084690-appb-000003
结合彩色值得到归一化后的三维点云
Figure PCTCN2021084690-appb-000004
进一步地,对旋转估计网络进行训练,是利用旋转估计归一化后的图像块作为旋转估计网络的输入数据,输出旋转矩阵四元数Q,对旋转矩阵四元数Q进行单位化,然后转成旋转矩阵
Figure PCTCN2021084690-appb-000005
以旋转矩阵
Figure PCTCN2021084690-appb-000006
与旋转真值
Figure PCTCN2021084690-appb-000007
之间的绝对角度误差L R作为旋转矩阵损失:
Figure PCTCN2021084690-appb-000008
E为单位矩阵,将L R反向传播,采用梯度下降方法对旋转估计网络进行训练,更新基于双向空间注意力的特征提取网络参数。
进一步地,对平移向量估计网络训练是以图像块O的归一化后的三维点云为输入数据,以目标物体各表面点云指向3维平移向量
Figure PCTCN2021084690-appb-000009
的单位向量
Figure PCTCN2021084690-appb-000010
作为输出数据,以角度误差L t作为平移向量损失:
Figure PCTCN2021084690-appb-000011
将L t反向传播,采用梯度下降方法对平移向量估计网络进行参数训练,更新平移向量估计网络参数,其中,
Figure PCTCN2021084690-appb-000012
表示第i个像素的平移向量真值:
Figure PCTCN2021084690-appb-000013
m表示目标物体图像块中的像素数量。
在实际的平移向量估计中,以获得的单位向量W构造目标物体点云中任意点连接3维平移向量的直线方程集L,l i∈L,i=1,2,...,m,其中l为三维空间直线方 程:
Figure PCTCN2021084690-appb-000014
然后求解三维空间中距直线方程集L最近点q:(x,y,z)即为目标物体3维平移向量t。
另一方面,一种基于注意力机制和霍夫投票的目标位姿估计系统,包括:
图像采集模块:利用RGB-D相机获取包含多目标物体场景中的彩色图像与深度图像;
目标分割模块:用于对彩色图像进行分割,获得各目标物体的类别和分割掩码;
目标提取模块:基于各物体分割掩码,对彩色图像与深度图像进行裁剪与拼接,提取各目标物体图像块;
归一化模块:对各目标物体图像块中的三维点云的坐标、彩色值以及深度值进行归一化处理,获得各目标物体在同一空间下的三维点云数据;
位姿估计网络构建模块:用于构建旋转估计网络以及平移向量估计网络;
所述旋转估计网络包括串联的基于双向空间注意力的特征提取网络、特征拼接网络、多尺度池化网络和多层感知机网络,所述双向空间注意力的特征提取网络包含ResNet34卷积神经网络以及两个并联的空间聚合卷积神经网络、空间分布卷积神经网络;
所述平移向量估计网络包含串联的PointNet++网络和逐点投票网络;
网络训练模块:利用深度学习工作站对位姿估计网络进行训练;
利用已知的目标位姿估计数据集中不同场景下彩色图像和深度图像,调用图像采集模块、目标分割模块、目标提取模块以及归一化模块进行处理,以得到的归一化处理后的各目标物体图像块、对应的物体点云和对应的旋转矩阵四元数、3维平移单位向量分别对所述旋转估计网络以及平移向量估计网络进行训练,训练过程中,以旋转矩阵的绝对角度误差作为旋转估计网络损失,以平移向量的绝对角度误差作为平移向量估计网络损失,且以梯度下降形式进行参数更新;
位姿估计模块:利用训练好的所述旋转估计网络和平移向量估计网络,对待进行目标位姿估计的目标物体图像块,分别进行3维旋转矩阵估计和3维平移向量估计,实现目标位姿估计。
进一步地,所述空间聚合卷积神经网络采用卷积神经网络架构,利用基于 ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个全局点与局部H×W个点的特征约束关系对应的全局点至局部点的聚合特征F c:[H×W,H,W],并作为空间聚合卷积神经网络的输出数据;
所述空间分布卷积神经网络采用卷积神经网络架构,利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个局部点与全局H×W个点的特征约束关系对应的局部至全局的分布特征F d:[H×W,H,W],并作为空间分布网络卷积神经网络的输出数据;
所述平移向量估计网络包括PointNet++网络和逐点霍夫投票网络,所述逐点霍夫投票网络采用多层感知机网络架构;
平移向量估计网络利用归一化处理后的目标物体三维点云输入至PointNet++网络,获得点云特征,再利用基于多层感知机网络形式的逐点霍夫投票网络逐点回归获得目标物体的3维平移向量的单位向量。
3维旋转矩阵和3维平移向量估计网络相互独立,训练过程互不干扰,可以并行完成,获得s目标物体位姿R和t,得到目标物体位姿P=|R|t|。
有益效果
本发明提供了一种基于注意力机制和霍夫投票的目标位姿估计方法及系统,该方法包含如下步骤:获取彩色图像和深度图像;对彩色图像进行分割与裁剪,得到各目标物体的彩色和深度图像块;采用两种策略估计目标物体6维位姿,针对3维旋转矩阵,基于双向空间注意力的特征提取网络,利用目标表面二维特征约束进行鲁棒特征提取,再利用多层感知网络回归出目标3维旋转矩阵;针对3维平移向量,重建目标物体点云并归一化点云数据,采用霍夫投票网络逐点估计点云3维平移方向向量,最后建立平移中心直线集,求解空间最近点得到目标3维平移向量。
相较于现有技术而言,具有以下优点:
1.针对位姿参数3维旋转矩阵和3维平移向量在目标物体位姿变化时对彩色、 深度图像各自不同的约束特点,采用不同的策略分别进行参数估计,能有效提取目标物体彩色、深度图像特征,建立更精准的参数估计模型,提高网络的表述和推理能力;
2.输入输出数据采用单位化形式,3维旋转矩阵估计网络输入规范至[0,1]空间的彩色、图像数据,输出旋转矩阵的单位四元数形式,3维平移向量估计网络输入规范至[-1,1]空间的点云数据,逐点输出指向平移向量的单位方向向量,有效解决了不同维度、量纲数据下训练梯度消失、爆炸或者不稳定的问题,加快网络收敛。
附图说明
图1为本发明实例中涉及的目标位姿估计方法网络结构示意图;
图2为用于训练和验证本发明方法提出的位姿估计网络的YCB数据集示意图,其中,(a)为场景1的RGB图像,(b)为(a)对应的Depth图像,(c)为(a)对应的目标物体标注Label图像,(d)为场景2的RGB图像,(e)为(d)对应的Depth图像,(f)为(d)对应的目标物体标注Label图像,(g)为场景3的RGB图像,(h)为(g)对应的Depth图像,(i)为(g)对应的目标物体标注Label图像,;
图3为从图2所示的数据集中处理获得6个目标物体RGB和Depth图像块,其中,(a)为目标物体1RGB图像块,(b)为(a)对应的Depth图像块,(c)目标物体2RGB图像块,(d)为(c)对应的Depth图像块,(e)为目标物体3RGB图像块,(f)为(e)对应的Depth图像块,(g)目标物体4RGB图像块,(h)为(g)对应的Depth图像块,(i)为目标物体5RGB图像块,(j)为(i)对应的Depth图像块,(k)为目标物体6RGB图像块,(l)为(k)对应的Depth图像块;
图4为从图2所示的数据集中处理获得的6个目标物体的目标点云,其中,(a)为目标物体1目标点云,(b)目标物体2目标点云,(c)为目标物体3目标点云,(d)为目标物体4目标点云,(e)为目标物体5目标点云,(f)为目标物体6目标点云;
图5为旋转矩阵估计网络损失随着迭代次数增加的变化曲线示意图;
图6为平移向量估计网络损失随着迭代次数增加的变化曲线示意图;
图7为6D位姿测试结果示意图,其中,(a)-(l)分别为12个不同的测试场景。
具体实施方式
为使本发明要解决的技术问题、技术方案和优点更加清楚,下面将结合附图及具体实施例进行详细描述:
本发明针对现有物体位姿估计方法的问题,提供了一种基于注意力机制和霍夫投票的目标位姿估计方法,具体网络结构如图1所示,包括如下步骤:
步骤S1:获取包含目标物体场景下的彩色、深度图像,如图2所示,展示了每个场景下RGB-D相机获取到的RGB图像和Depth图像,以及标注好的各目标物体Label图像,总共3个场景;
步骤S2:通过现有最先进的目标分割方法从彩色图像中获得各物体的类别和分割掩码;
步骤S3:利用步骤S2中获取的各物体分割掩码从输入图像中裁剪出对应物体的彩色、深度图像,并进行通道拼接,获得包含3通道彩色和1通道深度的4通道各目标物体图像块O,o j∈O,j=1,2,...,k,k为图像中目标物体数量;
步骤S4:构建旋转估计网络以及平移向量估计网络;
所述旋转估计网络包括串联的基于双向空间注意力的特征提取网络、特征拼接网络、多尺度池化网络和多层感知机网络,所述双向空间注意力的特征提取网络包含ResNet34卷积神经网络以及两个并联的空间聚合卷积神经网络、空间分布卷积神经网络;
所述平移向量估计网络包含串联的PointNet++网络和逐点霍夫投票网络;
步骤S5:网络训练;
利用已知的目标位姿估计数据集中不同场景下彩色图像和深度图像,按照步骤S1-步骤S3处理,以得到的归一化处理后的各目标物体图像块、对应的目标物体三维点云和对应的旋转矩阵四元数、3维平移单位向量分别对所述旋转估计网络以及平移向量估计网络进行训练,训练过程中,以旋转矩阵的绝对角度误差作为旋转估计网络损失,以平移向量的绝对角度误差作为平移向量估计网络损失;
步骤S6:将待进行目标位姿估计的目标物体图像按照步骤S1-步骤S3处理后,输入到利用步骤S5训练好的所述旋转估计网络和平移向量估计网络中,分别进行3维旋转矩阵估计和3维平移向量估计,实现目标位姿估计。
步骤S2中,目标分割为输入场景彩色图像,输出各已知物体的分割掩码,具体实施可采用现有最先进目标分割方法,本发明不包含此部分内容,但分割结果的精度会影响本发明最终物体位姿估计的精度。
物体位姿估计分解成两个独立任务进行,分别为3维旋转矩阵估计和3维平移向量估计;
旋转估计归一化:对裁剪出的各目标物体图像块O进行数据规范化处理,将彩色、深度图像各通道值分别从范围[0,255]、[near,far]归一化至[0,1],其中near、far分别为目标深度图像最近、最远值;
以图像块O的最小外接矩形为边界,保持纵横比,对其进行上采样或下采样,放缩至固定矩形尺寸,空白区域以0填充,如图3所示,通过放缩和填充处理使得所有图像块长宽一致,得到图像块O R,后续用于旋转矩阵估计网络训练使用;
步骤S53:将图像块O R输入旋转估计网络,旋转估计网络利用ResNet34卷积神经网络获取卷积特征,然后将获得的卷积特征分别输入到空间聚合卷积神经网络和空间分布卷积神经网络,提取聚合特征和分布特征;利用特征拼接网络将聚合特征和分布特征拼接后,再利用多尺度池化网络对拼接后的特征进行多尺度池化操作,获得目标物体图像的特征向量F A;最后,利用多层感知机网络从目标物体图像的特征向量中回归出目标物体的3维旋转矩阵;
所述空间聚合卷积神经网络利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个全局点与局部H×W个点的特征约束关系对应的全局点至局部点的聚合特征F c:[H×W,H,W],并作为空间聚合卷积神经网络的输出数据;
所述空间分布卷积神经网络利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个局部点与全局H×W个点的特征约束关系对应的局部至全局的分布特征F d:[H×W,H,W],并作为空间分布网 络卷积神经网络的输出数据。
所述空间分布网络获得H×W个局部点与全局H×W个点的特征约束关系,依照特征空间位置逐通道提取对应点特征值,并按照特征图像二维位置进行排列整合,生成分布特征F d:[H×W,H,W],所述特征图像中每个位置包含H×W个值,表示H×W个全局点与该位置的分布约束关系;
三维点云归一化处理:
首先,利用相机内参和小孔成像模型从各目标物体图像块O中获得某一目标物体三维点云V,V=(X,Y,Z,I);
其中:
Figure PCTCN2021084690-appb-000015
构成相机内参K,f x、f y为等效焦距,u i、v i分别为图像块O中像素i在原输入图像中的横纵坐标,I=(R,G,B)为彩色值,D(u i,v i)为图像块O中像素i深度值,c x、c y为图像坐标偏移,i=1,2,...,m,m表示目标物体图像块中的像素数量;
接着,计算三维点云V的三维重心
Figure PCTCN2021084690-appb-000016
对三维点云V进行归一化处理,彩色值I各通道从[0,255]归一化至[-1,1],三维点云的三维坐标首先移除重心,获得偏移坐标ΔS(ΔX,ΔY,ΔZ)=(X-G x,Y-G y,Z-G z),然后对ΔS进行单位向量化norm(ΔX,ΔY,ΔZ),得到归一化向量
Figure PCTCN2021084690-appb-000017
结合彩色值得到归一化后的三维点云
Figure PCTCN2021084690-appb-000018
将目标物体点云V norm输入至一个平移向量估计网络,逐点生成各点云指向目标物体3维平移向量的单位向量
Figure PCTCN2021084690-appb-000019
利用获得的单位向量W构造目标物体点云中任意点连接3维平移向量的直线方程集L,l i∈L,i=1,2,...,m,其中l为三维空间直线方程:
Figure PCTCN2021084690-appb-000020
然后求解三维空间中距直线方程集L最近点q:(x,y,z)即为目标物体3维平移向量t。
所述平移向量估计网络是利用归一化处理后的目标物体三维点云输入至PointNet++网络,获得点云特征,再利用基于多层感知机网络形式的逐点霍夫投票网络逐点回归获得目标物体的3维平移向量的单位向量
网络参数训练时:
对旋转估计网络进行训练,是利用旋转估计归一化后的图像块作为旋转估计网络的输入数据,输出旋转矩阵四元数Q,对旋转矩阵四元数Q进行单位化,然后转成旋转矩阵
Figure PCTCN2021084690-appb-000021
以旋转矩阵
Figure PCTCN2021084690-appb-000022
与旋转真值
Figure PCTCN2021084690-appb-000023
之间的绝对角度误差L R作为旋转矩阵损失:
Figure PCTCN2021084690-appb-000024
E为单位矩阵,将L R反向传播,采用梯度下降方法对旋转估计网络进行训练,更新基于双向空间注意力的特征提取网络参数。
如图4所示,通过随机采样或复制使得所有目标点云点数量一致,获得用于平移向量估计网络训练使用的目标点云数据;对平移向量估计网络训练是以图像块O的归一化后的三维点云为输入数据,以目标物体各表面点云指向3维平移向量
Figure PCTCN2021084690-appb-000025
的单位向量
Figure PCTCN2021084690-appb-000026
作为输出数据,以角度误差L t作为平移向量损失:
Figure PCTCN2021084690-appb-000027
将L t反向传播,采用梯度下降方法对平移向量估计网络进行参数训练,更新平移向量估计网络参数,其中,
Figure PCTCN2021084690-appb-000028
表示第i个像素的平移向量真值:
Figure PCTCN2021084690-appb-000029
m表示目标物体图像块中的像素数量。
一般设定一个最小值,当损失值小于该值时,即停止训练,达到最优效果,最小值的选择根据实际仿真实验的结果,会不断的调整;
在本实例中,参见图5和图6,可知旋转矩阵估计和平移向量估计网络损失随着迭代次数增加而减小,且在迭代次数达到一定值以后,估计网络损失趋于稳定。
3维旋转估计网络和3维平移向量估计网络相互独立,训练过程互不干扰,可以并行完成,分别预测目标物体位姿R和t,得到目标物体位姿P=|R|t|。
具体训练时,可以并行实施,独立计算两者损失并反向传播,然后更新网络权重,获得最佳网络性能。
参见图7,利用本发明实例所述方法得到训练好的网络在数据集验证集中6D位姿测试结果,包含(a)-(l)共12个测试场景,每个测试结果展示了场景中已知物体的3维边界框,边界框由已知物体模型、相机内参和场景下物体6D位姿计算获得,6D位姿包括标注的真实值和网络计算的估计值,两者的吻合度体现出网络的计算精度,进一步验证了本方案所述方法的有效性和准确性。基于上述方法,本发明实施例还提供一种基于注意力机制和霍夫投票的目标位姿估计系统,其特征在于,包括:
图像采集模块:利用RGB-D相机获取包含多目标物体场景中的彩色图像与深度图像;
其中,RGB-D相机选用Azure Kinect DK相机;
目标分割模块:用于对彩色图像进行分割,获得各目标物体的类别和分割掩码;
目标提取模块:基于各物体分割掩码,对彩色图像与深度图像进行裁剪与拼接,提取各目标物体图像块;
归一化模块:对各目标物体图像块中的三维点云的坐标、彩色值以及深度值进行归一化处理,获得各目标物体在同一空间下的三维点云数据;
位姿估计网络构建模块:用于构建旋转估计网络以及平移向量估计网络;
所述旋转估计网络包括串联的基于双向空间注意力的特征提取网络、特征拼接网络、多尺度池化网络和多层感知机网络,所述双向空间注意力的特征提取网络包含ResNet34卷积神经网络以及两个并联的空间聚合卷积神经网络、空间分布卷积神经网络;
所述平移向量估计网络包含串联的PointNet++网络和逐点投票网络;
网络训练模块:利用深度学习工作站对位姿估计网络进行训练;
利用已知的目标位姿估计数据集中不同场景下彩色图像和深度图像,调用图像采集模块、目标分割模块、目标提取模块以及归一化模块进行处理,以得到的 归一化处理后的各目标物体图像块、对应的物体点云和对应的旋转矩阵四元数、3维平移单位向量分别对所述旋转估计网络以及平移向量估计网络进行训练,训练过程中,以旋转矩阵的绝对角度误差作为旋转估计网络损失,以平移向量的绝对角度误差作为平移向量估计网络损失,且以梯度下降形式进行参数更新;
深度学习工作站,选用戴尔P5820x图形工作站
位姿估计模块:利用训练好的所述旋转估计网络和平移向量估计网络,对待进行目标位姿估计的目标物体图像块,分别进行3维旋转矩阵估计和3维平移向量估计,实现目标位姿估计。
应当理解,本发明各个实施例中的功能单元模块可以集中在一个处理单元中,也可以是各个单元模块单独物理存在,也可以是两个或两个以上的单元模块集成在一个单元模块中,可以采用硬件或软件的形式来实现。
最后应当说明的是:以上实施例仅用以说明本发明的技术方案而非对其限制,尽管参照上述实施例对本发明进行了详细的说明,所属领域的普通技术人员应当理解:依然可以对本发明的具体实施方式进行修改或者等同替换,而未脱离本发明精神和范围的任何修改或者等同替换,其均应涵盖在本发明的权利要求保护范围之内。

Claims (10)

  1. 一种基于注意力机制和霍夫投票的目标位姿估计方法,其特征在于,包括如下几个步骤:
    步骤S1:获取包含多目标物体场景下的彩色图像与深度图像;
    步骤S2:通过目标分割方法从彩色图像中获得各目标物体的类别和分割掩码;
    步骤S3:利用步骤S2中获取的各物体分割掩码,对彩色图像与深度图像进行裁剪与拼接,提取各目标物体图像块,并进行归一化处理;
    步骤S4:构建旋转估计网络以及平移向量估计网络;
    所述旋转估计网络包括串联的基于双向空间注意力的特征提取网络、特征拼接网络、多尺度池化网络和多层感知机网络,所述双向空间注意力的特征提取网络包含ResNet34卷积神经网络以及两个并联的空间聚合卷积神经网络、空间分布卷积神经网络;
    所述平移向量估计网络包含串联的PointNet++网络和逐点霍夫投票网络;
    步骤S5:网络训练;
    利用已知的目标位姿估计数据集中不同场景下彩色图像和深度图像,按照步骤S1-步骤S3处理,以得到的归一化处理后的各目标物体图像块、对应的目标物体三维点云和对应的旋转矩阵四元数、3维平移单位向量分别对所述旋转估计网络以及平移向量估计网络进行训练,训练过程中,以旋转矩阵的绝对角度误差作为旋转估计网络损失,以平移向量的绝对角度误差作为平移向量估计网络损失;
    步骤S6:将待进行目标位姿估计的目标物体图像按照步骤S1-步骤S3处理后,输入到利用步骤S5训练好的所述旋转估计网络和平移向量估计网络中,分别进行3维旋转矩阵估计和3维平移向量估计,实现目标位姿估计。
  2. 根据权利要求1所述的方法,其特征在于,将各目标物体图像块进行归一化处理的具体过程如下:
    旋转估计归一化:从各目标物体图像块O中彩色通道值、深度通道值分别从[0,255]、[near,far]范围归一化至[-1,1];再以各目标物体图像块O的最小外接矩形为边界,保持设定的纵横比,对各目标物体图像块O进行上采样或下采样,放缩至固定矩形尺寸,空白区域以0填充,获得宽、高尺寸统一的各目标物体图像块O R
    三维点云归一化:从各目标物体图像块O中获取各目标物体三维点云,将三 维点云彩色值从[0,255]归一化至[-1,1],并将三维点云的三维坐标进行重心移除,获得偏移坐标,并对偏移坐标进行单位向量化,获得归一化坐标,从而获得各目标物体在同一空间下的三维点云数据;
    其中,near、far分别为目标物体深度图像的最近、最远值。
  3. 根据权利要求1所述的方法,其特征在于,所述空间聚合卷积神经网络利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个全局点与局部H×W个点的特征约束关系对应的全局点至局部点的聚合特征F c:[H×W,H,W],并作为空间聚合卷积神经网络的输出数据;
    所述空间分布卷积神经网络利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个局部点与全局H×W个点的特征约束关系对应的局部至全局的分布特征F d:[H×W,H,W],并作为空间分布网络卷积神经网络的输出数据。
  4. 根据权利要求1所述的方法,其特征在于,所述平移向量估计网络是利用归一化处理后的目标物体三维点云输入至PointNet++网络,获得点云特征,再利用基于多层感知机网络形式的逐点霍夫投票网络逐点回归获得目标物体的3维平移向量的单位向量。
  5. 根据权利要求4所述的方法,其特征在于,利用各目标物体的三维点云坐标和单位向量建立各目标物体3维平移向量所在直线方程集,通过求解三维空间距直线集的最近点,得到各目标物体3维平移向量t。
  6. 根据权利要求1所述的方法,其特征在于,三维点云归一化处理具体是指:
    首先,利用相机内参和小孔成像模型从各目标物体图像块O中获取某一目标物体三维点云V,V=(X,Y,Z,I);
    其中:
    Figure PCTCN2021084690-appb-100001
    构成相机内参K,f x、f y为等效焦距,u i、v i分别为图像块O中像素i在原输入图像中的横纵坐标,I=(R,G,B)为彩色值,D(u i,v i)为图像块O中像素i深度值,c x、c y为图像坐标偏移,i=1,2,...,m,m表示目标物体图像块中的像素数量;
    接着,计算三维点云V的三维重心G:
    Figure PCTCN2021084690-appb-100002
    对三维点云V进行归一化处理,彩色值I各通道从[0,255]归一化至[-1,1],三维点云的三维坐标首先移除重心,获得偏移坐标ΔS(ΔX,ΔY,ΔZ)=(X-G x,Y-G y,Z-G z),然后对ΔS进行单位向量化norm(ΔX,ΔY,ΔZ),得到归一化向量
    Figure PCTCN2021084690-appb-100003
    结合彩色值得到归一化后的三维点云V norm:
    Figure PCTCN2021084690-appb-100004
  7. 根据权利要求1所述的方法,其特征在于,对旋转估计网络进行训练,是利用旋转估计归一化后的图像块作为旋转估计网络的输入数据,输出旋转矩阵四元数Q,对旋转矩阵四元数Q进行单位化,然后转成旋转矩阵
    Figure PCTCN2021084690-appb-100005
    以旋转矩阵
    Figure PCTCN2021084690-appb-100006
    与旋转真值
    Figure PCTCN2021084690-appb-100007
    之间的绝对角度误差L R作为旋转矩阵损失:
    Figure PCTCN2021084690-appb-100008
    E为单位矩阵,将L R反向传播,采用梯度下降方法对旋转估计网络进行训练,更新旋转估计网络参数。
  8. 根据权利要求1所述的方法,其特征在于,读平移向量估计网络训练是以图像块O的归一化后的三维点云为输入数据,以目标物体各表面点云指向3维平移向量
    Figure PCTCN2021084690-appb-100009
    的单位向量
    Figure PCTCN2021084690-appb-100010
    作为输出数据,以角度误差L t作为平移向量损失:
    Figure PCTCN2021084690-appb-100011
    将L t反向传播,采用梯度下降方法对平移向量估计网 络进行参数训练,更新平移向量估计网络参数,其中,
    Figure PCTCN2021084690-appb-100012
    表示第i个像素的平移向量真值:
    Figure PCTCN2021084690-appb-100013
    m表示目标物体图像块中的像素数量。
  9. 一种基于注意力机制和霍夫投票的目标位姿估计系统,其特征在于,包括:
    图像采集模块:利用RGB-D相机获取包含多目标物体场景中的彩色图像与深度图像;
    目标分割模块:用于对彩色图像进行分割,获得各目标物体的类别和分割掩码;
    目标提取模块:基于各物体分割掩码,对彩色图像与深度图像进行裁剪与拼接,提取各目标物体图像块;
    归一化模块:对各目标物体图像块中的三维点云的坐标、彩色值以及深度值进行归一化处理,获得各目标物体在同一空间下的三维点云数据;
    位姿估计网络构建模块:用于构建旋转估计网络以及平移向量估计网络;
    所述旋转估计网络包括串联的基于双向空间注意力的特征提取网络、特征拼接网络、多尺度池化网络和多层感知机网络,所述双向空间注意力的特征提取网络包含ResNet34卷积神经网络以及两个并联的空间聚合卷积神经网络、空间分布卷积神经网络;
    所述平移向量估计网络包含串联的PointNet++网络和逐点投票网络;
    网络训练模块:利用深度学习工作站对位姿估计网络进行训练;
    利用已知的目标位姿估计数据集中不同场景下彩色图像和深度图像,调用图像采集模块、目标分割模块、目标提取模块以及归一化模块进行处理,以得到的归一化处理后的各目标物体图像块、对应的物体点云和对应的旋转矩阵四元数、3维平移单位向量分别对所述旋转估计网络以及平移向量估计网络进行训练,训练过程中,以旋转矩阵的绝对角度误差作为旋转估计网络损失,以平移向量的绝对角度误差作为平移向量估计网络损失,且以梯度下降形式进行参数更新;
    位姿估计模块:利用训练好的所述旋转估计网络和平移向量估计网络,对待进行目标位姿估计的目标物体图像块,分别进行3维旋转矩阵估计和3维平移向量估计,实现目标位姿估计。
  10. 根据权利要求9所述的系统,其特征在于,所述空间聚合卷积神经网络采 用卷积神经网络架构,利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个全局点与局部H×W个点的特征约束关系对应的全局点至局部点的聚合特征F c:[H×W,H,W],并作为空间聚合卷积神经网络的输出数据;
    所述空间分布卷积神经网络采用卷积神经网络架构,利用基于ResNet34卷积神经网络获取的卷积特征作为卷积神经网络的输入数据,从卷积神经网络得到的上下文分布特征F d-c:[(H×W)×(H×W),H,W]中,提取出H×W个局部点与全局H×W个点的特征约束关系对应的局部至全局的分布特征F d:[H×W,H,W],并作为空间分布网络卷积神经网络的输出数据;
    所述平移向量估计网络包括PointNet++网络和逐点霍夫投票网络,所述逐点霍夫投票网络采用多层感知机网络架构;
    平移向量估计网络利用归一化处理后的目标物体三维点云输入至PointNet++网络,获得点云特征,再利用基于多层感知机网络形式的逐点霍夫投票网络逐点回归获得目标物体的3维平移向量的单位向量。
PCT/CN2021/084690 2021-02-25 2021-03-31 一种基于注意力机制和霍夫投票的目标位姿估计方法及系统 WO2022178952A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110211748.6 2021-02-25
CN202110211748.6A CN113065546B (zh) 2021-02-25 2021-02-25 一种基于注意力机制和霍夫投票的目标位姿估计方法及系统

Publications (1)

Publication Number Publication Date
WO2022178952A1 true WO2022178952A1 (zh) 2022-09-01

Family

ID=76559164

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084690 WO2022178952A1 (zh) 2021-02-25 2021-03-31 一种基于注意力机制和霍夫投票的目标位姿估计方法及系统

Country Status (2)

Country Link
CN (1) CN113065546B (zh)
WO (1) WO2022178952A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311274A (zh) * 2022-10-11 2022-11-08 四川路桥华东建设有限责任公司 一种基于空间变换自注意力模块的焊缝检测方法及系统
CN115578461A (zh) * 2022-11-14 2023-01-06 之江实验室 基于双向rgb-d特征融合的物体姿态估计方法及装置

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780240B (zh) * 2021-09-29 2023-12-26 上海交通大学 基于神经网络及旋转特征增强的物体位姿估计方法
CN113989318B (zh) * 2021-10-20 2023-04-07 电子科技大学 基于深度学习的单目视觉里程计位姿优化与误差修正方法
CN114170312A (zh) * 2021-12-07 2022-03-11 南方电网电力科技股份有限公司 一种基于特征融合的目标物体位姿估计方法及装置
CN114820932B (zh) * 2022-04-25 2024-05-03 电子科技大学 一种基于图神经网络和关系优化的全景三维场景理解方法
CN115082572B (zh) * 2022-07-22 2023-11-03 南京慧尔视智能科技有限公司 一种雷达和相机联合自动标定方法和系统
CN115761116B (zh) * 2022-11-03 2023-08-18 云南大学 一种基于单目相机的透视投影下三维人脸重建方法
CN117788577A (zh) * 2023-12-21 2024-03-29 西南交通大学 一种基于深度学习的螺栓6d姿态估计方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130051639A1 (en) * 2011-08-23 2013-02-28 Kabushiki Kaisha Toshiba Object location method and system
CN110458128A (zh) * 2019-08-16 2019-11-15 广东工业大学 一种姿态特征获取方法、装置、设备及存储介质
CN111179324A (zh) * 2019-12-30 2020-05-19 同济大学 基于颜色和深度信息融合的物体六自由度位姿估计方法
CN111783986A (zh) * 2020-07-02 2020-10-16 清华大学 网络训练方法及装置、姿态预测方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5346863B2 (ja) * 2010-03-30 2013-11-20 大日本スクリーン製造株式会社 3次元位置・姿勢認識装置およびそれを用いたシステム、方法、プログラム
GB201215944D0 (en) * 2012-09-06 2012-10-24 Univ Manchester Image processing apparatus and method for fittng a deformable shape model to an image using random forests
US20200301015A1 (en) * 2019-03-21 2020-09-24 Foresight Ai Inc. Systems and methods for localization
CN111325797B (zh) * 2020-03-03 2023-07-25 华东理工大学 一种基于自监督学习的位姿估计方法
CN111723721A (zh) * 2020-06-15 2020-09-29 中国传媒大学 基于rgb-d的三维目标检测方法、系统及装置
CN111784770B (zh) * 2020-06-28 2022-04-01 河北工业大学 基于shot和icp算法的无序抓取中的三维姿态估计方法
CN111862201B (zh) * 2020-07-17 2023-06-23 北京航空航天大学 一种基于深度学习的空间非合作目标相对位姿估计方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130051639A1 (en) * 2011-08-23 2013-02-28 Kabushiki Kaisha Toshiba Object location method and system
CN110458128A (zh) * 2019-08-16 2019-11-15 广东工业大学 一种姿态特征获取方法、装置、设备及存储介质
CN111179324A (zh) * 2019-12-30 2020-05-19 同济大学 基于颜色和深度信息融合的物体六自由度位姿估计方法
CN111783986A (zh) * 2020-07-02 2020-10-16 清华大学 网络训练方法及装置、姿态预测方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115311274A (zh) * 2022-10-11 2022-11-08 四川路桥华东建设有限责任公司 一种基于空间变换自注意力模块的焊缝检测方法及系统
CN115311274B (zh) * 2022-10-11 2022-12-23 四川路桥华东建设有限责任公司 一种基于空间变换自注意力模块的焊缝检测方法及系统
CN115578461A (zh) * 2022-11-14 2023-01-06 之江实验室 基于双向rgb-d特征融合的物体姿态估计方法及装置
CN115578461B (zh) * 2022-11-14 2023-03-10 之江实验室 基于双向rgb-d特征融合的物体姿态估计方法及装置

Also Published As

Publication number Publication date
CN113065546A (zh) 2021-07-02
CN113065546B (zh) 2022-08-12

Similar Documents

Publication Publication Date Title
WO2022178952A1 (zh) 一种基于注意力机制和霍夫投票的目标位姿估计方法及系统
Fan et al. Pothole detection based on disparity transformation and road surface modeling
US11436437B2 (en) Three-dimension (3D) assisted personalized home object detection
US11315266B2 (en) Self-supervised depth estimation method and system
WO2019174377A1 (zh) 一种基于单目相机的三维场景稠密重建方法
CN111563415B (zh) 一种基于双目视觉的三维目标检测系统及方法
CN112258618A (zh) 基于先验激光点云与深度图融合的语义建图与定位方法
JP7439153B2 (ja) 全方位場所認識のためのリフトされたセマンティックグラフ埋め込み
CN111998862B (zh) 一种基于bnn的稠密双目slam方法
WO2021164887A1 (en) 6d pose and shape estimation method
WO2022099613A1 (zh) 图像生成模型的训练方法、新视角图像生成方法及装置
CN113393503B (zh) 一种分割驱动形状先验变形的类别级物体6d位姿估计方法
CN112465903A (zh) 一种基于深度学习点云匹配的6dof物体姿态估计方法
WO2022052782A1 (zh) 图像的处理方法及相关设备
CN111768447A (zh) 一种基于模板匹配的单目相机物体位姿估计方法及系统
EP4107650A1 (en) Systems and methods for object detection including pose and size estimation
CN114913552B (zh) 一种基于单视角点云序列的三维人体稠密对应估计方法
Mukasa et al. 3d scene mesh from cnn depth predictions and sparse monocular slam
CN115482268A (zh) 一种基于散斑匹配网络的高精度三维形貌测量方法与系统
Zhu et al. A review of 6d object pose estimation
Burkov et al. Multi-neus: 3d head portraits from single image with neural implicit functions
CN113409242A (zh) 一种轨交弓网点云智能监测方法
CN117351078A (zh) 基于形状先验的目标尺寸与6d姿态估计方法
KR20210018114A (ko) 교차 도메인 메트릭 학습 시스템 및 방법
CN116883590A (zh) 一种三维人脸点云优化方法、介质及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927386

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927386

Country of ref document: EP

Kind code of ref document: A1