WO2020063475A1 - 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置 - Google Patents

基于深度学习迭代匹配的6d姿态估计网络训练方法及装置 Download PDF

Info

Publication number
WO2020063475A1
WO2020063475A1 PCT/CN2019/106993 CN2019106993W WO2020063475A1 WO 2020063475 A1 WO2020063475 A1 WO 2020063475A1 CN 2019106993 W CN2019106993 W CN 2019106993W WO 2020063475 A1 WO2020063475 A1 WO 2020063475A1
Authority
WO
WIPO (PCT)
Prior art keywords
target object
segmentation mask
picture
pose
estimated
Prior art date
Application number
PCT/CN2019/106993
Other languages
English (en)
French (fr)
Inventor
季向阳
王谷
李益
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2020063475A1 publication Critical patent/WO2020063475A1/zh
Priority to US17/023,919 priority Critical patent/US11200696B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • G06V20/647Three-dimensional objects by matching two-dimensional images to three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the disclosure relates to the field of artificial intelligence, and in particular, to a 6D pose estimation network training method and device based on deep learning iterative matching.
  • the 6D pose of an object that is, the 3D position and 3D position of the object, can be used for tasks such as grabbing or motion planning.
  • the 6D pose of an object can be used for tasks such as grabbing or motion planning.
  • a depth camera In traditional techniques, a depth camera is generally used to estimate the pose of an object. But depth cameras have many limitations, such as limitations on frame rate, field of view, resolution, and depth range, which make it difficult for these technologies that rely on depth cameras to detect small, transparent, or fast-moving objects. However, it is still very challenging to estimate the 6D pose of an object using only RGB images, because factors such as lighting, pose changes, and occlusion will affect the appearance of the object on the image. A robust 6D pose estimation method also needs to be able to handle both textured and untextured objects.
  • the present disclosure proposes a 6D pose estimation network training method and device based on deep learning iterative matching to solve the problem that the 6D pose estimation of objects obtained by the existing deep learning methods is not accurate enough and lacks a depth-independent Information can be used to improve the 6D pose estimation problem.
  • a 6D pose estimation network training method based on deep learning iterative matching which is characterized in that the method includes:
  • a 6D pose estimation network training device based on deep learning iterative matching which is characterized in that the device includes:
  • An obtaining module for obtaining a rendered picture and a first segmentation mask of the target object by using the three-dimensional model of the target object and the initial 6D pose estimation
  • An input module for inputting a rendered image, a first segmentation mask, an observation image of the target object, and a second segmentation mask of the target object in the observation image into a deep convolutional neural network to obtain a 6D pose estimation and a third segmentation mask.
  • Iterative module configured to update the initial 6D pose estimation with the obtained 6D pose estimation, replace the second segmentation mask with a third segmentation mask, and perform the above steps again to iteratively train the deep convolutional neural network .
  • Memory for storing processor-executable instructions
  • the processor is configured to implement the above method when executing the processor-executable instructions.
  • a non-volatile computer-readable storage medium on which computer program instructions are stored, characterized in that the computer program instructions implement the above method when executed by a processor.
  • the training method proposed in the embodiment of the present disclosure improves the initial 6D pose estimation, it does not need to rely on depth information, and the estimation result is accurate. Since the environmental conditions such as lighting and occlusion can be adjusted as needed during rendering, this method is robust to problems such as lighting and occlusion, and because the segmentation mask can be obtained accordingly with or without texture , So this method can handle both textured and untextured objects.
  • FIG. 1a shows a flowchart of a 6D pose estimation network training method based on deep learning iterative matching according to an embodiment of the present disclosure.
  • FIG. 1b shows a schematic diagram of a 6D pose estimation network training method based on deep learning iterative matching according to an embodiment of the present disclosure.
  • FIG. 2 is a schematic diagram showing one example of a zoom-in operation according to an embodiment of the present disclosure.
  • FIG. 3 shows a flowchart of a method for training a 6D pose estimation model based on deep learning iterative matching according to an embodiment of the present disclosure.
  • FIG. 4 shows a structural schematic diagram of an example of a deep convolutional neural network according to an embodiment of the present disclosure.
  • Fig. 5 is a block diagram of a 6D pose estimation network training device 1900 for iterative matching based on deep learning according to an exemplary embodiment.
  • exemplary means “serving as an example, embodiment, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as superior to or better than other embodiments.
  • FIG. 1a shows a flowchart of a 6D pose estimation network training method based on deep learning iterative matching according to an embodiment of the present disclosure.
  • FIG. 1b shows a schematic diagram of the method. As shown in FIG. 1a, the method includes:
  • the target object may be any object to be subjected to pose estimation during the network training process, such as an object or a person.
  • the initial 6D pose estimation pose (0) of the target object may be a preset initial value, or an initial value obtained through an estimation method of other related technologies.
  • a rendered picture of the target object and a first segmentation mask may be obtained, where the first segmentation mask may be a segmentation mask of the target object in the rendered picture.
  • the observation picture of the target object may be a picture obtained by shooting an actual target object.
  • the second segmentation mask may be obtained by segmenting an annotation on an observed picture, and the segmentation annotation may be obtained based on an object segmentation method of a related technology.
  • the deep neural convolutional network can be divided into three branches to return to 6D pose estimation, a third segmentation mask, and optical flow for iterative training.
  • the network parameters of the deep convolutional neural network can be adjusted according to the loss function.
  • the iterative training meets the training conditions, it can be considered as training completion, and the training conditions can be set according to actual needs. For example, if the value of the loss function is less than the threshold, or the number of iterations reaches the threshold, the disclosure does not limit this.
  • the training method proposed in the embodiment of the present disclosure improves the initial 6D pose estimation, it does not need to rely on depth information, and the estimation result is accurate. Since the environmental conditions such as lighting and occlusion can be adjusted as needed during rendering, this method is robust to problems such as lighting and occlusion, and because the segmentation mask can be obtained accordingly with or without texture , So this method can handle both textured and untextured objects.
  • iteratively training the deep convolutional neural network may include: iteratively training the deep convolutional neural network using an SGD optimization algorithm.
  • the SGD (Stochastic Gradient Descent, Stochastic Gradient Descent) optimization algorithm can be used to iterate the deep convolutional neural network to optimize the loss function until convergence to achieve better training results.
  • inputting the rendered picture, the first segmentation mask, the observation image of the target object, and the second segmentation mask of the target object in the observation image into the deep convolutional neural network includes:
  • FIG. 2 is a schematic diagram showing one example of a zoom-in operation according to an embodiment of the present disclosure.
  • the rectangular area surrounded by the second segmentation mask in the observation picture is enlarged to obtain an enlarged observation picture
  • the rendered picture is enlarged to obtain an enlarged rendered picture, in which the observation picture and the rendered picture are enlarged.
  • the proportions can be the same.
  • the two-dimensional projection center of the three-dimensional model of the target object is located at the center of the enlarged rendered picture, and the target object in the observation picture is completely located in the enlarged observation picture.
  • the deep neural convolutional network in this embodiment processes paired pictures (rendered pictures and observation pictures) enlarged with the target object as the center, it is less affected by the scale factor of the object, and the estimation result is more accurate.
  • FIG. 3 shows a flowchart of a method for training a 6D pose estimation model based on deep learning iterative matching according to an embodiment of the present disclosure. As shown in FIG. 3, the method further includes:
  • Steps S104, S105, and S106 are used as the test or use process of the network, and are performed after the network iterative training is completed. During testing or use, remove the optical flow and segmentation masks of the deep convolutional neural network trained in steps S101-S103, and repeat steps S104, S105, and S106 until the preset iterative convergence conditions are completed. The 6D attitude estimation result can be obtained.
  • the present disclosure places no restrictions on iterative convergence conditions.
  • a rendered picture of the target object to be estimated, a fourth segmentation mask, an observation picture of the target object to be estimated, and a fifth segmentation mask of the target object to be estimated in the observation image are input to the trained
  • the deep convolutional neural network includes: an enlargement operation of a rectangular area surrounded by a rendered picture of the target object to be estimated and a fifth segmentation mask of the target object to be estimated in the observation picture of the initial prediction, so that the target object to be estimated
  • the 2D projection center of the 3D model is located in the center of the enlarged rendered picture, and the target object to be estimated in the observed picture is completely located in the enlarged observation picture; the enlarged rendered picture, the fourth segmentation mask
  • the enlarged observation picture and the fifth segmentation mask of the target object to be estimated in the observation picture are input to the trained deep convolutional neural network.
  • the fifth segmentation mask of the initial prediction may be a segmentation mask obtained by other related technologies.
  • the estimation result is less affected by the scale factor of the object, and the estimation result is more accurate.
  • the 6D pose estimation output by the deep convolutional neural network may be represented by a relative pose transformation amount relative to the target pose, where the target pose is the pose of the target object in the labeled observation picture, and can be obtained by Manual annotation or other related gesture recognition technology.
  • the relative attitude transformation amount includes a relative rotation transformation amount and a relative translation transformation amount, and can be expressed in a manner of decoupling translation and rotation.
  • the relative rotation transformation amount can use the center point of the target object in the camera coordinate system as the origin of the camera coordinate system, so that the rotation of the target object in the camera coordinate system will not be affected during rotation. This decouples rotation from translation.
  • the relative rotation transformation amount can be expressed by a transformation matrix.
  • the relative translation transform amount can be expressed by the offset and scale change in 2D pixel space, instead of directly expressed by the coordinate difference in 3D space.
  • the relative translation transformation amount can be represented by a transformation vector.
  • v x , v y respectively represent the target object in the rendered picture and the target object in the observed picture between x and y.
  • v x f x (x tgt / z tgt -x src / z src ),
  • v y f y (y tgt / z tgt -y src / z src ),
  • f x and f y are the focal length of the camera, and the scale transformation factor v z is expressed by a ratio, so that it has nothing to do with the absolute scale of the target object.
  • f x and f y are both fixed constants, they can be regarded as 1 during the actual training of the network.
  • the network is easier to train and can be applied to zero-sample learning, that is, the improvement of 6D pose estimation on models that have not been seen before.
  • the deep convolutional neural network may be constructed based on a FlowNet model for predicting optical flow.
  • the basic structure of the deep convolutional neural network can be a simple version of the structure of FlowNet for predicting optical flow, retaining its branch that predicts optical flow, adding a segmentation mask in the input part, and adding a prediction segmentation mask in the output part.
  • Code branch, and branch for predicting 6D pose estimation are a simple version of the structure of FlowNet for predicting optical flow, retaining its branch that predicts optical flow, adding a segmentation mask in the input part, and adding a prediction segmentation mask in the output part.
  • Code branch, and branch for predicting 6D pose estimation are Among them, the optical flow branch and the segmentation mask branch only play a role in training. As an auxiliary to training, the training is more stable. Only 6D pose estimation branches can be used during testing and application.
  • the input of the network can include 8 channels, that is, 3 channels of the observation image, 1 channel of the segmentation mask of the observation image, 3 channels of the rendered image, and 1 channel of the segmentation mask of the rendered image.
  • the network weight of the increased split mask channel can be initialized with 0, and other parts can be randomly initialized if they are new layers, and the remaining layers same as the original FlowNet can retain the original weights.
  • a fully connected layer with 3 neurons can be used for the relative translation transformation amount
  • a fully connected layer with 4 neurons can be used for the relative rotation transformation amount, where 4 means four Element to represent the relative rotation transform amount.
  • FIG. 4 is a schematic structural diagram of an example of a deep convolutional neural network according to an embodiment of the present disclosure.
  • the network is based on the FlowNet Convs and FlowNet DeConvs (FlowNet Convolution and Deconv) models.
  • FlowNet Convs FlowNet Convolution and Deconv
  • the segmentation mask is input to the FlowNet Convs model to obtain 6D pose estimation (including relative rotation transform (Rotation), relative translation transform (Translation)).
  • the FlowNet DeConvs model is based on the feature map obtained by the FlowNet Convs model to obtain optical flow and segmentation. Mask (third split mask above).
  • the two branches of optical flow and segmentation mask are removed when testing and attitude estimation using the network.
  • a loss function is formed based on a weighted sum of the loss functions of the three branches of 6D pose estimation, optical flow, and the third segmentation mask.
  • the optical flow branch and the third segmentation mask branch are only used for iterative training.
  • loss function For example, the following loss function can be used:
  • L pose represents the loss function of the 6D pose estimation branch
  • represents the weight coefficient of the loss function of the 6D pose estimation branch
  • L flow represents the loss function of the optical flow branch
  • represents the weight coefficient of the loss function of the optical flow branch
  • L mask represents the first The loss function of the three-division mask branch
  • represents the weight coefficient of the loss function of the third-division mask branch.
  • the estimated pose is Then the loss function of the 6D attitude estimation branch can be:
  • R represents the amount of rotation in the target pose
  • t represents the amount of translation in the target pose
  • x j represents the coordinates of the j-th point in the three-dimensional model of the target object
  • L 1 represents a norm
  • n represents the total number of points in the three-dimensional model.
  • R, t Represents the absolute rotation and absolute translation relative to the origin of the coordinates. with It can be obtained by superimposing the relative rotation transform amount and the relative translation transform amount output by the deep convolutional neural network of the target pose.
  • the optical flow branch and / or the third segmentation mask branch may be removed during iterative training.
  • the loss function also needs to be adjusted accordingly.
  • a person skilled in the art may set a preset number of values according to actual conditions, which is not limited in the present disclosure.
  • An embodiment of the present disclosure provides a 6D pose estimation network training device based on deep learning iterative matching, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the processor When the processor executes the instructions, the above method is implemented.
  • An embodiment of the present disclosure provides a non-volatile computer-readable storage medium on which computer program instructions are stored, and is characterized in that the computer program instructions implement the foregoing method when executed by a processor.
  • Fig. 5 is a block diagram of a 6D pose estimation network training device 1900 for iterative matching based on deep learning according to an exemplary embodiment.
  • the device 1900 may be provided as a server. 5
  • the device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932, for storing instructions executable by the processing component 1922, such as an application program.
  • the application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the method described above.
  • the device 1900 may further include a power supply component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input / output (I / O) interface 1958.
  • the device 1900 can operate based on an operating system stored in the memory 1932, such as Windows ServerTM, Mac OSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
  • a non-volatile computer-readable storage medium such as a memory 1932 including computer program instructions, and the computer program instructions may be executed by the processing component 1922 of the device 1900 to complete the above method.
  • the present disclosure may be a system, method, and / or computer program product.
  • the computer program product may include a computer-readable storage medium having computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.
  • the computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electric storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon A protruding structure in the hole card or groove, and any suitable combination of the above.
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory flash memory
  • SRAM static random access memory
  • CD-ROM compact disc read only memory
  • DVD digital versatile disc
  • memory stick floppy disk
  • mechanical encoding device such as a printer with instructions stored thereon A protruding structure in the hole card or groove, and any suitable combination of the above.
  • Computer-readable storage media used herein are not to be interpreted as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or via electrical wires Electrical signal transmitted.
  • the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing / processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and / or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers.
  • the network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing / processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Smalltalk, C ++, and the like—and conventional procedural programming languages—such as the "C" language or similar programming languages.
  • Computer-readable program instructions may be executed entirely on a user's computer, partly on a user's computer, as a stand-alone software package, partly on a user's computer, partly on a remote computer, or entirely on a remote computer or server carried out.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through the Internet using an Internet service provider) connection).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, a field-programmable gate array (FPGA), or a programmable logic array (PLA), can be personalized by using state information of computer-readable program instructions.
  • FPGA field-programmable gate array
  • PDA programmable logic array
  • the electronic circuit can Computer-readable program instructions are executed to implement various aspects of the present disclosure.
  • These computer-readable program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, or other programmable data processing device, thereby producing a machine such that when executed by a processor of a computer or other programmable data processing device , Means for implementing the functions / actions specified in one or more blocks in the flowcharts and / or block diagrams.
  • These computer-readable program instructions may also be stored in a computer-readable storage medium, and these instructions cause a computer, a programmable data processing apparatus, and / or other devices to work in a specific manner. Therefore, a computer-readable medium storing instructions includes: An article of manufacture that includes instructions to implement various aspects of the functions / acts specified in one or more blocks in the flowcharts and / or block diagrams.
  • Computer-readable program instructions can also be loaded onto a computer, other programmable data processing device, or other device, so that a series of operating steps can be performed on the computer, other programmable data processing device, or other device to produce a computer-implemented process , So that the instructions executed on the computer, other programmable data processing apparatus, or other equipment can implement the functions / actions specified in one or more blocks in the flowchart and / or block diagram.
  • each block in the flowchart or block diagram may represent a module, a program segment, or a part of an instruction, which contains one or more components for implementing a specified logical function.
  • Executable instructions may also occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts can be implemented in a dedicated hardware-based system that performs the specified function or action. , Or it can be implemented with a combination of dedicated hardware and computer instructions.

Abstract

本公开涉及一种基于深度学习迭代匹配的6D姿态估计网络训练方法及装置,该方法包括:利用目标对象的三维模型与初始6D姿态估计,获得目标对象的渲染图片和第一分割掩码,将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中,得到6D姿态估计、第三分割掩码和光流,以所得到的6D姿态估计更新所述初始6D姿态估计,以第三分割掩码替代所述第二分割掩码,重新执行上述步骤,以迭代训练所述深度卷积神经网络。本公开实施例所提出的训练方法在对初始6D姿态估计进行改进时,不需要依赖深度信息,估计结果准确。对于光照、遮挡等问题具有鲁棒性,可以同时处理有纹理和无纹理的物体。

Description

基于深度学习迭代匹配的6D姿态估计网络训练方法及装置
本申请要求在2018年9月25日提交中国专利局、申请号为201811114456.5、发明名称为“基于深度学习迭代匹配的6D姿态估计网络训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能领域,尤其涉及一种基于深度学习迭代匹配的6D姿态估计网络训练方法及装置。
背景技术
从2D图像中获取物体在3D空间中的姿态在很多现实应用中非常重要,例如在机器人领域,识别出物体的6D姿态,即物体的3D位置和3D方位,能够为抓取或者运动规划等任务提供关键的信息;在虚拟现实场景中,准确的6D物体姿态可以使人和物体进行交互。
在传统的技术中,一般都会采用深度相机来做物体姿态估计。但是深度相机有很多局限性,例如在帧率、视场、分辨率和深度范围等方面的局限性,使得这些依赖深度相机的技术很难检测出细小的、透明的或者移动很快的物体。然而,只用RGB图像来估计物体的6D姿态仍然非常有挑战性,因为光照、姿态变化、遮挡等因素都会影响到物体在图像上的外观。一个鲁棒的6D姿态估计方法还需要能同时处理有纹理和无纹理的物体。
最近有一些基于深度学习的方法来使用RGB图像得到物体的6D姿态估计,一般是通过扩展目标检测或者分割的方法来实现。这些方法相对于传统只用RGB图像的方法有较大提升,但是仍然比不上基于RGB-D的方法。因此这些方法一般都需要进一步利用深度信息,通过ICP(Iterative Closest Point,迭代最近点算法)的方法对初始姿态估计进行改进。但是ICP对初始估计比较敏感,可能会收敛到局部极小值,特别是在有遮挡的情况下。并且基于深 度信息的方法本身也会受到深度相机的局限。
发明内容
有鉴于此,本公开提出了一种基于深度学习迭代匹配的6D姿态估计网络训练方法及装置,以解决现有的深度学习方法得到的物体的6D姿态估计不够准确,且缺乏一种不依赖深度信息就能对6D姿态估计进行改善的方法的问题。
一方面,提出了一种基于深度学习迭代匹配的6D姿态估计网络训练方法,其特征在于,该方法包括:
利用目标对象的三维模型与初始6D姿态估计,获得目标对象的渲染图片和第一分割掩码,
将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中,得到6D姿态估计、第三分割掩码和光流,
以所得到的6D姿态估计更新所述初始6D姿态估计,以第三分割掩码替代所述第二分割掩码,重新执行上述步骤,以迭代训练所述深度卷积神经网络。
另一方面,提出了一种基于深度学习迭代匹配的6D姿态估计网络训练装置,其特征在于,该装置包括:
获得模块,用于利用目标对象的三维模型与初始6D姿态估计,获得目标对象的渲染图片和第一分割掩码,
输入模块,用于将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中,得到6D姿态估计、第三分割掩码和光流,
迭代模块,用于以所得到的6D姿态估计更新所述初始6D姿态估计,以第三分割掩码替代所述第二分割掩码,重新执行上述步骤,以迭代训练所述深度卷积神经网络。
另一方面,提出了一种基于深度学习迭代匹配的6D姿态估计网络训练装 置,其特征在于,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为在执行所述处理器可执行指令时,实现上述方法。
另一方面,提出了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现上述方法。
本公开实施例所提出的训练方法在对初始6D姿态估计进行改进时,不需要依赖深度信息,估计结果准确。由于渲染过程中可以根据需要对光照、遮挡等环境条件进行调整,该方法对于光照、遮挡等问题具有鲁棒性,而且,由于在有纹理或无纹理的情况下均可相应地获取分割掩码,因此该方法可以同时处理有纹理和无纹理的物体。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1a示出根据本公开一实施例的一种基于深度学习迭代匹配的6D姿态估计网络训练方法的流程图。
图1b示出了本公开一实施例的一种基于深度学习迭代匹配的6D姿态估计网络训练方法的示意图。
图2示出了根据本公开实施例的放大操作的一个示例的示意图。
图3示出了根据本公开一实施例的一种基于深度学习迭代匹配的6D姿态估计模型的训练方法的流程图。
图4示出了根据本公开实施例的深度卷积神经网络的一个示例的结构示 意图。
图5是根据一示例性实施例示出的一种用于一种基于深度学习迭代匹配的6D姿态估计网络训练装置1900的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1a示出根据本公开一实施例的一种基于深度学习迭代匹配的6D姿态估计网络训练方法的流程图。图1b示出了该方法的示意图,如图1a所示,该方法包括:
S101,利用目标对象的三维模型与初始6D姿态估计pose (0),获得目标对象的渲染图片和第一分割掩码。
其中,目标对象可以是网络训练过程中待进行姿态估计的任意对象,例如物体、人等。目标对象的初始6D姿态估计pose (0)可以是预设的初始值,或者通过其他相关技术的估计方法得到的初始值。基于目标对象的三维模型和初始6D姿态估计pose (0)进行渲染,可得到目标对象的渲染图片和第一分割掩码,其中第一分割掩码可以是渲染图片中目标对象的分割掩码。
S102,将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中,得到6D姿态估计Δpose (0)、第三分割掩码和光流。
其中,目标对象的观测图片可以是对实际的目标对象进行拍摄得到的图片。第二分割掩码可通过针对观测图片的分割标注得到,分割标注可基于相关技术的对象分割方法获得。深度神经卷积网络可分三个分支分别回归6D姿态估计、第三分割掩码和光流以用于迭代训练。
S103,以所得到的6D姿态估计Δpose (0)更新步骤S101中的初始6D姿态估计pose (0),以第三分割掩码替代步骤S102中的第二分割掩码,重新执行步骤S101、S102、S103,以迭代训练深度卷积神经网络。其中,更新表示将所得到的6D姿态估计Δpose (0)与初始6D姿态估计pose (0)通过计算后得到新的6D姿态估计,作为下一次迭代的输入,本领域技术人员可通过相关技术手段实现,本公开对更新的具体实现方式不做限制。
其中,在重新执行步骤S101,S102之前,可根据损失函数对深度卷积神经网络的网络参数进行调整,在迭代训练至满足训练条件时,可视为训练完成,训练条件可根据实际需要进行设置,例如损失函数值小于阈值,或迭代次数达到阈值等,本公开对此不作限制。
本公开实施例所提出的训练方法在对初始6D姿态估计进行改进时,不需要依赖深度信息,估计结果准确。由于渲染过程中可以根据需要对光照、遮挡等环境条件进行调整,该方法对于光照、遮挡等问题具有鲁棒性,而且,由于在有纹理或无纹理的情况下均可相应地获取分割掩码,因此该方法可以同时处理有纹理和无纹理的物体。
在两大公开的测试基准数据集LINEMOD和Occluded LINEMOD上,该方法相比之前的方法有很大的性能提升。
在一种可能的实现方式中,迭代训练深度卷积神经网络,可包括:利用SGD优化算法迭代训练深度卷积神经网络。
可利用SGD(Stochastic Gradient Descent,随机梯度下降)优化算法迭代深度卷积神经网络,对损失函数进行优化,直至收敛,以实现更好的训练效果。
在一种可能的实现方式中,将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中, 包括:
将目标对象的渲染图片,以及观测图片中目标对象的第二分割掩码的包围矩形区域一起进行的放大操作,使目标对象的三维模型的二维投影中心位于放大后的渲染图片的中心,并且使观测图片中的目标对象完整地位于放大后的观测图片之中;将放大后的渲染图片、第一分割掩码、放大后的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中。
图2示出了根据本公开实施例的放大操作的一个示例的示意图。
如图2所示,观测图片中第二分割掩码的包围矩形区域经放大后,得到放大后的观测图片,渲染图片经放大后,得到放大后的渲染图片,其中观测图片和渲染图片的放大比例可相同。在放大后的渲染图片中,目标对象的三维模型的二维投影中心位于放大后的渲染图片的中心,观测图片中的目标对象完整地位于放大后的观测图片中。
由于本实施例中深度神经卷积网络处理的是以目标对象为中心放大后的成对图片(渲染图片和观测图片),因此受物体尺度因素影响较小,估计结果更为准确。
图3示出了根据本公开一实施例的一种基于深度学习迭代匹配的6D姿态估计模型的训练方法的流程图。如图3所示,该方法还包括:
通过以下步骤,利用训练后的深度卷积神经网络对待估计目标对象进行6D姿态估计:
S104,利用待估计目标对象的三维模型与初始6D姿态估计,获得待估计目标对象的渲染图片和第四分割掩码;
S105,将待估计目标对象的渲染图片、第四分割掩码、待估计目标对象的观测图片及观测图片中待估计目标对象的第五分割掩码输入到训练后的深度卷积神经网络中,得到6D姿态估计;
S106,以所得到的6D姿态估计更新所述待估计目标对象的初始6D姿态估计,重新执行上述步骤S104,S105,以对待估计目标对象的初始6D姿态估计进行迭代改进。
步骤S104、S105、S106作为网络的测试或使用过程,在网络迭代训练 完成之后进行。在测试或使用过程中,去掉步骤S101-S103中训练好的深度卷积神经网络的光流和分割掩码两个分支,重复执行步骤S104、S105、S106,直至完成预设的迭代收敛条件,即可得到6D姿态估计结果。本公开对迭代收敛条件不做限制。
在一种可能的实现方式中,将待估计目标对象的渲染图片、第四分割掩码、待估计目标对象的观测图片及观测图片中待估计目标对象的第五分割掩码输入到训练后的深度卷积神经网络中,包括:将待估计目标对象的渲染图片,以及初始预测的观测图片中待估计目标对象的第五分割掩码的包围矩形区域一起进行的放大操作,使待估计目标对象的三维模型的二维投影中心位于放大后的渲染图片的中心,并且使观测图片中的待估计目标对象完整地位于放大后的观测图片之中;将放大后的渲染图片、第四分割掩码、放大后的观测图片及观测图片中待估计目标对象的第五分割掩码输入到训练后的深度卷积神经网络中。
其中,初始预测的第五分割掩码可以是通过其他相关技术得到的分割掩码,通过与上文类似的放大处理,使得估计结果受物体尺度因素影响较小,估计结果更为准确。
在一种可能的实现方式中,深度卷积神经网络输出的6D姿态估计可用相对于目标姿态的相对姿态变换量来表示,其中,目标姿态是所标注的观测图片中目标对象的姿态,可通过人工标注或其他相关姿态识别技术进行标注。相对姿态变换量包括相对旋转变换量和相对平移变换量,可用一种将平移和旋转解耦的方式来表示。
其中,相对旋转变换量可以目标对象在相机坐标系下的中心点作为相机坐标系的原点,这样在旋转时就不会影响目标对象在相机坐标系下的平移。由此将旋转与平移解耦开来。相对旋转变换量可以变换矩阵来表示。
相对平移变换量可用2D像素空间的偏移量和尺度变化来表示,而不是直接用3D空间中的坐标差表示。相对平移变换量可以变换向量来表示。
举例来说,假设相对平移变换量为t Δ=(v x,v y,v z),其中v x,v y分别表示 渲染图片中的目标对象和观测图片中目标对象之间在x和y方向上的像素移动量,v z表示目标对象的尺度变化因子,另假设渲染图片中的目标对象相对于坐标原点的源平移和观测图片中的目标对象相对于坐标原点的目标平移分别为t src=(x src,y src,z src)和t tgt=(x tgt,y tgt,z tgt),则平移相对变换可由下列公式得到:
v x=f x(x tgt/z tgt-x src/z src),
v y=f y(y tgt/z tgt-y src/z src),
v z=log(z src/z tgt),
其中f x和f y是相机的焦距,尺度变换因子v z用比值来表示,从而与目标对象的绝对尺度无关,采用对数是为了使v z=0对应于尺度没有变化。考虑到f x和f y都是固定的常数,在实际训练网络的过程中,可将其视为1。
通过解耦的相对旋转变换量和相对平移变换量的表示方式,使得网络训练起来更容易,并且能够应用到零样本学习上,即对之前未见过的模型进行6D姿态估计的改进。
在一种可能的实现方式中,深度卷积神经网络可基于用于预测光流的FlowNet模型来构建。其中,深度卷积神经网络的基本结构可以是用于预测光流的FlowNet的简单版本的结构,保留其预测光流的分支,并且在输入部分加入分割掩码,在输出部分也增加预测分割掩码的分支,以及预测6D姿态估计的分支。其中光流分支和分割掩码分支只在训练时起作用,作为训练的辅助,使训练更稳定,测试和应用时可以只有6D姿态估计分支。
网络的输入可包含8个通道,即观测图片3个通道、观测图片的分割掩码1个通道、渲染图片的3个通道和渲染图片的分割掩码1个通道。增加的分割掩码通道的网络权值可使用0初始化,其他的部分如果是新增的层则可采用随机初始化,其余与原FlowNet相同的层可均保留原始权值。
在使用网络进行姿态估计时,对于相对平移变换量可采用输出为3个神经元的全连接层,对于相对旋转变换量可采用输出为4个神经元的全连接层 实现,其中4表示用四元数来表示相对旋转变换量。
图4示出了根据本公开实施例的深度卷积神经网络的一个示例的结构示意图。
在该示例中,网络以FlowNet Convs和FlowNet DeConvs(FlowNet卷积和反卷积)模型为基础,在训练过程中,将放大后的渲染图片及其分割掩码,以及放大后的观测图片及其分割掩码输入FlowNet Convs模型,得到6D姿态估计(包括相对旋转变换量(Rotation),相对平移变换量(Translation)),FlowNet DeConvs模型基于FlowNet Convs模型得到的特征图(Featuremap)得到光流和分割掩码(上文中的第三分割掩码)。迭代训练完成后,进行测试以及利用网络进行姿态估计时,去掉光流和分割掩码两个分支。
在一种可能的实现方式中,在迭代训练中,基于6D姿态估计、光流和第三分割掩码三个分支的损失函数的加权和,构成损失函数。其中,光流分支和第三分割掩码分支仅用于迭代训练。
举例来说,可采用如下损失函数:
L=αL pose+βL flow+γL mask
L pose表示6D姿态估计分支的损失函数,α表示6D姿态估计分支的损失函数的权重系数,L flow表示光流分支的损失函数,β表示光流分支的损失函数的权重系数,L mask表示第三分割掩码分支的损失函数,γ表示第三分割掩码分支的损失函数的权重系数。
可根据需要设置不同分支的权重系数,例如,可设置为α=0.1,β=0.25,γ=0.03,L flow可与FlowNet模型中一样,L mask可采用sigmoid交叉熵损失函数。
在一种可能的实现方式中,假设针对观测图片的目标姿态为p=[R|t],估计姿态为
Figure PCTCN2019106993-appb-000001
则6D姿态估计分支的损失函数可为:
Figure PCTCN2019106993-appb-000002
其中R表示目标姿态中的旋转量,
Figure PCTCN2019106993-appb-000003
表示估计姿态中的旋转量,t表示目标姿态中的平移量,
Figure PCTCN2019106993-appb-000004
表示估计姿态中的平移量,x j表示目标对象的三维模型中第j个点的坐标,L 1表示1范数,n表示三维模型中点的总数。其中,R、
Figure PCTCN2019106993-appb-000005
t、
Figure PCTCN2019106993-appb-000006
表示的是相对于坐标原点的绝对旋转量和绝对平移量,估计姿态的
Figure PCTCN2019106993-appb-000007
Figure PCTCN2019106993-appb-000008
可以通过目标姿态叠加深度卷积神经网络输出的相对旋转变换量和相对平移变换量得到。
在一种可能的实现方式中,在用于训练的图片及其分割掩码的数量大于或等于预设数量时,迭代训练时可去掉光流分支和/或第三分割掩码分支。在该情况下,损失函数也需进行相应的调整。本领域技术人员可根据实际情况设置预设数量的取值,本公开对此不做限制。
本公开实施例提出了一种基于深度学习迭代匹配的6D姿态估计网络训练装置,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为在执行所述处理器可执行指令时,实现上述方法。
本公开实施例提出了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现上述方法。
图5是根据一示例性实施例示出的一种用于一种基于深度学习迭代匹配的6D姿态估计网络训练装置1900的框图。例如,装置1900可以被提供为一服务器。参照图5,装置1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
装置1900还可以包括一个电源组件1926被配置为执行装置1900的电源 管理,一个有线或无线网络接口1950被配置为将装置1900连接到网络,和一个输入输出(I/O)接口1958。装置1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由装置1900的处理组件1922执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序 指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规 定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (10)

  1. 一种基于深度学习迭代匹配的6D姿态估计网络训练方法,其特征在于,该方法包括:
    利用目标对象的三维模型与初始6D姿态估计,获得目标对象的渲染图片和第一分割掩码,
    将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中,得到6D姿态估计、第三分割掩码和光流,
    以所得到的6D姿态估计更新所述初始6D姿态估计,以第三分割掩码替代所述第二分割掩码,重新执行上述步骤,以迭代训练所述深度卷积神经网络。
  2. 根据权利要求1所述的方法,其特征在于,将渲染图片、第一分割掩码、目标对象的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中,包括:
    将目标对象的渲染图片,以及观测图片中目标对象的第二分割掩码的包围矩形区域一起进行的放大操作,使目标对象的三维模型的二维投影中心位于放大后的渲染图片的中心,并且使观测图片中的目标对象完整地位于放大后的观测图片之中;
    将放大后的渲染图片、第一分割掩码、放大后的观测图片及观测图片中目标对象的第二分割掩码输入到深度卷积神经网络中。
  3. 根据权利要求1所述的方法,其特征在于,该方法还包括:通过以下步骤,利用训练后的深度卷积神经网络对待估计目标对象进行6D姿态估计:
    利用待估计目标对象的三维模型与初始6D姿态估计,获得待估计目标对象的渲染图片和第四分割掩码;
    将待估计目标对象的渲染图片、第四分割掩码、待估计目标对象的观测图片及观测图片中待估计目标对象的第五分割掩码输入到训练后的深度卷积神经网络中,得到6D姿态估计;
    以所得到的6D姿态估计更新所述待估计目标对象的初始6D姿态估计, 重新执行上述步骤,以对待估计目标对象的初始6D姿态估计进行迭代改进。
  4. 根据权利要求3所述的方法,其特征在于,将待估计目标对象的渲染图片、第四分割掩码、待估计目标对象的观测图片及观测图片中待估计目标对象的第五分割掩码输入到训练后的深度卷积神经网络中,包括:
    将待估计目标对象的渲染图片,以及初始预测的观测图片中待估计目标对象的第五分割掩码的包围矩形区域一起进行的放大操作,使待估计目标对象的三维模型的二维投影中心位于放大后的渲染图片的中心,并且使观测图片中的待估计目标对象完整地位于放大后的观测图片之中;
    将放大后的渲染图片、第四分割掩码、放大后的观测图片及观测图片中待估计目标对象的第五分割掩码输入到训练后的深度卷积神经网络中。
  5. 根据权利要求1所述的方法,其特征在于,
    所述深度卷积神经网络输出的6D姿态估计用相对于目标姿态的相对姿态变换量来表示,其中,目标姿态是观测图片中目标对象的姿态,相对姿态变换量包括相对旋转变换量和相对平移变换量,相对旋转变换量以目标对象在相机坐标系下的中心点作为相机坐标系的原点,相对平移变换量用2D像素空间的偏移量和尺度变化来表示。
  6. 根据权利要求1所述的方法,其特征在于,所述深度卷积神经网络基于用于预测光流的FlowNet模型来构建。
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在所述迭代训练中,基于6D姿态估计、光流和第三分割掩码三个分支的损失函数的加权和,构成损失函数,其中,光流分支和第三分割掩码分支仅用于所述迭代训练。
  8. 根据权利要求7所述的方法,6D姿态估计分支的损失函数为:
    Figure PCTCN2019106993-appb-100001
    其中,p=[R|t]为目标姿态,
    Figure PCTCN2019106993-appb-100002
    为估计姿态,R表示目标姿态中的旋转量,
    Figure PCTCN2019106993-appb-100003
    表示估计姿态中的旋转量,t表示目标姿态中的平移量,
    Figure PCTCN2019106993-appb-100004
    表示 估计姿态中的平移量,x j表示目标对象的三维模型中的第j个坐标,L 1表示1范数,n表示三维模型中点的总数。
  9. 一种基于深度学习迭代匹配的6D姿态估计网络训练装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为在执行所述处理器可执行指令时,实现权利要求1至8中任意一项所述的方法。
  10. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至8中任意一项所述的方法。
PCT/CN2019/106993 2018-09-25 2019-09-20 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置 WO2020063475A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/023,919 US11200696B2 (en) 2018-09-25 2020-09-17 Method and apparatus for training 6D pose estimation network based on deep learning iterative matching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811114456.5A CN109215080B (zh) 2018-09-25 2018-09-25 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置
CN201811114456.5 2018-09-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/023,919 Continuation US11200696B2 (en) 2018-09-25 2020-09-17 Method and apparatus for training 6D pose estimation network based on deep learning iterative matching

Publications (1)

Publication Number Publication Date
WO2020063475A1 true WO2020063475A1 (zh) 2020-04-02

Family

ID=64984702

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/106993 WO2020063475A1 (zh) 2018-09-25 2019-09-20 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置

Country Status (3)

Country Link
US (1) US11200696B2 (zh)
CN (1) CN109215080B (zh)
WO (1) WO2020063475A1 (zh)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553949A (zh) * 2020-04-30 2020-08-18 张辉 基于单帧rgb-d图像深度学习对不规则工件的定位抓取方法
CN112528858A (zh) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 人体姿态估计模型的训练方法、装置、设备、介质及产品
CN113158773A (zh) * 2021-03-05 2021-07-23 普联技术有限公司 一种活体检测模型的训练方法及训练装置
CN113223058A (zh) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 光流估计模型的训练方法、装置、电子设备及存储介质
CN113450351A (zh) * 2021-08-06 2021-09-28 推想医疗科技股份有限公司 分割模型训练方法、图像分割方法、装置、设备及介质
CN115420277A (zh) * 2022-08-31 2022-12-02 北京航空航天大学 一种物体位姿测量方法及电子设备
CN113223058B (zh) * 2021-05-12 2024-04-30 北京百度网讯科技有限公司 光流估计模型的训练方法、装置、电子设备及存储介质

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109215080B (zh) 2018-09-25 2020-08-11 清华大学 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置
WO2020061648A1 (en) * 2018-09-26 2020-04-02 Sitesee Pty Ltd Apparatus and method for three-dimensional object recognition
US10964015B2 (en) * 2019-01-15 2021-03-30 International Business Machines Corporation Product defect detection
CN109977847B (zh) * 2019-03-22 2021-07-16 北京市商汤科技开发有限公司 图像生成方法及装置、电子设备和存储介质
CN110322510B (zh) * 2019-06-27 2021-08-27 电子科技大学 一种利用轮廓信息的6d位姿估计方法
CN110660101B (zh) * 2019-08-19 2022-06-07 浙江理工大学 基于rgb图像和坐标系变换的物体6d姿势预测方法
CN110598771A (zh) * 2019-08-30 2019-12-20 北京影谱科技股份有限公司 一种基于深度语义分割网络的视觉目标识别方法和装置
CN110503689B (zh) 2019-08-30 2022-04-26 清华大学 位姿预测方法、模型训练方法及装置
CN111144401B (zh) * 2019-11-06 2024-01-26 华能国际电力股份有限公司海门电厂 一种电厂集控室深度学习和视觉伺服的触屏控制操作方法
CN111145253B (zh) * 2019-12-12 2023-04-07 深圳先进技术研究院 一种高效的物体6d姿态估计算法
CN111401230B (zh) * 2020-03-13 2023-11-28 深圳市商汤科技有限公司 姿态估计方法及装置、电子设备和存储介质
CN111489394B (zh) * 2020-03-16 2023-04-21 华南理工大学 物体姿态估计模型训练方法、系统、装置及介质
CN111415389B (zh) * 2020-03-18 2023-08-29 清华大学 基于强化学习的无标签六维物体姿态预测方法及装置
CN111462239B (zh) * 2020-04-03 2023-04-14 清华大学 姿态编码器训练及姿态估计方法及装置
CN111652798B (zh) * 2020-05-26 2023-09-29 浙江大华技术股份有限公司 人脸姿态迁移方法和计算机存储介质
CN111784772B (zh) * 2020-07-02 2022-12-02 清华大学 基于域随机化的姿态估计模型训练方法及装置
CN111783986A (zh) * 2020-07-02 2020-10-16 清华大学 网络训练方法及装置、姿态预测方法及装置
CN111968235B (zh) * 2020-07-08 2024-04-12 杭州易现先进科技有限公司 一种物体姿态估计方法、装置、系统和计算机设备
CN111932530B (zh) * 2020-09-18 2024-02-23 北京百度网讯科技有限公司 三维对象检测方法、装置、设备和可读存储介质
CN112508007B (zh) * 2020-11-18 2023-09-29 中国人民解放军战略支援部队航天工程大学 基于图像分割Mask和神经渲染的空间目标6D姿态估计方法
CN112508027B (zh) * 2020-11-30 2024-03-26 北京百度网讯科技有限公司 用于实例分割的头部模型、实例分割模型、图像分割方法及装置
CN113192141A (zh) * 2020-12-10 2021-07-30 中国科学院深圳先进技术研究院 一种6d姿态估计方法
CN112668492B (zh) * 2020-12-30 2023-06-20 中山大学 一种自监督学习与骨骼信息的行为识别方法
CN112802032A (zh) * 2021-01-19 2021-05-14 上海商汤智能科技有限公司 图像分割网络的训练和图像处理方法、装置、设备及介质
CN112767486B (zh) * 2021-01-27 2022-11-29 清华大学 基于深度卷积神经网络的单目6d姿态估计方法及装置
CN112991445B (zh) * 2021-03-03 2023-10-24 网易(杭州)网络有限公司 模型训练方法、姿态预测方法、装置、设备及存储介质
CN113743189B (zh) * 2021-06-29 2024-02-02 杭州电子科技大学 一种基于分割引导的人体姿态识别方法
CN113470124B (zh) * 2021-06-30 2023-09-22 北京达佳互联信息技术有限公司 特效模型的训练方法及装置、特效生成方法及装置
WO2023004558A1 (en) * 2021-07-26 2023-02-02 Shanghaitech University Neural implicit function for end-to-end reconstruction of dynamic cryo-em structures
CN113592991B (zh) * 2021-08-03 2023-09-05 北京奇艺世纪科技有限公司 一种基于神经辐射场的图像渲染方法、装置及电子设备
CN114119753A (zh) * 2021-12-08 2022-03-01 北湾科技(武汉)有限公司 面向机械臂抓取的透明物体6d姿态估计方法
DE102022106765B3 (de) 2022-03-23 2023-08-24 Bayerische Motoren Werke Aktiengesellschaft Verfahren zum Bestimmen einer Lage eines Objekts relativ zu einer Erfassungseinrichtung, Computerprogramm und Datenträger
CN115578563A (zh) * 2022-10-19 2023-01-06 中国科学院空间应用工程与技术中心 一种图像分割方法、系统、存储介质和电子设备
CN116416307B (zh) * 2023-02-07 2024-04-02 浙江大学 基于深度学习的预制构件吊装拼接3d视觉引导方法
CN117689990A (zh) * 2024-02-02 2024-03-12 南昌航空大学 一种基于6d姿态估计的三支流双向融合网络方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8744169B2 (en) * 2011-05-31 2014-06-03 Toyota Motor Europe Nv/Sa Voting strategy for visual ego-motion from stereo
CN104851094A (zh) * 2015-05-14 2015-08-19 西安电子科技大学 一种基于rgb-d的slam算法的改进方法
CN106846382A (zh) * 2017-01-22 2017-06-13 深圳市唯特视科技有限公司 一种基于直方图控制点的图像配准目标检测方法
CN109215080A (zh) * 2018-09-25 2019-01-15 清华大学 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10235771B2 (en) * 2016-11-11 2019-03-19 Qualcomm Incorporated Methods and systems of performing object pose estimation
CN106897697A (zh) * 2017-02-24 2017-06-27 深圳市唯特视科技有限公司 一种基于可视化编译器的人物和姿势检测方法
CN107622257A (zh) * 2017-10-13 2018-01-23 深圳市未来媒体技术研究院 一种神经网络训练方法及三维手势姿态估计方法
US10885659B2 (en) * 2018-01-15 2021-01-05 Samsung Electronics Co., Ltd. Object pose estimating method and apparatus
CN108171748B (zh) * 2018-01-23 2021-12-07 哈工大机器人(合肥)国际创新研究院 一种面向机器人智能抓取应用的视觉识别与定位方法
US10977827B2 (en) * 2018-03-27 2021-04-13 J. William Mauchly Multiview estimation of 6D pose
WO2020010328A1 (en) * 2018-07-05 2020-01-09 The Regents Of The University Of Colorado, A Body Corporate Multi-modal fingertip sensor with proximity, contact, and force localization capabilities
CN112823376A (zh) * 2018-10-15 2021-05-18 文塔纳医疗系统公司 图像增强以实现改善的核检测和分割
US11034026B2 (en) * 2019-01-10 2021-06-15 General Electric Company Utilizing optical data to dynamically control operation of a snake-arm robot
US11030766B2 (en) * 2019-03-25 2021-06-08 Dishcraft Robotics, Inc. Automated manipulation of transparent vessels
EP3869392A1 (en) * 2020-02-20 2021-08-25 Toyota Jidosha Kabushiki Kaisha Object pose estimation from an rgb image by a neural network outputting a parameter map

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8744169B2 (en) * 2011-05-31 2014-06-03 Toyota Motor Europe Nv/Sa Voting strategy for visual ego-motion from stereo
CN104851094A (zh) * 2015-05-14 2015-08-19 西安电子科技大学 一种基于rgb-d的slam算法的改进方法
CN106846382A (zh) * 2017-01-22 2017-06-13 深圳市唯特视科技有限公司 一种基于直方图控制点的图像配准目标检测方法
CN109215080A (zh) * 2018-09-25 2019-01-15 清华大学 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CAO, XUN ET AL.: "3D Spatial Reconstruction and Communication from Vision Field", IEEE XPLORE, 31 December 2012 (2012-12-31), XP032228389 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111553949A (zh) * 2020-04-30 2020-08-18 张辉 基于单帧rgb-d图像深度学习对不规则工件的定位抓取方法
CN112528858A (zh) * 2020-12-10 2021-03-19 北京百度网讯科技有限公司 人体姿态估计模型的训练方法、装置、设备、介质及产品
CN113158773A (zh) * 2021-03-05 2021-07-23 普联技术有限公司 一种活体检测模型的训练方法及训练装置
CN113158773B (zh) * 2021-03-05 2024-03-22 普联技术有限公司 一种活体检测模型的训练方法及训练装置
CN113223058A (zh) * 2021-05-12 2021-08-06 北京百度网讯科技有限公司 光流估计模型的训练方法、装置、电子设备及存储介质
CN113223058B (zh) * 2021-05-12 2024-04-30 北京百度网讯科技有限公司 光流估计模型的训练方法、装置、电子设备及存储介质
CN113450351A (zh) * 2021-08-06 2021-09-28 推想医疗科技股份有限公司 分割模型训练方法、图像分割方法、装置、设备及介质
CN113450351B (zh) * 2021-08-06 2024-01-30 推想医疗科技股份有限公司 分割模型训练方法、图像分割方法、装置、设备及介质
CN115420277A (zh) * 2022-08-31 2022-12-02 北京航空航天大学 一种物体位姿测量方法及电子设备
CN115420277B (zh) * 2022-08-31 2024-04-12 北京航空航天大学 一种物体位姿测量方法及电子设备

Also Published As

Publication number Publication date
US20210004984A1 (en) 2021-01-07
CN109215080A (zh) 2019-01-15
US11200696B2 (en) 2021-12-14
CN109215080B (zh) 2020-08-11

Similar Documents

Publication Publication Date Title
WO2020063475A1 (zh) 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置
JP7236545B2 (ja) ビデオターゲット追跡方法と装置、コンピュータ装置、プログラム
US9213899B2 (en) Context-aware tracking of a video object using a sparse representation framework
US10803546B2 (en) Systems and methods for unsupervised learning of geometry from images using depth-normal consistency
US11880977B2 (en) Interactive image matting using neural networks
WO2019174377A1 (zh) 一种基于单目相机的三维场景稠密重建方法
CN106204522B (zh) 对单个图像的联合深度估计和语义标注
Koestler et al. Tandem: Tracking and dense mapping in real-time using deep multi-view stereo
US8958630B1 (en) System and method for generating a classifier for semantically segmenting an image
CN111739005B (zh) 图像检测方法、装置、电子设备及存储介质
US20190057532A1 (en) Realistic augmentation of images and videos with graphics
CN110838122B (zh) 点云的分割方法、装置及计算机存储介质
CN110706262B (zh) 图像处理方法、装置、设备及存储介质
WO2022218012A1 (zh) 特征提取方法、装置、设备、存储介质以及程序产品
CA3137297C (en) Adaptive convolutions in neural networks
US11948310B2 (en) Systems and methods for jointly training a machine-learning-based monocular optical flow, depth, and scene flow estimator
WO2023109361A1 (zh) 用于视频处理的方法、系统、设备、介质和产品
CN110288691B (zh) 渲染图像的方法、装置、电子设备和计算机可读存储介质
WO2023178951A1 (zh) 图像分析方法、模型的训练方法、装置、设备、介质及程序
CN112085842A (zh) 深度值确定方法及装置、电子设备和存储介质
KR20230083212A (ko) 객체 자세 추정 장치 및 방법
KR20230078502A (ko) 이미지 처리 장치 및 방법
CN115375847A (zh) 材质恢复方法、三维模型的生成方法和模型的训练方法
KR20230128284A (ko) 변형가능 모델들에 의한 3차원 스캔 등록
Kaviani et al. Semi-Supervised 3D hand shape and pose estimation with label propagation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19866700

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19866700

Country of ref document: EP

Kind code of ref document: A1