CN111260680B - RGBD camera-based unsupervised pose estimation network construction method - Google Patents
RGBD camera-based unsupervised pose estimation network construction method Download PDFInfo
- Publication number
- CN111260680B CN111260680B CN202010034081.2A CN202010034081A CN111260680B CN 111260680 B CN111260680 B CN 111260680B CN 202010034081 A CN202010034081 A CN 202010034081A CN 111260680 B CN111260680 B CN 111260680B
- Authority
- CN
- China
- Prior art keywords
- network
- sequence
- pose
- image
- camera
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an RGBD camera-based unsupervised pose estimation network construction method. Estimating the motion of a camera from an image is a major research topic of current vision mobile robots. The traditional method is easy to fail in the environments of low texture, complex geometric structure, illumination, shading and the like. Most deep learning based methods require additional supervision data, which complicates the work and increases the cost. The method of the convolutional neural network makes up the defects of the traditional method, utilizes the distance information of the depth image, combines the traditional geometric knowledge, and utilizes the positive sequence and negative sequence input to increase the constraint, so that the network can accurately estimate the pose of the camera.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to an RGBD camera-based unsupervised bit motion estimation network construction method
Background
The instant positioning And Mapping of the SLAM (singular Localization And Mapping) is an important research direction of machine vision, and through the development of the last 30 years, the SLAM And related technologies become research hotspots in the fields of robots, image processing, deep learning, motion recovery structures, augmented reality And the like. And the recovery Of Motion Structure, SFM (Structure Of Motion), from consecutive frames is a major focus. Although conventional SFM methods are efficient in many cases, they rely on accurate images and are prone to failure in low texture, complex geometry and lighting, shadowing, etc. environments.
In order to solve the problem, with the development of deep learning in recent years, some methods based on deep learning are proposed and applied to various stages of the conventional SFM. Because the deep learning methods use a large amount of data sets for training and the methods use accurate external supervision during training, the interframe estimation under a fixed scene becomes more accurate, and the defects of the traditional methods are overcome. However, external surveillance data, particularly true values of relative motion between frames, are not easily obtained, requiring the use of additional sensors such as IMU, GPS, etc., which makes the task more complicated and increases costs.
Depth cameras have been widely used in SLAM research in recent years, and accurate color maps and corresponding depth maps can be acquired easily and accurately. The color map has rich characteristic information, can learn characteristic representation by using a neural network, and can sense the characteristic correlation among different frames. While the depth map provides distance information of the object. And the distance information and the deep learning of the depth map are fused, and a new idea is provided for unsupervised network by combining geometric knowledge.
Disclosure of Invention
The invention aims to overcome the defects of the traditional method and the limitation of supervised deep learning, and provides an unsupervised pose estimation network construction method based on an RGBD camera. The method not only uses a deep learning method to learn the relationship of inter-frame conversion, but also combines the traditional geometric knowledge, utilizes the inter-frame conversion relationship generated by the network and the distance information of the depth map, and combines the traditional geometric knowledge to guide the network to generate more accurate results, thereby achieving unsupervised effect, and adding constraint through a positive sequence and reverse sequence network to ensure that the estimation of the network is more accurate. The method comprises the following specific steps:
step (1): obtaining same scene color image and depth image by RGB-D camera
Continuous color image using RGB-D cameraAnd corresponding continuous depth imagesThe resolution is H × W, and H and W are the height and width of the image respectively; selecting a color image at time tAnd its adjacent image Each color image is RGB three channels, and three color images are combinedIn the form of a 9-channel spliceWherein feature refers to a feature map obtained after convolutional layer, and 0 refers to convolution operation of 0 th time;
step (2): learning inter-frame structure relationship based on pose network
The pose network is composed of a convolutional neural network, and passes through a ReLU activation layer after each convolutional layer; first 9 channel sequencesPassing through a convolution layer with convolution kernel size of 7 × 7, then a convolution layer with convolution kernel size of 5 × 5, and then a convolution layer with five convolution kernels of 3 × 3 to obtain 256 channelsThen passing through a layer of rollThe convolution kernel with a kernel size of 1 x 1 performs dimensionality reduction to obtain a channel number of 12Finally, averaging H and W dimensions to form a number to obtain a group of 12-dimensional numbers; the number is divided into two groups of 6-dimensional numbers which are respectively marked as T t→t-1 、T t→t+1 (ii) a For T t→t-1 The first three digits representToThe last three positions are expressed by Euler anglesToRotation of the coordinate system of, T t→t+1 The same is shown;
and (3): and (3) completing self-supervision by utilizing an interframe camera pose relation and combining distance information of a depth map and geometric knowledge:
for imagesThe corresponding depth map isImage corresponding to time t +1Has a conversion relation of T t→t+1 (ii) a For imagesA certain point pixel onCorrespond toIs asThe camera projection model and the inter-frame triangular relation can be obtainedCorrespond toIs formed by a plurality of pixelsThere is a relationship:
①
wherein K is an internal reference of the camera; according toMapping toCorresponding space is obtainedEach pixel value corresponds toIs further according toThe size of the pixel value and the position of the initial pixel are obtained by using a differentiable bilinear sampling interpolation methodCorresponding to (1)Composite drawingWherein each pixel value of the composite map is not a simple mappingThe differentiable bilinear sampling interpolation is obtained by weighting four pixels around the pixel;
②
where i = top or bottom, j = left or right, stands forFour surrounding pixels, where w ij Representing the weight of four pixels, has ∑ w ij =1; composite viewsLater, the original viewSelf-supervision is formed between two frames, and a loss function is formed:
③
therefore, the purpose of self-supervision without external supervision is achieved by synthesizing a new graph by using a depth graph and constructing a photometric error;
and (4): prevention of network training gradient corruption by masking network
When the traditional geometric knowledge is used in the previous step, the preconditions of no dynamic object, no shielding object and the like in the image need to be met, and a mask network is provided for preventing network training from being inhibited; the mask network and the pose network share the former five-layer convolution network,training with pose network, and obtaining a mask corresponding to a sequence by upsampling through four layers of 4 × 4 convolutional layers and one layer of 3 × 3 convolutional layersFor each pixel corresponding maskThe loss function for two frames becomes from equation (3)
④
And (5): adding constraints through an inverse sequence network enables the network to more accurately estimate relative pose between frames
When the positive sequence image is used for inputting, the input sequence isThe image input of the reverse order network isA good pose estimation network can not only estimate the pose relationship between frames in a positive sequence, but also estimate the pose between the frames when an image sequence is input in a negative sequence, thereby increasing the constraint; for a sequence of three pictures, the position and posture obtained by the network when the sequence is positive arePose obtained by inverse sequence isIdeally, there areHowever, the network estimate always has an error, and the error increases the constraint, and the loss function is as follows
⑤
Representing the displacement estimated by the network at the input of a positive sequence,representing the displacement of the network estimate at the time of the inverse sequence input,representing the rotation of the network estimate at the input of the positive sequence,representing the rotation of the network estimation when the inverse sequence is input, and omega represents the weight;
therefore, the pose network is trained by adding constraints, so that the network has the capability of accurately estimating the relative motion between frames.
The invention has the beneficial effects that: the method of deep learning is used for searching the association between adjacent frames from the characteristic information of the color image, the distance information provided by the depth image is utilized, the traditional geometric method is combined, the network avoids complicated external supervision to achieve unsupervised learning, and the constraint is added through the positive sequence and reverse sequence network, so that the motion of the camera is estimated more accurately.
Drawings
FIG. 1 is a single-sequence flow chart of the present invention;
FIG. 2 is an image reconstruction process;
FIG. 3 is a forward and reverse order combining network.
Detailed Description
The invention is further described below with reference to the accompanying drawings, comprising the steps of:
step (1): obtaining same scene color image and depth image by RGB-D camera
Obtaining connections using RGB-D cameraColor-continued imageAnd corresponding continuous depth imagesThe resolution is H W, H and W being the height and width of the image, respectively. Selecting a color image at time tAnd its adjacent image Each color image is RGB three channels, and three color images are combinedIn the form of a 9-channel splice(wherein feature refers to a feature map obtained after convolutional layer, and 0 refers to the 0 th convolution operation).
Step (2): learning of inter-frame structure relationship based on convolutional neural network
The pose network is mainly composed of a convolutional neural network, and passes through a ReLU activation layer after each convolutional layer. First 9 channel sequencesPassing through a convolution layer with convolution kernel size of 7 × 7, then a convolution layer with convolution kernel size of 5 × 5, and then a convolution layer with five convolution kernels of 3 × 3 to obtain 256 channelsThen, dimension reduction is performed on the convolution kernel with the size of 1 x 1 after one layer of convolution kernel, and the number of channels is 12And finally, averaging the dimensions H and W into a number to obtain a group of 12-dimensional numbers. The number is divided into two groups of 6-dimensional numbers which are respectively marked as T t→t-1 、T t→t+1 . For T t→t-1 The first three digits representToThe last three bits are expressed by Euler anglesToRotation of the coordinate system of, T t→t+1 The same is shown.
And (3): and (3) completing self-supervision by utilizing an inter-frame camera pose relation and combining distance information of a depth map and geometric knowledge: for imagesThe corresponding depth map isImage corresponding to time t +1Has a conversion relation of T t→t+1 . For imagesA pixel of a certain point onIs correspondingly provided withIs asThe camera projection model and the inter-frame triangular relation can be obtainedCorrespond toOf a pixelThere is a relationship:
①
wherein K is an internal reference of the camera. According toMapping toThe corresponding space is easily obtainedEach pixel value corresponds toIs further according toThe pixel value and the initial pixel position are obtained by using a differentiable bilinear sampling interpolation methodCorresponding composite map ofWherein each pixel value of the composite map is not a simple mappingThe differentiable bilinear sampling interpolation is obtained by weighting four pixels around the pixel. As shown in figure 2 of the attached drawings of the specification
②
Where i = top or bottom, j = left or right, stands forFour surrounding pixels, where w ij Representing the weight of four pixels, has ∑ w ij And =1. Composite viewsLater, the original viewSelf-supervision is formed between two frames, and a loss function is formed:
③
therefore, the purposes of synthesizing a new image by utilizing a depth image and constructing a luminosity error to achieve self-supervision without external supervision are achieved.
And (4): and the mask network is used for preventing the network training gradient from being damaged. Because the conventional geometric knowledge is used in the previous step, the preconditions of no dynamic object, no shielding object and the like in the image need to be met, and the mask network is provided in order to prevent the network training from being inhibited. The mask network and the pose network are trained together, and the mask corresponding to one sequence is obtained by adopting the upsampling, passing through four layers of 4 x 4 convolutional layers and then passing through one layer of 3 x 3 convolutional layersFor each pixel corresponding maskThe loss function for two frames becomes from equation (3)
④
And (5): an inverse sequence network, the network structure of which is shown in fig. 3, is used to add constraints to enable the network to estimate the relative pose between frames more accurately. When the positive sequence image is used for inputting, the input sequence isThe image input of the reverse order network isA good pose estimation network can not only estimate the pose relationship between frames in a positive sequence, but also estimate the pose between the frames when an image sequence is input in a negative sequence, thereby increasing the constraint. For a sequence of three pictures, the pose obtained by the network in the positive sequence isPose obtained by inverse sequence is Ideally, there areHowever, the network estimate always has errors, and the error increases the constraint, and the loss function is as follows
⑤
Representing the displacement estimated by the network at the input of a positive sequence,representing the displacement of the network estimate at the time of the inverse sequence input,representing the rotation of the network estimate at the input of a positive sequence,representing the rotation of the network estimate at the time of inverse sequence input and ω represents the weight.
Therefore, the pose network is trained by adding constraint, so that the network has the capability of accurately estimating the relative motion between frames.
Claims (1)
1. An RGBD camera-based unsupervised pose estimation network construction method is characterized by comprising the following specific steps:
step (1): obtaining co-scene color images and depth images using RGB-D camera
Continuous color image using RGB-D cameraAnd corresponding continuous depth imagesThe resolution is H × W, and H and W are the height and width of the image respectively; selecting a color image at time tImage adjacent thereto Each color image is RGB three channels, and three color images are formedIs spliced into a 9-channel sequenceWherein feature refers to a feature map obtained after convolution, and 0 refers to convolution operation of 0 th time;
step (2): pose network based learning of inter-frame structural relationships
The pose network is formed by a convolutional neural network, and passes through a ReLU activation layer after each convolution layer; first 9 channel sequencePassing through a convolution layer with convolution kernel size of 7 × 7, then a convolution layer with convolution kernel size of 5 × 5, and then a convolution layer with five convolution kernels of 3 × 3 to obtain 256 channelsThen, dimension reduction is performed on the convolution kernel with the size of 1 x 1 after passing through one layer of convolution kernel, and the number of channels is 12Finally, averaging H and W dimensions to form a number to obtain a group of 12-dimensional numbers; the number is divided into two groups of 6-dimensional numbers which are respectively marked as T t→t-1 、T t→t+1 (ii) a For T t→t-1 The first three digits representToThe last three positions are expressed by Euler anglesToRotation of the coordinate system of, T t→t+1 The same is shown;
and (3): and (3) completing self-supervision by utilizing an inter-frame camera pose relation and combining distance information of a depth map and geometric knowledge:
for imagesThe corresponding depth map isImage corresponding to time t +1Has a conversion relation of T t→t+1 (ii) a For imagesA certain point pixel onIs correspondingly provided withIs asThe camera projection model and the inter-frame triangular relation can be used for obtainingCorrespond toIs formed by a plurality of pixelsThere is a relationship:
①
wherein K is an internal reference of the camera; according toMapping toCorresponding space is obtainedEach pixel value corresponds toIs further according toThe pixel value and the initial pixel position are obtained by using a differentiable bilinear sampling interpolation methodCorresponding composite map ofWherein each pixel value of the composite map is not a simple mappingThe differentiable bilinear sampling interpolation is obtained by weighting four pixels around the pixel;
②
where i = top or bottom, j = left or right, stands forFour surrounding pixels, where w ij Representing the weight of four pixels, has ∑ w ij =1; composite viewsLater, the original viewSelf-supervision is formed between two frames, and a loss function is formed:
③
therefore, the purposes of synthesizing a new image by utilizing a depth image and constructing a luminosity error to achieve the purpose of self-supervision without external supervision are achieved;
and (4): prevention of network training gradient corruption by masking network
The mask network and the pose network share the former five-layer convolution network, and are trained together with the pose network, and the up-sampling is adopted to obtain a mask I corresponding to a sequence through four layers of 4-4 convolutional layers and one layer of 3-3 convolutional layers t mask For each pixel corresponding mask P t mask Then the loss function for two frames becomes from equation (3)
④
And (5): adding constraints through an inverse sequence network enables the network to more accurately estimate relative pose between frames
When the positive sequence image is used for inputting, the input sequence isThe image input of the reverse order network isA good pose estimation network can not only estimate the pose relationship between frames in a positive sequence, but also estimate the pose between the frames when an image sequence is input in a negative sequence, thereby increasing the constraint; for a sequence of three pictures, the position and posture obtained by the network when the sequence is positive arePose obtained by inverse sequence isIdeally, there areHowever, the network estimate always has an error, and the error increases the constraint, and the loss function is as follows
⑤
Representing the displacement estimated by the network at the input of the positive sequence,representing the displacement of the network estimate at the time of the inverse sequence input,representing the rotation of the network estimate at the input of the positive sequence,representing the rotation of the network estimate at the time of the inverse sequence input, ω representing the weight;
therefore, the pose network is trained by adding constraint, so that the network has the capability of accurately estimating the relative motion between frames.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010034081.2A CN111260680B (en) | 2020-01-13 | 2020-01-13 | RGBD camera-based unsupervised pose estimation network construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010034081.2A CN111260680B (en) | 2020-01-13 | 2020-01-13 | RGBD camera-based unsupervised pose estimation network construction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111260680A CN111260680A (en) | 2020-06-09 |
CN111260680B true CN111260680B (en) | 2023-01-03 |
Family
ID=70954018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010034081.2A Active CN111260680B (en) | 2020-01-13 | 2020-01-13 | RGBD camera-based unsupervised pose estimation network construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111260680B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111739078B (en) * | 2020-06-15 | 2022-11-18 | 大连理工大学 | Monocular unsupervised depth estimation method based on context attention mechanism |
CN112489128A (en) * | 2020-12-14 | 2021-03-12 | 南通大学 | RGB-D indoor unmanned aerial vehicle positioning implementation method based on unsupervised deep learning |
CN113888629A (en) * | 2021-10-28 | 2022-01-04 | 浙江大学 | RGBD camera-based rapid object three-dimensional pose estimation method |
CN114998411B (en) * | 2022-04-29 | 2024-01-09 | 中国科学院上海微系统与信息技术研究所 | Self-supervision monocular depth estimation method and device combining space-time enhancement luminosity loss |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106658023A (en) * | 2016-12-21 | 2017-05-10 | 山东大学 | End-to-end visual odometer and method based on deep learning |
CN110490928A (en) * | 2019-07-05 | 2019-11-22 | 天津大学 | A kind of camera Attitude estimation method based on deep neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111915663B (en) * | 2016-09-15 | 2024-04-30 | 谷歌有限责任公司 | Image depth prediction neural network |
-
2020
- 2020-01-13 CN CN202010034081.2A patent/CN111260680B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106658023A (en) * | 2016-12-21 | 2017-05-10 | 山东大学 | End-to-end visual odometer and method based on deep learning |
CN110490928A (en) * | 2019-07-05 | 2019-11-22 | 天津大学 | A kind of camera Attitude estimation method based on deep neural network |
Non-Patent Citations (5)
Title |
---|
A Positioning System Based on Monocular Vision for Industrial Robots;Mingyu Gao et al.;《IEEE Xplore》;20161103;全文 * |
A Target Detection System for Mobile Robot Based On Single Shot Multibox Detector Neural Network;Yujie Du;《IEEE Xplore》;20190530;全文 * |
Circular Trajectory Planning with Pose Control for Six-DOF Manipulator;Jincan Li et al.;《IEEE Xplore》;20190621;全文 * |
深度学习实时多人姿态估计与跟踪;许忠雄 等;《中国电子科学研究院学报》;20180820(第04期);全文 * |
融合扩张卷积网络与SLAM的无监督单目深度估计;戴仁月 等;《激光与光电子学进展》;20190902(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111260680A (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111260680B (en) | RGBD camera-based unsupervised pose estimation network construction method | |
CN110782490B (en) | Video depth map estimation method and device with space-time consistency | |
CN110490928B (en) | Camera attitude estimation method based on deep neural network | |
CN109377530B (en) | Binocular depth estimation method based on depth neural network | |
CN108765479A (en) | Using deep learning to monocular view estimation of Depth optimization method in video sequence | |
CN111105432B (en) | Unsupervised end-to-end driving environment perception method based on deep learning | |
CN111354030B (en) | Method for generating unsupervised monocular image depth map embedded into SENet unit | |
CN108986166A (en) | A kind of monocular vision mileage prediction technique and odometer based on semi-supervised learning | |
CN115187638B (en) | Unsupervised monocular depth estimation method based on optical flow mask | |
CN113554032B (en) | Remote sensing image segmentation method based on multi-path parallel network of high perception | |
CN109272493A (en) | A kind of monocular vision odometer method based on recursive convolution neural network | |
CN112767467B (en) | Double-image depth estimation method based on self-supervision deep learning | |
CN110992414B (en) | Indoor monocular scene depth estimation method based on convolutional neural network | |
CN113077505A (en) | Optimization method of monocular depth estimation network based on contrast learning | |
CN113283525A (en) | Image matching method based on deep learning | |
CN115883764A (en) | Underwater high-speed video frame interpolation method and system based on data cooperation | |
CN115035171A (en) | Self-supervision monocular depth estimation method based on self-attention-guidance feature fusion | |
CN112241959A (en) | Attention mechanism generation semantic segmentation method based on superpixels | |
CN113610912B (en) | System and method for estimating monocular depth of low-resolution image in three-dimensional scene reconstruction | |
CN112419411B (en) | Realization method of vision odometer based on convolutional neural network and optical flow characteristics | |
Berenguel-Baeta et al. | Fredsnet: Joint monocular depth and semantic segmentation with fast fourier convolutions from single panoramas | |
Liu et al. | Towards better data exploitation in self-supervised monocular depth estimation | |
CN112132880A (en) | Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image | |
CN112164078B (en) | RGB-D multi-scale semantic segmentation method based on encoder-decoder | |
CN117197229B (en) | Multi-stage estimation monocular vision odometer method based on brightness alignment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |