CN107507126A

CN107507126A - A kind of method that 3D scenes are reduced using RGB image

Info

Publication number: CN107507126A
Application number: CN201710621981.5A
Authority: CN
Inventors: 李扬
Original assignee: Dalian And Creation Of Lazy Technology Co Ltd
Current assignee: Yuexin (Dalian) Technology Co.,Ltd.
Priority date: 2017-07-27
Filing date: 2017-07-27
Publication date: 2017-12-22
Anticipated expiration: 2037-07-27
Also published as: CN107507126B

Abstract

The present invention relates to 3-dimensional structure prediction field, the method for more particularly to a kind of three-dimensional 3-dimensional structure using RGB image.This method mainly includes image instance segmenting device, 3D model generating means, and object relative position prediction meanss.The input data of this method is 2 dimension RGB images.Output data is the threedimensional model of whole environment.This method can help robot to be better understood from local environment, reduce robot movement, keep in obscurity and the difficulty of path planning.

Description

A kind of method that 3D scenes are reduced using RGB image

Technical field

Present invention design is related to 3-dimensional structure prediction field, more particularly to a kind of side that 3D scenes are reduced using RGB image Method.

Background technology

Human brain has the ability of completion environment 3D information.See the one side of a 3D object, people can guess out other Appearance simultaneously, and the 3D structures of whole object are generated in the brain.It is this to speculate complete 3D structures using partial information Ability is the Context aware of the mankind, mobile, avoiding barrier, the very big help of offer such as path planning.Machine does not possess robust still 3D structure completion abilities.With the help of infrared range-measurement system etc., machine also can only obtain the 3D information of visible part.Mesh It is preceding help machine realize the method for 3D structure completions be also concentrated mainly on using the 3D object models pre-defined in CAD storehouses come Synthesize or be spliced into the 3D structures to match with object in environment.Such way makes the 3D structures of generation excessively single, not General it can use in scene changeable in real life.The reaction of same pictures is seen with the mankind as shown in figure 3, testing a machine, Figure is the 2D images of certain scene；Understanding of the machine shown in Fig. 4 in the case where there is depth information to image, the part being blocked Information has no way of learning；Understanding of the mankind shown in Fig. 5 to image, the object that human brain can be blocked with completion, reduce 3D structures.Closely Remained little over year with what the low branch fruit of supervised learning was plucked, unsupervised learning becomes study hotspot.GAN (Generative Adversarial Network, generation confrontation network) and VAE (Variational Auto-Encoder, Variation self-encoding encoder) etc. self-supervisory generation model it is of increased attention.Generation model can utilize latent space Random vector, generate significant image or animation.Fig. 7-10 is illustrated the 2D facial images completion realized using GAN and imitated Fruit, Fig. 7 and Fig. 9 are that input picture is the facial image with noise, and Fig. 8 and Figure 10 are that output image is after removing noise Face.

Same generation model is also extended into the task of generation 3D structures, the 3D-VAE- proposed by MIT in 2016 GAN can utilize the RGB image of object to be inferred to the 3D rendering of object.As people, present machine, which also has, utilizes RGB Image infers the ability of object 3D structures.But 3D-VAE-GAN can only predict the 3D structures of single body, for containing complexity Environment, the scene blocked comprising object or helpless.

In summary, the 2D images of the scene comprising multiple objects how are utilized, the tomograph of scene is reduced, is one Individual urgent problem to be solved.

The content of the invention

It is an object of the invention to provide a kind of method that 3D scenes are reduced using RGB image, to help machine to understand more things Body, there is the complex environment blocked.

To achieve these goals, the present invention provides following technical scheme：A kind of side that 3D scenes are reduced using RGB image Method, idiographic flow are：

Step S1:Read in RGB image；

Step S2:Object detection is carried out in image range, the object detected is demarcated using rectangular window；

Step S3:The image detected carries out example segmentation, obtains the masking-out of single body；

Step S4:The 3D shape of single body is predicted using deep learning model, input data is single body Masking-out；

Step S5:Regression forecasting is carried out the relativeness each pair youngster's object in image；

Step S6:Relativeness structure figure between the object obtained using prediction；

Step S7:Global figure optimization is carried out, obtains optimal object 3-dimensional space arrangement mode；

Step S8:Obtain three-dimensional scenic.

Further, step S2 is specifically included：Feature extraction, production detection are carried out on image using deep learning model Object candidate region, the classification of detection object window and the optimal presumption of the window's position.Its advantage is to utilize deep learning model Obtained testing result precision is high, and error is small.

Further step S3 is specifically included：Feature up-sampling, profit are carried out inside object candidate region caused by step S2 The classification of the detection object window interior pixel scale obtained with the method for bilinearity difference, by classification identical pixel in window Integration is the masking-out of the detection object.

Specifically, the algorithm in window used in the classification of progress pixel scale described in S3 steps is SVM.Its is beneficial Effect is split on the basis of detection, greatly reduce it is multi-class between interfere with each other, improve pixel classifications Quality.

Further step S4 is specifically included：The object masking-out obtained by step S3 is inputted, utilizes variation self-encoding encoder (VAE) object masking-out is projected into latent variable space, latent variable is generated to corresponding 3-dimensional knot using generation confrontation network Structure.

It has been related to a kind of regression algorithm for predicting the relativeness between object in further step S5.Its feature exists In：The translation between two articles and rotation relationship are represented using 3D transition matrixes；Instructed using deep learning algorithm and big data Practice to make the distribution of relativeness between the real object of Model approximation.

Figure in further step S6 is non-directed graph G (V, E), and wherein V represents summit, represents the collection of detection object herein Close, E is the set on side, represents the transition matrix between detection object herein.The mode for obtaining G is to be speculated respectively using step S5 Transition matrix between two objects.

Further step S7 purpose is to scheme caused by Optimization Steps S6.

The present invention utilizes object detecting method, detects the multiple objects in image；Pixel scale is carried out to multiple objects Segmentation, obtain the masking-out of multiple objects；On this basis, the relative position each two object is predicted, generates nothing Xiang Tu；More objects puts relation in finally being optimized the environment using figure optimized algorithm；The present invention solves direct by two dimensional image Infer the problem of three-D space structure.

Brief description of the drawings

Fig. 1 is that 3D of the present invention estimates algorithm flow chart.

Fig. 2 is the example segmentation effect figure based on MaskRCNN.

Fig. 3 is to convert images into 3D model structures using self-encoding encoder+maker.

Fig. 4 be test a machine the reaction that same pictures are seen with the mankind certain scene 2D images.

Fig. 5 is the figure of shown understanding of the machine in the case where there is depth information to image.

The figure of understanding of the mankind shown in Fig. 6 to image.

Fig. 7 be using GAN realize 2D input pictures be the facial image with noise.

The output image that Fig. 8 is Fig. 7 removes the face figure after noise.

Fig. 9 is that another 2D input pictures realized using GAN are the facial image with noise.

The output image that Figure 10 is Fig. 9 removes the face figure after noise.

Embodiment

The specific algorithm of each step of the invention is described below.

A kind of method that 3D scenes are reduced using RGB image, shown in idiographic flow Fig. 1；

The embodiment that object detection is carried out in image range involved by step S2 is as follows：

Object detection is carried out in image range, the object detected is demarcated using rectangular window.Here with Detection mode based on Faster R-CNN algorithms.Concretely comprise the following steps：Feature extraction, produce candidate region, window classification and essence Repair.Characteristic extraction part is realized using depth convolutional neural networks (CNN).If CNN include comprising dried layer volume basic unit (conv) and Active coating (relu).The VGG networks that the CNN of this programme trains to obtain using ImageNet.Faster RCNN utilize RPN (candidates Area generation network) generation candidate region.In Faster RCNN, PRN be a very little convolutional network (3 × 3conv → The conv of 1 × 1conv → 1 × 1) all characteristic quantities of conv5_3 volumes of basic unit's characteristic pattern are checked in moving window.It is each mobile Window has 9 with the related anchor box (anchor box) of its receptive field.PRN can do the window's position to each anchor box Return (bounding box regression) and window confidence score (box confidence scoring).Pass through knot Closing the loss of above three turns into common all characteristic quantity, and whole pipeline can be trained to.Faster R-CNN top Layer output also includes bounding box regression and window confidence score, and the two parts are respectively intended to refine window Mouthful, and the object in window is identified/classified.

The image progress example segmentation to detecting involved by step S3, obtains the specific implementation of the masking-out of single body Mode is as follows：

Example refers to 2D images being subdivided into multiple images subregion (set of pixel), wherein per sub-regions Pixel belong to same object.Scheme utilizes the Mask-RCNN of Facebook AI research institutes.Fig. 2 is the one of example segmentation Individual effect.For each object in image, example segmentation can generate a masking-out (Mask)., can be by image using masking-out In object segmentation come out.Mask RCNN are improved based on faster RCNN.By increasing a particular category pair As mask is predicted, Mask RCNN extend the Faster RCNN of facing living examples segmentation, with existing bounding box regressor and right As grader is parallel.Because Faster RCNN RolPool is not designed for the pixel of network inputs and outlet chamber to picture Element alignment, MaskRCNN instead of it with RolAlign.RolAlign has used bilinear interpolation to calculate each subwindow The exact value of input feature vector, rather than RolPooling maximum pond method.MaskRCNN training dataset is storehouse of increasing income Microsoft COCO。

The embodiment that the 3D shape of single body is predicted using deep learning model that step S4 is related to It is as follows：

The 3D shape of single body is predicted using VAE-GAN, input data is the masking-out of single body.VAE (variations Self-encoding encoder) with come on, image is mapped to latent variable space (latent space) and then again by the data in latent variable space It is mapped to a kind of mode of own coding decoding of image.GAN (generation confrontation network) is given birth to using the data in latent variable space Into the method for True Data.VAE own coding part is grafted onto before GAN by this programme, it is therefore an objective to is reflected 2D images using VAE Latent variable space is mapped to, then latent variable is mapped to above 3D models using GAN.VAE-GAN includes three parts：Figure As encoder：E, decoder：GAN, and an identifier：D.Image encoder include five spaces roll up machine kernels be respectively 11, 5,5,5,8 } and step-length { 4,2,2,2,1 } convolution interlayer is using crowd standardization (batch normalization) and ReLU, and One sampler is used for sampling the vector of 200 dimensions.VAE-GAN structure as shown in figure 3, input example masking-out → self-encoding encoder → Latent variable → maker → reconstruct 3D shapes.

The loss function L that VAE-GAN learning process is related to includes three parts：Object reconstruction loses Lrecon, intersects Entropy loss L3D-GAN and relative entropy loss LKL.It is written as equation：

L=L3D-GAN+a1LKL+a2Lrecon,

The weight that wherein a1 and a2 is reconstruct loss respectively and relative damage entropy loses.We set a1=5, a2=104. The calculation equation of wherein three losses is as follows：

L3D-GAN=log D (x)+log (1D (G (z))),

LKL=DKL (q (z | y) | | p (z)),

Lrecon=| | G (E (y)) x | | 2,

Wherein x is the 3D models that training data is concentrated, and y is 2D models on the other side, and q (z | y) it is latent space z Variable probability is distributed.This programme selection p (z) is distributed for a multivariate Gaussian, and its average is 0, and variance is 1. training VAE-GAN When need 2D images and corresponding 3D models while exist.The training dataset is utilized to the manual demarcation of CAD model Obtain.

The specific embodiment party that regression forecasting is carried out the relativeness each pair youngster's object in image that step S5 is related to Formula is as follows：

Regression forecasting is carried out the relativeness each pair youngster's object in image.Obtain each object 3D shapes it Afterwards.Finally utilize the relativeness returned between device prediction 3D models.The conversion of relativeness 3d space between 3D models Matrix T is represented.T is 4x4 matrix, and the 3x3 submatrixs in the wherein T upper left corner represent the rotation pass of three dimensions, T The 4th row [Xt, Yt, Zt] represent three dimensions in translation relation.The target of regression algorithm is exactly to minimize transition matrix Difference between predicted value and actual value.The present invention represents this error using L2 losses.The method for expressing of L2 losses is true The square value of difference between value and predicted value.Regression algorithm profit MLP (multilayer neural network).Transition matrix in training data Actual value is directly to calculate to obtain in the 3D models of simulation.

The embodiment for the figure optimization being related in step S7 is as follows.

The transition matrix of detection object two-by-two is calculated in step S6, and has obtained non-directed graph G (V, E), wherein V Summit is represented, represents the set of detection object herein, E is the set on side, represents the transformational relation between detection object herein.Such as Occurs two or more object in fruit environment, due to all including the relative position relation of supposition between two two articles, so by difference Detection object recursion is to can produce control information between the putting position of all objects.Be described below how selectivity use nothing To figure G side collection to eliminate this error.

Deep neural network has been used in step S2 to do object detection algorithms.For each object detected, god A confidence level C can be exported through network：Confidence, to represent the degree of reliability of this object being detected.We are sharp The side collection of non-directed graph is screened with this confidence level：

Three dimensions puts algorithm

Claims

A kind of 1. method that 3D scenes are reduced using RGB image, it is characterised in that：Concretely comprise the following steps：

Step S1:Read in RGB image；

Step S2:Object detection is carried out in image range, the object detected is demarcated using rectangular window；

Step S3:The image detected carries out example segmentation, obtains the masking-out of single body；

Step S4:The 3D shape of single body is predicted using deep learning model, input data is the masking-out of single body；

Step S5:Regression forecasting is carried out the relativeness each pair youngster's object in image；

Step S6:Relativeness structure figure between the object obtained using prediction；

Step S7:Global figure optimization is carried out, obtains optimal object 3-dimensional space arrangement mode；

Step S8:Obtain three-dimensional scenic.
A kind of 2. method that 3D scenes are reduced using RGB image according to claim 1, it is characterised in that：

The step S2 is specifically included：Feature extraction is carried out on image using deep learning model, produces detection object candidate Region, the classification of detection object window and the optimal presumption of the window's position.
A kind of 3. method that 3D scenes are reduced using RGB image according to claim 1, it is characterised in that：

The step S3 is specifically included：Feature up-sampling is being carried out caused by the step S2 inside object candidate region, is being utilized The classification for the detection object window interior pixel scale that the method for bilinearity difference obtains, classification identical pixel in window is united It is combined into the masking-out of the detection object.
A kind of 4. method that 3D scenes are reduced using RGB image according to claim 3, it is characterised in that：S3 steps institute The algorithm used in the classification that pixel scale is carried out in window stated is SVM.
A kind of 5. method that 3D scenes are reduced using RGB image according to claim 1, it is characterised in that：

The step S4 is specifically included：The object masking-out that is obtained by the step S3 is inputted, using variation self-encoding encoder by object Masking-out projects to latent variable space, and latent variable is generated to corresponding 3-dimensional structure using generation confrontation network.
A kind of 6. method that 3D scenes are reduced using RGB image according to claim 1, it is characterised in that：

It is related to a kind of regression algorithm for predicting the relativeness between object in the step S5, is represented using 3D transition matrixes Translation and rotation relationship between two articles；Make the real thing of Model approximation using deep learning algorithm and big data training The distribution of relativeness between body.
A kind of 7. method that 3D scenes are reduced using RGB image according to claim 1, it is characterised in that：The step Figure in S6 is G (V, E), and wherein V represents summit, represents the set of detection object herein, and E is the set on side, represents inspection herein The transition matrix surveyed between object.
A kind of 8. method that 3D scenes are reduced using RGB image according to claim 7, it is characterised in that：Obtain G side Formula is to speculate the transition matrix between two objects respectively using the step S5.
A kind of 9. method that 3D scenes are reduced using RGB image according to claim 1, it is characterised in that：The step S7 purpose is schemed to optimize caused by the step S6.