CN108305229A

CN108305229A - A kind of multiple view method for reconstructing based on deep learning profile network

Info

Publication number: CN108305229A
Application number: CN201810081726.0A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2018-07-20

Abstract

The present invention proposes a kind of multiple view method for reconstructing based on deep learning profile network, and main contents include：It introduces deep depth and learns framework, 3D shape encodes, build profile network, network training and test, its process is, introduce the profile network of a deep learning, network learns the 3D shape coding of one or more input pictures, new view is generated using this code adjustment decoder later, it is subsequently introduced agency's loss based on profile, when decoder does not include three-dimensional characterization, this two dimension loss of Web vector graphic encodes 3D shape, two dimension loss is not limited by three-dimensional characterization resolution ratio, generate the simultaneously huge mottled object data set network of pre-training one, profile network on data set is finely adjusted.The present invention uses neural network learning 3D shape, the profile generated in new view that network code 3D shape, the information of the multiple views of efficient combination is forced to improve multiple view reconstruction performance.

Description

A kind of multiple view method for reconstructing based on deep learning profile network

Technical field

The present invention relates to view reconstruction fields, more particularly, to a kind of multiple view weight based on deep learning profile network Construction method.

Background technology

Multiple view reconstruction is the method for recovering scene threedimensional model using the different visual angles picture of multiple scenes, The multiview three-dimensional of natural scene rebuilds the basic problem of always computer vision field, by the three-dimensional mould for reconstructing target Type, so that it may to carry out the relevant information of quantitative analysis and processing target to target.Multiple view reconstruction technique be applied to medicine at Picture passes through view reconstruction, on the one hand, visual structure or function information can be obtained in biomedical imaging technology, for biology Research or clinical diagnosis use；On the other hand, it again may be by reconstruction technique and obtain structoure of the human body model, and then before surgery Research discusses out best therapeutic scheme.Carrying out information collection by the means such as simply take photo by plane in advance in military war can borrow The threedimensional model in battlefield is obtained with multiple view reconstruction technique, and then seizes war first chance, preferably controls the war situation, arranges army, Planning strategy.Society of today becomes increasingly complex, and various delinquent cases emerge one after another, and public security is handled a case the case where faced It becomes increasingly complex, and utilizes multiple view reconstruction technique, by public security criminal-scene three-dimensional reconstruction and sunykatuib analysis, design a case Part scene three-dimensional animation restores analysis system, and case investigation technical staff is according to spot plane sketch, scene photograph and basic Process of commission of crime verbal description carries out scene rebuilding；In addition it can carry out video camera, object, environment, animal and the animation of human body Design and emulation, to reach various simulation criminal-scene scenes and human body and event generation, the reproduction of process, result；Pass through case Part scene three-dimensional animation restores the advanced rendering function of analysis, produces three-dimensional scenic picture and the cartoon of high fidelity, These three-dimensional scenic picture, cartoon and sound, word combine, so that it may be generated for investigation, technology, commanding various three-dimensional empty The multimedia video and imaging material of quasi- criminal-scene scene and case process.Although a lot of in multiple view reconstruction research, It is wide in the less baseline of former view image and in the case that illumination condition is complicated, the information from multiple views is combined, And reconstruction performance is correspondingly improved, even there is certain challenge.

The present invention proposes a kind of multiple view method for reconstructing based on deep learning profile network, introduces a deep learning Profile network, network learns the 3D shapes of one or more input pictures coding, this coding is used to adjust later Whole decoder generates new view, agency's loss based on profile is subsequently introduced, when decoder does not include three-dimensional characterization It waits, this two dimension loss of Web vector graphic encodes 3D shape, and two dimension loss is not limited by three-dimensional characterization resolution ratio, raw At the simultaneously huge mottled object data set network of pre-training one, the profile network on data set is finely adjusted.The present invention A kind of multiple view method for reconstructing based on deep learning profile network is proposed newly to regard using neural network learning 3D shape The profile generated in figure forces network code 3D shape, the information of the multiple views of efficient combination to improve multiple view reconstruction capability Energy.

Invention content

For view reconstruction, the present invention proposes a kind of multiple view method for reconstructing based on deep learning profile network, draws Enter the profile network of a deep learning, network learns the 3D shape coding of one or more input pictures, later New view is generated using this code adjustment decoder, agency's loss based on profile is subsequently introduced, when decoder does not wrap When characterization containing three-dimensional, this two dimension loss of Web vector graphic encodes 3D shape, and two dimension loss is not characterized by three-dimensional The limitation of resolution ratio generates the simultaneously huge mottled object data set network of pre-training one, to the profile network on data set It is finely adjusted.

To solve the above problems, the present invention proposes a kind of multiple view method for reconstructing based on deep learning profile network, Main contents include：

(1) deep learning framework is introduced；

(2) 3D shape encodes；

(3) profile network is built；

(4) network training and test.

Wherein, the introducing deep learning framework, introduces mottled object data set and true sculpture data set, wherein Mottled object data set is used for pre-training, and sculpture data set can learn and encode 3D shape for proving profile, and New profile view is generated in variously-shaped and material sculpture, in order to which the 3D shape and processing of predicting multiple images are three-dimensional Smooth surface is applied to smooth sculpture texture, introduces the profile network (SilNet) of a deep learning, one, network pair Or the 3D shape coding of multiple input image is learnt, and generates new regard using this code adjustment decoder later Figure is subsequently introduced agency's loss based on profile, when decoder does not include three-dimensional characterization, this two dimension damage of Web vector graphic Mistake encodes 3D shape, and two dimension loss is not limited by three-dimensional characterization resolution ratio, generates and pre-training one is huge Mottled object data set network, is finely adjusted the SilNet on data set.

Further, the mottled object data set, the smooth surface created by implicit surface form, including 11706 spot objects, by 75:10:15 ratio is divided into training set, assessment collection, test set, each object there are five image, The coding that deep learning profile network must find 3D shape is learnt, and is shown scheme by rectangular projection in a mixer Picture, steps are as follows：First, three light sources are randomly dispersed on object；Secondly, video camera is rotated around z-axis, each renders angle The random selection from [0 °, 120 °] of the value of θ；Finally, using a complicated texture model, it is ensured that it is with surface scattering and mirror Face is reflected.

Further, the true sculpture data set works out the new data set of 307 true sculptures, with mottled right Image data collection is the same, by 75:10:15 ratio is divided into training set, assessment collection, test set, for spot object 5 views of completion Image rendering provides the sculpture of image not stringent direction limitation.

Further, the loss, loss function are as follows：Give an angle set θ₁…θ_NWith one group of image I₁… I_N, the profile of S expression calibration, S_x,y∈ 0,1, wherein 0 represents object, 1 represents non-object, learns a function g^θ′, angle, θ ' Place generates a S, and the binary cross entropy loss function L of a pixel will demarcate profile and predict that the difference between profile is minimum Change, by being given with minor function：

Wherein, 3D shape coding, given one or more images, the new view for generating three dimensional object, It is required that network predicts object outline by the new viewpoint angle uniquely provided, in order to execute task, network needs to encode three-dimensional Shape, due to having concentrated profile, network does not have to the intensity of study prognostic chart picture, therefore learning process is easier, during training The characterization for not needing several picture or 3 D stereo pixel, from geometrically, if network can predict the profile of multiple views, Then it must encode the profile of three-dimension object in given image, therefore SilNet encodes 3D shape, then wants It asks the coding to create the profile of new object in new viewpoint, or is detected by extracting the 3D shape learnt.

Wherein, the structure profile network, in order to generate profile from multiple views, SilNet is conciliate using encoder The design of code device, it is made of an encoder f, and the number that encoder is replicated is no less than amount of views, due to all codings The parameter sharing of device, so memory will not increase, encoder is combined in a pond layer φ, each module of maximum pondization Feature vector, for learn combination feature vector, this allows SilNet to handle multiple images；It is each in processing pond layer The important feature of image, with a decoder g^θ′From feature vector to up-sampling, in the middle X-Y schemes for generating profile of new viewpoint θ ' Picture, the feature vector learnt generates a hiding 3D shape by three-dimensional decoder and characterizes, by projection layer that this is hidden Shape project is hidden to the profile of two dimensional image, each image and θ encoded in an individual coder module, image The parameter that size is adjusted to 112 × 112, θ is encoded as (sin θ, cos θ) to indicate the distribution of angle, these θ values pass through two Full articulamentum transmits, and is connected to corresponding module, and in a decoder, feature vector is sampled, and is followed by one with pixel The sigmoid function that mode carries out, and add an additional convolutional layer after most latter two up-sampling layer.

Further, the three-dimensional decoder, in order to extract three dimensional object and determine two dimensional character whether to three The information coding for tieing up shape, changes the decoder of SilNet, so that SilNet learns three in the case where encoder is kept fixed The hiding characterization of dimension shape to the feature vector of combination adopt in three-dimensional decoder using Three dimensional convolution transposed matrix Sample generates the vector that a volume is 57 × 57 × 57, and volume is indicated with V, is a sigmoid function layer behind, this individual Product can be used as the three dimensional representation of object, projects to obtain profile using projection layer, as twodimensional decoder, fall into a trap in projected outline It calculates binary and intersects entropy loss, result of calculation shows that three dimensional representation does not have direct losses.

Further, the projection layer, uses T_θ′It indicates projection layer, gives a voxel, it is assumed that it indicates three-dimensional Shape, using the loss function on profile by pixel projection to two dimensional image, θ ' is rotated by the first time of nearest sampler To V, determine whether pixel is filled by the minimum value of all depth values in each location of pixels, it is assumed that in vertical direction pair θ ' carry out orthographic projections and rotation, projected image pixel p_j,kIt is given by：

Rotating frame V_θ′(i, j, k) is given by：

Wherein V_θ′(i, j, k) indicates rotating frame, is a differentiable combination of function, the Classification Loss of Pixel-level can lead to Cross this layer of backpropagation.

Wherein, the network training and test, using detection evaluation function (IoU) data reporting collection test subregion, For given prediction profile S and calibration profileIt is defined asWherein I is target function, if pixel corresponds to One object, then I=1 indicate the average value of all images of average IoU, data set be randomized into training group, evaluation group and Test group so that during object is gathered at one, this can ensure adaptives of the SilNet to invisible object, when with n-th mould When block is trained, randomly choose N+1 view of an object, the mask of view by as the profile next to be predicted, Remaining is as input picture, after result of calculation, then retrains SilNet and training set, assessment collection and test set；At random The N+1 views of an object and n-th module are selected, compare disparate modules as a result, each new module includes a volume Outer unselected view, in more different SilNet, it is ensured that these non-selected views are consistent, for spot Shape object data set, SilNet are trained using stochastic gradient descent, and momentum is set as 0.9, and weight decays to 0.001, batch Amount is set as 16, then network is initialized using mottled object data set, for being finely adjusted on engraved data collection.

Description of the drawings

Fig. 1 is a kind of system framework figure of the multiple view method for reconstructing based on deep learning profile network of the present invention.

Fig. 2 is a kind of new view of the generation of the multiple view method for reconstructing based on deep learning profile network of the present invention.

Fig. 3 is a kind of training frame diagram of the multiple view method for reconstructing based on deep learning profile network of the present invention.

Fig. 4 is a kind of data set sample graph of the multiple view method for reconstructing based on deep learning profile network of the present invention.

Specific implementation mode

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase It mutually combines, invention is further described in detail in the following with reference to the drawings and specific embodiments.

Fig. 1 is a kind of system framework figure of the multiple view method for reconstructing based on deep learning profile network of the present invention.Mainly Including introducing deep learning framework, 3D shape coding, structure profile network, network training and test.

Rotating frame V_θ′(i, j, k) is given by：

Fig. 2 is a kind of new view of the generation of the multiple view method for reconstructing based on deep learning profile network of the present invention. SilNet handles image and renders angle, θ simultaneously, generates the new view of sculpture.Give an angle set θ₁…θ_NWith a group picture As I₁…I_N, the profile of S expression calibration, S_x,y∈ 0,1, wherein 0 represents object, 1 represents non-object, learns a function g^θ′, Angle, θ ' place generates a S, and the binary cross entropy loss function L of a pixel will demarcate profile and predict the difference between profile It minimizes, by being given with minor function：

Fig. 3 is a kind of training frame diagram of the multiple view method for reconstructing based on deep learning profile network of the present invention.Pass through Independent encoder f processing image I and rendering angle, θ export feature vector, and decoder g is in the middle generation targets of new view θ ' Profile.In order to generate profile from multiple views, SilNet uses the design of encoder and decoder, it is by an encoder f Composition, the number that encoder is replicated is no less than amount of views, due to the parameter sharing of all encoders, so memory will not increase Add, encoder is combined in a pond layer φ, the feature vector of maximum each module of pondization, for learning the spy of combination Sign vector, this allows SilNet to handle multiple images；The important feature of each image in processing pond layer, with a decoder g^θ′From feature vector to up-sampling, in the middle two dimensional images for generating profile of new viewpoint θ ', the feature vector learnt passes through three-dimensional Decoder generates a hiding 3D shape characterization, by projection layer by this hide shape project to two dimensional image profile, Each image and θ are encoded in an individual coder module, and the size of image is adjusted to the parameter of 112 × 112, θ (sin θ, cos θ) is encoded as to indicate the distribution of angle, these θ values are transmitted by two full articulamentums, and are connected to corresponding Module, in a decoder, feature vector is sampled, and is followed by a sigmoid function carried out with pixel-wise, and last An additional convolutional layer is added after two up-sampling layers.

Fig. 4 is a kind of data set sample graph of the multiple view method for reconstructing based on deep learning profile network of the present invention.Come The diversity and complexity of sculpture are presented from the sample of each data set, mottled object data set is used for pre-training, sculpture Data set generates new wheel for proving that profile can learn and encode 3D shape in variously-shaped and material sculpture Wide view.

For those skilled in the art, the present invention is not limited to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of refreshing and range, the present invention can be realized in other specific forms.In addition, those skilled in the art can be to this hair Bright to carry out various modification and variations without departing from the spirit and scope of the present invention, these improvements and modifications also should be regarded as the present invention's Protection domain.Therefore, the following claims are intended to be interpreted as including preferred embodiment and falls into all changes of the scope of the invention More and change.

Claims

1. a kind of multiple view method for reconstructing based on deep learning profile network, which is characterized in that main includes introducing depth Practise framework (one)；3D shape encodes (two)；Build profile network (three)；Network training and test (four).

2. based on the introducing deep learning framework (one) described in claims 1, which is characterized in that introduce mottled object data Collection and true sculpture data set, wherein mottled object data set is used for pre-training, sculpture data set is for proving that profile can Study and coding 3D shape, and new profile view is generated in variously-shaped and material sculpture, in order to predict multiple figures The 3D shape of picture and the three-dimensional smooth surface of processing, are applied to smooth sculpture texture, introduce the wheel of a deep learning Wide network (SilNet), network learn the 3D shape coding of one or more input pictures, use this volume later Code adjusts decoder to generate new view, is subsequently introduced agency's loss based on profile, when decoder does not include three-dimensional characterize When, this two dimension loss of Web vector graphic encodes 3D shape, and two dimension loss is not limited by three-dimensional characterization resolution ratio System generates the simultaneously huge mottled object data set network of pre-training one, is finely adjusted to the SilNet on data set.

3. based on the mottled object data set described in claims 2, which is characterized in that the smooth song created by implicit surface Face forms, including 11706 spot objects, by 75:10:15 ratio is divided into training set, assessment collection, test set, each object There are five images, and the coding that deep learning profile network must find 3D shape is learnt, in a mixer by orthogonal Projection-display image, steps are as follows：First, three light sources are randomly dispersed on object；Secondly, video camera is rotated around z-axis, often The random selection from [0 °, 120 °] of a value for rendering angle, θ；Finally, using a complicated texture model, it is ensured that it is with table Area scattering and mirror-reflection.

4. based on the true sculpture data set described in claims 2, which is characterized in that the new number of 307 true sculptures of establishment It is the same with mottled object data set according to collection, by 75:10:15 ratio is divided into training set, assessment collection, test set, is spot pair Image rendering as completing 5 views provides the sculpture of image not stringent direction limitation.

5. based on the loss described in claims 2, which is characterized in that loss function is as follows：Give an angle set θ₁…θ_N With one group of image I₁…I_N, the profile of S expression calibration, S_x,y∈ 0,1, wherein 0 represents object, 1 represents non-object, learns a letter Number g^θ′, generate a S in angle, θ ' place, the binary cross entropy loss function L of a pixel will demarcate profile and predict profile it Between difference minimize, by being given with minor function：

6. encoding (two) based on the 3D shape described in claims 1, which is characterized in that given one or more image is used In the new view for generating three dimensional object, it is desirable that network predicts object outline by the new viewpoint angle uniquely provided, in order to hold Row task, network need to encode 3D shape, and due to having concentrated profile, network does not have to the intensity of study prognostic chart picture, therefore learns Habit process is easier, and the characterization of several picture or 3 D stereo pixel is not needed during training, from geometrically, if network It can predict the profile of multiple views, then it must encode the profile of three-dimension object in given image, therefore SilNet pairs 3D shape is encoded, and the coding is then required to create the profile of new object in new viewpoint, or learnt by extraction 3D shape is detected.

7. based on the structure profile network (three) described in claims 1, which is characterized in that in order to generate wheel from multiple views Exterior feature, SilNet use the design of encoder and decoder, it is made of an encoder f, and the number that encoder is replicated is many In amount of views, due to the parameter sharing of all encoders, so memory will not increase, encoder is combined in a pond layer In φ, the feature vector of maximum each module of pondization, for learning the feature vector of combination, it is more that this allows SilNet to handle A image；The important feature of each image in processing pond layer, with a decoder g^θ′From feature vector to up-sampling, regarded newly The middle two dimensional images for generating profile of point θ ', the feature vector learnt generate a hiding 3D shape by three-dimensional decoder Characterization, hides shape project to the profile of two dimensional image, each image and θ are in an individual encoder by projection layer by this It is encoded in module, the parameter that the size of image is adjusted to 112 × 112, θ is encoded as (sin θ, cos θ) to indicate angle Distribution, these θ values are transmitted by two full articulamentums, and are connected to corresponding module, in a decoder, in feature vector Sampling, is followed by a sigmoid function carried out with pixel-wise, and addition one is additionally after most latter two up-sampling layer Convolutional layer.

8. based on the three-dimensional decoder described in claims 7, which is characterized in that in order to extract three dimensional object and determine that two dimension is special Whether sign changes the decoder of SilNet, so that SilNet in encoder be kept fixed to the information coding of 3D shape In the case of learn 3D shape hiding characterization, in three-dimensional decoder, using Three dimensional convolution transposed matrix to the spy of combination Sign vector is up-sampled, and generates the vector that a volume is 57 × 57 × 57, volume is indicated with V, is a S-shaped behind Function layer, this volume can be used as the three dimensional representation of object, project to obtain profile using projection layer, as twodimensional decoder, Binary is calculated in projected outline and intersects entropy loss, and result of calculation shows that three dimensional representation does not have direct losses.

9. based on the projection layer described in claims 7, which is characterized in that use T_θ′It indicates projection layer, gives a voxel, Assuming that it indicates 3D shape, using the loss function on profile by pixel projection to two dimensional image, θ ' passes through nearest sampler First time rotate to obtain V, determine whether pixel is filled by the minimum value of all depth values in each location of pixels, it is false Vertical direction is located to θ ' carry out orthographic projections and rotation, projected image pixel p_j,kIt is given by：

Rotating frame V_θ′(i, j, k) is given by：

Wherein V_θ′(i, j, k) indicates rotating frame, is a differentiable combination of function, the Classification Loss of Pixel-level can pass through this A layer of backpropagation.

10. based on described in claims 1 network training and test (four), which is characterized in that using detection evaluation function (IoU) the test subregion of data reporting collection, for given prediction profile S and calibration profileIt is defined asIts Middle I is target function, if pixel corresponds to an object, I=1 indicates the average value of all images of average IoU, data Collection is randomized into training group, evaluation group and test group so that during object is gathered at one, this can ensure SilNet to can not The adaptive for seeing object randomly chooses N+1 view of an object, the mask of view when being trained with n-th module By as the profile next to be predicted, remaining after result of calculation, then retrains SilNet and training as input picture Collection, assessment collection and test set；Randomly choose the N+1 views of an object and n-th module, compare disparate modules as a result, Each new module includes an additional unselected view, in more different SilNet, it is ensured that these are unselected The view selected is consistent, and for mottled object data set, SilNet is trained using stochastic gradient descent, momentum setting It is 0.9, weight decays to 0.001, and then batch setting 16 initializes network using mottled object data set, be used for It is finely adjusted on engraved data collection.