CN110689514A - Training method and computer equipment for new visual angle synthetic model of transparent object - Google Patents

Training method and computer equipment for new visual angle synthetic model of transparent object Download PDF

Info

Publication number
CN110689514A
CN110689514A CN201910964836.6A CN201910964836A CN110689514A CN 110689514 A CN110689514 A CN 110689514A CN 201910964836 A CN201910964836 A CN 201910964836A CN 110689514 A CN110689514 A CN 110689514A
Authority
CN
China
Prior art keywords
image
prediction
visual angle
mask
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910964836.6A
Other languages
Chinese (zh)
Other versions
CN110689514B (en
Inventor
黄惠
吴博剑
吕佳辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201910964836.6A priority Critical patent/CN110689514B/en
Publication of CN110689514A publication Critical patent/CN110689514A/en
Application granted granted Critical
Publication of CN110689514B publication Critical patent/CN110689514B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The method comprises the steps that when training is carried out, a convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through the transparent object, and then the prediction image of the transparent object under the new visual angle is obtained according to the prediction mask, the prediction attenuation map and the prediction refraction flow. Obtaining a new visual angle synthetic model by iteratively training a convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.

Description

Training method and computer equipment for new visual angle synthetic model of transparent object
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a method and a computer device for training a new perspective synthesis model of a transparent object.
Background
New view synthesis is performed by capturing images of objects or scenes from fixed views to generate images from new views, typically by interpolating or warping images from nearby views. At present, for the research of new view angle synthesis, on one hand, the research mainly focuses on lambertian surface, because it is difficult to explicitly simulate light transmission characteristics, and the light effect depending on the view angle, such as specular reflectivity or transparency, is not considered, so that the characteristic correspondence between images is lacked, which will cause all methods based on image deformation or based on geometric inference to fail, and the new view angle synthesis of transparent objects becomes very challenging; on the other hand, the image at the new viewing angle is directly output by training the image to the network of images, wherein the network not only needs to reasonably explain the light transmission behavior, but also needs to model the attributes of the image itself, and therefore, the network still has great difficulty for transparent objects. Existing new view synthesis methods cannot be applied directly to transparent objects.
Therefore, the prior art is in need of improvement.
Disclosure of Invention
The invention aims to solve the technical problem of providing a training method and computer equipment of a new visual angle synthesis model of a transparent object so as to realize new visual angle synthesis for the transparent object.
In one aspect, an embodiment of the present invention provides a method for training a new perspective synthesis model of a transparent object, including:
inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents a visual angle relation among the first visual angle, the second visual angle and the new visual angle;
calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle;
and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model.
As a further improved technical solution, the convolutional neural network includes: an encoding module and a decoding module, wherein the first image and the second image in the training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, and the encoding module and the decoding module comprise:
inputting the first image, the second image and the mixing coefficient into the coding module to obtain a depth characteristic; inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow.
As a further improved technical solution, the encoding module includes a first encoder, a second encoder, and a convolutional layer, where the depth features include a first depth feature, a second depth feature, a third depth feature, a fourth depth feature, and a mixed depth feature, and the inputting the first image, the second image, and the mixed coefficients into the encoder to obtain the depth features includes:
inputting the first image into a first encoder to obtain a first depth feature and a second depth feature corresponding to the first image;
inputting the second image into a second encoder to obtain a third depth feature and a fourth depth feature corresponding to the second image;
the second depth feature, the fourth depth feature, and the blending coefficient are input to the convolutional layer to obtain a blended depth feature.
As a further improved technical solution, the decoding module includes a first decoder, a second decoder and a third decoder, and the inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow includes:
inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a first decoder to obtain a prediction mask;
inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a second decoder to obtain a prediction attenuation map;
and inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a third decoder to obtain the predicted refraction flow.
As a further improved technical solution, the adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction map, the prediction image, and the real image includes:
calculating a total loss value from the prediction mask, the prediction attenuation map and the prediction refraction stream, the prediction image and the real image;
and adjusting parameters of the convolutional neural network according to the total loss value.
As a further improvement, the calculating a total loss value according to the prediction mask, the prediction attenuation map, the prediction refraction map, the prediction image and the real image includes:
calculating a real mask, a real attenuation map and a real refraction flow according to the real image;
calculating a mask loss value according to the predicted mask and the real mask;
calculating an attenuation loss value according to the predicted attenuation map and the real attenuation map;
calculating a breaking flow loss value according to the predicted breaking flow and the real breaking flow;
calculating a composition loss value and a perception loss value according to the predicted image and the real image;
calculating a total loss value from the mask loss value, the attenuation loss value, the refractive flow loss value, the composition loss value, and the perceptual loss value.
As a further improved technical solution, before inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network, the method includes:
and calculating a mixing coefficient according to the visual angle sequence number of the first image, the visual angle sequence number of the second image and the visual angle sequence number of the real visual angle image.
In a second aspect, a second embodiment of the present invention provides a new viewing angle synthesis method for a transparent object, the method including:
acquiring a first image to be processed, a second image to be processed and a mixing coefficient to be processed;
inputting the first image to be processed and the second image to be processed into a new visual angle synthesis model to obtain a mask to be processed, an attenuation map to be processed and a baffling flow to be processed; the new visual angle synthetic model is obtained by training through the training method of the new visual angle synthetic model of the transparent object;
and calculating to obtain a synthetic image through an environment mask according to the mask to be processed, the attenuation map to be processed and the refraction flow to be processed, wherein the visual angle of the synthetic image is between the visual angle of the first image to be processed and the visual angle of the second image to be processed.
In a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the following steps when executing the computer program:
inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents a visual angle relation among the first visual angle, the second visual angle and the new visual angle;
calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle;
and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps:
inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents a visual angle relation among the first visual angle, the second visual angle and the new visual angle;
calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle;
and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model.
Compared with the prior art, the embodiment of the invention has the following advantages:
according to the training method provided by the embodiment of the invention, a first image, a second image and a mixing coefficient in training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents a visual angle relation among the first visual angle, the second visual angle and the new visual angle; calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle; and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model. During training, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through a transparent object, then obtains the prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training of the convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for training a new perspective composite model of a transparent object according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a process of inputting a first image, a second image and a mixing coefficient into a convolutional neural network to obtain a prediction mask, a prediction attenuation map and a prediction baffling flow in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a hierarchical structure of a convolutional neural network in an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating quality results of prediction images obtained by different combinations evaluated by using PSNR and SSIM according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of acquiring a real mask, a real attenuation map and a real refraction flow from a real image according to an embodiment of the present invention;
FIG. 6 is a rendering background diagram in an embodiment of the invention;
FIG. 7 is a diagram illustrating an embodiment of the present invention using a Point Grey Flea color camera to capture a real image for training and testing
FIG. 8 is a diagram illustrating the quantitative evaluation results of 5 other categories according to an embodiment of the present invention;
FIG. 9 is a schematic flow chart illustrating a new viewing angle synthesis method for a transparent object according to an embodiment of the present invention;
FIG. 10 is a diagram illustrating an example of synthesis of airplan in an embodiment of the present invention;
FIG. 11 is a diagram illustrating a synthesis of Glass _ water according to an embodiment of the present invention;
FIG. 12 is a diagram showing a synthetic example of Bottle in the embodiment of the present invention;
FIG. 13 is a diagram illustrating a synthesis example of Bench in an embodiment of the present invention;
FIG. 14 is a diagram illustrating the synthesis of Table in the embodiment of the present invention
Fig. 15 is an internal structural view of a computer device in the embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Various non-limiting embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Referring to fig. 1, a method for training a new perspective composite model of a transparent object in an embodiment of the present invention is shown. In this embodiment, the method may include, for example, the steps of:
and S1, inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the visual angle relation among the first visual angle, the second visual angle and the new visual angle.
In the embodiment of the invention, the first image, the second image and the real image are from sparse sampling images shot by cameras with different visual angles, and for a transparent object, the images of the transparent object under multiple visual angles can be shot, and the visual angle sequence numbers are used for numbering the images under multiple visual angles. For example, the camera moves around the transparent object at a constant speed, capturing a sequence of images, consisting of C ═ CkI.k is 0,1, …, N, wherein C0Represents a picture with a view number of 0; randomly selecting a first image C from a sequence of imagesLAnd a second image CR(0≤L<R is less than or equal to N), and a real image C for supervised learningt(L<t<R), the first image is a transparent object image photographed at a first viewing angle, and in this example, the viewing angle number of the first viewing angle is L; similarly, the view angle sequence number of the second view angle is R, and the view angle sequence number corresponding to the real image is t. The acquisition of the training data will be described in detail later.
Specifically, step S1 is preceded by:
and M, calculating a mixing coefficient according to the visual angle sequence number of the first image, the visual angle sequence number of the second image and the visual angle sequence number of the real visual angle image.
Since the sequence of images is captured by a camera moving at a uniform speed around the transparent object, at the time of training, the first image and the second image, and the real image are selected, and the blending coefficient is determined after the first image, the second image, and the real image are selected. The blending coefficient α may be calculated according to formula (1), the blending coefficient representing a relationship between the first viewing angle, the second viewing angle, and the new viewing angle, and the blending coefficient α may be calculated by formula (1):
Figure BDA0002230142180000081
wherein t is the view sequence number of the real image, L is the view sequence number of the first image, and R is the view sequence number of the second image. The mixed coefficient is input into a convolutional neural network, and the convolutional neural network outputs a mask, a folded stream and an attenuation map corresponding to the predicted image of the first image and the second image under the mixed coefficient according to the mixed coefficient.
In the embodiment of the invention, the convolutional neural network can obtain the prediction mask corresponding to the new visual angle according to the first image, the second image and the mixed coefficient
Figure BDA0002230142180000082
Predictive attenuation map
Figure BDA0002230142180000083
And predicting the refractive flow
Details of step S1 will be described later.
And S2, calculating to obtain a predicted image of the first image and the second image under a mixed coefficient according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle.
In the embodiment of the invention, the environment shade can describe the reflection and refraction of the transparent object when interacting with light in the environment and any transmission effect of a foreground object, and in order to well synthesize the transparent object into a new background, the core of the environment shade is to accurately estimate a light transmission matrix; by adopting an environment mask, a new view angle image with a view angle serial number t can be synthesized according to the prediction mask, the prediction attenuation map and the prediction refraction flow, namely a transparent object image predicted by the convolutional neural network under the new view angle; for transparent objects, the ambience mask can be expressed as follows:
Figure BDA0002230142180000085
wherein C represents a composite image, F represents ambient lighting, and B is a background image; if the background image B is equal to 0, C is equal to F, that is, when the background image B is pure black, the ambient illumination F is easily obtained; intoIn one step, since the subject is a transparent object, F is 0. Furthermore, m ∈ {0,1} represents the object binary mask, where m ═ 0, then the composite color comes directly from the background image; the refraction flow W may be used to represent the light transmission matrix and characterize the correspondence between the pixels of the composite image and the pixels of the background image, and for the sake of simplicity, it is assumed that one pixel of the composite image is only from a corresponding one of the pixels in the background image, and the correspondence between a single pixel in the composite image and one of the pixels in the background image is denoted by W;
Figure BDA0002230142180000091
meaning a pixel-by-pixel index of the background image. For example, if Wij(a, B), then BabIs indexed to calculate CijIn which B isabAnd CijRepresenting the background pixel at location (a, b) and the synthesized pixel value at location (i, j), respectively. Further, ρ represents an attenuation map in which, for each pixel, if no light passes, the attenuation value is 0; if there is light passing through and no attenuation, the attenuation value is equal to 1.
In an embodiment of the invention, the prediction mask of the convolutional neural network output is used
Figure BDA0002230142180000092
Predictive attenuation map
Figure BDA0002230142180000093
And predicting the refractive flow
Figure BDA0002230142180000094
Substituting the formula (2) into the formula (2), calculating the pixel value of each pixel point in the predicted image, and obtaining the predicted image according to the pixel value
Figure BDA0002230142180000095
S3, adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthetic model.
In the embodiment of the invention, training is carried out in a network supervision mode, the training of a convolutional neural network is supervised by adopting a real image, the output result of the convolutional neural network and a predicted image, the mask loss, the attenuation loss and the convolutional flow loss are combined according to a mask, an attenuation map and a convolutional flow corresponding to the real image, and in order to synthesize a new visual angle image with higher quality, the composition loss and the perception loss are increased in the training so as to adjust the parameters of the convolutional neural network until preset training conditions are met, so that a new visual angle synthesis model is obtained.
During training, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through a transparent object, then obtains the prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training of the convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.
Details of step S3 will be described later.
The details of step S1 in another implementation will be described in detail below.
In an embodiment of the invention, since light transmission through transparent objects is highly non-linear, the light transmission relationship is learned and modeled by a convolutional neural network by synthesizing a predictive mask, a predictive attenuation map and a predictive baffled flow for intermediate viewing angles, see FIG. 2, showing a first image CLAnd a second image CRAnd the mixed coefficients are input to the convolutional neural network 100 to obtain a prediction maskPredictive attenuation map
Figure BDA0002230142180000102
And predicting the refractive flow
Figure BDA0002230142180000103
As shown in equation (3).
Figure BDA0002230142180000104
The Network represents a convolutional neural Network, the convolutional neural Network adopts a frame based on coding-decoding, learning is used as a transparent object to synthesize a new view angle, a first image and a second image are used as input of the convolutional neural Network, the first image and the second image are projected through a sequential convolutional layer and matched with a depth feature space, after features are mixed, a mixed depth feature is obtained, and the mixed depth feature is used as a decoding basis and used for simultaneously predicting a mask, an attenuation map and a refraction flow under the new view angle.
Specifically, the convolutional neural network includes: an encoding module and a decoding module, and step S1 includes:
and S11, inputting the first image, the second image and the mixing coefficient into the coding module to obtain the depth characteristic.
In the embodiment of the present invention, referring to fig. 3, a hierarchical structure of a convolutional neural network is shown, where the encoding module includes a first encoder enc1(101), a second encoder enc2(102), and a convolutional layer CNN, and the first encoder enc1 and the second encoder enc2 share a weight; each encoder has multiple layers, for example, each of the first encoder and the second encoder has 8 encoder layers, for convenience of description, the first encoder layer of the first encoder is denoted as enc1-L1(1011), and the number of output channels of each encoder layer is: 64; 128; 256 of; 512; 512; 512; 512; 512, in the encoding stage, the encoder adopts 8 continuous encodersLayer-wise downsampling to it a first image and a second imageSize.
The depth features include a first depth feature, a second depth feature, a third depth feature, a fourth depth feature, and a mixed depth feature, and specifically, the step S11 further includes:
and S111, inputting the first image into a first encoder to obtain a first depth feature and a second depth feature corresponding to the first image.
In the embodiment of the invention, how to balance the depth feature mixture and the jump connection is researched, and how to influence the final synthesis result by quantitatively researching how the combination is expressed by the last p layers of the depth feature mixture and the first q layers of the jump connection are (p mixture; q connection). The quality of the predicted images obtained by different combinations is evaluated by using PSNR and SSIM, and the results are summarized as shown in fig. 4, 3 examples are selected for quantitative evaluation of performance of different networks, the minimum value represents the best performance, m (ask) -IoU, a (termination) -MSE, f (low) -EPE, c (optimization) -L1, PSNR and SSIM are respectively evaluated, and (p ═ 6; q ═ 2) is selected as a better combination according to experimental data in order to balance detail preservation and feature mixing.
In this embodiment of the present invention, the first depth characteristic is a characteristic output by a shallow coding layer of the first encoder, for example, the first encoder has 8 encoder layers, and the first coding layer and the second coding layer can be set as the shallow coding layer, and the first depth characteristic includes a first depth characteristic output by the first coding layer of the first encoder-L1:
Figure BDA0002230142180000111
and a second depth feature output by the second encoding layer of the first encoder-L2:
Figure BDA0002230142180000112
the second depth characteristic is a characteristic of the output of a deep coding layer of the first encoder, e.g., the first encoder has 8 encoder layers, which may beSetting the third coding layer to the eighth coding layer as a deep coding layer, the second depth characteristic includes all output results of the first coding layer to the eighth coding layer of the first encoder, and the second depth characteristic may include: second depth feature-L3:
Figure BDA0002230142180000113
second depth feature-L4:
Figure BDA0002230142180000114
second depth feature-L5:
Figure BDA0002230142180000115
second depth feature-L6:
Figure BDA0002230142180000116
second depth feature-L7:
Figure BDA0002230142180000117
and a second depth feature-L8:
and S112, inputting the second image into a second encoder to obtain a third depth feature and a fourth depth feature corresponding to the second image.
In this embodiment of the present invention, the third depth characteristic is a shallow characteristic of the second encoder, for example, the second encoder has 8 encoder layers, and the first encoding layer and the second encoding layer can be set to be shallow layers, and the third depth characteristic includes a third depth characteristic-L1:
Figure BDA00022301421800001110
and a third depth feature output by the second encoded layer of the second encoder-L2:
Figure BDA0002230142180000119
the fourth depth feature is a feature of a deep coding layer output of the second encoding,for example, the second encoder has 8 encoder layers, and the third to eighth encoding layers can be set as deep encoding layers, and the fourth depth characteristic includes all output results of the first to eighth encoding layers of the second encoder, and the fourth depth characteristic can include: fourth depth feature-L3:
Figure BDA0002230142180000121
fourth depth feature-L4:
Figure BDA0002230142180000122
fourth depth feature-L5:
Figure BDA0002230142180000123
fourth depth feature-L6:
Figure BDA0002230142180000124
fourth depth feature-L7:and a fourth depth feature-L8:
and S113, inputting the second depth feature, the fourth depth feature and the mixing coefficient into the convolutional layer to obtain a mixed depth feature.
In the embodiment of the present invention, the second depth feature is a feature output by a deep coding layer of the first encoder, and the fourth depth feature is a feature output by a deep coding layer of the second encoder, and in order to synthesize a new view image, an inherent transformation relationship among the first view, the second view and the new view is simulated by mixing in a depth feature space; in convolutional layer CNN, the depth features of two encoders are mixed by formula (4) to obtain a mixed depth feature.
Figure BDA0002230142180000127
Where k denotes an encoding layer, assuming that the encoderThe third coding layer through the eighth coding layer are deep coding layers, k may be 3,4, … …,8,
Figure BDA0002230142180000128
representing depth features corresponding to a first image output from a kth coding layer of a first encoder,
Figure BDA0002230142180000129
and the depth characteristics corresponding to the second image output by the kth coding layer of the second coder are represented.
And S12, inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow.
In an embodiment of the invention, referring to fig. 3, the decoding module comprises a first decoder (103), a second decoder (104) and a third decoder (105) for outputting the prediction mask, the prediction attenuation map and the prediction refraction stream, respectively, according to the depth characteristics. Assume that the encoder downsamples the first picture and the second picture to 8 consecutive encoding layersSize, due to symmetry, the decoder must upsample the compressed depth features in the opposite way, with the same number of transposed decoding layers. Specifically, step S12 includes:
and S121, inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and inputting the first depth feature, the third depth feature and the mixed depth feature into a first decoder to obtain a prediction mask.
In the embodiment of the invention, the first depth characteristic and the third depth characteristic are output results of a shallow coding layer of an encoder, and the characteristic of the shallow coding layer is connected with a decoder layer with the same spatial dimension in a jumping mode (as shown in 501-504 in figure 3), so that more detail and context information can be transmitted to a decoding layer with higher resolution.
For example, the first encoder and the second encoder each have 8 encoding layers, the first encoding layer of the first encoder and the first encoding layer of the second encoder are skip-connected to respective first decoding layers of the decoding module, and the second encoding layer of the first encoder and the second encoding layer of the second encoder are skip-connected to respective second decoding layers of the decoding module.
The mixed depth feature is an output result of a deep coding layer of the encoder, and the first decoder is configured to output the prediction mask according to the first depth feature, the third depth feature and the mixed depth feature in step S121
Figure BDA0002230142180000131
And S122, inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and inputting the first depth feature, the third depth feature and the mixed depth feature into a second decoder to obtain a prediction attenuation map.
In an embodiment of the invention, the second decoder is configured to output the predictive attenuation map based on the first depth feature, the third depth feature and the mixed depth feature
Figure BDA0002230142180000132
S123, inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and inputting the first depth feature, the third depth feature and the mixed depth feature into a third decoder to obtain the predicted refraction flow.
In an embodiment of the invention, the third decoder is for outputting the predicted refracted stream according to the first depth feature, the third depth feature and the mixed depth feature
In the embodiment of the present invention, after decoding, a predicted image can be obtained by formula (2)
Figure BDA0002230142180000134
The details of step S3 in another implementation will be described in detail below.
Specifically, step S3 includes:
s31, calculating the total loss value according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image.
Specifically, step S31 includes:
s311, acquiring a real mask, a real attenuation map and a real refraction flow according to the real image.
Firstly, acquiring a corresponding real mask m according to a real imagetTrue attenuation map ρtAnd a true baffling flow Wt. See FIG. 5, CtFor a real image, mtFor real masks corresponding to real images, ptFor true attenuation maps corresponding to true images, WtThe real folded flow corresponding to the real image.
How to obtain the corresponding real mask, real attenuation map and real refraction flow according to the real image is described in detail later when the real data set is described.
And S312, calculating a mask loss value according to the predicted mask and the real mask.
In the embodiment of the invention, the mask prediction of the transparent object is a binary classification problem, an additional softmax layer can be adopted to normalize the output and calculate the mask loss value L by using a binary cross entropy functionmAs shown in equation (5):
Figure BDA0002230142180000141
where H and W represent the height and width of the first and second input images (the height of the first input image is the same as the height of the second input image, and the width of the first input image is the same as the width of the second input image), and mijAnd
Figure BDA0002230142180000142
the pixel values at locations (i, j) of the binary real mask and the normalized output prediction mask, respectively.
And S313, calculating an attenuation loss value according to the predicted attenuation map and the real attenuation map.
In the embodiment of the invention, the MSE function is used for calculating the attenuation loss value LaAs in formula (6)) Shown in the figure:
Figure BDA0002230142180000143
where ρ isijAnd
Figure BDA0002230142180000144
representing the true and predicted attenuation values at the (i, j) pixel and normalizing the predicted attenuation map using a sigmoid activation function
Figure BDA0002230142180000145
And S314, calculating a refraction flow loss value according to the predicted refraction flow and the real refraction flow.
In an embodiment of the present invention, the dimension of the predicted folded stream is H × W × 2, which is defined as an index relationship between a synthesized pixel and its corresponding background pixel. These two channels represent pixel displacement in the x and y dimensions, respectively. The output may be normalized by a tanh activation function and then scaled using the size of the first input image and the second input image. Calculating the value of the breaking loss L by using the average endpoint error (EPE) functionfAs shown in equation (7):
Figure BDA0002230142180000146
wherein W and
Figure BDA0002230142180000147
representing true and predicted refraction flow, H and W representing the height and width of the first and second input images (the height of the first input image is the same as the height of the second input image, the width of the first input image is the same as the width of the second input image),
Figure BDA0002230142180000148
representing the displacement of pixels in the x-dimension of the real image at position (i, j),
Figure BDA0002230142180000151
indicating the pixel displacement of the predicted image in the x-dimension at position (i, j),
Figure BDA0002230142180000152
representing the displacement of pixels in the y-dimension of the real image at position (i, j),
Figure BDA0002230142180000153
indicating the pixel displacement of the predicted image in the y-dimension at position (i, j).
And S315, calculating a composition loss value and a perception loss value according to the predicted image and the real image.
In an embodiment of the present invention, in order to minimize the difference between the predicted image and the real image, the composition loss L may be calculated using the L1 functioncAs shown in equation (8):
Figure BDA0002230142180000154
where H and W denote the height and width of the first and second input images (the height of the first input image is the same as the height of the second input image, the width of the first input image is the same as the width of the second input image),
Figure BDA0002230142180000155
representing the pixel value of the predicted image at (i, j),
Figure BDA0002230142180000156
representing the pixel value of the real image at (i, j).
And, to better preserve details and less ambiguity while increasing the sharpness of the predicted image, adding a perceptual loss LpAs shown in formula (9).
Figure BDA0002230142180000157
Where φ (-) denotes the conv4_3 feature of the VGG16 model pre-trained by ImageNet, and N is the total number of channels in the layer.
S316, calculating a total loss value according to the mask loss value, the attenuation loss value, the refraction flow loss value, the composition loss value and the perception loss value.
In an embodiment of the invention, a total loss value is calculated from the prediction mask, the prediction attenuation map and the prediction refraction stream, the prediction image and the real image; minimizing the total loss value, from which the network is trained, can be achieved by equation (10).
L=ωmLmaLafLfcLcpLp(10)
Wherein L represents the total loss value, LmRepresenting the mask loss value, ωmRepresenting the balance weight, L, corresponding to the mask loss valueaRepresenting attenuation loss value, ωaRepresenting the balance weight, L, corresponding to the attenuation loss valuefRepresenting the value of refractive flow loss, ωfRepresenting the balance weight, L, corresponding to the value of the refractive flow losscRepresenting the patterning loss value, ωcRepresenting the balance weight, L, corresponding to the composition loss valuepRepresenting the value of the perceptual loss, ωpRepresenting the balance weight corresponding to the value of the perceptual loss, ω can be setm=1,ωa=10,ωf=1,ωc10 and ωp=1。
And S32, adjusting the parameters of the convolutional neural network according to the total loss value.
In the embodiment of the present invention, during training, pytorre may be used for implementation, the parameter initialization of the convolutional neural network is implemented by using an Xavier algorithm, an Adam algorithm with default parameters is used as an optimizer, and the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network is continuously performed after the parameters are modified, in one implementation, the time for training 100 cycles on a Titan X GPU is about 10 to 12 hours by fixing a learning rate to 0.0002.
In another implementation manner, after the parameters are modified, the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network is continued until a preset training condition is met, where the preset training condition includes that the total loss value meets a preset requirement or the training times reach a preset number. The preset requirement can be determined according to a new view synthesis model, which is not described in detail herein; the preset number may be a maximum training number of the convolutional neural network, for example, 50000 times, and the like. Therefore, after the total loss value is obtained through calculation, whether the total loss value meets a preset requirement is judged, if the total loss value meets the preset requirement, the training is finished, if the total loss value does not meet the preset requirement, whether the training frequency of the convolutional neural network reaches the training frequency is judged, if the training frequency does not reach the preset frequency, the parameter of the convolutional neural network is adjusted according to the total loss value, if the training frequency reaches the preset frequency, the training is finished, and therefore whether the training of the convolutional neural network is finished is judged through the loss value and the training frequency, and the phenomenon that the training enters dead cycle due to the fact that the loss value cannot reach the preset requirement can be avoided.
Further, since the modification of the parameter of the convolutional neural network is performed when the preset condition is not satisfied (for example, the total loss value does not satisfy the preset requirement and the training times do not reach the preset times), after the parameter of the convolutional neural network is modified according to the total loss value, it is necessary to continue to train the convolutional neural network, that is, to continue to perform the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network. Wherein, the continuing to perform the inputting of the first image, the second image and the mixing coefficient in the training data into the convolutional neural network may be the first image, the second image and the mixing coefficient which are not input into the convolutional neural network. For example, all the first images and the second images in the training data have unique image identifiers (e.g., view sequence numbers), the numerical values of the mixing coefficients are different, the image identifiers of the first images and the second images input into the convolutional neural network for the first training are different from the image identifiers of the first images and the second images input into the convolutional neural network for the second training, for example, the view sequence number of the first image input into the convolutional neural network for the first training is 1, the view sequence number of the second image is 7, and the mixing coefficient is 0.5; the visual angle sequence number of the first image input into the convolutional neural network for the second training is 2, the visual angle sequence number of the second image is 10, and the mixing coefficient is 0.6.
In practical application, because the number of the first images and the second images in the training data is limited, in order to improve the training effect of the convolutional neural network, the first images, the second images and the mixing coefficients in the training data may be sequentially input to the convolutional neural network to train the convolutional neural network, and after all the first images, all the second images and the corresponding mixing coefficients in the training data are input to the new view synthesis model, the operation of sequentially inputting the first images, the second images and the mixing coefficients in the training data to the convolutional neural network may be continuously performed, so that the training image group in the training data is input to the convolutional neural network model in a cycle. It should be noted that, in the process of inputting the first image and the second image into the new view synthesis model for training, the images may be input in the order of the view sequence numbers of the first images, or may not be input in the order of the view sequence numbers of the first images, of course, the same first image, second image, and mixing coefficient may be repeatedly used for training the convolutional neural network, or the same first image, second image, and mixing coefficient may not be repeatedly used for training the convolutional neural network.
The training data in one implementation is described in detail below.
There is currently no open dataset specifically for new view composition of transparent objects, the present invention creates a training data set comprising a composite dataset and a real dataset, wherein the composite dataset comprises as selectable first and second images 8 different model classes at different camera views rendered using POVRay; the real data set includes 6 real transparent objects photographed for evaluation.
In an embodiment of the invention, the synthetic dataset is a 3D object comprising 8 classes collected from sharenet, including airplan, Bench, Bottle, Car, Jar, Lamp, and Table, for each class, 400 models were randomly selected, 350 for training and 50 for testing; in addition, 400 Glass _ water models are also used as additional examples to verify that the new perspective synthesis model obtained after training can be effectively extended to the general example. During rendering, each model appears as a transparent object with a refractive index set to 1.5. The cameras used to capture the transparency are arranged as pinhole models with fixed focal length and viewpoint, the resolution of the display screen is 512 x 512, and in each camera attempt the screen displays a series of binary gray code images for mask extraction and environmental masking, so 18 gray code images need to be rendered, 9 as rows and 9 as columns. Furthermore, by rendering the model in front of a pure white background image, the attenuation map is easily obtained; the background image used in rendering is shown in fig. 6. Each pixel in the background image is pre-coded with a unique color value to avoid repetitive patterns and help optimize the loss function to more efficiently compute the gradient during grid sampling. In order to meet preset training requirements and increase the diversity of training examples in view of the rendering of each object, the object is first randomly rotated to some initialized position in the virtual scene and then rotated from-10 ° to 10 ° around the y-axis (coordinate system of POVRay). And a sequence of images is acquired at a rotational interval of 2 deg..
In an embodiment of the present invention, a real dataset, 6 real transparent objects including Hand, Goblet, Dog, Monkey, Mouse and Rabbit, is used for algorithm evaluation, see fig. 7. Real images for training and testing were captured with a Point Grey Flea color camera (FL3-U3-13S 2C-CS). Similar to the rendering method of the composite dataset, transparent objects are placed on the carousel in front of the DELL LCD display (U2412M). During shooting, the turntable is rotated from 0 ° to 360 ° at intervals of 2 °. Gray code patterns, pure white images and colored image background images are displayed on a display and are used for extracting real masks, real attenuation images and real refraction flows.
In the embodiment of the present invention, in addition to the three categories evaluated in fig. 4, more quantitative evaluations were performed on the other 5 categories, see fig. 8, with the average PSNR and SSIM of each category higher than 20.0 and 0.85, which shows that our network-synthesized images can produce better visual results.
When the method is trained, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, the prediction mask, the prediction attenuation map and the prediction refraction flow reflect a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light behavior of the transparent object image, obtains a prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.
The embodiment of the present invention further provides a new viewing angle synthesis method for a transparent object, referring to fig. 9, the method may include the following steps:
and K1, acquiring the first image to be processed, the second image to be processed and a preset mixing coefficient.
In the embodiment of the present invention, the view sequence number X of the to-be-processed first image X is different from the view sequence number Y of the to-be-processed second image Y, and the predetermined blending coefficient α' is greater than 0 and smaller than 1. The view angle number of the synthesized image can be known from equation (11).
Figure BDA0002230142180000191
Then, for example, the view number X is equal to 2, the view number y is equal to 8, the predetermined blending factor α' is 0.5, and the view number X of the synthesized image X is 4.
K2, inputting the first image to be processed, the second image to be processed and the preset mixing coefficient into a new visual angle synthesis model to obtain a mask to be processed, an attenuation map to be processed and a refraction flow to be processed; the new visual angle synthetic model is obtained by the training method of the new visual angle synthetic model of the transparent object.
K3, calculating to obtain a composite image of the first image to be processed and the second image to be processed under a preset mixing coefficient according to the mask to be processed, the attenuation map to be processed and the refraction flow to be processed by adopting an environment mask, wherein the visual angle of the composite image is between the visual angle of the first image to be processed and the visual angle of the second image to be processed.
In the embodiment of the invention, by adopting the environmental mask expression shown in formula (2), a synthetic image can be obtained according to the mask m to be processed, the attenuation map ρ to be processed and the refraction flow W to be processed.
By way of example, with reference to fig. 10, an example of the synthesis of Airplane is shown, the viewing angle of the camera with respect to the object, from-10 ° to 10 °, where a is the image taken at-10 °, B is the image taken at-8 °, C is the image taken at-6 °, D is the image taken at-4 °, E is the image taken at-2 °, F is the image taken at 0 °, G is the image taken at 2 °, H is the image taken at 4 °, I is the image taken at 6 °, J is the image taken at 8 °, K is the image taken at 10 °, image a is taken as the first image to be processed, image K is taken as the second image to be processed, at different new viewing angles (i.e. different preset blended viewing angles), the mask to be processed, the loss value to be processed and the folded jet to be processed can be output, the to-be-processed mask, the to-be-processed loss value and the to-be-processed refraction flow are pixel values and can be represented by images. A1 is a visual to-be-processed mask corresponding to the image A, B1 is a visual to-be-processed mask corresponding to the image B, … … and K1 are visual to-be-processed masks corresponding to the image K; a2 is a visual attenuation map to be processed corresponding to the image A, B2 is a visual attenuation map to be processed corresponding to the image B, … … and K2 are visual attenuation maps to be processed corresponding to the image K; a3 is the visualized to-be-processed folded flow corresponding to the image A, B3 is the visualized to-be-processed folded flow corresponding to the image B, … … and K3 are the visualized to-be-processed folded flows corresponding to the image K. According to the mask to be processed, the attenuation image to be processed and the refraction flow to be processed, a synthetic image can be obtained through an environment mask, a is a synthetic image under a view angle of-10 degrees obtained according to the image A and the image K, b is a synthetic image under a view angle of-8 degrees obtained according to the image A and the image K, c is a synthetic image under a view angle of-6 degrees obtained according to the image A and the image K, and … …, K is a synthetic image under a view angle of 10 degrees obtained according to the image A and the image K. The average PSNR and SSIM in this case are (25.7,0.9567) and (19.4,0.9004) compared to each corresponding real image, which clearly shows that the synthesis results are visually reasonable.
Referring to fig. 11, an example of the composition of Glass _ water is also shown in the present embodiment, where a is an image taken at-10 °, B is an image taken at-8 °, C is an image taken at-6 °, D is an image taken at-4 °, E is an image taken at-2 °, F is an image taken at 0 °, G is an image taken at 2 °, H is an image taken at 4 °, I is an image taken at 6 °, J is an image taken at 8 °, and K is an image taken at 10 °; according to the new visual angle synthesis model obtained through training, the image A is used as a first image to be processed, the image K is used as a second image to be processed, and different synthesis images can be obtained under different new visual angles (namely different preset mixed visual angles). The average PSNR and SSIM in this case are (19.4,0.9004), which clearly shows that the synthesis results are visually reasonable.
Referring to fig. 12, an embodiment of the present invention further illustrates a synthetic example of Bottle, where a is an image captured at-10 °, B is an image captured at 0 °, C is an image captured at 10 °, a is taken as a first image to be processed, C is taken as a second image to be processed, a1 is a visual mask to be processed corresponding to the image a being a real image, B1 is a visual mask to be processed corresponding to the image B being a real image, and C1 is a visual mask to be processed corresponding to the image C being a real image; a2 is a corresponding visual attenuation map to be processed when the image A is a real image, B2 is a corresponding visual attenuation map to be processed when the image B is a real image, and C2 is a corresponding visual attenuation map to be processed when the image C is a real image; a3 is the corresponding visual to-be-processed folded flow when the image A is a real image, B3 is the corresponding visual to-be-processed folded flow when B is a real image, and C3 is the corresponding visual to-be-processed folded flow when the image C is a real image. a is a composite image at a viewing angle of-10 ° obtained from image a and image C, b is a composite image at a viewing angle of 0 ° obtained from image a and image C, and C is a composite image at a viewing angle of 10 ° obtained from image a and image C. The average PSNR and SSIM in this case are (23.5,0.9584), which clearly shows that the synthesis results are visually reasonable.
Referring to fig. 13, an example of synthesizing Bench is further shown in the embodiment of the present invention, as shown in fig. 12, where a is an image captured at-10 °, B is an image captured at 0 °, C is an image captured at 10 °, a is taken as a first image to be processed, C is taken as a second image to be processed, a1 is a corresponding visualized mask to be processed when image a is a real image, B1 is a corresponding visualized mask to be processed when B is a real image, and C1 is a corresponding visualized mask to be processed when image C is a real image; a2 is a corresponding visual attenuation map to be processed when the image A is a real image, B2 is a corresponding visual attenuation map to be processed when the image B is a real image, and C2 is a corresponding visual attenuation map to be processed when the image C is a real image; a3 is the corresponding visual to-be-processed folded flow when the image A is a real image, B3 is the corresponding visual to-be-processed folded flow when B is a real image, and C3 is the corresponding visual to-be-processed folded flow when the image C is a real image. a is a composite image at a viewing angle of-10 ° obtained from image a and image C, b is a composite image at a viewing angle of 0 ° obtained from image a and image C, and C is a composite image at a viewing angle of 10 ° obtained from image a and image C. The average PSNR and SSIM in this case are (21.6,0.9243), which clearly shows that the synthesis results are visually reasonable.
Referring to fig. 14, an example of synthesizing Table is also shown in the embodiment of the present invention, and as shown in fig. 12, a is an image captured at-10 °, B is an image captured at 0 °, C is an image captured at 10 °, a is taken as a first image to be processed, C is taken as a second image to be processed, a1 is a corresponding visual mask to be processed when image a is a real image, B1 is a corresponding visual mask to be processed when B is a real image, and C1 is a corresponding visual mask to be processed when image C is a real image; a2 is a corresponding visual attenuation map to be processed when the image A is a real image, B2 is a corresponding visual attenuation map to be processed when the image B is a real image, and C2 is a corresponding visual attenuation map to be processed when the image C is a real image; a3 is the corresponding visual to-be-processed folded flow when the image A is a real image, B3 is the corresponding visual to-be-processed folded flow when B is a real image, and C3 is the corresponding visual to-be-processed folded flow when the image C is a real image. a is a composite image at a viewing angle of-10 ° obtained from image a and image C, b is a composite image at a viewing angle of 0 ° obtained from image a and image C, and C is a composite image at a viewing angle of 10 ° obtained from image a and image C. The average PSNR and SSIM in this case are (21.4,0.9907), which clearly shows that the synthesis results are visually reasonable.
Through the above example, it can be seen that the new perspective synthetic model of the transparent object obtained through training can accurately predict and reproduce the optical transmission characteristics under the new perspectives of different objects.
In one embodiment, the present invention provides a computer device, which may be a terminal, having an internal structure as shown in fig. 15. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of training a new view synthesis model of a transparent object. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the illustration in fig. 15 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the relationship among the first visual angle, the second visual angle and the new visual angle;
according to the prediction mask, the prediction attenuation map and the prediction refraction flow, calculating to obtain a prediction image of the first image and the second image under a mixed coefficient, wherein the prediction image is a transparent object image predicted by a convolutional neural network under a new visual angle;
and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network until preset training conditions are met so as to obtain a new view angle synthetic model.
In one embodiment, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:
inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the relationship among the first visual angle, the second visual angle and the new visual angle;
according to the prediction mask, the prediction attenuation map and the prediction refraction flow, calculating to obtain a prediction image of the first image and the second image under a mixed coefficient, wherein the prediction image is a transparent object image predicted by a convolutional neural network under a new visual angle;
and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network until preset training conditions are met so as to obtain a new view angle synthetic model.
According to the training method provided by the embodiment of the invention, a first image, a second image and a mixing coefficient in training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the visual angle relationship among the first visual angle, the second visual angle and the new visual angle; calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle; and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model. During training, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through a transparent object, then obtains the prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training of the convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims (10)

1. A method of training a new perspective composite model of a transparent object, the method comprising:
inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the relationship among the first visual angle, the second visual angle and the new visual angle;
according to the prediction mask, the prediction attenuation map and the prediction refraction flow, calculating to obtain a prediction image of the first image and the second image under a mixed coefficient, wherein the prediction image is a transparent object image predicted by a convolutional neural network under a new visual angle;
and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network until preset training conditions are met so as to obtain a new view angle synthetic model.
2. The method of claim 1, wherein the convolutional neural network comprises: an encoding module and a decoding module, wherein the first image and the second image in the training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, and the encoding module and the decoding module comprise:
inputting the first image, the second image and the mixing coefficient into the coding module to obtain a depth characteristic;
inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow.
3. The method of claim 2, wherein the encoding module comprises a first encoder, a second encoder, and a convolutional layer, wherein the depth features comprise a first depth feature, a second depth feature, a third depth feature, a fourth depth feature, and a mixed depth feature, and wherein inputting the first image, the second image, and the mixed coefficients into the encoder to obtain the depth features comprises:
inputting the first image into a first encoder to obtain a first depth feature and a second depth feature corresponding to the first image;
inputting the second image into a second encoder to obtain a third depth feature and a fourth depth feature corresponding to the second image;
the second depth feature, the fourth depth feature, and the blending coefficient are input to the convolutional layer to obtain a blended depth feature.
4. The method of claim 3, wherein the decoding module comprises a first decoder, a second decoder, and a third decoder, and wherein inputting the depth features into the decoding module to obtain the prediction mask, the prediction attenuation map, and the prediction baffling stream comprises:
inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a first decoder to obtain a prediction mask;
inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a second decoder to obtain a prediction attenuation map;
and inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a third decoder to obtain the predicted refraction flow.
5. The method of claim 1, wherein said adjusting parameters of said convolutional neural network based on said prediction mask, said prediction attenuation map and said prediction refraction stream, said prediction image and said real image comprises:
calculating a total loss value from the prediction mask, the prediction attenuation map and the prediction refraction stream, the prediction image and the real image;
and adjusting parameters of the convolutional neural network according to the total loss value.
6. The method of claim 5, wherein said calculating a total loss value from said prediction mask, said prediction attenuation map and said prediction refraction stream, said predicted image and said real image comprises:
calculating a real mask, a real attenuation map and a real refraction flow according to the real image;
calculating a mask loss value according to the predicted mask and the real mask;
calculating an attenuation loss value according to the predicted attenuation map and the real attenuation map;
calculating a breaking flow loss value according to the predicted breaking flow and the real breaking flow;
calculating a composition loss value and a perception loss value according to the predicted image and the real image;
calculating a total loss value from the mask loss value, the attenuation loss value, the refractive flow loss value, the composition loss value, and the perceptual loss value.
7. The method of any one of claims 1 to 6, wherein before inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network, comprising:
and calculating a mixing coefficient according to the visual angle sequence number of the first image, the visual angle sequence number of the second image and the visual angle sequence number of the real visual angle image.
8. A method for new viewing angle synthesis of a transparent object, the method comprising:
acquiring a first image to be processed, a second image to be processed and a preset mixing coefficient;
inputting the first image to be processed, the second image to be processed and the preset mixing coefficient into a new visual angle synthesis model to obtain a mask to be processed, an attenuation map to be processed and a refraction flow to be processed; wherein the new perspective synthetic model is the new perspective synthetic model of any one of claims 1 to 7;
and calculating to obtain a synthetic image of the first image to be processed and the second image to be processed under a preset mixing coefficient according to the mask to be processed, the attenuation map to be processed and the refraction flow to be processed by adopting an environment mask, wherein the visual angle of the synthetic image is between the visual angle of the first image to be processed and the visual angle of the second image to be processed.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201910964836.6A 2019-10-11 2019-10-11 Training method and computer equipment for new visual angle synthetic model of transparent object Active CN110689514B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910964836.6A CN110689514B (en) 2019-10-11 2019-10-11 Training method and computer equipment for new visual angle synthetic model of transparent object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910964836.6A CN110689514B (en) 2019-10-11 2019-10-11 Training method and computer equipment for new visual angle synthetic model of transparent object

Publications (2)

Publication Number Publication Date
CN110689514A true CN110689514A (en) 2020-01-14
CN110689514B CN110689514B (en) 2022-11-11

Family

ID=69112213

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910964836.6A Active CN110689514B (en) 2019-10-11 2019-10-11 Training method and computer equipment for new visual angle synthetic model of transparent object

Country Status (1)

Country Link
CN (1) CN110689514B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022077146A1 (en) * 2020-10-12 2022-04-21 深圳大学 Mesh reconstruction method and apparatus for transparent object, and computer device, and storage medium

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004164571A (en) * 2002-06-27 2004-06-10 Mitsubishi Electric Research Laboratories Inc Method for modeling three-dimensional object
US20140043436A1 (en) * 2012-02-24 2014-02-13 Matterport, Inc. Capturing and Aligning Three-Dimensional Scenes
US20140055570A1 (en) * 2012-03-19 2014-02-27 Fittingbox Model and method for producing 3d photorealistic models
US8947430B1 (en) * 2010-02-26 2015-02-03 Nvidia Corporation System and method for rendering a particle-based fluid surface
CN106683188A (en) * 2016-11-17 2017-05-17 长春理工大学 Double-surface three-dimensional reconstructing method, device and system for transparent object
CN106920243A (en) * 2017-03-09 2017-07-04 桂林电子科技大学 The ceramic material part method for sequence image segmentation of improved full convolutional neural networks
US20170352149A1 (en) * 2014-12-24 2017-12-07 Datalogic Ip Tech S.R.L. System and method for identifying the presence or absence of transparent pills in blister packer machines using high resolution 3d stereo reconstruction based on color linear cameras
US20180137611A1 (en) * 2016-11-14 2018-05-17 Ricoh Co., Ltd. Novel View Synthesis Using Deep Convolutional Neural Networks
EP3343507A1 (en) * 2016-12-30 2018-07-04 Dassault Systèmes Producing a segmented image of a scene
CN108416834A (en) * 2018-01-08 2018-08-17 长春理工大学 Transparent objects surface three dimension reconstructing method, device and system
CN108416751A (en) * 2018-03-08 2018-08-17 深圳市唯特视科技有限公司 A kind of new viewpoint image combining method assisting full resolution network based on depth
CN108765425A (en) * 2018-05-15 2018-11-06 深圳大学 Image partition method, device, computer equipment and storage medium
CN109118531A (en) * 2018-07-26 2019-01-01 深圳大学 Three-dimensional rebuilding method, device, computer equipment and the storage medium of transparent substance
CN109238167A (en) * 2018-07-26 2019-01-18 深圳大学 Transparent substance light corresponding relationship acquisition system
US20190026956A1 (en) * 2012-02-24 2019-01-24 Matterport, Inc. Employing three-dimensional (3d) data predicted from two-dimensional (2d) images using neural networks for 3d modeling applications and other applications
US20190095791A1 (en) * 2017-09-26 2019-03-28 Nvidia Corporation Learning affinity via a spatial propagation neural network
CN109712080A (en) * 2018-10-12 2019-05-03 迈格威科技有限公司 Image processing method, image processing apparatus and storage medium
WO2019140414A1 (en) * 2018-01-14 2019-07-18 Light Field Lab, Inc. Systems and methods for rendering data from a 3d environment
CN110033486A (en) * 2019-04-19 2019-07-19 山东大学 Transparent crystal growth course edge and volume method of real-time and system
CN110060335A (en) * 2019-04-24 2019-07-26 吉林大学 There are the virtual reality fusion methods of mirror article and transparent substance in a kind of scene
US20190304069A1 (en) * 2018-03-29 2019-10-03 Pixar Denoising monte carlo renderings using neural networks with asymmetric loss
US20190304134A1 (en) * 2018-03-27 2019-10-03 J. William Mauchly Multiview Estimation of 6D Pose

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004164571A (en) * 2002-06-27 2004-06-10 Mitsubishi Electric Research Laboratories Inc Method for modeling three-dimensional object
US8947430B1 (en) * 2010-02-26 2015-02-03 Nvidia Corporation System and method for rendering a particle-based fluid surface
US20190026956A1 (en) * 2012-02-24 2019-01-24 Matterport, Inc. Employing three-dimensional (3d) data predicted from two-dimensional (2d) images using neural networks for 3d modeling applications and other applications
US20140043436A1 (en) * 2012-02-24 2014-02-13 Matterport, Inc. Capturing and Aligning Three-Dimensional Scenes
US20140055570A1 (en) * 2012-03-19 2014-02-27 Fittingbox Model and method for producing 3d photorealistic models
US20170352149A1 (en) * 2014-12-24 2017-12-07 Datalogic Ip Tech S.R.L. System and method for identifying the presence or absence of transparent pills in blister packer machines using high resolution 3d stereo reconstruction based on color linear cameras
US20180137611A1 (en) * 2016-11-14 2018-05-17 Ricoh Co., Ltd. Novel View Synthesis Using Deep Convolutional Neural Networks
CN106683188A (en) * 2016-11-17 2017-05-17 长春理工大学 Double-surface three-dimensional reconstructing method, device and system for transparent object
EP3343507A1 (en) * 2016-12-30 2018-07-04 Dassault Systèmes Producing a segmented image of a scene
CN106920243A (en) * 2017-03-09 2017-07-04 桂林电子科技大学 The ceramic material part method for sequence image segmentation of improved full convolutional neural networks
US20190095791A1 (en) * 2017-09-26 2019-03-28 Nvidia Corporation Learning affinity via a spatial propagation neural network
CN108416834A (en) * 2018-01-08 2018-08-17 长春理工大学 Transparent objects surface three dimension reconstructing method, device and system
WO2019140414A1 (en) * 2018-01-14 2019-07-18 Light Field Lab, Inc. Systems and methods for rendering data from a 3d environment
CN108416751A (en) * 2018-03-08 2018-08-17 深圳市唯特视科技有限公司 A kind of new viewpoint image combining method assisting full resolution network based on depth
US20190304134A1 (en) * 2018-03-27 2019-10-03 J. William Mauchly Multiview Estimation of 6D Pose
US20190304069A1 (en) * 2018-03-29 2019-10-03 Pixar Denoising monte carlo renderings using neural networks with asymmetric loss
CN108765425A (en) * 2018-05-15 2018-11-06 深圳大学 Image partition method, device, computer equipment and storage medium
CN109118531A (en) * 2018-07-26 2019-01-01 深圳大学 Three-dimensional rebuilding method, device, computer equipment and the storage medium of transparent substance
CN109238167A (en) * 2018-07-26 2019-01-18 深圳大学 Transparent substance light corresponding relationship acquisition system
CN109712080A (en) * 2018-10-12 2019-05-03 迈格威科技有限公司 Image processing method, image processing apparatus and storage medium
CN110033486A (en) * 2019-04-19 2019-07-19 山东大学 Transparent crystal growth course edge and volume method of real-time and system
CN110060335A (en) * 2019-04-24 2019-07-26 吉林大学 There are the virtual reality fusion methods of mirror article and transparent substance in a kind of scene

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BOJIAN WU ET AL.: "Full 3D Reconstruction of Transparent Objects", 《ACM TRANSACTIONS ON GRAPHICS》 *
GUANYING CHEN ET AL.: "TOM-Net: Learning Transparent Object Matting from a Single Image", 《2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
JONATHAN DYSSEL STETS ET AL.: "Single-Shot Analysis of Refractive Shape Using Convolutional Neural Networks", 《 19TH IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022077146A1 (en) * 2020-10-12 2022-04-21 深圳大学 Mesh reconstruction method and apparatus for transparent object, and computer device, and storage medium
US11748948B2 (en) 2020-10-12 2023-09-05 Shenzhen University Mesh reconstruction method and apparatus for transparent object, computer device and storage medium

Also Published As

Publication number Publication date
CN110689514B (en) 2022-11-11

Similar Documents

Publication Publication Date Title
CN111798400B (en) Non-reference low-illumination image enhancement method and system based on generation countermeasure network
Baldassarre et al. Deep koalarization: Image colorization using cnns and inception-resnet-v2
CN110363716B (en) High-quality reconstruction method for generating confrontation network composite degraded image based on conditions
Huang et al. Bidirectional recurrent convolutional networks for multi-frame super-resolution
CN106780543B (en) A kind of double frame estimating depths and movement technique based on convolutional neural networks
CN110599395B (en) Target image generation method, device, server and storage medium
CN110378838B (en) Variable-view-angle image generation method and device, storage medium and electronic equipment
JP7026222B2 (en) Image generation network training and image processing methods, equipment, electronics, and media
CN109255831A (en) The method that single-view face three-dimensional reconstruction and texture based on multi-task learning generate
CN109643383A (en) Domain separates neural network
CN110351511A (en) Video frame rate upconversion system and method based on scene depth estimation
WO2023138062A1 (en) Image processing method and apparatus
Richardt et al. Capture, reconstruction, and representation of the visual real world for virtual reality
Li et al. A lightweight depth estimation network for wide-baseline light fields
CN110264526A (en) A kind of scene depth and camera position posture method for solving based on deep learning
CN116977522A (en) Rendering method and device of three-dimensional model, computer equipment and storage medium
CN113077505A (en) Optimization method of monocular depth estimation network based on contrast learning
CN116524121A (en) Monocular video three-dimensional human body reconstruction method, system, equipment and medium
CN110689514B (en) Training method and computer equipment for new visual angle synthetic model of transparent object
Yuan et al. Presim: A 3d photo-realistic environment simulator for visual ai
CN113298931A (en) Reconstruction method and device of object model, terminal equipment and storage medium
CN115953524B (en) Data processing method, device, computer equipment and storage medium
CN111080754B (en) Character animation production method and device for connecting characteristic points of head and limbs
CN108520532A (en) Identify the method and device of movement direction of object in video
CN112116646A (en) Light field image depth estimation method based on depth convolution neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant