CN110689514A

CN110689514A - Training method and computer equipment for new visual angle synthetic model of transparent object

Info

Publication number: CN110689514A
Application number: CN201910964836.6A
Authority: CN
Inventors: 黄惠; 吴博剑; 吕佳辉
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2019-10-11
Filing date: 2019-10-11
Publication date: 2020-01-14
Anticipated expiration: 2039-10-11
Also published as: CN110689514B

Abstract

The method comprises the steps that when training is carried out, a convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through the transparent object, and then the prediction image of the transparent object under the new visual angle is obtained according to the prediction mask, the prediction attenuation map and the prediction refraction flow. Obtaining a new visual angle synthetic model by iteratively training a convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.

Description

Training method and computer equipment for new visual angle synthetic model of transparent object

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and a computer device for training a new perspective synthesis model of a transparent object.

Background

New view synthesis is performed by capturing images of objects or scenes from fixed views to generate images from new views, typically by interpolating or warping images from nearby views. At present, for the research of new view angle synthesis, on one hand, the research mainly focuses on lambertian surface, because it is difficult to explicitly simulate light transmission characteristics, and the light effect depending on the view angle, such as specular reflectivity or transparency, is not considered, so that the characteristic correspondence between images is lacked, which will cause all methods based on image deformation or based on geometric inference to fail, and the new view angle synthesis of transparent objects becomes very challenging; on the other hand, the image at the new viewing angle is directly output by training the image to the network of images, wherein the network not only needs to reasonably explain the light transmission behavior, but also needs to model the attributes of the image itself, and therefore, the network still has great difficulty for transparent objects. Existing new view synthesis methods cannot be applied directly to transparent objects.

Therefore, the prior art is in need of improvement.

Disclosure of Invention

The invention aims to solve the technical problem of providing a training method and computer equipment of a new visual angle synthesis model of a transparent object so as to realize new visual angle synthesis for the transparent object.

In one aspect, an embodiment of the present invention provides a method for training a new perspective synthesis model of a transparent object, including:

inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents a visual angle relation among the first visual angle, the second visual angle and the new visual angle;

calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle;

and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model.

As a further improved technical solution, the convolutional neural network includes: an encoding module and a decoding module, wherein the first image and the second image in the training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, and the encoding module and the decoding module comprise:

inputting the first image, the second image and the mixing coefficient into the coding module to obtain a depth characteristic; inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow.

As a further improved technical solution, the encoding module includes a first encoder, a second encoder, and a convolutional layer, where the depth features include a first depth feature, a second depth feature, a third depth feature, a fourth depth feature, and a mixed depth feature, and the inputting the first image, the second image, and the mixed coefficients into the encoder to obtain the depth features includes:

inputting the first image into a first encoder to obtain a first depth feature and a second depth feature corresponding to the first image;

inputting the second image into a second encoder to obtain a third depth feature and a fourth depth feature corresponding to the second image;

the second depth feature, the fourth depth feature, and the blending coefficient are input to the convolutional layer to obtain a blended depth feature.

As a further improved technical solution, the decoding module includes a first decoder, a second decoder and a third decoder, and the inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow includes:

inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a first decoder to obtain a prediction mask;

inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a second decoder to obtain a prediction attenuation map;

and inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and a third decoder to obtain the predicted refraction flow.

As a further improved technical solution, the adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction map, the prediction image, and the real image includes:

calculating a total loss value from the prediction mask, the prediction attenuation map and the prediction refraction stream, the prediction image and the real image;

and adjusting parameters of the convolutional neural network according to the total loss value.

As a further improvement, the calculating a total loss value according to the prediction mask, the prediction attenuation map, the prediction refraction map, the prediction image and the real image includes:

calculating a real mask, a real attenuation map and a real refraction flow according to the real image;

calculating a mask loss value according to the predicted mask and the real mask;

calculating an attenuation loss value according to the predicted attenuation map and the real attenuation map;

calculating a breaking flow loss value according to the predicted breaking flow and the real breaking flow;

calculating a composition loss value and a perception loss value according to the predicted image and the real image;

calculating a total loss value from the mask loss value, the attenuation loss value, the refractive flow loss value, the composition loss value, and the perceptual loss value.

As a further improved technical solution, before inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network, the method includes:

and calculating a mixing coefficient according to the visual angle sequence number of the first image, the visual angle sequence number of the second image and the visual angle sequence number of the real visual angle image.

In a second aspect, a second embodiment of the present invention provides a new viewing angle synthesis method for a transparent object, the method including:

acquiring a first image to be processed, a second image to be processed and a mixing coefficient to be processed;

inputting the first image to be processed and the second image to be processed into a new visual angle synthesis model to obtain a mask to be processed, an attenuation map to be processed and a baffling flow to be processed; the new visual angle synthetic model is obtained by training through the training method of the new visual angle synthetic model of the transparent object;

and calculating to obtain a synthetic image through an environment mask according to the mask to be processed, the attenuation map to be processed and the refraction flow to be processed, wherein the visual angle of the synthetic image is between the visual angle of the first image to be processed and the visual angle of the second image to be processed.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the following steps when executing the computer program:

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps:

Compared with the prior art, the embodiment of the invention has the following advantages:

according to the training method provided by the embodiment of the invention, a first image, a second image and a mixing coefficient in training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents a visual angle relation among the first visual angle, the second visual angle and the new visual angle; calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle; and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model. During training, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through a transparent object, then obtains the prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training of the convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for training a new perspective composite model of a transparent object according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a process of inputting a first image, a second image and a mixing coefficient into a convolutional neural network to obtain a prediction mask, a prediction attenuation map and a prediction baffling flow in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a hierarchical structure of a convolutional neural network in an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating quality results of prediction images obtained by different combinations evaluated by using PSNR and SSIM according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of acquiring a real mask, a real attenuation map and a real refraction flow from a real image according to an embodiment of the present invention;

FIG. 6 is a rendering background diagram in an embodiment of the invention;

FIG. 7 is a diagram illustrating an embodiment of the present invention using a Point Grey Flea color camera to capture a real image for training and testing

FIG. 8 is a diagram illustrating the quantitative evaluation results of 5 other categories according to an embodiment of the present invention;

FIG. 9 is a schematic flow chart illustrating a new viewing angle synthesis method for a transparent object according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating an example of synthesis of airplan in an embodiment of the present invention;

FIG. 11 is a diagram illustrating a synthesis of Glass _ water according to an embodiment of the present invention;

FIG. 12 is a diagram showing a synthetic example of Bottle in the embodiment of the present invention;

FIG. 13 is a diagram illustrating a synthesis example of Bench in an embodiment of the present invention;

FIG. 14 is a diagram illustrating the synthesis of Table in the embodiment of the present invention

Fig. 15 is an internal structural view of a computer device in the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Various non-limiting embodiments of the present invention are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, a method for training a new perspective composite model of a transparent object in an embodiment of the present invention is shown. In this embodiment, the method may include, for example, the steps of:

and S1, inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the visual angle relation among the first visual angle, the second visual angle and the new visual angle.

In the embodiment of the invention, the first image, the second image and the real image are from sparse sampling images shot by cameras with different visual angles, and for a transparent object, the images of the transparent object under multiple visual angles can be shot, and the visual angle sequence numbers are used for numbering the images under multiple visual angles. For example, the camera moves around the transparent object at a constant speed, capturing a sequence of images, consisting of C ═ C_kI.k is 0,1, …, N, wherein C₀Represents a picture with a view number of 0; randomly selecting a first image C from a sequence of images_LAnd a second image C_R(0≤L<R is less than or equal to N), and a real image C for supervised learning_t(L<t<R), the first image is a transparent object image photographed at a first viewing angle, and in this example, the viewing angle number of the first viewing angle is L; similarly, the view angle sequence number of the second view angle is R, and the view angle sequence number corresponding to the real image is t. The acquisition of the training data will be described in detail later.

Specifically, step S1 is preceded by:

and M, calculating a mixing coefficient according to the visual angle sequence number of the first image, the visual angle sequence number of the second image and the visual angle sequence number of the real visual angle image.

Since the sequence of images is captured by a camera moving at a uniform speed around the transparent object, at the time of training, the first image and the second image, and the real image are selected, and the blending coefficient is determined after the first image, the second image, and the real image are selected. The blending coefficient α may be calculated according to formula (1), the blending coefficient representing a relationship between the first viewing angle, the second viewing angle, and the new viewing angle, and the blending coefficient α may be calculated by formula (1):

wherein t is the view sequence number of the real image, L is the view sequence number of the first image, and R is the view sequence number of the second image. The mixed coefficient is input into a convolutional neural network, and the convolutional neural network outputs a mask, a folded stream and an attenuation map corresponding to the predicted image of the first image and the second image under the mixed coefficient according to the mixed coefficient.

In the embodiment of the invention, the convolutional neural network can obtain the prediction mask corresponding to the new visual angle according to the first image, the second image and the mixed coefficient

Predictive attenuation map

And predicting the refractive flow

Details of step S1 will be described later.

And S2, calculating to obtain a predicted image of the first image and the second image under a mixed coefficient according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle.

In the embodiment of the invention, the environment shade can describe the reflection and refraction of the transparent object when interacting with light in the environment and any transmission effect of a foreground object, and in order to well synthesize the transparent object into a new background, the core of the environment shade is to accurately estimate a light transmission matrix; by adopting an environment mask, a new view angle image with a view angle serial number t can be synthesized according to the prediction mask, the prediction attenuation map and the prediction refraction flow, namely a transparent object image predicted by the convolutional neural network under the new view angle; for transparent objects, the ambience mask can be expressed as follows:

wherein C represents a composite image, F represents ambient lighting, and B is a background image; if the background image B is equal to 0, C is equal to F, that is, when the background image B is pure black, the ambient illumination F is easily obtained; intoIn one step, since the subject is a transparent object, F is 0. Furthermore, m ∈ {0,1} represents the object binary mask, where m ═ 0, then the composite color comes directly from the background image; the refraction flow W may be used to represent the light transmission matrix and characterize the correspondence between the pixels of the composite image and the pixels of the background image, and for the sake of simplicity, it is assumed that one pixel of the composite image is only from a corresponding one of the pixels in the background image, and the correspondence between a single pixel in the composite image and one of the pixels in the background image is denoted by W;

meaning a pixel-by-pixel index of the background image. For example, if W_ij(a, B), then B_abIs indexed to calculate C_ijIn which B is_abAnd C_ijRepresenting the background pixel at location (a, b) and the synthesized pixel value at location (i, j), respectively. Further, ρ represents an attenuation map in which, for each pixel, if no light passes, the attenuation value is 0; if there is light passing through and no attenuation, the attenuation value is equal to 1.

In an embodiment of the invention, the prediction mask of the convolutional neural network output is used

Predictive attenuation map

And predicting the refractive flow

Substituting the formula (2) into the formula (2), calculating the pixel value of each pixel point in the predicted image, and obtaining the predicted image according to the pixel value

S3, adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthetic model.

In the embodiment of the invention, training is carried out in a network supervision mode, the training of a convolutional neural network is supervised by adopting a real image, the output result of the convolutional neural network and a predicted image, the mask loss, the attenuation loss and the convolutional flow loss are combined according to a mask, an attenuation map and a convolutional flow corresponding to the real image, and in order to synthesize a new visual angle image with higher quality, the composition loss and the perception loss are increased in the training so as to adjust the parameters of the convolutional neural network until preset training conditions are met, so that a new visual angle synthesis model is obtained.

During training, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through a transparent object, then obtains the prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training of the convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.

Details of step S3 will be described later.

The details of step S1 in another implementation will be described in detail below.

In an embodiment of the invention, since light transmission through transparent objects is highly non-linear, the light transmission relationship is learned and modeled by a convolutional neural network by synthesizing a predictive mask, a predictive attenuation map and a predictive baffled flow for intermediate viewing angles, see FIG. 2, showing a first image C_LAnd a second image C_RAnd the mixed coefficients are input to the convolutional neural network 100 to obtain a prediction maskPredictive attenuation map

And predicting the refractive flow

As shown in equation (3).

The Network represents a convolutional neural Network, the convolutional neural Network adopts a frame based on coding-decoding, learning is used as a transparent object to synthesize a new view angle, a first image and a second image are used as input of the convolutional neural Network, the first image and the second image are projected through a sequential convolutional layer and matched with a depth feature space, after features are mixed, a mixed depth feature is obtained, and the mixed depth feature is used as a decoding basis and used for simultaneously predicting a mask, an attenuation map and a refraction flow under the new view angle.

Specifically, the convolutional neural network includes: an encoding module and a decoding module, and step S1 includes:

and S11, inputting the first image, the second image and the mixing coefficient into the coding module to obtain the depth characteristic.

In the embodiment of the present invention, referring to fig. 3, a hierarchical structure of a convolutional neural network is shown, where the encoding module includes a first encoder enc1(101), a second encoder enc2(102), and a convolutional layer CNN, and the first encoder enc1 and the second encoder enc2 share a weight; each encoder has multiple layers, for example, each of the first encoder and the second encoder has 8 encoder layers, for convenience of description, the first encoder layer of the first encoder is denoted as enc1-L1(1011), and the number of output channels of each encoder layer is: 64; 128; 256 of; 512; 512; 512; 512; 512, in the encoding stage, the encoder adopts 8 continuous encodersLayer-wise downsampling to it a first image and a second imageSize.

The depth features include a first depth feature, a second depth feature, a third depth feature, a fourth depth feature, and a mixed depth feature, and specifically, the step S11 further includes:

and S111, inputting the first image into a first encoder to obtain a first depth feature and a second depth feature corresponding to the first image.

In the embodiment of the invention, how to balance the depth feature mixture and the jump connection is researched, and how to influence the final synthesis result by quantitatively researching how the combination is expressed by the last p layers of the depth feature mixture and the first q layers of the jump connection are (p mixture; q connection). The quality of the predicted images obtained by different combinations is evaluated by using PSNR and SSIM, and the results are summarized as shown in fig. 4, 3 examples are selected for quantitative evaluation of performance of different networks, the minimum value represents the best performance, m (ask) -IoU, a (termination) -MSE, f (low) -EPE, c (optimization) -L1, PSNR and SSIM are respectively evaluated, and (p ═ 6; q ═ 2) is selected as a better combination according to experimental data in order to balance detail preservation and feature mixing.

In this embodiment of the present invention, the first depth characteristic is a characteristic output by a shallow coding layer of the first encoder, for example, the first encoder has 8 encoder layers, and the first coding layer and the second coding layer can be set as the shallow coding layer, and the first depth characteristic includes a first depth characteristic output by the first coding layer of the first encoder-L1:

and a second depth feature output by the second encoding layer of the first encoder-L2:

the second depth characteristic is a characteristic of the output of a deep coding layer of the first encoder, e.g., the first encoder has 8 encoder layers, which may beSetting the third coding layer to the eighth coding layer as a deep coding layer, the second depth characteristic includes all output results of the first coding layer to the eighth coding layer of the first encoder, and the second depth characteristic may include: second depth feature-L3:

second depth feature-L4:

second depth feature-L5:

second depth feature-L6:

second depth feature-L7:

and a second depth feature-L8:

and S112, inputting the second image into a second encoder to obtain a third depth feature and a fourth depth feature corresponding to the second image.

In this embodiment of the present invention, the third depth characteristic is a shallow characteristic of the second encoder, for example, the second encoder has 8 encoder layers, and the first encoding layer and the second encoding layer can be set to be shallow layers, and the third depth characteristic includes a third depth characteristic-L1:

and a third depth feature output by the second encoded layer of the second encoder-L2:

the fourth depth feature is a feature of a deep coding layer output of the second encoding,for example, the second encoder has 8 encoder layers, and the third to eighth encoding layers can be set as deep encoding layers, and the fourth depth characteristic includes all output results of the first to eighth encoding layers of the second encoder, and the fourth depth characteristic can include: fourth depth feature-L3:

fourth depth feature-L4:

fourth depth feature-L5:

fourth depth feature-L6:

fourth depth feature-L7:and a fourth depth feature-L8:

and S113, inputting the second depth feature, the fourth depth feature and the mixing coefficient into the convolutional layer to obtain a mixed depth feature.

In the embodiment of the present invention, the second depth feature is a feature output by a deep coding layer of the first encoder, and the fourth depth feature is a feature output by a deep coding layer of the second encoder, and in order to synthesize a new view image, an inherent transformation relationship among the first view, the second view and the new view is simulated by mixing in a depth feature space; in convolutional layer CNN, the depth features of two encoders are mixed by formula (4) to obtain a mixed depth feature.

Where k denotes an encoding layer, assuming that the encoderThe third coding layer through the eighth coding layer are deep coding layers, k may be 3,4, … …,8,

representing depth features corresponding to a first image output from a kth coding layer of a first encoder,

and the depth characteristics corresponding to the second image output by the kth coding layer of the second coder are represented.

And S12, inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow.

In an embodiment of the invention, referring to fig. 3, the decoding module comprises a first decoder (103), a second decoder (104) and a third decoder (105) for outputting the prediction mask, the prediction attenuation map and the prediction refraction stream, respectively, according to the depth characteristics. Assume that the encoder downsamples the first picture and the second picture to 8 consecutive encoding layersSize, due to symmetry, the decoder must upsample the compressed depth features in the opposite way, with the same number of transposed decoding layers. Specifically, step S12 includes:

and S121, inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and inputting the first depth feature, the third depth feature and the mixed depth feature into a first decoder to obtain a prediction mask.

In the embodiment of the invention, the first depth characteristic and the third depth characteristic are output results of a shallow coding layer of an encoder, and the characteristic of the shallow coding layer is connected with a decoder layer with the same spatial dimension in a jumping mode (as shown in 501-504 in figure 3), so that more detail and context information can be transmitted to a decoding layer with higher resolution.

For example, the first encoder and the second encoder each have 8 encoding layers, the first encoding layer of the first encoder and the first encoding layer of the second encoder are skip-connected to respective first decoding layers of the decoding module, and the second encoding layer of the first encoder and the second encoding layer of the second encoder are skip-connected to respective second decoding layers of the decoding module.

The mixed depth feature is an output result of a deep coding layer of the encoder, and the first decoder is configured to output the prediction mask according to the first depth feature, the third depth feature and the mixed depth feature in step S121

And S122, inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and inputting the first depth feature, the third depth feature and the mixed depth feature into a second decoder to obtain a prediction attenuation map.

In an embodiment of the invention, the second decoder is configured to output the predictive attenuation map based on the first depth feature, the third depth feature and the mixed depth feature

S123, inputting the first depth feature, the third depth feature and the mixed depth feature into the decoding module and inputting the first depth feature, the third depth feature and the mixed depth feature into a third decoder to obtain the predicted refraction flow.

In an embodiment of the invention, the third decoder is for outputting the predicted refracted stream according to the first depth feature, the third depth feature and the mixed depth feature

In the embodiment of the present invention, after decoding, a predicted image can be obtained by formula (2)

The details of step S3 in another implementation will be described in detail below.

Specifically, step S3 includes:

s31, calculating the total loss value according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image.

Specifically, step S31 includes:

s311, acquiring a real mask, a real attenuation map and a real refraction flow according to the real image.

Firstly, acquiring a corresponding real mask m according to a real image_tTrue attenuation map ρ_tAnd a true baffling flow W_t. See FIG. 5, C_tFor a real image, m_tFor real masks corresponding to real images, p_tFor true attenuation maps corresponding to true images, W_tThe real folded flow corresponding to the real image.

How to obtain the corresponding real mask, real attenuation map and real refraction flow according to the real image is described in detail later when the real data set is described.

And S312, calculating a mask loss value according to the predicted mask and the real mask.

In the embodiment of the invention, the mask prediction of the transparent object is a binary classification problem, an additional softmax layer can be adopted to normalize the output and calculate the mask loss value L by using a binary cross entropy function_mAs shown in equation (5):

where H and W represent the height and width of the first and second input images (the height of the first input image is the same as the height of the second input image, and the width of the first input image is the same as the width of the second input image), and m_ijAnd

the pixel values at locations (i, j) of the binary real mask and the normalized output prediction mask, respectively.

And S313, calculating an attenuation loss value according to the predicted attenuation map and the real attenuation map.

In the embodiment of the invention, the MSE function is used for calculating the attenuation loss value L_aAs in formula (6)) Shown in the figure:

where ρ is_ijAnd

representing the true and predicted attenuation values at the (i, j) pixel and normalizing the predicted attenuation map using a sigmoid activation function

And S314, calculating a refraction flow loss value according to the predicted refraction flow and the real refraction flow.

In an embodiment of the present invention, the dimension of the predicted folded stream is H × W × 2, which is defined as an index relationship between a synthesized pixel and its corresponding background pixel. These two channels represent pixel displacement in the x and y dimensions, respectively. The output may be normalized by a tanh activation function and then scaled using the size of the first input image and the second input image. Calculating the value of the breaking loss L by using the average endpoint error (EPE) function_fAs shown in equation (7):

wherein W and

representing true and predicted refraction flow, H and W representing the height and width of the first and second input images (the height of the first input image is the same as the height of the second input image, the width of the first input image is the same as the width of the second input image),

representing the displacement of pixels in the x-dimension of the real image at position (i, j),

indicating the pixel displacement of the predicted image in the x-dimension at position (i, j),

representing the displacement of pixels in the y-dimension of the real image at position (i, j),

indicating the pixel displacement of the predicted image in the y-dimension at position (i, j).

And S315, calculating a composition loss value and a perception loss value according to the predicted image and the real image.

In an embodiment of the present invention, in order to minimize the difference between the predicted image and the real image, the composition loss L may be calculated using the L1 function_cAs shown in equation (8):

where H and W denote the height and width of the first and second input images (the height of the first input image is the same as the height of the second input image, the width of the first input image is the same as the width of the second input image),

representing the pixel value of the predicted image at (i, j),

representing the pixel value of the real image at (i, j).

And, to better preserve details and less ambiguity while increasing the sharpness of the predicted image, adding a perceptual loss L_pAs shown in formula (9).

Where φ (-) denotes the conv4_3 feature of the VGG16 model pre-trained by ImageNet, and N is the total number of channels in the layer.

S316, calculating a total loss value according to the mask loss value, the attenuation loss value, the refraction flow loss value, the composition loss value and the perception loss value.

In an embodiment of the invention, a total loss value is calculated from the prediction mask, the prediction attenuation map and the prediction refraction stream, the prediction image and the real image; minimizing the total loss value, from which the network is trained, can be achieved by equation (10).

L＝ω_mL_m+ω_aL_a+ω_fL_f+ω_cL_c+ω_pL_p(10)

Wherein L represents the total loss value, L_mRepresenting the mask loss value, ω_mRepresenting the balance weight, L, corresponding to the mask loss value_aRepresenting attenuation loss value, ω_aRepresenting the balance weight, L, corresponding to the attenuation loss value_fRepresenting the value of refractive flow loss, ω_fRepresenting the balance weight, L, corresponding to the value of the refractive flow loss_cRepresenting the patterning loss value, ω_cRepresenting the balance weight, L, corresponding to the composition loss value_pRepresenting the value of the perceptual loss, ω_pRepresenting the balance weight corresponding to the value of the perceptual loss, ω can be set_m＝1，ω_a＝10，ω_f＝1，ω_c10 and ω_p＝1。

And S32, adjusting the parameters of the convolutional neural network according to the total loss value.

In the embodiment of the present invention, during training, pytorre may be used for implementation, the parameter initialization of the convolutional neural network is implemented by using an Xavier algorithm, an Adam algorithm with default parameters is used as an optimizer, and the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network is continuously performed after the parameters are modified, in one implementation, the time for training 100 cycles on a Titan X GPU is about 10 to 12 hours by fixing a learning rate to 0.0002.

In another implementation manner, after the parameters are modified, the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network is continued until a preset training condition is met, where the preset training condition includes that the total loss value meets a preset requirement or the training times reach a preset number. The preset requirement can be determined according to a new view synthesis model, which is not described in detail herein; the preset number may be a maximum training number of the convolutional neural network, for example, 50000 times, and the like. Therefore, after the total loss value is obtained through calculation, whether the total loss value meets a preset requirement is judged, if the total loss value meets the preset requirement, the training is finished, if the total loss value does not meet the preset requirement, whether the training frequency of the convolutional neural network reaches the training frequency is judged, if the training frequency does not reach the preset frequency, the parameter of the convolutional neural network is adjusted according to the total loss value, if the training frequency reaches the preset frequency, the training is finished, and therefore whether the training of the convolutional neural network is finished is judged through the loss value and the training frequency, and the phenomenon that the training enters dead cycle due to the fact that the loss value cannot reach the preset requirement can be avoided.

Further, since the modification of the parameter of the convolutional neural network is performed when the preset condition is not satisfied (for example, the total loss value does not satisfy the preset requirement and the training times do not reach the preset times), after the parameter of the convolutional neural network is modified according to the total loss value, it is necessary to continue to train the convolutional neural network, that is, to continue to perform the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network. Wherein, the continuing to perform the inputting of the first image, the second image and the mixing coefficient in the training data into the convolutional neural network may be the first image, the second image and the mixing coefficient which are not input into the convolutional neural network. For example, all the first images and the second images in the training data have unique image identifiers (e.g., view sequence numbers), the numerical values of the mixing coefficients are different, the image identifiers of the first images and the second images input into the convolutional neural network for the first training are different from the image identifiers of the first images and the second images input into the convolutional neural network for the second training, for example, the view sequence number of the first image input into the convolutional neural network for the first training is 1, the view sequence number of the second image is 7, and the mixing coefficient is 0.5; the visual angle sequence number of the first image input into the convolutional neural network for the second training is 2, the visual angle sequence number of the second image is 10, and the mixing coefficient is 0.6.

In practical application, because the number of the first images and the second images in the training data is limited, in order to improve the training effect of the convolutional neural network, the first images, the second images and the mixing coefficients in the training data may be sequentially input to the convolutional neural network to train the convolutional neural network, and after all the first images, all the second images and the corresponding mixing coefficients in the training data are input to the new view synthesis model, the operation of sequentially inputting the first images, the second images and the mixing coefficients in the training data to the convolutional neural network may be continuously performed, so that the training image group in the training data is input to the convolutional neural network model in a cycle. It should be noted that, in the process of inputting the first image and the second image into the new view synthesis model for training, the images may be input in the order of the view sequence numbers of the first images, or may not be input in the order of the view sequence numbers of the first images, of course, the same first image, second image, and mixing coefficient may be repeatedly used for training the convolutional neural network, or the same first image, second image, and mixing coefficient may not be repeatedly used for training the convolutional neural network.

The training data in one implementation is described in detail below.

There is currently no open dataset specifically for new view composition of transparent objects, the present invention creates a training data set comprising a composite dataset and a real dataset, wherein the composite dataset comprises as selectable first and second images 8 different model classes at different camera views rendered using POVRay; the real data set includes 6 real transparent objects photographed for evaluation.

In an embodiment of the invention, the synthetic dataset is a 3D object comprising 8 classes collected from sharenet, including airplan, Bench, Bottle, Car, Jar, Lamp, and Table, for each class, 400 models were randomly selected, 350 for training and 50 for testing; in addition, 400 Glass _ water models are also used as additional examples to verify that the new perspective synthesis model obtained after training can be effectively extended to the general example. During rendering, each model appears as a transparent object with a refractive index set to 1.5. The cameras used to capture the transparency are arranged as pinhole models with fixed focal length and viewpoint, the resolution of the display screen is 512 x 512, and in each camera attempt the screen displays a series of binary gray code images for mask extraction and environmental masking, so 18 gray code images need to be rendered, 9 as rows and 9 as columns. Furthermore, by rendering the model in front of a pure white background image, the attenuation map is easily obtained; the background image used in rendering is shown in fig. 6. Each pixel in the background image is pre-coded with a unique color value to avoid repetitive patterns and help optimize the loss function to more efficiently compute the gradient during grid sampling. In order to meet preset training requirements and increase the diversity of training examples in view of the rendering of each object, the object is first randomly rotated to some initialized position in the virtual scene and then rotated from-10 ° to 10 ° around the y-axis (coordinate system of POVRay). And a sequence of images is acquired at a rotational interval of 2 deg..

In an embodiment of the present invention, a real dataset, 6 real transparent objects including Hand, Goblet, Dog, Monkey, Mouse and Rabbit, is used for algorithm evaluation, see fig. 7. Real images for training and testing were captured with a Point Grey Flea color camera (FL3-U3-13S 2C-CS). Similar to the rendering method of the composite dataset, transparent objects are placed on the carousel in front of the DELL LCD display (U2412M). During shooting, the turntable is rotated from 0 ° to 360 ° at intervals of 2 °. Gray code patterns, pure white images and colored image background images are displayed on a display and are used for extracting real masks, real attenuation images and real refraction flows.

In the embodiment of the present invention, in addition to the three categories evaluated in fig. 4, more quantitative evaluations were performed on the other 5 categories, see fig. 8, with the average PSNR and SSIM of each category higher than 20.0 and 0.85, which shows that our network-synthesized images can produce better visual results.

When the method is trained, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, the prediction mask, the prediction attenuation map and the prediction refraction flow reflect a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light behavior of the transparent object image, obtains a prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.

The embodiment of the present invention further provides a new viewing angle synthesis method for a transparent object, referring to fig. 9, the method may include the following steps:

and K1, acquiring the first image to be processed, the second image to be processed and a preset mixing coefficient.

In the embodiment of the present invention, the view sequence number X of the to-be-processed first image X is different from the view sequence number Y of the to-be-processed second image Y, and the predetermined blending coefficient α' is greater than 0 and smaller than 1. The view angle number of the synthesized image can be known from equation (11).

Then, for example, the view number X is equal to 2, the view number y is equal to 8, the predetermined blending factor α' is 0.5, and the view number X of the synthesized image X is 4.

K2, inputting the first image to be processed, the second image to be processed and the preset mixing coefficient into a new visual angle synthesis model to obtain a mask to be processed, an attenuation map to be processed and a refraction flow to be processed; the new visual angle synthetic model is obtained by the training method of the new visual angle synthetic model of the transparent object.

K3, calculating to obtain a composite image of the first image to be processed and the second image to be processed under a preset mixing coefficient according to the mask to be processed, the attenuation map to be processed and the refraction flow to be processed by adopting an environment mask, wherein the visual angle of the composite image is between the visual angle of the first image to be processed and the visual angle of the second image to be processed.

In the embodiment of the invention, by adopting the environmental mask expression shown in formula (2), a synthetic image can be obtained according to the mask m to be processed, the attenuation map ρ to be processed and the refraction flow W to be processed.

By way of example, with reference to fig. 10, an example of the synthesis of Airplane is shown, the viewing angle of the camera with respect to the object, from-10 ° to 10 °, where a is the image taken at-10 °, B is the image taken at-8 °, C is the image taken at-6 °, D is the image taken at-4 °, E is the image taken at-2 °, F is the image taken at 0 °, G is the image taken at 2 °, H is the image taken at 4 °, I is the image taken at 6 °, J is the image taken at 8 °, K is the image taken at 10 °, image a is taken as the first image to be processed, image K is taken as the second image to be processed, at different new viewing angles (i.e. different preset blended viewing angles), the mask to be processed, the loss value to be processed and the folded jet to be processed can be output, the to-be-processed mask, the to-be-processed loss value and the to-be-processed refraction flow are pixel values and can be represented by images. A1 is a visual to-be-processed mask corresponding to the image A, B1 is a visual to-be-processed mask corresponding to the image B, … … and K1 are visual to-be-processed masks corresponding to the image K; a2 is a visual attenuation map to be processed corresponding to the image A, B2 is a visual attenuation map to be processed corresponding to the image B, … … and K2 are visual attenuation maps to be processed corresponding to the image K; a3 is the visualized to-be-processed folded flow corresponding to the image A, B3 is the visualized to-be-processed folded flow corresponding to the image B, … … and K3 are the visualized to-be-processed folded flows corresponding to the image K. According to the mask to be processed, the attenuation image to be processed and the refraction flow to be processed, a synthetic image can be obtained through an environment mask, a is a synthetic image under a view angle of-10 degrees obtained according to the image A and the image K, b is a synthetic image under a view angle of-8 degrees obtained according to the image A and the image K, c is a synthetic image under a view angle of-6 degrees obtained according to the image A and the image K, and … …, K is a synthetic image under a view angle of 10 degrees obtained according to the image A and the image K. The average PSNR and SSIM in this case are (25.7,0.9567) and (19.4,0.9004) compared to each corresponding real image, which clearly shows that the synthesis results are visually reasonable.

Referring to fig. 11, an example of the composition of Glass _ water is also shown in the present embodiment, where a is an image taken at-10 °, B is an image taken at-8 °, C is an image taken at-6 °, D is an image taken at-4 °, E is an image taken at-2 °, F is an image taken at 0 °, G is an image taken at 2 °, H is an image taken at 4 °, I is an image taken at 6 °, J is an image taken at 8 °, and K is an image taken at 10 °; according to the new visual angle synthesis model obtained through training, the image A is used as a first image to be processed, the image K is used as a second image to be processed, and different synthesis images can be obtained under different new visual angles (namely different preset mixed visual angles). The average PSNR and SSIM in this case are (19.4,0.9004), which clearly shows that the synthesis results are visually reasonable.

Referring to fig. 12, an embodiment of the present invention further illustrates a synthetic example of Bottle, where a is an image captured at-10 °, B is an image captured at 0 °, C is an image captured at 10 °, a is taken as a first image to be processed, C is taken as a second image to be processed, a1 is a visual mask to be processed corresponding to the image a being a real image, B1 is a visual mask to be processed corresponding to the image B being a real image, and C1 is a visual mask to be processed corresponding to the image C being a real image; a2 is a corresponding visual attenuation map to be processed when the image A is a real image, B2 is a corresponding visual attenuation map to be processed when the image B is a real image, and C2 is a corresponding visual attenuation map to be processed when the image C is a real image; a3 is the corresponding visual to-be-processed folded flow when the image A is a real image, B3 is the corresponding visual to-be-processed folded flow when B is a real image, and C3 is the corresponding visual to-be-processed folded flow when the image C is a real image. a is a composite image at a viewing angle of-10 ° obtained from image a and image C, b is a composite image at a viewing angle of 0 ° obtained from image a and image C, and C is a composite image at a viewing angle of 10 ° obtained from image a and image C. The average PSNR and SSIM in this case are (23.5,0.9584), which clearly shows that the synthesis results are visually reasonable.

Referring to fig. 13, an example of synthesizing Bench is further shown in the embodiment of the present invention, as shown in fig. 12, where a is an image captured at-10 °, B is an image captured at 0 °, C is an image captured at 10 °, a is taken as a first image to be processed, C is taken as a second image to be processed, a1 is a corresponding visualized mask to be processed when image a is a real image, B1 is a corresponding visualized mask to be processed when B is a real image, and C1 is a corresponding visualized mask to be processed when image C is a real image; a2 is a corresponding visual attenuation map to be processed when the image A is a real image, B2 is a corresponding visual attenuation map to be processed when the image B is a real image, and C2 is a corresponding visual attenuation map to be processed when the image C is a real image; a3 is the corresponding visual to-be-processed folded flow when the image A is a real image, B3 is the corresponding visual to-be-processed folded flow when B is a real image, and C3 is the corresponding visual to-be-processed folded flow when the image C is a real image. a is a composite image at a viewing angle of-10 ° obtained from image a and image C, b is a composite image at a viewing angle of 0 ° obtained from image a and image C, and C is a composite image at a viewing angle of 10 ° obtained from image a and image C. The average PSNR and SSIM in this case are (21.6,0.9243), which clearly shows that the synthesis results are visually reasonable.

Referring to fig. 14, an example of synthesizing Table is also shown in the embodiment of the present invention, and as shown in fig. 12, a is an image captured at-10 °, B is an image captured at 0 °, C is an image captured at 10 °, a is taken as a first image to be processed, C is taken as a second image to be processed, a1 is a corresponding visual mask to be processed when image a is a real image, B1 is a corresponding visual mask to be processed when B is a real image, and C1 is a corresponding visual mask to be processed when image C is a real image; a2 is a corresponding visual attenuation map to be processed when the image A is a real image, B2 is a corresponding visual attenuation map to be processed when the image B is a real image, and C2 is a corresponding visual attenuation map to be processed when the image C is a real image; a3 is the corresponding visual to-be-processed folded flow when the image A is a real image, B3 is the corresponding visual to-be-processed folded flow when B is a real image, and C3 is the corresponding visual to-be-processed folded flow when the image C is a real image. a is a composite image at a viewing angle of-10 ° obtained from image a and image C, b is a composite image at a viewing angle of 0 ° obtained from image a and image C, and C is a composite image at a viewing angle of 10 ° obtained from image a and image C. The average PSNR and SSIM in this case are (21.4,0.9907), which clearly shows that the synthesis results are visually reasonable.

Through the above example, it can be seen that the new perspective synthetic model of the transparent object obtained through training can accurately predict and reproduce the optical transmission characteristics under the new perspectives of different objects.

In one embodiment, the present invention provides a computer device, which may be a terminal, having an internal structure as shown in fig. 15. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of training a new view synthesis model of a transparent object. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the illustration in fig. 15 is merely a block diagram of a portion of the structure associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

inputting a first image, a second image and a mixing coefficient in training data into a convolutional neural network, and outputting a prediction mask, a prediction attenuation map and a prediction refraction flow through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the relationship among the first visual angle, the second visual angle and the new visual angle;

according to the prediction mask, the prediction attenuation map and the prediction refraction flow, calculating to obtain a prediction image of the first image and the second image under a mixed coefficient, wherein the prediction image is a transparent object image predicted by a convolutional neural network under a new visual angle;

and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network until preset training conditions are met so as to obtain a new view angle synthetic model.

In one embodiment, a computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, performs the steps of:

According to the training method provided by the embodiment of the invention, a first image, a second image and a mixing coefficient in training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, wherein the training data comprises a plurality of groups of training image groups, each group of training image group comprises a first image, a second image, a real image and a mixing coefficient, the first image is a transparent object image shot at a first visual angle, the second image is a transparent object image shot at a second visual angle, the real image is a transparent object image shot at a new visual angle between the first visual angle and the second visual angle, and the mixing coefficient represents the visual angle relationship among the first visual angle, the second visual angle and the new visual angle; calculating to obtain a predicted image according to the prediction mask, the prediction attenuation map and the prediction refraction flow, wherein the predicted image is a transparent object image predicted by a convolutional neural network under a new visual angle; and adjusting parameters of the convolutional neural network according to the prediction mask, the prediction attenuation map, the prediction refraction flow, the prediction image and the real image, and continuously executing the step of inputting the first image and the second image in the training data into the convolutional neural network until a preset training condition is met to obtain a new view synthesis model. During training, the convolutional neural network outputs a prediction mask, a prediction attenuation map and a prediction refraction flow according to a first image, a second image and a mixing coefficient instead of directly obtaining a prediction image, wherein the prediction refraction flow reflects a light transmission matrix of a new visual angle, so that the convolutional neural network learns the complex light transmission behavior of light rays passing through a transparent object, then obtains the prediction image of the transparent object under the new visual angle according to the prediction mask, the prediction attenuation map and the prediction refraction flow, and obtains a new visual angle synthetic model through iterative training of the convolutional neural network; the new visual angle synthesis model obtained through training can obtain a synthesis image of any visual angle between the first visual angle and the second visual angle according to the transparent image of the first visual angle and the transparent image of the second visual angle, and the synthesis image has high quality.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

Claims

1. A method of training a new perspective composite model of a transparent object, the method comprising:

2. The method of claim 1, wherein the convolutional neural network comprises: an encoding module and a decoding module, wherein the first image and the second image in the training data are input into a convolutional neural network, and a prediction mask, a prediction attenuation map and a prediction refraction flow are output through the convolutional neural network, and the encoding module and the decoding module comprise:

inputting the first image, the second image and the mixing coefficient into the coding module to obtain a depth characteristic;

inputting the depth features into the decoding module to obtain a prediction mask, a prediction attenuation map and a prediction refraction flow.

3. The method of claim 2, wherein the encoding module comprises a first encoder, a second encoder, and a convolutional layer, wherein the depth features comprise a first depth feature, a second depth feature, a third depth feature, a fourth depth feature, and a mixed depth feature, and wherein inputting the first image, the second image, and the mixed coefficients into the encoder to obtain the depth features comprises:

4. The method of claim 3, wherein the decoding module comprises a first decoder, a second decoder, and a third decoder, and wherein inputting the depth features into the decoding module to obtain the prediction mask, the prediction attenuation map, and the prediction baffling stream comprises:

5. The method of claim 1, wherein said adjusting parameters of said convolutional neural network based on said prediction mask, said prediction attenuation map and said prediction refraction stream, said prediction image and said real image comprises:

6. The method of claim 5, wherein said calculating a total loss value from said prediction mask, said prediction attenuation map and said prediction refraction stream, said predicted image and said real image comprises:

7. The method of any one of claims 1 to 6, wherein before inputting the first image, the second image and the mixing coefficient in the training data into the convolutional neural network, comprising:

8. A method for new viewing angle synthesis of a transparent object, the method comprising:

acquiring a first image to be processed, a second image to be processed and a preset mixing coefficient;

inputting the first image to be processed, the second image to be processed and the preset mixing coefficient into a new visual angle synthesis model to obtain a mask to be processed, an attenuation map to be processed and a refraction flow to be processed; wherein the new perspective synthetic model is the new perspective synthetic model of any one of claims 1 to 7;

and calculating to obtain a synthetic image of the first image to be processed and the second image to be processed under a preset mixing coefficient according to the mask to be processed, the attenuation map to be processed and the refraction flow to be processed by adopting an environment mask, wherein the visual angle of the synthetic image is between the visual angle of the first image to be processed and the visual angle of the second image to be processed.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.