Background
Background blurring is an important technique in the field of photography, and in general, a photographer blurs an uninteresting portion of an image, that is, blurs a background, in order to highlight a certain region in the image.
In the prior art, the algorithm of background blurring is limited to data containing portrait, or an end-to-end network is trained by using depth estimation and image saliency segmentation as prior knowledge and using an original image together as input data. However, the disadvantages of both approaches are obvious. The data set of the first category of methods does not contain images in general scenes such as natural scenery, resulting in that this category of algorithms generally does not work with images other than portraits. The second method, which only uses depth estimation and image significance as a priori knowledge to input an end-to-end neural network training, has poor interpretability and slow running speed.
Disclosure of Invention
The purpose of the invention is as follows: an object is to provide a method for fast blurring a monocular visual image background based on depth perception, so as to solve the above problems in the prior art. A further object is to propose a system implementing the above method.
The technical scheme is as follows: a method for quickly blurring a monocular visual image background based on depth perception comprises the following steps:
step one, establishing a model training data set;
secondly, constructing a monocular visual image training network for depth perception;
and step three, receiving the picture to be subjected to the shot rendering by the trained neural network model, and outputting the picture after the shot rendering is finished.
In a further embodiment, the first step is further: and (4) establishing a picture data set for model training in the step two, and adopting the data sets to be real scene pictures in order to improve the learning of data contained in the real scene. The pictures in the data set are stored in pairs, namely, the original pictures and the corresponding pictures with the shot rendering effect are respectively stored. The original pictures in the data set are used as input data in the model training process, and the pictures with the shot effect in the data set are used as comparison data for comparing with model output pictures in the model training process.
In a further embodiment, the second step is further:
and constructing a depth-perception monocular visual image training network, and training the network to form a network model for generating the shot effect rendering. The network is trained by firstly receiving input picture data I in a data set established in the first steporg(ii) a Secondly, extracting a characteristic image in a collaborative mode of the designed convolutional layer and an activation function; then, activating the learning weight by using a loss function; and finally, forming presentation of a preset fuzzy degree effect through iteration of the fuzzy function, and realizing output of a final image through the weighted sum of the weight and the iteration result of the fuzzy function. The network learning training is completed through the process, and the output picture I is obtainedbokehIn the network model of (1), wherein IbokehThe picture is rendered with a shot effect. The model for generating the picture with the shot rendering effect can be specifically constructed as follows:
i.e. the comparison data in the data set created in step one is regarded as a weighted sum of smoothed versions of the input data in the data set. Wherein, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) represents the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image, and W
iAnd also satisfy
The employed i-th order fuzzy function B
i(. is a shallow fuzzy neural network
The loop iterates i times to obtain the following specific expression:
in the training process, the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, the compactness analysis between output data and comparison data is improved, and the purpose of effectively reducing the difference between a model and actual data is achieved through the back propagation of an error value, so that the model is better optimized. Wherein l is specifically:
wherein I
bokehThe representation model generates an image with a shot effect,
an original image representing the image in the data set that actually carries the shot effect,
representing the generated image I
bokehWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
bokehWith the actual image
Structural relationship between them.
In a further embodiment, the third step is further:
firstly, the network model for the shot effect rendering obtained in the step two receives a picture to be subjected to background blurring; secondly, carrying out specification adjustment on a preset multiple value of the received picture to enable the size of the received picture to be in accordance with the size of the picture which can be accepted by a depth network in the model, obtaining an estimated image passing through the depth estimation network, and carrying out up-sampling through deconvolution to further obtain the depth image; inputting the depth map into a convolution layer in a shot effect rendering network model, extracting and calculating the weight of a feature map of the received image by combining an activation function, and taking the obtained numerical value as corresponding data of a fuzzy layer weighted sum; thirdly, a shallow fuzzy network in the shot effect rendering network model performs fuzzification operation on the received picture signal to be subjected to background blurring through a fuzzy function; secondly, performing weighted sum calculation on the obtained weights and the fuzzy function by using a weighted sum calculation mode set in the shot effect rendering network model to obtain a numerical value of the processed image; and finally, outputting the obtained processed image numerical value by the shot effect rendering network model, namely outputting the image with shot effect rendering.
A monocular visual image background rapid blurring system based on depth perception specifically comprises;
a first module for establishing a training set;
a second module for obtaining a network model with a shot effect;
and the third module is used for realizing the shot effect.
In a further embodiment, the first module is further to:
and establishing a picture data set for model training in a second module, wherein the data sets in the second module are all photo sets which appear in pairs, namely, the photo sets which appear in pairs are a plane graph of a monocular visual image and an image which corresponds to the plane graph and has a shot rendering effect. A plan view of a monocular visual image in a data set of the module is used as input data in a model training process, and a picture with a shot effect in the data set is used as comparison data for comparing with a model output picture in the model training process. In order to ensure that the shot effect is effectively improved in different scenes, the scenes described in the picture set and the object scenes contained in the picture set are real scenes and different main bodies.
In a further embodiment, the second module is further to:
the module firstly establishes a depth estimation network for obtaining a depth map, and the network receives a plane map of a monocular visual image in the first module as training data and takes an image with a shot rendering effect as contrast data of the image generated by network learning processing, so that the generated error is utilized for back propagation to obtain parameter optimization of the depth estimation network; then after the training of the depth estimation network is finished, inputting a depth map obtained by the depth estimation network into a convolutional layer with a preset number of layers and an activation function to obtain a characteristic map, further obtaining the weight of the characteristic map, and obtaining a fuzzy image by presetting the iteration times of the fuzzy network; and finally, combining the feature weight with the fuzzy image, and obtaining an image finally containing the shot effect by utilizing weighting and calculation, thereby completing the establishment of the shot effect network model.
The training process of the depth estimation network comprises the steps of firstly resetting the size of a picture serving as input image data in a first module, namely adjusting the size of the picture according to a preset multiple to obtain a picture meeting the network input size; then inputting the processed picture into a depth estimation network to generate a depth estimation image; and finally, restoring the image size of the obtained depth estimation image through the deconvolution layer.
And obtaining the feature map weight by the depth map generated by the depth estimation network through convolution and activation functions with set times, then obtaining the image with the shot effect through iteration of the fuzzy network function of preset times and finally obtaining the image with the shot effect through weighting and summation.
The model for finally generating the picture with the shot rendering effect can be specifically constructed as follows:
i.e. the comparison data in the data set created by the first module is regarded as a weighted sum of smoothed versions of the input data in the data set. Wherein, I
bokehRepresenting the finally generated image, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
in the training process, the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, so that the compactness analysis between output data and comparison data is improved, the purpose of effectively reducing the difference between a model and actual data is achieved, and the model is optimized better. Wherein l is specifically:
wherein I
bokeThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokeWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
boWith the actual image
Structural relationship between them.
In a further embodiment, the third module is further to:
the module firstly receives a picture to be subjected to background blurring by using a network model of the shot effect rendering obtained in the second module, and inputs the picture into a depth estimation network in the model; secondly, performing deconvolution up-sampling on a depth map obtained by a depth estimation network so as to restore the size of input image data, inputting the depth map obtained by the depth network into a convolution layer in a shot effect rendering network model, performing weight calculation aiming at a received image feature map by combining an activation function, and taking an obtained numerical value as corresponding data of a fuzzy network layer weighted sum; thirdly, a shallow fuzzy network in the shot effect rendering network model performs fuzzification operation on the received depth image signal of the to-be-background blurred picture; secondly, performing weighted sum calculation on the obtained weights and the fuzzy function by using a weighted sum calculation mode set in the shot effect rendering network model to obtain a numerical value of the processed image; and finally, outputting the obtained processed image numerical value by the shot effect rendering network model, namely outputting the image with shot effect rendering.
Has the advantages that: the invention provides a monocular visual image background fast blurring method based on depth perception and a system for realizing the method. Compared with other algorithms, the background blurring algorithm technology adopted by the invention has stronger generalization performance and smaller error, can adapt to various scenes, and utilizes a large number of data sets to train an end-to-end neural output network, thereby solving the problem of the operation speed of the background blurring algorithm.
Detailed Description
The applicant believes that the single lens reflex camera, due to its large aperture and the integration of numerous sensors, is easy to do by the adjustment of some parameters; the binocular camera can feel certain scene depth during shooting, and can generate the effect of background blurring through a certain depth algorithm. However, in the device with fewer monocular cameras and optical sensors, the problem of achieving fast and good background blurring effect directly from any image is not solved effectively.
In the prior art, a method adopted for shot rendering is to perform centralized processing on an image photo, use depth estimation and image saliency segmentation as prior knowledge and use an original image together as input data to train an end-to-end network. In the extraction of the portrait photo, a semantic segmentation method is generally used to segment the portrait from the image, and then the remaining area is blurred. Some people use an algorithm that combines hardware and software, as shown in fig. 3 below. Firstly, segmenting a person in an image by utilizing a neural network, and generating a foreground mask under the condition of the person image. They then generate a dense depth map using a sensor with two-pixel autofocus hardware and use this depth map along with the foreground mask for depth dependent rendering of the shallow depth image. Still others use existing networks for portrait segmentation and depth estimation of individual images, and perform different levels of blurring on scenes other than the portrait according to depth. As mentioned above, the data set of the first category of methods does not contain images in general scenes such as natural scenery, resulting in that this category of algorithms generally does not work for images other than portraits. The second method is to input the depth estimation and the image significance as the prior knowledge into an end-to-end neural network training, which has poor interpretability and slow running speed.
In order to solve the problems in the prior art, the invention provides a monocular visual image background rapid blurring method based on depth perception and a system for realizing the method.
The present invention will be further described in detail with reference to the following examples and accompanying drawings.
In the present application, a method for fast blurring a monocular visual image background based on depth perception and a system for implementing the method are provided, wherein the method for fast blurring a monocular visual image background based on depth perception includes the following steps:
step one, establishing a model training data set; in the step, a picture data set used for model training in the step two is established, and in order to improve the learning of data contained in a real scene, the data sets are all real scene pictures. The pictures in the data set are stored in pairs, namely, the original pictures and the corresponding pictures with the shot rendering effect are respectively stored. The original pictures in the data set are used as input data in the model training process, and the pictures with the shot effect in the data set are used as comparison data for comparing with model output pictures in the model training process.
Secondly, constructing a monocular visual image training network for depth perception; the step builds a monocular visual image training network for depth perception, trains the network to form a network model for generating the shot effect rendering. The network is trained by firstly receiving input picture data I in a data set established in the first steporg(ii) a Second, by the designed convolutional layerCarrying out feature extraction; then, a presentation of a predetermined blurring effect is formed by iteration of the blurring function, and finally, the learning weights are activated by the loss function. The network learning training is completed through the process, and the output picture I is obtainedbokehIn the network model of (1), wherein IbokThe picture is rendered with a shot effect. The model for generating the picture with the shot rendering effect can be specifically constructed as follows:
i.e. the comparison data in the data set created in step one is regarded as a weighted sum of smoothed versions of the input data in the data set. Wherein, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
during training, the loss function l1By combining the reconstruction function and the structural similarity SSIM, the compactness analysis between the output data and the comparison data is improved, the purpose of effectively reducing the difference between the model and the actual data is achieved, and the model is optimized better. Wherein l1The method specifically comprises the following steps:
wherein I
bokehThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokehWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokehWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
bokehWith the actual image
Structural relationship between them.
Step three, receiving the picture to be subjected to the shot rendering by the trained neural network model, and outputting the picture after the shot rendering is finished; firstly, receiving a picture to be subjected to background blurring by the network model with the shot effect rendering obtained in the step two; secondly, extracting and calculating the weight of a feature map of the received image by using a convolution layer and an activation function in the shot effect rendering network model, and using the obtained numerical value as corresponding data of the weighted sum of the fuzzy layers; thirdly, a shallow fuzzy network in the shot effect rendering network model performs fuzzification operation on the received picture signal to be subjected to background blurring; secondly, performing weighted sum calculation on the obtained weights and the fuzzy function by using a weighted sum calculation mode set in the shot effect rendering network model to obtain a numerical value of the processed image; and finally, outputting the obtained processed image numerical value by the shot effect rendering network model, namely outputting the image with shot effect rendering.
Based on the method, a system for implementing the method can be constructed, and specifically comprises the following steps:
a first module for establishing a training set; the module establishes a picture data set used for model training in a second module, and the data sets in the module are all photo sets which appear in pairs, namely, a plan view of a monocular visual image and an image which corresponds to the plan view and has a shot rendering effect appear in pairs. A plan view of a monocular visual image in a data set of the module is used as input data in a model training process, and a picture with a shot effect in the data set is used as comparison data for comparing with a model output picture in the model training process. In order to ensure that the shot effect is effectively improved in different scenes, the scenes described in the picture set and the object scenes contained in the picture set are both real scenes and different subjects.
A second module for obtaining a network model with a shot effect; the module firstly establishes a depth estimation network for obtaining a depth map, and the network receives a plane map of a monocular visual image in the first module as training data and takes an image with a shot rendering effect as contrast data of the image generated by network learning processing, so that the generated error is utilized for back propagation to obtain parameter optimization of the depth estimation network; then after the training of the depth estimation network is finished, inputting a depth map obtained by the depth estimation network into a convolutional layer with a preset number of layers and an activation function to obtain a characteristic map, further obtaining the weight of the characteristic map, and obtaining a fuzzy image by presetting the iteration times of the fuzzy network; and finally, combining the feature weight with the fuzzy image, and obtaining an image finally containing the shot effect by utilizing weighting and calculation, thereby completing the establishment of the shot effect network model.
The training process of the depth estimation network comprises the steps of firstly resetting the size of a picture of which the first module is used as an input image, namely adjusting the size of the picture according to a set multiple to obtain a picture which accords with the network input size; then inputting the processed picture into a depth estimation network to generate a depth estimation image; and finally, restoring the image size of the obtained depth estimation image through the deconvolution layer.
And obtaining the feature map weight by the depth map generated by the depth estimation network through convolution and activation functions with set times, then obtaining the image with the shot effect through iteration of the fuzzy network function of preset times and finally obtaining the image with the shot effect through weighting and summation.
The model for finally generating the picture with the shot rendering effect can be specifically constructed as follows:
i.e. the comparison data in the data set created by the first module is regarded as a weighted sum of smoothed versions of the input data in the data set. Wherein, I
bokehRepresenting the finally generated image, I
orgWhich represents the original image or images of the original image,
representing the multiplication of the matrix element by element, B
i(. h) is the i-th order blur function, W
iA characteristic weight matrix value representing an i-th layer data image,
involving the i-th order blur function B
i(. is a shallow fuzzy neural network
Obtained i iterations, which is expressed as:
in the training process, the loss function l adopts the combination of a reconstruction function and a structural similarity SSIM, so that the compactness analysis between output data and comparison data is improved, the purpose of effectively reducing the difference between a model and actual data is achieved, and the model is optimized better. Wherein l is specifically:
wherein I
bokehThe representation model generates an image with a shot effect,
an original image representing an image with an actual shot effect,
representing the generated image I
bokWith the actual image
The structural similarity between the two is as follows:
wherein alpha, beta and gamma are preset constants,
representing the generated image I
bokWith the actual image
The relationship between the brightness of the light source and the brightness of the light source,
representing the generated image I
bokehWith the actual image
The contrast ratio relationship between the two components,
representing the generated image I
boWith the actual image
Structural relationship between them.
And the third module is used for realizing the shot effect. The module firstly receives a picture to be subjected to background blurring by using a network model of the shot effect rendering obtained in the second module, and inputs the picture into a depth estimation network in the model; secondly, performing deconvolution up-sampling on a depth map obtained by a depth estimation network so as to restore the size of input image data, inputting the depth map obtained by the depth network into a convolution layer in a shot effect rendering network model, performing weight calculation aiming at a received image feature map by combining an activation function, and taking an obtained numerical value as corresponding data of a fuzzy network layer weighted sum; thirdly, a shallow fuzzy network in the shot effect rendering network model performs fuzzification operation on the received depth image signal of the to-be-background blurred picture; secondly, performing weighted sum calculation on the obtained weights and the fuzzy function by using a weighted sum calculation mode set in the shot effect rendering network model to obtain a numerical value of the processed image; and finally, outputting the obtained processed image numerical value by the shot effect rendering network model, namely outputting the image with shot effect rendering.
The invention can be applied to image blurring processing at the PC end of a computer, automatic blurring of shooting at the mobile end and the like, and realizes blurring of the background of a picture. Fig. 6 to 8 show the results of background blurring of the photos according to the present invention. The left side is an original photo, the right side is a processed photo with a shot effect, and the definition of the main body part is maintained while background blurring is realized according to the effect graph.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.