CN115100337A

CN115100337A - Whole body portrait video relighting method and device based on convolutional neural network

Info

Publication number: CN115100337A
Application number: CN202210612418.2A
Authority: CN
Inventors: 黄海; 朱玥琰; 陈洪; 李琳; 徐嵩; 穆俊生; 陈傲然; 于华妍; 张舒
Original assignee: Beijing University of Posts and Telecommunications; MIGU Culture Technology Co Ltd
Current assignee: Beijing University of Posts and Telecommunications; MIGU Culture Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-23

Abstract

The application discloses a whole body portrait video relighting method and device based on a convolutional neural network, wherein the method comprises the following steps: acquiring a video image to be processed, wherein the video to be processed comprises a whole-body portrait video image; inputting a plurality of image frames of a video image to be processed and a target lighting scene into a pre-trained image processing model to obtain a rendered image frame sequence, wherein the image processing model is used for rendering the image frames and the target lighting scene into rendered image frames under the target lighting scene and performing time consistency processing on the rendered image frames; the sequence of rendered image frames is synthesized into a relighting video image. The application can realize the relighting of the portrait of the whole body and improve the relighting effect.

Description

Whole body portrait video relighting method and device based on convolutional neural network

Technical Field

The application relates to the field of image processing, in particular to a whole body portrait video relighting method and device based on a convolutional neural network.

Background

With the rise of digital photography technology, people have more and more requirements on digital image processing, and manual cropping requires complex workload and has higher requirements on the professional performance of users, which prompts the attention of the industry to technologies such as automatic image enhancement and the like. Due to the sensitivity of the human eye to light, re-illumination is becoming one of the most important, leading-edge technologies. The realistic relighting provides an immersive visual effect for augmented reality, virtual reality and digital special effects, and is widely applied to the multimedia technology industry. Especially in the stage performance scene, complex and variable lighting settings are often required to assist the performance effect, and how to achieve consistent re-lighting video results under challenging dynamic lighting conditions remains a difficult problem to solve.

There are many methods in the prior art that are applied in the field of video and image relighting:

the traditional re-illumination method needs to densely sample illumination through a multi-view camera, quantize the illumination through methods such as time differentiation and integration, and then re-map the illumination onto an image to be illuminated. The method does not need strong hardware support, the calculated amount is not negligible, and the algorithm efficiency and the generated picture quality have larger promotion space; in the current stage, the end-to-end re-illumination operation is realized based on a deep learning method, a normal map and an albedo image are predicted through a neural network, and finally, the synthetic rendering of the re-illumination image is executed. Google proposes a method for simulating complex light transmission by constructing an implicit reflection model, however, the albedo image prediction of human body clothing is not accurate enough, and a definite time consistency modeling is not performed, which can cause a certain time instability; the university of qinghua proposes embedding facial albedo, geometry, and specular reflection and shading by explicitly modeling multiple reflection channels, but does not take into account the complex light transmission effects of global illumination, sub-surface scattering, etc. In addition, the method only performs experiments on the facial relighting, and the whole body relighting is still a difficult point to be solved; a video portrait relighting scheme is provided by Shanghai science and technology university, real-time face relighting is realized through antagonism training, but due to the lack of facial geometric information, certain artifacts and false faces exist in a relighting result, and the sense of reality is insufficient.

In summary, the relighting methods of the prior art all have limitations and cannot achieve consistent relighting video results under challenging dynamic lighting conditions.

Content of application

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the application aims to achieve the consistent re-illumination video effect under the dynamic illumination condition, and provides a whole body portrait video re-illumination method based on a convolutional neural network.

Another objective of the present application is to provide a convolutional neural network-based whole-body portrait video relighting device.

According to a first aspect of the embodiments of the present application, a method for relighting a whole-body portrait video based on a convolutional neural network is provided, which includes the following steps:

acquiring a video image to be processed, wherein the video to be processed comprises a whole-body portrait video image;

inputting a plurality of image frames of the video image to be processed and a target lighting scene into a pre-trained image processing model to obtain a rendered image frame sequence, wherein the image processing model is used for rendering the image frames and the target lighting scene into the rendered image frames under the target lighting scene and performing time consistency processing on the rendered image frames;

synthesizing the sequence of rendered image frames into a relighting video image.

In some possible embodiments, the pre-trained image processing model includes a first convolutional neural network and a second convolutional neural network, and the inputting the plurality of image frames of the video image to be processed and the target illumination scene into the pre-trained image processing model to obtain the sequence of rendered image frames includes:

inputting each image frame in the image frames into the first convolution neural network frame by frame to perform light removal processing, and obtaining a portrait albedo image and a normal map under a standard lighting scene;

simultaneously inputting the portrait albedo image and the normal map of a plurality of adjacent image frames and the target illumination environment map into a plurality of second convolutional neural networks to obtain a plurality of synthesized relighting frames, wherein the second convolutional neural networks encode time consistency through inter-frame attention;

synthesizing a background image in the image frames and the plurality of multiple illumination frames to generate the sequence of rendered image frames.

In some possible embodiments, before the inputting each of the image frames frame by frame to the first convolutional neural network for the de-illumination processing, further comprising:

acquiring training data, wherein the training data comprises video image data under different lighting scenes;

respectively inputting video image data under two different lighting scenes into a first initial convolutional neural network which is constructed in advance, and respectively obtaining albedo images and normal maps which correspond to the video image data under the two different lighting scenes;

calculating Euclidean space distances of albedo images and normal maps corresponding to the video image data under the two different lighting scenes, and a characteristic space distance of a convolutional layer characteristic map of the first initial convolutional neural network;

and (3) taking the Euclidean spatial distance of the albedo image and the normal map and the characteristic spatial distance together as a loss function to train the network, so as to obtain the trained first convolutional neural network.

In some possible embodiments, before the simultaneously inputting the human albedo image and the normal map of the adjacent image frames and the target illumination environment map into the second convolutional neural networks to obtain a combined multiple illumination frames, the method further includes:

performing time consistency coding on a plurality of second initial convolutional neural networks which are constructed in advance according to the inter-frame attention mechanism;

simultaneously and respectively inputting the human image albedo image and the normal map of the plurality of adjacent image frames corresponding to any one of the two lighting scenes into the plurality of second initial convolutional neural networks, and outputting the plurality of re-lighting frames;

and training the plurality of second initial convolutional neural networks based on the plurality of multiple illumination frames and the target illumination environment graph of the plurality of adjacent image frames to obtain a plurality of trained second convolutional neural networks.

According to a second aspect of the embodiments of the present application, there is provided a convolutional neural network-based whole-body portrait video relighting apparatus, including:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a video image to be processed, and the video to be processed comprises a whole-body portrait video image;

the rendering module is used for inputting a plurality of image frames of the video image to be processed and a target lighting scene into a pre-trained image processing model to obtain a rendering image frame sequence, wherein the image processing model is used for rendering the image frames and the target lighting scene into the rendering image frames under the target lighting scene and performing time consistency processing on the rendering image frames;

a composition module for composing the sequence of rendered image frames into a relighting video image.

In some possible embodiments, the rendering module includes:

the first input unit is used for inputting each image frame in the image frames into the first convolution neural network frame by frame to perform light removal processing, so as to obtain a portrait albedo image and a normal map under a standard lighting scene;

the second input unit is used for simultaneously inputting the human image albedo image and the normal map of a plurality of adjacent image frames and the target illumination environment map into a plurality of second convolutional neural networks to obtain a plurality of synthesized relighting frames, wherein the second convolutional neural networks encode time consistency through an inter-frame attention mechanism;

a synthesizing unit, configured to synthesize a background image in the image frames and the multiple illumination frames, and generate the rendered image frame sequence.

In some possible embodiments, the apparatus further comprises:

the second acquisition module is used for acquiring training data, wherein the training data comprises video image data under different lighting scenes;

the first input module is used for respectively inputting video image data under two different lighting scenes into a first initial convolutional neural network which is constructed in advance to respectively obtain albedo images and normal maps corresponding to the video image data under the two different lighting scenes;

the calculation module is used for calculating Euclidean space distances of the albedo image and the normal map corresponding to the video image data under the two different lighting scenes and a feature space distance of the convolutional layer feature map of the first initial convolutional neural network;

and the first training module is used for training the network by taking the Euclidean spatial distance of the albedo image and the normal map and the characteristic spatial distance together as a loss function to obtain the trained first convolution neural network.

In some possible embodiments, the apparatus further comprises:

the coding module is used for carrying out time consistency coding on a plurality of second initial convolutional neural networks which are constructed in advance according to the inter-frame attention mechanism;

a second input module, configured to simultaneously and respectively input the human albedo image and the normal map of the multiple adjacent image frames corresponding to any one of the two types of lighting scenes into the multiple second initial convolutional neural networks, and output the multiple relighting frames;

and the second training module is used for training the plurality of second initial convolutional neural networks based on the plurality of repeated illumination frames and the target illumination environment map of the plurality of adjacent image frames to obtain the plurality of trained second convolutional neural networks.

According to a third aspect of embodiments herein, there is provided an electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the convolutional neural network-based whole-body portrait video relighting method according to any one of the first aspect.

According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the convolutional neural network-based whole-body portrait video relighting method as defined in any one of the first aspects.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the convolutional neural network-based whole-body portrait video relighting method as defined in any one of the first aspects.

The beneficial effect of this application:

according to the whole body portrait video relighting method based on the convolutional neural network, a to-be-processed video image is obtained, wherein the to-be-processed video comprises the whole body portrait video image, a plurality of image frames and a target lighting scene of the to-be-processed video image are input into a pre-trained image processing model, a rendered image frame sequence is obtained, the image processing model is used for rendering the image frames and the target lighting scene into rendered image frames under the target lighting scene, time consistency processing is conducted on the rendered image frames, and the rendered image frame sequence is synthesized into the relighting video image. The application can realize the relighting of the portrait of the whole body and improve the relighting effect.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a convolutional neural network-based whole-body portrait video relighting method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training process of a first convolutional neural network according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a training process of a second convolutional neural network according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a convolutional neural network-based whole-body portrait video relighting device according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments, not all embodiments, of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Interpretation of the related terms:

(1) image (video) generation: image (video) generation refers to generating a target image (video) from an input vector. The input vector may be random noise or a user-specified condition vector for directing the machine learning algorithm to output a satisfactory image (video). The specific application scenes comprise handwritten digit generation, two-dimensional face synthesis, style migration, image restoration, image enhancement and the like.

(2) Image (video) enhancement: in the process of acquiring the image (video), due to the influence of various factors, the quality of the image (video) is degraded, such as low brightness, strong noise, poor color, lack of detail information and the like. Image (video) enhancement refers to improving the visual effect of an image (video) through a series of techniques, and converting the image (video) into a form more suitable for computer or human analysis and application. Typical methods are histogram equalization, Retinex algorithm, defogging algorithm, gamma correction, etc.

(3) Image (video) relighting: the image (video) re-illumination task belongs to an image (video) enhancement task, which means that given input images and ambient light illumination, an algorithm regularly changes the brightness of pixel values of the input images according to illumination characteristics, performs detail processing such as highlight, shadow, surface scattering and the like, and outputs a new rendered image under a corresponding illumination effect.

(4) Rendering in reality: photorealistic rendering refers to the generation of photorealistic graphical images of three-dimensional scenes in a computer, so that human eyes cannot distinguish the difference between the generated images and the images shot by a camera of the same scene. The three main criteria for realism are: photo level, physical correctness, and high performance. The main realistic rendering algorithms comprise a scanning line algorithm, a ray tracing algorithm and a light energy radiometric algorithm.

(5) Time consistency modeling: temporal consistency between video frames is one of the basic characteristics of video information. At present, a plurality of video processing algorithms are usually edited frame by frame, continuity between frames is not fully considered, and a generated video can often observe a slight unnatural oscillation phenomenon in a partial area of a picture when being played, which is called as time inconsistency. Considering that strong continuous information exists between the upper frame and the lower frame, an algorithm design can be performed on the continuity, which is called as time consistency modeling.

The method and the device for whole-body portrait video relighting based on the convolutional neural network proposed by the embodiment of the present application are described below with reference to the accompanying drawings, and first, the method for whole-body portrait video relighting based on the convolutional neural network proposed by the embodiment of the present application will be described with reference to the accompanying drawings.

Fig. 1 is a flowchart of a convolutional neural network-based whole-body portrait video relighting method according to an embodiment of the present application.

As shown in fig. 1, the method for relighting the whole-body portrait video based on the convolutional neural network comprises the following steps:

and step S110, acquiring a video image to be processed.

The video to be processed comprises a whole-body portrait video image, and the video to be processed can be a video image needing to be relighted.

In the embodiment of the present application, the video images to be processed may be whole-body portrait video images, and the meaning to be processed may be that the whole-body portrait video images are relighted.

It should be noted that the form of the video image to be processed may be a video segment, or may be a video frame image decomposed according to a video segment.

Step S120, inputting a plurality of image frames of the video image to be processed and the target lighting scene into a pre-trained image processing model to obtain a rendering image frame sequence.

The image processing model is used for rendering the image frame and the target illumination scene into a rendered image frame under the target illumination scene, and performing time consistency processing on the rendered image frame, wherein the target illumination scene can be a relighting scene which needs to be achieved by processing the video image to be processed.

In the embodiment of the present application, after the video to be processed is obtained, a plurality of image frames of the video image to be processed and the target lighting scene may be input to a pre-trained image processing model, so as to obtain a sequence of rendered image frames. That is to say, the image processing model is trained in advance, and relighting processing based on time consistency can be performed on a plurality of image frames of the video image to be processed according to the target lighting scene, so as to obtain a rendered image frame sequence after relighting processing corresponding to the plurality of image frames of the video image to be processed.

Step S130, the rendered image frame sequence is synthesized into a relighting video image.

In the embodiment of the present application, after a plurality of image frames of a video image to be processed and a target illumination scene are input to a pre-trained image processing model to obtain a sequence of rendered image frames, the sequence of rendered image frames may be synthesized into a relighting video image. That is, after the sequence of rendered image frames is obtained, the image relighting process has been completed, and the sequence of relighting-processed image frames can be synthesized into a relighting video image in the sequence order.

In some possible embodiments, the pre-trained image processing model includes a first convolutional neural network and a second convolutional neural network, and the inputting the plurality of image frames of the video image to be processed and the target lighting scene into the pre-trained image processing model to obtain the sequence of rendered image frames includes:

inputting each image frame in the image frames into a first convolution neural network frame by frame for light removal processing to obtain a portrait albedo image and a normal map under a standard lighting scene;

simultaneously inputting the human image albedo image, the normal map and the target illumination environment map of a plurality of adjacent image frames into a plurality of second convolutional neural networks to obtain a plurality of synthesized relighting frames, wherein the second convolutional neural networks are encoded by inter-frame attention in time consistency;

a background image in the image frame and the multiple illumination frames are combined to generate a sequence of rendered image frames.

The first convolutional neural network can be an image de-illumination processing network and is used for processing an input image to obtain a portrait albedo image and a normal map under a standard illumination scene, and the second convolutional neural network can be an image relighting processing network and is used for relighting the portrait albedo image and the normal map under the input standard illumination scene according to a target illumination environment and in a mode of a plurality of adjacent image frames to obtain a plurality of synthesized relighting frames.

In the embodiment of the application, after a video image to be processed is obtained, each image frame in the image frames can be input into a first convolution neural network frame by frame to perform a light removal process, so that a portrait albedo image and a normal map in a standard lighting scene are obtained, then the portrait albedo image, the normal map and a target lighting environment map of a plurality of adjacent image frames can be simultaneously input into a plurality of second convolution neural networks to obtain a plurality of synthesized relighting frames, wherein the second convolution neural networks encode time consistency through inter-frame attention mechanism, and then a background image and the relighting frames in the image frames can be synthesized to generate a rendering image frame sequence. That is to say, the image processing model comprises a first convolution neural network and a second convolution neural network, the first convolution neural network carries out light removal processing on each image frame by frame to obtain a portrait albedo image and a normal map under a standard lighting scene, and the second convolution neural network can carry out relighting based on time consistency on the portrait albedo image and the normal map under the standard lighting scene according to a target lighting environment to obtain a relighting frame.

It is noted that the standard lighting environment may be an average brightness white light lighting environment.

In some possible embodiments, before inputting each of the image frames frame by frame to the first convolutional neural network for the de-illumination processing, the method further includes:

calculating Euclidean space distances of albedo images and normal maps corresponding to video image data under two different lighting scenes and a characteristic space distance of a convolutional layer characteristic map of a first initial convolutional neural network;

and (3) taking the Euclidean spatial distance and the characteristic spatial distance of the albedo image and the normal map together as a loss function to train the network, so as to obtain a trained first convolutional neural network.

In the embodiment of the present application, before each image frame in the image frames is input to the first convolutional neural network frame by frame for the light removal processing, training of the first convolutional neural network may be performed:

training data can be obtained by building a lighting stage and adjusting the brightness and the color of light, so that complex and changeable ambient lighting including white light and colored light under a stage performance scene is simulated. Through a monocular RGB-D (Red Green Blue-Depth) camera, whole body portrait static videos of different volunteers in a standard lighting environment are shot, the video length can be set according to needs, for example, the video length can be 3-5 seconds, a color albedo image without lamplight interference and Depth volume data are generated, Depth information is mapped into a normal map, lamplight setting can be randomly adjusted again, and the whole body portrait static videos in a non-standard lighting environment are obtained.

The whole-body portrait static video under the standard illumination environment is used as supervision, and the whole-body portrait static video under the non-standard illumination environment is used for training the first convolutional neural network. Aiming at a whole body portrait static video of the same volunteer under a non-standard lighting environment, randomly decomposing two groups of whole body portrait static videos with different lighting environments into image sequences, representing the two groups of image sequences with different lighting environments by using a frame A and a frame B, constructing two groups of first initial convolutional neural networks with the same structure, representing the first initial convolutional neural networks by using a network A and a network B, inputting the frame A into the network A, and inputting the frame B into the network B to obtain a albedo image and a normal map which respectively correspond to the frame A and the frame B.

And calculating Euclidean spatial distances of the albedo image and the normal map corresponding to the frame A and the frame B respectively and characteristic spatial distances of the convolutional layer characteristic maps corresponding to the network A and the network B respectively, and iterating the first initial convolutional neural network by taking the Euclidean spatial distances and the characteristic spatial distances as loss functions to optimize parameters to obtain the first convolutional neural network.

It should be noted that, the first convolution neural network is used as a de-illumination network, and the influence of random complex lighting on the color and shadow of the portrait can be removed. The encoder of the first convolutional neural network consists of a series of downsampled convolutional layers, encoding the input image frame into a 1 x 1 feature vector representation. The decoder of the first convolutional neural network up-samples the characteristic vector through a series of transposed convolutional layers, restores the characteristic diagram to the size of an input image frame layer by layer, and finally outputs an albedo image and a normal mapping. Wherein a skip connection is made between the encoder and the decoder.

In some possible embodiments, before simultaneously inputting the human albedo image and the normal map of the adjacent image frames and the target illumination environment map into the second convolution neural networks to obtain the combined multiple re-illumination frames, the method further includes:

carrying out time consistency coding on a plurality of second initial convolutional neural networks which are constructed in advance according to an interframe attention mechanism;

inputting the human image albedo images and the normal map of a plurality of adjacent image frames corresponding to any one of the two lighting scenes into a plurality of second initial convolution neural networks at the same time respectively, and outputting a plurality of re-lighting frames;

and training the plurality of second initial convolutional neural networks based on the plurality of multiple illumination frames and the target illumination environment diagram of the plurality of adjacent image frames to obtain a plurality of trained second convolutional neural networks.

In this embodiment of the present application, after obtaining the first convolutional neural network, training of the second convolutional neural network may be performed on the basis of the first convolutional neural network:

constructing a plurality of second initial convolutional neural networks, as shown in fig. 3, performing time consistency coding on the plurality of second initial convolutional neural networks constructed in advance according to an inter-frame attention mechanism, wherein the plurality of second initial convolutional neural networks can be represented by a network 1, a network 2 and a network … …, k is a positive integer greater than 1, for the same volunteer, a portrait albedo image and a normal map of a plurality of adjacent image frames corresponding to any one of two lighting scenes in the two lighting scenes after the training of the first convolutional neural network are input to the plurality of second initial convolutional neural networks at the same time, correspondingly, the plurality of image frames can also be represented by a frame 1 and a frame 2 … …, relighting frames corresponding to the k image frames are obtained, target lighting environment maps of the adjacent k image frames are respectively used as supervision of the k networks, performing parameter iterative optimization on the second initial convolutional neural networks according to the relighting frames corresponding to the output k image frames, k second convolutional neural networks are obtained.

It should be noted that the second convolutional neural network is used as a synthesis network, outputs the relighting video frame in the target lighting environment, and performs cooperative processing on the color, the shadow and the highlight to synthesize a relighting effect with high reality. The encoder of the second convolutional neural network is composed of a series of downsampling convolutional layers, the input albedo image and the normal mapping are encoded into a characteristic vector of 1 x 1, the vector is called a bottleneck layer, the bottleneck layers of the k second initial convolutional neural networks are encoded into a vector of 1 x 128, and the vectors of the k second initial convolutional neural networks are subjected to time sequence attention mechanism encoding so as to capture illumination dependency relations among different frames and perform time consistency modeling. The decoder of the second convolutional neural network up-samples the feature vector through a series of transposed convolutional layers, restores the features to the size of the input image frame layer by layer, and finally outputs a synthesized relighting image. Wherein a skip connection is made between the encoder and the decoder.

Through the steps, the video image to be processed is obtained, wherein the video to be processed comprises a whole body portrait video image, a plurality of image frames of the video image to be processed and the target lighting scene are input into a pre-trained image processing model, a rendered image frame sequence is obtained, the image processing model is used for rendering the image frames and the target lighting scene into rendered image frames under the target lighting scene, time consistency processing is carried out on the rendered image frames, and the rendered image frame sequence is synthesized into the relighting video image. The application can realize the relighting of the portrait of the whole body and improve the relighting effect.

In order to implement the foregoing embodiment, as shown in fig. 4, there is further provided a whole-body portrait video relighting apparatus 400 based on a convolutional neural network in the present embodiment, where the apparatus 400 includes: a first acquisition module 410, a rendering module 420, and a composition module 430.

A first obtaining module 410, configured to obtain a video image to be processed, where the video to be processed includes a whole-body portrait video image;

a rendering module 420, configured to input a plurality of image frames of a video image to be processed and a target lighting scene into a pre-trained image processing model to obtain a rendering image frame sequence, where the image processing model is configured to render the image frames and the target lighting scene into rendering image frames in the target lighting scene, and perform time consistency processing on the rendering image frames;

a composition module 430 for composing the sequence of rendered image frames into a relighting video image.

In some possible embodiments, the rendering module 420 includes:

the first input unit is used for inputting each image frame in the image frames into the first convolution neural network frame by frame to carry out light removal processing so as to obtain a portrait albedo image and a normal map under a standard lighting scene;

the second input unit is used for simultaneously inputting the human image albedo images, the normal map and the target illumination environment map of a plurality of adjacent image frames into a plurality of second convolutional neural networks to obtain a plurality of synthesized relighting frames, wherein the second convolutional neural networks encode time consistency through an inter-frame attention mechanism;

and the synthesis unit is used for synthesizing the background image and the multiple illumination frames in the image frame to generate a rendering image frame sequence.

In some possible embodiments, the convolutional neural network based whole-body portrait video relighting apparatus 400 further comprises:

the second acquisition module is used for acquiring training data, and the training data comprises video image data under different lighting scenes;

the first input module is used for respectively inputting the video image data under two different lighting scenes into a first initial convolution neural network which is constructed in advance to respectively obtain an albedo image and a normal map which correspond to the video image data under the two different lighting scenes;

the calculation module is used for calculating Euclidean space distances of albedo images and normal map corresponding to video image data under two different lighting scenes and a characteristic space distance of a convolutional layer characteristic map of a first initial convolutional neural network;

and the first training module is used for training the network by taking the Euclidean spatial distance and the characteristic spatial distance of the albedo image and the normal map as loss functions together to obtain a trained first convolution neural network.

In some possible embodiments, the convolutional neural network based whole-body portrait video relighting 400 further comprises:

the coding module is used for carrying out time consistency coding on a plurality of second initial convolutional neural networks which are constructed in advance according to an inter-frame attention mechanism;

the second input module is used for simultaneously and respectively inputting the portrait albedo image and the normal map of a plurality of adjacent image frames corresponding to any one of the two lighting scenes into a plurality of second initial convolution neural networks and outputting a plurality of relighting frames;

and the second training module is used for training the plurality of second initial convolutional neural networks based on the plurality of repeated illumination frames and the target illumination environment images of the plurality of adjacent image frames to obtain the plurality of trained second convolutional neural networks.

According to the convolutional neural network-based whole body portrait video relighting device, a to-be-processed video image is obtained, wherein the to-be-processed video image comprises a whole body portrait video image, a plurality of image frames of the to-be-processed video image and a target lighting scene are input into a pre-trained image processing model, a rendered image frame sequence is obtained, the image processing model is used for rendering the image frames and the target lighting scene into rendered image frames under the target lighting scene, time consistency processing is carried out on the rendered image frames, and the rendered image frame sequence is synthesized into the relighting video image. The application can realize the relighting of the portrait of the whole body and improve the relighting effect.

It should be noted that the foregoing explanation of the embodiment of the method for relighting the whole-body portrait video based on the convolutional neural network is also applicable to the device for relighting the whole-body portrait video based on the convolutional neural network of the embodiment, and details are not repeated here.

The present disclosure also provides an electronic device, a computer-readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. The electronic device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 comprises a computing unit 501 which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the various methods and processes described above, such as the convolutional neural network-based whole-body portrait video relighting method. For example, in some embodiments, the convolutional neural network-based whole-body portrait video relighting method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the convolutional neural network-based whole-body portrait video relighting method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the convolutional neural network-based whole-body portrait video relighting method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server that incorporates a blockchain.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the description of the present application, "a plurality" means at least two, e.g., two, three, etc., unless explicitly defined otherwise.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A whole-body portrait video relighting method based on a convolutional neural network is characterized by comprising the following steps:

inputting a plurality of image frames of the video image to be processed and a target illumination scene into a pre-trained image processing model to obtain a rendered image frame sequence, wherein the image processing model is used for rendering the image frames and the target illumination scene into the rendered image frames under the target illumination scene and performing time consistency processing on the rendered image frames;

2. The method of claim 1, wherein the pre-trained image processing model comprises a first convolutional neural network and a second convolutional neural network, and wherein inputting the plurality of image frames of the video image to be processed and the target lighting scene into the pre-trained image processing model to obtain a sequence of rendered image frames comprises:

inputting each image frame in the image frames into the first convolution neural network frame by frame to perform light removal processing to obtain a portrait albedo image and a normal map under a standard lighting scene;

simultaneously inputting the human image albedo image and the normal map of a plurality of adjacent image frames and the target illumination environment map into a plurality of second convolutional neural networks to obtain a plurality of synthesized multiple illumination frames, wherein the second convolutional neural networks encode time consistency through an inter-frame attention mechanism;

and synthesizing a background image in the image frame and the multiple relighting frames to generate the rendering image frame sequence.

3. The method according to claim 2, wherein before said inputting each of said image frames frame by frame to said first convolutional neural network for de-illumination processing, further comprising:

4. The method of claim 3, wherein before said inputting the human albedo image and the normal map of the adjacent image frames and the target illumination environment map into the plurality of second convolutional neural networks simultaneously to obtain the combined plurality of re-illuminated frames, further comprising:

and training the plurality of second initial convolutional neural networks based on the plurality of multiple illumination frames and the target illumination environment map of the plurality of adjacent image frames to obtain a plurality of trained second convolutional neural networks.

5. A whole-body portrait video relighting device based on a convolutional neural network is characterized by comprising:

a compositing module to composite the sequence of rendered image frames into a relighting video image.

6. The apparatus of claim 5, wherein the rendering module comprises:

a synthesizing unit, configured to synthesize a background image in the image frames and the multiple relighting frames, and generate the rendered image frame sequence.

7. The apparatus of claim 6, further comprising:

8. The method of claim 7, wherein the apparatus further comprises:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the convolutional neural network-based whole-body portrait video relighting method of any of claims 1 to 4.

10. A computer readable storage medium, wherein instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the convolutional neural network-based whole-body portrait video relighting method of any of claims 1-4.