CN113870372A

CN113870372A - Video hair color conversion method based on deep learning

Info

Publication number: CN113870372A
Application number: CN202111012366.7A
Authority: CN
Inventors: 伍克煜; 郑友怡
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-12-31
Anticipated expiration: 2041-08-31
Also published as: CN113870372B; WO2023029184A1

Abstract

The invention provides a video hair color conversion method based on a deep neural network. The method first introduces a luminance map (Luminance map) to represent the geometric structure of the hair, and proposes standardization in time and space to further ensure the stability and generation of continuous frames. The results are consistent with the reference image. The present invention also designs three conditional modules of color, structure and background to further decouple the properties of hair. Since the luminance map is highly decoupled from color, and the corresponding modules are separately designed for each property of the hair, the color, structure and lighting of the hair and the background of the image are highly decoupled. Taking advantage of this advantage, this method achieves high-fidelity hair. Color conversion. The invention also introduces a discriminator and a cycle consistency loss function to generate more realistic and temporally coherent results.

Description

A deep learning-based video hair color conversion method

技术领域technical field

本发明属于计算机视觉领域，尤其涉及一种基于深度学习的视频头发颜色转换方法。The invention belongs to the field of computer vision, in particular to a video hair color conversion method based on deep learning.

背景技术Background technique

头发是肖像描绘中最重要的组成部分之一，这也引发了计算机视觉领域大量优秀的工作，然而现有的工作都只停留在静态头发上，对于视频序列的相关工作仍然不足，此外，与人脸的大部分部位不同，头发非常精致、多变和复杂。它由数以千计的细线组成，并受到光照、运动和遮挡的影响，因此难以分析、表示和生成。现有的研究都利用gabor filter提取发丝的生长方向作为头发的几何并与颜色解耦，然而这种做法的一个常见缺点是：gabor filter会损失许多细节信息，导致生成的视频序列不保真且容易抖动。Hair is one of the most important components in portrait depiction, which has also led to a lot of excellent work in the field of computer vision. However, the existing work only stays on static hair, and the related work for video sequences is still insufficient. In addition, with Most parts of the human face are different, and the hair is very delicate, varied and complex. It consists of thousands of thin lines and is affected by lighting, motion, and occlusion, making it difficult to analyze, represent, and generate. Existing studies have used gabor filter to extract the hair growth direction as the geometry of the hair and decouple it from color. However, a common disadvantage of this approach is that the gabor filter will lose a lot of detail information, resulting in unfidelity of the generated video sequence. and easy to shake.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术的不足，提出了一种基于深度学习的视频头发颜色转换方法，利用标准化后的luminance map表示头发的几何结构，针对头发各个属性设计了三个条件模块，将头发属性高度解耦并重组达到头发颜色转换的效果，整个训练流程无需任何成对数据。提出了循环一致性损失(cycle consistency loss)并使用判别器(discriminator)去除生成过程中可能造成的抖动，并增强了生成结果与参考图像的一致性。Aiming at the shortcomings of the prior art, the present invention proposes a video hair color conversion method based on deep learning. The standardized luminance map is used to represent the geometric structure of the hair, and three condition modules are designed for each attribute of the hair. Decoupling and reorganization achieves the effect of hair color conversion, and the entire training process does not require any paired data. A cycle consistency loss is proposed and a discriminator is used to remove the jitter that may be caused during the generation process and enhance the consistency of the generated results with the reference image.

本发明是通过以下技术方案来实现的：The present invention is achieved through the following technical solutions:

一种基于深度学习的视频头发颜色转换方法，包括以下步骤：A deep learning-based video hair color conversion method includes the following steps:

步骤一：将包含待转换头发颜色的目标图像的视频中每一帧从RGB空间转换为LAB空间，提取L空间并在时间和空间上做标准化，并利用结构特征提取模块获取目标图像包含头发结构和光照的特征图。Step 1: Convert each frame of the video containing the target image of the hair color to be converted from RGB space to LAB space, extract the L space and standardize it in time and space, and use the structural feature extraction module to obtain the target image containing the hair structure. and lighting feature maps.

其中，L空间(luminance map)代表亮度，亮度的强弱即可表达发丝的结构并保留原有的光照，最重要的一点是Luminance map能保留所有微小的细节，以重建出更真实的图像。由于亮度也关系到颜色的深浅，为了弥补不同图像不同光照条件带来的影响，本发明提出在时间和空间上对luminance map进行标准化并用标准化后的结果来表示头发的几何结构，这样既能保证发丝的结构又能保证原有的光照条件，生成更真实的结果。Among them, the L space (luminance map) represents the brightness. The intensity of the brightness can express the structure of the hair and retain the original lighting. The most important point is that the Luminance map can retain all the tiny details to reconstruct a more realistic image. . Since the brightness is also related to the depth of the color, in order to compensate for the influence of different lighting conditions in different images, the present invention proposes to standardize the luminance map in time and space, and use the standardized result to represent the geometric structure of the hair, which can ensure that both The structure of the hair can also ensure the original lighting conditions, resulting in more realistic results.

步骤二：选取带有预转换头发颜色的参考图像并利用颜色特征提取模块提取参考图像的头发颜色特征，并将头发颜色特征叠加至目标图像的头发掩模上获得头发颜色掩模；Step 2: Select a reference image with pre-converted hair color and use a color feature extraction module to extract the hair color feature of the reference image, and superimpose the hair color feature on the hair mask of the target image to obtain a hair color mask;

步骤三：根据目标图像头发掩模利用背景区域特征提取模块提取目标图像除头发以外的背景区域特征图；Step 3: According to the target image hair mask, use the background area feature extraction module to extract the background area feature map of the target image except for the hair;

步骤四：将步骤一提取到的目标图像包含头发结构和光照的特征图、步骤二中提取到的头发颜色掩模及步骤三提取到的背景区域特征图输入至一个主干生成网络整合生成具有参考图像头发颜色的目标图像。Step 4: Input the target image extracted in step 1 including the feature map of hair structure and lighting, the hair color mask extracted in step 2 and the background region feature map extracted in step 3 into a backbone generation network to integrate and generate a reference image. The target image for the image hair color.

其中，所述结构特征提取模块、颜色特征提取模块、背景区域特征提取模块、主干生成网络通过收集的视频训练获得。Wherein, the structural feature extraction module, the color feature extraction module, the background region feature extraction module, and the backbone generation network are obtained by training videos collected.

进一步地，所述步骤1中，将包含待转换头发颜色的目标图像的视频中每一帧从RGB空间转换为LAB空间，提取L空间并在时间和空间上做标准化，具体为：Further, in the step 1, each frame in the video containing the target image of the hair color to be converted is converted from the RGB space to the LAB space, and the L space is extracted and standardized in time and space, specifically:

(1.1)将包含待转换头发颜色的目标图像的视频中每一帧图像从CIE RGB转化到CIE XYZ颜色空间再从CIE XYZ转换到LAB颜色空间。(1.1) Convert each frame of image in the video containing the target image of the hair color to be converted from CIE RGB to CIE XYZ color space and then from CIE XYZ to LAB color space.

(1.2)提取L空间并计算一整个视频序列所有像素点的L值的均值及方差，利用公式L^t _norm＝(L^t-mean(L))/V对L空间进行标准化。其中mean(L)代表L的均值，V代表方差，t为图像的索引。(1.2) Extract the L space and calculate the mean and variance of the L values of all pixels in an entire video sequence, and use the formula L ^t _norm =(L ^t -mean(L))/V to standardize the L space. where mean(L) represents the mean of L, V represents the variance, and t is the index of the image.

进一步地，所述步骤(1.2)中，计算一整个视频序列中头发区域对应像素点的L值的均值及方差。Further, in the step (1.2), the mean and variance of the L values of the pixels corresponding to the hair region in an entire video sequence are calculated.

进一步地，所述颜色特征提取模块包括4层下采样的部分卷积网络(partialconvolution)和一个实例级平均池化层(instance-wise average pooling)。Further, the color feature extraction module includes a 4-layer down-sampling partial convolution network (partial convolution) and an instance-wise average pooling layer.

通过使用partial convolution基于mask提取特征这一优势来提取参考图像头发区域的特征，避免头发区域以外的特征干扰，并使用实例级平均池化层将特征压缩为一个特征向量，得到参考图像的全局颜色特征。By using the advantage of partial convolution to extract features based on masks, the features of the hair region of the reference image are extracted to avoid the interference of features outside the hair region, and the instance-level average pooling layer is used to compress the features into a feature vector to obtain the global color of the reference image. feature.

进一步地，所述结构特征提取模块包括依次连接的多个上采样模块和残差块，所述主干生成网络包括依次连接的多个残差块和下采样模块，结构特征提取模块与主干生成网络呈对称结构，。所述步骤四中，将步骤一提取到的目标图像包含头发结构和光照的特征图、步骤二中提取到的头发颜色掩模在特征通道上连接后输入主干生成网络的多个残差块提取特征后，输入至上采样模块，上采样模块与结构特征提取模块的下采样模块通过跳跃连接获得多尺度特征，并且在最后n个下采样模块中结合步骤三提取到的背景区域特征图最终获得具有参考图像头发颜色的目标图像，n为背景区域特征图的数量。Further, the structural feature extraction module includes a plurality of upsampling modules and residual blocks connected in sequence, the backbone generation network includes a plurality of residual blocks and downsampling modules connected in sequence, and the structural feature extraction module and the backbone generation network. Symmetrical structure. In the step 4, the target image extracted in step 1 contains the feature map of hair structure and lighting, and the hair color mask extracted in step 2 is connected on the feature channel and then input into the backbone generation network to extract multiple residual blocks. After the feature, it is input to the upsampling module, the upsampling module and the downsampling module of the structural feature extraction module obtain multi-scale features through skip connection, and combine the background region feature map extracted in step 3 in the last n downsampling modules to finally obtain The target image for the hair color of the reference image, and n is the number of feature maps in the background region.

进一步地，当包含连续的多帧待转换头发颜色的目标图像时，将当前待转换头发颜色的目标图像的前k帧目标图像转换结果反馈至结构特征提取模块与当前待转换头发颜色的目标图像作为共同输入以保证生成视频序列的时间相干性。Further, when the target image of the hair color to be converted is contained in multiple consecutive frames, the first k frames of target image conversion results of the target image of the current hair color to be converted are fed back to the structural feature extraction module and the target image of the current hair color to be converted. as a common input to guarantee the temporal coherence of the generated video sequences.

进一步地，所述训练采用的损失函数包括：生成的具有参考图像头发颜色的目标图像与真值的L1损失和感知损失、网络的生成对抗损失、生成的具有参考图像头发颜色的目标图像与真值的特征匹配损失、时间相干性损失和循环一致性损失，表示为：Further, the loss function used in the training includes: the generated target image with the hair color of the reference image and the L1 loss and the perceptual loss of the true value, the generation confrontation loss of the network, the generated target image with the hair color of the reference image and the true value. The feature matching loss, temporal coherence loss, and cycle consistency loss of the value, expressed as:

其中，

分别是网络输出图像与真值的L1损失和感知损失，

是网络的生成对抗损失，

是网络输出图像与真值的特征匹配损失，

是时间相干性损失(Temporal coherence)，

和

是循环一致性损失；λ为对应损失的权重，为实数。in,

are the L1 loss and perceptual loss of the network output image and the ground truth, respectively,

is the generative adversarial loss of the network,

is the feature matching loss between the network output image and the ground truth,

is the temporal coherence loss,

and

is the cycle consistency loss; λ is the weight of the corresponding loss, which is a real number.

进一步地，所述循环一致性损失训练具体为对于两个视频序列X和Y，首先以视频序列Y作为参考图像，使用本发明方法上述步骤将视频序列X的头发颜色转换生成视频序列X*，再以视频序列X*作为参考图像，视频序列Y作为待转换头发颜色的视频序列，重复上述步骤生成新的视频序列Y*，损失函数如下：Further, the cycle consistency loss training is specifically for two video sequences X and Y, firstly, using the video sequence Y as a reference image, using the above steps of the method of the present invention to convert the hair color of the video sequence X to generate a video sequence X*, Then take the video sequence X* as the reference image and the video sequence Y as the video sequence of the hair color to be converted, repeat the above steps to generate a new video sequence Y*, the loss function is as follows:

其中，I是目标图像，I_ref代表参考图像，G表示主干生成网络的输出，

表示一次头发颜色转换的结果，I_cyc代表经过两次转换得到的结果，k+1代表反馈的输入图像张数；

表示VGG网络第j层的输出，t为视频序列的索引，下标l表示对应图像的l空间，即图像I的亮度图；D表示判别器discriminator的输出，

表示期望；M表示模板图像的头发掩模，M_ref表示参考图像的头发掩模，||*||₁表示L1正则化。where I is the target image, _Iref represents the reference image, G represents the output of the backbone generation network,

Represents the result of one hair color conversion, I _cyc represents the result obtained after two conversions, and k+1 represents the number of input images fed back;

Represents the output of the jth layer of the VGG network, t is the index of the video sequence, and the subscript l represents the l space of the corresponding image, that is, the brightness map of the image I; D represents the output of the discriminator,

denotes the expectation; M denotes the hair mask of the template image, _Mref denotes the hair mask of the reference image, and ||*|| ₁ denotes L1 regularization.

循环一致性损失进一步保证了生成结果的颜色与参考图像的一致性。The cycle consistency loss further ensures that the colors of the generated results are consistent with the reference image.

本发明的突出贡献是：The outstanding contributions of the present invention are:

本发明提出了iHairRecolorer，第一个基于深度学习将参考图像的头发颜色转换到视频中的方法，第一次提出使用LAB空间中的L空间替代传统的orientation map作为头发的结构并对其进行了时间和空间上的标准化，使luminance map不仅能保留细微的结构特征及原有光照条件还能与头发的色彩高度结构。此外采用了一种新颖的循环一致性损失来更好地匹配生成结果和参考图像之间的颜色。该发明不依赖任何成对数据即可训练并能在测试数据上依然鲁棒稳定。该发明还引入了discriminator以保证生成视频序列的时间一致性，得到更真实更平滑的结果，优于目前已有的所有方法。The invention proposes iHairRecolorer, the first method to convert the hair color of the reference image into the video based on deep learning, and the first time to use the L space in the LAB space to replace the traditional orientation map as the structure of the hair and carry out The normalization in time and space enables the luminance map not only to retain the subtle structural features and the original lighting conditions, but also to be highly structured with the color of the hair. Furthermore, a novel cycle-consistency loss is employed to better match colors between the generated results and reference images. The invention does not rely on any paired data for training and remains robust and stable on test data. The invention also introduces a discriminator to ensure the temporal consistency of the generated video sequences, resulting in more realistic and smoother results, which are superior to all existing methods.

附图说明Description of drawings

图1是本发明的网络管线结构图；Fig. 1 is the network pipeline structure diagram of the present invention;

图2是本发明的头发颜色转换结果图。FIG. 2 is a graph of the hair color conversion result of the present invention.

具体实施方式Detailed ways

由于头发非常精致、多变和复杂。它由数以千计的细线组成，并受到光照、运动和遮挡的影响，因此难以分析、表示和生成。本发明的目标是得到头发结构高度还原原始视频且发色与参考图像保持一致的新的视频序列。这需要结构与颜色高度解耦，一般来说采用gabor filter计算。本发明针对上述问题，提出使用luminance map表示头发的结构并对齐进行时间和空间上的标准化。并利用partial convolution等三个精心设计的条件模块及所提出的损失函数完成视频头发颜色转换模型的训练。具体包括如下步骤：Because hair is very delicate, varied and complex. It consists of thousands of thin lines and is affected by lighting, motion, and occlusion, making it difficult to analyze, represent, and generate. The goal of the present invention is to obtain a new video sequence in which the hair structure is highly restored to the original video and the hair color is consistent with the reference image. This requires a high degree of decoupling of structure and color, which is generally calculated using a gabor filter. In view of the above problems, the present invention proposes to use a luminance map to represent the structure of the hair and to align it for temporal and spatial standardization. And three well-designed conditional modules such as partial convolution and the proposed loss function are used to complete the training of the video hair color conversion model. Specifically include the following steps:

图1说明了本发明中网络的结构及数据流通方向。下面，结合一个具体的实施例对本发明方法作进一步说明：Figure 1 illustrates the structure of the network and the direction of data flow in the present invention. Below, in conjunction with a specific embodiment, the inventive method is further described:

对于一个给定T帧的视频序列I(T)＝{I¹，I²，...，I^T}，及参考图像I_ref，我们的目标是删除I(T)中的原始头发颜色，并转换为与参考图像中头发相同的颜色，同时保持其他头发属性不变。因此，首先对视频序列中头发的属性做分解，将其分解为形状，结构，光照，颜色及头发以外区域。For a given video sequence I(T)={ ^I1 , ^I2 ,...,IT} of ^T frames, and reference image _Iref , our goal is to remove the original hair color in I(T), and converted to the same color as the hair in the reference image, while keeping other hair properties unchanged. Therefore, the properties of hair in the video sequence are first decomposed into shape, structure, lighting, color and areas other than hair.

其中结构特征的提取使用了LAB空间独有的亮度与颜色的高度解耦性，并且将L空间使用如下公式进行标准化：The extraction of structural features uses the unique high degree of decoupling of brightness and color in the LAB space, and the L space is standardized using the following formula:

其中M表示目标图像头发区域的掩模mask，

表示第T帧图像对应的亮度图(luminance map)，i表示亮度图中某一像素，

表示标准化后第T帧图像对应的亮度图。本发明中利用M*L移除了背景对亮度图标准化的影响。同时，将一个视频序列的所有帧同时计算亮度图，统一标准化的方差和均值，消除亮度图序列的抖动以生成更加平滑的结果。亮度图生成结果如图1所示(Clip X)。最后，目标图像的亮度图经过结构特征提取模块(luminance module)提取获得包含头发结构和光照的特征图；本实施例中，亮度模块由3层下采样及4个残差块(Resnet Block)结构组成。where M represents the mask mask of the hair region of the target image,

Represents the luminance map corresponding to the T-th frame image, i represents a pixel in the luminance map,

Indicates the luminance map corresponding to the T-th frame image after normalization. In the present invention, M*L is used to remove the influence of the background on the normalization of the luminance map. At the same time, the luminance map is calculated simultaneously for all frames of a video sequence, the standardized variance and mean are unified, and the jitter of the luminance map sequence is eliminated to generate smoother results. The result of luminance map generation is shown in Figure 1 (Clip X). Finally, the luminance map of the target image is extracted by a structural feature extraction module (luminance module) to obtain a feature map including hair structure and illumination; in this embodiment, the luminance module consists of 3 layers of downsampling and 4 residual blocks (Resnet Block) structure composition.

针对带有预转换头发颜色的参考图像，利用如图1所示的颜色特征提取模块(Color Module)，所述颜色特征提取模块包括4层下采样的部分卷积网络(partialconvolution)和一个实例级平均池化层(instance-wise average pooling)，首先使用头发分割网络获得参考图像头发区域的头发掩模，经过4层下采样的partial convolution，每层下采样都会使图像分辨率降低，相应的，每次头发掩模也会随之更新以避免头发以外特征的干扰。经过4次下采样后获得的特征图(feature map)进一步的使用一个实例级平均池化层压缩为一个512维的特征向量(feature vector)。这不仅保留了头发颜色的全局信息还去除了头发形状及结构差异带来的影响。进一步地，将提取到的特征向量通过以下公式与目标图像的头发掩模叠加：For the reference image with pre-converted hair color, a Color Module as shown in Figure 1 is used, which includes a 4-layer down-sampling partial convolution network and an instance-level The average pooling layer (instance-wise average pooling), firstly uses the hair segmentation network to obtain the hair mask of the reference image hair area, after 4 layers of partial convolution of downsampling, each downsampling will reduce the image resolution, correspondingly, The hair mask is also updated each time to avoid interference from features other than hair. The feature map obtained after 4 downsampling is further compressed into a 512-dimensional feature vector using an instance-level average pooling layer. This not only preserves the global information of hair color but also removes the effects of differences in hair shape and structure. Further, the extracted feature vector is superimposed with the hair mask of the target image by the following formula:

其中A′_ref表示参考图像提取到的特征向量，M_ref表示参考图像的头发掩模，M表示目标图像的头发掩模。A即为feature vector叠加到M上的关于颜色的特征图，即头发颜色掩模。where _A'ref represents the feature vector extracted from the reference image, _Mref represents the hair mask of the reference image, and M represents the hair mask of the target image. A is the feature map of color superimposed on M by the feature vector, that is, the hair color mask.

进一步地，利用如图1所示背景区域特征提取模块(Background Module)提取目标图像除头发以外的背景区域特征图，具体地，首先使用分割获得的目标图像头发掩模去除头发区域，保留其余区域不变，将剩余区域输入神经背景区域特征提取模块，经过两次下采样获得不同粒度的特征并在主干生成网络(backbone Generator)的最后两层与新生成的头发特征组合。Further, utilize the background region feature extraction module (Background Module) as shown in Figure 1 to extract the background region feature map of the target image except the hair, specifically, first use the target image hair mask obtained by segmentation to remove the hair region, and retain the remaining regions. Unchanged, the remaining area is input into the neural background area feature extraction module, and the features of different granularities are obtained after two downsampling and combined with the newly generated hair features in the last two layers of the backbone generator network.

如图1所示，主干生成网络，包括4个resBlock和3层上采样，与Luminance Module具有对称的网络结构，其输入为获取目标图像包含头发结构和光照的特征图与头发颜色掩模在特征通道上连接后的组合特征，经过4个resBlock进一步提取特征后通过skipconnection结合Luminance Module中的下采样中多尺度的特征，并且在最后2层与背景特征组合生成具有参考图像头发颜色的图像。As shown in Figure 1, the backbone generation network, including 4 resBlocks and 3 layers of upsampling, has a symmetric network structure with the Luminance Module, and its input is the feature map of the target image containing hair structure and lighting and the hair color mask in the feature map. The combined features connected on the channel are further extracted through 4 resBlocks, and then combined with the multi-scale features in the downsampling in the Luminance Module through skipconnection, and combined with the background features in the last 2 layers to generate an image with the hair color of the reference image.

作为一优选方案，若是需要连续转换视频中的多张目标图像的头发颜色，可以将待转换目标图像的前k帧图像经头发颜色转换后的生成的图像一同作为网络的输入即可生成更加平滑，时间相干性更强的视频序列。其中，为了保证输入一致，第一张、第二张的输入为相同的三张图像。As a preferred solution, if the hair color of multiple target images in the video needs to be continuously converted, the images generated by the hair color conversion of the first k frames of the target images to be converted can be used as the input of the network to generate a smoother image. , a video sequence with stronger temporal coherence. Among them, in order to ensure that the input is consistent, the input of the first and second images is the same three images.

其中，结构特征提取模块、颜色特征提取模块、背景区域特征提取模块、主干生成网络通过通过如下方法根据收集的视频训练获得：Among them, the structural feature extraction module, the color feature extraction module, the background region feature extraction module, and the backbone generation network are obtained by training the collected videos through the following methods:

如图1循环一致性所示，对于两个视频序列X和Y，首先以视频序列Y作为参考图像，使用本发明方法上述步骤将视频序列X的头发颜色转换生成视频序列X*，再以视频序列X*作为参考图像，视频序列Y作为待转换头发颜色的视频序列，重复上述步骤生成新的视频序列Y*，Y*应当与视频序列Y相同，因此可以利用以下损失函数约束：As shown in Fig. 1 , for two video sequences X and Y, the video sequence Y is used as a reference image, and the above steps of the method of the present invention are used to convert the hair color of the video sequence X to generate a video sequence X*, and then use the video sequence Y as a reference image. The sequence X* is used as the reference image, and the video sequence Y is used as the video sequence of the hair color to be converted. Repeat the above steps to generate a new video sequence Y*. Y* should be the same as the video sequence Y, so the following loss function constraints can be used:

其中，

分别是网络输出图像与真值的L1损失和感知损失，

是网络的生成对抗损失，

是网络输出图像与真值的特征匹配损失，

是时间相干性损失(Temporal coherence)。in,

is the generative adversarial loss of the network,

is the temporal coherence loss.

I是目标图像，I_ref代表参考图像，G表示主干生成网络的输出，

表示一次头发颜色转换的结果，I_cyc∈Y*代表经过两次转换得到的结果，k+1代表结构特征提取模块输入图像张数，本实施例中为3；

表示VGG网络第j层的输出，下标l表示对应图像的l空间，即图像I的亮度图；D表示判别器discriminator的输出，

表示期望。I is the target image, _Iref represents the reference image, G represents the output of the backbone generative network,

represents the result of one hair color conversion, I _cyc ∈ Y* represents the result obtained after two conversions, and k+1 represents the number of input images of the structural feature extraction module, which is 3 in this embodiment;

Represents the output of the jth layer of the VGG network, and the subscript l represents the l space of the corresponding image, that is, the brightness map of the image I; D represents the output of the discriminator,

express expectations.

λ为对应损失的权重，本实施例中分别取值λ₁＝10、λ_p＝10、λ_adv＝0.1、

λ_chromatic＝10、λ_stable＝1，最终训练获得网络，图2是利用本发明的头发颜色转换结果图，从图中可以看出，本发明可以适应不同发型，从长到短，简单到复杂，也可以适应各种颜色，包括一些混合色，生成的目标图像能保持与参考图像保持一致，同时生成的目标图像也保留了源图像中的光照条件。λ is the weight of the corresponding loss, in this embodiment, the values λ ₁ =10, λ _p =10, λ _adv =0.1,

λ _chromatic = 10, λ _stable = 1, and finally the network is obtained by training. Fig. 2 is the result of hair color conversion using the present invention. It can be seen from the figure that the present invention can adapt to different hairstyles, from long to short, simple to complex , can also adapt to various colors, including some mixed colors, the generated target image can keep the same as the reference image, and the generated target image also retains the lighting conditions in the source image.

显然，上述实施例仅仅是为清楚地说明所作的举例，而并非对实施方式的限定。对于所属领域的普通技术人员来说，在上述说明的基础上还可以做出其他不同形式的变化或变动。这里无需也无法把所有的实施方式予以穷举。而由此所引申出的显而易见的变化或变动仍处于本发明的保护范围。Obviously, the above-mentioned embodiments are only examples for clear description, and are not intended to limit the implementation manner. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. All implementations need not and cannot be exhaustive here. However, the obvious changes or changes derived from this are still within the protection scope of the present invention.

Claims

1. A video hair color conversion method based on deep learning is characterized by comprising the following steps:

the method comprises the following steps: converting each frame of a video containing a target image of the hair color to be converted from an RGB space to an LAB space, extracting an L space, standardizing the L space in time and space, and acquiring a characteristic diagram of the target image containing the hair structure and illumination by using a structural characteristic extraction module.

Step two: selecting a reference image with pre-converted hair colors, extracting hair color features of the reference image by using a color feature extraction module, and superposing the hair color features on a hair mask of a target image to obtain a hair color mask;

step three: extracting a background region characteristic diagram of the target image except hairs by using a background region characteristic extraction module according to the target image hair mask;

step four: inputting the target image extracted in the step one including the hair structure and the characteristic diagram of illumination, the hair color mask obtained in the step two and the background region characteristic diagram extracted in the step three into a main generating network to integrate and generate the target image with the hair color of the reference image.

The structure feature extraction module, the color feature extraction module, the background region feature extraction module and the backbone generation network are obtained through collected video training.

2. The method for converting the hair color of the video based on the deep learning as claimed in claim 1, wherein in the step 1, each frame of the video including the target image of the hair color to be converted is converted from RGB space to LAB space, L space is extracted and normalized in time and space, specifically:

(1.1) converting each frame image of the video containing the target image of the hair color to be converted from CIE RGB to CIE XYZ color space and then from CIE XYZ to LAB color space.

(1.2) extracting L space and calculating the mean and variance of L values of all pixel points of a whole video sequence by using a formula L^t _norm＝(L^tMean (L)/V normalizes the L space. Where mean (L) represents the mean of L, V represents the variance, and t is the index of the image.

3. The method according to claim 1, wherein in step (1.2), the mean and variance of the L values of the pixels corresponding to the hair region in the whole video sequence are calculated.

4. The deep learning based video hair color conversion method according to claim 1, wherein the color feature extraction module comprises a 4-layer down-sampled partial convolution network and an instance-level average pooling layer.

5. The method according to claim 1, wherein the structural feature extraction module comprises a plurality of upsampling modules and residual blocks which are connected in sequence, the backbone generation network comprises a plurality of residual blocks and downsampling modules which are connected in sequence, the structural feature extraction module and the backbone generation network are in a symmetrical structure, and the corresponding upsampling modules are connected with the downsampling modules in a jumping manner. In the fourth step, the target image extracted in the first step comprises a hair structure and illumination characteristic diagram, the hair color mask extracted in the second step is input into a plurality of residual blocks of a backbone generation network after being connected on a characteristic channel to extract characteristics, the extracted characteristics are input into an upper sampling module to sequentially obtain multi-scale characteristics, and the target image with the hair color of the reference image is finally obtained by combining the background area characteristic diagram extracted in the third step in the last n lower sampling modules.

6. The method for converting hair color of video based on deep learning of claim 1, wherein when the method comprises a plurality of consecutive target images of hair color to be converted, the conversion result of the first k frames of target images of the target image of hair color to be converted is fed back to the structural feature extraction module and the target image of hair color to be converted is used as common input to ensure temporal coherence of the generated video sequence.

7. The method of claim 1, wherein the training uses a loss function comprising: the generated target image with reference image hair color and the true value of the L1 loss and the perception loss, the network generation countermeasure loss, the generated target image with reference image hair color and the true value of the feature matching loss, the temporal coherence loss and the cycle coherence loss.

8. The method according to claim 7, wherein the cyclic consistency loss training specifically comprises, for two video sequences X and Y, first using the video sequence Y as a reference image, using the above steps to transform the hair color of the video sequence X into the video sequence X, then using the video sequence X as a reference image, using the video sequence Y as a video sequence of the hair color to be transformed, and repeating the above steps to generate a new video sequence Y, wherein the loss function is as follows:

wherein I is the target image, I_refRepresenting a reference image, G represents the output of the backbone generation network,

shows the result of a hair color conversion, I_cycRepresenting the result obtained by two times of conversion, and k +1 represents the number of the fed-back input images;

representing the output of the j-th layer of the VGG network, t is the index of the video sequence, and the subscript l represents the l space of the corresponding image, i.e. the luminance map of the image I; d denotes an output of the discriminator,

indicating a desire; m denotes the hair mask of the template image, M_refA hair mask representing a reference image, | Y phosphor₁Indicating L1 regularization.