CN116342379A

CN116342379A - Flexible and various human face image aging generation system

Info

Publication number: CN116342379A
Application number: CN202310338136.2A
Authority: CN
Inventors: 李佩佩; 何召锋; 王锐; 曹春水
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-06-27

Abstract

The invention relates to the technical field of image processing, and provides a flexible and various human face image aging generation system which comprises an acquisition unit for acquiring an original input image

Reference image

And a predefined aging textT is the same as ^ref The method comprises the steps of carrying out a first treatment on the surface of the A CLIP encoder for encoding the reference image

And the aged text t ^ref Mapping to the CLIP hidden space to obtain hidden vectors e respectively ^img Hidden vector e ^txt The method comprises the steps of carrying out a first treatment on the surface of the Probability age prediction unit using text prior N (e ^txt I) makes KL divergence constraint according to hidden vector e ^img Probability generation representation e of the aging condition is obtained ^age ＝N(μ _φ (e ^img )，σ _φ ² (e ^img ) I) is carried out; diffusion self-encoder for converting an original input image

Encoding into semantic conditions

A first diffusion decoder for decoding the semantic condition z ^src Pre-training a noisy image that diffuses the t-th step from the encoder

And aging condition e ^age Decoding and decoding the image p subjected to denoising aging editing. By the technical scheme, the problem of low flexibility degree of face aging in the prior art is solved.

Description

Flexible and various human face image aging generation system

Technical Field

The invention relates to the technical field of image processing, in particular to a flexible and various human face image aging generation system.

Background

The aging of the face aims at keeping the identity information of the face, and simultaneously simulates the face appearance change of different age groups, and has practical landing application prospects in the aspects of age estimation, cross-age face recognition, film and television creation, medical beauty and the like. The rapid development of deep learning has driven research into human face aging over the past decades. At present, face aging mainly faces three problems: firstly, the previous GAN-based methods often have difficulty in robustly generating high quality aging results, and in the actual generation process, many results have obvious artifacts; secondly, the prior aging method usually takes a fixed age label as input, so that the flexibility of human face aging is greatly limited; finally, previous aging methods ignore the diversity of aging, as it is affected by environmental complications, and it is very unscientific to generate only one aging pattern. In summary, all three of the above problems are issues to be resolved in aging.

Disclosure of Invention

The invention provides a flexible and various human face image aging generation system, which solves the problem of low human face aging flexibility degree in the related technology.

The technical scheme of the invention is as follows: comprising the following steps:

an obtaining unit for obtaining an original input image

Reference image->

And a predefined aged text t ^ref ；

A CLIP encoder for encoding the reference image

And the aged text t ^ref Mapping to the CLIP hidden space to obtain hidden vectors e respectively ^img Hidden vector e ^txt ；

Probability age prediction unit using text prior N (e ^txt I) makes KL divergence constraint according to hidden vector e ^img Probability generation representation e of the aging condition is obtained ^age ＝N(μ _φ (e ^img )，σ _φ ² (e ^img ) I) is carried out; wherein N (0,I) represents a normal distribution, mu _φ Mean value and sigma of normal distribution _φ Representing the variance of the normal distribution, phi being the network parameter;

diffusion self-encoder for converting an original input image

Encoding into semantic Condition->

And aging condition e ^age Decoding and decoding the image p subjected to denoising aging editing.

The working principle and the beneficial effects of the invention are as follows:

because the images and the characters are used as aging conditions to be more in line with the intuition and cognition of human beings, the invention firstly refers to the images

And a predefined aged text t ^ref Mapping to the CLIP hidden space through a pre-trained CLIP encoder to obtain corresponding expression e ^img And e ^txt Utilizing the characteristic of highly consistent alignment of the CLIP hidden space text and the image; then regarding the aging condition as a sampling result from the probability distribution, using text prior as KL divergence constraint to the aging condition e ^age And performing probability generation representation to realize aging condition generation of flexible interaction of the image and the text.

Drawings

The invention will be described in further detail with reference to the drawings and the detailed description.

FIG. 1 is a schematic block diagram of the present invention;

FIG. 2 is a schematic diagram of a probabilistic age prediction unit according to the present invention;

FIG. 3 is a schematic diagram of a diffusion self-encoder in accordance with the present invention;

fig. 4 is a schematic diagram of an adaptive modulation module according to the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the flexible and various face aging system of the present embodiment includes:

an obtaining unit for obtaining an original input image

Reference image->

And a predefined aged text t ^ref ；

A CLIP encoder for encoding the reference image

Probability age prediction unit using text prior N (e ^txt I) makes KL divergence constraint according to hidden vector e ^img Probability generation representation e of the aging condition is obtained ^age ＝N(μ _φ (e ^img )，σ _φ ² (e ^img ) I) is carried out; wherein N (0,I) represents a normal distribution, mu _φ Mean value and sigma of normal distribution _φ Representing the variance of normal distribution, phi being the network parameter of the prediction unit;

diffusion self-encoder for converting an original input image

Encoding into semantic Condition->

The present embodiment first refers to the input reference image

And a predefined aged text t ^ref Mapping to the CLIP hidden space through a pre-training CLIP encoder to respectively obtain corresponding hidden vectors e ^img Hidden vector e ^txt By utilizing the characteristic of highly consistent alignment of CLIP hidden space text and images, a lightweight network structure is designed to probability-generate an expression of aging conditions, as shown in fig. 2. The lightweight network structure first comprises a second multi-layer perceptron MLP by embedding a vector e ^img Inputting the average value mu obtained by the MLP of the second multi-layer perceptron _φ (e ^img ) Sum of variances sigma _φ (e ^img ) The aging condition is considered as the sampling result from the probability distribution: e, e ^age ＝N(μ _φ (e ^img )，σ _φ ² (e ^img ) I) and using the text a priori N (e ^txt I) makes KL divergence constraints. The original input image is then ++encoded using a semantic encoder in a pre-trained diffusion self-encoder>

Encoding into semantic Condition z ^src The semantic condition includes a subject-aware feature of the image. Finally combine the semantic condition z ^src And age condition z obtained by aging the encoder ^age The first diffusion decoder is enabled to continuously execute (T steps are taken together) reverse process to obtain an image p (x) after aging editing _t-1 |x _t ，z ^tar T), where p (x _T )＝N(0，I)，z ^tar ＝z ^src +z ^age . Wherein the first diffusion decoder employs a pre-trained diffusion decoder.

In recent years, the development of the image generation field has been greatly promoted by the proposal of DDPM (Denoising Diffusion Probabilistic Models, denoising diffusion probability model). DDPM is trained by a Forward noise adding process (Forward) and a reverse noise removing process (Denoise). In the forward noise adding process, the DDPM continuously adds noise with fixed parameters to the image to approximate to a random gaussian noise, and in the reverse process, trains the network to continuously predict the noise and makes noise reduction to restore the original image. After training, the DDPM can randomly sample from Gaussian noise to generate an image. In this embodiment, the pre-training diffusion self-Encoder trains a Semantic Encoder (Semantic Encoder) additionally on the basis of training the DDPM, and in the reverse process, the encoding characteristics of the Semantic Encoder are trained as conditions, and the structure diagram is shown in fig. 3.

It should be noted that, in the training and sampling process, the text priori may be used to make small disturbance to the text vector to directly obtain e ^age ＝e ^txt The model is beneficial to learning how to simultaneously use text and image information to guide aging generation by adopting the mode of +sigma.eta, eta-N (0,I), so that the generation of aging conditions is more flexible.

Further, an adaptive modulation module for adjusting the aging condition e ^age And semantic Condition z ^src Performing self-adaptive fusion to obtain age condition z ^age By age condition z ^age Instead of the aging condition e ^age Inputting the images into a decoder to obtain an aged edited image p; the step of self-adaptive fusion specifically comprises the following steps:

using a multilayer perceptron MLP to condition e aging ^age Mapping to diffusion self-encoder hidden space to obtain mapping vector delta z ^age ；

Mapping vector Deltaz by two layers of full-connection layers respectively ^age Learning weight parameter gamma _θ And beta _θ By means of adaptive coding and semantic condition z ^src Fusion to obtain age condition z ^age 。

In this embodiment, e in the CLIP hidden space is modulated adaptively ^age Mapping to a pre-training steganography space that diffuses from the encoder, thereby embedding e in CLIP steganography space ^age And semantic Condition z ^src Performing self-adaptive fusion to further obtain diversified aging conditions z ^age 。

Further, the method further comprises the following steps:

a calculation unit for performing calculation of the loss function L, specifically, the present embodiment proposes training of six loss constraint models, i.e., l=l _tKL +λ ₁ L _age +λ ₂ L _clip +λ ₃ L _id +λ ₄ L _norm +λ ₅ L _rec 。

In this embodiment, the denoising process is performed with a pre-trained conditional diffusion decoder (Diffusion Decoder) in the diffusion self-encoder, at z ^src Is guided to reconstruct the original input image

Similarly we can reconstruct the original input image +.>

In order to let the obtained aging conditions meet the text prior and avoid collapsing to a fixed value, we propose a text-guided KL divergence constraint: l (L) _tKL ＝D _KL (N(μ _φ (e ^img )，σ _φ ² (e ^img )I)||N(e ^txt ，I))

In the actual training process, on the premise that the hidden space of the CLIP is a hypersphere, the constraint of Euclidean distance and the constraint of negative cosine similarity are equivalent, so that the distance item in KL divergence is weakened into the constraint of negative cosine similarity and a modulus value.

To ensure that the ageing results meet the target age condition, we propose two losses: age characteristic contrast loss L _age And CLIP direction loss L _clip . Wherein age loss L _age Is a result of intermediate reconstruction in the feature space pair of the pre-trained age estimator f ()

The contrast loss of cosine similarity </DEG > is specifically defined as follows:

since the diffusion intermediate reconstruction result is too fuzzy, a conventional L is adopted ₂ Loss is liable to cause age deviationThe difference is large, so in combination with the characteristic of the diffusion model, we propose that the age comparison loss ensures the consistency of the age of the generated result, and in the experiment, we choose m to be 0.25. Meanwhile, in order to avoid age deviation introduced by a single age estimator, we use a pre-trained large model CLIP to do additional age supervision guidance:

ΔT＝E _txt (t ^ref )-E _txt (t ^src )

wherein E is _txt (. Cndot.) and E _img (. Cndot.) represents a pre-trained CLIP encoder, t in the experiment ^src The text we choose is expressed as "a face".

In order to improve the retention of quality-independent age-related features of the generated image, we propose L _id ，L _norm And L _rec Specifically, a pre-trained face recognition model R (-) is adopted, so that the identity characteristics of the model are unchanged in the aging process:

wherein the method comprises the steps of<·>And (3) calculating the similarity of the cosine, wherein R (-) is the characteristic representation of the characteristic space of the pre-training face recognition model. To ensure the quality of the generation, we propose a regularized term loss L for the various aging conditions _norm The definition is as follows:

L _norm ＝||z ^age || ₂

in order to ensure that the age-independent characteristics of the model remain unchanged, during the training process, the reference image and the input image are randomly made to be consistent, and L is used ₁ Constraint is carried out:

our overall objective function can be summarized as follows:

L＝L _tKL +λ ₁ L _age +λ ₂ L _clip +λ ₃ L _id +λ ₄ L _norm +λ ₅ L _rec

wherein lambda is _i Is the weight of each loss.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The flexible and various human face image aging generation system is characterized by comprising:

an obtaining unit for obtaining an original input image

Reference image->

And a predefined aged text t ^ref ；

A CLIP encoder for encoding the reference image

Probability age prediction unit using text prior N (e ^txt I) makes KL divergence constraint according to hidden vector e ^img Probability generation representation e of the aging condition is obtained ^age ＝N(μ _φ (e ^img )，σ _φ ² (e ^img ) I) is carried out; wherein N (0,I) represents a normal distribution, mu _φ Representing normal componentsMean value of cloth, sigma _φ Representing the variance of the normal distribution, phi being the network parameter;

diffusion self-encoder for converting an original input image

Encoding into semantic Condition->

2. The flexible multi-face image aging generation system of claim 1, further comprising:

an adaptive modulation module for adjusting the aging condition e ^age And semantic Condition z ^src Performing self-adaptive fusion to obtain age condition z ^age By age condition z ^age Instead of the aging condition e ^age Inputting the images into a decoder to obtain an aged edited image p; the step of self-adaptive fusion specifically comprises the following steps:

Mapping vector Deltaz by two layers of full-connection layers respectively ^age Learning weight parameters by self-adaptive coding and semantic condition z ^src Fusion to obtain age condition z ^age 。

3. The flexible multiple face image aging generation system of claim 1, further comprising

The calculating unit is configured to perform calculation of the loss function L, specifically:

wherein L is _tKL Represents KL divergence constraint loss, L _age Represents age characteristic contrast loss, L _clip Indicating the CLIP direction loss, L _id Indicating identity loss, L _norm Representing regular term loss, L _rec Represents a consistency loss, lambda ₁ 、λ ₂ 、λ ₃ 、λ ₄ 、λ ₅ Weights for each loss;

KL divergence constraint loss L _tKL The method comprises the following steps:

age characteristic contrast loss L _age The method comprises the following steps:

wherein,,

original input image ++by diffusion decoder diffusing pre-training from encoder>

Reconstructing to obtain->

Reference picture +/by diffusion decoder diffusing pre-training from encoder>

Reconstructing to obtain->

For the target image output by the first diffusion decoder, m is a parameter,<·>calculating the similarity of the cosine;

CLIP direction loss L _clip The method comprises the following steps:

ΔT＝E _txt (t ^ref )-E _txt (t ^src )

wherein E is _txt (. Cndot.) and E _img (. Cndot.) represents a pre-trained CLIP encoder, t ^src The text is expressed for the selected text;

identity loss L _id The method comprises the following steps:

wherein R (-) is the characteristic representation of the characteristic space of the pre-training face recognition model;

regular term loss L _norm The method comprises the following steps:

L _norm ＝||z ^age || ₂

consistency loss L _rec The method comprises the following steps: