CN117523626A - Pseudo RGB-D face recognition method - Google Patents

Pseudo RGB-D face recognition method Download PDF

Info

Publication number
CN117523626A
CN117523626A CN202210959034.8A CN202210959034A CN117523626A CN 117523626 A CN117523626 A CN 117523626A CN 202210959034 A CN202210959034 A CN 202210959034A CN 117523626 A CN117523626 A CN 117523626A
Authority
CN
China
Prior art keywords
image
rgb
face
face recognition
pseudo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210959034.8A
Other languages
Chinese (zh)
Inventor
金波
努诺·贡萨尔维斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nu NuoGongsaerweisi
Original Assignee
Nu NuoGongsaerweisi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nu NuoGongsaerweisi filed Critical Nu NuoGongsaerweisi
Priority to CN202210959034.8A priority Critical patent/CN117523626A/en
Publication of CN117523626A publication Critical patent/CN117523626A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer image processing, and particularly relates to a face recognition method based on pseudo depth generation fusion. The method comprises the following steps: s100: pretreatment; s200: adding a label; s300: generating a pseudo depth map; s400: fusing images; s500: training a pseudo RGB-D face recognition model; s600: pseudo RGB-D face recognition test. Experiments show that the face recognition mode of pseudo RGB-D is higher in accuracy than the common RGB face recognition mode. In addition, the invention also provides a training method of the image translation model D+GAN of the countermeasure generation network by utilizing the multi-face attribute. Experiments show that the obtained model can generate a face depth map with higher quality. The invention has the advantages that the third dimensional space information of the two-dimensional face image is extracted by fully utilizing the big data, the robustness and the accuracy of face recognition are improved, and the performance of the face recognition task is improved.

Description

Pseudo RGB-D face recognition method
Technical Field
The invention belongs to the field of machine learning and image processing, relates to a face recognition method based on a pseudo RGB-D mode, and particularly relates to a depth image generation fusion method based on an countermeasure generation network.
Background
The theory of darwinian proposes natural selection, i.e. survival of the fittest, eliminating other processes. The genetic characteristics of organisms that adapt to the environment can be preserved by natural selection, which is a substantial basis for the profound effects of academic research. Today, all higher organisms have two eyes for three-degree localization, which is critical for foraging. In contrast, most of the monocular organisms are extinct. Humans can still be positioned three-dimensionally with one eye for a period of time due to a great deal of past experience.
In recent decades, biological recognition technology has attracted attention from researchers due to its unique, stable, versatile and difficult counterfeiting characteristics. Because of its non-invasiveness, face recognition has become the most user-friendly method of biometric identification, resulting in its widespread use [2] [3]. However, the accuracy of face recognition based on RGB images is generally affected by many factors such as illumination conditions, age, head pose changes, and the like. Human vision is three-dimensional, in contrast to the most common 2D face images, which lack face spatial stereo information. The importance of the face spatial information is undoubted. Recent advances and popularity of inexpensive RGB-D sensors have enabled us to utilize 3D information. Compared with RGB face recognition, RGB-D face recognition requires depth images captured by depth sensors such as Kinect and PrimeSense, and the accuracy is better due to effective utilization of spatial features [8] [9 ]. In modern society, face recognition systems, while convenient, raise many information security and privacy concerns. Furthermore, RGB-D data has no popular file format, and RGB-D cameras have not as many. Thus, RGB-D face images are not easily collected, much less than RGB face images.
Disclosure of Invention
The advent of machine learning allows computers to simulate the learning process of humans and learn from historical experience to make predictions. From this, it may be possible to let the model effectively predict the depth map from its corresponding RGB image by using a machine learning algorithm. With the development of big data and the improvement of the performance of computer hardware, the deep learning technology widely applied in the scientific and industrial fields in recent years has stronger reasoning performance than the traditional machine learning model. The monocular depth estimation inspires us to acquire 3D information from 2D face images by deep learning. The invention aims to acquire three-dimensional space information from a two-dimensional RGB face image through a big data machine learning model so as to improve the accuracy of two-dimensional face recognition. The invention refers to a pseudo RGB-D face recognition method. The overall frame is shown in fig. 1.
The generation of a countermeasure network (GAN) proposed by Ian Goodfellow et al is a model that learns the mapping from random noise vectors to output images. The original GAN is composed of two parts, a generator and a discriminator. The purpose of the generator is to map the input gaussian noise into a false image and the discriminator is to determine whether the input image is from the generator, i.e. to calculate the probability that the input image is false. The conditional generation antagonism network (cGAN) proposed by Mehdi Mirza and Simon osidero is a supervised model that can generate output images with the required conditions from random noise. The Pix2Pix proposed by philip Isola et al can be seen as a special case of cGAN. It uses 2D images as input conditions for cGAN to effect image-to-image conversion. ACGAN is proposed by Augustus Odena, christopher Olah, and Jonathon sylens, and not only needs to judge the authenticity of an input image, but also classifies the category of the input image in a discriminator section.
In order to better adapt to the task of generating corresponding depth from RGB face images, we comprehensively refer to the network structure and cooperate with some advanced skills to propose D+GAN. Fig. 2 shows the main structure of cGAN, pix2Pix, ACGAN and d+gan. It concisely demonstrates the distinction between the major structures of d+gan and other GANs. They all control the generated image by introducing external conditions. For cGAN and ACGAN, the generator generates dummy samples based on random noise and conditions. For Pix2Pix, the generator generates a false image from the images that can be regarded as conditions. Whereas for d+gan, the generator generates a false image from the conditional image and its corresponding label. For cGAN and Pix2Pix, the arbiter determines whether the sample is a true sample that satisfies the condition. For ACGAN, the arbiter determines not only whether the sample is a true sample satisfying the condition, but also the category of each sample. For d+gan, the discriminator determines not only whether or not the input sample is a real sample corresponding to the conditional image, but also a plurality of categories to which each sample belongs.
In practice, the images always have different backgrounds, which can affect the processing performance of the algorithm. The training image pair converted from 3D data has a black background. Pseudo-depth face recognition provides a modular algorithmic framework:
f1: the Preprocessing part inputs the original RGB image or the grayscale face image obtained by the color sensor. Its output is the processed image. The preprocessing section includes normalization of image values, and the image size is unified to 256 x 256. It is important to remove the background of the face image. First, the threshold is calculated using the Otsu method. Then, the image is converted into a binary image by a threshold value. Thus, the markers 8 communicate with the object to locate the face from the binary image. Next, the background pixel is replaced with a black pixel. Finally, an open operation is performed, namely erosion followed by dilation, to remove small objects and smooth the boundaries of larger objects in the image.
F2: the generated Model generates a Model part whose input is the processed image. Its output is a pseudo depth map of the same size corresponding to the input image-pixels. This section provides d+gan as the preferred embodiment.
F3: the input of the Image Fusion part is a pseudo depth Image obtained by preprocessing RGB images and F2. Its output is the pseudo RGB-D image after algorithm fusion. This section provides the NSST algorithm as the preferred embodiment.
F4: feature Extraction the input to the feature extraction section is a fused pseudo RGB-D image. The output of the method is the characteristics for face recognition extracted by the algorithm model.
And F5: the Face Recognition part, whose input is the features for Face Recognition, and whose output is the predicted identity tags or the distance between the features.
The invention is also characterized in that:
f1: the background removal method of the Preprocessing section, which includes face images, is used here in a simple and efficient embodiment, and first, the threshold is calculated using the Otsu method. Matlab implementation function level= graythresh (image); then, the image is converted into a binary image by a threshold value. Matlab implementation function bw=image (level.) label 8 communicates objects to locate faces in binary images. Matlab implements the function [ L, NUM ] =bwlabel (BW, 8). Next, the background pixel is replaced with a black pixel. Finally, an open operation is performed, namely erosion followed by dilation, to remove small objects and smooth the boundaries of larger objects in the image. Matlab implements the function image=image (SE). SE represents a structuring element (structuring element).
F2: the generated Model part shows that the depth map generated by the anti-generation network GAN Model has better effect through experiments. We propose an image translation model d+gan of an countermeasure generation network that utilizes multiple face attributes, and SSIM, RMSE, PSNR index indicates that a higher quality depth map is generated than other countermeasure generation network GAN models such as Pix2Pix and CycleGAN. The D+GAN also structurally uses advanced technologies such as a residual error module, a self-attention module, spectrum normalization and the like to improve the performance. The d+gan processing requires adding the class of age, sex, race to the end of the file name in order from left to right of the RGB picture processed by S100 obtained from the color sensor and the face attribute classifier for the processed standard depth image obtained from the depth sensor. Age category 19 to 39 year old tags correspond to tag "0", 40 to 60 year old tags correspond to tag "1", and tags over 60 years old correspond to tag "2". Sex category male tags correspond to the label "0", sex category female tags correspond to the label "1". The race category caucasian (caucasian) tag corresponds to the label "0", and the race category mongolian (yellow) tag corresponds to the label "1"; the race class nigella (black) tag corresponds to the label "2".
F3: in the Image Fusion section, in practical application, the NSST first performs multi-scale and multi-directional decomposition on the input Image through a non-downsampling pyramid (NSP) and a clipping filter. And then, according to the formulated fusion strategy, the decomposed high-frequency and low-frequency subband images are transformed to be combined into a new subband image. Finally, a final fused image is obtained by inverting NSST over the new subband image. In our example, the set of filters for Laplacian pyramid decomposition is "maxflat". The vector representing the direction of decomposition is set to [3, 4]. The vector representing the local support of the shear filter is set to [8, 16, 16]. The fusion coefficient was set to 0.5. The method has the beneficial effects that the channel number of the RGB image after fusion is consistent with that of the color or gray image before fusion, and the advantages of the deep learning face recognition pre-training model can be fully utilized.
F4: feature Extraction feature extraction part, we typically perform a fine tuning training test of pseudo RGB-D patterns on a pretrained model of large scale RGB face recognition. If the number of the face types of the test set is small, the pre-training model of RGB face recognition is used for directly extracting the characteristics of the pseudo RGB-D image, and then a linear Support Vector Machine (SVM) is trained for testing, so that a better face recognition effect can be obtained.
And F5: the Face Recognition part shows that the performance is better than the RGB Face Recognition effect under the condition that the same algorithm is used by using the pseudo RGB-D Face Recognition method. The boosting effect of the RGB mode is greater using the pseudo RGB-D mode than using the conventional Machine Learning (ML) model than using the pseudo RGB-D mode.
Drawings
FIG. 1 is a schematic diagram of an overall modular framework of a pseudo RGB-D face recognition method; an overall modular framework for pseudo RGB-D face recognition methods. RGB Color Sensor: an RGB color sensor; the following is: replacement; RGB-D Sensor: an RGB-D sensor; RGB Face Image: RGB face images; preprocessing: pretreatment; resize: adjusting the size; grayscale: graying; debackgroup: removing the background; normalization: normalizing; input Face Image: inputting a face image of the generated model; generating Model: generating a model by a depth map; monodepth, D+GAN, pix2Pix, cycleGAN: generating models of different kinds of depth maps; generated Depth Map: a generated depth map; image Fusion: fusing images; weighted Average: averaging the weight values; LP: a laplacian pyramid; NSST: a non-downsampled shear wave transform; wavelet: a wavelet; weighted Average, LP, NSST, wavelet: different kinds of image fusion methods. Generated Depth Maps: a depth map group is generated; pseudoo RGB-D Face Images: the pseudo RGB-D face image group after fusion; feature Extractor: extracting features; PCA: analyzing principal components; ICA: independent component analysis; PCA, ICA, insightFace, faceNet: different kinds of feature extraction models; PCA and ICA are traditional machine learning models; insight and FaceNet are deep learning models; features: features; face recogntion: face recognition; classification Problem: ID Label Prediction: classification problem: predicting an identity tag; metric Learning Problem: feature Distance Comparison: metric learning problem: feature distance comparison
FIG. 2 is a diagram showing the structure of D+GAN according to the present invention in comparison with the structure of the related model; fig. 2 shows the main structure of cGAN, pix2Pix, ACGAN and d+gan. It concisely demonstrates the distinction between the major structures of d+gan and other GANs. They all control the generated image by introducing external conditions. For cGAN and ACGAN, the generator generates dummy samples based on random noise and conditions. For Pix2Pix, the generator generates a false image from the images that can be regarded as conditions. Whereas for d+gan, the generator generates a false image from the conditional image and its corresponding label. For cGAN and Pix2Pix, the arbiter determines whether the sample is a true sample that satisfies the condition. For ACGAN, the arbiter determines not only whether the sample is a true sample satisfying the condition, but also the category of each sample. For d+gan, the discriminator determines not only whether or not the input sample is a real sample corresponding to the conditional image, but also a plurality of categories to which each sample belongs.
Fig. 3 is a schematic block diagram of the system structure of the present invention.
Detailed Description
In order to describe the technical content, the constructional features, the achieved objects and effects of the present invention in detail, the following description is made in connection with the embodiments and the accompanying drawings.
Referring to fig. 1, in order to solve the technical problem described in the present invention, a method adopted in the present invention includes the steps of:
s100: an RGB image sample containing a face is obtained, the size of the image sample is firstly transformed to 256×256, and then the face image sample is removed from the background, so that the background part is changed into full black.
S200: the face attribute classifier is used for acquiring the category to which the RGB face image belongs, and the category to which the age, the gender and the race belong is sequentially added from left to right in the file name as a label.
S300: and inputting the RGB face image and the label thereof into a trained depth map generation model to obtain a pseudo depth map.
S400: the pseudo depth map and the RGB image are fused by NSST algorithm to obtain a fused image, which is called pseudo RGB-D image.
S500: and inputting the pseudo RGB-D image into a deep learning pre-training face recognition model trained on a large number of RGB face image data sets to perform fine tuning strategy training, so as to obtain the deep learning face recognition model suitable for the pseudo RGB-D image.
S600: and inputting the pseudo RGB-D face image for testing into a deep learning model based on the pseudo RGB-D image obtained in S500 to perform face recognition testing. Experiments show that the face recognition mode of pseudo RGB-D is higher than the mode accuracy of the traditional RGB face recognition mode.
The method of removing the background containing the face image of step S100, which is a simple and effective embodiment, is first of all to calculate the threshold value using the Otsu method. Matlab implementation function level= graythresh (image); then, the image is converted into a binary image by a threshold value. Matlab implementation function bw=image (level.) label 8 communicates objects to locate faces in binary images. Matlab implements the function [ L, NUM ] =bwlabel (BW, 8). Next, the background pixel is replaced with a black pixel. Finally, an open operation is performed, namely erosion followed by dilation, to remove small objects and smooth the boundaries of larger objects in the image. Matlab implements the function image=image (SE). SE represents a structuring element (structuring element).
The age category 19 to 39 year old tags of the step S200 correspond to the tags "0", 40 to 60 year old tags correspond to the tags "1", and tags over 60 years old correspond to the tags "2". Sex category male tags correspond to the label "0", sex category female tags correspond to the label "1". The race category caucasian (caucasian) tag corresponds to the label "0", and the race category mongolian (yellow) tag corresponds to the label "1"; the race class nigella (black) tag corresponds to the label "2".
In the step S300, referring to fig. 3, we propose an image translation model d+gan of an countermeasure generation network using a multi-face attribute, a generator of the countermeasure generation network uses a U-shaped structure, and a plurality of convolution layers with different parameters, an excitation function and batch normalization of an encoder of the generator. The encoder section implements feature extraction and feature compression, which is a downsampling process. The decoder portion of the generator consists of a plurality of convolutional layers with different parameters, deconvolution layers, excitation functions and batch normalization. The decoder section achieves a feature size recovery, which is an upsampling process. The bottom joint of the U-shaped structure adopts a residual error module group design, and each residual error module unit comprises a residual error module and a self-attention module. The combination of multiple residual modules and self-attention modules, we call the residual module group. The generator model also utilizes jump connections in the convolutional layer between the encoder and decoder to construct the information streaming method, which can effectively alleviate the gradient vanishing problem.
In our example, the encoder consists of 8 two-dimensional convolutional layers. The number of convolution kernels is set to [64, 128, 256, 256, 512, 512, 512, 512], and the stride is set to [2,1, 2, 1]. There is a Bulk Normalization (BN) layer to normalize the input features to speed up the convergence process, and a layer with a ReLU activation function to introduce sparsity into the data to suppress overfitting, the ReLU formula:
f(x)=max(0,x)
to fully extract features and increase model capacity, ten sets of residual modules and self-attention modules are used in series across the encoder and decoder of the generator. In our design we use a residual block instead of the original design of UNet. In the residual block H (x), the original mapping is changed from F (x) to F (x) +x by a jump connection, making the neural network easier to optimize. The number of convolution kernels is 256, the kernel size is 3 x 3, and stride is set to 1.
The self-attention mechanism can be learned from a remote block, so in our design it is used for both the generator and the arbiter. The self-attention module helps learn multi-level and long-range dependencies across image areas, which is complementary to convolutional layers. In the self-intent module, the input features x of the n channels are converted to queries by convolution operations (q=w Q x)、key(K=W K x) and value (k=w K x). Q, K, V, the number of channels becomes n/8, n/8 and n, respectively. Next, Q, K, V was serialized through the channel to yield q, respectively m×n/8 、k m×n/8 And v m×n Wherein m represents the feature size. The final output of the attention weight distribution is calculated as:
attention(q,k,v)=softmax(qk T )v
the d+gan discriminator consists of a backbone structure for discriminating between true and false and three branches for identifying the face attributes of the generated image. In the backbone network, to provide more inter-channel information exchange and save computational resources, we insert a self-attention module after some higher convolution layers as described above before the branching nodes. Specifically, there are ten convolutional layers, where the number of convolutional kernels is set to [64, 64, 64, 128, 128, 128, 256, 256, 512], and the stride is set to [2,1,1,2,1,1,2,1,1,2] order, respectively. The size of the convolution kernel is 3×3 except that the first layer is 5×5. To make the training process more stable, we set spectral normalization in these 10 convolutional layers to make the neural network robust to interference input.
Spectral normalization: specifically, for the weight W of the neural network m×n The spectral norm is the largest singular value of the matrix. Maximum singular value sigma(W m×n ) The definition is as follows:
in practical application, σ (W m×n ) Then forward propagating the weight W during training m×n Updated to W m×n /σ(W m×n ) This is the process of spectrum normalization.
The four branch networks take the output of the branch nodes as input to execute different classification tasks. The first sub-arbiter network is used to determine the true or false of the depth map, essentially a classification task. Similarly, the second, third and fourth sub-arbiter networks are used to classify age, gender and race, respectively. In detail, the age tags are classified into three categories of 19-39 years old, 40-60 years old, and over 60 years old. Sex tags are classified into male and female types. The ethnic tags are classified into three categories, caucasian race, mongolian race and nigella race. For the network structure, the four sub-discriminant networks are identical. The four sub-discriminant networks have the same network structure except for the last layer, and consist of seven two-dimensional convolution layers with a kernel size of 3 x 3. The number of the first six layers of convolution kernels is 512, the stride is 1, the number of the last layer of convolution kernels is 2 or 3, and the stride is 2.
Distinguishing device L D The loss of (2) consists of two parts. First part L S,D Standard GAN is used to distinguish between training real samples and generated samples, expressed as:
wherein X represents an image to be translated corresponding to the RGB face image, Y represents a conditional image corresponding to the real depth image, and Pdat represents probability distribution of a corresponding data set. D (D) 1 Representing the output of the first discriminator. For the conditional real image Y and the generated image G (X), the classifier in the arbiter should be able to predict the class it belongs to. Second part L C,D Classification losses are age, sex andcross entropy loss of race classification, expressed as:
wherein D is i Represents the i-th discriminator, C i Representing the corresponding tag. Overall, training loss of discriminant L D Can be expressed as:
L D =λ 1 L S,D2 L C,D
for the generator, its loss function L G Comprising three parts. First, it is expected that the generated samples may spoof the arbiter, thus L S,G The definition is as follows:
to ensure similarity of the input and output images of the generator, L2-loss is introduced as:
next, it is desirable for the generator to generate high quality samples so that they can be correctly classified by the discriminator. Similarly, classification loss L C,G The definition is as follows:
furthermore, to avoid overfitting, a weight regularization term L is introduced W,G . It is expressed as:
in summary, the generator L G The training loss of (2) can be expressed as:
L G =λ 1 L S,G2 L C,G3 L O,G4 L W,G
training the D+GAN optimizer we used the Adadelta optimizer.
The depth map generation model of the step S300 includes a Monodepth, denseDepth,3DMM,Pix2Pix,CycleGAN and d+gan. Wherein, pix2Pix, cycleGAN and d+gan belong to the GAN model. Experiments show that the depth map generated by the GAN model is better due to the Structural Similarity (SSIM), root Mean Square Error (RMSE) and peak signal to noise ratio (PSNR) of a plurality of indexes. Experimental results show that the effect of the embodiment D+GAN is optimal, and the RGB face recognition performance can be obviously improved.
The step S400 performs pseudo depth map and processed RGB image fusion using a non-downsampled shear wave transform (NSST) based technique. Theoretically, a shear wave system is expressed mathematically as:
A D,S (Ψ)={Ψ j,k,l (x)=|det(D)| j/2 Ψ(S l D j x-k):j,l∈Z;k∈Z 2 }
where j, k and l represent the ratio, displacement and direction, respectively. D is an anisotropic expansion matrix, expressed as:
in practical applications, NSST first performs multi-scale and multi-directional decomposition on an input image through a non-downsampled pyramid (NSP) and a clipping filter, and may be implemented using matlab to implement a function [ dst, sheaf ] =nsst_dec2 (x, sheaf_parameters, lpfilt). And then, according to the formulated fusion strategy, the decomposed high-frequency and low-frequency subband images are transformed to be combined into a new subband image. Finally, the final fused image is obtained by inverting NSST on the new subband image, which can be implemented using matlab implementation function x=nsst_rec2 (dst, sheaf, lpfilt). In our example, the set of filters for Laplacian pyramid decomposition is "maxflat". The vector representing the direction of decomposition is set to [3, 4]. The vector representing the local support of the shear filter is set to [8, 16, 16]. The fusion coefficient was set to 0.5.
In step 5500, S600, we typically perform a fine tuning training test of pseudo RGB-D pattern on a pretrained model of large-scale RGB face recognition. If the number of the face types of the test set is small, the pre-training model of RGB face recognition is used for directly extracting the characteristics of the pseudo RGB-D image, and then a linear Support Vector Machine (SVM) is trained for testing, so that a better face recognition effect can be obtained. Experiments show that the performance is better than the RGB face recognition effect under the condition that the pseudo RGB-D face recognition method is used by the same algorithm. The boosting effect of the Independent Component Analysis (ICA) using the pseudo RGB-D mode is greater than that of the RGB mode using the Deep Learning (DL) model using the pseudo RGB-D mode using the conventional Machine Learning (ML) model such as the Principal Component Analysis (PCA).

Claims (9)

1. A face recognition method based on pseudo depth generation fusion. The method is characterized by comprising the following steps:
s100: an RGB image sample containing a face is obtained, the size of the image sample is firstly transformed to 256×256, and then the face image sample is removed from the background, so that the background part is changed into full black.
S200: the face attribute classifier is used for acquiring the category to which the RGB face image belongs, and the category to which the age, the gender and the race belong is sequentially added from left to right in the file name as a label.
S300: and inputting the RGB face image and the label thereof into a trained depth map generation model to obtain a pseudo depth map.
S400: the pseudo depth map and the RGB image are fused by NSST algorithm to obtain a fused image, which is called pseudo RGB-D image.
S500: and inputting the pseudo RGB-D image into a deep learning pre-training face recognition model trained on a large number of RGB face image data sets to perform fine tuning strategy training, so as to obtain the deep learning face recognition model suitable for the pseudo RGB-D image.
S600: and inputting the pseudo RGB-D face image for testing into a deep learning model based on the pseudo RGB-D image obtained in S500 to perform face recognition testing. Experiments show that the face recognition mode of pseudo RGB-D is higher than the mode accuracy of the traditional RGB face recognition mode.
2. The face recognition method based on pseudo depth generation fusion according to claim 1, wherein the background removal method of step S100 including face images is a simple and effective embodiment, and the threshold is calculated by using the Otsu method. Matlab implement function level= graythresh (image); then, the image is converted into a binary image by a threshold value. Matlab implements the function bw=image (level.) mark 8 communicates objects to locate faces in binary images. Matlab implements the function [ L, NUM ] =bwlabel (BW, 8). Next, the background pixel is replaced with a black pixel. Finally, an open operation is performed, namely erosion followed by dilation, to remove small objects and smooth the boundaries of larger objects in the image. Matlab implements the function image=image (SE). SE represents a structuring element (structuring element).
3. The pseudo depth generation fusion-based face recognition method according to claim 1, wherein the age category 19 to 39 years old tags correspond to the tag "0", 40 to 60 years old tags correspond to the tag "1", and tags over 60 years old correspond to the tag "2" of step S200. Sex category male tags correspond to the label "0", sex category female tags correspond to the label "1". The race category caucasian (caucasian) tag corresponds to the label "0", and the race category mongolian (yellow) tag corresponds to the label "1"; the race class nigella (black) tag corresponds to the label "2".
4. The face recognition method based on pseudo depth generation fusion according to claim 1, wherein the depth map generation model of step S300 includes monoscopic, denseDepth,3DMM,Pix2Pix,CycleGAN and d+gan. Wherein, pix2Pix, cycleGAN and d+gan belong to the GAN model. Experiments show that the depth map generated by the GAN model has better effect. Experimental results show that the effect of the embodiment D+GAN is optimal, and the RGB face recognition performance can be obviously improved.
5. The face recognition method based on pseudo depth generation fusion according to claim 1, wherein the implementation flow of step S400 is as follows: NSST first performs multi-scale and multi-directional decomposition on the input image through a non-downsampling pyramid (NSP) and a clipping filter, and may be implemented using matlab implementation function [ dst, sheaf ] =nsst_dec2 (x, shea_parameters, ipfilt). . And then, according to the formulated fusion strategy, the decomposed high-frequency and low-frequency subband images are transformed to be combined into a new subband image. Finally, the final fused image is obtained by inverting NSST on the new subband image, which can be implemented using matlab implementation function x=nsst_rec2 (dst, sheaf, ipfilt). In our example, the set of filters for Laplacian pyramid decomposition is "maxflat". The vector representing the direction of decomposition is set to [3, 4]. The vector representing the local support of the shear filter is set to [8, 16, 16]. The fusion coefficient was set to 0.5.
6. The pseudo depth generation fusion-based face recognition method of claim 1, wherein step S500, we typically perform a pseudo RGB-D model fine-tuning training test on a model that has been pre-trained on a large RGB face recognition dataset. If the number of the face types of the test set is small, the pre-training model of RGB face recognition is used for directly extracting the characteristics of the pseudo RGB-D image, and then a linear Support Vector Machine (SVM) is trained for testing, so that a better face recognition effect can be obtained.
7. A training method for an image translation model D+GAN of an countermeasure generation network by utilizing a plurality of face attributes. The method is characterized by comprising the following steps of:
p100: and adding the categories of age, gender and race in the file name from left to right sequentially by using the RGB picture obtained from the color sensor after processing in S100 and the face attribute classifier for the standard depth image obtained from the depth sensor after processing.
P200: and building a generator of the countermeasure generation network. The generator of the countermeasure generation network uses a U-shaped structure, and the encoder of the generator is composed of a plurality of convolution layers with different parameters, an excitation function and batch normalization. The encoder section implements feature extraction and feature compression, which is a downsampling process. The decoder portion of the generator consists of a plurality of convolutional layers with different parameters, deconvolution layers, excitation functions and batch normalization. The decoder section achieves a feature size recovery, which is an upsampling process. The bottom joint of the U-shaped structure adopts a residual error module group design, and each residual error module unit comprises a residual error module and a self-attention module. The combination of multiple residual modules and self-attention modules, we call the residual module group. The generator model also utilizes jump connections in the convolutional layer between the encoder and decoder to construct the information streaming method, which can effectively alleviate the gradient vanishing problem.
P300: a discriminator of the challenge-generating network is built. The discriminator network not only needs to judge the true or false of the input face image, but also needs to classify the age, sex and race of the input face image. The discriminator backbone network consists of a plurality of convolutional layers with different parameters, excitation functions, batch normalization, spectral normalization and self-attention modules. The discriminator backbone network is used for judging whether the input face image is generated by the generator or not. The three branch networks of the discriminator are each composed of a plurality of convolution layers with different parameters, batch normalization and excitation functions. The three branch networks of the discriminator are used for classifying the age, sex and race of the face depth image respectively.
P400 sets parameters, inputs RGB face images containing face attribute labels and depth maps corresponding to pixels of the RGB face images into a D+GAN generator, starts training, and stops training when the total loss of the generator converges and the total loss of a discriminator converges to obtain a D+GAN training model and parameters thereof.
8. The training method of d+gan of claim 7, wherein in step P200, batch Normalization (BN) is used to normalize the input features to speed up the convergence process, and the activation function uses ReLU to introduce sparsity into the data to suppress overfitting. In our design, in the residual block H (x), the original mapping is changed from F (x) to F (x) +x by a jump connection, making the neural network easier to optimize. The self-attention module following the residual module helps to learn multi-level and long-distance dependencies across the image area, which is a complement to the convolutional layer function.
9. The training method of d+gan according to claim 7, wherein in step P300, we set spectral normalization in multiple convolution layers of the backbone network of the discriminator to make the neural network robust to the input interference, so that the training process is more stable.
CN202210959034.8A 2022-08-04 2022-08-04 Pseudo RGB-D face recognition method Pending CN117523626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210959034.8A CN117523626A (en) 2022-08-04 2022-08-04 Pseudo RGB-D face recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210959034.8A CN117523626A (en) 2022-08-04 2022-08-04 Pseudo RGB-D face recognition method

Publications (1)

Publication Number Publication Date
CN117523626A true CN117523626A (en) 2024-02-06

Family

ID=89751909

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210959034.8A Pending CN117523626A (en) 2022-08-04 2022-08-04 Pseudo RGB-D face recognition method

Country Status (1)

Country Link
CN (1) CN117523626A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117894083A (en) * 2024-03-14 2024-04-16 中电科大数据研究院有限公司 Image recognition method and system based on deep learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117894083A (en) * 2024-03-14 2024-04-16 中电科大数据研究院有限公司 Image recognition method and system based on deep learning

Similar Documents

Publication Publication Date Title
CN112307958B (en) Micro-expression recognition method based on space-time appearance motion attention network
CN108460356B (en) Face image automatic processing system based on monitoring system
CN106599883B (en) CNN-based multilayer image semantic face recognition method
CN112766158B (en) Multi-task cascading type face shielding expression recognition method
CN109308485B (en) Migrating sparse coding image classification method based on dictionary field adaptation
CN106778687B (en) Fixation point detection method based on local evaluation and global optimization
CN112686331B (en) Forged image recognition model training method and forged image recognition method
CN112488205B (en) Neural network image classification and identification method based on optimized KPCA algorithm
CN107451565B (en) Semi-supervised small sample deep learning image mode classification and identification method
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN104408469A (en) Firework identification method and firework identification system based on deep learning of image
CN112580590A (en) Finger vein identification method based on multi-semantic feature fusion network
CN110689025A (en) Image recognition method, device and system, and endoscope image recognition method and device
CN112418041B (en) Multi-pose face recognition method based on face orthogonalization
CN110222718B (en) Image processing method and device
CN111898736A (en) Efficient pedestrian re-identification method based on attribute perception
CN112288011A (en) Image matching method based on self-attention deep neural network
CN110728694B (en) Long-time visual target tracking method based on continuous learning
CN109635726B (en) Landslide identification method based on combination of symmetric deep network and multi-scale pooling
CN113989890A (en) Face expression recognition method based on multi-channel fusion and lightweight neural network
CN109947960B (en) Face multi-attribute joint estimation model construction method based on depth convolution
CN114937298A (en) Micro-expression recognition method based on feature decoupling
CN117523626A (en) Pseudo RGB-D face recognition method
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN111310807B (en) Feature subspace and affinity matrix joint learning method based on heterogeneous feature joint self-expression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: Jin Bo

Document name: Notice of Publication of Invention Patent Application