CN112906815A

CN112906815A - Method for predicting human face by sound based on condition generation countermeasure network

Info

Publication number: CN112906815A
Application number: CN202110273900.3A
Authority: CN
Inventors: 许曼玲; 戴宪华
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-04

Abstract

The invention provides a method for predicting a human face by using sound generated by a confrontation network based on conditions, which comprises the following steps: a data construction step, collecting voice data and face data, cleaning the data, and generating a one-hot label according to age and gender labels; designing and training a sound classification model, namely extracting Mel frequency spectrum characteristics from sound data, inputting the characteristics and label data into a deep learning classification network for training, and further obtaining classification network model weight; designing and training a face generation network, namely inputting the label and face data into a pre-training condition generation confrontation network for training to obtain the weight of a face generation network model; and model prediction, namely inputting the preprocessed voice data into a voice classifier to obtain a classification label, and inputting the classification label into a human face generator to obtain a predicted human face. The invention relates to the field of deep learning technology application, realizes the function of predicting a face image of a speaker according to input voice, and makes up the blank of the field.

Description

Method for predicting human face by sound based on condition generation countermeasure network

Technical Field

The invention relates to the technical field of deep learning application, in particular to a face prediction method based on a condition generation confrontation network sound.

Background

In recent years, the development of deep learning has attracted extensive attention from all social circles, and the technical application thereof has penetrated aspects of life. The deep learning is put forward by the development of a neural network, and the basic concept of the deep learning is to simulate the human brain to perform data analysis and find the hidden layer relation between input and output. Currently, the deep learning technique shows impressive effects on image processing, natural language processing, audio processing, and other problems, and particularly, the deep learning technique shows most remarkable performance on image processing.

Image processing problems can be divided into: image detection, image classification, image generation, and the like. The generation of a confrontation network is a promising image generation model, and the essence of the generation is a game confrontation process. The generation countermeasure network is composed of a generator and a discriminator, wherein the generator aims to synthesize a fake picture, and the discriminator aims to distinguish the synthesized picture and a real picture, and the two pictures are balanced by continuously comparing. However, the generation result of the original generation countermeasure network is uncontrollable, and in order to improve the problem, the conditional generation countermeasure network CGAN is generated, and the idea behind the conditional generation countermeasure network is to add a certain constraint condition to the original network, so that the generated picture meets the specified requirements. This improvement has greatly facilitated the development of a convergence of creating antagonistic networks in a variety of areas.

On the basis of conditional generation of a confrontation network, technologies such as generating pictures according to texts and generating pictures according to colors have achieved good results, but the field development of predicting the voice portrait of a human face through voice by means of voice physiognomy is still unsatisfactory. The existing voice portrait technology has low resolution ratio of generated pictures, is difficult to be applied in actual work, and mostly uses voice characteristics as constraint conditions for generating an confrontation network directly, thereby increasing the learning difficulty of the network and having unsatisfactory model effect.

Disclosure of Invention

In order to overcome the defects, the invention provides a voice face prediction method for generating an antagonistic network based on conditions.

The technical scheme adopted by the invention is as follows:

a method for generating a voice-predicted face against a network based on conditions, the method comprising: the method comprises the steps of data construction, sound classification network model design and training, face image generation network design and training and model prediction; the data construction step is mainly to collect Chinese (Chinese mainland) sound data in a Common Voice data set and Asian human face data in a UTKface data set of a current mainstream, to carry out data cleaning and respectively establish one-hot coding labels for the sound and the human face data according to related labeling data of a database; designing and training a sound classification network model, namely designing a corresponding network structure by utilizing the processing of a deep learning technology on classification problems, and training by utilizing constructed data to obtain a network model; designing and training a face image generation network, generating a relevant principle of an anti-network by using conditions, training by using the constructed data and obtaining a network model; the model prediction step is connected with a sound classification network and a human face image generation network in series, and the function of predicting the human face from the sound is achieved.

Specifically, the method comprises the following implementation steps:

s1, data construction, wherein Common Voice data set Chinese (mainland China) Voice data and UTKface data set Asian face data are collected; carrying out data cleaning on the voice data and the face image data; according to original age and gender labels in the data set, establishing a one-hot coding label for the voice data and the face image data, and keeping the consistency of coding rules of the voice data and the face image data;

s2, designing and training a sound classification network model, wherein the network model comprises three sub-networks, namely a Mel frequency spectrum transformation network, a pre-trained resnet50 network and a full-connection network; firstly, inputting voice data subjected to data processing into a Mel frequency spectrum conversion network to obtain a Mel frequency spectrum of the voice data; then inputting the Mel frequency spectrum into a pre-trained resnet50 network to obtain sound characteristics with higher accuracy; finally, the output of the resnet50 network is input into a full-connection network after certain data processing, and is output as a predicted one-hot sound classification label; optimizing the similarity between the predicted sound classification label and the real sound coding label, updating the weight of the network, and obtaining a convergent network;

s3, designing and training a face image generation network, wherein the network is a pre-trained CGAN network, random seeds are used as network input, a face one-hot coded label is used as a constraint condition, and a generator and a discriminator of the network are trained simultaneously to balance the two in a game; a generator after network convergence is taken as a face image generation network;

s4, model prediction, namely preprocessing the sound to be predicted and inputting the preprocessed sound into a sound classification network to obtain a one-hot sound classification label; and inputting the classification labels into a human face image generation network to obtain a predicted human face image.

Further, in step S1, the data cleansing step for the data is as follows:

s11, clearing the silent sound segments;

s12, removing the voice data and the face image data which are marked with the defects;

s13, uniformly cutting the sound data to a time length of 5S;

further, in step S1, a one-hot encoding label is established for the voice and face data, and the label is divided into eight cases according to the labels, which are: male is less than 19 years old, male is 19-29 years old, male is 30-39 years old, male is greater than 40 years old, female is less than 19 years old, female is 19-29 years old, female is 30-39 years old, female is greater than 40 years old, which are respectively coded as (00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000);

further, in step S2, the specific training steps of the sound classification network are as follows:

s21, inputting the processed sound data into a Mel frequency spectrum conversion network, wherein the network is realized by adopting an encapsulation function in a librosa toolkit;

s22, inputting the extracted Mel frequency spectrum into a pre-trained resnet50 network to obtain sound features with higher accuracy;

s23, the output of the resnet50 network is input into a full connection layer after being subjected to maximum value pooling, and a predicted one-hot label is obtained;

s24, calculating a cross entropy loss function according to the predicted one-hot label and the real one-hot label, and updating parameters of a resnet50 network and a full connection layer;

s25, repeating the steps S21 to S24 until the training times are reached, finishing the training and storing the classification network at the moment;

further, in step S3, the specific training step of the face image generation network is: the random seed and the human face one-hot coding label are used as the input of a CGAN generator, and the output is a generated random human face picture; inputting the random face picture, the face one-hot coding label and the real face picture into a CGAN network discriminator, and using an output value to judge whether a synthesized picture of a generator is real and whether the synthesized picture accords with label constraint; training a generator and a discriminator at the same time, and updating the network weight by optimizing a loss function to balance the network; a converged CGAN generator is taken as a face image generation network;

in summary, the invention discloses a method for predicting a human face by generating a sound of an confrontation network based on conditions. The beneficial effects are as follows: the invention makes up the blank of the speech portrait field based on the sound classification and face image technology in the deep learning. The voice features are converted into the classification labels, and then the classification labels are used as constraint conditions for generating the countermeasure network, so that the difficulty of network learning is reduced, and the quality of generated pictures is improved.

Drawings

FIG. 1 is a block diagram of the overall design of a method for predicting human faces by sound based on a conditional generation confrontation network

FIG. 2 is a flow chart of data construction of a method for predicting a face by using a sound generated by a conditional generation countermeasure network

FIG. 3 is a flow chart of training of a voice classification network in a method for predicting a face by generating a confrontation network based on conditions

FIG. 4 is a model prediction flow diagram of a method for predicting a face by sound based on a conditional generative confrontation network

Detailed Description

The present invention will be described in further detail with reference to the following drawings and specific embodiments, and it should be understood that the described embodiments are only some embodiments, not all embodiments.

In the field of image generation, the existing audio portrait technology has the problems of low quality of generated images, unsatisfactory model learning effect and the like. The invention discloses a face sound prediction method based on a condition generation countermeasure network, which decomposes a sound morphology problem into two stages of predicting classification labels by sound and generating face images according to the classification labels, reduces model learning difficulty and can obtain face images with higher resolution.

This example is based on the Tensorflow framework and Pycharm development environment: tensorflow is an open-source python machine learning library, comprises various toolkits suitable for a deep learning algorithm, can efficiently and flexibly build a neural network model, and is one of mainstream programming frameworks at present.

The embodiment discloses a method for predicting a human face by using a sound generated by a confrontation network based on conditions, as shown in the figure I, the method mainly comprises the following design processes:

s1, constructing training data, collecting Chinese (mainland China) Voice data in a Common Voice data set and Asian human face data in a UTKface data set, respectively processing the Voice data and the human face image data, and making a one-hot coding label according to the original age and gender labels;

s2, designing and training a sound classification network, wherein the network is divided into three sub-networks, namely a Mel frequency spectrum transformation network, a pre-training resnet50 network and a full-connection classification network; taking the processed sound data as network input, and updating network weight for training by optimizing the similarity between the predicted classification label and the one-hot coding label;

s3, designing and training a face image generation network, wherein the network is a pre-trained CGAN network and is divided into a generator and a discriminator; the random seed and the human face one-hot coding label are used as the input of a generator, and the output is a random human face image; random face images, face one-hot coded labels and real face images are used as input of the discriminator, and output values are used for judging whether the images generated by the generator are real and whether the images meet constraint requirements; training a generator and a discriminator at the same time, and taking the generator to generate a network for the face image after the network is converged;

s4, model prediction, namely, connecting the sound classification network trained in the S2 and the human face image generation network trained in the S3 in series, taking the sound to be predicted as input, and inputting the sound classification network after certain data preprocessing to obtain a sound classification label; using the classification labels as constraint conditions of a face generation network to obtain a predicted face image;

specifically, as shown in fig. two, the data construction process of the method for predicting a human face by using voice is as follows:

collecting Chinese (mainland China) voice data in a common Voice audio data set for 78 hours, and collecting 3440 images of Asian race face data in a UTKface data set;

step two, removing silent sound segments in the sound data;

clearing the voice data and the incomplete data marked in the face image data;

step four, uniformly cutting the time length of the sound data into 5 s;

step five, constructing one-hot coded labels of the sound data and the face data according to the original labels of the data set, wherein the labels are divided into eight conditions according to the labels, and the conditions are as follows: male is less than 19 years old, male is 19-29 years old, male is 30-39 years old, male is greater than 40 years old, female is less than 19 years old, female is 19-29 years old, female is 30-39 years old, female is greater than 40 years old, which are respectively coded as (00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000);

specifically, as shown in fig. three, the training process of the voice classification network of the voice face prediction method is as follows:

step one, sound data after data processing is used as network input. Since the number of audio data is large, the following is 100: 1, dividing sound data into a training set and a test set according to the proportion, and taking the training set data as network input;

and step two, extracting the Mel frequency spectrum characteristics of the sound data. It should be noted that the conversion network of mel spectrum is composed of encapsulation functions in librosa toolkit;

and step three, inputting the extracted Mel frequency spectrum into a pre-trained resnet50 network, and outputting the voice characteristics with higher accuracy. It should be noted that the pretrained resnet50 network can be obtained by referring to the network architecture encapsulated by the keras module in the tenserflow 2.0;

step four, performing maximum pooling on the output of the resnet50 network;

and step five, taking the data processed in the step four as the input of the full-connection network to obtain the predicted classified coding label. It should be noted that the label rule of the classification label is consistent with the one-hot encoding rule constructed by the data;

and step six, calculating a loss function according to the predicted classified coding label and the real one-hot coding label, optimizing network parameters and storing a network model. It should be noted that, a network performance test is performed every 200 training rounds, and the test set data is used as the input during the test to obtain the accuracy of the network; a loss function adopted in the network training process is a cross entropy loss function;

step seven, repeating the step two to the step six until the training times are reached, finishing the training, and storing the network at the moment as a sound classifier;

specifically, in the method for predicting the human face by using the voice, a human face image generation network is a generator of a pre-training CGAN network, an official open source CGAN network can be downloaded from a github open source code library, processed human face image data and a human face one-hot coding label are used as training data, meanwhile, the generator and an encoder of the network are trained, and after the network is converged, the generator is taken as the human face image generation network of the embodiment;

specifically, as shown in fig. four, the model prediction process of the method for predicting a face by using voice is as follows:

step one, carrying out data preprocessing on sound data to be predicted. It should be noted that the preprocessing includes checking whether the sound data is valid, and performing error notification on the silent sound data; cutting the sound data to ensure that the time length is 5 s;

and step two, inputting the processed voice data into the trained voice classification network, and outputting voice classification labels. It should be noted that the classification label is a one-hot code, which represents the age and gender attribute of the sound;

inputting the sound classification labels into a human face image generation network, and outputting predicted human face images;

based on the voice face prediction method, because the voice data and the face data adopt Asian related data, the method is only applicable to Asian people with Chinese language type. According to the requirements of actual application scenes, different training data are adopted, and the method can be popularized to any face prediction problem of any language type.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention and to implement the present invention. Various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit of the invention. Therefore, the present invention is not limited to the embodiments mentioned herein, and other embodiments obtained without inventive work are within the scope of the present invention.

Claims

1. A method for predicting a human face by generating a sound of an antagonistic network based on a condition, the method comprising the steps of:

s1, data construction, wherein voice data are collected, data cleaning is carried out, and one-hot label codes are manufactured according to age and gender labels of speakers, wherein the labels totally comprise a 4-class age attribute and a 2-class gender attribute; collecting face image data, cleaning the data, marking and manufacturing one-hot label codes according to the age and the gender of the face, and keeping the consistency of the voice label data and the face label data manufacturing rule;

s2, designing and training a sound classification network model, wherein the network model is divided into three sub-networks, namely a Mel frequency spectrum conversion network for extracting sound large-scale features, a pre-training resnet50 network for carrying out feature recognition on the sound features, and a full-connection network for classifying sound data according to the recognized features; taking the voice data after data processing as input, optimizing the similarity between the classification output of the network and the voice label code, and realizing the convergence of a voice classification network model;

s3, designing and training a face generation network, wherein the network consists of a pre-trained CGAN network, and random seeds and face label data are used as input, so that a generator and a discriminator of the CGAN network are balanced in a game, and the convergence of the face generation network is realized;

s4, model prediction, namely preprocessing the sound data and inputting the preprocessed sound data into a sound classification network to obtain corresponding label codes; and inputting the label codes into a face generation network to obtain the predicted speaker face image output.

2. The method as claimed in claim 1, wherein the Voice data in step S1 is collected from Common Voice open source data set containing original age and gender labels; the face image data is collected from Asian face data in a UTKface open source data set, and the data set comprises original age and gender labels.

3. The method as claimed in claim 1, wherein the step of data cleansing in step S1 comprises: clearing silent sound segments; removing the voice data and the face image data marked with the defects; the sound data is clipped so that the time length thereof is uniform.

4. The method as claimed in claim 1, wherein the one-hot label coding in step S1 is divided into eight cases, which are: male is less than 19 years old, male is 19-29 years old, male is 30-39 years old, male is greater than 40 years old, female is less than 19 years old, female is 19-29 years old, female is 30-39 years old, female is greater than 40 years old, which are respectively coded as (00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000).

5. The method for generating a voice-predicted face against a network based on conditions as claimed in claim 1, wherein said step S2 is as follows: firstly, taking processed sound data as input, and extracting the Mel frequency spectrum characteristics of sound by utilizing a Mel frequency spectrum conversion network; inputting the characteristic frequency spectrum into a pre-trained resnet50 network to obtain the characteristic identification of the sound; finally, the output of the resnet50 network is input into the full-connection network after being processed, and a sound classification label is obtained; and optimizing the similarity between the finally output sound classification label and the one-hot coding label, and updating the weight of the classification network.

6. The method for generating a voice-predicted face against a network based on conditions as claimed in claim 1, wherein said step S3 is as follows: random noise and face label data are used as input of a CGAN generator, and output is a random face image; the random face image, the face label data and the real face image data are used as the input of a CGAN network discriminator, and the output value is used for judging whether the image generated by the generator is a real image or not and whether the image meets the requirement of label data or not; training a generator and a discriminator at the same time, and updating the network weight; and after the network is stable, the generator is taken out and used as the face to generate the network.