CN112906815A - Method for predicting human face by sound based on condition generation countermeasure network - Google Patents
Method for predicting human face by sound based on condition generation countermeasure network Download PDFInfo
- Publication number
- CN112906815A CN112906815A CN202110273900.3A CN202110273900A CN112906815A CN 112906815 A CN112906815 A CN 112906815A CN 202110273900 A CN202110273900 A CN 202110273900A CN 112906815 A CN112906815 A CN 112906815A
- Authority
- CN
- China
- Prior art keywords
- network
- data
- face
- sound
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000012549 training Methods 0.000 claims abstract description 39
- 238000001228 spectrum Methods 0.000 claims abstract description 15
- 238000010276 construction Methods 0.000 claims abstract description 7
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000012545 processing Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000003042 antagnostic effect Effects 0.000 claims description 3
- 230000007547 defect Effects 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims 2
- 238000013135 deep learning Methods 0.000 abstract description 11
- 230000006870 function Effects 0.000 abstract description 9
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000013145 classification model Methods 0.000 abstract 1
- 238000011161 development Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005538 encapsulation Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention provides a method for predicting a human face by using sound generated by a confrontation network based on conditions, which comprises the following steps: a data construction step, collecting voice data and face data, cleaning the data, and generating a one-hot label according to age and gender labels; designing and training a sound classification model, namely extracting Mel frequency spectrum characteristics from sound data, inputting the characteristics and label data into a deep learning classification network for training, and further obtaining classification network model weight; designing and training a face generation network, namely inputting the label and face data into a pre-training condition generation confrontation network for training to obtain the weight of a face generation network model; and model prediction, namely inputting the preprocessed voice data into a voice classifier to obtain a classification label, and inputting the classification label into a human face generator to obtain a predicted human face. The invention relates to the field of deep learning technology application, realizes the function of predicting a face image of a speaker according to input voice, and makes up the blank of the field.
Description
Technical Field
The invention relates to the technical field of deep learning application, in particular to a face prediction method based on a condition generation confrontation network sound.
Background
In recent years, the development of deep learning has attracted extensive attention from all social circles, and the technical application thereof has penetrated aspects of life. The deep learning is put forward by the development of a neural network, and the basic concept of the deep learning is to simulate the human brain to perform data analysis and find the hidden layer relation between input and output. Currently, the deep learning technique shows impressive effects on image processing, natural language processing, audio processing, and other problems, and particularly, the deep learning technique shows most remarkable performance on image processing.
Image processing problems can be divided into: image detection, image classification, image generation, and the like. The generation of a confrontation network is a promising image generation model, and the essence of the generation is a game confrontation process. The generation countermeasure network is composed of a generator and a discriminator, wherein the generator aims to synthesize a fake picture, and the discriminator aims to distinguish the synthesized picture and a real picture, and the two pictures are balanced by continuously comparing. However, the generation result of the original generation countermeasure network is uncontrollable, and in order to improve the problem, the conditional generation countermeasure network CGAN is generated, and the idea behind the conditional generation countermeasure network is to add a certain constraint condition to the original network, so that the generated picture meets the specified requirements. This improvement has greatly facilitated the development of a convergence of creating antagonistic networks in a variety of areas.
On the basis of conditional generation of a confrontation network, technologies such as generating pictures according to texts and generating pictures according to colors have achieved good results, but the field development of predicting the voice portrait of a human face through voice by means of voice physiognomy is still unsatisfactory. The existing voice portrait technology has low resolution ratio of generated pictures, is difficult to be applied in actual work, and mostly uses voice characteristics as constraint conditions for generating an confrontation network directly, thereby increasing the learning difficulty of the network and having unsatisfactory model effect.
Disclosure of Invention
In order to overcome the defects, the invention provides a voice face prediction method for generating an antagonistic network based on conditions.
The technical scheme adopted by the invention is as follows:
a method for generating a voice-predicted face against a network based on conditions, the method comprising: the method comprises the steps of data construction, sound classification network model design and training, face image generation network design and training and model prediction; the data construction step is mainly to collect Chinese (Chinese mainland) sound data in a Common Voice data set and Asian human face data in a UTKface data set of a current mainstream, to carry out data cleaning and respectively establish one-hot coding labels for the sound and the human face data according to related labeling data of a database; designing and training a sound classification network model, namely designing a corresponding network structure by utilizing the processing of a deep learning technology on classification problems, and training by utilizing constructed data to obtain a network model; designing and training a face image generation network, generating a relevant principle of an anti-network by using conditions, training by using the constructed data and obtaining a network model; the model prediction step is connected with a sound classification network and a human face image generation network in series, and the function of predicting the human face from the sound is achieved.
Specifically, the method comprises the following implementation steps:
s1, data construction, wherein Common Voice data set Chinese (mainland China) Voice data and UTKface data set Asian face data are collected; carrying out data cleaning on the voice data and the face image data; according to original age and gender labels in the data set, establishing a one-hot coding label for the voice data and the face image data, and keeping the consistency of coding rules of the voice data and the face image data;
s2, designing and training a sound classification network model, wherein the network model comprises three sub-networks, namely a Mel frequency spectrum transformation network, a pre-trained resnet50 network and a full-connection network; firstly, inputting voice data subjected to data processing into a Mel frequency spectrum conversion network to obtain a Mel frequency spectrum of the voice data; then inputting the Mel frequency spectrum into a pre-trained resnet50 network to obtain sound characteristics with higher accuracy; finally, the output of the resnet50 network is input into a full-connection network after certain data processing, and is output as a predicted one-hot sound classification label; optimizing the similarity between the predicted sound classification label and the real sound coding label, updating the weight of the network, and obtaining a convergent network;
s3, designing and training a face image generation network, wherein the network is a pre-trained CGAN network, random seeds are used as network input, a face one-hot coded label is used as a constraint condition, and a generator and a discriminator of the network are trained simultaneously to balance the two in a game; a generator after network convergence is taken as a face image generation network;
s4, model prediction, namely preprocessing the sound to be predicted and inputting the preprocessed sound into a sound classification network to obtain a one-hot sound classification label; and inputting the classification labels into a human face image generation network to obtain a predicted human face image.
Further, in step S1, the data cleansing step for the data is as follows:
s11, clearing the silent sound segments;
s12, removing the voice data and the face image data which are marked with the defects;
s13, uniformly cutting the sound data to a time length of 5S;
further, in step S1, a one-hot encoding label is established for the voice and face data, and the label is divided into eight cases according to the labels, which are: male is less than 19 years old, male is 19-29 years old, male is 30-39 years old, male is greater than 40 years old, female is less than 19 years old, female is 19-29 years old, female is 30-39 years old, female is greater than 40 years old, which are respectively coded as (00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000);
further, in step S2, the specific training steps of the sound classification network are as follows:
s21, inputting the processed sound data into a Mel frequency spectrum conversion network, wherein the network is realized by adopting an encapsulation function in a librosa toolkit;
s22, inputting the extracted Mel frequency spectrum into a pre-trained resnet50 network to obtain sound features with higher accuracy;
s23, the output of the resnet50 network is input into a full connection layer after being subjected to maximum value pooling, and a predicted one-hot label is obtained;
s24, calculating a cross entropy loss function according to the predicted one-hot label and the real one-hot label, and updating parameters of a resnet50 network and a full connection layer;
s25, repeating the steps S21 to S24 until the training times are reached, finishing the training and storing the classification network at the moment;
further, in step S3, the specific training step of the face image generation network is: the random seed and the human face one-hot coding label are used as the input of a CGAN generator, and the output is a generated random human face picture; inputting the random face picture, the face one-hot coding label and the real face picture into a CGAN network discriminator, and using an output value to judge whether a synthesized picture of a generator is real and whether the synthesized picture accords with label constraint; training a generator and a discriminator at the same time, and updating the network weight by optimizing a loss function to balance the network; a converged CGAN generator is taken as a face image generation network;
in summary, the invention discloses a method for predicting a human face by generating a sound of an confrontation network based on conditions. The beneficial effects are as follows: the invention makes up the blank of the speech portrait field based on the sound classification and face image technology in the deep learning. The voice features are converted into the classification labels, and then the classification labels are used as constraint conditions for generating the countermeasure network, so that the difficulty of network learning is reduced, and the quality of generated pictures is improved.
Drawings
FIG. 1 is a block diagram of the overall design of a method for predicting human faces by sound based on a conditional generation confrontation network
FIG. 2 is a flow chart of data construction of a method for predicting a face by using a sound generated by a conditional generation countermeasure network
FIG. 3 is a flow chart of training of a voice classification network in a method for predicting a face by generating a confrontation network based on conditions
FIG. 4 is a model prediction flow diagram of a method for predicting a face by sound based on a conditional generative confrontation network
Detailed Description
The present invention will be described in further detail with reference to the following drawings and specific embodiments, and it should be understood that the described embodiments are only some embodiments, not all embodiments.
In the field of image generation, the existing audio portrait technology has the problems of low quality of generated images, unsatisfactory model learning effect and the like. The invention discloses a face sound prediction method based on a condition generation countermeasure network, which decomposes a sound morphology problem into two stages of predicting classification labels by sound and generating face images according to the classification labels, reduces model learning difficulty and can obtain face images with higher resolution.
This example is based on the Tensorflow framework and Pycharm development environment: tensorflow is an open-source python machine learning library, comprises various toolkits suitable for a deep learning algorithm, can efficiently and flexibly build a neural network model, and is one of mainstream programming frameworks at present.
The embodiment discloses a method for predicting a human face by using a sound generated by a confrontation network based on conditions, as shown in the figure I, the method mainly comprises the following design processes:
s1, constructing training data, collecting Chinese (mainland China) Voice data in a Common Voice data set and Asian human face data in a UTKface data set, respectively processing the Voice data and the human face image data, and making a one-hot coding label according to the original age and gender labels;
s2, designing and training a sound classification network, wherein the network is divided into three sub-networks, namely a Mel frequency spectrum transformation network, a pre-training resnet50 network and a full-connection classification network; taking the processed sound data as network input, and updating network weight for training by optimizing the similarity between the predicted classification label and the one-hot coding label;
s3, designing and training a face image generation network, wherein the network is a pre-trained CGAN network and is divided into a generator and a discriminator; the random seed and the human face one-hot coding label are used as the input of a generator, and the output is a random human face image; random face images, face one-hot coded labels and real face images are used as input of the discriminator, and output values are used for judging whether the images generated by the generator are real and whether the images meet constraint requirements; training a generator and a discriminator at the same time, and taking the generator to generate a network for the face image after the network is converged;
s4, model prediction, namely, connecting the sound classification network trained in the S2 and the human face image generation network trained in the S3 in series, taking the sound to be predicted as input, and inputting the sound classification network after certain data preprocessing to obtain a sound classification label; using the classification labels as constraint conditions of a face generation network to obtain a predicted face image;
specifically, as shown in fig. two, the data construction process of the method for predicting a human face by using voice is as follows:
collecting Chinese (mainland China) voice data in a common Voice audio data set for 78 hours, and collecting 3440 images of Asian race face data in a UTKface data set;
step two, removing silent sound segments in the sound data;
clearing the voice data and the incomplete data marked in the face image data;
step four, uniformly cutting the time length of the sound data into 5 s;
step five, constructing one-hot coded labels of the sound data and the face data according to the original labels of the data set, wherein the labels are divided into eight conditions according to the labels, and the conditions are as follows: male is less than 19 years old, male is 19-29 years old, male is 30-39 years old, male is greater than 40 years old, female is less than 19 years old, female is 19-29 years old, female is 30-39 years old, female is greater than 40 years old, which are respectively coded as (00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000);
specifically, as shown in fig. three, the training process of the voice classification network of the voice face prediction method is as follows:
step one, sound data after data processing is used as network input. Since the number of audio data is large, the following is 100: 1, dividing sound data into a training set and a test set according to the proportion, and taking the training set data as network input;
and step two, extracting the Mel frequency spectrum characteristics of the sound data. It should be noted that the conversion network of mel spectrum is composed of encapsulation functions in librosa toolkit;
and step three, inputting the extracted Mel frequency spectrum into a pre-trained resnet50 network, and outputting the voice characteristics with higher accuracy. It should be noted that the pretrained resnet50 network can be obtained by referring to the network architecture encapsulated by the keras module in the tenserflow 2.0;
step four, performing maximum pooling on the output of the resnet50 network;
and step five, taking the data processed in the step four as the input of the full-connection network to obtain the predicted classified coding label. It should be noted that the label rule of the classification label is consistent with the one-hot encoding rule constructed by the data;
and step six, calculating a loss function according to the predicted classified coding label and the real one-hot coding label, optimizing network parameters and storing a network model. It should be noted that, a network performance test is performed every 200 training rounds, and the test set data is used as the input during the test to obtain the accuracy of the network; a loss function adopted in the network training process is a cross entropy loss function;
step seven, repeating the step two to the step six until the training times are reached, finishing the training, and storing the network at the moment as a sound classifier;
specifically, in the method for predicting the human face by using the voice, a human face image generation network is a generator of a pre-training CGAN network, an official open source CGAN network can be downloaded from a github open source code library, processed human face image data and a human face one-hot coding label are used as training data, meanwhile, the generator and an encoder of the network are trained, and after the network is converged, the generator is taken as the human face image generation network of the embodiment;
specifically, as shown in fig. four, the model prediction process of the method for predicting a face by using voice is as follows:
step one, carrying out data preprocessing on sound data to be predicted. It should be noted that the preprocessing includes checking whether the sound data is valid, and performing error notification on the silent sound data; cutting the sound data to ensure that the time length is 5 s;
and step two, inputting the processed voice data into the trained voice classification network, and outputting voice classification labels. It should be noted that the classification label is a one-hot code, which represents the age and gender attribute of the sound;
inputting the sound classification labels into a human face image generation network, and outputting predicted human face images;
based on the voice face prediction method, because the voice data and the face data adopt Asian related data, the method is only applicable to Asian people with Chinese language type. According to the requirements of actual application scenes, different training data are adopted, and the method can be popularized to any face prediction problem of any language type.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention and to implement the present invention. Various modifications to the embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit of the invention. Therefore, the present invention is not limited to the embodiments mentioned herein, and other embodiments obtained without inventive work are within the scope of the present invention.
Claims (6)
1. A method for predicting a human face by generating a sound of an antagonistic network based on a condition, the method comprising the steps of:
s1, data construction, wherein voice data are collected, data cleaning is carried out, and one-hot label codes are manufactured according to age and gender labels of speakers, wherein the labels totally comprise a 4-class age attribute and a 2-class gender attribute; collecting face image data, cleaning the data, marking and manufacturing one-hot label codes according to the age and the gender of the face, and keeping the consistency of the voice label data and the face label data manufacturing rule;
s2, designing and training a sound classification network model, wherein the network model is divided into three sub-networks, namely a Mel frequency spectrum conversion network for extracting sound large-scale features, a pre-training resnet50 network for carrying out feature recognition on the sound features, and a full-connection network for classifying sound data according to the recognized features; taking the voice data after data processing as input, optimizing the similarity between the classification output of the network and the voice label code, and realizing the convergence of a voice classification network model;
s3, designing and training a face generation network, wherein the network consists of a pre-trained CGAN network, and random seeds and face label data are used as input, so that a generator and a discriminator of the CGAN network are balanced in a game, and the convergence of the face generation network is realized;
s4, model prediction, namely preprocessing the sound data and inputting the preprocessed sound data into a sound classification network to obtain corresponding label codes; and inputting the label codes into a face generation network to obtain the predicted speaker face image output.
2. The method as claimed in claim 1, wherein the Voice data in step S1 is collected from Common Voice open source data set containing original age and gender labels; the face image data is collected from Asian face data in a UTKface open source data set, and the data set comprises original age and gender labels.
3. The method as claimed in claim 1, wherein the step of data cleansing in step S1 comprises: clearing silent sound segments; removing the voice data and the face image data marked with the defects; the sound data is clipped so that the time length thereof is uniform.
4. The method as claimed in claim 1, wherein the one-hot label coding in step S1 is divided into eight cases, which are: male is less than 19 years old, male is 19-29 years old, male is 30-39 years old, male is greater than 40 years old, female is less than 19 years old, female is 19-29 years old, female is 30-39 years old, female is greater than 40 years old, which are respectively coded as (00000001, 00000010, 00000100, 00001000, 00010000, 00100000, 01000000, 10000000).
5. The method for generating a voice-predicted face against a network based on conditions as claimed in claim 1, wherein said step S2 is as follows: firstly, taking processed sound data as input, and extracting the Mel frequency spectrum characteristics of sound by utilizing a Mel frequency spectrum conversion network; inputting the characteristic frequency spectrum into a pre-trained resnet50 network to obtain the characteristic identification of the sound; finally, the output of the resnet50 network is input into the full-connection network after being processed, and a sound classification label is obtained; and optimizing the similarity between the finally output sound classification label and the one-hot coding label, and updating the weight of the classification network.
6. The method for generating a voice-predicted face against a network based on conditions as claimed in claim 1, wherein said step S3 is as follows: random noise and face label data are used as input of a CGAN generator, and output is a random face image; the random face image, the face label data and the real face image data are used as the input of a CGAN network discriminator, and the output value is used for judging whether the image generated by the generator is a real image or not and whether the image meets the requirement of label data or not; training a generator and a discriminator at the same time, and updating the network weight; and after the network is stable, the generator is taken out and used as the face to generate the network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273900.3A CN112906815A (en) | 2021-03-15 | 2021-03-15 | Method for predicting human face by sound based on condition generation countermeasure network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110273900.3A CN112906815A (en) | 2021-03-15 | 2021-03-15 | Method for predicting human face by sound based on condition generation countermeasure network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112906815A true CN112906815A (en) | 2021-06-04 |
Family
ID=76105021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110273900.3A Pending CN112906815A (en) | 2021-03-15 | 2021-03-15 | Method for predicting human face by sound based on condition generation countermeasure network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112906815A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114912539A (en) * | 2022-05-30 | 2022-08-16 | 吉林大学 | Environmental sound classification method and system based on reinforcement learning |
-
2021
- 2021-03-15 CN CN202110273900.3A patent/CN112906815A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114912539A (en) * | 2022-05-30 | 2022-08-16 | 吉林大学 | Environmental sound classification method and system based on reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112818861B (en) | Emotion classification method and system based on multi-mode context semantic features | |
CN109559736B (en) | Automatic dubbing method for movie actors based on confrontation network | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN112633010A (en) | Multi-head attention and graph convolution network-based aspect-level emotion analysis method and system | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
CN108538283B (en) | Method for converting lip image characteristics into voice coding parameters | |
CN111541900B (en) | Security and protection video compression method, device, equipment and storage medium based on GAN | |
CN111914555B (en) | Automatic relation extraction system based on Transformer structure | |
CN112215054A (en) | Depth generation countermeasure method for underwater acoustic signal denoising | |
CN111160163A (en) | Expression recognition method based on regional relation modeling and information fusion modeling | |
CN114863938A (en) | Bird language identification method and system based on attention residual error and feature fusion | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
Cosovic et al. | Classification methods in cultural heritage | |
CN116129013A (en) | Method, device and storage medium for generating virtual person animation video | |
CN112906815A (en) | Method for predicting human face by sound based on condition generation countermeasure network | |
Abrar et al. | Deep lip reading-a deep learning based lip-reading software for the hearing impaired | |
CN112163605A (en) | Multi-domain image translation method based on attention network generation | |
CN112508121A (en) | Method and system for sensing outside by industrial robot | |
CN112699288A (en) | Recipe generation method and system based on condition-generation type confrontation network | |
CN116758451A (en) | Audio-visual emotion recognition method and system based on multi-scale and global cross attention | |
CN115187705B (en) | Voice-driven face key point sequence generation method and device | |
CN108831486B (en) | Speaker recognition method based on DNN and GMM models | |
CN116258989A (en) | Text and vision based space-time correlation type multi-modal emotion recognition method and system | |
CN111340329B (en) | Actor evaluation method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |