CN112562722A - Audio-driven digital human generation method and system based on semantics - Google Patents

Audio-driven digital human generation method and system based on semantics Download PDF

Info

Publication number
CN112562722A
CN112562722A CN202011382282.8A CN202011382282A CN112562722A CN 112562722 A CN112562722 A CN 112562722A CN 202011382282 A CN202011382282 A CN 202011382282A CN 112562722 A CN112562722 A CN 112562722A
Authority
CN
China
Prior art keywords
semantic
face
audio
mouth
face image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011382282.8A
Other languages
Chinese (zh)
Inventor
王涛
徐常亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinhua Zhiyun Technology Co ltd
Original Assignee
Xinhua Zhiyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinhua Zhiyun Technology Co ltd filed Critical Xinhua Zhiyun Technology Co ltd
Priority to CN202011382282.8A priority Critical patent/CN112562722A/en
Publication of CN112562722A publication Critical patent/CN112562722A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Computer Graphics (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a semantic-based audio-driven digital human generation method and a semantic-based audio-driven digital human generation system, wherein the generation method comprises the following steps: acquiring a target audio and a first face image sequence; extracting the characteristics of the target audio to obtain corresponding audio characteristics; inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs; and acquiring the face images to be rendered with the same number of the mouth semantic graphs based on a first face image sequence, shielding the mouth region of the face images to be rendered, and performing face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence. The invention realizes the conversion of audio and facial semantics through the semantic conversion network, and achieves the accurate expression of mouth shape by utilizing the facial semantics.

Description

Audio-driven digital human generation method and system based on semantics
Technical Field
The invention relates to the field of machine learning, in particular to a semantic-based audio-driven digital human generation method and system.
Background
The video of the synchronous speaking action of the digital person generated by the audio drive is widely applied to various video sharing scenes, such as news broadcasting, training sharing, advertising and the like;
a method for synchronously driving a three-dimensional human face mouth shape and a facial pose animation by voice is disclosed by the publication number CN1032188842, mouth shape characteristic parameters and facial pose characteristic parameters which are defined based on MPEG-4 and correspond to each initial and final in a video frame are extracted, then a difference value Vel of each characteristic point coordinate and a standard frame coordinate is calculated, a corresponding scale reference quantity P on a human face defined according to MPEG-4 is calculated, and a human face motion parameter is calculated and obtained through the difference value Vel and the scale reference quantity P;
the patent application adopts the constructed three-dimensional face as a digital person, and the face generated by modeling is greatly different from a real face, so that the method is not suitable for occasions requiring the consistency of the digital face and the real face, such as news broadcasting, training sharing and the like;
because the face movement and speaking are a very elaborate and complex process, the face movement can only be preliminarily represented by using the coordinates of the feature points, errors exist in the positioning of the face feature points, and the face movement and speaking have individual differences; the method associates each initial and final with the mouth shape face posture characteristic parameters, and the tone, language and speed of the sound are related to the face movement, so the method has large limitation.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides the audio-driven digital person generation method and system based on the semantic meaning, which can accurately and finely express the face, and is suitable for occasions requiring the digital person to be similar to a target person.
In order to solve the technical problem, the invention is solved by the following technical scheme:
a semantic-based audio-driven digital human generation method comprises the following steps:
acquiring a target audio and a target face image sequence, and after masking the mouth region of each target face image in the target face image sequence, acquiring a corresponding first face image sequence;
extracting the characteristics of the target audio to obtain corresponding audio characteristics;
inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic images, and generating a synthesized face sequence based on the mouth semantic images and the face images to be rendered, and the synthesized face sequence contains synthesized faces corresponding to the mouth semantic images one to one.
As an implementable embodiment, the semantic conversion network comprises a recurrent neural network and an upsampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
As an implementable embodiment:
respectively connecting the semantic graph of the mouth part with the corresponding face image to be rendered to obtain corresponding data to be synthesized;
and inputting the data to be synthesized into a preset neural rendering network, and synthesizing and rendering the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthesized face.
As an implementation manner, the pre-training semantic conversion network includes the following steps:
acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a mouth semantic graph of the face, and taking the obtained mouth semantic graph as a sample semantic graph;
training the semantic conversion network based on the sample audio features and the sample semantic graph.
As an implementation manner, the pre-training semantic conversion network includes the following steps:
masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;
connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;
and training the neural rendering network based on the sample data to be synthesized and the sample face image.
As an implementable embodiment:
the audio features are mel-frequency cepstral coefficients.
The invention also provides a semantic-based audio-driven digital human generation system, which comprises:
the data acquisition module is used for acquiring a target audio and a target face image sequence, and after masking processing is carried out on mouth regions of all target face images in the target face image sequence, a corresponding first face image sequence is obtained;
the characteristic extraction module is used for extracting the characteristics of the target audio to obtain corresponding audio characteristics;
the semantic conversion module is used for inputting the audio features to a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and the synthetic rendering module is used for constructing a second face image sequence based on the first face image sequence, the second face image sequence comprises the face images to be rendered in the same quantity as the mouth semantic graphs, face synthesis is carried out based on the mouth semantic graphs and the face images to be rendered, a synthetic face sequence is generated, and synthetic faces corresponding to the mouth semantic graphs one by one are contained in the synthetic face sequence.
As an implementable embodiment:
the semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
As one implementable way, the composition rendering module includes:
the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;
and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.
The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the preceding claims.
Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:
according to the invention, through a pre-trained semantic conversion network, the semantic is adopted to achieve the fine expression of the mouth shape, the semantic is essentially a binary image of the mouth shape of the digital human face, and compared with key points or parameters of the face, the expression of the face is more accurate and fine.
The invention carries out synthesis rendering through the neural rendering network, can more accurately realize the digital human generation driven by audio frequency of Lubang, enables the synthesized face to be more similar to the real face, and improves the watching experience.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow diagram of a semantic-based audio-driven digital human generation method of the present invention;
FIG. 2 is a schematic diagram of a network architecture of a neural rendering network according to embodiment 1;
FIG. 3 is a schematic diagram of a neural rendering network generating a corresponding synthetic face based on a semantic graph of a mouth and a face image to be rendered in a case;
fig. 4 is a schematic diagram of module connection of a semantic-based audio-driven digital human generation system according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.
Embodiment 1, a semantic-based audio-driven digital human generation method, as shown in fig. 1, includes the following steps:
s100, acquiring a target audio and a target face image sequence, and masking mouth regions of all target face images in the target face image sequence to obtain a corresponding first face image sequence;
after the mouth region of the target face image is subjected to mask processing, corresponding face images to be rendered are obtained, and the face images to be rendered corresponding to the target face images one to one form a first face image sequence.
S200, extracting the characteristics of the target audio to obtain corresponding audio characteristics;
s300, inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
s400, constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence comprises the same number of face images to be rendered as the mouth semantic graphs, carrying out face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence, and the synthesized face sequence comprises synthesized faces corresponding to the mouth semantic graphs one by one.
When the number of the mouth semantic images is less than or equal to the number of the face images to be rendered in the first face image sequence, extracting the corresponding number of face images to be rendered in sequence to form a second face image sequence;
when the number of the mouth semantic images is larger than that of the face images to be rendered in the first face image sequence, the first face image sequence can be repeatedly played to amplify and obtain a corresponding number of face images to be rendered so as to form a second face image sequence;
for example, the first face image sequence is placed in the front direction and placed in the reverse direction in a circulating mode, an image sequence with infinite length can be generated, and the second face image sequence can be obtained by sequentially extracting face images to be rendered with corresponding length according to the number of the semantic graphs of the mouth.
The obtained synthesized face sequence is a picture sequence of the digital person speaking generated based on the target face, the digital person speaking action is consistent with the target audio, and then a corresponding video can be generated based on the corresponding synthesized face sequence and the target audio.
In the embodiment, the conversion between audio and semantics is realized through a pre-trained semantic conversion network, the semantics is adopted to achieve the fine expression of the mouth shape, the semantics is essentially a binary image of the digital human face mouth shape, and compared with key points or parameters of the face, the expression of the face is more accurate and fine, so that the method is suitable for the requirements of news broadcasting, teaching training and the like that the digital human face needs to have a target character and the speech action is more real.
The specific way of obtaining the face image to be rendered in step S100 is as follows:
manually setting a mask on the target face image to shield the mouth region of the target face image, and taking the obtained image as the face image to be rendered, wherein a person skilled in the art can set the mask region according to the actual situation.
The specific way of extracting the audio features in step S200 is as follows:
the audio features are mel-frequency cepstrum coefficients (MFCCs), in this embodiment, the target audio is framed according to 40 milliseconds and corresponding mel-frequency cepstrum coefficients are extracted, and a person skilled in the art can design the framing time interval according to actual needs.
Further, the pre-training mode of the semantic conversion network in step S300 is as follows:
a1, acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a semantic graph of the mouth of the face, and taking the obtained semantic graph of the mouth as a sample semantic graph;
in the embodiment, speaking videos of a target person are collected, and audio and video separation is carried out on each speaking video to obtain corresponding speaking audio and a plurality of video frames; framing the audio data according to 40 milliseconds and extracting corresponding Mel frequency cepstrum coefficients to obtain sample audio features; based on the existing face detection and face segmentation technology, the face in each video frame is detected, and a semantic graph of a mouth part corresponding to the face is extracted to be used as a sample semantic graph.
A2, training the semantic conversion network based on the sample audio features and the sample semantic graph, and iteratively training the semantic conversion network based on the following steps:
taking the sample audio features as an input of a semantic conversion network, and outputting a predicted mouth semantic graph, namely a predicted semantic graph, by the semantic conversion network;
performing loss calculation based on a corresponding sample semantic graph (real data) and a prediction semantic graph (prediction data), performing gradient rotation based on a first loss value obtained by calculation, and updating parameters of a semantic conversion network, wherein the first loss value is the sum of cross entropy loss and perception loss;
and finishing the training when the training times reach a preset iteration time threshold or the loss value is reduced to a preset loss threshold.
The model training step belongs to the conventional technical means in the field, so the model training step is not further detailed in this embodiment, and a person skilled in the art can also train to obtain a corresponding semantic conversion network.
Further, the semantic conversion network comprises a recurrent neural network and an upsampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
In this embodiment, the recurrent neural network includes two GRU layers, GRUs (gated recurrent units), and this embodiment extracts the time sequence relationship of the input audio by using this network, and averages the output of this network in the time dimension, and then sends the result to the Linear layer.
In the embodiment, Tanh is used as an activation function in an output layer of the upsampling convolutional neural network for predicting the mouth semantic graph.
The network structure of the semantic conversion network is specifically shown in the following table:
TABLE 1
Figure BDA0002809814140000051
Figure BDA0002809814140000061
Wherein kernel is a convolution kernel, stride is a step length, the Linear layer is a full-link layer, and the Reshape layer is used for transforming vector dimensions.
Further, in step S400, performing face synthesis based on the mouth semantic graph and the face image to be rendered, and generating a synthesized face sequence specifically includes:
s410, connecting the semantic graphs of the mouth parts with the corresponding face images to be rendered respectively to obtain corresponding data to be synthesized;
the connection refers to connecting two pieces of multidimensional data of the mouth semantic graph and the face image to be rendered on one channel (dimension), for example, connecting a 20-dimensional vector and a 30-dimensional vector to form a 50-dimensional vector.
And S420, inputting the data to be synthesized into a preset neural rendering network, and synthesizing and rendering the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthesized face.
In the embodiment, the synthesis rendering is performed through the neural rendering network, so that the audio-driven digital human generation can be more accurately realized by Lu Pont, the synthesized face is more similar to the real face, and the watching experience is improved.
The pre-training mode of the neural rendering network is as follows:
b1, masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;
the video frame is the video frame extracted in step a 1;
in this step, the mouth region is consistent with the mouth region in step S100, that is, after a person skilled in the art sets a fixed region for shielding the mouth according to actual needs, mask processing is performed on a target face image for synthesizing a digital person and a video frame serving as a training sample based on the fixed region.
B2, connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;
the connection described in the above-described connection synchronization step S410 is not described in detail.
B3, training the neural rendering network based on the sample data to be synthesized and the sample face image.
Taking the sample data to be synthesized as the input of a neural rendering network, and outputting the predicted synthesized face by the semantic conversion network, namely, predicting a face image;
loss calculation is carried out based on the corresponding sample face image (real data) and the prediction face image (prediction data), gradient rotation is carried out based on a second loss value obtained by calculation, and parameters of the semantic conversion network are updated, wherein the second loss value is the sum of L1 loss and perception loss;
and finishing the training when the training times reach a preset iteration time threshold or the loss value is reduced to a preset loss threshold.
The model training step belongs to the conventional technical means in the field, so the model training step is not further detailed in this embodiment, and a person skilled in the art can also train to obtain a corresponding semantic conversion network.
Note that the cross-entropy loss, the L1 loss and the perceptual loss are all conventional loss functions in the art, and a person skilled in the art can calculate corresponding loss values according to actual situations without providing detailed formulas.
In this embodiment, the formula for calculating the perception loss by the semantic conversion network and the neural rendering network is as follows:
Figure BDA0002809814140000071
in the present embodiment, the real data Y and the prediction data are combined
Figure BDA0002809814140000072
Respectively input into a pre-trained VGG network V, wherein V is in the formulajThe activation condition of the j layer of the VGG network in the real data or the prediction data processing is shown in (C)j,Hj,Wj). The squares of the L2 losses are then used to compare the real data Y with the predicted data
Figure BDA0002809814140000073
As a corresponding loss of perception.
When training semantic exchange network, real data Y is sample semantic graph and prediction data
Figure BDA0002809814140000074
Is a corresponding prediction semantic graph;
when the neural rendering network is trained, the real data Y is a sample face image and prediction data
Figure BDA0002809814140000075
And the corresponding predicted face image is obtained.
In this embodiment, the network structure of the neural rendering network is specifically shown in the following table:
TABLE 2
Figure BDA0002809814140000076
Figure BDA0002809814140000081
In the above table, Mask MtRepresenting a face image with a mask, namely a face image to be rendered or a sample image to be rendered; image Q represents a semantic graph of the mouth; skip indicates that it belongs to a Skip layer connection structure, the network architecture diagram of the neural rendering network is shown in fig. 2, and a dotted line in fig. 2 indicates Skip layer connection.
Case (2):
and acquiring a speaking video of the target person, and training to acquire a semantic conversion network and a video rendering network by using audio data (MFCC) and image data in the speaking video according to the training steps.
Referring to fig. 3, according to actual needs, extracting a segment of speaking video from pre-collected speaking videos, or selecting a speaking video (non-training video) designated by a user, masking a mouth region of a face in each video frame (i.e., a target face image) with a target face image sequence of video frames of the speaking video to obtain a face image to be rendered, thereby generating a first face image sequence;
acquiring a target audio, extracting a mel frequency cepstrum coefficient of the target audio, and acquiring corresponding audio characteristics; inputting the audio features into a semantic conversion network to obtain a plurality of corresponding mouth semantic graphs;
after the first face image sequence is subjected to forward amplification and backward amplification, the length of the obtained image sequence is consistent with the number of the mouth semantic images, face images to be rendered in the image sequence are sequentially extracted at the moment, the corresponding mouth semantic images and the face images to be rendered are connected and then input into a neural rendering network, and corresponding synthesized faces are obtained;
and generating a video based on the obtained synthesized face and the target audio, and synchronizing the mouth shape of the digital person corresponding to the target person with the target audio.
Embodiment 2, a semantic-based audio-driven digital human generation system, as shown in fig. 4, includes:
the data acquisition module 100 is configured to acquire a target audio and a target face image sequence, and perform masking processing on a mouth region of each target face image in the target face image sequence to obtain a corresponding first face image sequence;
the feature extraction module 200 is configured to perform feature extraction on the target audio to obtain corresponding audio features;
a semantic conversion module 300, configured to input the audio features into a pre-trained semantic conversion network, where the audio features are subjected to semantic conversion by the semantic conversion network to obtain a corresponding semantic motion sequence, where the semantic motion sequence includes a plurality of mouth semantic graphs;
and a synthesis rendering module 400, configured to construct a second face image sequence based on the first face image sequence, where the second face image sequence includes the same number of face images to be rendered as the mouth semantic graphs, and perform face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthesized face sequence, where the synthesized face sequence includes synthesized faces corresponding to the mouth semantic graphs one to one.
The semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
The composite rendering module 400 includes:
the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;
and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
Embodiment 3 is a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of embodiment 1.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that:
reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims (10)

1. A semantic-based audio-driven digital human generation method is characterized by comprising the following steps:
acquiring a target audio and a target face image sequence, and after masking the mouth region of each target face image in the target face image sequence, acquiring a corresponding first face image sequence;
extracting the characteristics of the target audio to obtain corresponding audio characteristics;
inputting the audio features into a pre-trained semantic conversion network, and performing semantic conversion on the audio features by the semantic conversion network to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and constructing a second face image sequence based on the first face image sequence, wherein the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic images, and generating a synthesized face sequence based on the mouth semantic images and the face images to be rendered, and the synthesized face sequence contains synthesized faces corresponding to the mouth semantic images one to one.
2. The semantic-based audio driven digital human generation method of claim 1, wherein the semantic conversion network comprises a recurrent neural network and an upsampled convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
3. The semantic-based audio-driven digital human generation method of claim 1 or 2, wherein:
connecting the semantic graph of the mouth with the corresponding face image to be rendered to obtain corresponding data to be synthesized;
and inputting the data to be synthesized into a preset neural rendering network, and performing synthetic rendering on the face image to be rendered by the neural rendering network based on the mouth semantic graph to generate a corresponding synthetic face.
4. The semantic-based audio-driven digital human generation method of claim 3, wherein the pre-training of the semantic conversion network comprises:
acquiring a speaking video corresponding to a target face, extracting audio features of the speaking video, acquiring sample audio features, extracting video frames of the speaking video, detecting the face in each video frame, segmenting a mouth semantic graph of the face, and taking the obtained mouth semantic graph as a sample semantic graph;
training the semantic conversion network based on the sample audio features and the sample semantic graph.
5. The semantic-based audio-driven digital human generation method of claim 4, wherein the pre-training of the semantic conversion network comprises:
masking the mouth region of the face in the video frame to obtain a corresponding sample image to be rendered;
connecting the sample image to be rendered and the corresponding sample semantic graph to obtain corresponding sample data to be synthesized;
and training the neural rendering network based on the sample data to be synthesized and the sample face image.
6. The semantic-based audio-driven digital human generation method of claim 5, wherein:
the audio features are mel-frequency cepstral coefficients.
7. A semantic-based audio-driven digital human generation system, comprising:
the data acquisition module is used for acquiring a target audio and a target face image sequence, and after masking processing is carried out on mouth regions of all target face images in the target face image sequence, a corresponding first face image sequence is obtained;
the characteristic extraction module is used for extracting the characteristics of the target audio to obtain corresponding audio characteristics;
the semantic conversion module is used for inputting the audio features to a pre-trained semantic conversion network, and the semantic conversion network performs semantic conversion on the audio features to obtain a corresponding semantic motion sequence, wherein the semantic motion sequence comprises a plurality of mouth semantic graphs;
and the synthetic rendering module is used for constructing a second face image sequence based on the first face image sequence, the second face image sequence contains the face images to be rendered in the same quantity as the mouth semantic graphs, and is also used for carrying out face synthesis based on the mouth semantic graphs and the face images to be rendered to generate a synthetic face sequence, and the synthetic face sequence contains synthetic faces corresponding to the mouth semantic graphs one by one.
8. The semantic-based audio-driven digital human generation system of claim 7, wherein:
the semantic conversion network comprises a cyclic neural network and an up-sampling convolutional neural network;
the recurrent neural network is used for converting the audio features into expression vectors:
the up-sampling convolution neural network is used for generating a semantic motion sequence based on the expression vector.
9. The semantic-based audio-driven digital human generation system according to claim 7 or 8, wherein the synthesis rendering module comprises:
the connecting unit is used for connecting the semantic graph of the mouth part with the corresponding face image to be rendered respectively to obtain corresponding data to be synthesized;
and the rendering unit is used for inputting the data to be synthesized into a preset neural rendering network, and the neural rendering network synthesizes and renders the face image to be rendered based on the mouth semantic graph to generate a corresponding synthesized face.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN202011382282.8A 2020-12-01 2020-12-01 Audio-driven digital human generation method and system based on semantics Pending CN112562722A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011382282.8A CN112562722A (en) 2020-12-01 2020-12-01 Audio-driven digital human generation method and system based on semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011382282.8A CN112562722A (en) 2020-12-01 2020-12-01 Audio-driven digital human generation method and system based on semantics

Publications (1)

Publication Number Publication Date
CN112562722A true CN112562722A (en) 2021-03-26

Family

ID=75045817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011382282.8A Pending CN112562722A (en) 2020-12-01 2020-12-01 Audio-driven digital human generation method and system based on semantics

Country Status (1)

Country Link
CN (1) CN112562722A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096242A (en) * 2021-04-29 2021-07-09 平安科技(深圳)有限公司 Virtual anchor generation method and device, electronic equipment and storage medium
CN113269872A (en) * 2021-06-01 2021-08-17 广东工业大学 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
CN113299312A (en) * 2021-05-21 2021-08-24 北京市商汤科技开发有限公司 Image generation method, device, equipment and storage medium
CN113380269A (en) * 2021-06-08 2021-09-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113628635A (en) * 2021-07-19 2021-11-09 武汉理工大学 Voice-driven speaking face video generation method based on teacher and student network
CN113674184A (en) * 2021-07-19 2021-11-19 清华大学 Virtual speaker limb gesture generation method, device, equipment and storage medium
CN113723385A (en) * 2021-11-04 2021-11-30 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device
CN113747086A (en) * 2021-09-30 2021-12-03 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium
CN113822968A (en) * 2021-11-24 2021-12-21 北京影创信息科技有限公司 Method, system and storage medium for driving virtual human in real time by voice
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation
CN113851145A (en) * 2021-09-23 2021-12-28 厦门大学 Virtual human action sequence synthesis method combining voice and semantic key actions
CN114419702A (en) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 Digital human generation model, training method of model, and digital human generation method
CN115330913A (en) * 2022-10-17 2022-11-11 广州趣丸网络科技有限公司 Three-dimensional digital population form generation method and device, electronic equipment and storage medium
CN115953521A (en) * 2023-03-14 2023-04-11 世优(北京)科技有限公司 Remote digital human rendering method, device and system
WO2023088080A1 (en) * 2021-11-22 2023-05-25 上海商汤智能科技有限公司 Speaking video generation method and apparatus, and electronic device and storage medium
CN116342835A (en) * 2023-03-31 2023-06-27 华院计算技术(上海)股份有限公司 Face three-dimensional surface grid generation method, device, computing equipment and storage medium
CN116385604A (en) * 2023-06-02 2023-07-04 摩尔线程智能科技(北京)有限责任公司 Video generation and model training method, device, equipment and storage medium
CN116778040A (en) * 2023-08-17 2023-09-19 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN116994600A (en) * 2023-09-28 2023-11-03 中影年年(北京)文化传媒有限公司 Method and system for driving character mouth shape based on audio frequency
CN117036555A (en) * 2023-05-18 2023-11-10 无锡捷通数智科技有限公司 Digital person generation method and device and digital person generation system
CN117372553A (en) * 2023-08-25 2024-01-09 华院计算技术(上海)股份有限公司 Face image generation method and device, computer readable storage medium and terminal
CN117689783A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field
CN117036555B (en) * 2023-05-18 2024-06-21 无锡捷通数智科技有限公司 Digital person generation method and device and digital person generation system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975952A (en) * 2016-05-26 2016-09-28 天津艾思科尔科技有限公司 Beard detection method and system in video image
CN108229245A (en) * 2016-12-14 2018-06-29 贵港市瑞成科技有限公司 Method for detecting fatigue driving based on facial video features
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN111259875A (en) * 2020-05-06 2020-06-09 中国人民解放军国防科技大学 Lip reading method based on self-adaptive magnetic space-time diagramm volumetric network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975952A (en) * 2016-05-26 2016-09-28 天津艾思科尔科技有限公司 Beard detection method and system in video image
CN108229245A (en) * 2016-12-14 2018-06-29 贵港市瑞成科技有限公司 Method for detecting fatigue driving based on facial video features
CN110866968A (en) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 Method for generating virtual character video based on neural network and related equipment
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN111259875A (en) * 2020-05-06 2020-06-09 中国人民解放军国防科技大学 Lip reading method based on self-adaptive magnetic space-time diagramm volumetric network

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113096242A (en) * 2021-04-29 2021-07-09 平安科技(深圳)有限公司 Virtual anchor generation method and device, electronic equipment and storage medium
WO2022242381A1 (en) * 2021-05-21 2022-11-24 上海商汤智能科技有限公司 Image generation method and apparatus, device, and storage medium
CN113299312A (en) * 2021-05-21 2021-08-24 北京市商汤科技开发有限公司 Image generation method, device, equipment and storage medium
CN113269872A (en) * 2021-06-01 2021-08-17 广东工业大学 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
CN113380269A (en) * 2021-06-08 2021-09-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113380269B (en) * 2021-06-08 2023-01-10 北京百度网讯科技有限公司 Video image generation method, apparatus, device, medium, and computer program product
CN113674184A (en) * 2021-07-19 2021-11-19 清华大学 Virtual speaker limb gesture generation method, device, equipment and storage medium
CN113628635B (en) * 2021-07-19 2023-09-15 武汉理工大学 Voice-driven speaker face video generation method based on teacher student network
CN113628635A (en) * 2021-07-19 2021-11-09 武汉理工大学 Voice-driven speaking face video generation method based on teacher and student network
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation
CN113851145A (en) * 2021-09-23 2021-12-28 厦门大学 Virtual human action sequence synthesis method combining voice and semantic key actions
CN113851145B (en) * 2021-09-23 2024-06-07 厦门大学 Virtual human action sequence synthesis method combining voice and semantic key actions
CN113747086A (en) * 2021-09-30 2021-12-03 深圳追一科技有限公司 Digital human video generation method and device, electronic equipment and storage medium
CN113723385A (en) * 2021-11-04 2021-11-30 新东方教育科技集团有限公司 Video processing method and device and neural network training method and device
WO2023077742A1 (en) * 2021-11-04 2023-05-11 新东方教育科技集团有限公司 Video processing method and apparatus, and neural network training method and apparatus
WO2023088080A1 (en) * 2021-11-22 2023-05-25 上海商汤智能科技有限公司 Speaking video generation method and apparatus, and electronic device and storage medium
CN113822968B (en) * 2021-11-24 2022-03-04 北京影创信息科技有限公司 Method, system and storage medium for driving virtual human in real time by voice
CN113822968A (en) * 2021-11-24 2021-12-21 北京影创信息科技有限公司 Method, system and storage medium for driving virtual human in real time by voice
CN114419702A (en) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 Digital human generation model, training method of model, and digital human generation method
CN114419702B (en) * 2021-12-31 2023-12-01 南京硅基智能科技有限公司 Digital person generation model, training method of model, and digital person generation method
CN115330913A (en) * 2022-10-17 2022-11-11 广州趣丸网络科技有限公司 Three-dimensional digital population form generation method and device, electronic equipment and storage medium
CN115953521A (en) * 2023-03-14 2023-04-11 世优(北京)科技有限公司 Remote digital human rendering method, device and system
CN116342835A (en) * 2023-03-31 2023-06-27 华院计算技术(上海)股份有限公司 Face three-dimensional surface grid generation method, device, computing equipment and storage medium
CN117036555B (en) * 2023-05-18 2024-06-21 无锡捷通数智科技有限公司 Digital person generation method and device and digital person generation system
CN117036555A (en) * 2023-05-18 2023-11-10 无锡捷通数智科技有限公司 Digital person generation method and device and digital person generation system
CN116385604A (en) * 2023-06-02 2023-07-04 摩尔线程智能科技(北京)有限责任公司 Video generation and model training method, device, equipment and storage medium
CN116385604B (en) * 2023-06-02 2023-12-19 摩尔线程智能科技(北京)有限责任公司 Video generation and model training method, device, equipment and storage medium
CN116778040B (en) * 2023-08-17 2024-04-09 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN116778040A (en) * 2023-08-17 2023-09-19 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN117372553A (en) * 2023-08-25 2024-01-09 华院计算技术(上海)股份有限公司 Face image generation method and device, computer readable storage medium and terminal
CN117372553B (en) * 2023-08-25 2024-05-10 华院计算技术(上海)股份有限公司 Face image generation method and device, computer readable storage medium and terminal
CN116994600B (en) * 2023-09-28 2023-12-12 中影年年(北京)文化传媒有限公司 Method and system for driving character mouth shape based on audio frequency
CN116994600A (en) * 2023-09-28 2023-11-03 中影年年(北京)文化传媒有限公司 Method and system for driving character mouth shape based on audio frequency
CN117689783A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field
CN117689783B (en) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 Face voice driving method and device based on super-parameter nerve radiation field

Similar Documents

Publication Publication Date Title
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
CN111325817B (en) Virtual character scene video generation method, terminal equipment and medium
Cao et al. Expressive speech-driven facial animation
CN103650002B (en) Text based video generates
CN110853670B (en) Music-driven dance generation method
CN112001992A (en) Voice-driven 3D virtual human expression sound-picture synchronization method and system based on deep learning
JP2009533786A (en) Self-realistic talking head creation system and method
KR20060090687A (en) System and method for audio-visual content synthesis
US11847726B2 (en) Method for outputting blend shape value, storage medium, and electronic device
JP2003529861A (en) A method for animating a synthetic model of a human face driven by acoustic signals
CN110910479B (en) Video processing method, device, electronic equipment and readable storage medium
CN113838173B (en) Virtual human head motion synthesis method driven by combination of voice and background sound
CN113838174B (en) Audio-driven face animation generation method, device, equipment and medium
CN116051692B (en) Three-dimensional digital human face animation generation method based on voice driving
CN116597857A (en) Method, system, device and storage medium for driving image by voice
CN115578512A (en) Method, device and equipment for training and using generation model of voice broadcast video
CN116828129B (en) Ultra-clear 2D digital person generation method and system
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN116912375A (en) Facial animation generation method and device, electronic equipment and storage medium
CN116417008A (en) Cross-mode audio-video fusion voice separation method
CN115439614B (en) Virtual image generation method and device, electronic equipment and storage medium
CN115223224A (en) Digital human speaking video generation method, system, terminal device and medium
Deena et al. Speech-driven facial animation using a shared Gaussian process latent variable model
Mahavidyalaya Phoneme and viseme based approach for lip synchronization
CN114360491A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210326