CN113963092B - Audio and video fitting associated computing method, device, medium and equipment - Google Patents

Audio and video fitting associated computing method, device, medium and equipment Download PDF

Info

Publication number
CN113963092B
CN113963092B CN202111442573.6A CN202111442573A CN113963092B CN 113963092 B CN113963092 B CN 113963092B CN 202111442573 A CN202111442573 A CN 202111442573A CN 113963092 B CN113963092 B CN 113963092B
Authority
CN
China
Prior art keywords
voice
sequence
feature
coding sequence
characteristic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111442573.6A
Other languages
Chinese (zh)
Other versions
CN113963092A (en
Inventor
王苏振
李林橙
丁彧
吕唐杰
范长杰
胡志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202111442573.6A priority Critical patent/CN113963092B/en
Publication of CN113963092A publication Critical patent/CN113963092A/en
Application granted granted Critical
Publication of CN113963092B publication Critical patent/CN113963092B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention provides a calculation method for audio and video fitting association, which utilizes the phoneme characteristics and the voice characteristics of target voice and also utilizes the extraction of structural characteristics from a reference image. In addition, the invention also provides a computing device, a medium and equipment for audio and video fitting association. The implementation of each specific embodiment provided by the invention is used as the front preparation for the synthesis of the facial speaking animation, and can obviously improve the synthesis quality and the final effect of the facial speaking animation.

Description

Audio and video fitting associated computing method, device, medium and equipment
Technical Field
The invention belongs to the field of neural networks and artificial intelligence, and particularly relates to a computing method, a computing device, a computing medium and computing equipment for audio and video fitting association.
Background
The human face voice animation construction in the audio and video fitting association calculation refers to the synthesis of the virtual voice animation of the character image based on any input voice and by taking the given character image in the reference image as a benchmark. Ideally, the generated virtual voice animation needs to present a mouth shape, expression, natural head motion that matches in the input voice. The technology can be widely applied to multiple fields of virtual assistants, intelligent customer service, news broadcasting, teleconferences and the like, and manual labor of related industries is greatly reduced based on artificial intelligence.
The prior art technology for constructing the facial voice animation mainly relies on the voice visual correlation calculation of video data, and aims to construct features related to speaking actions from source data, and generate simulated facial images by means of the features by using a depth generation model, so that the animation effect of the simulated facial images speaking the appointed voice is obtained.
When the three-dimensional deformation model or the human face key points are selected as the visual mode representation, the representation of the region outside the human face is lacking, so that the human face voice animation synthesized by the depth generation model can generate a fuzzy effect outside the human face region, and the quality of the human face voice animation is poor.
When selecting voice features as voice modal characterization, if the source of training data is a single object, the generalization of the depth generation model is poor, and if the source of training data is a plurality of objects, the depth generation model cannot learn accurate voice visual association characterization due to the difference among the plurality of objects.
When the phoneme features of a single object are selected as the voice modal representation, although the problem that the tone generalization is difficult to promote is solved to a certain extent, the phoneme features are difficult to be associated with emotion in natural voice, and the facial voice animation generated by the deep generation model obtained through training has obvious action flaws, such as unnatural mouth actions when speaking sentences with stronger tone.
It can be seen that each of the solutions in the prior art has difficulty in obtaining high quality facial speech animation.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks in the prior art, the present invention provides a method for calculating an audio/video fitting association, which includes:
Acquiring a head motion coding sequence, target voice and a reference image containing a target head portrait;
extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;
Extracting structural features from the reference image;
Splicing the phoneme characteristic sequence and the head motion coding sequence to obtain a first joint coding sequence, and splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;
Inputting the first joint coding sequence into an encoder of a neural network model based on an attention mechanism, wherein the encoder obtains a hidden space representation of a target voice frame;
Jointly inputting the hidden space representation and the second joint coding sequence into a decoder of the neural network model based on the attention mechanism, wherein the hidden space representation is key value versus attention of the decoder, and the second joint coding sequence is a query vector of the decoder;
The decoder outputs the feature vector of the target voice frame;
the feature vector is converted into a description parameter of the dense motion field.
According to one aspect of the invention, the step of obtaining the head motion coding sequence in the method comprises: generating the head motion coding sequence according to preset head motion data; or generating the head motion coding sequence according to the video matched with the target voice. According to another aspect of the present invention, the step of extracting the phoneme feature sequence and the speech feature sequence from the target speech in the method includes: splitting the target voice into a plurality of voice frames according to a preset period; respectively extracting the phoneme characteristic and the voice characteristic of each voice frame; and selecting a plurality of phoneme features to form the phoneme feature sequence according to a preset time sequence window, and selecting a plurality of voice features to form the voice feature sequence.
According to another aspect of the invention, the step of extracting structural features from the reference image in the method comprises: inputting the reference image into a pre-trained non-supervision key point detector, and extracting a characteristic diagram representation of the middle layer output of the non-supervision key point detector as the structural characteristic.
According to another aspect of the present invention, before splicing the speech feature sequence and the structural feature to obtain the second joint coding sequence, the method further comprises: and respectively modifying the channel dimension of each voice feature in the voice feature sequence by using an up-sampling convolution network to enable the channel dimension of the voice feature to be consistent with the structural feature.
According to another aspect of the invention, the step of converting the feature vector into the description parameters of the dense motion field in the method comprises: the feature vector is converted into the descriptive parameters using a fully connected layer.
According to another aspect of the present invention, the step of converting the feature vector of the target speech frame into the description parameter using the full connection layer in the method includes: respectively inputting the feature vectors of the target voice frame into two full-connection models, and respectively outputting the description parameters of corresponding types through the two full-connection models;
according to another aspect of the invention, the description parameters in the method include: the method comprises the steps of key point parameters for composing the dense sports field and local affine transformation parameters corresponding to the key points.
According to another aspect of the invention, after said converting said feature vector into a description parameter of the dense motion field, the method further comprises: and generating a video containing the target head portrait according to the description parameters.
Correspondingly, the invention provides an audio and video fitting association calculating device, which comprises:
The head motion data module is used for acquiring or generating a head motion coding sequence;
The voice processing module is used for extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;
the key point detection module is used for extracting structural features from the reference image and containing the target head portrait;
The first joint coding module is used for splicing the phoneme characteristic sequences and the head motion coding sequences to obtain a first joint coding sequence;
The second joint coding module is used for splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;
A neural network model based on an attention mechanism, the neural network model including an encoder and a decoder;
the encoder is used for inputting the first joint coding sequence after time sequence modeling and outputting the hidden space representation of the target voice frame;
The decoder is configured to input the hidden space representation and the second joint coding sequence, and output a feature vector of the target speech frame, where the hidden space representation is a key value of the decoder and the second joint coding sequence is a query vector of the decoder;
and the converter module is connected with the output end of the decoder and is used for converting the characteristic vector into the description parameter of the dense motion field.
Furthermore, the present invention provides one or more computer-readable media storing computer-executable instructions that, when used by one or more computer devices, cause the one or more computer devices to perform the method of computing an audio-video fit association as described hereinbefore.
The invention also provides a computer device comprising a memory and a processor, wherein: the memory stores a computer program, and the processor implements the audio and video fitting associated computing method when executing the computer program.
When the specific implementation modes provided by the invention are used for carrying out audio and video fitting associated calculation, the selected input features comprise the phoneme features and the voice features carried by the voice of the single object and the structural features of the reference image, and compared with the prior art, the generalization of the audio and video fitting associated calculation result and the mouth shape visual representation effect are improved based on the selection of the phoneme features and the voice features; based on the selection of the structural features, the calculation process of the audio-video fitting association can pay more attention to the structural distribution information of the characters and the background in the reference image, the generalization of the calculation result of the audio-video fitting association is also improved, and the quality of the synthesized face speaking animation based on the calculation result is correspondingly improved. The implementation of each specific embodiment provided by the invention is used as the front preparation for the synthesis of the facial speaking animation, and can obviously improve the synthesis quality and the final effect of the facial speaking animation.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of one embodiment of a method of computing an audio video fit association in accordance with the present invention;
FIG. 2 is a flow chart of an alternative embodiment of step S120 shown in FIG. 1;
FIG. 3 is a schematic structural diagram of one embodiment of a computing device associated with audio-video fitting in accordance with the present invention;
FIG. 4 is a schematic diagram of an exemplary computer device for performing an embodiment of the audio-video fit-association calculation method of the present invention.
The same or similar reference numbers in the drawings refer to the same or similar parts.
Detailed Description
For a better understanding and explanation of the present invention, reference will be made to the following detailed description of the invention taken in conjunction with the accompanying drawings. The invention is not limited to these specific embodiments only. On the contrary, the invention is intended to cover modifications and equivalent arrangements included within the scope of the appended claims.
It should be noted that numerous specific details are set forth in the following detailed description. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In the following description of various embodiments, well-known principles, structures and components are not described in detail in order to facilitate a salient point of the invention.
The invention provides a computing method of audio and video fitting association, please refer to fig. 1, fig. 1 is a flow chart of a specific implementation of the computing method of audio and video fitting association according to the invention, the method comprises:
Step S110, a head motion coding sequence, target voice and a reference image containing a target head portrait are acquired;
Step S120, extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;
step S130, extracting structural features from the reference image;
Step S200, splicing the phoneme characteristic sequence and the head motion coding sequence to obtain a first joint coding sequence, and splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;
Step S300, inputting the first joint coding sequence into an encoder of a neural network model based on an attention mechanism, and outputting a hidden space representation of a target voice frame by the encoder;
Step S400, the hidden space representation and the second joint coding sequence are input into a decoder of a neural network model based on an attention mechanism together, wherein the hidden space representation is key value versus attention of the decoder, and the second joint coding sequence is a query vector of the decoder;
step S500, the decoder outputs the characteristic vector of the target voice frame;
Step S600, converting the feature vector into a description parameter of the dense motion field.
Specifically, since the calculation method related to audio and video fitting of the present invention is a pre-preparation step of synthesizing a face voice animation, the final purpose of the synthesized face voice animation is to give a still image containing a face and a driving source (e.g. an audio and video clip or a sound clip), and the face in a still image state is synthesized into a dynamic face with a sounding action based on the driving source, that is, the dynamic of the still image, in which the face sequentially presents a sounding action, a head action, an expression, etc. that match with the sound of the driving source. In order to make the dynamics of such a still image more natural, it is often considered to introduce information describing the head motion, which is also the purpose of acquiring the head motion coding sequence in step S110. The present embodiment is not limited to the source of the head motion coding sequence, and the step of obtaining the head motion coding sequence may be to generate the head motion coding sequence according to a preset head motion data, or may generate the head motion coding sequence according to a video matched with the voice.
In step S120, the term "voice" refers to sound data used as a driving source to drive still images, such as directly recorded sound data, or audio track data extracted from video. The features that can be utilized that the speech naturally carries include phoneme features, which refer to phoneme labels extracted from the speech using automatic speech recognition technology tools, typically represented using one-hot coding (one-hot coding), and speech features. The speech features refer to spectral features calculated from the speech and are typically used to represent emotion carried by the speech, for example, in this embodiment the speech feature selection consists of MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient) features, FBANK (filterbank ) features, and fundamental frequency features. It is well known to those skilled in the art that, according to the characteristics of the speech processing technology, the object of the phoneme feature extraction and the speech feature extraction is not the complete data of the speech, but the speech frames obtained by framing the speech according to the preset length (i.e. the speech segments with the preset length), so that the phoneme feature sequence in step S120 is composed of the phoneme features corresponding to the plurality of speech frames, and similarly, the speech feature sequence is also composed of the speech features corresponding to the plurality of speech frames.
In step S130, the term "reference image" refers to a driven still image, and the structural features refer to available structural information naturally carried by the reference image. In an exemplary embodiment, the specific step of extracting the structural feature from the reference image may be implemented as: inputting the reference image into a pre-trained non-supervision key point detector, and extracting a characteristic diagram representation of the middle layer output of the non-supervision key point detector as the structural characteristic. The structural features mainly comprise structural distribution information of the head, the body, the background and the like of the person in the reference image, and the structural distribution information is training data required by the deep neural network when the static image is driven.
Referring to fig. 2, fig. 2 is a flow chart of an alternative embodiment of step S120 shown in fig. 1, where step S120 further includes the following steps:
step S121, splitting the voice into a plurality of voice frames according to a preset period;
step S122, extracting the phoneme characteristic and the voice characteristic of each voice frame respectively;
Step S123, selecting a plurality of the phoneme features to form the phoneme feature sequence and selecting a plurality of the voice features to form the voice feature sequence according to a preset time sequence window.
In this alternative embodiment, when the preset period in step S121 is selected to be 40 ms, the data per second of the speech may be split into 25 speech frames (speech frames), and the phoneme feature and the speech feature of each of the speech frames are extracted in step S122, respectively. In order to improve the stability of the motion field representation of the continuous frames, a predetermined timing window may be selected, wherein in step S123, the plurality of phoneme features are selected to form the phoneme feature sequence, and the plurality of speech features are selected to form the speech feature sequence, wherein the timing window is used to constrain the timing of the phoneme feature sequence and the speech feature sequence. If the phoneme feature and the speech feature of the i-th frame are respectively denoted as p i and a i, and the length of the time sequence window is selected to be 2n+1, the phoneme feature sequence is (p i-n,……pi,…… pi+n), and the speech feature sequence is (a i-n,……ai,……ai+n).
In step S200, the phoneme feature sequence and the head motion coding sequence are spliced to obtain a first joint coding sequence, and the speech feature sequence and the structural feature are spliced to obtain a second joint coding sequence, so as to construct a suitable input for a subsequent neural network.
In order to facilitate the splicing of the head motion coding sequence and the phoneme feature sequence, the head motion coding sequence should generally have a similar data arrangement structure to the phoneme feature sequence, for example, the i-th head motion code in the head motion coding sequence is denoted as h i, and the head motion coding sequence is denoted as (h i-n,…… hi,……hi+n). In order to achieve the first joint encoding of the ith frame by splicing p i and h i together, a suitable preprocessing may be performed on the phoneme characteristic p i of the ith frame, for example, by converting the phoneme characteristic p i into an encoding vector in a word vector manner, and finally, combining all the first joint encoding formed by splicing into the first joint encoding sequence.
Before splicing the voice feature sequence and the structural feature to obtain a second combined coding sequence, the method further comprises: and respectively modifying the channel dimension of each voice feature in the voice feature sequence by using an up-sampling convolution network to enable the channel dimension of the voice feature to be consistent with the structural feature. If the structural feature is denoted as f r, for the voice feature sequence (a i-n,……ai,……ai+n), the voice feature a i of the ith frame modifies the two-dimensional feature dimension of the voice feature a i to be consistent with the feature dimension of the structural feature through the up-sampling convolution network, so that the modified voice feature a i and the structural feature f r can be spliced in the channel dimension to obtain a second joint code of the ith frame, and finally, all the second joint codes formed by splicing are combined into the second joint code sequence.
It will be understood by those skilled in the art that after step S200 is performed, for the i-th frame, the input constructed for the subsequent neural network is (a i-n;i+n,pi-n;i+n,hi-n;i+n,fr), that is, a conditional window input composed of a phoneme feature, a head motion coding, a speech feature, and a structural feature with the length 2n+1 of the i-th frame as an intermediate frame. Wherein, i and n are positive integers, the range of values is i-n to i+n, and the length of n is not particularly limited.
In step S300, the first joint coding sequence is input into an encoder of a neural network model based on an attention mechanism, which outputs a hidden space representation of a target speech frame, i.e. a generic hidden space mouth-shape representation of the target speech frame. When the input is (a i-n;i+n,pi-n;i+n, hi-n;i+n,fr), the target speech frame is the i-th frame described above. Since this stage uses the phoneme features from the speech for modeling, the hidden space representation has a better timbre generalization capability.
In step S400, the hidden space token and the second joint coding sequence are jointly input into a decoder of an attention mechanism based neural network model, wherein the hidden space token is a key-value-to-attention of the decoder, and the second joint coding sequence is a query vector of the decoder. Typically, the neural network model based on the attention mechanism may be a transducer, the hidden space representation corresponds to a key-value input of the decoder, and the second joint coding sequence corresponds to a query input of the decoder. Obviously, because the second joint coding sequence is used as the query vector of the decoder, the inherent data information of the voice feature and the structural feature contained in the second joint coding sequence is fully utilized, and meanwhile, the decoder takes the hidden space representation as a key value to pay attention to, so that the output result of the neural network model based on the attention mechanism can further modulate the mouth shape on the basis of ensuring the mouth shape accuracy, and the mouth shape action is more natural.
In step S500, the decoder outputs a feature vector of the target speech frame, specifically, a set of a plurality of feature vectors, wherein the feature vector set includes the feature vector of the target speech frame. For example, when the input is (a i-n;i+n,pi-n;i+n,hi-n;i+n,fr), the decoder outputs a feature vector of 2n+1 frames, where the target speech frame is the i-th frame of the 2n+1 frames, and selects the feature vector of the i-th frame to execute step S600.
In step S600, the feature vector of the target speech frame is converted into the description parameters of the dense motion field, typically, the feature vector may be converted into the description parameters using a full connection layer in step S600. More preferably, step S600 may be implemented as the following procedure: and respectively inputting the feature vectors of the target voice frame into two full-connection models, and respectively outputting the description parameters of the corresponding categories by the two full-connection models. Wherein the description parameters include key point parameters for composing a dense motion field, and local affine transformation parameters corresponding to the key points. The key point parameters include coordinate data and the like, and the local affine transformation parameters include a local affine transformation matrix of the key point and first-order Jacobian derivatives thereof.
After converting the feature vectors into description parameters of the dense motion field, it is obvious that subsequent steps related to facial speech animation synthesis may be performed using the description parameters, e.g. generating a video containing the target avatar from the description parameters.
It should be noted that although the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations be performed in that particular order or that all illustrated operations be performed to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
For example, although fig. 1 shows that step S120 and step S130 are sequentially performed, in other embodiments, step S120 and step S130 may be performed in parallel.
Correspondingly, the invention also provides a computing device associated with audio-video fitting, please refer to fig. 3, fig. 3 is a schematic structural diagram of a specific embodiment of the computing device associated with audio-video fitting according to the invention, the device comprises:
a head motion data module 110 for acquiring or generating a head motion coding sequence;
A voice processing module 120 for extracting a phoneme feature sequence and a voice feature sequence from the target voice 121;
a keypoint detection module 130 for extracting structural features from a reference image 131 containing the target avatar;
A first joint coding module 210, configured to splice the phoneme feature sequence and the head motion coding sequence to obtain a first joint coding sequence;
A second joint coding module 220, configured to splice the speech feature sequence and the structural feature to obtain a second joint coding sequence;
A neural network model 300 based on an attention mechanism, the neural network model comprising an encoder 310 and a decoder 320;
the encoder 310 is configured to input the first joint coding sequence after time sequence modeling, and output a hidden space representation of a target speech frame;
The decoder 320 is configured to input the hidden space token and the second joint coding sequence, and output a feature vector of the target speech frame, where the hidden space token is a key-value-to-attention of the decoder 320, and the second joint coding sequence is a query vector of the decoder 320;
a converter module 400 connected to the output of the decoder 320 for converting the feature vector into a description parameter of the dense motion field.
The terms and terms appearing in this section have the same meaning as the terms or terms in the foregoing, such as the "coding sequence", "phoneme feature sequence", and "structural features", etc., and the above terms or terms and their related working principles are referred to in the description and explanation of the relevant sections in the foregoing, and are not repeated herein for brevity.
Optionally, the computing device further includes an upsampling convolutional network 140 for modifying the channel dimensions of each speech feature in the sequence of speech features, respectively, so that the channel dimensions of the speech features are consistent with the structural features.
Alternatively, the converter module 400 may be implemented with a fully connected layer comprising two fully connected models, such as the fully connected model 401 and the fully connected model 402 in fig. 3, which respectively output the description parameters of the corresponding categories; in particular, the description parameters include key point parameters for composing a dense motion field, local affine transformation parameters corresponding to the key points.
Typically, the attention mechanism based neural network model 300 is a transducer; the hidden space characterizes key-value inputs of the corresponding decoder 320; the second concatenated coding sequence corresponds to the query input of decoder 320.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a typical computer device for executing an embodiment of the audio/video fitting association calculation method according to the present invention. More specifically, the computing device associated with the audio-video fit described above may be included as part of the computer apparatus. The computer device comprises at least the following parts: a CPU (central processing unit) 501, a RAM (random access memory) 502, a ROM (read only memory) 503, a system bus 500, a hard disk control unit 504, a hard disk 505, a man-machine interaction external device control unit 506, a man-machine interaction external device 507, a serial interface control unit 508, a serial interface external device 509, a parallel interface control unit 510, a parallel interface external device 511, a display device control unit 512, and a display device 513. The CPU 501, the RAM 502, the ROM 503, the hard disk control unit 504, the man-machine interaction external device control unit 506, the serial interface control unit 508, the parallel interface control unit 510, and the display device control unit 512 are connected to the system bus 500, and realize previous communication with each other through the system bus 500. Further, the hard disk control unit 504 is connected to a hard disk 505; the man-machine interaction external device control unit 506 is connected to a man-machine interaction external device 507, typically a mouse, a trackball, a touch screen or a keyboard; the serial interface control unit 508 is connected to a serial interface external device 509; the parallel interface control unit 510 is connected to a parallel interface external device 511; the display device control unit 512 is connected to a display device 513.
The block diagram depicted in FIG. 4 illustrates the structure of one type of computer device capable of practicing the various embodiments of the invention, and is not limiting of the environment in which the invention can be practiced. In some cases, some of the computer devices may be added or subtracted as desired. For example, the device shown in fig. 4 may remove the human interactive external device 507 and the display device 513, and the specific embodiment is only a server accessible by the external device. The computer devices shown in fig. 4 may, of course, implement the operating environment of the present invention solely or may be interconnected by a network to provide an operating environment in which various embodiments of the present invention are applicable, e.g., the various modules and/or steps of the present invention may be implemented in distributed fashion among the interconnected computer devices.
Furthermore, one or more computer-readable media storing computer-executable instructions that, when used by one or more computer devices, cause the one or more computer devices to perform various embodiments of an audio-video fit-associated computing method as described above, such as the audio-video fit-associated computing method shown in fig. 1, are disclosed. Computer readable media can be any available media that can be accessed by the computer device and includes both volatile and nonvolatile media, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer-readable media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computing device. Combinations of any of the above should also be included within the scope of computer readable media.
Correspondingly, the invention also discloses a computer device, which comprises a memory and a processor, wherein: the memory stores a computer program which when executed by the processor implements various embodiments of the audio-video fit-related computing method as described above, such as the audio-video fit-related computing method shown in fig. 1.
The part related to software logic in the audio and video fitting associated calculation method provided by the invention can be realized by a programmable logic device, and can also be implemented as a computer program product, and the program product enables a computer to execute the method. The computer program product comprises a computer-readable storage medium having computer program logic or code portions embodied therein for carrying out the steps of the methods described above. The computer readable storage medium may be a built-in medium installed in a computer or a removable medium (e.g., a hot-pluggable storage device) detachable from a computer main body. The built-in medium includes, but is not limited to, rewritable nonvolatile memory such as RAM, ROM, and hard disk. The removable media includes, but is not limited to: optical storage media (e.g., CD-ROM and DVD), magneto-optical storage media (e.g., MO), magnetic storage media (e.g., magnetic tape or removable hard disk), media with built-in rewritable non-volatile memory (e.g., memory card), and media with built-in ROM (e.g., ROM cartridge).
It will be appreciated by those skilled in the art that any computer system having suitable programming means is capable of executing the steps of the method of the present invention embodied in a computer program product. Although most of the specific embodiments described in this specification focus on software programs, alternative embodiments that implement the methods provided by the present invention in hardware are also within the scope of the invention as claimed.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements, units or steps, and that the singular does not exclude a plurality. A plurality of components, units or means recited in the claims can also be implemented by means of one component, unit or means in software or hardware.
When the specific implementation modes provided by the invention are used for carrying out audio and video fitting associated calculation, the selected input features comprise the phoneme features and the voice features carried by the voice of the single object and the structural features of the reference image, and compared with the prior art, the generalization of the audio and video fitting associated calculation result and the mouth shape visual representation effect are improved based on the selection of the phoneme features and the voice features; based on the selection of the structural features, the calculation process of the audio-video fitting association can pay more attention to the structural distribution information of the characters and the background in the reference image, the generalization of the calculation result of the audio-video fitting association is also improved, and the quality of the synthesized face speaking animation based on the calculation result is correspondingly improved. The implementation of each specific embodiment provided by the invention is used as the front preparation for the synthesis of the facial speaking animation, and can obviously improve the synthesis quality and the final effect of the facial speaking animation.
The above disclosure is intended to be illustrative of only and not limiting of the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalents thereof.

Claims (12)

1. A method for computing an audio-video fit association, the method comprising:
Acquiring a head motion coding sequence, target voice and a reference image containing a target head portrait;
extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;
Extracting structural features from the reference image;
Splicing the phoneme characteristic sequence and the head motion coding sequence to obtain a first joint coding sequence, and splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;
Inputting the first joint coding sequence into an encoder of a neural network model based on an attention mechanism, wherein the encoder obtains a hidden space representation of a target voice frame;
Jointly inputting the hidden space representation and the second joint coding sequence into a decoder of the neural network model based on the attention mechanism, wherein the hidden space representation is key value versus attention of the decoder, and the second joint coding sequence is a query vector of the decoder;
The decoder outputs the feature vector of the target voice frame;
the feature vector is converted into a description parameter of the dense motion field.
2. The method of claim 1, wherein the step of obtaining a head motion encoding sequence comprises:
generating the head motion coding sequence according to preset head motion data; or (b)
And generating the head motion coding sequence according to the video matched with the target voice.
3. The method of claim 1, wherein the step of extracting a phoneme feature sequence and a speech feature sequence from a target speech comprises:
splitting the target voice into a plurality of voice frames according to a preset period;
respectively extracting the phoneme characteristic and the voice characteristic of each voice frame;
and selecting a plurality of phoneme features to form the phoneme feature sequence according to a preset time sequence window, and selecting a plurality of voice features to form the voice feature sequence.
4. The method of claim 1, wherein the step of extracting structural features from the reference image comprises:
inputting the reference image into a pre-trained non-supervision key point detector, and extracting a characteristic diagram representation of the middle layer output of the non-supervision key point detector as the structural characteristic.
5. The method of claim 1, further comprising, prior to concatenating the sequence of speech features with the structural feature to obtain a second image joint coding sequence:
and respectively modifying the channel dimension of each voice feature in the voice feature sequence by using an up-sampling convolution network to enable the channel dimension of the voice feature to be consistent with the structural feature.
6. The method according to claim 1, wherein the step of converting the feature vector into a description parameter of the dense motion field comprises:
The feature vector is converted into the descriptive parameters using a fully connected layer.
7. The method of claim 6, wherein the step of converting feature vectors of the target speech frame into the description parameters using a full connection layer comprises:
And respectively inputting the feature vectors of the target voice frame into two full-connection models, and respectively outputting the description parameters of the corresponding categories through the two full-connection models.
8. The method of claim 1, wherein the descriptive parameters include:
The method comprises the steps of key point parameters for composing the dense sports field and local affine transformation parameters corresponding to the key points.
9. The method according to claim 1, wherein after said converting said feature vector into a description parameter of the dense motion field, the method further comprises:
And generating a video containing the target head portrait according to the description parameters.
10. A computing device associated with audio-video fits, the device comprising:
The head motion data module is used for acquiring or generating a head motion coding sequence;
The voice processing module is used for extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;
The key point detection module is used for extracting structural features from a reference image containing the target head portrait;
The first joint coding module is used for splicing the phoneme characteristic sequences and the head motion coding sequences to obtain a first joint coding sequence;
The second joint coding module is used for splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;
A neural network model based on an attention mechanism, the neural network model including an encoder and a decoder;
the encoder is used for inputting the first joint coding sequence after time sequence modeling and outputting the hidden space representation of the target voice frame;
The decoder is configured to input the hidden space representation and the second joint coding sequence, and output a feature vector of the target speech frame, where the hidden space representation is a key value of the decoder and the second joint coding sequence is a query vector of the decoder;
and the converter module is connected with the output end of the decoder and is used for converting the characteristic vector into the description parameter of the dense motion field.
11. One or more computer-readable media storing computer-executable instructions that, when used by one or more computer devices, cause the one or more computer devices to perform the method of computing an audio-video fit association as recited in any one of claims 1-9.
12. A computer device comprising a memory and a processor, wherein:
The memory stores a computer program which when executed by the processor implements the audio-video fit-related computing method according to any one of claims 1 to 9.
CN202111442573.6A 2021-11-30 2021-11-30 Audio and video fitting associated computing method, device, medium and equipment Active CN113963092B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111442573.6A CN113963092B (en) 2021-11-30 2021-11-30 Audio and video fitting associated computing method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111442573.6A CN113963092B (en) 2021-11-30 2021-11-30 Audio and video fitting associated computing method, device, medium and equipment

Publications (2)

Publication Number Publication Date
CN113963092A CN113963092A (en) 2022-01-21
CN113963092B true CN113963092B (en) 2024-05-03

Family

ID=79472581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111442573.6A Active CN113963092B (en) 2021-11-30 2021-11-30 Audio and video fitting associated computing method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN113963092B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778041B (en) * 2023-08-22 2023-12-12 北京百度网讯科技有限公司 Multi-mode-based face image generation method, model training method and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111489424A (en) * 2020-04-10 2020-08-04 网易(杭州)网络有限公司 Virtual character expression generation method, control method, device and terminal equipment
CN112017633A (en) * 2020-09-10 2020-12-01 北京地平线信息技术有限公司 Voice recognition method, voice recognition device, storage medium and electronic equipment
CN112581569A (en) * 2020-12-11 2021-03-30 中国科学院软件研究所 Adaptive emotion expression speaker facial animation generation method and electronic device
CN112712813A (en) * 2021-03-26 2021-04-27 北京达佳互联信息技术有限公司 Voice processing method, device, equipment and storage medium
CN112785667A (en) * 2021-01-25 2021-05-11 北京有竹居网络技术有限公司 Video generation method, device, medium and electronic equipment
CN113168826A (en) * 2018-12-03 2021-07-23 Groove X 株式会社 Robot, speech synthesis program, and speech output method
CN113269066A (en) * 2021-05-14 2021-08-17 网易(杭州)网络有限公司 Speaking video generation method and device and electronic equipment
CN113314094A (en) * 2021-05-28 2021-08-27 北京达佳互联信息技术有限公司 Lip-shaped model training method and device and voice animation synthesis method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019226964A1 (en) * 2018-05-24 2019-11-28 Warner Bros. Entertainment Inc. Matching mouth shape and movement in digital video to alternative audio
US11417041B2 (en) * 2020-02-12 2022-08-16 Adobe Inc. Style-aware audio-driven talking head animation from a single image

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113168826A (en) * 2018-12-03 2021-07-23 Groove X 株式会社 Robot, speech synthesis program, and speech output method
CN111489424A (en) * 2020-04-10 2020-08-04 网易(杭州)网络有限公司 Virtual character expression generation method, control method, device and terminal equipment
CN112017633A (en) * 2020-09-10 2020-12-01 北京地平线信息技术有限公司 Voice recognition method, voice recognition device, storage medium and electronic equipment
CN112581569A (en) * 2020-12-11 2021-03-30 中国科学院软件研究所 Adaptive emotion expression speaker facial animation generation method and electronic device
CN112785667A (en) * 2021-01-25 2021-05-11 北京有竹居网络技术有限公司 Video generation method, device, medium and electronic equipment
CN112712813A (en) * 2021-03-26 2021-04-27 北京达佳互联信息技术有限公司 Voice processing method, device, equipment and storage medium
CN113269066A (en) * 2021-05-14 2021-08-17 网易(杭州)网络有限公司 Speaking video generation method and device and electronic equipment
CN113314094A (en) * 2021-05-28 2021-08-27 北京达佳互联信息技术有限公司 Lip-shaped model training method and device and voice animation synthesis method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Fine-grained grocery product recognition by one-shot learning;Weidong Geng 等;《ACM International Conference on Multimedia》;20181231;第1706-1714页 *
实时语音同步的三维虚拟人表情和动作的研究;位雪岭;《中国优秀硕士学位论文全文数据库信息科技辑》;20170115(第1期);第 I138-641页 *

Also Published As

Publication number Publication date
CN113963092A (en) 2022-01-21

Similar Documents

Publication Publication Date Title
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN110600013B (en) Training method and device for non-parallel corpus voice conversion data enhancement model
CN112329451B (en) Sign language action video generation method, device, equipment and storage medium
CN113077537A (en) Video generation method, storage medium and equipment
CN113111812A (en) Mouth action driving model training method and assembly
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
Wang et al. HMM trajectory-guided sample selection for photo-realistic talking head
Thangthai et al. Synthesising visual speech using dynamic visemes and deep learning architectures
CN116309984A (en) Mouth shape animation generation method and system based on text driving
CN117349427A (en) Artificial intelligence multi-mode content generation system for public opinion event coping
CN113963092B (en) Audio and video fitting associated computing method, device, medium and equipment
CN115497448A (en) Method and device for synthesizing voice animation, electronic equipment and storage medium
Websdale et al. Speaker-independent speech animation using perceptual loss functions and synthetic data
Huang et al. Fine-grained talking face generation with video reinterpretation
Filntisis et al. Video-realistic expressive audio-visual speech synthesis for the Greek language
CN112185340A (en) Speech synthesis method, speech synthesis device, storage medium and electronic apparatus
CN117152285A (en) Virtual person generating method, device, equipment and medium based on audio control
CN114581570B (en) Three-dimensional face action generation method and system
Sun et al. Pre-avatar: An automatic presentation generation framework leveraging talking avatar
CN113990295A (en) Video generation method and device
Chu et al. CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation
CN115529500A (en) Method and device for generating dynamic image
CN116129852A (en) Training method of speech synthesis model, speech synthesis method and related equipment
Liu et al. Optimization of an image-based talking head system
Kolivand et al. Realistic lip syncing for virtual character using common viseme set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant