CN113963092B

CN113963092B - Audio and video fitting associated computing method, device, medium and equipment

Info

Publication number: CN113963092B
Application number: CN202111442573.6A
Authority: CN
Inventors: 王苏振; 李林橙; 丁彧; 吕唐杰; 范长杰; 胡志鹏
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2024-05-03
Anticipated expiration: 2041-11-30
Also published as: CN113963092A

Abstract

The invention provides a calculation method for audio and video fitting association, which utilizes the phoneme characteristics and the voice characteristics of target voice and also utilizes the extraction of structural characteristics from a reference image. In addition, the invention also provides a computing device, a medium and equipment for audio and video fitting association. The implementation of each specific embodiment provided by the invention is used as the front preparation for the synthesis of the facial speaking animation, and can obviously improve the synthesis quality and the final effect of the facial speaking animation.

Description

Audio and video fitting associated computing method, device, medium and equipment

Technical Field

The invention belongs to the field of neural networks and artificial intelligence, and particularly relates to a computing method, a computing device, a computing medium and computing equipment for audio and video fitting association.

Background

The human face voice animation construction in the audio and video fitting association calculation refers to the synthesis of the virtual voice animation of the character image based on any input voice and by taking the given character image in the reference image as a benchmark. Ideally, the generated virtual voice animation needs to present a mouth shape, expression, natural head motion that matches in the input voice. The technology can be widely applied to multiple fields of virtual assistants, intelligent customer service, news broadcasting, teleconferences and the like, and manual labor of related industries is greatly reduced based on artificial intelligence.

The prior art technology for constructing the facial voice animation mainly relies on the voice visual correlation calculation of video data, and aims to construct features related to speaking actions from source data, and generate simulated facial images by means of the features by using a depth generation model, so that the animation effect of the simulated facial images speaking the appointed voice is obtained.

When the three-dimensional deformation model or the human face key points are selected as the visual mode representation, the representation of the region outside the human face is lacking, so that the human face voice animation synthesized by the depth generation model can generate a fuzzy effect outside the human face region, and the quality of the human face voice animation is poor.

When selecting voice features as voice modal characterization, if the source of training data is a single object, the generalization of the depth generation model is poor, and if the source of training data is a plurality of objects, the depth generation model cannot learn accurate voice visual association characterization due to the difference among the plurality of objects.

When the phoneme features of a single object are selected as the voice modal representation, although the problem that the tone generalization is difficult to promote is solved to a certain extent, the phoneme features are difficult to be associated with emotion in natural voice, and the facial voice animation generated by the deep generation model obtained through training has obvious action flaws, such as unnatural mouth actions when speaking sentences with stronger tone.

It can be seen that each of the solutions in the prior art has difficulty in obtaining high quality facial speech animation.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks in the prior art, the present invention provides a method for calculating an audio/video fitting association, which includes:

Acquiring a head motion coding sequence, target voice and a reference image containing a target head portrait;

extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;

Extracting structural features from the reference image;

Splicing the phoneme characteristic sequence and the head motion coding sequence to obtain a first joint coding sequence, and splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;

Inputting the first joint coding sequence into an encoder of a neural network model based on an attention mechanism, wherein the encoder obtains a hidden space representation of a target voice frame;

Jointly inputting the hidden space representation and the second joint coding sequence into a decoder of the neural network model based on the attention mechanism, wherein the hidden space representation is key value versus attention of the decoder, and the second joint coding sequence is a query vector of the decoder;

The decoder outputs the feature vector of the target voice frame;

the feature vector is converted into a description parameter of the dense motion field.

According to one aspect of the invention, the step of obtaining the head motion coding sequence in the method comprises: generating the head motion coding sequence according to preset head motion data; or generating the head motion coding sequence according to the video matched with the target voice. According to another aspect of the present invention, the step of extracting the phoneme feature sequence and the speech feature sequence from the target speech in the method includes: splitting the target voice into a plurality of voice frames according to a preset period; respectively extracting the phoneme characteristic and the voice characteristic of each voice frame; and selecting a plurality of phoneme features to form the phoneme feature sequence according to a preset time sequence window, and selecting a plurality of voice features to form the voice feature sequence.

According to another aspect of the invention, the step of extracting structural features from the reference image in the method comprises: inputting the reference image into a pre-trained non-supervision key point detector, and extracting a characteristic diagram representation of the middle layer output of the non-supervision key point detector as the structural characteristic.

According to another aspect of the present invention, before splicing the speech feature sequence and the structural feature to obtain the second joint coding sequence, the method further comprises: and respectively modifying the channel dimension of each voice feature in the voice feature sequence by using an up-sampling convolution network to enable the channel dimension of the voice feature to be consistent with the structural feature.

According to another aspect of the invention, the step of converting the feature vector into the description parameters of the dense motion field in the method comprises: the feature vector is converted into the descriptive parameters using a fully connected layer.

According to another aspect of the present invention, the step of converting the feature vector of the target speech frame into the description parameter using the full connection layer in the method includes: respectively inputting the feature vectors of the target voice frame into two full-connection models, and respectively outputting the description parameters of corresponding types through the two full-connection models;

according to another aspect of the invention, the description parameters in the method include: the method comprises the steps of key point parameters for composing the dense sports field and local affine transformation parameters corresponding to the key points.

According to another aspect of the invention, after said converting said feature vector into a description parameter of the dense motion field, the method further comprises: and generating a video containing the target head portrait according to the description parameters.

Correspondingly, the invention provides an audio and video fitting association calculating device, which comprises:

The head motion data module is used for acquiring or generating a head motion coding sequence;

The voice processing module is used for extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;

the key point detection module is used for extracting structural features from the reference image and containing the target head portrait;

The first joint coding module is used for splicing the phoneme characteristic sequences and the head motion coding sequences to obtain a first joint coding sequence;

The second joint coding module is used for splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;

A neural network model based on an attention mechanism, the neural network model including an encoder and a decoder;

the encoder is used for inputting the first joint coding sequence after time sequence modeling and outputting the hidden space representation of the target voice frame;

The decoder is configured to input the hidden space representation and the second joint coding sequence, and output a feature vector of the target speech frame, where the hidden space representation is a key value of the decoder and the second joint coding sequence is a query vector of the decoder;

and the converter module is connected with the output end of the decoder and is used for converting the characteristic vector into the description parameter of the dense motion field.

Furthermore, the present invention provides one or more computer-readable media storing computer-executable instructions that, when used by one or more computer devices, cause the one or more computer devices to perform the method of computing an audio-video fit association as described hereinbefore.

The invention also provides a computer device comprising a memory and a processor, wherein: the memory stores a computer program, and the processor implements the audio and video fitting associated computing method when executing the computer program.

When the specific implementation modes provided by the invention are used for carrying out audio and video fitting associated calculation, the selected input features comprise the phoneme features and the voice features carried by the voice of the single object and the structural features of the reference image, and compared with the prior art, the generalization of the audio and video fitting associated calculation result and the mouth shape visual representation effect are improved based on the selection of the phoneme features and the voice features; based on the selection of the structural features, the calculation process of the audio-video fitting association can pay more attention to the structural distribution information of the characters and the background in the reference image, the generalization of the calculation result of the audio-video fitting association is also improved, and the quality of the synthesized face speaking animation based on the calculation result is correspondingly improved. The implementation of each specific embodiment provided by the invention is used as the front preparation for the synthesis of the facial speaking animation, and can obviously improve the synthesis quality and the final effect of the facial speaking animation.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of one embodiment of a method of computing an audio video fit association in accordance with the present invention;

FIG. 2 is a flow chart of an alternative embodiment of step S120 shown in FIG. 1;

FIG. 3 is a schematic structural diagram of one embodiment of a computing device associated with audio-video fitting in accordance with the present invention;

FIG. 4 is a schematic diagram of an exemplary computer device for performing an embodiment of the audio-video fit-association calculation method of the present invention.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

For a better understanding and explanation of the present invention, reference will be made to the following detailed description of the invention taken in conjunction with the accompanying drawings. The invention is not limited to these specific embodiments only. On the contrary, the invention is intended to cover modifications and equivalent arrangements included within the scope of the appended claims.

It should be noted that numerous specific details are set forth in the following detailed description. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In the following description of various embodiments, well-known principles, structures and components are not described in detail in order to facilitate a salient point of the invention.

The invention provides a computing method of audio and video fitting association, please refer to fig. 1, fig. 1 is a flow chart of a specific implementation of the computing method of audio and video fitting association according to the invention, the method comprises:

Step S110, a head motion coding sequence, target voice and a reference image containing a target head portrait are acquired;

Step S120, extracting a phoneme characteristic sequence and a voice characteristic sequence from the target voice;

step S130, extracting structural features from the reference image;

Step S200, splicing the phoneme characteristic sequence and the head motion coding sequence to obtain a first joint coding sequence, and splicing the voice characteristic sequence and the structural characteristic to obtain a second joint coding sequence;

Step S300, inputting the first joint coding sequence into an encoder of a neural network model based on an attention mechanism, and outputting a hidden space representation of a target voice frame by the encoder;

Step S400, the hidden space representation and the second joint coding sequence are input into a decoder of a neural network model based on an attention mechanism together, wherein the hidden space representation is key value versus attention of the decoder, and the second joint coding sequence is a query vector of the decoder;

step S500, the decoder outputs the characteristic vector of the target voice frame;

Step S600, converting the feature vector into a description parameter of the dense motion field.

Specifically, since the calculation method related to audio and video fitting of the present invention is a pre-preparation step of synthesizing a face voice animation, the final purpose of the synthesized face voice animation is to give a still image containing a face and a driving source (e.g. an audio and video clip or a sound clip), and the face in a still image state is synthesized into a dynamic face with a sounding action based on the driving source, that is, the dynamic of the still image, in which the face sequentially presents a sounding action, a head action, an expression, etc. that match with the sound of the driving source. In order to make the dynamics of such a still image more natural, it is often considered to introduce information describing the head motion, which is also the purpose of acquiring the head motion coding sequence in step S110. The present embodiment is not limited to the source of the head motion coding sequence, and the step of obtaining the head motion coding sequence may be to generate the head motion coding sequence according to a preset head motion data, or may generate the head motion coding sequence according to a video matched with the voice.

In step S120, the term "voice" refers to sound data used as a driving source to drive still images, such as directly recorded sound data, or audio track data extracted from video. The features that can be utilized that the speech naturally carries include phoneme features, which refer to phoneme labels extracted from the speech using automatic speech recognition technology tools, typically represented using one-hot coding (one-hot coding), and speech features. The speech features refer to spectral features calculated from the speech and are typically used to represent emotion carried by the speech, for example, in this embodiment the speech feature selection consists of MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient) features, FBANK (filterbank ) features, and fundamental frequency features. It is well known to those skilled in the art that, according to the characteristics of the speech processing technology, the object of the phoneme feature extraction and the speech feature extraction is not the complete data of the speech, but the speech frames obtained by framing the speech according to the preset length (i.e. the speech segments with the preset length), so that the phoneme feature sequence in step S120 is composed of the phoneme features corresponding to the plurality of speech frames, and similarly, the speech feature sequence is also composed of the speech features corresponding to the plurality of speech frames.

In step S130, the term "reference image" refers to a driven still image, and the structural features refer to available structural information naturally carried by the reference image. In an exemplary embodiment, the specific step of extracting the structural feature from the reference image may be implemented as: inputting the reference image into a pre-trained non-supervision key point detector, and extracting a characteristic diagram representation of the middle layer output of the non-supervision key point detector as the structural characteristic. The structural features mainly comprise structural distribution information of the head, the body, the background and the like of the person in the reference image, and the structural distribution information is training data required by the deep neural network when the static image is driven.

Referring to fig. 2, fig. 2 is a flow chart of an alternative embodiment of step S120 shown in fig. 1, where step S120 further includes the following steps:

step S121, splitting the voice into a plurality of voice frames according to a preset period;

step S122, extracting the phoneme characteristic and the voice characteristic of each voice frame respectively;

Step S123, selecting a plurality of the phoneme features to form the phoneme feature sequence and selecting a plurality of the voice features to form the voice feature sequence according to a preset time sequence window.

In this alternative embodiment, when the preset period in step S121 is selected to be 40 ms, the data per second of the speech may be split into 25 speech frames (speech frames), and the phoneme feature and the speech feature of each of the speech frames are extracted in step S122, respectively. In order to improve the stability of the motion field representation of the continuous frames, a predetermined timing window may be selected, wherein in step S123, the plurality of phoneme features are selected to form the phoneme feature sequence, and the plurality of speech features are selected to form the speech feature sequence, wherein the timing window is used to constrain the timing of the phoneme feature sequence and the speech feature sequence. If the phoneme feature and the speech feature of the i-th frame are respectively denoted as p _i and a _i, and the length of the time sequence window is selected to be 2n+1, the phoneme feature sequence is (p _i-n,……p_i,…… p_i+n), and the speech feature sequence is (a _i-n,……a_i,……a_i+n).

In step S200, the phoneme feature sequence and the head motion coding sequence are spliced to obtain a first joint coding sequence, and the speech feature sequence and the structural feature are spliced to obtain a second joint coding sequence, so as to construct a suitable input for a subsequent neural network.

In order to facilitate the splicing of the head motion coding sequence and the phoneme feature sequence, the head motion coding sequence should generally have a similar data arrangement structure to the phoneme feature sequence, for example, the i-th head motion code in the head motion coding sequence is denoted as h _i, and the head motion coding sequence is denoted as (h _i-n,…… h_i,……h_i+n). In order to achieve the first joint encoding of the ith frame by splicing p _i and h _i together, a suitable preprocessing may be performed on the phoneme characteristic p _i of the ith frame, for example, by converting the phoneme characteristic p _i into an encoding vector in a word vector manner, and finally, combining all the first joint encoding formed by splicing into the first joint encoding sequence.

Before splicing the voice feature sequence and the structural feature to obtain a second combined coding sequence, the method further comprises: and respectively modifying the channel dimension of each voice feature in the voice feature sequence by using an up-sampling convolution network to enable the channel dimension of the voice feature to be consistent with the structural feature. If the structural feature is denoted as f ^r, for the voice feature sequence (a _i-n,……a_i,……a_i+n), the voice feature a _i of the ith frame modifies the two-dimensional feature dimension of the voice feature a _i to be consistent with the feature dimension of the structural feature through the up-sampling convolution network, so that the modified voice feature a _i and the structural feature f ^r can be spliced in the channel dimension to obtain a second joint code of the ith frame, and finally, all the second joint codes formed by splicing are combined into the second joint code sequence.

It will be understood by those skilled in the art that after step S200 is performed, for the i-th frame, the input constructed for the subsequent neural network is (a _i-n;i+n,p_i-n;i+n,h_i-n;i+n,f^r), that is, a conditional window input composed of a phoneme feature, a head motion coding, a speech feature, and a structural feature with the length 2n+1 of the i-th frame as an intermediate frame. Wherein, i and n are positive integers, the range of values is i-n to i+n, and the length of n is not particularly limited.

In step S300, the first joint coding sequence is input into an encoder of a neural network model based on an attention mechanism, which outputs a hidden space representation of a target speech frame, i.e. a generic hidden space mouth-shape representation of the target speech frame. When the input is (a _i-n;i+n,p_i-n;i+n, h_i-n;i+n,f^r), the target speech frame is the i-th frame described above. Since this stage uses the phoneme features from the speech for modeling, the hidden space representation has a better timbre generalization capability.

In step S400, the hidden space token and the second joint coding sequence are jointly input into a decoder of an attention mechanism based neural network model, wherein the hidden space token is a key-value-to-attention of the decoder, and the second joint coding sequence is a query vector of the decoder. Typically, the neural network model based on the attention mechanism may be a transducer, the hidden space representation corresponds to a key-value input of the decoder, and the second joint coding sequence corresponds to a query input of the decoder. Obviously, because the second joint coding sequence is used as the query vector of the decoder, the inherent data information of the voice feature and the structural feature contained in the second joint coding sequence is fully utilized, and meanwhile, the decoder takes the hidden space representation as a key value to pay attention to, so that the output result of the neural network model based on the attention mechanism can further modulate the mouth shape on the basis of ensuring the mouth shape accuracy, and the mouth shape action is more natural.

In step S500, the decoder outputs a feature vector of the target speech frame, specifically, a set of a plurality of feature vectors, wherein the feature vector set includes the feature vector of the target speech frame. For example, when the input is (a _i-n;i+n,p_i-n;i+n,h_i-n;i+n,f^r), the decoder outputs a feature vector of 2n+1 frames, where the target speech frame is the i-th frame of the 2n+1 frames, and selects the feature vector of the i-th frame to execute step S600.

In step S600, the feature vector of the target speech frame is converted into the description parameters of the dense motion field, typically, the feature vector may be converted into the description parameters using a full connection layer in step S600. More preferably, step S600 may be implemented as the following procedure: and respectively inputting the feature vectors of the target voice frame into two full-connection models, and respectively outputting the description parameters of the corresponding categories by the two full-connection models. Wherein the description parameters include key point parameters for composing a dense motion field, and local affine transformation parameters corresponding to the key points. The key point parameters include coordinate data and the like, and the local affine transformation parameters include a local affine transformation matrix of the key point and first-order Jacobian derivatives thereof.

After converting the feature vectors into description parameters of the dense motion field, it is obvious that subsequent steps related to facial speech animation synthesis may be performed using the description parameters, e.g. generating a video containing the target avatar from the description parameters.

It should be noted that although the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations be performed in that particular order or that all illustrated operations be performed to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

For example, although fig. 1 shows that step S120 and step S130 are sequentially performed, in other embodiments, step S120 and step S130 may be performed in parallel.

Correspondingly, the invention also provides a computing device associated with audio-video fitting, please refer to fig. 3, fig. 3 is a schematic structural diagram of a specific embodiment of the computing device associated with audio-video fitting according to the invention, the device comprises:

a head motion data module 110 for acquiring or generating a head motion coding sequence;

A voice processing module 120 for extracting a phoneme feature sequence and a voice feature sequence from the target voice 121;

a keypoint detection module 130 for extracting structural features from a reference image 131 containing the target avatar;

A first joint coding module 210, configured to splice the phoneme feature sequence and the head motion coding sequence to obtain a first joint coding sequence;

A second joint coding module 220, configured to splice the speech feature sequence and the structural feature to obtain a second joint coding sequence;

A neural network model 300 based on an attention mechanism, the neural network model comprising an encoder 310 and a decoder 320;

the encoder 310 is configured to input the first joint coding sequence after time sequence modeling, and output a hidden space representation of a target speech frame;

The decoder 320 is configured to input the hidden space token and the second joint coding sequence, and output a feature vector of the target speech frame, where the hidden space token is a key-value-to-attention of the decoder 320, and the second joint coding sequence is a query vector of the decoder 320;

a converter module 400 connected to the output of the decoder 320 for converting the feature vector into a description parameter of the dense motion field.

The terms and terms appearing in this section have the same meaning as the terms or terms in the foregoing, such as the "coding sequence", "phoneme feature sequence", and "structural features", etc., and the above terms or terms and their related working principles are referred to in the description and explanation of the relevant sections in the foregoing, and are not repeated herein for brevity.

Optionally, the computing device further includes an upsampling convolutional network 140 for modifying the channel dimensions of each speech feature in the sequence of speech features, respectively, so that the channel dimensions of the speech features are consistent with the structural features.

Alternatively, the converter module 400 may be implemented with a fully connected layer comprising two fully connected models, such as the fully connected model 401 and the fully connected model 402 in fig. 3, which respectively output the description parameters of the corresponding categories; in particular, the description parameters include key point parameters for composing a dense motion field, local affine transformation parameters corresponding to the key points.

Typically, the attention mechanism based neural network model 300 is a transducer; the hidden space characterizes key-value inputs of the corresponding decoder 320; the second concatenated coding sequence corresponds to the query input of decoder 320.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a typical computer device for executing an embodiment of the audio/video fitting association calculation method according to the present invention. More specifically, the computing device associated with the audio-video fit described above may be included as part of the computer apparatus. The computer device comprises at least the following parts: a CPU (central processing unit) 501, a RAM (random access memory) 502, a ROM (read only memory) 503, a system bus 500, a hard disk control unit 504, a hard disk 505, a man-machine interaction external device control unit 506, a man-machine interaction external device 507, a serial interface control unit 508, a serial interface external device 509, a parallel interface control unit 510, a parallel interface external device 511, a display device control unit 512, and a display device 513. The CPU 501, the RAM 502, the ROM 503, the hard disk control unit 504, the man-machine interaction external device control unit 506, the serial interface control unit 508, the parallel interface control unit 510, and the display device control unit 512 are connected to the system bus 500, and realize previous communication with each other through the system bus 500. Further, the hard disk control unit 504 is connected to a hard disk 505; the man-machine interaction external device control unit 506 is connected to a man-machine interaction external device 507, typically a mouse, a trackball, a touch screen or a keyboard; the serial interface control unit 508 is connected to a serial interface external device 509; the parallel interface control unit 510 is connected to a parallel interface external device 511; the display device control unit 512 is connected to a display device 513.

The block diagram depicted in FIG. 4 illustrates the structure of one type of computer device capable of practicing the various embodiments of the invention, and is not limiting of the environment in which the invention can be practiced. In some cases, some of the computer devices may be added or subtracted as desired. For example, the device shown in fig. 4 may remove the human interactive external device 507 and the display device 513, and the specific embodiment is only a server accessible by the external device. The computer devices shown in fig. 4 may, of course, implement the operating environment of the present invention solely or may be interconnected by a network to provide an operating environment in which various embodiments of the present invention are applicable, e.g., the various modules and/or steps of the present invention may be implemented in distributed fashion among the interconnected computer devices.

Furthermore, one or more computer-readable media storing computer-executable instructions that, when used by one or more computer devices, cause the one or more computer devices to perform various embodiments of an audio-video fit-associated computing method as described above, such as the audio-video fit-associated computing method shown in fig. 1, are disclosed. Computer readable media can be any available media that can be accessed by the computer device and includes both volatile and nonvolatile media, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer-readable media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computing device. Combinations of any of the above should also be included within the scope of computer readable media.

Correspondingly, the invention also discloses a computer device, which comprises a memory and a processor, wherein: the memory stores a computer program which when executed by the processor implements various embodiments of the audio-video fit-related computing method as described above, such as the audio-video fit-related computing method shown in fig. 1.

The part related to software logic in the audio and video fitting associated calculation method provided by the invention can be realized by a programmable logic device, and can also be implemented as a computer program product, and the program product enables a computer to execute the method. The computer program product comprises a computer-readable storage medium having computer program logic or code portions embodied therein for carrying out the steps of the methods described above. The computer readable storage medium may be a built-in medium installed in a computer or a removable medium (e.g., a hot-pluggable storage device) detachable from a computer main body. The built-in medium includes, but is not limited to, rewritable nonvolatile memory such as RAM, ROM, and hard disk. The removable media includes, but is not limited to: optical storage media (e.g., CD-ROM and DVD), magneto-optical storage media (e.g., MO), magnetic storage media (e.g., magnetic tape or removable hard disk), media with built-in rewritable non-volatile memory (e.g., memory card), and media with built-in ROM (e.g., ROM cartridge).

It will be appreciated by those skilled in the art that any computer system having suitable programming means is capable of executing the steps of the method of the present invention embodied in a computer program product. Although most of the specific embodiments described in this specification focus on software programs, alternative embodiments that implement the methods provided by the present invention in hardware are also within the scope of the invention as claimed.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements, units or steps, and that the singular does not exclude a plurality. A plurality of components, units or means recited in the claims can also be implemented by means of one component, unit or means in software or hardware.

The above disclosure is intended to be illustrative of only and not limiting of the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalents thereof.

Claims

1. A method for computing an audio-video fit association, the method comprising:

Extracting structural features from the reference image;

The decoder outputs the feature vector of the target voice frame;

2. The method of claim 1, wherein the step of obtaining a head motion encoding sequence comprises:

generating the head motion coding sequence according to preset head motion data; or (b)

And generating the head motion coding sequence according to the video matched with the target voice.

3. The method of claim 1, wherein the step of extracting a phoneme feature sequence and a speech feature sequence from a target speech comprises:

splitting the target voice into a plurality of voice frames according to a preset period;

respectively extracting the phoneme characteristic and the voice characteristic of each voice frame;

and selecting a plurality of phoneme features to form the phoneme feature sequence according to a preset time sequence window, and selecting a plurality of voice features to form the voice feature sequence.

4. The method of claim 1, wherein the step of extracting structural features from the reference image comprises:

inputting the reference image into a pre-trained non-supervision key point detector, and extracting a characteristic diagram representation of the middle layer output of the non-supervision key point detector as the structural characteristic.

5. The method of claim 1, further comprising, prior to concatenating the sequence of speech features with the structural feature to obtain a second image joint coding sequence:

and respectively modifying the channel dimension of each voice feature in the voice feature sequence by using an up-sampling convolution network to enable the channel dimension of the voice feature to be consistent with the structural feature.

6. The method according to claim 1, wherein the step of converting the feature vector into a description parameter of the dense motion field comprises:

The feature vector is converted into the descriptive parameters using a fully connected layer.

7. The method of claim 6, wherein the step of converting feature vectors of the target speech frame into the description parameters using a full connection layer comprises:

And respectively inputting the feature vectors of the target voice frame into two full-connection models, and respectively outputting the description parameters of the corresponding categories through the two full-connection models.

8. The method of claim 1, wherein the descriptive parameters include:

The method comprises the steps of key point parameters for composing the dense sports field and local affine transformation parameters corresponding to the key points.

9. The method according to claim 1, wherein after said converting said feature vector into a description parameter of the dense motion field, the method further comprises:

And generating a video containing the target head portrait according to the description parameters.

10. A computing device associated with audio-video fits, the device comprising:

The key point detection module is used for extracting structural features from a reference image containing the target head portrait;

11. One or more computer-readable media storing computer-executable instructions that, when used by one or more computer devices, cause the one or more computer devices to perform the method of computing an audio-video fit association as recited in any one of claims 1-9.

12. A computer device comprising a memory and a processor, wherein:

The memory stores a computer program which when executed by the processor implements the audio-video fit-related computing method according to any one of claims 1 to 9.