CN116363269A

CN116363269A - Image driving method based on voice and data processing method of image driving

Info

Publication number: CN116363269A
Application number: CN202310252857.1A
Authority: CN
Inventors: 王家喻; 赵康; 张士伟; 张迎亚; 沈宇军
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-06-30

Abstract

The embodiments of the present specification provide a voice-based image driving method and an image-driven data processing method, wherein the voice-based image driving method includes: the method comprises the steps of obtaining a reference voice and a reference face image of a virtual object, performing voice coding on the reference voice to obtain target voice features, performing image coding on the reference face image to obtain first image features of a first area and second image features of a second area, performing feature transformation on the first image features based on facial prior features and the target voice features to determine first target image features, wherein the facial prior features comprise facial texture features, and generating a driven target image according to the first target image features and the second image features. And based on the target voice features and the facial priori features, performing feature transformation on the first image features, and decoding to obtain a target image with high fidelity and high definition, thereby improving user experience.

Description

Image driving method based on voice and data processing method of image driving

Technical Field

The embodiment of the specification relates to the technical field of image processing, in particular to an image driving method based on voice.

Background

Along with the development of computer technology, a target image of a virtual object corresponding to face change is generated according to reference voice input by a user, namely a virtual object reconstruction technology, and the method has wide application in multiple fields such as virtual live broadcast, word broadcasting, content recommendation video and the like.

At present, a target image is generated by using a neural network model with deep learning, and is based on key points of a virtual object face, so that virtual object face reconstruction corresponding to reference voice is realized, and the target image is obtained.

However, the virtual object face reconstruction is performed based on the key points, and the obtained virtual object face possibly has only one face structure similar to that of the virtual object face, so that the detail content of the target image is missing, the definition and the fidelity of the image are insufficient, and the user experience is insufficient, so that a voice-based image driving method with high definition and fidelity is needed.

Disclosure of Invention

In view of this, the present embodiments provide a voice-based image driving method. One or more embodiments of the present specification relate to an image-driven data processing method, a voice-based image driving apparatus, an image-driven data processing apparatus, a computing device, a computer-readable storage medium, and a computer program to solve the technical drawbacks of the related art.

In one embodiment of the present disclosure, there is provided a voice-based image driving method including:

acquiring a reference voice and a reference face image of a virtual object;

performing voice coding on the reference voice to obtain target voice characteristics, and performing image coding on the reference facial image to obtain first image characteristics of a first area and second image characteristics of a second area, wherein the first area is an area of the reference facial image, which changes along with the voice, and the second area is an area of the reference facial image, which is other than the first area;

performing feature transformation on the first image feature based on the face prior feature and the target voice feature to determine a first target image feature, wherein the face prior feature comprises a face texture feature, and wherein the face prior feature comprises a face texture feature;

and generating a driven target image according to the first target image characteristic and the second image characteristic.

In one or more embodiments of the present disclosure, a reference face image of a virtual object and a reference voice is acquired, the reference voice is subjected to voice encoding to obtain a target voice feature, the reference face image is subjected to image encoding to obtain a first image feature of a first region and a second image feature of a second region, wherein the first region is a region of the reference face image, which changes along with voice, and the second region is a region of the reference face image, which is other than the first region, and the first image feature is subjected to feature transformation based on a face prior feature and the target voice feature to determine a first target image feature, wherein the face prior feature includes a face texture feature, and a driven target image is generated according to the first target image feature and the second image feature. Because the facial priori features comprise facial texture features, feature transformation is performed on the first image features of the first region following the voice transformation based on the facial priori features and the target voice features, the obtained first target image features not only correspond to the voice features and contain the texture features, but also generate complete driven target images which correspond to the reference voice and contain the texture features according to the second image features and the first target image features, and the target images have the characteristics of high fidelity and high definition, so that user experience is improved.

Drawings

FIG. 1 is a flow chart of a method of voice-based image driving provided in one embodiment of the present disclosure;

FIG. 2 is a flow chart of another voice-based image driving method provided by one embodiment of the present disclosure;

FIG. 3 is a flow chart of another voice-based image driving method provided by one embodiment of the present disclosure;

FIG. 4 is a flow chart of an image-driven data processing method provided by one embodiment of the present disclosure;

FIG. 5 is a flow chart of constructing facial prior features in a voice-based image-driven method according to one embodiment of the present disclosure;

FIG. 6 is a flow chart of a method for driving a voice-based image according to one embodiment of the present disclosure;

FIG. 7 is a process flow diagram of a voice-based image-driven method for face animation generation according to one embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a voice-based image driving apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of another voice-based image driving apparatus according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of another voice-based image driving apparatus according to one embodiment of the present disclosure;

Fig. 11 is a schematic structural view of an image-driven data processing apparatus according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, terms related to one or more embodiments of the present specification will be explained.

MLP (Multilayer Perceptron, fully connected neural network or multi-layer perceptron) model: a neural network model comprises an input layer, a hidden layer and an output layer, wherein the layers are connected in a full-connection mode.

CNN (Convolutional Neural Networks, convolutional neural network) model: a multi-layer neural network model with forward and backward propagation has convolution kernels (filters) that process feature data.

RNN (Recurrent Neural Network ) model: a recurrent neural network model that recurses in the processing direction of vector characterization and links each intermediate layer in a chain.

Pointet model: a key point space estimation model consists of a classification network and a segmentation network. The classification network classifies keypoints of the input image. The segmentation network is an extension of the classification network, and outputs the spatial characteristics of the key points under different categories through the global characteristics and the local characteristics of the key points.

Pointet++ model: a key point space estimation model is used for strengthening the local feature extraction capability on the basis of a Pointnet model.

GCN (Graph Convolutional Networks, graph roll-up network) model: a convolutional neural network model for processing graph data.

LSTM (Long Short Term Memory, long and short term memory network) model: a neural network model with the ability to memorize long-short term information has a convolution kernel (filter) that processes feature data.

Transformer model: a neural network model based on an attention mechanism calculates and analyzes the characteristics of data through attention.

BERT (Bidirectional Encoder Representation from Transformers, bi-directional coding characterizes translation) model: a neural network model of bi-directional attention code characterization function.

GAN (Generative Adversarial Network, generating an antagonism network) model: a deep learning neural network model comprises a Generator and a Discriminator, and the Generator and the Discriminator are trained alternately to obtain a high-accuracy Generator.

Diffuse (distribution) model: a neural network model is obtained by forward and reverse deduction training through adding noise and eliminating noise.

In the present specification, a voice-based image driving method, the present specification relates to an image-driven data processing method, a voice-based image driving apparatus, an image-driven data processing apparatus, a computing device, a computer-readable storage medium, and a computer program, which are described in detail in the following embodiments one by one.

Fig. 1 shows a flowchart of a voice-based image driving method according to an embodiment of the present disclosure, including the following specific steps:

step 102: a reference facial image and a reference voice of the virtual object are acquired.

The embodiments of the present disclosure apply to a client or server having a virtual object face image driving platform, where the image driving platform includes, but is not limited to: live applications, web pages or applets with virtual hosting, information broadcast applications, web pages or applets with virtual objects, content recommendation applications, web pages or applets with virtual objects.

The virtual object may be a module object pre-constructed by a module construction tool, for example, a character module or an animal module pre-constructed by a three-dimensional modeling tool, or an image object obtained by mapping an image of a real object in the physical world, for example, an image object of a real character or a real animal acquired by an image acquisition device.

The reference face image is a visual image containing a virtual object face, the reference face image contains at least one visual image, the reference face image is specifically in the form of a photo, a picture or a video frame, and the like, and the reference face image is a visual image in a specific color space, for example, an RGB (Red-Green-Blue) image, an HIS (Hue-Saturation-color space) image, a YUV (luminance-chrominance color space) image, a YCbCr (luminance-offset color space) image, and the like. The reference face image may be a visual image acquired by an image acquisition device, for example, a person image, an animal image, or the like acquired by an optical photographing device, or an artificially generated visual image, for example, an animation image, a drawing image, or the like, or a visual image generated by an image driving algorithm, which is not limited herein. The reference face image may be a two-dimensional image or a three-dimensional image, and is not limited herein.

The reference voice is voice data for guiding the face transformation of the virtual object, and may be voice data in a natural language, for example, chinese voice, or voice data in a non-natural language, for example, animal sound, which is not limited herein.

The reference face image and the reference voice of the virtual object can be obtained directly from an image database and a voice database, wherein the image database and the voice database can be a local database of the image driving platform or a remote database, for example, a cloud database or an open source database of the image driving platform. The reference face image and the reference voice of the virtual object are acquired, and the reference face image and the reference voice can also be uploaded by the direct receiving image acquisition equipment and the voice acquisition equipment. The reference face image and the reference voice of the virtual object are acquired, and the reference face image and the reference voice can be directly transmitted by a user through the front end of the image driving platform.

Illustratively, a character video of an a character uploaded by a video capturing device (the first N Zhang Shipin frames of the video have images with high definition and high fidelity, the later M Zhang Shipin frames have insufficient definition and fidelity) is received, the nth Zhang Shipin frame in the character video is extracted as a reference face image, and voice data corresponding to the later M Zhang Shipin frames of the video is determined as reference voice.

The method comprises the steps of obtaining a reference facial image and a reference voice of a virtual object, and laying a data foundation for obtaining a first image feature, a second image feature and a target voice feature through subsequent encoding.

Step 104: and performing voice coding on the reference voice to obtain target voice characteristics, and performing image coding on the reference facial image to obtain first image characteristics of a first area and second image characteristics of a second area, wherein the first area is an area of the reference facial image, which changes along with the voice, and the second area is an area of the reference facial image, except the first area.

The first region is a face region in which the virtual object face follows a change in voice, for example, a mouth region of the virtual character, an eye region of the virtual character, a cheek region of the virtual character, or the like, and the second region is a face region of the virtual object face other than the first region, for example, a forehead region of the virtual character, an ear region of the virtual character, or the like. The first region and the second region may be different when the sounding habits of different virtual objects are different, and the first region may not be simply considered to be the mouth region, and the second region may be a region of the face of the virtual object other than the mouth region. The first region and the second region may be from the same reference face image or may be from different reference face images, which is not limited herein.

The first image feature is a coding feature vector obtained by image coding the first region, and the second image feature is a coding feature vector obtained by image coding the second region. The first image feature and the second image feature comprise information such as texture features, image space features and the like of the image.

The target speech feature is an encoding feature vector obtained by speech encoding the reference speech, and may be a text feature vector obtained by text recognition of the reference speech, or may be an audio feature vector obtained by frequency division of the reference speech, which is not limited herein.

Performing speech coding on the reference speech to obtain target speech characteristics, and performing image coding on the reference face image to obtain first image characteristics of a first region and second image characteristics of a second region, wherein the specific modes are as follows: and performing voice coding on the reference voice by using a preset coding algorithm to obtain target voice characteristics, and performing image coding on the reference face image to obtain a first image characteristic of the first region and a second image characteristic of the second region, wherein the preset coding algorithm can be a statistical coding algorithm, such as a hash coding algorithm, a single-heat coding algorithm and the like, and can also perform coding by using a coding layer of a neural network model.

Illustratively, the reference voice is subjected to voice coding by utilizing a single-heat coding algorithm to obtain target voice characteristics, and the reference face image is subjected to image coding by utilizing a preset image hash coding algorithm to obtain first image characteristics of the mouth area of the A character and second image characteristics of other face areas of the A character.

And performing voice coding on the reference voice to obtain target voice characteristics, and performing image coding on the reference face image to obtain first image characteristics of a first area and second image characteristics of a second area, wherein the first area is an area of the reference face image, which changes along with voice, and the second area is an area of the reference face image, which is other than the first area, so that a characteristic foundation is laid for subsequent characteristic transformation, and image characteristics are provided for subsequent decoding to obtain the target image.

Step 106: and performing attention computation on the first image features by using the facial prior features to obtain attention-seeking image features, and performing feature optimization on the attention-seeking image features according to the target voice features to obtain first target image features of the first region, wherein the facial prior features comprise facial texture features.

The facial priori features are priori features of the facial texture features of the virtual object, feature transformation is carried out according to the priori features, and accuracy of target image features obtained after feature transformation is guaranteed. Facial texture features are the surface properties of the facial regions of a virtual object in an image, such as the thickness, density, etc. of the image texture.

The first target image feature is a feature vector corresponding to a reference speech and including a first region of texture features. Which corresponds to the target speech features of the reference speech and contains texture features. Decoding based on the first target image feature may result in a high fidelity and high definition visual image corresponding to the first region of the reference speech.

Based on the facial priori features and the target voice features, carrying out feature transformation on the first image features to determine the first target image features, wherein the specific mode is as follows: and performing attention calculation on the first image feature by using the facial prior feature to obtain an attention-seeking image feature, and performing feature optimization on the attention-seeking image feature according to the target voice feature to obtain a first target image feature of the first region. The attention calculation is a feature transformation method of calculating attention distribution on input information and then calculating weighted average of the input information according to the attention distribution, so that feature vectors focused on part of the information can be obtained. The facial prior features comprise facial texture features of a large number of virtual objects in the sample image, and attention computing is performed by utilizing the facial texture features of the virtual objects in the global sample image and focusing on the facial texture features with high correlation with the first image features to obtain attention-seeking image features comprising the texture features. For example, the reference facial image is an image of a virtual object of black hair, wide face and large five sense organs, the facial prior feature comprises the facial texture feature of the virtual object of the characteristics of the black hair, the wide face and the large five sense organs, the attention is focused on the texture feature to calculate, and the obtained attention map image weight is focused on the facial texture feature corresponding to the characteristics. Feature optimization is a feature transformation method for adaptively adjusting image features by utilizing other features, so that the obtained first target image features are adapted to target voice features. For example, the first region is a mouth region, the reference speech is an open tone "a", and the corresponding first target image feature is to represent a large mouth space and a tight mouth texture.

Illustratively, attention calculation is performed on the first image feature by using the facial prior feature to obtain an attention image feature, and feature optimization is performed on the attention image feature according to the target voice feature to obtain the first target image feature of the mouth region of the A person.

Based on the facial prior feature and the target voice feature, the feature transformation is carried out on the first image feature, and the first target image feature is determined, so that the obtained first target image feature not only corresponds to the voice feature but also contains texture features, and a foundation is laid for the follow-up generation of the driven target image.

Step 108: and generating a driven target image according to the first target image characteristic and the second image characteristic.

The target image is a visual driving image containing a virtual object face corresponding to the reference voice. The target image contains one more visual image. The target image may be a two-dimensional image or a three-dimensional image, and is not limited herein.

According to the first target image feature and the second image feature, generating a driven target image by the following specific modes: and decoding to obtain a target image of the virtual object by utilizing a preset decoding algorithm according to the first target image characteristic and the second image characteristic. The preset decoding algorithm may be a statistical decoding algorithm, for example, a hash decoding algorithm, a single-decoding algorithm, or the like, and may also use a decoding layer of the neural network model to perform decoding.

Illustratively, the post M Zhang Shipin frame of the a character corresponding to the reference voice is decoded by using an image hash decoding algorithm according to the first target image feature and the second image feature, so as to obtain a complete character video with high definition and high fidelity.

In this embodiment of the present disclosure, a reference face image of a virtual object and a reference voice is obtained, the reference voice is subjected to voice encoding to obtain a target voice feature, and the reference face image is subjected to image encoding to obtain a first image feature of a first region and a second image feature of a second region, where the first region is a region in which the reference face image changes along with voice, the second region is a region of the reference face image other than the first region, and the first image feature is subjected to feature transformation based on a face prior feature and the target voice feature to determine a first target image feature, where the face prior feature includes a face texture feature, and generates a driven target image according to the first target image feature and the second image feature. The method comprises the steps of carrying out feature transformation on first image features of a first area following voice transformation based on the facial prior features and target voice features, enabling the obtained first target image features to not only correspond to voice features and contain texture features, and finally generating a complete driven target image which corresponds to reference voice and contains texture features according to second image features and the first target image features, wherein the target image has the characteristics of high fidelity and high definition, and user experience is improved.

Optionally, before step 106, the method further includes the following specific steps:

acquiring a calibration face image of a virtual object;

performing image coding on the calibration face image to obtain calibration image characteristics of a target area, wherein the target area corresponds to the second area;

and calibrating the first image feature based on the feature deviation between the second image feature and the calibration image feature to obtain the calibrated first image feature.

When the reference face image is a plurality of different visual images including a first area and/or a second area of the face of the virtual object, the gesture of the virtual object is different, namely, when the spatial features of the face are different, the texture feature correspondence is transformed, and the matching of the texture features is problematic, for example, when the face orientation is inconsistent, the texture features such as five sense organs, facial types and the like are also changed, thus the virtual object needs to be calibrated, the structural information of the face of the virtual object is considered, the geometric consistency of the target image is enhanced, the problem of texture matching is not caused due to the division of the face area, and the accuracy of the obtained target image features is ensured.

The calibration face image is a visual image containing a virtual object face, which is different from the spatial features of the first region, and is consistent with the form of the reference face image, and is consistent with the specific color space of the reference face image. For example, the reference face image is a visual image acquired by a 30-degree sideways facing image acquisition device, and the calibration face image is a visual image acquired by a forward facing image acquisition device, both of which have different spatial characteristics but correspond to the same virtual object. The virtual object face in the calibration face image contains a target region that corresponds to, but is not necessarily completely coincident with, the second region. For example, the reference face image is an animal image in which a part of the five sense organs of the virtual animal face is contained, and the calibration image contains all of the five sense organs of the virtual animal face.

The feature bias is a spatial feature bias between the second image feature and the calibration image feature, and is represented as a feature vector having a spatial distribution.

Image coding is carried out on the calibration face image, and the calibration image characteristics of the target area are obtained, wherein the specific mode is as follows: and (3) performing image coding on the calibration face image by using a preset coding algorithm to obtain the calibration image characteristics of the target area. The preset encoding algorithm in the preset encoding algorithm step 104 remains the same, and is not described herein.

Based on the feature deviation between the second image feature and the calibration image feature, calibrating the first image feature to obtain a calibrated first image feature, wherein the specific mode is as follows: and based on the characteristic deviation between the second image characteristic and the calibration image characteristic, calibrating the first image characteristic by using a spatial characteristic transformation algorithm to obtain the calibrated first image characteristic. The spatial feature transformation algorithm is an algorithm for realizing spatial feature transformation by interpolating feature vectors, for example, bilinear interpolation algorithm (Bilinear Interpolating), bilinear sampling algorithm (Bilinear Sampling), and the like.

The method includes the steps of extracting an N-10 th video frame from a character video of an A character as a calibration face image, performing image coding on the calibration face image by using an image hash coding algorithm to obtain calibration image features of other face areas, comparing second image features with the calibration image features to obtain feature deviations, and performing spatial feature calibration on first image features by using a bilinear sampling algorithm according to the feature deviations to obtain calibrated first image features.

And obtaining a calibration face image of the virtual object, performing image coding on the calibration face image to obtain calibration image features of a target area, wherein the target area corresponds to the second area, comparing the second image features with the calibration image features to obtain feature deviations, and calibrating the first image features according to the feature deviations to obtain calibrated first image features. The accuracy of the subsequently obtained first target image features is guaranteed.

Optionally, comparing the second image feature with the calibration image feature to obtain a feature deviation, including the following specific steps:

performing key point coding on the second image feature to obtain a second key point space feature, and performing key point coding on the calibration image feature to obtain a calibration key point space feature;

determining feature deviation according to the feature difference between the second key point space feature and the calibration key point space feature;

and calibrating the first image features based on the feature deviation to obtain calibrated first image features.

The spatial features may be determined from the location of key points by sampling and encoding the key points of the virtual object face. For example, key points such as eyes, nose, eyebrows, etc. of the face of the virtual character are sampled and encoded. Thus, the feature bias can be understood as: and constructing a space coordinate system for the key points on the second area of the reference face image, wherein each key point has a corresponding coordinate position, mapping the corresponding key points on the second area of the calibration image to the space coordinate system, wherein each corresponding key point has a corresponding coordinate position, and determining the space feature deviation by comparing the difference degrees of the coordinate positions.

The second keypoint spatial feature is a feature vector characterizing a keypoint spatial feature of the second region, and the calibration keypoint spatial feature is a feature vector characterizing a corresponding keypoint spatial feature of the second region in the calibration image.

Performing key point coding on the second image feature to obtain a second key point space feature, and performing key point coding on the calibration image feature to obtain a calibration key point space feature, wherein the specific mode is as follows: and performing key point coding on the second image features by using a coding layer of the pre-trained space adaptation model to obtain second key point space features, and performing key point coding on the calibration image features to obtain the calibration key point space features. The spatial adaptation model is a neural network model with a spatial adaptation function of image features, and comprises an encoding layer, a calculating module and a decoding layer, and can realize the spatial feature transformation algorithm according to the embodiment. The space adaptation model may be an MLP model, a CNN model, an RMM model, a GCN model, a Pointet model, a Pointet++ model, or the like.

Based on the feature deviation between the second key point space feature and the calibration key point space feature, calibrating the first image feature to obtain a calibrated first image feature, wherein the specific mode is as follows: and calculating the characteristic deviation by using a calculation module of the space adaptation model according to the characteristic difference between the second key point space characteristic and the calibration key point space characteristic, and calibrating the first image characteristic based on the characteristic deviation to obtain the calibrated first image characteristic.

The method includes the steps of performing key point coding on a second image feature by using a coding layer of a pre-trained Pointernet model to obtain a second key point space feature, performing key point coding on a calibration image feature to obtain a calibration key point space feature, calculating to obtain a feature deviation by using a calculation module of the model according to feature difference degree between the second key point space feature and the calibration key point space feature, performing bilinear sampling on a first image feature by using a decoding layer of a space adaptation model based on the feature deviation, and obtaining a calibrated first image feature.

And determining characteristic deviation according to the characteristic difference degree between the second key point space characteristic and the calibration key point space characteristic, and calibrating the first image characteristic based on the characteristic deviation to obtain the calibrated first image characteristic. Through feature processing of the key point level, the accuracy of calibration is improved, the texture matching degree of the calibrated first image features and the second region is improved, and the accuracy of the target image features obtained later is further ensured.

Optionally, step 106 includes the following specific steps:

performing attention calculation on the first image features by using the facial prior features to obtain first attention image features;

And carrying out normalization processing on the first attention image characteristic by taking the target voice characteristic as a constraint condition to obtain a first target image characteristic.

The first attention image feature is an image feature that includes a corresponding texture feature in the facial prior feature. The first attention is directed to the image feature being more focused on a particular facial texture feature, guided by the a priori feature, facial a priori feature.

And carrying out attention calculation on the first image characteristic by using the facial prior characteristic to obtain a first attention image characteristic, wherein the specific mode is as follows: and determining the facial prior feature as a key vector and a value vector, determining the first image feature as a query vector, and performing attention calculation according to the query vector, the key vector and the value vector to obtain the first attention image feature. The specific formula of the attention calculation is shown in formula 1:

wherein H is the output feature vector, Q is the input feature vector, i.e. Query vector Query, K is Key vector Key, V is Value vector Value, d _k Is a preset temperature factor.

The normalization process is such that the image features are limited to a certain range (e.g., [0,1] or [ -1,1 ]) and subject to the target speech features as constraints, thereby eliminating the effect caused by the singular values in the first attention image features. The effect is exhibited that the obtained first target image feature is adapted to the target speech feature.

It should be noted that the above-mentioned process is implemented in a neural network model with an attention mechanism, for example, a transducer model, a BERT model, and the like, in the pre-training process.

Illustratively, determining facial prior features as key vectors and value vectors, determining first image features as Query vectors, performing attention calculation according to the Query vectors, the key vectors and the value vectors by using a formula 1 to obtain first attention image features, and performing normalization processing on the first attention image features by taking target voice features as constraint conditions to obtain first target image features of the mouth region of the A character.

And performing attention calculation on the first image features by using the facial prior features to obtain first attention image features, and performing normalization processing on the first attention image features by taking the target voice features as constraint conditions to obtain first target image features of the first region. The pertinence of the first target image features and the target voice features is improved, and the pertinence of texture features contained in the first target image features is improved.

Optionally, the facial prior feature includes a first facial prior feature corresponding to the first region and a second facial prior feature corresponding to the second region;

Correspondingly, before step 106, the method further comprises the following specific steps:

performing discretization processing on the first image feature based on the first facial prior feature to obtain a discretized first image feature;

and discretizing the second image feature based on the second facial prior feature to obtain a discretized second image feature.

Because the facial prior features are respectively encoded for the first region and the second region, the facial prior features can be respectively constructed when the facial prior features are constructed, so that the pertinence of the facial prior features is improved, and the pertinence of feature transformation is further improved.

The first facial prior feature is a set of facial texture features of a first region of the virtual object's face in the sample image, and the second facial prior feature is a set of facial texture features of a second region of the virtual object's face in the sample image. The first facial prior feature and the second facial prior feature are a set of discrete feature vectors, and are obtained by performing discretization feature extraction on a virtual object face in a sample image, and the discrete feature vector includes discretized texture features, for example, discretization feature extraction is performed on the face according to a discretization region (such as a three-court region), and discretization feature extraction is performed on the face according to five sense organs and nearby regions.

The discretization processing is a limited feature vector mapping, and the processing mode of mapping the feature vectors with more dimensions into the feature vectors with fewer dimensions can reduce the subsequent data processing amount, improve the processing efficiency, ensure the stability of the generated result and reduce the training difficulty of the related neural network model.

Based on the first facial prior feature, discretizing the first image feature to obtain the discretized first image feature, wherein the specific mode is as follows: and discretizing the first image features based on the feature similarity of each discrete feature in the first image features and the first facial prior features to obtain discretized first image features.

Based on the prior characteristic of the second face, discretizing the second image characteristic to obtain the discretized second image characteristic, wherein the specific mode is as follows: discretizing the second image features based on the feature similarity of each discrete feature in the second image features and the second facial prior features to obtain discretized second image features

Illustratively, discretizing the first image feature based on the feature similarity of each discrete feature in the first image feature and the first facial prior feature to obtain a discretized first image feature, and discretizing the second image feature based on the feature similarity of each discrete feature in the second image feature and the second facial prior feature to obtain a discretized second image feature.

Based on the first facial prior feature, discretizing the first image feature to obtain a discretized first image feature, and based on the second facial prior feature, discretizing the second image feature to obtain a discretized second image feature, so that image driving efficiency is improved, image driving stability is ensured, and image driving complexity is reduced.

Optionally, step 104 includes the following specific steps:

acquiring a pre-trained image driving model, wherein the image driving model comprises an image coding layer, a voice coding layer, a characteristic transformation layer and a decoding layer;

inputting the reference voice into a voice coding layer to obtain target voice characteristics, and inputting the reference facial image into an image coding layer to obtain first image characteristics of a first area and second image characteristics of a second area;

correspondingly, the step 106 includes the following specific steps:

inputting the first image features, the target voice features and the facial prior features into a feature transformation layer, carrying out feature transformation on the first image features based on the facial prior features and the target voice features, and determining the first target image features;

correspondingly, step 108 includes the following specific steps:

feature combination is carried out on the first target image feature and the second image feature, and combined image features are obtained;

And inputting the combined image characteristics into a decoding layer to obtain the driven target image.

The image driving model is a neural network model with the function of driving the virtual object to image, and is an image processing model, including but not limited to an LSTM model, a transducer model, a BERT model, a GAN model, a Diffusion model and the like. The image driving model comprises an image coding layer, a voice coding layer, a characteristic transformation layer and a decoding layer. Wherein the feature transformation layer comprises an attention calculation layer and a condition Normalization layer (CLN, conditionalLayer-Normalization), and the feature transformation layer can be a transducer model. The image driving model is obtained by taking a virtual object face driving task with high definition and high fidelity as a target and training in advance.

The merged image feature is a feature vector that spatially merges the first target image feature and the second image feature, for example, the first region is a mouth region of the avatar, the second region is another face region of the avatar, and the obtained merged image feature is an image feature that characterizes a full face of the avatar by spatially merging the first target image feature of the mouth region and the second image feature of the other face region.

Feature merging is a method for merging feature vectors in space, and is specifically realized through a merging function (Concat function).

Inputting the first image feature, the target voice feature and the facial prior feature into a feature transformation layer, carrying out feature transformation on the first image feature based on the facial prior feature and the target voice feature, and determining the first target image feature, wherein the specific mode is as follows: the first image feature and the facial prior feature are input into an attention calculating layer in a feature transformation layer, the first attention image feature is obtained through calculation, the first attention image feature and the target voice feature are input into a condition normalization layer in the feature transformation layer, and the first target image feature is obtained. The specific feature processing manner of the attention layer and the condition normalization layer is described in detail in the above embodiments, and is not described herein again.

The method comprises the steps of obtaining a pre-trained GAN model with an image driving function for a virtual object, wherein the GAN model comprises an image coding layer, a voice coding layer, an attention calculating layer, a condition normalizing layer and a decoding layer, inputting reference voice into the voice coding layer to obtain target voice characteristics, inputting a reference face image of an a character into the image coding layer to obtain first image characteristics of a mouth area of the a character and second image characteristics of other face areas of the a character, inputting the first image characteristics and face priori characteristics into the attention calculating layer, calculating to obtain first attention image characteristics, inputting the first attention image characteristics and the target voice characteristics into the condition normalizing layer to obtain first target image characteristics, combining the first target image characteristics and the second image characteristics by using a Concat function to obtain combined image characteristics, inputting the combined image characteristics into the decoding layer to obtain a post M Zhang Shipin frames of the a character corresponding to the reference voice, and obtaining a complete video of the a character with high definition and high fidelity.

The method comprises the steps of obtaining a pre-trained image driving model, wherein the image driving model comprises an image coding layer, a voice coding layer, a feature transformation layer and a decoding layer, inputting reference voice into the voice coding layer to obtain target voice features, inputting reference facial images into the image coding layer to obtain first image features of a first area and second image features of a second area, inputting the first image features, the target voice features and facial prior features into the feature transformation layer, carrying out feature transformation on the first image features based on the facial prior features and the target voice features, determining first target image features, carrying out feature combination on the first target image features and the second image features to obtain combined image features, inputting the combined image features into the decoding layer to obtain a driven target image. The image driving efficiency is improved, the fidelity and the definition of the target image are further improved, and the user experience is further improved.

Optionally, before acquiring the pre-trained image driving model, the method further comprises the following specific steps:

acquiring a training sample set, wherein the training sample set comprises a plurality of training sample groups, and any training sample group comprises a sample image of a sample virtual object, a label image of the sample virtual object and sample voice corresponding to the label image;

Taking the prior feature of the face as the prior feature of feature transformation, and performing supervision training on the image driving model according to the sample image, the sample voice and the label image of each training sample group to obtain the trained image driving model.

The training sample set is a pre-constructed set of sample images for image-driven model training and comprises a plurality of training sample groups, any training sample group comprises sample images of sample virtual objects, label images of the sample virtual objects and sample voices corresponding to the label images, the sample images are visual sample images containing faces of the virtual objects, the label images are target images and visual sample images containing faces of the virtual objects, and the target images correspond to the reference voices. The sample image may be a visual sample image acquired by the image acquisition device, a visual sample image generated manually, or a visual sample image generated by an image generation algorithm, which is not limited herein. The label image is consistent with the generation or collection mode of the sample image. The sample speech is sample speech data of a guide face conversion corresponding to the tag image, and may be sample speech data of a natural language or sample speech data of a non-natural language, and is not limited herein. In order to ensure the training effect on the image driving model, the training sample set is large in scale and is generally obtained through an open source database, for example, a sample video database, a video database of an online video platform and the like. In order to ensure the training efficiency of the image-driven model, the training sample set is smaller in scale and is generally obtained from a local database.

The supervised training is a way of training the neural network model using the tag data. In the embodiment of the specification, the facial prior feature is taken as the prior feature, the sample image and the sample voice are input into the image driving model to generate the predicted image, the training loss value is calculated according to the predicted image and the label image, the model parameters of the image driving model are adjusted according to the training loss value, and the steps are repeated iteratively until the preset training ending condition is reached, so that the trained image driving model is obtained. The training loss values include, but are not limited to, cosine loss values, L1 loss values, L2 loss values, and cross entropy loss values. The training ending condition includes, but is not limited to, a preset loss value threshold, a preset iteration number, and a judgment condition for completing training of each sample group.

The face prior feature is taken as the prior feature, the sample image and the sample voice are input into the image driving model, and a predicted image is generated by the following specific modes: the method comprises the steps of inputting a sample image and sample voice into an image coding layer of an image driving model by taking facial priori features as priori features, correspondingly obtaining first image features of a first region, second image features of a second region and sample voice features, carrying out feature transformation on the first image features by using a feature transformation layer according to the sample voice features by taking the facial priori features as the priori features, determining target image features of the first region, and decoding by using a decoding layer based on the second image features and the target image features to obtain a predicted image.

According to the training loss value, the model parameters of the image driving model are adjusted in the following specific modes: and according to the training loss value, adjusting model parameters of an image coding layer, a voice coding layer, a characteristic transformation layer and a decoding layer by using a gradient updating method. It should be noted that, for consistency of the image coding in the training process and the application process, the parameter fixing may be performed on the image coding layer.

The method comprises the steps of obtaining video data in an open source video database, constructing a training sample set according to the video data, taking facial prior features as prior features, inputting sample images and sample voices into a GAN model, generating predicted images, calculating to obtain cross entropy loss values according to the predicted images and label images, adjusting parameters of a voice coding layer, a feature transformation layer and a decoding layer in the GAN model according to the cross entropy loss values by using a gradient updating method, and iterating the steps until a preset loss value threshold is reached, so that the trained GAN model is obtained.

The method comprises the steps of obtaining a training sample set, wherein the training sample set comprises a plurality of training sample groups, any training sample group comprises a sample image of a sample virtual object, a label image of the sample virtual object and sample voice corresponding to the label image, the prior feature of face prior feature is used as the prior feature of feature transformation, and according to the sample image, the sample voice and the label image of each training sample group, monitoring training is carried out on an image driving model, and a trained image driving model is obtained. The facial area features extracted from the virtual objects in the sample image set are pre-trained by the pre-trained facial reconstruction model, so that facial priori features containing texture features are obtained, the facial priori features are used as the priori features of feature transformation, the image driving model is subjected to supervision training according to the sample image, the sample voice and the label image, the trained image driving model is obtained, pertinence of the image driving model to voice data and image data and feature extraction capability of the texture features are improved, and the target image after follow-up generation driving has the characteristics of high fidelity and high definition.

Optionally, the facial prior feature is obtained by extracting features of a facial area of the sample image by using a facial reconstruction model in advance, and correspondingly, the method further comprises the following specific steps:

obtaining a sample image set, wherein the sample image set comprises a plurality of sample image pairs, and any sample image pair comprises a first area sample image and a second area sample image corresponding to the same virtual object;

for any sample image pair, respectively extracting texture features of a first region sample image and a second region sample image in the sample image pair by utilizing an encoding layer of a pre-trained facial reconstruction model to obtain the texture features of the first region sample image and the texture features of the second region sample image;

and integrating the texture features of each first region sample image and the texture features of each second region sample image to obtain the facial prior features.

The facial a priori features may be a set of facial texture features of the virtual object in the sample image, embodied as a set of feature vectors. The facial prior feature is obtained by carrying out feature extraction and construction on facial regions of a plurality of virtual objects in a sample image set by utilizing a pre-trained facial reconstruction model, and comprises texture features, such as color development and the like, of a plurality of virtual object faces, and corresponds to spatial features, such as facial forms, five-sense features and the like, of a plurality of virtual object faces. The facial prior feature realizes the feature generalization processing of the virtual object face of the sample image, and has high mobility and universality. The facial priori features are pre-constructed by taking the task of reconstructing the virtual object face with high definition and high fidelity as supervision.

The sample image set is a pre-constructed set of sample images for facial prior feature construction, comprising a plurality of sample image pairs, any of which comprises a first region sample image and a second region sample image corresponding to the same virtual object. The first area sample image and the second area sample image are obtained by splitting the same sample image. The sample images corresponding to the first area sample image and the second area sample image may be visual sample images acquired by the image acquisition device, or may be manually generated visual sample images, or may be visual sample images generated by using an image driving algorithm, which is not limited herein. In order to ensure high mobility, universality and accuracy of the constructed facial prior features, the sample image set is large in scale and is generally obtained through an open source database, for example, a sample image database, an image database of an online image platform and the like.

The facial reconstruction model is a neural network model with a virtual object facial reconstruction function, and comprises an encoding layer with a texture feature extraction function and a decoding layer with an image driving function. The facial reconstruction model may be an MLP model, a CNN model, an RMM model, a GCN model, a Pointet model, a Pointet++ model, or the like. The facial reconstruction model may be identical to or different from the image-driven model.

The facial prior feature is obtained by integrating the texture feature of each first region sample image and the texture feature of each second region sample image, and the first facial prior feature and the second facial prior feature may be obtained by integrating and constructing each of the first region sample images and the second region sample images, or the facial prior feature may be obtained by integrating and constructing each of the first region sample images and the second region sample images, which is not limited herein.

The method comprises the steps of obtaining a sample image set from an open source image database, wherein the sample image set comprises 100000 sample image pairs, any sample image pair comprises a first region sample image and a second region sample image corresponding to the same virtual character, extracting texture features of the first region sample image and the second region sample image in the sample image pair respectively by utilizing an encoding layer of a pre-trained Diffusion model for any sample image pair to obtain texture features of the first region sample image and texture features of the second region sample image, and integrating the texture features of 100000 first region sample images and the texture features of 100000 second region sample images to obtain facial prior features.

The method comprises the steps of obtaining a sample image set, wherein the sample image set comprises a plurality of sample image pairs, any sample image pair comprises a first region sample image and a second region sample image corresponding to the same virtual object, for any sample image pair, respectively extracting texture features of the first region sample image and the second region sample image in the sample image pair by utilizing a coding layer of a pre-trained facial reconstruction model, obtaining the texture features of the first region sample image and the texture features of the second region sample image, and integrating the texture features of each first region sample image and the texture features of each second region sample image to obtain facial priori features. The pre-trained facial reconstruction model is used for respectively extracting texture features of the first area sample image and the second area sample image in the sample image set, so that facial priori features containing the texture features are obtained, the subsequently obtained target image features not only correspond to voice features and contain the texture features, but also are decoded subsequently to obtain complete target images of virtual objects which correspond to reference voice and contain the texture features, the target images have the characteristics of high fidelity and high definition, user experience is improved, and meanwhile, the mobility and the universality of image driving are improved.

Optionally, before the texture feature extraction is performed on the first region sample image and the second region sample image in the sample image pair by using the coding layer of the pre-trained facial reconstruction model, the method further comprises the following specific steps:

acquiring a pre-training set, wherein the pre-training set comprises a plurality of pre-training pairs, and any pre-training pair comprises a first training sample image corresponding to a first area and a second training sample image corresponding to a second area of the same sample virtual object;

and performing supervised training on the facial reconstruction model according to the first training sample image and the second training sample image of each pre-training pair to obtain a trained facial reconstruction model.

The pre-training set is a pre-constructed set of sample images for pre-training of the face reconstruction model, and comprises a plurality of pre-training pairs, any one of which comprises a first training sample image corresponding to a first region of the same sample virtual object and a second training sample image corresponding to a second region. The first training sample image and the second training sample image are obtained by splitting the same sample image in areas. The sample images corresponding to the first training sample image and the second training sample image may be visual sample images acquired by the image acquisition device, or may be manually generated visual sample images, or may be visual sample images generated by using an image driving algorithm, which is not limited herein. In order to ensure the capability of extracting texture features of the facial reconstruction model obtained by pre-training, the pre-training set is large in scale and is generally obtained through an open source database, for example, a sample image database, an image database of an online image platform and the like.

The supervision training is a mode of training the neural network model by using the label data, and the first training sample image and the second training sample image in the pre-training pair are mutually label data. In this embodiment of the present disclosure, according to a first training sample image and a second training sample image of each pre-training pair, performing supervised training on a facial reconstruction model to obtain a trained facial reconstruction model, where the specific manner is as follows: inputting a first training sample image into a face reconstruction model to generate a first prediction image, calculating a training loss value according to the first prediction image and a second training sample image, adjusting model parameters of the face reconstruction model according to the training loss value, iterating the steps until a preset training end condition is reached, obtaining a face reconstruction model which is trained in a first stage, inputting a second training sample image into the face reconstruction model to generate a second prediction image, calculating a training loss value according to the second prediction image and the first training sample image, adjusting model parameters of the face reconstruction model according to the training loss value, iterating the steps until the preset training end condition is reached, and obtaining the face reconstruction model which is trained in a second stage. The training sequence of the first training sample image and the second training sample image can be changed. The training loss values include, but are not limited to, cosine loss values, L1 loss values, L2 loss values, and cross entropy loss values. The training ending condition includes, but is not limited to, a preset loss value threshold, a preset iteration number, and a judgment condition for completing training of each sample group.

The method comprises the steps of inputting a first training sample image into a facial reconstruction model to generate a first predicted image, wherein the specific mode is as follows: and inputting the first training sample image into a coding layer of the facial reconstruction model, extracting to obtain first texture features, and inputting the first texture features into a decoding layer of the facial reconstruction model to generate a first predicted image. And inputting the second training sample image into the facial reconstruction model to generate a second predicted image, which is similar to the second predicted image and is not repeated.

According to the training loss value, the model parameters of the image driving model are adjusted in the following specific modes: and according to the training loss value, adjusting model parameters of the coding layer and the decoding layer by using a gradient updating method.

The method comprises the steps of obtaining a pre-training set from an open source image database, wherein the sample image set comprises 10000 pre-training pairs, any pre-training pair is used for obtaining a first training sample image of a first area and a second training sample image of a second area of the same sample virtual object, inputting the first training sample image into a diffration model to generate a first predicted image, calculating to obtain a cross entropy loss value according to the first predicted image and the second training sample image, adjusting model parameters of the diffration model according to the cross entropy loss value by using a gradient updating method, iterating and repeating the steps until reaching a preset iteration number to obtain a diffration model which is trained in a first stage, inputting the second training sample image to generate a second predicted image, calculating to obtain the cross entropy loss value according to the second predicted image and the first training sample image, adjusting model parameters of the diffration model according to the cross entropy loss value, and iterating and repeating the steps until reaching the preset iteration number to obtain the diffration model which is trained in a preset stage.

The method comprises the steps of obtaining a pre-training set, wherein the pre-training set comprises a plurality of pre-training pairs, any pre-training pair comprises a first training sample image corresponding to a first area of a same sample virtual object and a second training sample image corresponding to a second area, and performing supervision training on a facial reconstruction model according to the first training sample image and the second training sample image of each pre-training pair to obtain a trained facial reconstruction model. The texture feature extraction capability of the coding layer of the facial reconstruction model is improved, and the follow-up extraction and construction of the facial priori features with high accuracy are guaranteed.

Fig. 2 shows a flowchart of another voice-based image driving method according to an embodiment of the present disclosure, where the method is applied to a cloud-side device, and includes the following specific steps:

step 202: the method comprises the steps that receiving end side equipment sends an image driving request for a virtual object, wherein the image driving request carries a reference face image and a reference voice of the virtual object;

step 204: performing voice coding on the reference voice to obtain target voice characteristics, and performing image coding on the reference facial image to obtain first image characteristics of a first area and second image characteristics of a second area, wherein the first area is an area of the reference facial image, which changes along with the voice, and the second area is an area of the reference facial image, which is other than the first area;

Step 206: performing attention calculation on the first image features by using the facial prior features to obtain attention-seeking image features, and performing feature optimization on the attention-seeking image features according to the target voice features to obtain first target image features of the first region, wherein the facial prior features comprise facial texture features;

step 208: generating a driven target image according to the first target image characteristic and the second image characteristic;

step 210: and sending the target image to the end-side equipment for rendering.

The cloud side device is a network cloud side device for providing a virtual object face image driving function, and is a virtual device. The terminal side equipment is terminal equipment where a client or a server of a platform such as an application, a webpage or an applet providing a virtual object face image driving function is located, and is entity equipment. And the cloud side equipment and the terminal side equipment are connected through a network transmission channel to perform data transmission. The computing power performance of the cloud side device is higher than that of the end side device.

Steps 204 to 208 are described in detail in steps 102 to 108 of the embodiment of fig. 1, and are not described herein.

And rendering and displaying the target image by the end-side equipment through the renderer.

In this embodiment of the present disclosure, a receiving end side device sends an image driving request for a virtual object, where the image driving request carries a reference face image and a reference voice of the virtual object, performs voice encoding on the reference voice to obtain a target voice feature, and performs image encoding on the reference face image to obtain a first image feature of a first area and a second image feature of a second area, where the first area is an area where the reference face image changes along with voice, the second area is an area where the reference face image is other than the first area, performs feature transformation on the first image feature based on a face priori feature and the target voice feature, determines a first target image feature, and generates a driven target image according to the first target image feature and the second image feature, and sends the target image to the end side device for rendering. Because the facial priori features comprise facial texture features, the feature transformation is carried out on the first image features of the first area following the voice transformation based on the facial priori features and the target voice features, so that the obtained first target image features not only correspond to the voice features and contain the texture features, but also generate complete driven target images which correspond to the reference voice and contain the texture features according to the second image features and the first target image features, the target images have the characteristics of high fidelity and high definition, the user experience is improved, meanwhile, the image driving is realized on cloud side equipment with higher computing power, the image driving efficiency is improved, and the computing power cost of the end side equipment is reduced.

Fig. 3 shows a flowchart of another voice-based image driving method according to one embodiment of the present disclosure, which is applied to a real augmented AR device, and includes the following specific steps:

step 302: receiving an image driving request for a virtual object, wherein the image driving request carries a reference face image and a reference voice of the virtual object;

step 304: performing voice coding on the reference voice to obtain target voice characteristics, and performing image coding on the reference facial image to obtain first image characteristics of a first area and second image characteristics of a second area, wherein the first area is an area of the reference facial image, which changes along with the voice, and the second area is an area of the reference facial image, which is other than the first area;

step 306: performing attention calculation on the first image features by using the facial prior features to obtain attention-seeking image features, and performing feature optimization on the attention-seeking image features according to the target voice features to obtain first target image features of the first region, wherein the facial prior features comprise facial texture features;

step 308: generating a driven target image according to the first target image characteristic and the second image characteristic;

Step 310: and rendering the target image.

The embodiments of the present description apply to a real-world augmented AR (Augmented Reality) device that provides a platform for virtual object facial image driven functionality, where the platform may be a real-world augmented gaming application, a real-world augmented live platform, a real-world augmented information broadcast application, a real-world augmented content recommendation application, and the like.

Steps 304 to 308 are described in detail in steps 104 to 108 of the embodiment of fig. 1, and are not described herein.

Rendering the target image, wherein the specific mode is as follows: and performing reality enhancement rendering on the target image. Wherein the reality augmentation rendering is implemented by a reality augmentation renderer.

In this embodiment of the present disclosure, an image driving request for a virtual object is received, where the image driving request carries a reference face image and a reference voice of the virtual object, the reference voice is subjected to voice encoding to obtain a target voice feature, and the reference face image is subjected to image encoding to obtain a first image feature of a first area and a second image feature of a second area, where the first area is an area where the reference face image changes along with voice, the second area is an area where the reference face image is other than the first area, and feature transformation is performed on the first image feature based on a face prior feature and the target voice feature, and the first target image feature is determined, where the face prior feature includes a face texture feature, and a driven target image is generated according to the first target image feature and the second image feature, and is rendered. Because the facial priori features comprise facial texture features, the feature transformation is carried out on the first image features of the first area following the voice transformation based on the facial priori features and the target voice features, so that the obtained first target image features not only correspond to the voice features and contain the texture features, but also generate complete driven target images which correspond to the reference voice and contain the texture features according to the second image features and the first target image features, the target images have the characteristics of high fidelity and high definition, the user experience is improved, meanwhile, the image driving is realized by the reality enhanced AR equipment with stronger rendering effect, the rendering effect of the driven target images is improved, and the user experience is further improved.

Fig. 4 shows a flowchart of an image-driven data processing method according to an embodiment of the present disclosure, where the method is applied to a cloud-side device, and includes the following specific steps:

step 402: acquiring a training sample set, wherein the training sample set comprises a plurality of training sample groups, and any training sample group comprises a sample image of a sample virtual object, a label image of the sample virtual object and sample voice corresponding to the label image;

step 404: taking the prior facial features as prior features of feature transformation, performing supervision training on the image driving model according to sample images, sample voices and label images of each training sample group to obtain a trained image driving model, wherein the prior facial features comprise facial region features extracted from virtual objects in a sample image set by utilizing a pre-trained facial reconstruction model, and the facial region features comprise texture features;

step 406: and sending the model parameters of the trained image driving model to the end-side equipment.

The cloud side device is network cloud side device for providing model training function, and is a virtual device. The terminal device is a terminal device for providing a virtual object face image driving function, and is an entity device. And the terminal side equipment and the cloud side equipment are connected through a network channel to perform data transmission. The computing power performance of the cloud side device is higher than that of the end side device.

Steps 402 to 404 are described in detail in the embodiment of fig. 1, and are not described herein.

In this embodiment of the present disclosure, a training sample set is obtained, where the training sample set includes a plurality of training sample groups, any one training sample group includes a sample image of a sample virtual object, a tag image of the sample virtual object, and sample speech corresponding to the tag image, a priori features transformed with facial a priori features as features, and according to the sample image, the sample speech, and the tag image of each training sample group, an image-driven model is supervised and trained to obtain a trained image-driven model, where the facial a priori features include facial region features extracted from the virtual object in the sample image set by using a pre-trained facial reconstruction model, the facial region features include texture features, and model parameters of the trained image-driven model are sent to an end-side device. The facial area features extracted from the virtual objects in the sample image set are pre-trained by utilizing the pre-trained facial reconstruction model, so that facial priori features containing texture features are obtained, the facial priori features are used as the priori features of feature transformation, the image driving model is subjected to supervision training according to the sample image, the sample voice and the label image, the trained image driving model is obtained, pertinence of the image driving model to voice data and image data and feature extraction capability of the texture features are improved, the subsequently generated target image has the characteristics of high fidelity and high definition, meanwhile, model training is realized on cloud side equipment with higher calculation power, the model training efficiency is improved, and the calculation cost of end side equipment is reduced.

Fig. 5 is a schematic flow chart of constructing facial prior features in a voice-based image driving method according to an embodiment of the present disclosure.

As shown in fig. 5, a sample image set is obtained, any sample image is extracted, the sample image is split to obtain a first region sample image and a second region sample image, the first region sample image and the second region sample image are respectively subjected to texture feature extraction by using an encoding layer of a facial reconstruction model, the texture features of the first region sample image and the texture features of the second region sample image are correspondingly obtained, and then the texture features of the first region sample image and the texture features of the second region sample image are decoded by using a decoding layer of the facial reconstruction model to obtain a prediction image. And integrating the texture features of the sample images of the first areas to obtain the prior features of the first face, and integrating the texture features of the sample images of the second areas to obtain the prior features of the second face.

Fig. 6 is a schematic flow chart of a voice-based image driving method according to an embodiment of the present disclosure.

As shown in fig. 6, a calibration face image of other face regions except a mouth region, a reference face image of other face regions, and a reference face image of the mouth region are obtained, after encoding by an image encoding layer, a calibration image feature, a second image feature, and a first image feature are obtained, encoding by a speech encoding layer a reference speech is used to obtain a target speech feature, discretizing the calibration image feature and the second image feature based on a face prior feature of the other face regions to obtain a discretized calibration image feature and a discretized second image feature, discretizing the first image feature based on a face prior feature of the mouth region to obtain a discretized first image feature, inputting the discretized calibration image feature and the discretized second image feature into a key point encoding layer of an adaptive face alignment model, coding to obtain a spatial feature of a calibration key point and a spatial feature of a second key point, based on the feature difference degree between the spatial feature of the calibration key point and the spatial feature of the second key point, utilizing a key point decoding layer of a self-adaptive face alignment model to decode to obtain feature deviation, inputting the feature deviation and the discretized first image feature into a bilinear sampling module to perform interpolation processing to obtain the calibrated first image feature, taking the calibrated first image feature as Q (query vector), taking the facial priori feature of a mouth region as K (key vector) and V (value vector), utilizing an attention computing layer of a transducer model to obtain an attention image feature, inputting the target voice feature into a condition normalization layer as constraint condition to perform normalization processing on the attention image feature, and (3) processing through a forward feedback neural network layer (FFN) and a discrete smoothing approximation layer (Gumbel Softmax) to obtain a first target image feature corresponding to the mouth region, combining the second image feature and the first target image feature to obtain a combined image feature, and decoding by using an image decoding layer based on the combined image feature to obtain the driven target image.

The following describes, with reference to fig. 7, an application of the voice-based image driving method provided in the present specification to face animation generation as an example. Fig. 7 is a flowchart of a processing procedure of a voice-based image driving method applied to face animation generation according to an embodiment of the present disclosure, where the processing procedure includes the following specific steps:

step 702: a sample image set is acquired, wherein the sample image set comprises a plurality of sample image pairs, any of the sample image pairs comprising sample images corresponding to a mouth region of a same sample animated character and sample images of other facial regions, wherein the other facial regions are other facial regions than the mouth region.

Step 704: for any sample image pair, respectively extracting texture features of the sample image of the mouth region and the sample images of other face regions in the sample image pair by utilizing an image coding layer of a pre-trained facial reconstruction model to obtain the texture features of the sample image of the mouth region and the texture features of the sample image of the other face regions.

In the present embodiment, the facial reconstruction model is a GAN model.

Step 706: and integrating the texture features of the sample images of each first region to obtain facial prior features of the mouth region, and integrating the texture features of the sample images of each other facial region to obtain facial prior features of the other facial regions.

Step 708: a calibration face image of other face regions of the target animated character, a reference face image of other face regions, a reference face image of a mouth region, and a reference voice are acquired.

Step 710: and performing image coding on the calibration facial images of other facial areas by using an image coding layer of the facial reconstruction model to obtain calibration image features corresponding to the other facial areas, performing image coding on the reference facial images of the other facial areas by using an image coding layer of the facial reconstruction model to obtain second image features corresponding to the other facial areas, performing image coding on the reference facial images of the mouth areas by using an image coding layer of the facial reconstruction model to obtain first image features corresponding to the mouth areas, and performing voice coding on the reference voice by using a voice coding layer of the facial reconstruction model to obtain target voice features.

Step 712: discretizing the calibration image feature and the second image feature based on the facial prior features of other facial regions to obtain a discretized calibration image feature and a discretized second image feature, and discretizing the first image feature based on the facial prior features of the mouth region to obtain a discretized first image feature.

Step 714: and performing key point coding on the calibration image features to obtain calibration key point space features, and performing key point coding on the second image features to obtain second key point space features.

Step 716: and determining the characteristic deviation according to the characteristic difference degree between the spatial characteristic of the calibration key point and the spatial characteristic of the second key point.

Step 718: and calibrating the first image features according to the feature deviation to obtain calibrated first image features.

Step 720: and performing attention calculation on the calibrated first image feature by using the facial prior feature of the mouth region to obtain an attention image feature.

Step 722: and carrying out normalization processing on the attention image features by taking the target voice features as constraint conditions to obtain first target image features corresponding to the mouth region.

Step 724: and carrying out feature combination on the first target image feature and the second image feature to obtain a combined image feature.

Step 726: based on the combined image features, a video frame driven by the target animated character is generated by using a decoding layer of the facial reconstruction model.

Step 728: and sending the video frame driven by the target animation character to front-end rendering.

In the embodiment of the present disclosure, facial region features extracted from a sample animated character in a sample image set are pre-utilized by a pre-trained virtual reconstruction model, so as to obtain facial priori features including texture features, and according to target voice features and the facial priori features including texture features, feature transformation is performed on first image features of a mouth region following voice transformation, so that the obtained first target image features not only correspond to voice features but also include texture features, and finally, according to the first target image features and second image features, a complete video frame corresponding to reference voice and including texture features after driving a target animated character is generated, where the video frame after driving the target animated character has the characteristics of high fidelity and high definition, improving user experience, and combining with self-adaptive processing of facial alignment, ensuring texture matching, and improving universality.

It should be noted that, in the embodiments of the present disclosure, the use of user data may be involved, and in practical applications, the user-specific personal data may be used in the schemes described herein within the scope allowed by applicable legal regulations in the country where the requirements of applicable legal regulations are met (for example, the user explicitly agrees to the user to be informed practically, etc.).

Corresponding to the above embodiment of the method of fig. 1, the present disclosure further provides an embodiment of a voice-based image driving device, and fig. 8 shows a schematic structural diagram of a voice-based image driving device according to one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:

a first acquisition module 802 configured to acquire a reference face image and a reference voice of a virtual object;

a first encoding module 804, configured to perform speech encoding on the reference speech to obtain a target speech feature, and perform image encoding on the reference face image to obtain a first image feature of a first region and a second image feature of a second region, where the first region is a region where the reference face image changes along with the speech, and the second region is a region of the reference face image except the first region;

A first feature transformation module 806 configured to perform feature transformation on the first image features based on the facial prior features and the target speech features, determining first target image features, wherein the facial prior features include facial texture features;

a first generation module 808 is configured to generate a driven target image from the first target image feature and the second image feature.

Optionally, the apparatus further comprises:

the calibration module is configured to acquire a calibration face image of the virtual object, perform image encoding on the calibration face image, and acquire calibration image features of a target area, wherein the target area corresponds to the second area, calibrate the first image features based on feature deviation between the second image features and the calibration image features, and acquire calibrated first image features.

Optionally, the calibration module is further configured to:

and performing key point coding on the second image feature to obtain a second key point space feature, performing key point coding on the calibration image feature to obtain a calibration key point space feature, and calibrating the first image feature based on feature deviation between the second key point space feature and the calibration key point space feature to obtain a calibrated first image feature.

Optionally, the first feature transformation module 806 is further configured to:

and carrying out attention calculation on the first image features by using the facial prior features to obtain first attention image features, and carrying out normalization processing on the first attention image features by taking the target voice features as constraint conditions to obtain first target image features.

correspondingly, the device further comprises:

the discretization module is configured to perform discretization processing on the first image feature based on the first facial prior feature to obtain a discretized first image feature, and perform discretization processing on the second image feature based on the second facial prior feature to obtain a discretized second image feature.

Optionally, the first encoding module 804 is further configured to:

acquiring a pre-trained image driving model, wherein the image driving model comprises an image coding layer, a voice coding layer, a feature transformation layer and a decoding layer, inputting reference voice into the voice coding layer to obtain target voice features, and inputting a reference facial image into the image coding layer to obtain first image features of a first region and second image features of a second region;

Correspondingly, the first feature transformation module 806 is further configured to:

correspondingly, the first generation module 808 is further configured to:

and carrying out feature combination on the first target image feature and the second image feature to obtain combined image features, and inputting the combined image features into a decoding layer to obtain the driven target image.

Optionally, the apparatus further comprises:

a training module configured to obtain a training sample set, wherein the training sample set comprises a plurality of training sample groups, any training sample group comprising a sample image of a sample virtual object, a label image of the sample virtual object, and sample speech corresponding to the label image; taking the prior feature of the face as the prior feature of feature transformation, and performing supervision training on the image driving model according to the sample image, the sample voice and the label image of each training sample group to obtain the trained image driving model.

Optionally, the apparatus further comprises:

a facial prior feature construction module configured to obtain a sample image set, wherein the sample image set comprises a plurality of sample image pairs, any of the sample image pairs comprising a first region sample image and a second region sample image corresponding to a same virtual object; for any sample image pair, respectively extracting texture features of a first region sample image and a second region sample image in the sample image pair by utilizing an encoding layer of a pre-trained facial reconstruction model to obtain the texture features of the first region sample image and the texture features of the second region sample image; and integrating the texture features of each first region sample image and the texture features of each second region sample image to obtain the facial prior features.

Optionally, the apparatus further comprises:

a pre-training module configured to obtain a pre-training set, wherein the pre-training set comprises a plurality of pre-training pairs, any one of the pre-training pairs comprising a first training sample image corresponding to a first region of a same sample virtual object and a second training sample image of a second region; and performing supervised training on the facial reconstruction model according to the first training sample image and the second training sample image of each pre-training pair to obtain a trained facial reconstruction model.

In the embodiment of the present disclosure, since the facial prior feature includes a facial texture feature, and based on the facial prior feature and the target voice feature, the feature transformation is performed on the first image feature of the first region following the voice transformation, so that the obtained first target image feature not only corresponds to the voice feature and includes the texture feature, but also generates a complete target image which corresponds to the reference voice and includes the texture feature according to the second image feature and the first target image feature, and the target image has the characteristics of high fidelity and high definition, thereby improving the user experience.

The above is an exemplary scheme of a voice-based image driving apparatus of the present embodiment. It should be noted that, the technical solution of the voice-based image driving device and the technical solution of the foregoing voice-based image driving method belong to the same concept, and details of the technical solution of the voice-based image driving device that are not described in detail may be referred to the description of the technical solution of the foregoing voice-based image driving method.

Corresponding to the above embodiment of the method of fig. 2, the present disclosure further provides an embodiment of a voice-based image driving apparatus, and fig. 9 shows a schematic structural diagram of another voice-based image driving apparatus provided in one embodiment of the present disclosure. As shown in fig. 9, the apparatus is applied to cloud-side equipment, and the apparatus includes:

a first receiving module 902 configured to receive an image driving request for a virtual object from a device at a receiving end, where the image driving request carries a reference face image and a reference voice of the virtual object;

a second encoding module 904 configured to perform speech encoding on the reference speech to obtain a target speech feature, and perform image encoding on the reference face image to obtain a first image feature of a first region and a second image feature of a second region, where the first region is a region where the reference face image changes along with the speech, and the second region is a region of the reference face image other than the first region;

a second feature transformation module 906 configured to perform feature transformation on the first image features based on the facial prior features and the target speech features, determining first target image features, wherein the facial prior features include facial texture features;

A second generation module 908 configured to generate a driven target image from the first target image feature and the second image feature;

the first rendering module 910 is configured to send the target image to the end-side device for rendering.

In the embodiment of the specification, because the facial prior feature includes facial texture features, feature transformation is performed on the first image feature of the first area following the voice transformation based on the facial prior feature and the target voice feature, so that the obtained first target image feature not only corresponds to the voice feature and contains texture features, but also generates a complete target image which corresponds to the reference voice and contains texture features according to the second image feature and the first target image feature, the target image has the characteristics of high fidelity and high definition, the user experience is improved, meanwhile, the image driving is realized on cloud side equipment with higher computing power, the image driving efficiency is improved, and the computing power cost of the end side equipment is reduced.

The above is another exemplary embodiment of the voice-based image driving apparatus of the present embodiment. It should be noted that, the technical solution of the voice-based image driving device and the technical solution of the foregoing voice-based image driving method belong to the same concept, and details of the technical solution of the voice-based image driving device that are not described in detail may be referred to the description of the technical solution of the foregoing voice-based image driving method.

Corresponding to the above embodiment of the method of fig. 3, the present disclosure further provides an embodiment of a voice-based image driving apparatus, and fig. 10 shows a schematic structural diagram of another voice-based image driving apparatus provided in one embodiment of the present disclosure. As shown in fig. 10, the apparatus is applied to a real augmented AR device, and the apparatus includes:

a second receiving module 1002 configured to receive an image driving request for a virtual object, wherein the image driving request carries a reference face image and a reference voice of the virtual object;

a third encoding module 1004, configured to perform speech encoding on the reference speech to obtain a target speech feature, and perform image encoding on the reference face image to obtain a first image feature of a first region and a second image feature of a second region, where the first region is a region where the reference face image changes along with the speech, and the second region is a region of the reference face image except the first region;

a third feature transformation module 1006 configured to perform feature transformation on the first image features based on the facial prior features and the target speech features, determining first target image features, wherein the facial prior features include facial texture features;

A third generation module 1008 configured to generate a driven target image from the first target image feature and the second image feature;

a second rendering module 1010 configured to render the target image.

In the embodiment of the specification, because the facial prior feature includes facial texture features, feature transformation is performed on the first image feature of the first region following the voice transformation based on the facial prior feature and the target voice feature, so that the obtained first target image feature not only corresponds to the voice feature and contains the texture feature, but also generates a complete target image which corresponds to the reference voice and contains the texture feature according to the second image feature and the first target image feature, the target image has the characteristics of high fidelity and high definition, the user experience is improved, meanwhile, the image driving is realized in the reality enhanced AR equipment with stronger rendering effect, the rendering effect of the driven target image is improved, and the user experience is further improved.

Corresponding to the above embodiment of the method of fig. 4, the present disclosure further provides an embodiment of an image-driven data processing apparatus, and fig. 11 is a schematic structural diagram of an image-driven data processing apparatus according to one embodiment of the present disclosure. As shown in fig. 11, the apparatus is applied to cloud-side equipment, and the apparatus includes:

a second obtaining module 1102 configured to obtain a training sample set, wherein the training sample set comprises a plurality of training sample groups, any training sample group comprising a sample image of a sample virtual object, a label image of the sample virtual object, and sample speech corresponding to the label image;

the model training module 1104 is configured to perform supervised training on the image driving model according to the sample images, the sample voices and the label images of each training sample group by taking the prior facial features as prior features of feature transformation to obtain a trained image driving model, wherein the prior facial features comprise facial region features extracted from virtual objects in a sample image set by utilizing a pre-trained facial reconstruction model, and the facial region features comprise texture features;

a transmitting module 1106 configured to transmit model parameters of the trained image-driven model to the end-side device.

In the embodiment of the specification, the facial area features extracted from the virtual objects in the sample image set are pre-utilized by the pre-trained facial reconstruction model, so that the facial priori features containing texture features are obtained, the facial priori features are used as the prior features of feature transformation, the image driving model is subjected to supervision training according to the sample image, the sample voice and the label image, the trained image driving model is obtained, the pertinence of the image driving model to voice data and image data and the feature extraction capability of the texture features are improved, the subsequently generated target image has the characteristics of high fidelity and high definition, meanwhile, model training is realized on cloud side equipment with higher calculation power, the model training efficiency is improved, and the calculation cost of end side equipment is reduced.

The above is a schematic scheme of an image-driven data processing apparatus of the present embodiment. It should be noted that, the technical solution of the image-driven data processing apparatus and the technical solution of the image-driven data processing method belong to the same concept, and details of the technical solution of the image-driven data processing apparatus that are not described in detail may be referred to the description of the technical solution of the image-driven data processing method.

FIG. 12 illustrates a block diagram of a computing device provided in accordance with one embodiment of the present description. The components of computing device 1200 include, but are not limited to, memory 1210 and processor 1220. Processor 1220 is coupled to memory 1210 by bus 1230 and database 1250 is used to store data.

The computing device 1200 also includes an access device 1240, the access device 1240 enabling the computing device 1200 to communicate via the one or more networks 1260. Examples of such networks include PSTN (Public Switched Telephone Network ), LAN (Local Area Network, local area network), WAN (Wide Area Network ), PAN (Personal Area Network, personal area network), or a combination of communication networks such as the internet. The access device 1240 may include one or more of any type of network interface, wired or wireless (e.g., NIC (Network Interface Controller, network interface card)), such as IEEE802.12WLAN (Wireless Local Area Networks, wireless local area network) wireless interface, wi-MAX (World Interoperability for Microwave Access, worldwide interoperability for microwave access) interface, ethernet interface, USB (Universal Serial Bus ) interface, cellular network interface, bluetooth interface, NFC (Near Field Communication ) interface, and so forth.

In one embodiment of the present description, the above components of computing device 1200, as well as other components not shown in fig. 12, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 12 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1200 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC (Personal Computer ). Computing device 1200 may also be a mobile or stationary server.

Wherein the processor 1220 is configured to execute computer-executable instructions that, when executed by the processor, perform the steps of the above-described voice-based image driving method or image-driven data processing method.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device and the technical solution of the foregoing voice-based image driving method and image-driven data processing method belong to the same concept, and details of the technical solution of the computing device that are not described in detail may be referred to the description of the technical solution of the foregoing voice-based image driving method or image-driven data processing method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described voice-based image driving method or image-driven data processing method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium and the technical solution of the above-mentioned voice-based image driving method and the technical solution of the image-driven data processing method belong to the same concept, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solution of the above-mentioned voice-based image driving method or the technical solution of the image-driven data processing method.

An embodiment of the present specification also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the above-described voice-based image driving method or image-driven data processing method.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program and the technical solution of the above-mentioned voice-based image driving method and the above-mentioned image-driven data processing method belong to the same concept, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the technical solution of the above-mentioned voice-based image driving method or the above-mentioned image-driven data processing method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. A voice-based image driving method, comprising:

acquiring a reference voice and a reference face image of a virtual object;

performing voice coding on the reference voice to obtain target voice characteristics, and performing image coding on the reference face image to obtain first image characteristics of a first area and second image characteristics of a second area, wherein the first area is an area of the reference face image, which changes along with voice, and the second area is an area of the reference face image, except the first area;

Performing feature transformation on the first image feature based on a face prior feature and the target voice feature to determine a first target image feature, wherein the face prior feature comprises a face texture feature;

2. The method of claim 1, further comprising, prior to said feature transforming the first image feature based on the facial prior feature and the target speech feature, determining a first target image feature:

acquiring a calibration face image of the virtual object;

and calibrating the first image feature based on the feature deviation between the second image feature and the calibration image feature to obtain a calibrated first image feature.

3. The method of claim 2, the calibrating the first image feature based on a feature offset between the second image feature and the calibrated image feature, resulting in a calibrated first image feature, comprising:

and calibrating the first image feature based on the feature deviation between the second key point space feature and the calibration key point space feature to obtain a calibrated first image feature.

4. The method of claim 1, the feature transforming the first image feature based on facial prior features and the target speech feature to determine a first target image feature, comprising:

performing attention calculation on the first image feature by using the facial prior feature to obtain a first attention image feature;

and carrying out normalization processing on the first attention image feature by taking the target voice feature as a constraint condition to obtain a first target image feature.

5. The method of claim 1, wherein the facial prior features comprise a first facial prior feature corresponding to the first region and a second facial prior feature corresponding to the second region;

before the feature transformation is performed on the first image feature according to the target voice feature and the facial prior feature, the method further comprises:

6. The method of claim 1, wherein the performing speech encoding on the reference speech to obtain a target speech feature, and performing image encoding on the reference face image to obtain a first image feature of a first region and a second image feature of a second region, includes:

inputting the reference voice into the voice coding layer to obtain target voice characteristics, and inputting the reference facial image into the image coding layer to obtain first image characteristics of a first area and second image characteristics of a second area;

the feature transformation is performed on the first image feature based on the facial prior feature and the target voice feature, and the determining of the first target image feature includes:

Inputting the first image feature, the target voice feature and the facial prior feature into the feature transformation layer, and carrying out feature transformation on the first image feature based on the facial prior feature and the target voice feature to determine a first target image feature;

the generating a driven target image according to the first target image feature and the second image feature comprises the following steps:

and inputting the combined image characteristics into the decoding layer to obtain the driven target image.

7. The method of claim 6, further comprising, prior to the acquiring the pre-trained image-driven model:

taking the facial prior feature as the prior feature of feature transformation, and performing supervision training on the image driving model according to the sample image, the sample voice and the label image of each training sample group to obtain a trained image driving model.

8. The method according to any of claims 1-7, the facial prior feature being a feature extraction of facial region features of the sample image using a facial reconstruction model in advance, the method further comprising:

9. The method of claim 8, further comprising, prior to the texture feature extraction of the first and second region sample images, respectively, of the pair of sample images by the encoding layer of the pre-trained facial reconstruction model:

Obtaining a pre-training set, wherein the pre-training set comprises a plurality of pre-training pairs, and any pre-training pair comprises a first training sample image corresponding to a first area of the same sample virtual object and a second training sample image corresponding to a second area of the same sample virtual object;

10. An image driving method based on voice is applied to cloud side equipment and comprises the following steps:

the method comprises the steps that receiving end side equipment sends an image driving request for a virtual object, wherein the image driving request carries a reference face image and a reference voice of the virtual object;

performing feature transformation on the first image feature based on a face prior feature and a target voice feature to determine a first target image feature, wherein the face prior feature comprises a face texture feature;

Generating a driven target image according to the first target image characteristic and the second image characteristic;

and sending the target image to the end-side equipment for rendering.

11. An image driving method based on voice, which is applied to a reality augmented AR device, comprises the following steps:

receiving an image driving request for a virtual object, wherein the image driving request carries a reference face image and a reference voice of the virtual object;

And rendering the target image.

12. An image-driven data processing method applied to cloud-side equipment comprises the following steps:

taking the facial prior feature as the prior feature of feature transformation, and performing supervision training on the image driving model according to the sample image, the sample voice and the label image of each training sample group to obtain a trained image driving model;

and sending the model parameters of the trained image driving model to end-side equipment.

13. A computing device, comprising:

a memory and a processor;

the memory is configured to store computer executable instructions that, when executed by the processor, implement the steps of the speech-based image driving method of any one of claims 1 to 11, the image-driven data processing method of claim 12.

14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the voice-based image driving method of any one of claims 1 to 11, the image-driven data processing method of claim 12.