CN115345968A

CN115345968A - Virtual object driving method, deep learning network training method and device

Info

Publication number: CN115345968A
Application number: CN202211276271.0A
Authority: CN
Inventors: 张展望; 颜剑锋; 梁柏荣; 徐志良
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2022-11-15
Anticipated expiration: 2042-10-19
Also published as: CN115345968B

Abstract

The present disclosure provides a virtual object driving method, a training method and apparatus for a deep learning network, a device, a medium, and a product, which relate to the field of artificial intelligence, and in particular to the technical fields of deep learning, computer vision, virtual/augmented reality, and image processing, and can be applied to scenes such as virtual digital people, meta universe, and the like. The specific implementation scheme comprises the following steps: determining initial voice features based on the voice data in response to the acquired voice data; performing time sequence enhancement processing on the initial voice features to obtain target voice features; generating a lip shape image sequence for the target virtual object based on the target voice feature and the reference face image of the target virtual object; and driving the target virtual object according to the lip image sequence so that the target virtual object performs a lip action matched with the voice data.

Description

Virtual object driving method, deep learning network training method and device

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of deep learning, computer vision, virtual/augmented reality, and image processing techniques, applicable to virtual digital people, meta universe, and the like.

Background

The virtual object has wide application in an interactive experience platform, and the virtual object drive influences the visual effect and the virtual interaction performance of the computer. However, in some scenes, the phenomenon that the driving stability is poor and the driving effect is poor exists in the virtual object driving process.

Disclosure of Invention

The disclosure provides a virtual object driving method and device, and a training method and device, equipment, medium and product of a deep learning network.

According to an aspect of the present disclosure, there is provided a virtual object driving method including: determining initial voice features based on the voice data in response to the acquired voice data; performing time sequence enhancement processing on the initial voice features to obtain target voice features; generating a lip shape image sequence for the target virtual object based on the target voice feature and the reference face image of the target virtual object; and driving the target virtual object according to the lip image sequence so that the target virtual object performs a lip action matched with the voice data.

According to another aspect of the present disclosure, there is provided a training method of a deep learning network, including: carrying out feature extraction on the sample voice fragment to obtain sample voice features, and carrying out feature extraction on a sample face image which has a time sequence incidence relation with the sample voice fragment to obtain sample face features; performing time sequence enhancement processing on the sample voice features to obtain training voice features; generating a lip shape image sequence matched with the sample voice fragment according to the training voice characteristics and the sample face characteristics through a deep learning network to be trained; and adjusting model parameters of the deep learning network according to the time sequence synchronism between the lip-shaped image sequence and the sample voice fragment to obtain the trained target deep learning network.

According to another aspect of the present disclosure, there is provided a virtual object driving apparatus including: the device comprises an initial voice characteristic determining module, a target voice characteristic determining module, a lip shape image sequence first determining module and a driving module. The initial voice feature determination module is used for responding to the acquired voice data and determining initial voice features based on the voice data; the target voice characteristic determining module is used for carrying out time sequence enhancement processing on the initial voice characteristics to obtain target voice characteristics; a lip image sequence first determination module for generating a lip image sequence for the target virtual object based on the target voice feature and a reference face image of the target virtual object; and the driving module is used for driving the target virtual object according to the lip image sequence so as to enable the target virtual object to execute lip action matched with the voice data.

According to another aspect of the present disclosure, there is provided a training apparatus for a deep learning network, including: the system comprises a sample feature determination module, a training voice feature determination module, a lip shape image sequence second determination module and a target deep learning network determination module. The sample feature determination module is used for extracting features of the sample voice fragments to obtain sample voice features, and extracting features of sample facial images having a time sequence incidence relation with the sample voice fragments to obtain sample facial features; the training voice characteristic determining module is used for carrying out time sequence enhancement processing on the sample voice characteristics to obtain training voice characteristics; the lip shape image sequence second determining module is used for generating a lip shape image sequence matched with the sample voice segment according to the training voice feature and the sample face feature through a deep learning network to be trained; and the target deep learning network determining module is used for adjusting model parameters of the deep learning network according to the time sequence synchronism between the lip-shaped image sequence and the sample voice fragment to obtain the trained target deep learning network.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform the virtual object driven method or the training method of the deep learning network.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the virtual object driving method or the training method of the deep learning network described above.

According to another aspect of the present disclosure, there is provided a computer program product, stored on at least one of a readable storage medium and an electronic device, comprising a computer program which, when executed by a processor, implements the virtual object driven method or the training method of the deep learning network described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates a system architecture of a training method and apparatus for a deep learning network according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a virtual object driving method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart of a training method of a deep learning network according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of a training method of a deep learning network according to another embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of a training process for a deep learning network according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a block diagram of a virtual object driving apparatus according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of a training apparatus of a deep learning network according to an embodiment of the present disclosure;

fig. 8 schematically illustrates a block diagram of an electronic device for performing deep learning network training in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "A, B and at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include, but not be limited to, systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

The embodiment of the disclosure provides a training method of a deep learning network. The training method of the deep learning network comprises the following steps: carrying out feature extraction on the sample voice fragment to obtain a sample voice feature, and carrying out feature extraction on a sample face image which has a time sequence incidence relation with the sample voice fragment to obtain a sample face feature; performing time sequence enhancement processing on the sample voice features to obtain training voice features; generating a lip shape image sequence matched with the sample voice fragment according to the training voice characteristics and the sample face characteristics through a deep learning network to be trained; and adjusting model parameters of the deep learning network according to the time sequence synchronism between the lip-shaped image sequence and the sample voice fragment to obtain the trained target deep learning network.

Fig. 1 schematically illustrates a system architecture of a training method and apparatus for a deep learning network according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

The system architecture 100 according to this embodiment may include a data side 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between data side 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The server 103 may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, and may be a cloud server providing basic cloud computing services such as cloud services, cloud computing, network services, and middleware services.

The server 103 may be a server providing various services, for example, a server performing deep learning network training according to the sample voice segments provided by the data terminal 101.

For example, the server 103 is configured to perform feature extraction on a sample voice segment to obtain a sample voice feature, and perform feature extraction on a sample face image having a time-series association relationship with the sample voice segment to obtain a sample face feature; performing time sequence enhancement processing on the sample voice features to obtain training voice features; generating a lip shape image sequence matched with the sample voice segment according to the training voice feature and the sample facial feature through a deep learning network to be trained; and adjusting model parameters of the deep learning network according to the time sequence synchronism between the lip shape image sequence and the sample voice fragment to obtain the trained target deep learning network.

It should be noted that the training method of the deep learning network provided by the embodiment of the present disclosure may be executed by the server 103. Accordingly, the training device of the deep learning network provided by the embodiment of the present disclosure may be disposed in the server 103. The training method of the deep learning network provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 103 and is capable of communicating with the data terminal 101 and/or the server 103. Correspondingly, the training device of the deep learning network provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 103 and is capable of communicating with the data terminal 101 and/or the server 103.

It should be understood that the number of data terminals, networks, and servers in fig. 1 is merely illustrative. There may be any number of data terminals, networks, and servers, as desired for an implementation.

The embodiment of the disclosure provides a virtual object driving method and a training method of a deep learning network, the virtual object driving method according to the exemplary embodiment of the disclosure is described with reference to fig. 2, and the training method of the deep learning network according to the exemplary embodiment of the disclosure is described with reference to fig. 3 to 5 in conjunction with the system architecture of fig. 1. The training method of the deep learning network of the embodiment of the present disclosure may be performed by the server 103 shown in fig. 1, for example.

Fig. 2 schematically illustrates a schematic diagram of a virtual object driving method according to an embodiment of the present disclosure.

As shown in FIG. 2, the virtual object driving method 200 according to the embodiment of the present disclosure may include operations S210 to S240.

In operation S210, in response to the acquired voice data 201, an initial voice feature 202 based on the voice data 201 is determined.

In operation S220, a timing enhancement process is performed on the initial voice feature 202 to obtain a target voice feature FM.

The initial speech feature 202 includes a plurality of initial speech sub-features having a time-series relationship.

Illustratively, for example, the speech data may be divided into a plurality of speech frames, and the initial speech sub-feature may be understood as a speech feature corresponding to any one speech frame. It will be appreciated that speech frames defined by speech data have a timing relationship and that a corresponding plurality of initial speech sub-features also have a timing relationship.

Illustratively, a specific example of obtaining the target speech feature of operation S220 can be implemented, for example, by using the following embodiments: and aiming at the target voice sub-feature in the plurality of initial voice sub-features, determining the adjacent voice sub-feature matched with the target voice sub-feature. And fusing the target voice sub-feature and the adjacent voice sub-features to obtain a fused voice sub-feature. And obtaining the target voice characteristics according to the fused voice sub-characteristics matched with the initial voice sub-characteristics in the plurality of initial voice sub-characteristics.

The target speech sub-feature comprises any initial speech sub-feature of the plurality of initial speech sub-features.

The neighboring speech sub-feature may be, for example, an initial speech sub-feature having a speech frame neighboring relationship with the target speech sub-feature. In the illustrative example of fig. 2, for example for a target speech sub-feature FT of the plurality of initial speech sub-features, a neighboring speech sub-feature FT-1 and/or a neighboring speech sub-feature FT-2 matching the target speech sub-feature FT may be determined. In the schematic example of fig. 2, a situation is shown where a fused speech sub-feature FM-i is obtained from a target speech sub-feature FT, an adjacent speech sub-feature FT-1 and an adjacent speech sub-feature FT-2.

It will be appreciated that the time interval between speech frames is shorter, which results in correspondingly smaller differences in characteristics between adjacent speech frames. According to the virtual object driving method disclosed by the embodiment of the disclosure, the target voice sub-feature and the adjacent voice sub-features are fused, the obtained fused voice sub-features can relieve the situation of large jitter between adjacent voice frames, the fused voice sub-features are relatively more stable and more excellent in continuity, and the lip shape change of the subsequent driving virtual image is convenient to stabilize and the subsequent driving virtual image has more excellent continuity.

In operation S230, a lip image sequence 205 for the target virtual object is generated based on the target voice feature FM and the reference face image 203 of the target virtual object.

Illustratively, a specific example of obtaining a lip image sequence for the target virtual object of operation S230 may be implemented, for example, by using the following embodiment: using the target voice feature FM and the reference face image 203 as input data of the trained target deep learning network 204, a lip image sequence 205 for the target virtual object is obtained.

The target deep learning network 204 may be obtained by training based on a sample voice segment and a sample facial image having a time sequence correlation with the sample voice segment, obtaining a training voice feature by performing time sequence enhancement processing on a sample voice feature based on the sample voice segment, and obtaining the target deep learning network 204 by performing deep learning network training according to the training voice feature and the sample facial feature based on the sample facial image.

For example, the target deep learning network 204 may be trained based on the following methods: carrying out feature extraction on the sample voice fragment to obtain sample voice features, and carrying out feature extraction on a sample face image which has a time sequence incidence relation with the sample voice fragment to obtain sample face features; performing time sequence enhancement processing on the sample voice features to obtain training voice features; generating a lip shape image sequence matched with the sample voice segment according to the training voice feature and the sample facial feature through a deep learning network to be trained; and adjusting model parameters of the deep learning network according to the time sequence synchronism between the lip-shaped image sequence and the sample voice fragment to obtain a trained target deep learning network 204.

After the target voice feature FM and the reference facial image 203 are input to the target deep learning network, a corresponding lip shape image sequence can be obtained, and the specific principle and technical effect can refer to the description of the training method of the deep learning network.

In operation S240, the target virtual object 206 is driven according to the lip image sequence 205 so that the target virtual object 206 performs a lip action matching the voice data.

Illustratively, the target virtual object 206 is driven based on the lip image sequence 205 to cause the target virtual object 206 to present a lip action matching the voice data, e.g., to broadcast the voice data through the target virtual object 206. The lip coherence driven by the virtual object can be effectively improved, the stability and the matching of lip control in the virtual image drive can be effectively improved, and the virtual image drive effect can be effectively improved.

Fig. 3 schematically illustrates a flow chart of a training method of a deep learning network according to an embodiment of the present disclosure.

As shown in FIG. 3, a training method 300 of a deep learning network according to an embodiment of the present disclosure may include operations S310 to S340.

In operation S310, feature extraction is performed on the sample voice segment to obtain a sample voice feature, and feature extraction is performed on the sample face image having a time-series association relationship with the sample voice segment to obtain a sample face feature.

In operation S320, a time sequence enhancement process is performed on the sample speech feature to obtain a training speech feature.

In operation S330, a lip image sequence matched with the sample voice segment is generated according to the training voice feature and the sample facial feature through the deep learning network to be trained.

In operation S340, model parameters of the deep learning network are adjusted according to timing synchronization between the lip-shaped image sequence and the sample voice segment, so as to obtain a trained target deep learning network.

The following illustrates exemplary operation flows of the training method of the deep learning network according to the present embodiment.

Illustratively, in response to the acquired training sample video, the training sample video is segmented to obtain a sample voice segment and a sample face image having a time sequence association relationship with the sample voice segment. Under the condition that the number of the sample voice segments is more than one, the segment duration of each sample voice segment can be determined according to the number of frames played by the training sample video per second.

The sample speech segment may be framed to obtain a sample speech segment frame. And carrying out short-time Fourier change on the sample voice segment frame to realize the conversion of the sample voice into a voice spectrogram. The voice spectrogram corresponding to each sample voice fragment frame can be filtered by using a mel filter to obtain a preprocessed sample voice fragment frame. The preprocessed sample voice segment frames can be convolved by using the deep learning network to be trained to obtain sample voice characteristics.

And calling a face alignment model to perform face alignment on the sample face image to obtain a target key point set in the sample face image. The target key point set may include a plurality of target key points and annotation information of each target key point. Facial feature regions in the sample facial image may be determined from the set of target keypoints, which may include, for example, eyebrow regions, eye regions, nose regions, lip regions, and ear regions. The face region of the object in the sample face image may be identified based on the face feature region. The deep learning network to be trained can be utilized to carry out convolution processing on the object face area in the sample face image to obtain the sample face characteristics.

The time sequence enhancement processing can be carried out on the sample voice features to obtain training voice features. The sample voice features comprise a plurality of sample voice sub-features with time sequence incidence relation, and the target sample voice sub-features can be corrected by utilizing adjacent sample voice sub-features matched with the target sample voice sub-features aiming at the target sample voice sub-features in the sample voice sub-features. The corrected target sample voice sub-features can form training voice features, the training voice features have more remarkable time sequence continuity, and the lip-shaped image sequence determined based on the training voice features can have better continuity and stability.

Lip-shaped image sequences matched with the sample voice segments can be generated according to the training voice features and the sample facial features through the deep learning network to be trained. Timing synchronization evaluation values between the lip-shaped image sequence and the sample voice segments can be determined, and model parameters of the deep learning network are adjusted according to the timing synchronization evaluation values to obtain the trained target deep learning network.

According to the embodiment of the disclosure, time sequence enhancement processing is carried out on the sample voice features to obtain training voice features, a lip-shaped image sequence matched with the sample voice segments is generated through a deep learning network to be trained according to the training voice features and the sample facial features, and model parameters of the deep learning network are adjusted according to time sequence synchronism between the lip-shaped image sequence and the sample voice segments to obtain a trained target deep learning network. The deep learning network with lip driving knowledge can be trained, the lip control stability and the lip control consistency in the virtual image driving can be improved, and the virtual image driving effect can be effectively improved.

Fig. 4 schematically shows a flowchart of a training method of a deep learning network according to another embodiment of the present disclosure.

As shown in FIG. 4, the method 400 may include operations S410 to S440, for example.

In operation S410, feature extraction is performed on the sample voice segment to obtain a sample voice feature, and feature extraction is performed on a sample face image having a time-series association relationship with the sample voice segment to obtain a sample face feature.

In operation S420, a timing enhancement process is performed on the sample voice feature to obtain a training voice feature, and a feature smoothing process is performed on the sample facial feature to obtain a training facial feature.

In operation S430, the training voice features and the training facial features are used as input data of the deep learning network to be trained, and a lip image sequence matched with the sample voice segment is obtained.

In operation S440, model parameters of the deep learning network are adjusted according to timing synchronization between the lip-shaped image sequence and the sample voice segment, so as to obtain a trained target deep learning network.

An example flow of each operation of the training method of the deep learning network of the present embodiment is illustrated below.

Illustratively, the sample speech features may be subjected to a timing enhancement process to obtain training speech features. In one example approach, the sample speech feature includes a plurality of sample speech sub-features having a time-series association relationship, and for a target sample speech sub-feature of the plurality of sample speech sub-features, a neighboring sample speech sub-feature that matches the target sample speech sub-feature may be determined. And fusing the target sample voice sub-feature and the adjacent sample voice sub-feature to obtain a fused sample voice sub-feature. And obtaining training voice features according to the fusion sample voice sub-features matched with each sample voice sub-feature in the plurality of sample voice sub-features. The target sample speech sub-feature comprises any sample speech sub-feature of a plurality of sample speech sub-features.

Through carrying out the chronogenesis enhancement to sample speech feature, can effectively revise the unstable characteristic information in the sample speech feature, through the attention information on the study chronogenesis, can effectively promote the chronogenesis continuity of sample speech feature, be favorable to through the lip image sequence that deep learning network generated the continuity and stability more.

For example, the target sample speech sub-feature and the neighboring sample speech sub-features may be weighted and summed by using a self-attention network to obtain a fused sample speech sub-feature. The fused sample speech sub-features that match each of the plurality of sample speech sub-features may constitute training speech features.

In another example approach, a key sample speech sub-feature of a plurality of sample speech sub-features may be determined using a connected-time classification CTC model. The CTC model may determine feature peak locations in a plurality of sample speech sub-features, which may indicate key sample speech sub-features. And determining adjacent sample voice sub-features matched with the key sample voice sub-features, and fusing the key sample voice sub-features and the adjacent sample voice sub-features to obtain fused sample voice sub-features. The fusion sample voice sub-features matched with the key sample voice sub-features can form part of feature content in the training voice features.

In one example approach, a sample face image may be lip cleared, resulting in a lip cleared face image. The sample speech segments and lip-cleaned facial images may be used as input data to a generator of a deep learning network. And generating a lip image sequence matched with the sample voice segment through a generator of the deep learning network so as to complete the lip region in the sample face image according to the sample voice segment. For example, the sample face image may include a plurality of sample face image frames, and the lip region in each sample face image frame may be set to 0 to implement lip region removal in each sample face image frame.

The sample facial features include a plurality of sample facial sub-features having a time-series association relationship, and each sample facial sub-feature may correspond to a single sample facial image frame or a plurality of sample facial image frames, which is not limited in this embodiment.

Illustratively, the sample facial features may be subjected to a feature smoothing process, resulting in training facial features. One example approach determines, for a target sample face sub-feature of a plurality of sample face sub-features, neighboring sample face sub-features that match the target sample face sub-feature. And fusing the target sample face sub-feature and the adjacent sample face sub-feature to obtain a fused sample face sub-feature. And obtaining training facial features according to the fused sample facial sub-features matched with the sample facial sub-features in the plurality of sample facial sub-features. The target sample facial sub-feature includes any sample facial sub-feature of a plurality of sample facial sub-features.

For example, based on a preset smoothing parameter, the target sample face sub-feature and the adjacent sample face sub-features may be subjected to a moving average process to obtain a fused sample face sub-feature. The fused sample face sub-features matched to each of the plurality of sample face image frames may constitute training face features.

The training voice features and the training facial features can be used as input data of the deep learning network to be trained, and a lip shape image sequence matched with the sample voice segments is obtained. By way of example, a generator of a deep learning network may be utilized to determine a compression relationship matrix based on training speech features and training facial features, and to generate a lip image sequence matching a sample speech segment from the compression relationship matrix.

By determining the compression relation matrix based on the training voice characteristics and the training facial characteristics, the time sequence synchronism between the generated lip image sequence and the sample voice fragment can be effectively ensured, and the improvement of the stability and the continuity of the lip driving of the virtual object is facilitated.

Illustratively, the training speech features may be converted into a speech feature matrix. And learning the incidence relation between the training voice characteristic and the training face characteristic through a coding convolution network in the generator to obtain a compression relation matrix. And decoding the compression relation matrix through a decoding convolution network in the generator, and restoring to obtain the voice characteristic matrix. A lip image sequence matching the sample voice segment may be generated by the generator based on the learned association between the training voice feature and the training facial feature.

The deep learning network to be trained comprises a discrimination network, and the discrimination network can comprise a lip synchronization discrimination network and an image quality discrimination network. The time sequence synchronism between the lip shape image sequence and the sample audio frequency fragment can be judged through the lip shape synchronous judging network. The image quality judging network can judge the authenticity and quality of the generated lip shape image sequence.

Model parameters of the deep learning network can be adjusted according to the time sequence synchronism between the lip-shaped image sequence and the sample voice fragment, and the trained target deep learning network is obtained. By way of example, feature extraction may be performed on a lip image sequence to obtain lip image features, a timing synchronization evaluation value between the lip image features and sample voice features may be determined, and model parameters of a deep learning network may be adjusted based on the timing synchronization evaluation value to obtain a trained target deep learning network.

The lip image sequence with high robustness and good smooth effect can be generated by the trained target deep learning network, lip driving distortion driven by the virtual object can be reduced, and the driving effect of the virtual object can be obviously improved.

For example, the cosine distance between the lip image feature and the sample voice feature can be determined through a lip synchronization discrimination network, and a timing synchronization evaluation value based on the cosine distance is obtained. And under the condition that the cosine distance between the lip-shaped image feature and the sample voice feature is close to 1, determining that the lip-shaped image sequence and the sample voice segment have better timing synchronization. In the case where the cosine distance between the lip-shaped image feature and the sample voice feature is close to 0, it is determined that the timing synchronism between the lip-shaped image sequence and the sample voice fragment is poor.

By way of example, a target loss function value may be determined according to the timing synchronization evaluation value and at least one of the following multiple indexes, and a model parameter of the deep learning network may be adjusted according to the target loss function value, resulting in a trained target deep learning network. The plurality of indicators may include, for example: a pixel distance between the lip region of the lip image sequence and the sample face image, an optical flow field variation value between adjacent lip images in the lip image sequence, and a feature loss value between the lip image feature and the lip sub-feature in the sample face feature.

For example, the target loss function can be constructed by the following formula,

Loss=loss_l1+loss_feat*wt_feat+loss_sync*wt_face+loss_flow*wt_flow，

where, loss _ l1 represents a pixel distance between the lip region of the sample face image and the lip image sequence, and the loss _ l1 may reflect a reconstruction loss value of the lip image sequence. loss _ flat represents a feature loss value between the lip image feature and the lip sub-feature in the sample face feature, loss _ sync represents a timing synchronization evaluation value between the lip image feature and the sample voice feature, loss _ flow represents an optical flow field variation value between adjacent lip images in the lip image sequence, the optical flow field variation value may indicate coherence between the adjacent lip images, and wt represents a weight value of each loss sub-function.

By carrying out time sequence enhancement processing on the sample voice features, the time sequence consistency of the sample voice features can be effectively improved, and the continuity of the regression lip of the deep learning network is favorably improved. By carrying out feature smoothing processing on the sample facial features, the interference of unstable features on deep learning network training can be effectively reduced. The method is beneficial to improving the stability and the consistency of lip driving in the virtual object driving, and can effectively solve the problems of facial jitter and discontinuous lip of the virtual object in the related technology.

Fig. 5 schematically shows a schematic diagram of a training process of a deep learning network according to an embodiment of the present disclosure.

As shown in fig. 5, feature extraction is performed on a sample speech segment 501 to obtain a sample speech feature 502. The sample speech features 502 are subjected to timing enhancement processing to obtain training speech features 503.

The sample face image 504 having a time-series correlation with the sample voice fragment 501 is subjected to feature extraction, and a sample face feature 505 is obtained. The sample facial features 505 are subjected to feature smoothing processing to obtain training facial features 506.

Using the training voice features 505 and the training facial features 506 as input data of the deep learning network 507 to be trained, a lip image sequence 508 matching the sample voice segment 501 is obtained. And adjusting model parameters of the deep learning network 507 according to the time sequence synchronism between the lip-shaped image sequence 508 and the sample voice segment 501 to obtain a trained target deep learning network.

Fig. 6 schematically illustrates a block diagram of a virtual object driving apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the virtual object driving apparatus 600 of the embodiment of the present disclosure includes, for example, an initial voice feature determination module 610, a target voice feature determination module 620, a lip image sequence first determination module 630, and a driving module 640.

An initial speech feature determination module 610, configured to determine an initial speech feature based on the speech data in response to the acquired speech data.

And the target voice characteristic determining module 620 is configured to perform time sequence enhancement processing on the initial voice characteristic to obtain a target voice characteristic.

A lip image sequence first determining module 630, configured to generate a lip image sequence for the target virtual object based on the target voice feature and the reference face image of the target virtual object.

And the driving module 640 is used for driving the target virtual object according to the lip image sequence so that the target virtual object performs lip action matched with the voice data.

According to the embodiment of the disclosure, the initial voice feature comprises a plurality of initial voice sub-features having a time sequence incidence relation, and the target voice feature determination module comprises: an adjacent voice sub-feature determination sub-module, a fused voice sub-feature determination sub-module and a target voice feature determination sub-module.

And the adjacent voice sub-feature determining sub-module is used for determining the adjacent voice sub-features matched with the target voice sub-features aiming at the target voice sub-features in the plurality of initial voice sub-features.

And the fused voice sub-feature determining sub-module is used for fusing the target voice sub-feature and the adjacent voice sub-features to obtain a fused voice sub-feature.

And the target voice characteristic determining submodule is used for obtaining the target voice characteristic according to the fusion voice sub-characteristic matched with each initial voice sub-characteristic in the plurality of initial voice sub-characteristics.

According to an embodiment of the present disclosure, the lip image sequence first determination module includes: the lip-shaped image sequence first determining submodule is used for taking a target voice feature and a reference face image as input data of a trained target deep learning network to obtain a lip-shaped image sequence for a target virtual object, wherein the target deep learning network is obtained by training a sample voice segment and a sample face image having a time sequence incidence relation with the sample voice segment, the sample voice feature based on the sample voice segment is subjected to time sequence enhancement processing to obtain a training voice feature, and deep learning network training is carried out according to the training voice feature and the sample face feature based on the sample face image to obtain the target deep learning network.

Fig. 7 schematically illustrates a block diagram of a training apparatus of a deep learning network according to an embodiment of the present disclosure.

As shown in fig. 7, the training apparatus 700 of the deep learning network of the embodiment of the present disclosure includes, for example, a sample feature determination module 710, a training speech feature determination module 720, a lip image sequence second determination module 730, and a target deep learning network determination module 740.

The sample feature determining module 710 is configured to perform feature extraction on the sample voice segment to obtain a sample voice feature, and perform feature extraction on a sample face image having a time sequence association relationship with the sample voice segment to obtain a sample face feature.

And the training speech feature determining module 720 is configured to perform timing sequence enhancement processing on the sample speech features to obtain training speech features.

And a lip shape image sequence second determining module 730, configured to generate a lip shape image sequence matched with the sample voice segment according to the training voice feature and the sample facial feature through the deep learning network to be trained.

And the target deep learning network determining module 740 is configured to adjust model parameters of the deep learning network according to timing synchronization between the lip-shaped image sequence and the sample voice segment, so as to obtain a trained target deep learning network.

According to the embodiment of the disclosure, time sequence enhancement processing is carried out on the sample voice features to obtain training voice features, a lip-shaped image sequence matched with the sample voice segments is generated through a deep learning network to be trained according to the training voice features and the sample facial features, and model parameters of the deep learning network are adjusted according to the time sequence synchronism between the lip-shaped image sequence and the sample voice segments to obtain a trained target deep learning network. The deep learning network with lip driving knowledge can be trained, the lip control stability and the lip control consistency in the virtual image driving can be improved, and the virtual image driving effect can be effectively improved.

According to an embodiment of the present disclosure, a sample speech feature includes a plurality of sample speech sub-features having a time-series association relationship; the training speech feature determination module comprises: an adjacent sample voice sub-feature determination sub-module, a fusion sample voice sub-feature determination sub-module and a training voice feature determination sub-module.

And the adjacent sample voice sub-feature determining sub-module is used for determining adjacent sample voice sub-features matched with the target sample voice sub-features aiming at the target sample voice sub-features in the plurality of sample voice sub-features.

And the fusion sample voice sub-feature determining sub-module is used for fusing the target sample voice sub-feature and the adjacent sample voice sub-feature to obtain a fusion sample voice sub-feature.

And the training voice characteristic determining sub-module is used for obtaining the training voice characteristics according to the fusion sample voice sub-characteristics matched with the sample voice sub-characteristics in the plurality of sample voice sub-characteristics.

The target sample speech sub-feature comprises any sample speech sub-feature of a plurality of sample speech sub-features.

According to an embodiment of the present disclosure, the lip image sequence second determination module includes: a training facial feature determination submodule and a lip image sequence second determination submodule.

And the training face feature determining submodule is used for carrying out feature smoothing processing on the sample face features to obtain training face features.

And the lip image sequence second determining submodule is used for taking the training voice features and the training face features as input data of the deep learning network to be trained to obtain the lip image sequence matched with the sample voice segment.

According to an embodiment of the present disclosure, the sample facial features include a plurality of sample facial sub-features having a time-series association relationship; the training facial feature determination sub-module includes: an adjacent sample face sub-feature determination unit, a fused sample face sub-feature determination unit, and a training face feature determination unit.

And an adjacent sample face sub-feature determination unit configured to determine, for a target sample face sub-feature of the plurality of sample face sub-features, an adjacent sample face sub-feature that matches the target sample face sub-feature.

And the fusion sample face sub-feature determining unit is used for fusing the target sample face sub-feature and the adjacent sample face sub-feature to obtain a fusion sample face sub-feature.

And the training face characteristic determining unit is used for obtaining training face characteristics according to the fused sample face sub-characteristics matched with the sample face sub-characteristics in the plurality of sample face sub-characteristics.

The target sample facial sub-feature includes any sample facial sub-feature of a plurality of sample facial sub-features.

According to an embodiment of the present disclosure, the lip image sequence second determination submodule includes: a compression relationship matrix determination unit and a lip-shaped image sequence determination unit.

And the compression relation matrix determining unit determines a compression relation matrix based on the training voice characteristics and the training facial characteristics by using a generator of the deep learning network.

And the lip shape image sequence determining unit is used for generating a lip shape image sequence matched with the sample voice segment according to the compression relation matrix.

According to an embodiment of the present disclosure, the target deep learning network determination module includes: the device comprises a lip image characteristic determining submodule, a time sequence synchronization evaluation value determining submodule and a target deep learning network determining submodule.

And the lip image characteristic determining submodule is used for extracting the characteristics of the lip image sequence to obtain the lip image characteristics.

And the timing synchronization evaluation value determination submodule is used for determining a timing synchronization evaluation value between the lip shape image characteristic and the sample voice characteristic.

And the target deep learning network determining submodule is used for adjusting the model parameters of the deep learning network based on the time sequence synchronous evaluation value to obtain the trained target deep learning network.

According to the embodiment of the disclosure, the target deep learning network determining submodule includes: a target loss function value determining unit and a target deep learning network determining unit.

And a target loss function value determination unit for determining a target loss function value based on the timing synchronization evaluation value and at least one of the following multiple indices.

And the target deep learning network determining unit is used for adjusting the model parameters of the deep learning network according to the target loss function value to obtain the trained target deep learning network.

The multiple indexes include: a pixel distance between the lip image sequence and a lip region of the sample face image; an optical flow field variation value between adjacent lip images in the lip image sequence; and a feature loss value between the lip image feature and a lip sub-feature in the sample facial feature.

It should be noted that in the technical solutions of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the related information are all in accordance with the regulations of the related laws and regulations, and do not violate the customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. The electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 806 allows the device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the training method of the deep learning network, the avatar driving method. For example, in some embodiments, the training method of the deep learning network may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 800 via ROM 802 and/or communications unit 806. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the training method of the deep learning network described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the training method, avatar-driven method of the deep learning network by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with an object, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to an object; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which objects can provide input to the computer. Other kinds of devices may also be used to provide for interaction with an object; for example, feedback provided to the subject can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the object may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., an object computer having a graphical object interface or a web browser through which objects can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A virtual object driven method, comprising:

in response to the acquired voice data, determining an initial voice feature based on the voice data;

performing time sequence enhancement processing on the initial voice features to obtain target voice features;

generating a lip shape image sequence for a target virtual object based on the target voice feature and a reference face image of the target virtual object; and

driving the target virtual object according to the lip image sequence to cause the target virtual object to perform a lip action matching the voice data.

2. The method of claim 1, wherein the initial speech feature comprises a plurality of initial speech sub-features having a time-series association relationship, and the performing time-series enhancement processing on the initial speech feature to obtain a target speech feature comprises:

for a target speech sub-feature of the plurality of initial speech sub-features, determining an adjacent speech sub-feature that matches the target speech sub-feature;

fusing the target voice sub-feature and the adjacent voice sub-features to obtain a fused voice sub-feature; and

obtaining the target voice feature according to the fusion voice sub-feature matched with each initial voice sub-feature in the plurality of initial voice sub-features,

wherein the target speech sub-feature comprises any initial speech sub-feature of the plurality of initial speech sub-features.

3. The method of claim 1, wherein the generating a lip image sequence for a target virtual object based on the target voice features and a reference face image of the target virtual object comprises:

using the target voice feature and the reference facial image as input data of a trained target deep learning network to obtain the lip image sequence for the target virtual object,

wherein the target deep learning network is trained based on a sample voice segment and a sample face image having a time sequence correlation relation with the sample voice segment,

and performing time sequence enhancement processing on the sample voice features based on the sample voice fragments to obtain training voice features, and performing deep learning network training according to the training voice features and the sample facial features based on the sample facial images to obtain the target deep learning network.

4. A training method of a deep learning network comprises the following steps:

carrying out feature extraction on the sample voice fragment to obtain a sample voice feature, and carrying out feature extraction on a sample facial image which has a time sequence incidence relation with the sample voice fragment to obtain a sample facial feature;

performing time sequence enhancement processing on the sample voice features to obtain training voice features;

generating a lip shape image sequence matched with the sample voice segment according to the training voice feature and the sample facial feature through a deep learning network to be trained; and

and adjusting model parameters of the deep learning network according to the time sequence synchronism between the lip shape image sequence and the sample voice fragment to obtain a trained target deep learning network.

5. The method of claim 4, wherein the sample speech feature comprises a plurality of sample speech sub-features having a time-series associative relationship; the time sequence enhancement processing is carried out on the sample voice features to obtain training voice features, and the method comprises the following steps:

for a target sample speech sub-feature of the plurality of sample speech sub-features, determining a neighboring sample speech sub-feature that matches the target sample speech sub-feature;

fusing the target sample voice sub-feature and the adjacent sample voice sub-feature to obtain a fused sample voice sub-feature; and

obtaining the training speech features according to the fusion sample speech sub-features matched with each sample speech sub-feature in the plurality of sample speech sub-features,

wherein the target sample speech sub-feature comprises any sample speech sub-feature of the plurality of sample speech sub-features.

6. The method of claim 4, wherein the generating, by the deep learning network to be trained, a lip image sequence matching the sample voice segment from the training voice features and the sample facial features comprises:

carrying out feature smoothing processing on the sample facial features to obtain training facial features; and

and taking the training voice features and the training facial features as input data of the deep learning network to be trained to obtain the lip shape image sequence matched with the sample voice segment.

7. The method of claim 6, wherein the sample facial features comprise a plurality of sample facial sub-features having a time-series associative relationship; the performing feature smoothing processing on the sample facial features to obtain training facial features includes:

determining, for a target sample face sub-feature of the plurality of sample face sub-features, a neighboring sample face sub-feature that matches the target sample face sub-feature;

fusing the target sample face sub-feature and the adjacent sample face sub-feature to obtain a fused sample face sub-feature; and

obtaining the training facial features from the fused sample facial sub-features that match each of the plurality of sample facial sub-features,

wherein the target sample facial sub-feature comprises any of the plurality of sample facial sub-features.

8. The method of claim 6, wherein the using the training speech features and the training facial features as input data of the deep learning network to be trained to obtain the lip image sequence matching the sample speech segment comprises:

determining, with a generator of the deep learning network, a compressed relationship matrix based on the training speech features and the training facial features; and

and generating the lip shape image sequence matched with the sample voice fragment according to the compression relation matrix.

9. The method according to any one of claims 4 to 8, wherein the adjusting model parameters of the deep learning network according to the timing synchronization between the lip image sequence and the sample voice segment to obtain a trained target deep learning network comprises:

extracting the features of the lip-shaped image sequence to obtain lip-shaped image features;

determining a timing synchronization evaluation value between the lip image feature and the sample voice feature; and

and adjusting model parameters of the deep learning network based on the time sequence synchronous evaluation value to obtain the trained target deep learning network.

10. The method of claim 9, wherein the adjusting model parameters of the deep learning network based on the timing synchronization assessment value to obtain the trained target deep learning network comprises:

determining a target loss function value according to the timing synchronization evaluation value and at least one of the following multiple indexes; and

adjusting model parameters of the deep learning network according to the target loss function value to obtain the trained target deep learning network,

wherein the plurality of indicators comprise:

a pixel distance between the lip image sequence and a lip region of the sample face image;

an optical flow field variation value between adjacent lip images in the lip image sequence; and

a feature loss value between the lip image feature and a lip sub-feature in the sample facial feature.

11. A virtual object driving apparatus, comprising:

an initial voice feature determination module, configured to determine, in response to the acquired voice data, an initial voice feature based on the voice data;

the target voice characteristic determining module is used for carrying out time sequence enhancement processing on the initial voice characteristic to obtain a target voice characteristic;

a lip image sequence first determination module for generating a lip image sequence for a target virtual object based on the target voice feature and a reference face image of the target virtual object; and

and the driving module is used for driving the target virtual object according to the lip image sequence so as to enable the target virtual object to execute lip action matched with the voice data.

12. The apparatus of claim 11, wherein the initial speech feature comprises a plurality of initial speech sub-features having a time-series associative relationship, the target speech feature determination module comprising:

the adjacent voice sub-feature determining sub-module is used for determining an adjacent voice sub-feature matched with the target voice sub-feature aiming at the target voice sub-feature in the plurality of initial voice sub-features;

a fused voice sub-feature determining sub-module, configured to fuse the target voice sub-feature and the adjacent voice sub-features to obtain a fused voice sub-feature; and

a target voice feature determination sub-module, configured to obtain the target voice feature according to the fusion voice sub-feature matched with each initial voice sub-feature in the multiple initial voice sub-features,

13. The apparatus of claim 11, wherein the lip image sequence first determination module comprises:

a lip image sequence first determination submodule for obtaining the lip image sequence for the target virtual object using the target speech feature and the reference face image as input data of a trained target deep learning network,

wherein the target deep learning network is trained based on a sample voice segment and a sample face image having a time-series correlation with the sample voice segment,

14. A training apparatus for a deep learning network, comprising:

the sample feature determination module is used for extracting features of the sample voice fragments to obtain sample voice features, and extracting features of sample facial images having a time sequence incidence relation with the sample voice fragments to obtain sample facial features;

the training voice characteristic determining module is used for carrying out time sequence enhancement processing on the sample voice characteristics to obtain training voice characteristics;

the lip shape image sequence second determining module is used for generating a lip shape image sequence matched with the sample voice segment according to the training voice feature and the sample face feature through a deep learning network to be trained; and

and the target deep learning network determining module is used for adjusting the model parameters of the deep learning network according to the time sequence synchronism between the lip-shaped image sequence and the sample voice fragment to obtain the trained target deep learning network.

15. The apparatus of claim 14, wherein the sample speech feature comprises a plurality of sample speech sub-features having a time-series associative relationship; the training speech feature determination module comprises:

the adjacent sample voice sub-feature determining sub-module is used for determining an adjacent sample voice sub-feature matched with the target sample voice sub-feature aiming at the target sample voice sub-feature in the plurality of sample voice sub-features;

a fusion sample voice sub-feature determining sub-module, configured to fuse the target sample voice sub-feature and the adjacent sample voice sub-feature to obtain a fusion sample voice sub-feature; and

a training speech feature determination sub-module for obtaining the training speech features according to the fusion sample speech sub-features matched with each sample speech sub-feature of the plurality of sample speech sub-features,

16. The apparatus of claim 14, wherein the lip image sequence second determination module comprises:

the training face feature determining submodule is used for carrying out feature smoothing processing on the sample face features to obtain training face features; and

17. The apparatus of claim 16, wherein the sample facial features comprise a plurality of sample facial sub-features having a time-series associative relationship; the training facial feature determination sub-module includes:

a neighboring sample face sub-feature determination unit configured to determine, for a target sample face sub-feature of the plurality of sample face sub-features, a neighboring sample face sub-feature that matches the target sample face sub-feature;

a fused sample face sub-feature determining unit, configured to fuse the target sample face sub-feature and the adjacent sample face sub-feature to obtain a fused sample face sub-feature; and

a training face feature determination unit for obtaining the training face features from the fused sample face sub-features matched with each of the plurality of sample face sub-features,

18. The apparatus of claim 16, wherein the lip image sequence second determination submodule comprises:

a compression relationship matrix determination unit that determines a compression relationship matrix based on the training speech feature and the training facial feature, using a generator of the deep learning network; and

and the lip image sequence determining unit is used for generating the lip image sequence matched with the sample voice segment according to the compression relation matrix.

19. The apparatus of any of claims 14 to 18, wherein the target deep learning network determination module comprises:

the lip image characteristic determining submodule is used for extracting the characteristics of the lip image sequence to obtain lip image characteristics;

a timing synchronization evaluation value determination sub-module for determining a timing synchronization evaluation value between the lip image feature and the sample voice feature; and

20. The apparatus of claim 19, wherein the target deep learning network determination submodule comprises:

a target loss function value determination unit configured to determine a target loss function value based on the timing synchronization evaluation value and at least one of the following multiple indices; and

a target deep learning network determining unit, configured to adjust model parameters of the deep learning network according to the target loss function value to obtain the trained target deep learning network,

wherein the plurality of indicators comprise:

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1~3 or to perform the method of any one of claims 4-10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1~3 or to perform the method of any one of claims 4 to 10.