CN114999440A

CN114999440A - Avatar generation method, apparatus, device, storage medium, and program product

Info

Publication number: CN114999440A
Application number: CN202210572328.5A
Authority: CN
Inventors: 郭紫垣
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-09-02

Abstract

The present disclosure provides a method, an apparatus, a device, a storage medium and a program product for generating an avatar, which relate to the technical field of artificial intelligence, and in particular to the technical field of deep learning, image processing and computer vision. The specific implementation scheme is as follows: filtering noise audio included in the initial voice data to obtain filtered first voice data, wherein the initial voice data includes the noise audio; determining the voice unit duration of each voice unit included in the first voice data and the voice text corresponding to the first voice data, wherein the voice unit duration is used for representing the pronunciation duration corresponding to the voice unit; performing voice conversion on the voice text to obtain second voice data; based on the voice unit duration of each voice unit in the first voice data, adjusting the voice unit duration of a corresponding voice unit in the second voice data to obtain target voice data; and generating an avatar based on the target speech data.

Description

Avatar generation method, apparatus, device, storage medium, and program product

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, image processing, and computer vision technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for generating an avatar.

Background

With the development of computer technology and internet technology, various functional services in the aspects of life, entertainment and the like can be provided through the virtual image. For example, some avatars may provide audiovisual functionality services such as voice announcements in conjunction with visual displays and voice output. For the audio-visual function service, how to keep the face expression made by the avatar and the output voice data synchronized is an urgent problem to be solved.

Disclosure of Invention

The present disclosure provides an avatar generation method, apparatus, device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided an avatar generation method, including: filtering noise audio included in the initial voice data to obtain filtered first voice data, wherein the initial voice data includes the noise audio; determining the voice unit duration of each voice unit included in the first voice data and the voice text corresponding to the first voice data, wherein the voice unit duration is used for representing the pronunciation duration corresponding to the voice unit; performing voice conversion on the voice text to obtain second voice data; adjusting the voice unit duration of the corresponding voice unit in the second voice data based on the voice unit duration of each voice unit in the first voice data to obtain target voice data; and generating an avatar based on the target speech data.

According to another aspect of the present disclosure, there is provided an avatar generation apparatus including: the device comprises a first voice data determining module, a voice unit duration and voice text determining module, a second voice data determining module, a target voice data determining module and an avatar generating module. The first voice data determining module is used for filtering noise audio contained in the initial voice data to obtain filtered first voice data, wherein the initial voice data comprises the noise audio; the voice unit duration and voice text determining module is used for determining the voice unit duration of each voice unit included in the first voice data and the voice text corresponding to the first voice data, and the voice unit duration is used for representing the pronunciation duration corresponding to the voice unit; the second voice data determining module is used for carrying out voice conversion on the voice text to obtain second voice data; the target voice data determining module is used for adjusting the voice unit duration of the corresponding voice unit in the second voice data based on the voice unit duration of each voice unit in the first voice data to obtain target voice data; and the virtual image generation module is used for generating a virtual image according to the target voice data.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments of the disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of an embodiment of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates a system architecture diagram of an avatar generation method and apparatus according to an embodiment of the present disclosure;

fig. 2 schematically shows a flow chart of an avatar generation method according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of generating an avatar and determining facial parameters according to an embodiment of the present disclosure;

FIG. 4 schematically shows a schematic diagram of obtaining facial pose features according to an embodiment of the present disclosure;

fig. 5 schematically shows a schematic diagram of an avatar generation method according to yet another embodiment of the present disclosure;

FIG. 6 schematically shows a schematic diagram of generating an avatar according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of obtaining target speech data according to an embodiment of the present disclosure;

fig. 8 schematically shows a block diagram of an avatar generation apparatus according to an embodiment of the present disclosure; and

fig. 9 schematically shows a block diagram of an electronic device that can implement the avatar generation method of the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

With the development of computer technology and internet technology, various functional services in the aspects of life, entertainment and the like can be provided through the virtual image. Some avatars may provide audio visual functionality services such as voice announcements in conjunction with visual displays and voice output.

For the audio-visual function service, how to ensure that the face expression made by the virtual image and the output voice keep synchronous is a problem which needs to be solved urgently. Whether the lip shape of the virtual image is synchronous with the voice is an important factor influencing the simulation effect of the virtual image.

In some embodiments, the lip-like changes are not consistent with the speech output when the corresponding expression is made based on the speech-driven avatar. The reason is that these voice-facial lip models input voice data including noise audio, and although the noise audio in the voice data can be filtered during processing, the noise audio cannot be completely filtered during practical application to obtain pure voice data, which results in inaccurate data for driving facial lip changes output by the voice-facial lip models. Specifically, the accuracy is not sufficient, for example, in the case where the lip shape of the plosive lip is abnormal, and the lip sequence of the continuous frame is unstable. This approach results in lip variations of the drive that are inconsistent with the speech output.

Fig. 1 schematically shows a system architecture of an avatar generation method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

clients

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

clients

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

clients

101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various messaging client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (examples only) may be installed on the

clients

101, 102, 103.

Clients

101, 102, 103 may be a variety of electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablets, laptop and desktop computers, and the like. The

clients

101, 102, 103 of the disclosed embodiments may run applications, for example.

The server 105 may be a server that provides various services, such as a back-office management server (for example only) that provides support for websites browsed by users using the

clients

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

It should be noted that the avatar generation method provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the avatar generation apparatus provided by the embodiment of the present disclosure may be provided in the server 105. The avatar generation method provided by the embodiments of the present disclosure may also be performed by a server or server cluster that is different from the server 105 and is capable of communicating with the

clients

101, 102, 103 and/or the server 105. Accordingly, the avatar generation apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

clients

101, 102, 103 and/or the server 105.

In one example, the server 105 may obtain initial voice data from the

clients

101, 102, 103 over the network 104.

It should be understood that the number of clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.

It should be noted that in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user are all in accordance with the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

An avatar generation method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 7 in conjunction with the system architecture of fig. 1. The avatar generation method of the disclosed embodiment may be performed by the server 105 shown in fig. 1, for example.

Fig. 2 schematically shows a flowchart of an avatar generation method according to an embodiment of the present disclosure.

As shown in fig. 2, the avatar generation method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S250.

In operation S210, noise audio included in the initial voice data is filtered to obtain filtered first voice data.

The initial speech data includes noisy audio. The noise audio may be understood as data that interferes with obtaining the first speech data. The first speech data is a type of speech data, and the speech data is an audio version of a language. The noise audio may include, for example, ambient noise, etc. For example, when the initial voice data is song voice data, the song voice data includes singing voice and accompaniment audio, and the accompaniment audio interferes with the obtaining of the first voice data, so that the accompaniment audio can be understood as a noise audio.

Illustratively, the initial speech data may also include non-noise speech. Non-noise speech is understood to be "clean" speech data. For example, still taking the initial speech data as a song for example, the non-noise speech data may be, for example, singing human speech.

For example, a speech extraction model may be utilized to filter noise audio included in the initial speech data, resulting in filtered first speech data.

In operation S220, a voice unit duration of each voice unit included in the first voice data and a voice text corresponding to the first voice data are determined.

The voice unit duration is used for representing the pronunciation duration corresponding to the voice unit.

Since the first voice data is a kind of voice data, and the voice data is an audio form of a language, it is possible to determine, from the first voice data, a voice unit duration of each voice unit included in the first voice data and a voice text corresponding to the first voice data. The duration of a speech unit can be understood as the duration of the pronunciation of the speech unit. The phonetic units may be, for example, words, phrases, etc.

In operation S230, the voice text is voice-converted to obtain second voice data.

For example, the voice conversion of the voice Text may be implemented by a Text-To-voice conversion model, so as To obtain the second voice data, where the Text-To-voice conversion model is a TTS model (TTS for short).

In operation S240, the duration of the speech unit of the corresponding speech unit in the second speech data is adjusted based on the duration of the speech unit of each speech unit in the first speech data, so as to obtain the target speech data.

In operation S250, an avatar is generated according to the target voice data.

When the avatar is generated from the initial voice data, the first voice data in the initial voice data is strongly correlated with the lip-like variation of the avatar.

In some embodiments, the noise tones included in the original speech data may not be completely filtered, and the first speech data obtained from the original speech data may also include some noise tones, and thus, the first speech data is not "clean" speech data.

By the avatar generation method of the embodiment of the present disclosure, the noise audio included in the initial voice data is filtered, so that at least a part of the noise audio can be filtered, and the filtered first voice data is obtained. Performing voice conversion on a voice text corresponding to the first voice data to obtain second voice data which is more 'pure' voice data than the first voice data; and adjusting the voice unit duration of the corresponding voice unit in the second voice data based on the voice unit duration of each voice unit in the first voice data to obtain target voice data matched with the voice unit duration of the first voice data. Therefore, the target voice data obtained through the embodiment of the disclosure has the characteristics of being 'pure' and being matched with the voice unit duration of the first voice data.

The lip shape of the virtual image generated according to the 'pure' target voice data is more accurate, and the abnormal situation of the lip shape and the closed mouth of the plosive of the virtual image can be at least reduced.

Further, since the voice unit duration is correlated with the lip-change sequence, the avatar generated from the target voice data matched with the voice unit duration of the first voice data can improve at least the lip-sequence stability.

The virtual image generation method of the embodiment of the disclosure can be applied to application scenes such as facial mouth shape capture of the virtual image, singing of the virtual image, movie and television animation, interactive game and entertainment and the like. Because the lip shape of the virtual image generated by the virtual image generation method of the embodiment of the disclosure is more accurate and the lip shape sequence stability is higher, the virtual image generation method of the embodiment of the disclosure has a better virtual image simulation effect, and can improve the immersive experience of the user. The method can also replace complex and expensive facial mouth shape capture equipment in a live scene, for example, and reduce equipment investment cost and labor cost for modifying abnormal lip shapes of the virtual images in the later period.

Fig. 3 schematically shows a schematic diagram of generating an avatar in an avatar generation method according to another embodiment of the present disclosure.

According to an avatar generation method of another embodiment of the present disclosure, a specific example of generating an avatar from target voice data may be realized by the following embodiments.

In operation S351, the face pose feature 303 is obtained from the voice feature 302 of the target voice data 301.

The speech feature of the target speech data may be understood as a feature that is obtained by extracting feature parameters of the target speech data and that supports computer processing, and the speech feature may be in the form of a feature vector, for example.

Facial pose features may be understood as features that characterize facial pose, which may for example map the facial expression of an avatar.

In operation S352, the face pose feature 303 is feature-split to obtain a plurality of pose split features 304.

Illustratively, feature splitting may be performed by splitting logic set by the relevant personnel. For example, feature splitting may be performed in equal amounts or in a splitting logic of feature sites.

The split features can be scrambled through a random algorithm, for example, to obtain random gesture split features.

In operation S353, facial parameters 305 are determined based on the plurality of pose split features 304.

In operation S354, the avatar 306 is generated according to the face parameters 305.

The gesture splitting feature is a feature with a finer granularity than the facial gesture feature, and the determined facial parameters are more accurate through the gesture splitting feature with the finer granularity, so that the facial expression of the generated virtual image is more accurate and real according to the facial parameters subsequently aiming at the application scene generated by the virtual image of the embodiment of the disclosure, and the virtual image has a better simulation effect.

Fig. 3 also schematically shows a schematic diagram of determining facial parameters in an avatar generation method according to an embodiment of the present disclosure.

As shown in fig. 3, according to the avatar generation method of the embodiment of the present disclosure, a specific example of determining facial parameters based on a plurality of posture split features may be implemented with the following embodiments.

In operation S355, the split feature correlation parameter 307 is determined based on the plurality of pose split features 304.

The split feature relevance parameter is used to characterize the relevance between the plurality of pose split features.

In operation S356, from the split feature correlation parameter 307 and the facial pose feature 303, the facial parameters 305 are determined.

It will be appreciated that the pronunciation will cause a corresponding change in the face. For example, when the posture splitting feature is split by a face part, for example, when the voice of "a" is sounded, the lip shape is changed, and the two cheek of the face are expanded, and the lip shape has relatively higher correlation with the two cheek of the face; the sound of 'B' causes lip shape change but does not cause the two cheek expansion of the face, and there is relatively low correlation between the lip shape and the two cheek expansion of the face.

According to the splitting feature correlation parameter, the virtual image generation method disclosed by the embodiment of the disclosure can learn the association degree between the splitting features of a certain pronunciation and posture, and perform self-supervision on the process of determining the facial parameters. Thus, the accuracy of the facial parameters determined from the split feature correlation parameters and the facial pose features is higher.

Fig. 4 schematically shows a schematic diagram of obtaining facial pose features in an avatar generation method according to still another embodiment of the present disclosure.

According to an avatar generation method of still another embodiment of the present disclosure, a specific example of obtaining a facial pose feature from a voice feature of target voice data can be realized by the following embodiment.

In operation S411, mel-frequency cepstral coefficients 402 of the target speech data 401 are acquired.

Mel Frequency Cepstral Coefficients, abbreviated as MFCC. The parameters determined based on the mel-frequency cepstrum coefficient have better robustness, are more in line with the auditory characteristics of human ears, and still have better identification performance when the signal-to-noise ratio is reduced.

Illustratively, this may be achieved by: pre-emphasis → framing → windowing → fast fourier transform → triangular band pass filter → mel frequency filter bank → calculation of logarithmic energy output by each filter bank → obtaining MFCC by discrete cosine transform. Pre-emphasis can be achieved by passing the target speech data through a high-pass filter. The high frequency part can be boosted by pre-emphasis so that the spectrum of the signal becomes flat and the signal remains in the entire band from low to high frequencies. The effect of vocal cords and lips in the sounding process can be eliminated through pre-emphasis, the high-frequency part of the voice signal suppressed by a sounding system is compensated, and the formant of high frequency is highlighted.

The speech features of the target speech data may include mel-frequency cepstral coefficients of the target speech data.

In operation S412, a phoneme feature 403 is obtained according to the mel cepstral coefficients 402.

The phoneme features are used to characterize the pronunciation action units. A phoneme feature may be understood as a phoneme characterized by a feature vector. A phoneme is understood to be the smallest unit of speech that is divided according to the natural properties of the speech, and each pronunciation action in a syllable may constitute a phoneme. Therefore, a phoneme is the smallest unit or the smallest speech segment constituting a syllable, and is the smallest linear speech unit divided from the viewpoint of sound quality.

In operation S413, from the phoneme features 403, facial pose features 404 are obtained.

Since the mel cepstral coefficient is strongly correlated with the phonemes and the phonemes are strongly correlated with the lips, according to the avatar generation method of the embodiment of the present disclosure, the phoneme characteristics strongly correlated with the lips can be obtained from the mel cepstral coefficient of the target voice data, and the facial pose characteristics obtained from the phoneme characteristics can be mapped to the accurate lips.

Fig. 5 schematically shows a schematic diagram of an avatar generation method according to still another embodiment of the present disclosure. In the embodiment of the present disclosure, a specific example of generating an avatar from target speech data is realized by a convolutional neural network Net.

Illustratively, the Audi can be added to the target voice data by adding a voice window _o And performing preliminary division, wherein each voice window can be further divided into m voice segments, and n MFCC components of each voice segment can be extracted, so that the model Input of m × n dimensions is obtained. Illustratively, m may take the value of 64 and n may take the value of 32.

Because the voice has continuity in a short time, the avatar generation method of the embodiment of the present disclosure can add a voice window capable of covering a plurality of voice frames to the target voice data, extract the features of a plurality of continuous voice frames as input, can better learn the features of a plurality of continuous voice frames, and conform to the voice features in a short time, thereby better fitting the facial parameters.

Illustratively, the speech window may be set to 385 ms.

The m x N dimensional model Input may be Input into a convolutional neural network Net, which may include a speech analysis network N1, a face pose analysis network N2, an auto-supervision network N3, a fully-connected layer CF, and an output layer OL.

The speech analysis network N1 may be configured to perform speech feature extraction on the N-dimensional features of the model Input to obtain phoneme features.

The facial pose analysis network N2 may perform feature extraction on the m-dimensional features of the model Input, analyze the time evolution of the features, and output facial pose features.

The self-monitoring network N3 may be configured to perform feature splitting on the facial pose feature to obtain a plurality of split pose features after splitting, and determine a split feature correlation parameter based on the plurality of pose split features; the fully-connected layer CF may be configured to fit facial parameters according to the facial pose features and the split feature correlation parameters, wherein the fully-connected layer is configured as at least two layers. It can be understood that a two-class numerical result can be obtained only through one fully-connected layer, one numerical value cannot represent the facial parameters, and at least two fully-connected layers can be fitted with multidimensional vectors, so that the facial parameters can be represented by the multidimensional vectors fitted by at least two fully-connected layers.

The output layer OL may be used to output the facial parameters. The avatar Vi may be generated from the facial parameters, in particular, a face model of the avatar.

Illustratively, the face parameters may include a blend shape coefficient weight (blend shape). The mixed shape coefficient can be used for representing a parameterized initial face model, the mixed shape coefficient weight represents the weight value of the mixed shape coefficient, the mixed shape coefficient weight is between 0 and 1, and the initial face model can be adjusted by adjusting the numerical value of the mixed shape coefficient weight to obtain a face model with corresponding expression.

The model parameters of the convolutional neural network Net include network weights, and when the convolutional neural network model regresses the numerical values of facial parameters with smaller values, such as mixed shape coefficient weights, the network weights have a larger influence on the facial parameters, and in some cases, numerical values of the regressed facial parameters are directly abnormal. Taking the mixed shape coefficient weight as the facial parameter as an example, the split feature correlation parameter can represent the mixed shape coefficient weight correlation among a plurality of posture split features, the process of determining the facial parameter according to the facial posture feature is automatically supervised, the process can be understood as that the characteristic learning of label-free supervision signals is carried out on the facial posture feature through the split feature correlation parameter, the facial parameter determined according to the facial posture feature and the split feature correlation parameter is more accurate and stable, the virtual image generated according to the facial parameter is more vivid, and the simulation effect is better.

For example, in the training phase of the convolutional neural network Net, the part of the facial parameters output by the network, which is not related to the lip variation, can be subjected to the process of weight reduction. For example, the training sample Ts outputs the facial parameters Fd via the convolutional neural network Net, and the loss value may be calculated from the label La of the training sample Ts and the facial parameters Fd. The portions unrelated to lip variation may include, for example, eyebrow, eye, etc., portions.

Fig. 6 schematically shows a schematic diagram of generating an avatar according to an avatar generation method according to still another embodiment of the present disclosure.

According to an avatar generation method of a further embodiment of the present disclosure, a specific example of generating an avatar according to face parameters may be realized by the following embodiments.

In operation S651, the initial face model 602 is acquired.

An initial face model 602 is generated from the initial face parameters 601.

In operation S652, the initial face parameters 601 of the initial face model 602 are updated according to the face parameters 603, and the target face model 604 is generated.

In operation S653, an avatar 605 is obtained according to the target face model 604.

According to the virtual image generation method disclosed by the embodiment of the disclosure, the virtual image can be generated quickly and efficiently based on the initial face model in a manner of updating the initial face parameters through the face parameters.

Fig. 7 schematically shows a schematic diagram of obtaining target speech data in an avatar generation method according to still another embodiment of the present disclosure.

As shown in fig. 7, the duration of each speech unit of the second speech data Am may be determined according to the second speech data Am, and the duration of each speech unit of the second speech data Am may be adjusted based on the duration of each speech unit of the first speech data Ai, so as to obtain the target speech data At.

Fig. 7 schematically shows A, B, C, D and E in total for five speech units, t1, t2, t3, t4 and t5 of the second speech data Am in total for five speech unit durations, and t1 ', t2 ', t3 ', t4 ' and t5 ' of the first speech data Ai in total for five speech unit durations. It can be understood that, for each voice unit, based on the voice unit duration of the first voice data Ai, the voice unit duration corresponding to the second voice data Am is adjusted until the target voice data At is consistent with the voice unit duration of the first voice data Ai, and the target voice data At can be obtained.

According to the avatar generation method of the embodiment of the present disclosure, target voice data having a voice unit duration matching relationship with the first voice data can be obtained by adjusting the voice unit durations of the corresponding voice units in the second voice data based on the voice unit durations of the voice units included in the first voice data, and according to the target voice data, the generated lip sequence of the avatar is more stable and accurate, and has a better avatar simulation effect.

Illustratively, according to an avatar generation method of another embodiment of the present disclosure, generating a specific example of an avatar from target voice data may be implemented by the following embodiments: determining a target rhythm parameter; performing rhythm adjustment on the target voice data based on the target rhythm parameter to obtain rhythm-adjusted target voice data; and generating an avatar according to the target voice data with the adjusted rhythm. .

Illustratively, the target tempo parameter may include at least one of a melody, a frequency, and a pitch.

Illustratively, the target speech data may be subject to a tempo adjustment using a speech style conversion model.

According to the method for generating the virtual image, provided by the embodiment of the disclosure, the target voice data is subjected to rhythm adjustment, the obtained target voice data subjected to rhythm adjustment has a rhythm consistent with the target rhythm parameter, and the virtual image generated according to the target voice data subjected to rhythm adjustment has a better simulation effect.

Illustratively, according to an avatar generation method of yet another embodiment of the present disclosure, a specific example of determining a target tempo parameter may be implemented with the following embodiments: and acquiring the rhythm parameter in the initial voice data as a target rhythm parameter.

According to the avatar generation method of the embodiment of the present disclosure, the target voice data after the tempo adjustment is a kind of "clean" voice data by using the tempo parameter in the initial voice data as the target tempo parameter, and the voice unit duration and the tempo parameter of the initial voice data are restored.

For example, the initial voice data may include a melody including rhythm parameters, the melody including accompaniment audio independent of the determined voice text so that the accompaniment audio is a kind of noise audio, and human voice data may determine the voice text so that the human voice data is a kind of first voice data. According to the avatar generation method of the embodiment of the present disclosure, it is possible to provide an audiovisual function service such as singing of an output music using the generated avatar, and the lip shape of the avatar is more accurate.

Fig. 8 schematically shows a block diagram of an avatar generation apparatus according to an embodiment of the present disclosure.

As shown in fig. 8, the avatar generation apparatus 800 of the embodiment of the present disclosure includes, for example, a first voice data determination module 810, a voice unit duration and voice text determination module 820, a second voice data determination module 830, a target voice data determination module 840, and an avatar generation module 850.

The first voice data determining module 810 is configured to filter a noise audio included in the initial voice data to obtain filtered first voice data. Wherein the initial speech data comprises noisy audio.

The speech unit duration and speech text determining module 820 is configured to determine a speech unit duration of each speech unit included in the first speech data and a speech text corresponding to the first speech data, where the speech unit duration is used to represent a pronunciation duration corresponding to the speech unit.

The second voice data determining module 830 is configured to perform voice conversion on the voice text to obtain second voice data.

The target voice data determining module 840 is configured to adjust the duration of the voice unit of the corresponding voice unit in the second voice data based on the duration of the voice unit of each voice unit in the first voice data, so as to obtain target voice data.

And an avatar generation module 850 for generating an avatar according to the target voice data.

According to an embodiment of the present disclosure, an avatar generation module includes: the virtual image generation device comprises a facial posture characteristic determination submodule, a posture splitting characteristic determination submodule, a facial parameter determination submodule and an avatar first generation submodule.

And the facial gesture feature determination submodule is used for obtaining facial gesture features according to the voice features of the target voice data.

And the gesture splitting characteristic determining submodule is used for performing characteristic splitting on the facial gesture characteristics to obtain a plurality of split gesture characteristics.

And the facial parameter determining submodule is used for determining facial parameters based on the plurality of posture splitting characteristics.

And the first generation submodule of the virtual image is used for generating the virtual image according to the facial parameters.

According to an embodiment of the present disclosure, the face parameter determination submodule includes: a split feature correlation parameter determination unit and a face parameter determination unit.

And the split feature correlation parameter determining unit is used for determining the split feature correlation parameter based on the plurality of posture split features. Wherein the split feature relevance parameter is used to characterize the relevance between the plurality of pose split features.

And the facial parameter determining unit is used for determining facial parameters according to the split feature correlation parameters and the facial posture features.

According to an embodiment of the present disclosure, the facial pose feature determination submodule includes: a mel cepstrum coefficient determining unit, a phoneme feature determining unit and a face posture feature determining unit.

A mel cepstrum coefficient determining unit for acquiring a mel cepstrum coefficient of the target voice data;

a phoneme feature determining unit, configured to obtain a phoneme feature according to the mel cepstrum coefficient; and

and the facial gesture feature determination unit is used for obtaining the facial gesture features according to the phoneme features.

According to an embodiment of the present disclosure, the avatar first generation submodule includes: an initial face model determining unit, a target face module determining unit, and an avatar determining unit.

An initial face model determination unit for obtaining an initial face model, wherein the initial face model is generated from the initial face parameters.

And the target face module determining unit is used for updating the initial face parameters of the initial face model according to the face parameters and generating the target face model.

An avatar determination unit for obtaining an avatar based on the target face model.

According to an embodiment of the present disclosure, the avatar determination module further includes: the target rhythm parameter determining submodule, the rhythm adjusting submodule and the virtual image second generating submodule.

And the target rhythm parameter determining submodule is used for determining the target rhythm parameter.

And the rhythm adjusting submodule is used for adjusting the rhythm of the target voice data based on the target rhythm parameter to obtain the target voice data with the adjusted rhythm.

And the second generation submodule of the virtual image is used for generating the virtual image according to the target voice data after the rhythm is adjusted.

According to the embodiment of the disclosure, the target rhythm parameter determining submodule includes: target rhythm parameter determination unit

And the target rhythm parameter determining unit is used for acquiring the rhythm parameter in the initial voice data as the target rhythm parameter.

It should be understood that the embodiments of the apparatus part of the present disclosure are the same as or similar to the embodiments of the method part of the present disclosure, and the technical problems to be solved and the technical effects to be achieved are also the same as or similar to each other, and the detailed description of the present disclosure is omitted.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the avatar generation method. For example, in some embodiments, the avatar generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the avatar generation method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the avatar generation method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An avatar generation method, comprising:

filtering noise audio included in initial voice data to obtain filtered first voice data, wherein the initial voice data includes the noise audio;

determining the voice unit duration of each voice unit included in the first voice data and the voice text corresponding to the first voice data, wherein the voice unit duration is used for representing the pronunciation duration corresponding to the voice unit;

performing voice conversion on the voice text to obtain second voice data;

based on the voice unit duration of each voice unit in the first voice data, adjusting the voice unit duration of a corresponding voice unit in the second voice data to obtain target voice data; and

and generating an avatar according to the target voice data.

2. The method of claim 1, wherein generating an avatar from the target speech data comprises:

obtaining facial posture characteristics according to the voice characteristics of the target voice data;

performing feature splitting on the facial pose features to obtain a plurality of split pose features;

determining facial parameters based on the plurality of pose-splitting features; and

and generating the virtual image according to the face parameters.

3. The method of claim 2, wherein said determining facial parameters based on said plurality of pose-splitting features comprises:

determining a split feature relevance parameter based on the plurality of pose-split features, wherein the split feature relevance parameter is used to characterize a relevance between the plurality of pose-split features; and

and determining the facial parameters according to the split feature correlation parameters and the facial posture features.

4. The method of claim 2, wherein the obtaining facial pose features from the speech features of the target speech data comprises:

acquiring a Mel cepstrum coefficient of the target voice data;

obtaining phoneme characteristics according to the Mel cepstrum coefficient; and

and obtaining the facial posture characteristic according to the phoneme characteristic.

5. The method of claim 2, wherein the generating the avatar according to the facial parameters comprises:

obtaining an initial face model, wherein the initial face model is generated according to initial face parameters;

updating the initial face parameters of the initial face model according to the face parameters to generate a target face model; and

and obtaining the virtual image according to the target face model.

6. The method of claim 1, said generating an avatar from said target speech data further comprising:

determining a target rhythm parameter;

performing rhythm adjustment on the target voice data based on the target rhythm parameter to obtain rhythm-adjusted target voice data;

and generating the virtual image according to the target voice data with the adjusted rhythm.

7. The method of claim 6, wherein the determining a target tempo parameter comprises:

and acquiring the rhythm parameter in the initial voice data as the target rhythm parameter.

8. An avatar generation apparatus comprising:

the voice processing device comprises a first voice data determining module, a second voice data determining module and a voice filtering module, wherein the first voice data determining module is used for filtering noise audio included in initial voice data to obtain filtered first voice data, and the initial voice data includes the noise audio;

a voice unit duration and voice text determining module, configured to determine a voice unit duration of each voice unit included in the first voice data and a voice text corresponding to the first voice data, where the voice unit duration is used to represent a pronunciation duration corresponding to the voice unit;

the second voice data determining module is used for carrying out voice conversion on the voice text to obtain second voice data;

a target voice data determining module, configured to adjust the voice unit durations of corresponding voice units in the second voice data based on the voice unit durations of the voice units in the first voice data, so as to obtain target voice data; and

and the virtual image generation module is used for generating a virtual image according to the target voice data.

9. The apparatus of claim 8, wherein the avatar generation module includes:

the facial gesture feature determination submodule is used for obtaining facial gesture features according to the voice features of the target voice data;

the gesture splitting characteristic determining submodule is used for performing characteristic splitting on the facial gesture characteristics to obtain a plurality of split gesture characteristics;

a facial parameter determination sub-module to determine facial parameters based on the plurality of pose split features; and

and the first avatar generation submodule is used for generating the avatar according to the facial parameters.

10. The apparatus of claim 9, wherein the facial parameter determination sub-module comprises:

a split feature correlation parameter determination unit, configured to determine a split feature correlation parameter based on the plurality of posture split features, where the split feature correlation parameter is used to characterize correlations between the plurality of posture split features; and

and the facial parameter determining unit is used for determining the facial parameters according to the split feature correlation parameters and the facial posture features.

11. The apparatus of claim 9, wherein the facial pose feature determination sub-module comprises:

a phoneme feature determining unit, configured to obtain a phoneme feature according to the mel cepstrum coefficient, where the phoneme feature is used to characterize a pronunciation action unit; and

a facial pose feature determination unit for obtaining the facial pose features according to the phoneme features.

12. The apparatus of claim 9, wherein the avatar first generation submodule comprises:

an initial face model determination unit for obtaining an initial face model, wherein the initial face model is generated according to initial face parameters;

a target face module determining unit, configured to update initial face parameters of the initial face model according to the face parameters, and generate a target face model; and

an avatar determination unit for obtaining the avatar according to the target face model.

13. The apparatus of claim 8, the avatar determination module further comprising:

the target rhythm parameter determining submodule is used for determining a target rhythm parameter;

the rhythm adjusting submodule is used for adjusting the rhythm of the target voice data based on the target rhythm parameter to obtain the target voice data with the adjusted rhythm;

14. The apparatus of claim 13, wherein the target tempo parameter determination sub-module comprises:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.