CN114999441A

CN114999441A - Avatar generation method, apparatus, device, storage medium, and program product

Info

Publication number: CN114999441A
Application number: CN202210572336.XA
Authority: CN
Inventors: 郭紫垣
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-09-02

Abstract

The present disclosure provides a method, an apparatus, a device, a storage medium and a program product for generating an avatar, which relate to the technical field of artificial intelligence, and in particular to the technical field of deep learning, image processing and computer vision. The specific implementation scheme is as follows: acquiring target voice data; obtaining target tone color voice data based on the target voice data and the target tone color parameters, wherein the target tone color parameters have corresponding tone color identifiers; fusing the tone mark and the voice characteristics of the target tone voice data to obtain target characteristics; determining an image parameter for the virtual image according to the target feature; and generating the virtual image according to the image parameters.

Description

Avatar generation method, apparatus, device, storage medium, and program product

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, image processing, and computer vision technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for generating an avatar.

Background

With the development of computer technology and internet technology, various functional services in the aspects of life, entertainment and the like can be provided through the virtual image. For example, some avatars may provide audiovisual functionality services such as voice announcements in conjunction with visual displays and voice output.

Disclosure of Invention

The present disclosure provides an avatar generation method, apparatus, device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided an avatar generation method, including: acquiring target voice data; obtaining target tone voice data based on the target voice data and the target tone parameters, wherein the target tone parameters have corresponding tone marks; fusing the tone mark and the voice characteristics of the target tone voice data to obtain target characteristics; determining an image parameter for the virtual image according to the target feature; and generating the virtual image according to the image parameters.

According to another aspect of the present disclosure, there is provided an avatar generation apparatus including: the system comprises a target voice data determining module, a target tone voice data determining module, a target characteristic determining module, an image parameter determining module and an avatar generating module. The target voice data determining module is used for acquiring target voice data; the target tone voice data determining module is used for obtaining target tone voice data based on the target voice data and the target tone parameters, wherein the target tone parameters have corresponding tone marks; the target characteristic determining module is used for fusing the tone mark and the voice characteristic of the target tone voice data to obtain a target characteristic; the image parameter determining module is used for determining image parameters aiming at the virtual image according to the target characteristics; and the virtual image generation module is used for generating a virtual image according to the image parameters.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments of the disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of an embodiment of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates a system architecture diagram of an avatar generation method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an avatar generation method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a schematic diagram of obtaining target timbre speech data according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a schematic diagram of determining avatar parameters for an avatar according to an embodiment of the present disclosure;

FIG. 5 schematically shows a schematic diagram of obtaining facial pose features according to an embodiment of the present disclosure;

fig. 6 schematically shows a schematic diagram of an avatar generation method according to yet another embodiment of the present disclosure;

FIG. 7 schematically illustrates a schematic diagram of generating an avatar according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a schematic diagram of obtaining target speech data according to an embodiment of the present disclosure;

fig. 9 schematically shows a block diagram of an avatar generation apparatus according to an embodiment of the present disclosure; and

fig. 10 schematically shows a block diagram of an electronic device that can implement the avatar generation method of the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

With the development of computer technology and internet technology, various functional services in the aspects of life, entertainment and the like can be provided through the virtual image. Some avatars may provide audio visual functionality services such as voice announcements in conjunction with visual displays and voice output.

For audiovisual function services, in some embodiments, the avatar may also output voice of a particular timbre. How to ensure that the facial expression made by the virtual image and the output voice are kept synchronous according to different timbres is a problem to be solved urgently. Whether the lip shape of the virtual image is synchronous with the voice is one reason for influencing the simulation effect of the virtual image.

In some embodiments, the lip-like changes are not consistent with the speech output when the corresponding expression is made based on the speech-driven avatar. For example, some embodiments may input voice data into a trained voice-facial lip model and output data that may drive changes in facial lips.

Some speech-to-facial lip models have a single comparison of speech data, e.g., the speech data includes only a single tone. In actually applying the voice-face lip model, data that drives the change of the face lip output by the voice-face lip model is not accurate when the input voice data includes a plurality of timbres.

Some voice-face lip models can filter noise audio in voice data aiming at the voice data comprising the noise audio, but the noise audio cannot be completely filtered in practical application, so that pure voice data is obtained, and the data which drives the face lip change and is output by the voice-face lip model is not accurate. The lack of accuracy is for example reflected by plosive lip shape anomalies and unstable lip sequences of consecutive frames.

Fig. 1 schematically shows a system architecture of an avatar generation method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

clients

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide communication links between

clients

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

clients

101, 102, 103 to interact with server 105 over network 104 to receive or send messages, etc. Various messaging client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (examples only) may be installed on the

clients

101, 102, 103.

Clients

101, 102, 103 may be a variety of electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablets, laptop and desktop computers, and the like. The

clients

101, 102, 103 of the disclosed embodiments may run applications, for example.

The server 105 may be a server that provides various services, such as a back-office management server (for example only) that provides support for websites browsed by users using the

clients

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the client. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

It should be noted that the avatar generation method provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the avatar generation apparatus provided by the embodiment of the present disclosure may be provided in the server 105. The avatar generation method provided by the embodiments of the present disclosure may also be performed by a server or server cluster that is different from the server 105 and is capable of communicating with the

clients

101, 102, 103 and/or the server 105. Accordingly, the avatar generation apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

clients

101, 102, 103 and/or the server 105.

In one example, the server 105 may obtain initial text data from the

clients

101, 102, 103 over the network 104.

It should be understood that the number of clients, networks, and servers in FIG. 1 is merely illustrative. There may be any number of clients, networks, and servers, as desired for an implementation.

It should be noted that in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user are all in accordance with the regulations of the relevant laws and regulations, and do not violate the customs of the public order.

In the technical scheme of the disclosure, before the personal information of the user is obtained or collected, the authorization or the consent of the user is obtained.

An avatar generation method according to an exemplary embodiment of the present disclosure is described below with reference to fig. 2 to 8 in conjunction with the system architecture of fig. 1. The avatar generation method of the disclosed embodiment may be performed by the server 105 shown in fig. 1, for example.

Fig. 2 schematically shows a flowchart of an avatar generation method according to an embodiment of the present disclosure.

As shown in fig. 2, the avatar generation method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S250.

In operation S210, target voice data is acquired.

Speech data may be understood as an audio form of a language.

In operation S220, target tone color voice data is obtained based on the target voice data and the target tone color parameters.

The target timbre parameters have corresponding timbre identifications.

Note that the timbre refers to a characteristic unique to the waveform of different sounds, and is determined by the characteristics of the sound producing body.

Taking singing of people as an example, due to the characteristics of body structures such as oral cavities, throats and the like of all people, all people have unique timbre, and due to the uniqueness of the timbre, the deductions of all people for the same song are different. The avatar generation method of the disclosed embodiment simulates a specific timbre through target timbre parameters.

Illustratively, the target timbre parameter may comprise pitch.

It can be understood that the target tone voice data also has corresponding tone characteristics on the basis of the target voice data.

In operation S230, the tone mark is fused with the voice feature of the target tone voice data to obtain a target feature.

Illustratively, the target feature may be obtained by fusing the tone mark and the voice feature of the target tone voice data in the same dimension.

The target features may be understood, for example, as being characterized in terms of feature vectors, which may support computer processing.

In operation S240, an avatar parameter for the avatar is determined according to the target feature.

In operation S250, an avatar is generated according to the avatar parameters.

It can be understood that, for a certain target voice data, based on a specific target tone parameter, the obtained target tone voice data has a specific tone, and the virtual image also has specific characteristics in visual representation. For example, for the target voice data a, based on two different target tone color parameters B and C, corresponding target tone color voice data Ab and target tone color voice data Ac can be obtained, respectively, and the avatar has corresponding different lip-shaped changes Lb and Lc based on the target tone color voice data Ab and the target tone color voice data Ac, respectively.

The virtual image generation method of the embodiment of the disclosure introduces the target tone color parameter, the target tone color voice data obtained based on the target voice data and the target tone color parameter has corresponding tone color characteristics, and the target characteristics obtained by fusing the tone color identifier and the voice characteristics of the target tone color voice data can reflect the association between the target tone color voice data and the corresponding target tone color parameter, for example, and can adapt to application scenes of various tone colors. The target feature may also, for example, characterize a particular tone of the target tone speech data, with the visual parameters determined from the target feature being associated with the particular tone. Therefore, the virtual image generated according to the image parameters is a visual expression related to a specific tone, and under the condition of the same target voice data and different target tone parameters, the visual expression of the virtual image generates difference and has better simulation effect.

Fig. 3 schematically shows a schematic diagram of obtaining target tone color voice data in an avatar generation method according to another embodiment of the present disclosure.

As shown in fig. 3, a specific example of obtaining target tone color voice data based on target voice data and target tone color parameters can be implemented by the following embodiments.

The target speech data 301 includes at least one speech unit, and the target tone color parameters 302 include tone color parameters for each speech unit.

In operation S331, the timbre parameters are matched to the corresponding speech unit to obtain target timbre speech data 303.

In the example of fig. 3, the target speech data 301 includes a total of x speech units u1 through ux, and the target tone color parameters 302 include a total of x tone color parameters p1 through px.

According to the virtual image generation method disclosed by the embodiment of the disclosure, more accurate target tone color voice data can be obtained by matching corresponding tone color parameters one by one voice unit.

Fig. 4 schematically shows a schematic diagram of determining an avatar parameter for an avatar in an avatar generation method according to still another embodiment of the present disclosure.

According to an avatar generation method of still another embodiment of the present disclosure, a specific example of determining an avatar parameter for an avatar according to a target feature may be implemented by the following embodiments.

In operation S441, the facial pose feature 402 is obtained from the target feature 401.

Facial pose features may be understood as features that characterize facial pose, which may for example map the facial expression of an avatar.

In operation S442, the face pose feature 402 is feature-split, resulting in a plurality of split pose split features 403.

Illustratively, feature splitting may be performed by splitting logic set by the relevant personnel. For example, feature splitting may be performed in equal amounts or in a split logic of feature sites.

The split features can be scrambled through a random algorithm, for example, to obtain random gesture split features.

In operation S443, based on the plurality of pose splitting features 403, splitting feature relevance parameters 404 are determined.

The split feature relevance parameter is used to characterize the relevance between the plurality of pose split features.

In operation S444, from the split feature correlation parameter 404 and the face pose feature 402, the face parameter 405 is determined.

The avatar parameters include facial parameters.

The gesture splitting feature is a feature with a finer granularity than the facial gesture feature, and the determined facial parameters are more accurate through the gesture splitting feature with the finer granularity, so that the facial expression of the generated virtual image is more accurate and real according to the facial parameters subsequently aiming at the application scene generated by the virtual image of the embodiment of the disclosure, and the virtual image has a better simulation effect.

It will be appreciated that the pronunciation will cause a corresponding change in the face. For example, when the posture splitting feature is split in a face part, for example, when a sound of "a" is emitted, lip shape change is caused, and two sides of the face are also expanded, and the lip shape has relatively higher correlation with the two sides of the face; the sound of 'B' causes lip shape change but does not cause the two cheek expansion of the face, and there is relatively low correlation between the lip shape and the two cheek expansion of the face.

According to the splitting feature correlation parameter, the virtual image generation method disclosed by the embodiment of the disclosure can learn the association degree between the splitting features of a certain pronunciation and posture, and perform self-supervision on the process of determining the facial parameters. Thus, the accuracy of the facial parameters determined from the split feature correlation parameters and the facial pose features is higher.

According to an avatar generation method of still another embodiment of the present disclosure, a specific example of obtaining a facial pose feature from a voice feature of target voice data can be realized by the following embodiment.

In operation S511, mel cepstral coefficients 502 of the target speech data 501 are acquired.

Mel Frequency Cepstral Coefficients, abbreviated as MFCC. The parameters determined based on the mel-frequency cepstrum coefficient have better robustness, are more in line with the auditory characteristics of human ears, and still have better identification performance when the signal-to-noise ratio is reduced.

Illustratively, this may be achieved by: pre-emphasis → framing → windowing → fast fourier transform → triangular band-pass filter → mel-frequency filter bank → calculating the logarithmic energy of each filter bank output → obtaining MFCC by discrete cosine transform. Pre-emphasis can be achieved by passing the target speech data through a high-pass filter. The high frequency part can be boosted by pre-emphasis so that the spectrum of the signal becomes flat and the signal remains in the entire band from low to high frequencies. The effect of vocal cords and lips in the sounding process can be eliminated through pre-emphasis, the high-frequency part of the voice signal suppressed by a sounding system is compensated, and the formant of high frequency is highlighted.

The speech features of the target speech data may include mel-frequency cepstral coefficients of the target speech data.

In operation S512, a phoneme feature 503 is obtained according to the mel cepstral coefficients 502.

The phoneme features are used to characterize the pronunciation action units. A phoneme feature may be understood as a phoneme characterized by a feature vector. A phoneme is understood to be the smallest unit of speech that is divided according to the natural properties of the speech, and each pronunciation action in a syllable may constitute a phoneme. Therefore, a phoneme is the smallest unit or the smallest speech segment constituting a syllable, and is the smallest linear speech unit divided from the viewpoint of sound quality.

In operation S513, from the phoneme features 503, the face pose feature 504 is obtained.

Since the mel cepstrum coefficient is strongly correlated with the phoneme and the phoneme is strongly correlated with the lip shape, according to the avatar generation method of the embodiment of the present disclosure, the phoneme characteristics strongly correlated with the lip shape can be obtained according to the mel cepstrum coefficient of the target voice data, and the facial pose characteristics obtained from the phoneme characteristics can be mapped to an accurate lip shape.

Fig. 6 schematically shows a schematic diagram of an avatar generation method according to still another embodiment of the present disclosure. In the embodiment of the present disclosure, a specific example of generating an avatar from target speech data is realized by a convolutional neural network Net.

Fig. 6 also shows the timbre identification inference module a-ID-Inf, and since the target timbre parameter has a corresponding timbre identification, the timbre identification inference module a-ID-Inf may be used to determine the corresponding target timbre parameter according to the timbre identification.

Illustratively, for a certain tone mark a-ID, the corresponding target tone parameter a-pa may be determined by using the tone mark inference module a-ID-Inf, and the tone parameters of the target tone parameter a-pa may be matched to the corresponding phonetic unit of the target speech data Aui, so as to obtain the target tone speech data Aut.

Illustratively, the target timbre speech data Aut may be preliminarily divided by adding speech windows, each speech window may be further divided into m speech segments, and n mel cepstral coefficient components of each speech segment may be extracted, so as to obtain a speech feature vector MFCC of the target timbre speech data Aut, where the speech feature vector MFCC is m × n dimensional.

Illustratively, m may take the value of 64 and n may take the value of 32.

On the basis, the tone mark A-ID and the voice feature MFCC of the target tone voice data Aut are fused to obtain a target feature, and the target feature is used as a model of a convolutional neural network and is Input into Input.

Illustratively, the target feature after fusion of the timbre identification a-ID and the voice feature MFCC of the target timbre voice data Aut may be (m +1) × n dimensions or m × (n +1) dimensions.

Illustratively, the timbre identification A-ID may be a One-Hot code, i.e., One Hot code.

Because the voice has continuity in a short time, the virtual image generation method of the embodiment of the disclosure can add a voice window capable of covering a plurality of sound frames to the target tone voice data, extract the features of the continuous sound frames, better learn the features of the continuous sound frames, and accord with the voice features in a short time, thereby better fitting the facial parameters.

Illustratively, the speech window may be set to 385 ms.

The model Input may be Input into a convolutional neural network Net, which may include a speech analysis network N1, a face pose analysis network N2, an auto-supervision network N3, a full connectivity layer CF, and an output layer OL.

The speech analysis network N1 may be configured to perform speech feature extraction on the N-dimensional features or the N + 1-dimensional features of the model Input to obtain phoneme features.

The facial pose analysis network N2 can perform feature extraction on the m-dimensional features or m + 1-dimensional features of the model Input, analyze the time evolution of the features, and output facial pose features.

The self-monitoring network N3 may be configured to perform feature splitting on the facial pose feature to obtain a plurality of split pose features after splitting, and determine a split feature correlation parameter based on the plurality of pose split features; the fully-connected layer CF may be configured to fit facial parameters according to the facial pose features and the split feature correlation parameters, where the fully-connected layer is configured as at least two layers. It can be understood that a two-class numerical result can be obtained only through one fully-connected layer, one numerical value cannot represent the facial parameters, and at least two fully-connected layers can be fitted with multidimensional vectors, so that the facial parameters can be represented by the multidimensional vectors fitted by at least two fully-connected layers.

The output layer OL may be used to output the facial parameters. The avatar Vi may be generated from the facial parameters, in particular, a face model of the avatar.

Illustratively, the face parameters may include a blend shape coefficient weight (blend shape). The mixed shape coefficient can be used for representing a parameterized initial face model, the mixed shape coefficient weight represents the weight value of the mixed shape coefficient, the mixed shape coefficient weight is between 0 and 1, and the initial face model can be adjusted by adjusting the numerical value of the mixed shape coefficient weight to obtain a face model with corresponding expression.

The model parameters of the convolutional neural network Net include network weights, and when the convolutional neural network model regresses the numerical values of facial parameters with smaller values, such as mixed shape coefficient weights, the network weights have a larger influence on the facial parameters, and in some cases, numerical values of the regressed facial parameters are directly abnormal. Taking the mixed shape coefficient weight as the facial parameter as an example, the split feature correlation parameter can represent the mixed shape coefficient weight correlation among a plurality of posture split features, the process of determining the facial parameter according to the facial posture feature is automatically supervised, the process can be understood as that the characteristic learning of label-free supervision signals is carried out on the facial posture feature through the split feature correlation parameter, the facial parameter determined according to the facial posture feature and the split feature correlation parameter is more accurate and stable, the virtual image generated according to the facial parameter is more vivid, and the simulation effect is better.

For example, in the training phase of the convolutional neural network Net, the part of the facial parameters output by the network, which is not related to the lip variation, can be subjected to the process of weight reduction. For example, a training sample Ts outputs a facial parameter Fd via a convolutional neural network Net, a loss value can be calculated from a label La of the training sample Ts and the facial parameter Fd, and a weight of a portion irrelevant to lip variation can be reduced when calculating the loss value. The portions unrelated to lip variation may include, for example, eyebrow, eye, etc., portions.

Fig. 7 schematically shows a schematic diagram of generating an avatar according to an avatar generation method of still another embodiment of the present disclosure.

According to an avatar generation method of still another embodiment of the present disclosure, a specific example of generating an avatar according to an avatar parameter may be realized by the following embodiments.

In operation S751, an initial face model 702 is acquired.

An initial face model 702 is generated from the initial face parameters 701.

In operation S752, the initial face parameters 701 of the initial face model 702 are updated according to the face parameters 703, generating the target face model 704.

In operation S753, an avatar 705 is obtained according to the target face model 704.

According to the virtual image generation method disclosed by the embodiment of the disclosure, the virtual image can be generated quickly and efficiently based on the initial face model in a manner of updating the initial face parameters through the face parameters.

Fig. 8 schematically shows a schematic diagram for acquiring target voice data in an avatar generation method according to still another embodiment of the present disclosure. FIG. 8 schematically illustrates a diagram for adjusting an initial phonetic unit duration for each initial phonetic unit based on a target phonetic unit duration.

As shown in fig. 8, a specific example of acquiring target voice data can be realized by the following embodiments: performing voice conversion on the initial text data to obtain initial voice data, wherein the initial voice data comprises initial voice units, and each initial voice unit of the initial voice data has an initial voice unit duration and an initial rhythm parameter; and adjusting the initial voice unit duration and the initial rhythm parameters of each initial voice unit based on the target voice unit duration and the target rhythm parameters to obtain target voice data.

The initial speech data is an audio version of the initial text data. The initial voice unit may be understood as a voice unit of the initial voice data, and the initial voice unit duration may be understood as a voice unit duration of the initial voice data. The speech unit duration can be understood as the pronunciation duration of the speech unit. The phonetic units may be, for example, words, phrases, etc.

The initial tempo parameter is to be understood as a tempo parameter of the initial speech data.

Illustratively, the rhythm parameter may include at least one of a melody, a frequency, and a tone.

For example, the initial Text data may be subjected To voice conversion by a Text-To-voice conversion model To obtain the initial voice data, where the Text-To-voice conversion model is a TTS model (TTS for short).

Illustratively, the initial tempo parameters of each initial speech unit may be adjusted using a speech style conversion model.

It is understood that the target voice data has a target voice unit duration and a target tempo parameter.

The virtual image generation method of the embodiment of the disclosure supports the driving of the initial text data to generate the virtual image, and expands the application scene of the virtual image. By performing voice conversion on the initial text data, the obtained initial voice data is not affected by interference noise and is "clean" voice data. The initial voice unit duration and the initial rhythm parameters of each initial voice unit are adjusted based on the target voice unit duration and the target rhythm parameters, and the obtained target voice data have the characteristics of purity and matching with the target voice unit duration and the target rhythm parameters.

The lip shape of the virtual image generated according to the 'pure' target voice data is more accurate, and the abnormal situation of the lip shape and the closed mouth of the plosive of the virtual image can be at least reduced.

Since the voice unit duration is associated with the lip-change sequence, an avatar generated from the target voice data matched to the target voice unit duration may at least improve lip-sequence stability.

The avatar generated according to the target voice data matched with the target rhythm parameter can provide a function of outputting the target voice data with the target rhythm parameter, improve the condition that the avatar mechanically outputs machine sounds, improve the simulation effect of the avatar, and improve the immersive experience of the user.

The virtual image generation method of the embodiment of the disclosure can be applied to application scenes such as virtual image facial mouth shape capture, virtual image singing, movie animation, interactive game entertainment and the like. Because the lip shape of the virtual image generated by the virtual image generation method of the embodiment of the disclosure is more accurate and the lip shape sequence stability is higher, the virtual image generation method of the embodiment of the disclosure has a better virtual image simulation effect, and can improve the immersive experience of the user. The method can also replace complex and expensive facial mouth shape capture equipment in a live scene, for example, and reduce equipment investment cost and labor cost for modifying abnormal lip shapes of the virtual images in the later period.

FIG. 8 schematically shows a diagram for adjusting the initial voice unit duration for each initial voice unit, wherein A, B, C, D and E are shown together for five voice units, and t1 ', t2 ', t3 ', t4 ' and t5 ' are also shown together for five target voice unit durations, and t1, t2, t3, t4 and t5 of the initial voice data Ai are shown together for five initial voice unit durations. It can be understood that, for each initial voice unit, the initial voice unit duration of the initial voice unit is adjusted based on the target voice unit duration until the initial voice unit duration is consistent with the target voice unit duration, and the target voice data At can be obtained.

The avatar generation method according to the embodiment of the present disclosure can be applied to scenes of music audio such as songs. The method comprises the steps of adjusting the duration and the initial rhythm parameter of the initial voice unit of each initial voice unit based on the duration and the target rhythm parameter of the target voice unit of music audio, obtaining target voice data which has a voice unit duration matching relation and a rhythm parameter matching relation with the music audio, and generating a lip-shaped sequence of an avatar more stably and accurately according to the target voice data with rhythm, so that the method has a better avatar simulation effect. For example, it is also possible to provide an audio-visual function service such as singing using the generated avatar.

Fig. 9 schematically shows a block diagram of an avatar generation apparatus according to an embodiment of the present disclosure.

As shown in fig. 9, the avatar generation apparatus 900 of the embodiment of the present disclosure includes, for example, a target voice data determination module 910, a target tone color voice data determination module 920, a target feature determination module 930, an avatar parameter determination module 940, and an avatar generation module 950.

And a target voice data determining module 910, configured to obtain target voice data.

And a target tone voice data determining module 920, configured to obtain target tone voice data based on the target voice data and the target tone parameter. Wherein, the target tone color parameter has a corresponding tone color identifier.

And the target feature determining module 930, configured to fuse the tone identifier with the voice feature of the target tone voice data to obtain a target feature.

An image parameter determining module 940 for determining an image parameter for the avatar according to the target feature.

An avatar generating module 950 for generating an avatar according to the avatar parameters.

According to an embodiment of the present disclosure, the tone color identification is used to identify a specified pronunciation style and/or a specified pronunciation object.

According to an embodiment of the present disclosure, the target voice data includes at least one voice unit, the target tone color parameter includes a tone color parameter for each voice unit, and the target tone color voice data determination module includes: and a target tone color voice data determination submodule.

And the target tone color voice data determining submodule is used for matching the tone color parameters to the corresponding voice units so as to obtain target tone color voice data.

According to an embodiment of the present disclosure, wherein the image parameters include face parameters, the image parameter determination module includes: the face posture characteristic determining sub-module, the posture splitting characteristic determining sub-module, the splitting characteristic correlation parameter determining sub-module and the face parameter determining sub-module.

And the facial gesture feature determination submodule is used for obtaining facial gesture features according to the target features.

And the gesture splitting characteristic determining submodule is used for performing characteristic splitting on the facial gesture characteristics to obtain a plurality of split gesture characteristics.

And the split feature correlation parameter determining submodule is used for determining the split feature correlation parameter based on the plurality of posture split features. Wherein the split feature relevance parameter is used to characterize the relevance between the plurality of pose split features.

And the facial parameter determining submodule is used for determining facial parameters according to the split feature correlation parameters and the facial posture features.

According to an embodiment of the present disclosure, wherein the facial pose feature determination sub-module includes: a mel cepstrum coefficient determining unit, a phoneme feature determining unit and a face posture feature determining unit.

And the Mel cepstrum coefficient determining unit is used for acquiring the Mel cepstrum coefficient of the target characteristic.

And the phoneme characteristic determining unit is used for obtaining a phoneme characteristic according to the Mel cepstrum coefficient, wherein the phoneme characteristic represents the pronunciation action unit.

And the facial gesture feature determination unit is used for obtaining the facial gesture features according to the phoneme features.

According to an embodiment of the present disclosure, wherein the avatar generation module includes: initial face model determination submodule, target face model determination submodule, and avatar generation submodule

An initial face model determination sub-module for obtaining an initial face model, wherein the initial face model is generated from the initial face parameters.

And the target face model determining submodule is used for updating the initial face parameters of the initial face model according to the face parameters to generate the target face model.

And the virtual image generation sub-module is used for obtaining the virtual image according to the target face model.

According to an embodiment of the present disclosure, the target speech data determination module includes: an initial voice data determination submodule and a target voice data determination submodule.

The initial voice data determining submodule is used for carrying out voice conversion on the initial text data to obtain initial voice data, wherein the initial voice data comprises initial voice units, and each initial voice unit of the initial voice data has an initial voice unit duration and an initial rhythm parameter; and

and the target voice data determining submodule is used for adjusting the initial voice unit duration and the initial rhythm parameters of each initial voice unit based on the target voice unit duration and the target rhythm parameters to obtain the target voice data.

It should be understood that the embodiments of the apparatus part of the present disclosure are the same as or similar to the embodiments of the method part of the present disclosure, and the technical problems to be solved and the technical effects to be achieved are also the same as or similar to each other, and the detailed description of the present disclosure is omitted.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as the avatar generation method. For example, in some embodiments, the avatar generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the avatar generation method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the avatar generation method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. An avatar generating method, comprising:

acquiring target voice data;

obtaining target tone color voice data based on the target voice data and target tone color parameters, wherein the target tone color parameters have corresponding tone color identifications;

fusing the tone mark and the voice characteristics of the target tone voice data to obtain target characteristics;

determining an image parameter for the avatar according to the target feature; and

and generating the virtual image according to the image parameters.

2. The method of claim 1, the timbre identification identifying a specified pronunciation style and/or a specified pronunciation object.

3. The method of claim 1, the target speech data comprising at least one speech unit, the target timbre parameters comprising timbre parameters for the respective speech unit, the obtaining target timbre speech data based on the target speech data and target timbre parameters comprising:

and matching the tone parameters to corresponding voice units to obtain the target tone voice data.

4. The method of claim 1, wherein the avatar parameters include facial parameters, and wherein determining the avatar parameters for the avatar based on the target features comprises:

obtaining facial posture characteristics according to the target characteristics;

performing feature splitting on the facial pose features to obtain a plurality of split pose features;

determining a split feature relevance parameter based on the plurality of pose-split features, wherein the split feature relevance parameter is used to characterize a relevance between the plurality of pose-split features; and

and determining the facial parameters according to the split feature correlation parameters and the facial posture features.

5. The method of claim 4, wherein said deriving facial pose features from the target features comprises:

acquiring a Mel cepstrum coefficient of the target feature;

obtaining phoneme characteristics according to the Mel cepstrum coefficient; and

and obtaining the facial posture characteristic according to the phoneme characteristic.

6. The method of claim 4, wherein the generating the avatar according to the avatar parameters comprises:

obtaining an initial face model, wherein the initial face model is generated according to initial face parameters;

updating the initial face parameters of the initial face model according to the face parameters to generate a target face model; and

and obtaining the virtual image according to the target face model.

7. The method of any of claims 1-6, wherein the obtaining target speech data comprises:

performing voice conversion on initial text data to obtain initial voice data, wherein the initial voice data comprise initial voice units, and each initial voice unit of the initial voice data has initial voice unit duration and initial rhythm parameters; and

and adjusting the initial voice unit duration and the initial rhythm parameter of each initial voice unit based on the target voice unit duration and the target rhythm parameter to obtain the target voice data.

8. An avatar generation apparatus comprising:

the target voice data determining module is used for acquiring target voice data;

the target tone voice data determining module is used for obtaining target tone voice data based on the target voice data and target tone parameters, wherein the target tone parameters have corresponding tone marks;

the target characteristic determining module is used for fusing the tone mark and the voice characteristic of the target tone voice data to obtain a target characteristic;

the image parameter determining module is used for determining image parameters aiming at the virtual image according to the target characteristics; and

and the virtual image generation module is used for generating the virtual image according to the image parameters.

9. The apparatus of claim 8, the timbre identification identifying a specified pronunciation style and/or a specified pronunciation object.

10. The apparatus of claim 8, the target speech data comprising at least one speech unit, the target timbre parameters comprising timbre parameters for each speech unit, the target timbre speech data determination module comprising:

and the target tone voice data determining submodule is used for matching the tone parameters to the corresponding voice units so as to obtain the target tone voice data.

11. The apparatus of claim 8, wherein the avatar parameters include facial parameters, the avatar parameter determination module comprising:

the facial gesture feature determination submodule is used for obtaining facial gesture features according to the target features;

the gesture splitting characteristic determining submodule is used for performing characteristic splitting on the facial gesture characteristics to obtain a plurality of split gesture characteristics;

a split feature correlation parameter determination submodule, configured to determine a split feature correlation parameter based on the plurality of gesture split features, where the split feature correlation parameter is used to characterize correlations between the plurality of gesture split features; and

and the facial parameter determining submodule is used for determining the facial parameters according to the split feature correlation parameters and the facial posture features.

12. The apparatus of claim 11, wherein the facial pose feature determination submodule comprises:

the Mel cepstrum coefficient determining unit is used for acquiring a Mel cepstrum coefficient of the target feature;

a phoneme feature determining unit, configured to obtain a phoneme feature according to the mel cepstrum coefficient; and

a facial pose feature determination unit for obtaining the facial pose features according to the phoneme features.

13. The apparatus of claim 11, wherein the avatar generation module comprises:

an initial face model determination sub-module for obtaining an initial face model, wherein the initial face model is generated according to initial face parameters;

the target face model determining submodule is used for updating the initial face parameters of the initial face model according to the face parameters to generate a target face model; and

and the virtual image generation submodule is used for obtaining the virtual image according to the target face model.

14. The apparatus of any of claims 8-13, wherein the target speech data determination module comprises:

the initial voice data determination submodule is used for carrying out voice conversion on initial text data to obtain initial voice data, wherein the initial voice data comprise initial voice units, and each initial voice unit of the initial voice data has an initial voice unit duration and an initial rhythm parameter; and

and the target voice data determining submodule is used for adjusting the initial voice unit duration and the initial rhythm parameter of each initial voice unit based on the target voice unit duration and the target rhythm parameter to obtain the target voice data.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.