CN111369967A

CN111369967A - Virtual character-based voice synthesis method, device, medium and equipment

Info

Publication number: CN111369967A
Application number: CN202010167707.7A
Authority: CN
Inventors: 殷翔; 顾宇
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-03
Anticipated expiration: 2040-03-11
Also published as: CN111369967B

Abstract

The present disclosure relates to a method, apparatus, medium, and device for voice synthesis based on a virtual character, the method comprising: acquiring voice characteristic information corresponding to a text to be synthesized and acquiring music characteristic information for performing voice synthesis on the text to be synthesized; inputting the voice feature information and the music theory feature information into a voice synthesis model to obtain acoustic features and facial image features corresponding to the text to be synthesized, wherein the acoustic features and the facial image features are aligned in sequence; obtaining audio information corresponding to the text to be synthesized according to the acoustic features; and outputting the audio information on the virtual character, and controlling the face state display of the virtual character according to the face image characteristics. Therefore, the problem that the voice output state and the face state of the virtual character are not consistent in display can be effectively avoided, the accuracy of voice synthesis is improved, and the use experience of a user is improved.

Description

Virtual character-based voice synthesis method, device, medium and equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a medium, and a device for synthesizing a voice based on a virtual character.

Background

Nowadays, with the rapid development of computer technology, virtual character applications are increasing. When the virtual character is driven to carry out singing data synthesis, voice synthesis data are generally directly output through the virtual character, and the virtual character is controlled to circularly show different facial expressions. However, in the above technical solution, it may occur that when a vocal character is output, the mouth in the facial expression is in a closed state, so that the facial expression state is inconsistent with the voice output state.

There is currently no good solution to solve the above problems.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method for virtual character-based speech synthesis, the method comprising:

acquiring voice characteristic information corresponding to a text to be synthesized and acquiring music characteristic information for performing voice synthesis on the text to be synthesized;

inputting the voice feature information and the music theory feature information into a voice synthesis model to obtain acoustic features and facial image features corresponding to the text to be synthesized, wherein the acoustic features and the facial image features are aligned in sequence;

obtaining audio information corresponding to the text to be synthesized according to the acoustic features;

and outputting the audio information on the virtual character, and controlling the face state display of the virtual character according to the face image characteristics.

In a second aspect, the present disclosure provides a virtual character-based speech synthesis apparatus, the apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring voice characteristic information corresponding to a text to be synthesized and acquiring music characteristic information for performing voice synthesis on the text to be synthesized;

a first input module, configured to input the speech feature information and the music feature information into a speech synthesis model, and obtain an acoustic feature and a facial image feature corresponding to the text to be synthesized, where a sequence of the acoustic feature and the facial image feature are aligned;

the first processing module is used for obtaining audio information corresponding to the text to be synthesized according to the acoustic characteristics;

and the second processing module is used for outputting the audio information on the virtual character and controlling the face state of the virtual character to be displayed according to the face image characteristics.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, implements the steps of a virtual character-based speech synthesis method.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the avatar-based speech synthesis method.

In the technical scheme, the method comprises the steps of obtaining voice characteristic information corresponding to a text to be synthesized and obtaining music characteristic information for performing voice synthesis on the text to be synthesized; inputting the voice feature information and the music feature information into a voice synthesis model, and obtaining acoustic features and facial image features corresponding to the text to be synthesized so as to align the sequences of the acoustic features and the facial image features; obtaining audio information corresponding to the text to be synthesized according to the acoustic features; and outputting the audio information on the virtual character, and controlling the face state display of the virtual character according to the face image characteristics. Therefore, according to the technical scheme, the acoustic features and the image features can be synthesized simultaneously through the voice synthesis model, the acoustic features and the face image features aligned in sequence are obtained, the matching degree of the face image features and the acoustic features is improved, the problem that the voice output state and the face state of a virtual character are not consistent in display can be effectively solved, the accuracy of voice synthesis is improved, and the use experience of a user is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

In the drawings:

FIG. 1 is a flow diagram of a method for avatar-based speech synthesis provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a flow diagram of a speech synthesis model training process provided in accordance with one embodiment of the present disclosure;

FIG. 3 is a block diagram of a virtual character-based speech synthesis apparatus provided in accordance with one embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Fig. 1 is a flowchart illustrating a virtual character-based speech synthesis method according to an embodiment of the present disclosure, where as shown in fig. 1, the method includes:

in S11, speech feature information corresponding to the text to be synthesized and music feature information for performing speech synthesis on the text to be synthesized are acquired.

Wherein the voice feature information includes: phoneme information, tone information, lyric melody information, lyric beat information and vibrato information, wherein the music characteristic information comprises: music melody information and music beat information. For example, the information can be identified by Label.

The phoneme is the minimum voice unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme; phonemes are divided into two major categories, vowels and consonants. For example, for Chinese, a phone includes an initial (an initial, which is a complete syllable formed with a final using a consonant preceding the final) and a final (i.e., a vowel). For english, a phoneme includes a vowel and a consonant. For example, the corresponding phoneme of "hello" is "nihao", and for each phoneme, for example, n, it may be labeled "n: n 1", that is, the kind of the phoneme is labeled, for example, the initial is 1, the final is 2, and the zero initial is 3. As another example, the lyric tempo information can be obtained by labeling "Ia: ia1 ", and the numbers following ia are used to indicate the type of beat, as above to indicate that the current number of beats is 8. For example, there may be multiple beats in one phoneme, and the ratio of different beats to the phoneme may be represented by "Ia, Ib" through Ia and Ib. The information is identified in a similar manner, and is not described herein again.

The tone refers to a change in the elevation of a sound. Illustratively, there are four tones in Chinese: yin Ping, Yang Ping, upward voice and voice removing, English includes repeat reading, repeat reading and light reading, Japanese includes repeat reading and light reading.

In the present disclosure, the speech feature information of the text to be synthesized may be acquired by an information extraction model. Optionally, the voice feature information further comprises one or more of: strong and weak information, rhythm information, speed information, lyric measure information and lyric paragraph information; the music theory feature information further comprises one or more of: music section information and music passage information.

Similarly, as described above, the strength information, the rhythm information, the tempo information, the lyric measure information, the lyric paragraph information, the music measure information, and the music paragraph information may be represented by a Label in a unified manner. The Label of each piece of information can be preset, wherein the information can also be extracted through an information extraction model. The information extraction model can be obtained by labeling the training data of the text to be synthesized and training the training data based on any machine learning mode, which is not limited in the present disclosure and is not described herein again.

Optionally, the exemplary implementation manner of obtaining the music theory feature information for performing the speech synthesis on the text to be synthesized is as follows, and the step may include:

receiving a music selection instruction, and determining music corresponding to the music selection instruction as target music;

extracting the music characteristic information from the target music;

or inputting the music score data to be synthesized into an information extraction model to obtain the music theory characteristic information.

In an embodiment, the user may select music to be synthesized through the music selection instruction, so that the music characteristic information may be extracted from the target music after the target music is determined according to the music selection instruction. Note, for example, the musical melody information may be determined by determining the note by the main frequency through short-time Fourier transform (STFT), or short-term Fourier transform. The manner of extracting the music theory feature information from the music is the prior art, and is not described herein again.

In another embodiment, the music characteristic information may be determined directly from music score data, which may be written by the user himself. For example, the training data corresponding to the music score can be obtained by training based on any machine learning mode through labeling.

Therefore, by the technical scheme, the text to be synthesized can be selected to be synthesized with the existing music, and the music characteristic information can also be determined based on the music score data, so that the diversity of the music characteristic information can be further improved, and the diversified requirements of voice synthesis can be met.

In S12, the speech feature information and the music feature information are input to the speech synthesis model, and the acoustic feature and the face image feature corresponding to the text to be synthesized are obtained, the acoustic feature and the face image feature being aligned in sequence.

And the acoustic features and the facial image features are aligned in sequence, namely the sequence of the acoustic features and the sequence of the facial image features are aligned in time sequence, so that the matching of the acoustic features and the facial image features is realized.

Optionally, the speech synthesis model is obtained by jointly training an image sub-model and an acoustic sub-model to align the acoustic features and the sequence of facial image features.

Illustratively, the acoustic feature may be a mel-frequency spectral feature. In this embodiment, through the speech synthesis model obtained by the joint training of the image sub-model and the acoustic sub-model, the acoustic features and the face image features aligned in sequence can be obtained, and the matching degree of the face image features and the acoustic features is improved.

In S13, audio information corresponding to the text to be synthesized is obtained from the acoustic features.

For example, the acoustic features may include a singing feature (e.g., a singing mel spectral feature) corresponding to the speech feature information and an accompaniment feature (e.g., an accompaniment mel spectral feature) corresponding to the musical feature information. In this embodiment, the audio information may be obtained by mixing the singing waveform data and the accompaniment waveform data by obtaining corresponding waveform data from the singing mel spectrum feature and the accompaniment mel spectrum feature by the vocoder, respectively. For example, the singing waveform data and the accompaniment waveform data are respectively used as separate channels, a data segment is taken from each channel, and the corresponding data segments are overlapped and stored, so that the audio information is obtained.

In S14, the audio information is output on the virtual character, and the display of the face state of the virtual character is controlled in accordance with the face image feature.

The virtual character can be an animation character or a virtual simulation character. One embodiment of controlling the display of the face state of a virtual character based on facial image features is as follows: the facial image features are input to the image processing model, and data for face state display is obtained, whereby the face state display of the virtual character is controlled by the data. For example, the image processing model may be pre-constructed based on real-person facial expression data and virtual character data, wherein the image processing model may be constructed based on any existing machine learning manner, and will not be described herein again.

Optionally, the acoustic submodel is configured to perform joint processing on the speech feature information and the music feature information to obtain acoustic features corresponding to singing and accompaniment, respectively. Wherein, this acoustics submodel can include singing submodel and accompaniment submodel, according to the comprehensive error feedback training this acoustics submodel that the error of the submodel of singing and the error of accompaniment submodel confirmed to make the acoustics characteristic that the corresponding acoustics characteristic of singing and accompaniment match, further improve speech synthesis's the degree of accuracy.

In the technical scheme, the voice characteristic information and the music theory characteristic information are jointly processed, so that the singing acoustic characteristic and the accompaniment acoustic characteristic can be jointly generated, the singing acoustic characteristic and the accompaniment acoustic characteristic can be changed according to different user styles, and the accuracy of voice synthesis is improved. Moreover, the comprehensiveness of singing data synthesis can be improved, and the application range of the method is widened.

Optionally, the speech synthesis model is obtained as follows, as shown in fig. 2:

in S21, an input sample set including a text sample and a music sample and an output sample set including recorded audio data and image data corresponding to the text sample and the music sample are obtained. For example, a video in which a plurality of persons sing a text sample may be prerecorded, so that audio data and image data may be separated from the video.

In S22, the text sample and the music sample are input into the information extraction model to obtain speech feature information corresponding to the text sample and music feature information corresponding to the music sample. The manner of determining the speech feature information and the music feature information is described in detail above, and is not described herein again.

In S23, inputting the speech feature information corresponding to the text sample and the music feature information corresponding to the music sample into a speech synthesis model, and obtaining training acoustic features and training facial image features and a stop flag, where the stop flag is used to indicate that a sequence corresponding to the training acoustic features and the training facial image features stops. The voice synthesis model can comprise an acoustic submodel and an image submodel, the training acoustic features can be output by the acoustic submodel, the training face image features can be output by the image submodel, and the stop marks of sequence stop corresponding to the training acoustic features and the training face image features can be simultaneously output as the image submodel and the acoustic submodel in the voice synthesis model are jointly trained.

The acoustic submodel of the speech synthesis model may be a Sequence-to-Sequence (Seq 2Seq) model with attention mechanism (attention), that is, the input of the model is a Sequence and the output is a Sequence. Illustratively, the attention network may be a Gaussian Mixture Model (GMM) attention network). For example, a representation sequence corresponding to the voice feature information and a representation sequence corresponding to the facial image feature corresponding to the acquired text sample can be obtained in advance through an encoding network, and the attention network can be used for generating a semantic representation with a fixed length according to the representation sequences.

In particular, the encoding network may include an Embedding layer (i.e., Embedding layer), a Pre-processing network (Pre-net) sub-model, and a CBHG sub-model. Firstly, voice characteristic information is converted into a vector through an embedding layer, then the vector is input into a Pre-net sub-model to carry out nonlinear transformation on the vector, so that the convergence and generalization capability of a voice synthesis model is improved, and finally, a corresponding representation sequence is obtained through a CBHG sub-model according to the vector after the nonlinear transformation.

Among the acoustic submodels of the above-described speech synthesis model, a preprocessing Network Pre-net submodel, an Attention-RNN (Recurrent Neural Network), and a Decoder-RNN may be included. The structure of Pre-net is the same as that of Pre-net in coding network, and is used to do some non-linear transformation to the input previous frame (Initial frame). The Attention-RNN may be structured as LSTM (Long Short-Term Memory network), which takes the output of Pre-net of decoding network (i.e. the previous frame after nonlinear transformation) as input, and outputs the input to Decoder-RNN after passing through LSTM unit. The Decoder-RNN can output fixed-length semantic representation, and then output the training acoustic features and Stop marks (Stop tokens) through the LSTM unit.

Optionally, the speech synthesis model may further include a post-processing-net (postnet), and the training acoustic features output by the Decoder-RNN may be input into the postnet. postnet can be a convolution network, and can be used for performing residual prediction on the training acoustic features of Decode-RNN, and then adding the residual and the vector of the original input postnet (namely, the training acoustic features) to obtain the final output training acoustic features, so that the accuracy of the acoustic features output by the model can be further improved.

Optionally, the image sub-model may be implemented based on a super resolution test sequence vgg (visual geometry group) network, and the training facial image features are features output by a hidden layer of the image sub-model.

After determining the training acoustic features, the training face image features, and the stop flag, a target loss of the speech synthesis model is determined according to the training acoustic features, the target acoustic features obtained based on the audio data, the training face image features, the target face image features, and the stop flag, and the training is ended when the target loss is less than a preset threshold in S24. The manner of extracting the acoustic features from the audio data is described in detail above, and is not described herein again. The target acoustic feature is an acoustic feature determined in a music sample recorded by the text sample corresponding to the training acoustic feature, and the target facial image feature is a facial image feature determined in a music sample recorded by the text sample corresponding to the training facial image feature.

In the technical scheme, the acoustic sub-model and the image sub-model are jointly trained, so that the facial image features output by the voice synthesis model can be matched with the acoustic features, and the accuracy and the applicability of the voice synthesis model are improved. Moreover, the stop of the sequence of the facial image features and the acoustic features is indicated through the stop identifier, the accuracy of synthesis can be effectively guaranteed, not only can the blank possibly occurring when the sequence synthesis is too short be avoided, but also the resource waste caused by too long sequence synthesis can be avoided, and the efficiency of voice synthesis is improved.

Optionally, an exemplary embodiment of the determining the target loss of the speech synthesis model according to the training acoustic features, the target acoustic features, the training face image features, the target face image features, and the stop identification may include:

and determining the loss of the training acoustic features according to the training acoustic features and the target acoustic features corresponding to the training acoustic features. For example, the loss of the training acoustic feature may be determined as a result of calculating a Mean Square Error (MSE) based on vectors corresponding to the training acoustic feature and the target acoustic feature corresponding to the training acoustic feature.

And determining the loss of the training facial image features according to the training facial image features and the target facial image features corresponding to the training facial image features. For example, the loss of the training face image feature may be determined based on a result of calculating a Mean Square Error (MSE) based on vectors corresponding to the training face image feature and a target face image feature corresponding to the training face image feature, respectively.

Determining the loss of the stop flag based on a cross entropy loss function (sigmoidentropy loss), wherein the calculation mode of the cross entropy loss function is the prior art and is not described herein again.

Determining the target loss from the loss of training acoustic features, the loss of training facial image features, and the loss of stop identification.

For example, the target loss may be determined as a result of weighted summation of the loss of the training acoustic features, the loss of the training face image features, and the loss of the stop flag, and the speech synthesis model may be updated by the target loss.

Therefore, by the technical scheme, the loss of the training acoustic features, the loss of the training facial image features and the loss of the stop marks can be respectively determined, and then the comprehensive loss corresponding to the voice synthesis model is determined.

The present disclosure also provides a virtual character-based speech synthesis apparatus, as shown in fig. 3, the apparatus 10 includes:

a first obtaining module 100, configured to obtain speech feature information corresponding to a text to be synthesized and obtain music feature information for performing speech synthesis on the text to be synthesized;

a first input module 200, configured to input the speech feature information and the music theory feature information into a speech synthesis model, and obtain an acoustic feature and a facial image feature corresponding to the text to be synthesized, where a sequence of the acoustic feature and the facial image feature are aligned;

the first processing module 300 is configured to obtain audio information corresponding to the text to be synthesized according to the acoustic features;

and the second processing module 400 is configured to output the audio information on the virtual character, and control the display of the face state of the virtual character according to the facial image features.

Optionally, the acoustic submodel is configured to perform joint processing on the speech feature information and the music feature information to obtain acoustic features corresponding to singing and accompaniment, respectively.

Optionally, the speech synthesis model is obtained by a training apparatus comprising:

a second obtaining module, configured to obtain an input sample set and an output sample set, where the input sample set includes a text sample and a music sample, and the output sample set includes recorded audio data and image data corresponding to the text sample and the music sample;

the second input module is used for inputting a text sample and a music sample into the information extraction model so as to obtain the voice characteristic information corresponding to the text sample and the music characteristic information corresponding to the music sample;

a third input module, configured to input speech feature information corresponding to the text sample and music feature information corresponding to the music sample into the speech synthesis model, and obtain a training acoustic feature, a training facial image feature, and a stop identifier, where the stop identifier is used to indicate that a sequence corresponding to the training acoustic feature and the training facial image feature stops;

a determining module, configured to determine a target loss of the speech synthesis model according to the training acoustic features, target acoustic features, training face image features, target face image features, and a stop flag, and end training when the target loss is smaller than a preset threshold, where the target acoustic features are obtained based on the audio data, and the target face image features are obtained based on the image data.

Optionally, the determining module includes:

the first determining submodule is used for determining the loss of the training acoustic features according to the training acoustic features and the target acoustic features corresponding to the training acoustic features;

a second determining sub-module, configured to determine a loss of the training facial image feature according to the training facial image feature and a target facial image feature corresponding to the training facial image feature;

a third determining submodule for determining a loss of the stop flag based on a cross entropy loss function;

a fourth determining sub-module to determine the target loss based on the loss of the training acoustic features, the loss of the training facial image features, and the loss of the stop flag.

Optionally, the image sub-model is implemented based on a super-resolution test sequence VGG network, and the training face image features are features output by a hidden layer of the image sub-model.

Optionally, the first obtaining module includes:

the receiving submodule is used for receiving a music selection instruction and determining music corresponding to the music selection instruction as target music;

the extraction submodule is used for extracting the music theory characteristic information from the target music;

or the fourth input submodule is used for inputting the music score data to be synthesized into the information extraction model to obtain the music theory characteristic information.

Referring now to FIG. 4, a block diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 4, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring voice characteristic information corresponding to a text to be synthesized and acquiring music characteristic information for performing voice synthesis on the text to be synthesized; inputting the voice feature information and the music theory feature information into a voice synthesis model to obtain acoustic features and facial image features corresponding to the text to be synthesized, wherein the acoustic features and the facial image features are aligned in sequence; obtaining audio information corresponding to the text to be synthesized according to the acoustic features; and outputting the audio information on the virtual character, and controlling the face state display of the virtual character according to the face image characteristics.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not constitute a limitation on the module itself in some cases, for example, the first obtaining module may also be described as "obtaining speech feature information corresponding to a text to be synthesized and obtaining music feature information for performing speech synthesis on the text to be synthesized".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Example 1 provides a virtual character-based speech synthesis method according to one or more embodiments of the present disclosure, including:

Example 2 provides the method of example 1, the speech synthesis model being obtained from joint training of an image sub-model and an acoustic sub-model to align the sequence of acoustic features and the facial image features, in accordance with one or more embodiments of the present disclosure.

Example 3 provides the method of example 2, the acoustic submodel being configured to jointly process the speech feature information and the music feature information to obtain acoustic features corresponding to singing and accompaniment, respectively, according to one or more embodiments of the present disclosure.

Example 4 provides the method of example 2, the speech synthesis model obtained by:

acquiring an input sample set and an output sample set, wherein the input sample set comprises a text sample and a music sample, and the output sample set comprises recorded audio data and image data corresponding to the text sample and the music sample;

inputting a text sample and a music sample into an information extraction model to obtain voice characteristic information corresponding to the text sample and music characteristic information corresponding to the music sample;

inputting the voice feature information corresponding to the text sample and the music feature information corresponding to the music sample into the voice synthesis model, and obtaining a training acoustic feature, a training facial image feature and a stop identifier, wherein the stop identifier is used for indicating that a sequence corresponding to the training acoustic feature and the training facial image feature stops;

determining a target loss of the speech synthesis model according to the training acoustic features, the target acoustic features, the training face image features, the target face image features and the stop identification, and ending the training when the target loss is smaller than a preset threshold value, wherein the target acoustic features are obtained based on the audio data, and the target face image features are obtained based on the image data.

Example 5 provides the method of example 4, the determining a target loss for the speech synthesis model from the training acoustic features, the target acoustic features, the training facial image features, the target facial image features, and the stop identification, comprising:

determining the loss of the training acoustic features according to the training acoustic features and target acoustic features corresponding to the training acoustic features;

determining loss of the training facial image features according to the training facial image features and target facial image features corresponding to the training facial image features;

determining a loss of the stop flag based on a cross-entropy loss function;

Example 6 provides the method of example 4, the image sub-model being implemented based on a super resolution test sequence VGG network, the training face image features being features output by a hidden layer of the image sub-model, in accordance with one or more embodiments of the present disclosure.

Example 7 provides the method of example 1, wherein the obtaining of the musical-feature information for speech synthesis of the text to be synthesized includes:

extracting the music characteristic information from the target music;

Example 8 provides, in accordance with one or more embodiments of the present disclosure, an avatar-based speech synthesis apparatus, the apparatus comprising:

Example 9 provides the apparatus of example 8, the speech synthesis model obtained from joint training of an image sub-model and an acoustic sub-model to align the sequence of acoustic features and the facial image features, according to one or more embodiments of the present disclosure.

Example 10 provides the apparatus of example 8, the acoustic submodel to jointly process the speech feature information and the music feature information to obtain acoustic features corresponding to singing and accompaniment, respectively, according to one or more embodiments of the present disclosure.

Example 11 provides the apparatus of example 8, the speech synthesis model obtained by a training apparatus comprising:

Example 12 provides the apparatus of example 11, the determination module comprising:

Example 13 provides the apparatus of example 11, the image sub-model implemented based on a super resolution test sequence VGG network, the training face image features being features output by a hidden layer of the image sub-model, in accordance with one or more embodiments of the present disclosure.

Example 14 provides the apparatus of example 8, the first obtaining module comprising:

Example 15 provides a computer readable medium having stored thereon a computer program that, when executed by a processing apparatus, performs the steps of a virtual character-based speech synthesis method according to one or more embodiments of the present disclosure.

Example 16 provides, in accordance with one or more embodiments of the present disclosure, an electronic device, comprising:

a storage device having a computer program stored thereon;

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims

1. A method for synthesizing voice based on virtual character, the method includes:

2. The method of claim 1, wherein the speech synthesis model is obtained from joint training of an image sub-model and an acoustic sub-model to align the sequence of acoustic features and the facial image features.

3. The method of claim 2, wherein the acoustic submodel is configured to jointly process the speech feature information and the music feature information to obtain acoustic features corresponding to singing and accompaniment, respectively.

4. The method of claim 2, wherein the speech synthesis model is obtained by:

5. The method of claim 4, wherein determining a target loss for the speech synthesis model based on the training acoustic features, target acoustic features, training face image features, target face image features, and stop identification comprises:

determining a loss of the stop flag based on a cross-entropy loss function;

6. The method of claim 4, wherein the image sub-model is implemented based on a super resolution test sequence (VGG) network, and wherein the training facial image features are features output by a hidden layer of the image sub-model.

7. The method according to claim 1, wherein the obtaining music theory feature information for performing speech synthesis on the text to be synthesized comprises:

extracting the music characteristic information from the target music;

8. An apparatus for virtual character based speech synthesis, the apparatus comprising:

9. A computer-readable medium, on which a computer program is stored, characterized in that the program, when being executed by processing means, carries out the steps of the method of any one of claims 1 to 7.

10. An electronic device, comprising:

a storage device having one or more computer programs stored thereon;

one or more processing devices for executing the one or more computer programs in the storage device to implement the steps of the method of any one of claims 1-7.