WO2022156464A1

WO2022156464A1 - Speech synthesis method and apparatus, readable medium, and electronic device

Info

Publication number: WO2022156464A1
Application number: PCT/CN2021/139987
Authority: WO
Inventors: 吴鹏飞; 伍林; 潘俊杰
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2021-01-20
Filing date: 2021-12-21
Publication date: 2022-07-28
Also published as: CN112786008B; CN112786008A

Abstract

A speech synthesis method and apparatus (200), a readable medium, and an electronic device (300), relating to the technical field of electronic information processing. The method comprises: obtaining a text to be synthesized and a specified acoustic feature (101), the specified acoustic feature being used for indicating a prosodic feature of an audio; extracting a phoneme sequence corresponding to the text to be synthesized (102); extending the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence (103); and inputting the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model to obtain a target audio output by the speech synthesis model and corresponding to the text to be synthesized (104), an acoustic feature of the target audio matching the specified acoustic feature. Speech synthesis of a text is controlled by means of the specified acoustic feature, so that the target audio output by the speech synthesis model can correspond to the specified acoustic feature, explicit control of the acoustic feature during the process of speech synthesis can be implemented, and the expressiveness of the target audio is improved.

Description

Speech synthesis method, apparatus, readable medium and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is based on the application with CN application number CN202110075977.X and the filing date of January 20, 2021, and claims its priority. The disclosure of this CN application is hereby incorporated into this application as a whole.

technical field

The present disclosure relates to the technical field of electronic information processing, and in particular, to a speech synthesis method, apparatus, readable medium, and electronic device.

Background technique

With the continuous development of electronic information processing technology, voice, as an important carrier for people to obtain information, has been widely used in daily life and work. In application scenarios involving speech, the processing of speech synthesis is usually included. Speech synthesis refers to the synthesis of user-specified text into audio. In the speech synthesis process in the related art, if it is necessary to synthesize a target speech conforming to a certain acoustic characteristic according to a specified text, a pre-prepared reference speech conforming to the acoustic characteristic needs to be used.

SUMMARY OF THE INVENTION

This Summary is provided to introduce concepts in a simplified form that are described in detail in the Detailed Description section that follows. This summary section is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.

In a first aspect, the present disclosure provides a speech synthesis method, the method comprising:

Obtaining the text to be synthesized and a specified acoustic feature, the specified acoustic feature is used to indicate the prosody feature of the audio;

extracting the phoneme sequence corresponding to the text to be synthesized;

Extending the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence;

Inputting the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model, to obtain the target audio output by the speech synthesis model, the text to be synthesized corresponds to the target audio, and the acoustic features of the target audio are the same as the specified target audio. Acoustic feature matching.

In a second aspect, the present disclosure provides a speech synthesis device, the device comprising:

an acquisition module for acquiring text to be synthesized and a specified acoustic feature, the specified acoustic feature being used to indicate the prosody feature of the audio;

an extraction module, used for extracting the phoneme sequence corresponding to the text to be synthesized;

an expansion module, configured to expand the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence;

A synthesis module, used to input the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model to obtain the target audio corresponding to the text to be synthesized and the acoustic output of the target audio output by the speech synthesis model. The feature matches the specified acoustic feature.

In a third aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device, comprising:

a storage device on which a computer program is stored;

A processing device is configured to execute the computer program in the storage device to implement the steps of the method in the first aspect of the present disclosure.

In a fifth aspect, the present disclosure provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform one or more steps of the method of any one of the first aspects.

In a sixth aspect, the present disclosure provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform one or more steps of the method of any one of the first aspects.

Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.

Description of drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent when taken in conjunction with the accompanying drawings and with reference to the following detailed description. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that the originals and elements are not necessarily drawn to scale. In the attached image:

1 is a flowchart of a method for speech synthesis according to an exemplary embodiment;

2 is a flowchart of another speech synthesis method shown according to an exemplary embodiment;

3 is a process flow diagram of a speech synthesis model according to an exemplary embodiment;

4 is a block diagram of a speech synthesis model according to an exemplary embodiment;

Fig. 5 is a flow chart of training a speech synthesis model according to an exemplary embodiment;

FIG. 6 is a flowchart of another training speech synthesis model according to an exemplary embodiment;

FIG. 7 is a flowchart illustrating another training speech synthesis model according to an exemplary embodiment;

8 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment;

FIG. 9 is a block diagram of another speech synthesis apparatus according to an exemplary embodiment;

Fig. 10 is a block diagram of an electronic device according to an exemplary embodiment.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for the purpose of A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.

It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "one or a plurality of". multiple".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only for illustrative purposes, and are not intended to limit the scope of these messages or information.

In the related art, the duration of the reference speech is often quite different from the duration of the target speech, which may lead to unstable synthesis results. Moreover, it is also difficult to prepare reference speech corresponding to various acoustic features in advance for the diverse needs of users. Therefore, in the process of speech synthesis, effective control of acoustic features cannot be achieved.

The speech synthesis solution provided by the present invention can solve this problem.

Fig. 1 is a flowchart of a speech synthesis method according to an exemplary embodiment. As shown in Fig. 1 , the method may include steps 101-104.

Step 101: Acquire the text to be synthesized and the specified acoustic feature, and the specified acoustic feature is used to indicate the prosody feature of the audio.

For example, first obtain the text to be synthesized that needs to be synthesized. The text to be synthesized can be, for example, one or more sentences in a text file specified by the user, one or more paragraphs, one or more chapters in a text file, or one or more sentences in a text file. words. The text file may be, for example, an electronic book, or other types of files, such as news, articles on official accounts, or blogs. At the same time, the specified acoustic features can also be obtained. The audio corresponding to the specified acoustic feature can be understood as specified by the user, and it is desired to synthesize the text to be synthesized into audio that conforms to the specified acoustic feature (ie, the target audio mentioned later). The specified acoustic feature may include multiple dimensions, for example, may include one or more of fundamental frequency (English: Pitch), volume (English: Energy), or speech rate (English: Duration), and may also include: noise level, Tone, timbre, or loudness, etc., where noise level can be understood as a feature that can reflect the level of noise in audio.

Step 102: Extract the phoneme sequence corresponding to the text to be synthesized.

For example, the text to be synthesized may be input into a pre-trained recognition model to obtain a phoneme sequence output by the recognition model and corresponding to the text to be synthesized. The phoneme corresponding to each word in the text to be synthesized may also be searched in a pre-established dictionary, and then the phoneme corresponding to each word is formed into a phoneme sequence corresponding to the text to be synthesized. Phonemes can be understood as phonetic units divided according to the pronunciation of each word, and can also be understood as vowels and consonants in pinyin corresponding to each word. The phoneme sequence includes phonemes corresponding to each word in the text to be synthesized (a word may correspond to one or more phonemes). For example, in Chinese, if the text to be synthesized is "the weather is fine today", the phoneme corresponding to each word can be searched in the dictionary in turn, so as to determine the phoneme sequence as "jintiantianqihenhao".

Step 103: Expand the specified acoustic feature according to the phoneme sequence to obtain the acoustic feature sequence.

Step 104: Input the phoneme sequence and the acoustic feature sequence into the pre-trained speech synthesis model to obtain the target audio output by the speech synthesis model and corresponding to the text to be synthesized, and the acoustic features of the target audio match the specified acoustic features.

For example, after the phoneme sequence is obtained, the specified acoustic feature may be expanded according to the phoneme sequence to obtain an acoustic feature sequence, and the acoustic feature sequence includes the acoustic feature corresponding to each phoneme in the phoneme sequence. In an implementation manner, an acoustic feature sequence may be generated according to the length of the phoneme sequence (ie, the number of phonemes included in the phoneme sequence), wherein the acoustic feature corresponding to each phoneme is a specified acoustic feature. In another implementation manner, the specified acoustic feature may also be taken as the mean value (or standard deviation), and the acoustic feature corresponding to each phoneme may be generated according to a preset distribution (eg, Gaussian distribution or uniform distribution).

Afterwards, the phoneme sequence and the acoustic feature sequence can be used as input to the pre-trained speech synthesis model. The audio output by the speech synthesis model is the target audio corresponding to the text to be synthesized that matches the specified acoustic features. The speech synthesis model can be pre-trained and can be understood as a TTS (English: Text To Speech, Chinese: from text to speech) model, which can generate the text to be synthesized according to the text to be synthesized and the specified acoustic features. Feature-matched target audio. Specifically, the speech synthesis model may be obtained by training based on the Tacotron model, the Deepvoice 3 model, the Tacotron 2 model, the Wavenet model, etc., which is not specifically limited in the present disclosure. In this way, in the process of speech synthesis in the text to be synthesized, in addition to the semantics included in the text to be synthesized, the specified acoustic features are also considered, so that the target audio can have the specified acoustic features, so as to realize the acoustic features in the process of speech synthesis without the need to spend a lot of time and labor costs to create the reference speech corresponding to various acoustic features in advance, and also avoid the instability problem caused by the large difference in the duration of the reference speech and the target audio. The expressiveness of the target audio, while also improving the user's listening experience.

To sum up, the present disclosure first obtains the text to be synthesized and the specified acoustic features used to indicate the prosody features of the audio; then, from the text to be synthesized, extracts the corresponding phoneme sequence, and then performs the specified acoustic features according to the phoneme sequence. Expand to obtain the acoustic feature sequence; finally, input the phoneme sequence and the acoustic feature sequence into the pre-trained speech synthesis model, so as to obtain the target audio corresponding to the text to be synthesized outputted by the speech synthesis model and matching the specified acoustic features. The present disclosure controls speech synthesis of text by specifying acoustic features, so that the target audio output by the speech synthesis model can correspond to the specified acoustic features, realizes explicit control of acoustic features in the process of speech synthesis, and improves the expressiveness of target audio.

Fig. 2 is a flowchart of another speech synthesis method according to an exemplary embodiment. As shown in Fig. 2 , the implementation of step 103 may include steps 1031-1032.

Step 1031: Determine the acoustic feature corresponding to each phoneme in the phoneme sequence according to the specified acoustic feature.

In step 1032, the acoustic features corresponding to each phoneme are formed into an acoustic feature sequence.

For example, in an implementation manner, the length of the phoneme sequence may be determined first, that is, the number of phonemes included in the phoneme sequence. Then, the specified acoustic feature is copied to obtain an acoustic feature sequence with the same length as the phoneme sequence, wherein each acoustic feature is the same as the specified acoustic feature, that is, the acoustic feature corresponding to each phoneme in the acoustic feature sequence is the same as the specified acoustic feature. to specify acoustic features. For example, the length of the phoneme sequence is 100 (that is, it includes 100 phonemes), and the acoustic feature corresponding to each phoneme can be determined as the specified acoustic feature, then the acoustic feature corresponding to the 100 phonemes can be formed into an acoustic feature sequence. Taking the specified acoustic feature as a 1*5-dimensional vector as an example, then the acoustic feature sequence includes 100 1*5-dimensional vectors, which can form a 100*5-dimensional vector.

Fig. 3 is a processing flow chart of a speech synthesis model according to an exemplary embodiment. As shown in Fig. 3 , the speech synthesis model can be used to execute steps A and B.

In step A, a text feature sequence corresponding to the text to be synthesized is determined according to the phoneme sequence, and the text feature sequence includes a text feature corresponding to each phoneme in the phoneme sequence.

Step B: Generate target audio according to the text feature sequence and the acoustic feature sequence.

For example, the specific process of synthesizing the target audio by the speech synthesis model can first extract the text feature sequence (ie Text Embedding) corresponding to the text to be synthesized according to the phoneme sequence. The text feature sequence includes the text corresponding to each phoneme in the phoneme sequence. feature, text feature can be understood as a text vector that can characterize the phoneme. For example, if the phoneme sequence includes 100 phonemes, and the text vector corresponding to each phoneme is a 1*80-dimensional vector, the text feature sequence may be a 100*80-dimensional vector.

After the text feature sequence is obtained, the text feature sequence can be combined with the acoustic feature sequence to generate target audio that matches the specified acoustic feature. For example, text feature sequences can be concatenated with acoustic feature sequences to obtain a combined sequence, and then target audio can be generated from the combined sequence. For example, the phoneme sequence includes 100 phonemes, the text feature sequence can be a 100*80-dimensional vector, and the corresponding acoustic feature sequence is a 100*5-dimensional vector, then the combined sequence can be a 100*85-dimensional vector. The target audio can be generated based on this 100*85-dimensional vector.

Taking the speech synthesis model shown in Figure 4 as an example, the speech synthesis model is a Tacotron model, which includes: an encoder (ie Encoder), an attention network (ie Attention), a decoder (ie Decoder) and a post-processing network (ie Post-processing). The encoder can include an embedding layer (ie Character Embedding layer), a pre-net sub-model and CBHG (English: Convolution Bank+Highway network+bidirectional Gated Recurrent Unit, Chinese: convolutional layer + high-speed network + bidirectional recursion neural network) submodel. A sequence of phonemes can be fed into the encoder. First, the phoneme sequence is converted into a word vector through the embedding layer, and then the word vector is input into the Pre-net sub-model to perform nonlinear transformation on the word vector, thereby improving the convergence and generalization capabilities of the speech synthesis model. Finally, through CBHG The sub-model obtains a text feature sequence that can characterize the text to be synthesized according to the non-linearly transformed word vector.

After that, the acoustic feature sequence and the text feature sequence output by the encoder can be spliced to obtain a combined sequence, and then the combined sequence can be input into the attention network, and the attention network can add an attention weight to each element in the combined sequence. Specifically, the attention network may be a location-sensitive attention (English: Locative Sensitive Attention) network, or a GMM (English: Gaussian Mixture Model, abbreviated GMM) attention network, or a Multi-Head Attention network. This is not specifically limited.

The output of the attention network is then used as the input of the decoder. The decoder may include a preprocessing network sub-model (which may be the same as that included in the encoder), Attention-RNN, Decoder-RNN. The preprocessing network sub-model is used to perform nonlinear transformation on the input. The structure of the Attention-RNN is a layer of unidirectional, zoneout-based LSTM (English: Long Short-Term Memory, Chinese: Long Short-Term Memory Network), which can The output of the processing network sub-model is used as input, and is output to the Decoder-RNN after passing through the LSTM unit. Decode-RNN is a two-layer unidirectional, zoneout-based LSTM, which outputs Mel spectrum information through the LSTM unit, and the Mel spectrum information can include one or more Mel spectrum features. Finally, the mel spectral information is input into the post-processing network, which can include a vocoder (eg, Wavenet vocoder, Griffin-Lim vocoder, etc.) to transform the mel spectral feature information to obtain the target audio.

Fig. 5 is a flowchart of training a speech synthesis model according to an exemplary embodiment. As shown in Fig. 5 , the speech synthesis model is obtained by training in steps C-E.

Step C, extract the real acoustic features of the training audio corresponding to the training text, and the real acoustic features are used to indicate the prosodic features of the training audio.

In step D, the real acoustic features are extended according to the training phoneme sequence corresponding to the training text to obtain the training acoustic feature sequence.

In step E, the training phoneme sequence and the training acoustic feature sequence are input into the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.

To train the speech synthesis model, it is necessary to first obtain the training text and the training audio corresponding to the training text. There can be multiple training texts, and correspondingly, there are multiple training audios. For example, a large amount of text can be captured on the Internet as training text, and then the audio corresponding to the training text can be used as training audio. For the training audio, the real acoustic features corresponding to the training audio can be extracted. For example, the real acoustic features corresponding to the training audio can be obtained by means of signal processing or labeling, wherein the real acoustic features are used to indicate the prosody features of the training audio, and may include: fundamental frequency, volume, or speech rate of the training audio At least one of them may also include: noise level, pitch, timbre, or loudness, etc. At the same time, a training phoneme sequence corresponding to the training text may also be extracted, and the training phoneme sequence may include training phonemes corresponding to each word in the training text (a word may correspond to one or more training phonemes).

After that, the real acoustic features are extended according to the training phoneme sequence corresponding to the training text to obtain the training acoustic feature sequence. The training acoustic feature sequence includes the training acoustic feature corresponding to each training phoneme. For example, a training acoustic feature sequence may be generated according to the number of training phonemes included in the training phoneme sequence, wherein the training acoustic feature corresponding to each training phoneme is a real acoustic feature.

Finally, the training phoneme sequence and the training acoustic feature sequence are used as the input of the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio. For example, according to the output of the speech synthesis model, the difference (or mean square error) between the training audio and the speech synthesis model can be used as the loss function of the speech synthesis model, and with the goal of reducing the loss function, the back-propagation algorithm can be used to correct the neuron in the speech synthesis model. The parameter, the parameter of the neuron can be, for example, the weight (English: Weight) and the bias (English: Bias) of the neuron. The above steps are repeated until the loss function satisfies the preset condition, for example, the loss function is smaller than the preset loss threshold.

FIG. 6 is a flowchart of another training speech synthesis model according to an exemplary embodiment. As shown in FIG. 6 , the real acoustic features include: at least one of fundamental frequency, volume, or speech rate. Correspondingly, The implementation of step C may include steps C1-C3.

Step C1, if the real acoustic feature includes the speech rate, determine the duration corresponding to each training phoneme in the training phoneme sequence according to the training audio and the training phoneme sequence, so as to determine the speech rate of the training audio.

An implementation manner is exemplarily introduced below.

First, according to the training audio and the training phoneme sequence, the duration corresponding to each training phoneme is determined. For example, HTS (English: HMM-based Speech Synthesis System) can be used to divide the training audio according to the training phonemes included in the training phoneme sequence, so as to obtain the duration corresponding to each training phoneme. For example, durationi is used to represent the duration corresponding to the ith training phoneme.

After that, logarithmic operation is performed on the duration corresponding to each training phoneme to obtain the logarithmic duration corresponding to each training phoneme. Through the logarithmic operation, the variation range of the duration can be compressed, thereby amplifying the degree of variation of the duration. For example, log_durationi is used to represent the logarithmic duration corresponding to the ith training phoneme.

Finally, the statistical value of the logarithmic duration corresponding to each training phoneme in the training phoneme sequence is used as the speech rate of the training audio. For example, the average (or standard deviation, extreme value, etc.) of the logarithmic duration corresponding to each training phoneme may be used as the speech rate of the training audio, which may be expressed as log_duration_mean.

Step C2, if the real acoustic feature includes the fundamental frequency, extract the fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio.

An implementation manner is exemplarily introduced below. First, you can use sox, librosa, straight and other audio processing tools to process the training audio to get the fundamental frequency of each audio in the training audio, and then perform logarithmic operation on the fundamental frequency corresponding to each audio frame to get each audio frequency. The logarithmic base frequency corresponding to each audio frame. Through logarithmic operation, the variation range of the fundamental frequency can be compressed, thereby amplifying the degree of variation of the fundamental frequency. For example, for each audio frame, pitchj is used to represent the fundamental frequency corresponding to the jth audio frame, and correspondingly, the logarithmic fundamental frequency corresponding to the jth audio frame can be represented as log_pitchj.

Then, the statistical value of the logarithmic fundamental frequency corresponding to each audio frame in the training audio is used as the fundamental frequency of the training audio. For example, the average value of the logarithmic fundamental frequency and the standard deviation of the logarithmic fundamental frequency corresponding to each audio frame may be used as the fundamental frequency of the training audio. For example, log_pitch_mean is used to represent the mean value, and log_pitch_std is used to represent the standard deviation, where log_pitch_mean can reflect the overall fundamental frequency of the training audio, and log_pitch_std can reflect the variation range of the fundamental frequency of the training audio.

Step C3, if the real acoustic feature includes volume, extract the volume of each audio frame included in the training audio to determine the volume of the training audio.

An implementation manner is exemplarily introduced below.

First, you can use sox, librosa, straight and other audio processing tools to process the training audio to get the volume of each audio in the training audio, and then perform logarithmic operations on the volume corresponding to each audio frame to get each audio. The logarithmic volume corresponding to the frame. Through logarithmic operation, the variation range of the volume can be compressed, thereby amplifying the variation degree of the volume. For example, for each audio frame, use energyj to represent the volume corresponding to the jth audio frame. Correspondingly, the logarithmic volume corresponding to the jth audio frame can be expressed as log_energyj.

Then, the statistical value of the logarithmic volume corresponding to each audio frame is used as the volume of the training audio. For example, the average value of the logarithmic volume corresponding to each audio frame can be used as the volume of the training audio, which can be expressed as log_energy_mean.

Further, if the real acoustic feature also includes the noise level, then step C may further include: determining the noise level corresponding to the training audio according to the linear prediction coefficient corresponding to the training audio.

For example, the LPC coefficient of the training audio can be determined (English: Linear Prediction Coefficient, Chinese: Linear Prediction Coefficient), and then the logarithmic operation is performed on the first dimension of the LPC coefficient of the training audio, and the result of the logarithmic operation is used as the training audio. the corresponding noise level. Through logarithmic operation, the variation range of the noise level can be compressed, thereby amplifying the variation degree of the noise level. The noise level can be represented as log_spectral_tilt, for example.

In the case where the fundamental frequency, volume, speech rate and noise level are included in the real acoustic features, the fundamental frequency, volume, speech rate and noise level of the training audio can be formed into the real acoustic features of the training audio. For example, the real acoustic feature can be a 1*5 dimensional vector: {fundamental frequency: (log_pitch_mean, log_pitch_std), volume: log_energy_mean, speech rate: log_duration_mean, noise level: log_spectral_tilt}. Correspondingly, when the specified acoustic feature includes fundamental frequency, volume, speech rate and noise level, the specified acoustic feature acquired in step 101 may also include the above-mentioned five dimensions.

Fig. 7 is another flowchart of training a speech synthesis model according to an exemplary embodiment. As shown in Fig. 7 , the training process of the speech synthesis model may further include steps F and G.

Step F, according to the preset training set including the real acoustic features of the training audio, determine the statistical acoustic features of the training set.

In step G, the real acoustic features of each training audio are normalized according to the statistical acoustic features.

Correspondingly, in this embodiment, the implementation of step D may be:

The normalized real acoustic features are extended according to the training phoneme sequence to obtain the training acoustic feature sequence.

For example, before extending the real acoustic features, the real acoustic features can also be normalized. For example, the training set includes multiple training texts, and each training text corresponds to a training audio. The real acoustic features of each training audio can be determined in the manner of steps C1 to C4, and the statistical acoustic features of the training set can be determined. The statistical acoustic feature can be, for example, the mean, standard deviation, variance, or extreme value of the real acoustic features. The real acoustic features of each training audio are then normalized according to the statistical acoustic features. For example, the mean μ and standard deviation σ of the real acoustic features can be used as statistical acoustic features, and then the real acoustic features between [μ-3σ, μ+3σ] can be mapped to [-1, 1], True acoustic features outside [μ-3σ,μ+3σ], which can be truncated to -1 or 1. The average value and standard deviation of each dimension in the real acoustic feature can also be obtained separately, and each dimension in the real acoustic feature can be normalized. For example, taking the log_pitch_mean in the real acoustic feature as an example, the average pitch_μ and standard deviation pitch_σ of the log_pitch_mean of each training audio can be obtained; then the log_pitch_mean between [pitch_μ-3pitch_σ, pitch_μ+3pitch_σ] is mapped to [ -1,1], truncate log_pitch_mean outside [pitch_μ-3pitch_σ, pitch_μ+3pitch_σ] to -1 or 1 to normalize log_pitch_mean.

Further, the normalized real acoustic features can be extended according to the training phoneme sequence to obtain the training acoustic feature sequence. For example, a training acoustic feature sequence may be generated according to the number of training phonemes included in the training phoneme sequence, wherein the training acoustic feature corresponding to each training phoneme is a normalized real acoustic feature.

It should be noted that, the specified acoustic features obtained in step 101 may also include the above-mentioned five normalized dimensions. The specified acoustic features after normalization are more interpretive. Taking the specified acoustic feature as {-1,1,0,1,0} as an example, where the value of log_pitch_mean is -1, it means that the target audio generated by the speech synthesis model and conforming to the specified acoustic feature is low-pitched. The value of log_pitch_std is 1, indicating that the fundamental frequency of the target audio varies greatly. The value of log_energy_mean is 0, indicating that the target audio is normal volume. The value corresponding to log_duration_mean is 1, which indicates that the speech rate of the target audio is slow (ie, the average duration corresponding to the phoneme). The value corresponding to log_spectral_tilt is 0, indicating that the noise level of the target audio is normal.

To sum up, the present disclosure first obtains the text to be synthesized and the specified acoustic features used to indicate the prosody features of the audio; then, from the text to be synthesized, extracts the corresponding phoneme sequence, and then performs the specified acoustic features according to the phoneme sequence. Expand to obtain the acoustic feature sequence; finally, input the phoneme sequence and the acoustic feature sequence into the pre-trained speech synthesis model, thereby obtaining the target audio corresponding to the text to be synthesized outputted by the speech synthesis model and matching the specified acoustic features. The present disclosure controls speech synthesis of text by specifying acoustic features, so that the target audio output by the speech synthesis model can correspond to the specified acoustic features, realizes explicit control of acoustic features in the process of speech synthesis, and improves the expressiveness of target audio.

FIG. 8 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment. As shown in FIG. 8 , the apparatus 200 may include the following modules:

An acquisition module 201 is used to acquire the text to be synthesized and the specified acoustic feature, and the specified acoustic feature is used to indicate the prosody feature of the audio;

The extraction module 202 is used to extract the phoneme sequence corresponding to the text to be synthesized;

The expansion module 203 is used to expand the specified acoustic feature according to the phoneme sequence to obtain the acoustic feature sequence;

The synthesis module 204 is used to input the phoneme sequence and the acoustic feature sequence into the pre-trained speech synthesis model to obtain the target audio output by the speech synthesis model and corresponding to the text to be synthesized, and the acoustic features of the target audio match the specified acoustic features.

FIG. 9 is a block diagram of another speech synthesis apparatus according to an exemplary embodiment. As shown in FIG. 9 , the expansion module 203 may include:

Determining submodule 2031, for determining the acoustic feature corresponding to each phoneme in the phoneme sequence according to the specified acoustic feature;

The expansion submodule 2032 is used to form the acoustic feature sequence corresponding to the acoustic features of each phoneme;

In an application scenario, the speech synthesis model in the above embodiment can be used to perform the following steps:

Step A, determine the text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, and the text feature sequence includes the text feature corresponding to each phoneme in the phoneme sequence;

In an application scenario, the specified acoustic feature includes at least one of fundamental frequency, volume, or speech rate.

In another application scenario, the speech synthesis model is trained as follows:

Step C, extracting the real acoustic features of the training audio corresponding to the training text, and the real acoustic features are used to indicate the prosodic features of the training audio;

Step D, extending the real acoustic feature according to the training phoneme sequence corresponding to the training text, to obtain the training acoustic feature sequence;

Specifically, the real acoustic features include: at least one of fundamental frequency, volume, or speech rate, and the implementation of step C may include steps C1-C3.

The specific implementation may include: first, determining the duration corresponding to each training phoneme according to the training audio and the training phoneme sequence; then, performing a logarithmic operation on the duration corresponding to each training phoneme to obtain the pair corresponding to each training phoneme Finally, the statistical value of the logarithmic duration corresponding to each training phoneme in the training phoneme sequence is used as the speech rate of the training audio.

The specific implementation may include: first, performing a logarithmic operation on the base frequency corresponding to each audio frame to obtain the logarithmic base frequency corresponding to each audio frame; then, calculating the logarithmic base frequency corresponding to each audio frame in the training audio The statistical value of the frequency is used as the fundamental frequency of the training audio.

The specific implementation may include: first, performing logarithmic operation on the volume corresponding to each audio frame to obtain the logarithmic volume corresponding to each audio frame; then, calculating the statistics of the logarithmic volume corresponding to each audio frame in the training audio value, as the volume of the training audio.

In another application scenario, the speech synthesis model may also be obtained by training in the following manner:

Step F, according to the preset training set including the real acoustic features of a plurality of training audio, determine the statistical acoustic features of the training set;

Correspondingly, the implementation of step D may be:

Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment of the method, and will not be described in detail here.

Referring to FIG. 10 below, it shows a schematic structural diagram of an electronic device 300 suitable for implementing an embodiment of the present disclosure (ie, the execution body of the above-mentioned speech synthesis method). Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in FIG. 10 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 10 , an electronic device 300 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 301 that may be loaded into random access according to a program stored in a read only memory (ROM) 302 or from a storage device 308 Various appropriate actions and processes are executed by the programs in the memory (RAM) 303 . In the RAM 303, various programs and data necessary for the operation of the electronic device 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other through a bus 304. An input/output (I/O) interface 305 is also connected to bus 304 .

Typically, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 307 of a computer, etc.; a storage device 308 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 309. Communication means 309 may allow electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 10 shows electronic device 300 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 309, or from the storage device 308, or from the ROM 302. When the computer program is executed by the processing device 301, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.

It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, terminal devices and servers can use any currently known or future developed network protocols such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects. Examples of communication networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: acquires the text to be synthesized and a specified acoustic feature, and the specified acoustic feature is used to indicate the audio extract the corresponding phoneme sequence of the text to be synthesized; expand the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence; input the phoneme sequence and the acoustic feature sequence into a pre-trained speech A synthesis model is used to obtain the target audio output from the speech synthesis model and corresponding to the text to be synthesized, and the acoustic features of the target audio match the specified acoustic features.

Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

The modules involved in the embodiments of the present disclosure may be implemented in software or hardware. Wherein, the name of the module does not constitute a limitation of the module itself under certain circumstances, for example, the acquisition module can also be described as "a module for acquiring text to be synthesized and specifying acoustic features".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, Example 1 provides a speech synthesis method, including: acquiring text to be synthesized and a specified acoustic feature, where the specified acoustic feature is used to indicate a prosody feature of audio; extracting the to-be-synthesized text The phoneme sequence corresponding to the text; the specified acoustic feature is expanded according to the phoneme sequence to obtain an acoustic feature sequence; the phoneme sequence and the acoustic feature sequence are input into a pre-trained speech synthesis model to obtain the speech synthesis The model outputs the target audio corresponding to the text to be synthesized, and the acoustic feature of the target audio matches the specified acoustic feature.

According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, wherein the extending the specified acoustic feature according to the phoneme sequence to obtain the acoustic feature sequence includes: according to the specified acoustic feature, Determine the acoustic feature corresponding to each phoneme in the phoneme sequence; and form the acoustic feature sequence with the acoustic feature corresponding to each phoneme.

According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 1, where the speech synthesis model is configured to: determine, according to the phoneme sequence, a text feature sequence corresponding to the text to be synthesized, the text feature sequence The text feature corresponding to each phoneme in the phoneme sequence is included; the target audio is generated according to the text feature sequence and the acoustic feature sequence.

According to one or more embodiments of the present disclosure, Example 4 provides the method of Examples 1 to 3, the specifying acoustic characteristics including at least one of fundamental frequency, volume, or speech rate.

According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 1, where the speech synthesis model is obtained by training in the following manner: extracting real acoustic features of the training audio corresponding to the training text, the real acoustic features for indicating the prosodic feature of the training audio; extending the real acoustic feature according to the training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence; inputting the training phoneme sequence and the training acoustic feature sequence the speech synthesis model, and train the speech synthesis model according to the output of the speech synthesis model and the training audio.

According to one or more embodiments of the present disclosure, Example 6 provides the method of Example 5, wherein the real acoustic features include: at least one of fundamental frequency, volume, or speech rate; the extracting training audio corresponding to the training text The real acoustic features include: if the real acoustic features include speech rate, according to the training audio and the training phoneme sequence, determine the duration corresponding to each training phoneme in the training phoneme sequence, so as to determine the The speech rate of the training audio; if the real acoustic features include the fundamental frequency, extract the fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio; if the real acoustic features include volume, extracting the volume of each audio frame included in the training audio to determine the volume of the training audio.

According to one or more embodiments of the present disclosure, Example 7 provides the method of Example 6, wherein the duration corresponding to each training phoneme in the training phoneme sequence is determined according to the training audio and the training phoneme sequence, Determining the speech rate of the training audio includes: determining a duration corresponding to each training phoneme according to the training audio and the training phoneme sequence; performing a logarithmic operation on the duration corresponding to each training phoneme, to obtain the logarithmic duration corresponding to each of the training phonemes; take the statistical value of the logarithmic duration corresponding to each of the training phonemes in the training phoneme sequence as the speech rate of the training audio; the extracting the The fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio includes: performing a logarithmic operation on the fundamental frequency corresponding to each of the audio frames to obtain the corresponding fundamental frequency of each of the audio frames. The logarithmic fundamental frequency; the statistical value of the logarithmic fundamental frequency corresponding to each of the audio frames in the training audio is used as the fundamental frequency of the training audio; the extraction of each audio frame included in the training audio volume, to determine the volume of the training audio, including: performing logarithmic operations on the volume corresponding to each of the audio frames to obtain the logarithmic volume corresponding to each of the audio frames; The statistical value of the logarithmic volume corresponding to the audio frame is used as the volume of the training audio.

According to one or more embodiments of the present disclosure, Example 8 provides the methods of Examples 5 to 7, and the speech synthesis model is further obtained by training in the following manner: according to a preset training set including a plurality of the training audios For the real acoustic features, determine the statistical acoustic features of the training set; perform normalization processing on the real acoustic features of each of the training audio according to the statistical acoustic features; Expanding the training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence includes: expanding the normalized real acoustic feature according to the training phoneme sequence to obtain the training acoustic feature sequence.

According to one or more embodiments of the present disclosure, Example 9 provides a speech synthesis apparatus, including: an acquisition module configured to acquire text to be synthesized and a specified acoustic feature, where the specified acoustic feature is used to indicate a prosody feature of audio; an extraction module, used to extract the phoneme sequence corresponding to the text to be synthesized; an expansion module, used to expand the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence; a synthesis module, used to convert the phoneme sequence Inputting the pre-trained speech synthesis model with the acoustic feature sequence to obtain the target audio output from the speech synthesis model and corresponding to the text to be synthesized, and the acoustic features of the target audio match the specified acoustic features.

According to one or more embodiments of the present disclosure, Example 10 provides a computer-readable medium having stored thereon a computer program that, when executed by a processing apparatus, implements the steps of the methods described in Examples 1 to 8.

According to one or more embodiments of the present disclosure, Example 11 provides an electronic device, including: a storage device on which a computer program is stored; and a processing device for executing the computer program in the storage device to Implement the steps of the methods described in Examples 1 to 8.

According to one or more embodiments of the present disclosure, Example 12 provides a computer program comprising instructions that, when executed by a processor, cause the processor to perform one of the methods of Examples 1 to 8 or multiple steps.

According to one or more embodiments of the present disclosure, Example 13 provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform one of the methods of Examples 1 to 8 or multiple steps.

The above description is merely a preferred embodiment of the present disclosure and an illustration of the technical principles employed. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the technical solutions formed by the specific combination of the above-mentioned technical features, and should also cover, without departing from the above-mentioned disclosed concept, the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of its equivalent features. For example, a technical solution is formed by replacing the above features with the technical features disclosed in the present disclosure (but not limited to) with similar functions.

Additionally, although operations are depicted in a particular order, this should not be construed as requiring that the operations be performed in the particular order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although the above discussion contains several implementation-specific details, these should not be construed as limitations on the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or logical acts of method, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the above-mentioned embodiment, the specific manner in which each module performs operations has been described in detail in the embodiment of the method, and will not be described in detail here.

Claims

A speech synthesis method, comprising:

Obtaining the text to be synthesized and a specified acoustic feature, the specified acoustic feature is used to indicate the prosody feature of the audio;

extracting the phoneme sequence corresponding to the text to be synthesized;

Extending the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence;

Inputting the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model, to obtain the target audio output by the speech synthesis model, the text to be synthesized corresponds to the target audio, and the acoustic features of the target audio are the same as the specified target audio. Acoustic feature matching.
The method according to claim 1, wherein extending the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence, comprising:

According to the specified acoustic feature, determine the acoustic feature corresponding to each phoneme in the phoneme sequence;

The acoustic features corresponding to each of the phonemes are formed into the acoustic feature sequence.
The method of claim 1, wherein the speech synthesis model is used to:

Determine a text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, where the text feature sequence includes a text feature corresponding to each phoneme in the phoneme sequence;

The target audio is generated according to the text feature sequence and the acoustic feature sequence.
The method according to any one of claims 1-3, wherein the specified acoustic characteristic comprises: at least one of fundamental frequency, volume, or speech rate.
The method according to claim 1, wherein the speech synthesis model is obtained by training in the following manner:

extracting the real acoustic features of the training audio corresponding to the training text, the real acoustic features are used to indicate the prosodic features of the training audio;

Extending the real acoustic feature according to the training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence;

The training phoneme sequence and the training acoustic feature sequence are input into the speech synthesis model, and the speech synthesis model is trained according to the output of the speech synthesis model and the training audio.
The method according to claim 5, wherein the real acoustic features include: at least one of fundamental frequency, volume, or speech rate; and the extraction of the real acoustic features of the training audio corresponding to the training text includes:

If the real acoustic feature includes the speech rate, determine the duration corresponding to each training phoneme in the training phoneme sequence according to the training audio and the training phoneme sequence, so as to determine the speech rate of the training audio;

If the real acoustic feature includes a fundamental frequency, extract the fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio;

If the real acoustic feature includes volume, extract the volume of each audio frame included in the training audio to determine the volume of the training audio.
The method of claim 6, wherein,

Determining, according to the training audio and the training phoneme sequence, the duration corresponding to each training phoneme in the training phoneme sequence, to determine the speech rate of the training audio, including:

According to the training audio and the training phoneme sequence, determine the duration corresponding to each of the training phonemes;

Logarithmic operation is performed on the corresponding duration of each of the training phonemes to obtain the logarithmic duration corresponding to each of the training phonemes; and

Taking the statistical value of the logarithmic duration corresponding to each of the training phonemes in the training phoneme sequence as the speech rate of the training audio;

The extracting the fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio includes:

performing a logarithmic operation on the base frequency corresponding to each of the audio frames to obtain a logarithmic base frequency corresponding to each of the audio frames; and

Taking the statistical value of the logarithmic fundamental frequency corresponding to each of the audio frames in the training audio as the fundamental frequency of the training audio;

The extraction of the volume of each audio frame included in the training audio to determine the volume of the training audio includes:

Logarithmic operation is performed on the volume corresponding to each of the audio frames to obtain the logarithmic volume corresponding to each of the audio frames; and

The statistical value of the logarithmic volume corresponding to each of the audio frames in the training audio is used as the volume of the training audio.
The method according to any one of claims 5-7, wherein,

The method also includes:

According to a preset training set including a plurality of the real acoustic features of the training audio, determine statistical acoustic features of the training set; and

Normalize the real acoustic features of each of the training audios according to the statistical acoustic features;

The described real acoustic features are expanded according to the training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence, including:

The normalized real acoustic features are extended according to the training phoneme sequence to obtain the training acoustic feature sequence.
A speech synthesis device, comprising:

an acquisition module for acquiring the text to be synthesized and a specified acoustic feature, where the specified acoustic feature is used to indicate the prosody feature of the audio;

an extraction module, used for extracting the phoneme sequence corresponding to the text to be synthesized;

an expansion module, configured to expand the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence;

A synthesis module, used to input the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model to obtain the target audio corresponding to the text to be synthesized and the acoustic output of the target audio output by the speech synthesis model. The feature matches the specified acoustic feature.
A computer-readable medium having a computer program stored thereon, the program implementing the method of any one of claims 1-8 when executed by a processing device.
An electronic device comprising:

a storage device on which a computer program is stored;

A processing device, configured to execute the computer program in the storage device, so as to implement the method of any one of claims 1-8.
A computer program comprising instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-8.
A computer program product comprising instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-8.