CN112786008B

CN112786008B - Speech synthesis method and device, readable medium and electronic equipment

Info

Publication number: CN112786008B
Application number: CN202110075977.XA
Authority: CN
Inventors: 吴鹏飞; 伍林; 潘俊杰
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-20
Filing date: 2021-01-20
Publication date: 2024-04-12
Anticipated expiration: 2041-01-20
Also published as: CN112786008A; WO2022156464A1

Abstract

The disclosure relates to a voice synthesis method, a device, a readable medium and an electronic device, and relates to the technical field of electronic information processing, wherein the method comprises the following steps: obtaining a text to be synthesized and appointed acoustic features, wherein the appointed acoustic features are used for indicating rhythm features of audio, extracting a phoneme sequence corresponding to the text to be synthesized, expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence, inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio output by the voice synthesis model, and matching the acoustic features of the target audio with the appointed acoustic features. According to the method and the device for controlling the voice synthesis of the text through the appointed acoustic features, the target audio output by the voice synthesis model can correspond to the appointed acoustic features, the explicit control of the acoustic features in the voice synthesis process can be achieved, and the expressive force of the target audio is improved.

Description

Speech synthesis method and device, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of electronic information processing technologies, and in particular, to a method and apparatus for speech synthesis, a readable medium, and an electronic device.

Background

With the continuous development of electronic information processing technology, voice is widely used in daily life and work as an important carrier for people to acquire information. In an application scenario involving speech, speech synthesis, which is to synthesize text specified by a user into audio, is typically involved. In the speech synthesis process, if a target speech conforming to a certain acoustic characteristic is synthesized according to a specified text, a reference speech conforming to the certain acoustic characteristic is needed to be prepared in advance. However, the duration of the reference voice tends to be greatly different from that of the target voice, which may lead to unstable synthesis results. In addition, it is difficult to prepare reference voices corresponding to various acoustic features in advance according to the diversified demands of users. Therefore, in the speech synthesis process, effective control of acoustic features cannot be achieved.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising:

acquiring a text to be synthesized and appointed acoustic characteristics, wherein the appointed acoustic characteristics are used for indicating rhythm characteristics of audio;

extracting a phoneme sequence corresponding to the text to be synthesized;

expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence;

inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, wherein the acoustic features of the target audio are matched with the appointed acoustic features.

In a second aspect, the present disclosure provides a speech synthesis apparatus, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text to be synthesized and appointed acoustic characteristics, and the appointed acoustic characteristics are used for indicating rhythm characteristics of audio;

the extraction module is used for extracting a phoneme sequence corresponding to the text to be synthesized;

the expansion module is used for expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence;

and the synthesis module is used for inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, and the acoustic features of the target audio are matched with the appointed acoustic features.

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect of the disclosure.

Through the technical scheme, the method comprises the steps of firstly obtaining a text to be synthesized and appointed acoustic features for indicating rhythm features of audio, then extracting a corresponding phoneme sequence from the text to be synthesized, expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence, and finally inputting the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model to obtain target audio which is output by the speech synthesis model, corresponds to the text to be synthesized and is matched with the appointed acoustic features. According to the method and the device for controlling the voice synthesis of the text through the appointed acoustic features, the target audio output by the voice synthesis model can correspond to the appointed acoustic features, the explicit control of the acoustic features in the voice synthesis process can be achieved, and the expressive force of the target audio is improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment;

FIG. 2 is a flow chart illustrating another method of speech synthesis according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating a process of a speech synthesis model according to an exemplary embodiment;

FIG. 4 is a block diagram of a speech synthesis model, shown according to an exemplary embodiment;

FIG. 5 is a flowchart illustrating a training speech synthesis model according to an exemplary embodiment;

FIG. 6 is a flowchart illustrating another training speech synthesis model according to an exemplary embodiment;

FIG. 7 is a flowchart illustrating another training speech synthesis model according to an exemplary embodiment;

FIG. 8 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment;

FIG. 9 is a block diagram of another speech synthesis apparatus according to an example embodiment;

fig. 10 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flow chart illustrating a method of speech synthesis, as shown in fig. 1, according to an exemplary embodiment, which may include the steps of:

step 101, obtaining a text to be synthesized and designated acoustic features, wherein the designated acoustic features are used for indicating rhythm features of the audio.

For example, a text to be synthesized that needs to be synthesized is first obtained. The text to be synthesized may be, for example, one or more sentences in a text file specified by a user, one or more paragraphs in a text file, one or more chapters, or one or more words in a text file. The text file may be, for example, an electronic book, or may be other types of files, such as news, public number articles, blogs, etc. At the same time, specified acoustic features, which can be understood as specified by the user, are also acquired, and it is desired to synthesize the text to be synthesized into audio conforming to the specified acoustic features (i.e., target audio mentioned later). The specified acoustic features may include multiple dimensions, which may include, for example, one or more of a base frequency (English: pitch), a volume (English: energy), a velocity (English: duration), and may further include: noise level, pitch, timbre, loudness, etc. The noise level is understood to be a characteristic that reflects the noise level in the audio, among other things.

Step 102, extracting a phoneme sequence corresponding to the text to be synthesized.

For example, the text to be synthesized may be input into a pre-trained recognition model to obtain a phoneme sequence corresponding to the text to be synthesized, which is output by the recognition model. The phonemes corresponding to each word in the text to be synthesized can also be searched in a pre-established dictionary, and then the phonemes corresponding to each word are formed into a phoneme sequence corresponding to the text to be synthesized. The phonemes are understood as speech units divided according to the pronunciation of each word, and are understood as vowels and consonants in the pinyin corresponding to each word. The phoneme sequence includes phonemes corresponding to each word in the text to be synthesized (a word may correspond to one or more phonemes). For example, the text to be synthesized is "weather today" and phonemes corresponding to each word may be sequentially searched in the dictionary, thereby determining that the phoneme sequence is "jintitanqihehao".

And 103, expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence.

Step 104, inputting the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the speech synthesis model, wherein the acoustic features of the target audio are matched with the appointed acoustic features.

For example, after obtaining the phoneme sequence, the specified acoustic feature may be expanded according to the phoneme sequence to obtain an acoustic feature sequence, where the acoustic feature sequence includes acoustic features corresponding to each phoneme in the phoneme sequence. In one implementation, the acoustic feature sequence may be generated according to a length of the phoneme sequence (i.e., a number of phonemes included in the phoneme sequence), where each of the phonemes corresponds to an acoustic feature that is a specified acoustic feature. In another implementation, the specified acoustic feature may be used as an average value (or standard deviation), and the acoustic feature corresponding to each phoneme may be generated according to a preset distribution (such as a gaussian distribution or a uniform distribution).

And then, the phoneme sequence and the acoustic feature sequence can be used as input of a pre-trained voice synthesis model, and the voice synthesis model outputs target audio which corresponds to the text to be synthesized and is matched with the appointed acoustic feature. The Speech synthesis model may be pre-trained, and may be understood as a TTS (english: text To Speech, chinese: from Text To Speech) model, capable of generating, according To a Text To be synthesized and a specified acoustic feature, a target audio corresponding To the Text To be synthesized and matching the specified acoustic feature. Specifically, the speech synthesis model may be trained based on a Tacotron model, a deep 3 model, a Tacotron2 model, a Wavenet model, and the like, which is not specifically limited in this disclosure. In this way, in the process of speech synthesis in the text to be synthesized, besides the semantics included in the text to be synthesized, the appointed acoustic characteristics are also considered, so that the target audio can have the appointed acoustic characteristics, and the explicit control of the acoustic characteristics in the process of speech synthesis is realized, without spending a great deal of time cost and labor cost to pre-create the reference speech corresponding to various acoustic characteristics, meanwhile, the problem of instability caused by the large difference between the duration of the reference speech and the target audio is avoided, the expressive force of the target audio is improved, and the hearing experience of a user is improved.

In summary, the present disclosure firstly obtains a text to be synthesized and a specified acoustic feature for indicating a prosodic feature of audio, then extracts a corresponding phoneme sequence from the text to be synthesized, then expands the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence, and finally inputs the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model, thereby obtaining a target audio corresponding to the text to be synthesized and matching the specified acoustic feature output by the speech synthesis model. According to the method and the device for controlling the voice synthesis of the text through the appointed acoustic features, the target audio output by the voice synthesis model can correspond to the appointed acoustic features, the explicit control of the acoustic features in the voice synthesis process can be achieved, and the expressive force of the target audio is improved.

FIG. 2 is a flow chart illustrating another method of speech synthesis according to an exemplary embodiment, as shown in FIG. 2, the implementation of step 103 may include:

step 1031, determining acoustic features corresponding to each phoneme in the sequence of phonemes according to the specified acoustic features.

Step 1032, the acoustic features corresponding to each phoneme are combined into an acoustic feature sequence.

For example, in one implementation, the length of the phoneme sequence, i.e., the number of phonemes included in the phoneme sequence, may be determined first. And then copying the appointed acoustic features to obtain an acoustic feature sequence with the same length as the phoneme sequence, wherein each acoustic feature is the same as the appointed acoustic feature, that is, the acoustic feature corresponding to each phoneme in the acoustic feature sequence is the appointed acoustic feature. For example, if the length of the phoneme sequence is 100 (i.e., 100 phonemes are included therein), the acoustic feature corresponding to each phoneme may be determined as a specified acoustic feature, and then the acoustic features corresponding to the 100 phonemes may be combined into the acoustic feature sequence. Taking the example of a vector with 1*5 dimensions for the given acoustic feature, the acoustic feature sequence includes 100 vectors with 1*5 dimensions, which can constitute a 100 x 5 dimension vector.

FIG. 3 is a process flow diagram of a speech synthesis model that may be used to perform the following steps, as shown in FIG. 3, according to an exemplary embodiment:

and step A, determining a text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, wherein the text feature sequence comprises text features corresponding to each phoneme in the phoneme sequence.

And B, generating target audio according to the text feature sequence and the acoustic feature sequence.

For example, in a specific process of synthesizing the target audio by the speech synthesis model, a Text feature sequence (i.e., text Embedding) corresponding to the Text to be synthesized may be extracted according to a phoneme sequence, where the Text feature sequence includes a Text feature corresponding to each phoneme in the phoneme sequence, and the Text feature may be understood as a Text vector capable of representing the phoneme. For example, the phoneme sequence includes 100 phonemes, and the text vector corresponding to each phoneme is a vector of 1×80 dimensions, so the text feature sequence may be a vector of 100×80 dimensions.

After the text feature sequence is obtained, the text feature sequence may be combined with the acoustic feature sequence to generate target audio that matches the specified acoustic feature. For example, the text feature sequence may be spliced with the acoustic feature sequence to obtain a combined sequence, and then the target audio is generated from the combined sequence. For example, the phoneme sequence includes 100 phonemes, the text feature sequence may be a vector of 100×80 dimensions, the corresponding acoustic feature sequence is a vector of 100×5 dimensions, and then the combined sequence may be a vector of 100×85 dimensions. The target audio may be generated from this 100 x 85 dimensional vector.

Taking the speech synthesis model shown in fig. 4 as an example, the speech synthesis model is a Tacotron model, which includes: an Encoder (i.e., encoder), an Attention network (i.e., attention), a Decoder (i.e., decoder), and a Post-processing network (i.e., post-processing). The encoder may include an embedded layer (i.e., character Embedding layer), a Pre-processing network (Pre-net) sub-model, and a CBHG (english: convolition Bank+highway network+ bidirectional Gated Recurrent Unit, chinese: convolutional layer+high speed network+bi-directional recurrent neural network) sub-model. The phoneme sequence can be input into the encoder, firstly, the phoneme sequence is converted into a word vector through an embedding layer, then the word vector is input into a Pre-net submodel to carry out nonlinear transformation on the word vector, so that convergence and generalization capability of a speech synthesis model are improved, and finally, a text feature sequence capable of representing a text to be synthesized is obtained through a CBHG submodel according to the word vector after nonlinear transformation.

The acoustic feature sequence and the text feature sequence output by the encoder may then be spliced to obtain a combined sequence, and the combined sequence is input into an attention network, which may add an attention weight to each element in the combined sequence. Specifically, the Attention network may be a location-sensitive Attention (english: locative Sensitive Attention) network, a GMM (english: gaussian Mixture Model, abbreviated GMM) Attention network, or a Multi-Head Attention network, which is not particularly limited in this disclosure.

The output of the attention network is then taken as the input to the decoder. The Decoder may include a pre-processing network sub-model (which may be the same as the pre-processing network sub-model included in the encoder), an Attention-RNN, a Decoder-RNN. The preprocessing network submodel is used for carrying out nonlinear transformation on input, the structure of the attribute-RNN is a unidirectional LSTM (English: long Short-Term Memory, chinese: long-Short Term Memory network) based on zoneout, and the output of the preprocessing network submodel can be used as input and output to the Decoder-RNN after passing through the LSTM unit. The Decode-RNN is a two-layer unidirectional zoneout-based LSTM that outputs mel-spectrum information via the LSTM unit, which may include one or more mel-spectrum features. Finally, the mel-frequency spectral information is input to a post-processing network, which may include a vocoder (e.g., a Wavenet vocoder, a Griffin-Lim vocoder, etc.) for converting the mel-frequency spectral feature information to obtain the target audio.

FIG. 5 is a flowchart illustrating a training speech synthesis model, as shown in FIG. 5, according to an exemplary embodiment, the speech synthesis model being trained by:

And C, extracting real acoustic characteristics of the training audio corresponding to the training text, wherein the real acoustic characteristics are used for indicating rhythm characteristics of the training audio.

And D, expanding the real acoustic features according to the training phoneme sequences corresponding to the training texts to obtain training acoustic feature sequences.

And E, inputting the training phoneme sequence and the training acoustic feature sequence into a speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and training audio.

Training the speech synthesis model, firstly acquiring training texts and training audios corresponding to the training texts, wherein a plurality of training texts can be provided, and a plurality of corresponding training audios can be provided. For example, a large amount of text can be captured on the internet as training text, and then the audio corresponding to the training text can be used as training audio. For training audio, the real acoustic features corresponding to the training audio may be extracted. For example, the real acoustic features corresponding to the training audio may be obtained through signal processing, labeling, etc., where the real acoustic features are used to indicate prosodic features of the training audio and may include: at least one of the fundamental frequency, the volume and the speech rate of the training audio can further comprise: noise level, pitch, timbre, loudness, etc. Meanwhile, a training phoneme sequence corresponding to the training text can be extracted, and training phonemes corresponding to each word in the training text can be included in the training phoneme sequence (one word can correspond to one or more training phonemes).

And then, expanding the real acoustic features according to the training phoneme sequences corresponding to the training texts to obtain training acoustic feature sequences. The training acoustic feature sequence comprises training acoustic features corresponding to each training phoneme. For example, a training acoustic feature sequence may be generated according to the number of training phones included in the training phone sequence, where the training acoustic feature corresponding to each training phone is a real acoustic feature.

And finally, taking the training phoneme sequence and the training acoustic feature sequence as the input of the voice synthesis model, and training the voice synthesis model according to the output of the voice synthesis model and training audio. For example, the difference (or mean square error) between the output of the speech synthesis model and the training audio may be used as a loss function of the speech synthesis model, and the parameters of neurons in the speech synthesis model, such as weights (english: weight) and offsets (english: bias), may be corrected by using a back propagation algorithm with the aim of reducing the loss function. Repeating the steps until the loss function meets the preset condition, for example, the loss function is smaller than the preset loss threshold value.

FIG. 6 is a flowchart illustrating another training speech synthesis model according to an exemplary embodiment, as shown in FIG. 6, where the true acoustic features include: at least one of fundamental frequency, volume and speech rate, and correspondingly, the implementation manner of the step C may include:

And step C1, if the real acoustic characteristics comprise speech speed, determining the duration corresponding to each training phoneme in the training phoneme sequence according to the training audio and the training phoneme sequence so as to determine the speech speed of the training audio.

Specific implementations may include:

firstly, according to training audio and training phoneme sequences, determining the corresponding duration of each training phoneme. For example, training audio may be divided by training phonemes included in the training phoneme sequence using HTS (English: HMM-based Speech Synthesis System) to obtain a corresponding duration of each training phoneme, which may be expressed as a duration _i And representing the duration corresponding to the ith training phoneme.

And then, carrying out logarithmic operation on the time length corresponding to each training phoneme to obtain the logarithmic time length corresponding to each training phoneme, and compressing the change range of the time length through logarithmic operation so as to amplify the change degree of the time length. For example, can be expressed as log_duration _i And (5) representing the logarithmic duration corresponding to the ith training phoneme.

And finally, taking the statistical value of the logarithmic duration corresponding to each training phoneme in the training phoneme sequence as the speech speed of the training audio. For example, an average value (or standard deviation, extremum, etc.) of the logarithmic duration corresponding to each training phoneme may be used as the speech rate of the training audio, and may be represented as log_duration_mean.

And step C2, extracting the fundamental frequency of each audio frame included in the training audio if the fundamental frequency is included in the real acoustic characteristics so as to determine the fundamental frequency of the training audio.

The specific implementation manner can comprise the following steps:

firstly, the training audio may be processed by using an audio processing tool such as sox, librosa, straight to obtain a fundamental frequency of each audio in the training audio, and then a logarithmic operation is performed on the fundamental frequency corresponding to each audio frame to obtain a logarithmic fundamental frequency corresponding to each audio frame. By means of logarithmic operation, the variation range of the fundamental frequency can be compressed, and the variation degree of the fundamental frequency is amplified. For example, the corresponding base frequency of each audio frame may be represented as pitch _j Representing the fundamental frequency corresponding to the jth audio frame, and correspondingly, the logarithmic fundamental frequency corresponding to the jth audio frame may be represented as log_pitch _j 。

Then, the statistics of the logarithmic fundamental frequency corresponding to each audio frame in the training audio is used as the fundamental frequency of the training audio. For example, an average value of logarithmic fundamental frequencies and a standard deviation of logarithmic fundamental frequencies corresponding to each audio frame may be used as fundamental frequencies of the training audio, and may be represented as log_pitch_mean and log_pitch_std, where log_pitch_mean may reflect the size of the fundamental frequency of the training audio as a whole, and log_pitch_std may reflect the variation amplitude of the fundamental frequency of the training audio.

And step C3, extracting the volume of each audio frame included in the training audio if the volume is included in the real acoustic characteristics so as to determine the volume of the training audio.

The specific implementation manner can comprise the following steps:

firstly, the training audio may be processed by using an audio processing tool such as sox, librosa, straight to obtain the volume of each audio in the training audio, and then the logarithm operation is performed on the volume corresponding to each audio frame to obtain the logarithm volume corresponding to each audio frame. By means of logarithmic operation, the change range of the volume can be compressed, and the change degree of the volume can be amplified. For example, the corresponding volume of each audio frame may be expressed as energy _j Indicating the volume corresponding to the jth audio frame. Accordingly, the log volume corresponding to the jth audio frame may be expressed as log_energy _j 。

And then, taking the statistical value of the logarithmic volume corresponding to each audio frame as the volume of the training audio. For example, an average value of the logarithmic volume corresponding to each audio frame may be used as the volume of the training audio, and may be expressed as log_energy_mean.

Further, if the real acoustic feature also includes a noise level, step C may further include: and determining the noise level corresponding to the training audio according to the linear prediction coefficient corresponding to the training audio.

For example, the LPC coefficients (english: linear Prediction Coefficient, chinese: linear prediction coefficients) of the training audio may be determined, then the 1 st dimension of the LPC coefficients of the training audio is logarithmically calculated, and the result of the logarithm calculation is used as the noise level corresponding to the training audio, and the variation range of the noise level may be compressed by the logarithmically calculation, thereby amplifying the variation degree of the noise level. The noise level may be expressed, for example, as log_spectral_tilt.

In the case that the fundamental frequency, volume, speech speed and noise level are included in the real acoustic features, the fundamental frequency, volume, speech speed and noise level of the training audio may be combined into the real acoustic features of the training audio. For example, the real acoustic feature may be a 1*5 dimensional vector: { fundamental frequency: (log_pitch_mean, log_pitch_std), volume: log_energy_mean, speech rate: log_duration_mean, noise level: log_spectral_tile }. Accordingly, in the case where the specified acoustic features include the fundamental frequency, the volume, the speech speed, and the noise level, the specified acoustic features acquired in step 101 may also include the above 5 dimensions.

FIG. 7 is a flowchart illustrating another training of a speech synthesis model, as shown in FIG. 7, according to an exemplary embodiment, which may also be obtained by training as follows:

And F, determining the statistical acoustic characteristics of the training set according to the real acoustic characteristics of the training set, which comprise a plurality of training audios, in the preset training set.

And G, carrying out normalization processing on the real acoustic characteristics of each training audio according to the statistical acoustic characteristics.

Accordingly, the implementation manner of the step D may be:

and expanding the real acoustic features after normalization processing according to the training phoneme sequence to obtain a training acoustic feature sequence.

For example, the real acoustic features may also be normalized before being expanded. For example, the training set includes a plurality of training texts, each corresponding to one training audio. The true acoustic characteristics of each training audio may be determined in the manner of steps C1 to C4 and the statistical acoustic characteristics of the training set may be determined. The statistical acoustic features may be, for example, mean, standard deviation, variance, extremum, etc. of the real acoustic features. And then carrying out normalization processing on the real acoustic characteristics of each training audio according to the statistical acoustic characteristics. For example, the mean value μ and standard deviation σ of the real acoustic features may be used as statistical acoustic features, and then the real acoustic features between [ μ -3σ, μ+3σ ] are mapped into [ -1,1], and the real acoustic features outside [ μ -3σ, μ+3σ ] may be truncated to-1 or 1. The average value and standard deviation of each dimension in the real acoustic characteristics can be obtained respectively, and normalization processing is carried out on each dimension in the real acoustic characteristics. For example, taking log_pitch_mean in real acoustic features as an example, the average value pitch_μ and standard deviation pitch_σ of log_pitch_mean for each training audio can be found, then log_pitch_mean between [ pitch_μ -3pitch_σ, pitch_μ+3pitch_σ ] is mapped into [ -1,1], log_pitch_mean outside [ pitch_μ -3pitch_σ, pitch_μ+3pitch_σ ] is truncated to-1 or 1 to normalize the pitch_mean.

Further, the real acoustic features after normalization processing can be expanded according to the training phoneme sequence to obtain a training acoustic feature sequence. For example, a training acoustic feature sequence may be generated according to the number of training phones included in the training phone sequence, where the training acoustic feature corresponding to each training phone is a real acoustic feature after normalization processing.

It should be noted that, the specified acoustic features acquired in step 101 may also include the above-mentioned 5 dimensions subjected to normalization processing. The normalized specified acoustic features are more explanatory. Taking the example of the specified acoustic feature { -1,1,0,1,0}, where the value of log_pitch_mean is-1, then the target audio that represents the speech synthesis model generation, that meets the specified acoustic feature, is characterized by a dip. A value of 1 for log_pitch_std indicates a large change in the fundamental frequency of the target audio. The value of log_energy_mean is 0, indicating that the target audio is normal volume. The value of log_duration_mean corresponds to 1, indicating that the speech rate of the target audio is slow (i.e., the average length of time corresponding to the phonemes is long). The value of log_spectral_tile is 0, indicating that the noise level of the target audio is normal.

Fig. 8 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment, as shown in fig. 8, the apparatus 200 may include the following modules:

an acquisition module 201, configured to acquire text to be synthesized and specified acoustic features, where the specified acoustic features are used to indicate prosodic features of the audio.

The extracting module 202 is configured to extract a phoneme sequence corresponding to the text to be synthesized.

And the expansion module 203 is configured to expand the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence.

The synthesizing module 204 is configured to input the phoneme sequence and the acoustic feature sequence into a pre-trained speech synthesis model, so as to obtain target audio corresponding to the text to be synthesized, where the target audio has acoustic features matched with specified acoustic features.

Fig. 9 is a block diagram of another speech synthesis apparatus according to an exemplary embodiment, and as shown in fig. 9, the expansion module 203 may include:

a determining submodule 2031 is configured to determine, according to the specified acoustic feature, an acoustic feature corresponding to each phoneme in the phoneme sequence.

An expansion submodule 2032 is configured to compose the acoustic feature corresponding to each phoneme into an acoustic feature sequence.

In an application scenario, the speech synthesis model in the above embodiment may be used to perform the following steps:

In one application scenario, specifying acoustic features includes: at least one of fundamental frequency, volume, and speech rate.

In another application scenario, the speech synthesis model is trained by:

Specifically, the true acoustic features include: at least one of fundamental frequency, volume and speech rate, the implementation of step C may include:

Specific implementations may include: firstly, according to training audio and training phoneme sequences, determining the corresponding duration of each training phoneme. And performing logarithmic operation on the time length corresponding to each training phoneme to obtain the logarithmic time length corresponding to each training phoneme. And finally, taking the statistical value of the logarithmic duration corresponding to each training phoneme in the training phoneme sequence as the speech speed of the training audio.

The specific implementation manner can comprise the following steps: first, a logarithmic operation is performed on the fundamental frequency corresponding to each audio frame to obtain a logarithmic fundamental frequency corresponding to each audio frame. Then, the statistics of the logarithmic fundamental frequency corresponding to each audio frame in the training audio is used as the fundamental frequency of the training audio.

The specific implementation manner can comprise the following steps: first, logarithmic operation is performed on the volume corresponding to each audio frame to obtain the logarithmic volume corresponding to each audio frame. And then taking the statistical value of the logarithmic volume corresponding to each audio frame in the training audio as the volume of the training audio.

In yet another application scenario, the speech synthesis model may also be obtained by training in the following way:

Accordingly, the implementation manner of the step D may be:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Referring now to fig. 10, a schematic diagram of a configuration of an electronic device (i.e., the execution body of the above-described speech synthesis method) 300 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 10 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 10, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the terminal devices, servers, may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text to be synthesized and appointed acoustic characteristics, wherein the appointed acoustic characteristics are used for indicating rhythm characteristics of audio; extracting a phoneme sequence corresponding to the text to be synthesized; expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence; inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, wherein the acoustic features of the target audio are matched with the appointed acoustic features.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module is not limited to the module itself in some cases, and for example, the acquisition module may also be described as "a module that acquires text to be synthesized and specifies acoustic features".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a speech synthesis method, comprising: acquiring a text to be synthesized and appointed acoustic characteristics, wherein the appointed acoustic characteristics are used for indicating rhythm characteristics of audio; extracting a phoneme sequence corresponding to the text to be synthesized; expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence; inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, wherein the acoustic features of the target audio are matched with the appointed acoustic features.

According to one or more embodiments of the present disclosure, example 2 provides the method of example 1, wherein the expanding the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence includes: according to the appointed acoustic features, determining acoustic features corresponding to each phoneme in the phoneme sequence; and forming the acoustic feature sequence by the acoustic features corresponding to each phoneme.

In accordance with one or more embodiments of the present disclosure, example 3 provides the method of example 1, the speech synthesis model to: determining a text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, wherein the text feature sequence comprises text features corresponding to each phoneme in the phoneme sequence; and generating the target audio according to the text feature sequence and the acoustic feature sequence.

In accordance with one or more embodiments of the present disclosure, example 4 provides the method of examples 1-3, the specifying acoustic features comprising: at least one of fundamental frequency, volume, and speech rate.

Example 5 provides the method of example 1, according to one or more embodiments of the present disclosure, the speech synthesis model being obtained by training in the following manner: extracting real acoustic characteristics of training audio corresponding to training text, wherein the real acoustic characteristics are used for indicating rhythm characteristics of the training audio; expanding the real acoustic features according to a training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence; and inputting the training phoneme sequence and the training acoustic feature sequence into the voice synthesis model, and training the voice synthesis model according to the output of the voice synthesis model and the training audio.

In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 5, the true acoustic features comprising: at least one of fundamental frequency, volume and speech rate; the extracting the real acoustic characteristics of the training audio corresponding to the training text comprises the following steps: if the real acoustic characteristics comprise speech speed, determining the duration corresponding to each training phoneme in the training phoneme sequence according to the training audio and the training phoneme sequence so as to determine the speech speed of the training audio; extracting the fundamental frequency of each audio frame included in the training audio if the fundamental frequency is included in the real acoustic characteristics so as to determine the fundamental frequency of the training audio; and if the real acoustic characteristics comprise the volume, extracting the volume of each audio frame included in the training audio to determine the volume of the training audio.

Example 7 provides the method of example 6, according to one or more embodiments of the present disclosure, the determining a duration corresponding to each training phoneme in the training phoneme sequence according to the training audio and the training phoneme sequence to determine a speech rate of the training audio, including: determining the duration corresponding to each training phoneme according to the training audio and the training phoneme sequence; carrying out logarithmic operation on the time length corresponding to each training phoneme to obtain the logarithmic time length corresponding to each training phoneme; taking the statistical value of the logarithmic duration corresponding to each training phoneme in the training phoneme sequence as the speech speed of the training audio; the extracting the fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio includes: carrying out logarithmic operation on the fundamental frequency corresponding to each audio frame to obtain the logarithmic fundamental frequency corresponding to each audio frame; taking the statistics value of the logarithmic fundamental frequency corresponding to each audio frame in the training audio as the fundamental frequency of the training audio; the extracting the volume of each audio frame included in the training audio to determine the volume of the training audio includes: carrying out logarithmic operation on the volume corresponding to each audio frame to obtain logarithmic volume corresponding to each audio frame; and taking the statistical value of the logarithmic volume corresponding to each audio frame in the training audio as the volume of the training audio.

According to one or more embodiments of the present disclosure, example 8 provides the method of examples 5 to 7, the speech synthesis model further being trained to be obtained by: determining statistical acoustic features of a training set according to the real acoustic features of the training audio including a plurality of training audios in the preset training set; normalizing the real acoustic features of each training audio according to the statistical acoustic features; the expanding the real acoustic features according to the training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence comprises the following steps: and expanding the real acoustic features subjected to normalization processing according to the training phoneme sequence to obtain the training acoustic feature sequence.

According to one or more embodiments of the present disclosure, example 9 provides a speech synthesis apparatus, comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text to be synthesized and appointed acoustic characteristics, and the appointed acoustic characteristics are used for indicating rhythm characteristics of audio; the extraction module is used for extracting a phoneme sequence corresponding to the text to be synthesized; the expansion module is used for expanding the appointed acoustic features according to the phoneme sequence to obtain an acoustic feature sequence; and the synthesis module is used for inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, and the acoustic features of the target audio are matched with the appointed acoustic features.

According to one or more embodiments of the present disclosure, example 10 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the methods described in examples 1 to 8.

Example 11 provides an electronic device according to one or more embodiments of the present disclosure, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to realize the steps of the method described in examples 1 to 8.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of speech synthesis, the method comprising:

extracting a phoneme sequence corresponding to the text to be synthesized;

inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, wherein the acoustic features of the target audio are matched with the appointed acoustic features;

The speech synthesis model is obtained by training in the following manner:

extracting real acoustic characteristics of training audio corresponding to training text, wherein the real acoustic characteristics are used for indicating rhythm characteristics of the training audio;

expanding the real acoustic features according to a training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence;

inputting the training phoneme sequence and the training acoustic feature sequence into the speech synthesis model, and training the speech synthesis model according to the output of the speech synthesis model and the training audio;

the speech synthesis model is also obtained by training in the following way:

determining statistical acoustic features of a training set according to the real acoustic features of the training audio including a plurality of training audios in the preset training set;

normalizing the real acoustic features of each training audio according to the statistical acoustic features;

the expanding the real acoustic features according to the training phoneme sequence corresponding to the training text to obtain a training acoustic feature sequence comprises the following steps:

and expanding the real acoustic features subjected to normalization processing according to the training phoneme sequence to obtain the training acoustic feature sequence.

2. The method of claim 1, wherein the expanding the specified acoustic feature according to the phoneme sequence to obtain an acoustic feature sequence comprises:

according to the appointed acoustic features, determining acoustic features corresponding to each phoneme in the phoneme sequence;

and forming the acoustic feature sequence by the acoustic features corresponding to each phoneme.

3. The method of claim 1, wherein the speech synthesis model is used to:

determining a text feature sequence corresponding to the text to be synthesized according to the phoneme sequence, wherein the text feature sequence comprises text features corresponding to each phoneme in the phoneme sequence;

and generating the target audio according to the text feature sequence and the acoustic feature sequence.

4. A method according to any one of claims 1-3, wherein the specifying acoustic features comprises: at least one of fundamental frequency, volume, and speech rate.

5. The method of claim 1, wherein the real acoustic feature comprises: at least one of fundamental frequency, volume and speech rate; the extracting the real acoustic characteristics of the training audio corresponding to the training text comprises the following steps:

If the real acoustic characteristics comprise speech speed, determining the duration corresponding to each training phoneme in the training phoneme sequence according to the training audio and the training phoneme sequence so as to determine the speech speed of the training audio;

extracting the fundamental frequency of each audio frame included in the training audio if the fundamental frequency is included in the real acoustic characteristics so as to determine the fundamental frequency of the training audio;

and if the real acoustic characteristics comprise the volume, extracting the volume of each audio frame included in the training audio to determine the volume of the training audio.

6. The method of claim 5, wherein the determining a duration corresponding to each training phoneme in the training phoneme sequence based on the training audio and the training phoneme sequence to determine a speech rate of the training audio comprises:

determining the duration corresponding to each training phoneme according to the training audio and the training phoneme sequence;

carrying out logarithmic operation on the time length corresponding to each training phoneme to obtain the logarithmic time length corresponding to each training phoneme;

taking the statistical value of the logarithmic duration corresponding to each training phoneme in the training phoneme sequence as the speech speed of the training audio;

The extracting the fundamental frequency of each audio frame included in the training audio to determine the fundamental frequency of the training audio includes:

carrying out logarithmic operation on the fundamental frequency corresponding to each audio frame to obtain the logarithmic fundamental frequency corresponding to each audio frame;

taking the statistics value of the logarithmic fundamental frequency corresponding to each audio frame in the training audio as the fundamental frequency of the training audio;

the extracting the volume of each audio frame included in the training audio to determine the volume of the training audio includes:

carrying out logarithmic operation on the volume corresponding to each audio frame to obtain logarithmic volume corresponding to each audio frame;

and taking the statistical value of the logarithmic volume corresponding to each audio frame in the training audio as the volume of the training audio.

7. A speech synthesis apparatus, the apparatus comprising:

The synthesis module is used for inputting the phoneme sequence and the acoustic feature sequence into a pre-trained voice synthesis model to obtain target audio corresponding to the text to be synthesized, which is output by the voice synthesis model, wherein the acoustic features of the target audio are matched with the appointed acoustic features;

the speech synthesis model is obtained by training in the following manner:

the speech synthesis model is also obtained by training in the following way:

8. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-6.

9. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.