CN113450756A

CN113450756A - Training method of voice synthesis model and voice synthesis method

Info

Publication number: CN113450756A
Application number: CN202010175459.0A
Authority: CN
Inventors: 杨丽兵
Original assignee: TCL Technology Group Co Ltd
Current assignee: TCL Technology Group Co Ltd
Priority date: 2020-03-13
Filing date: 2020-03-13
Publication date: 2021-09-28

Abstract

The application is applicable to the technical field of computers, and particularly relates to a training method of a speech synthesis model and a speech synthesis method. The training method of the speech synthesis model comprises the following steps: carrying out pronunciation labeling on the first voice sample to obtain a text sequence of the first voice sample; inputting the first voice sample and the text sequence of the first voice sample into a preset voice synthesis model in pairs for processing to obtain an output audio corresponding to the text sequence of the first voice sample and an audio characteristic of the first voice sample; and adjusting parameters of the speech synthesis model according to the audio characteristics of the first speech sample and the output audio until preset training conditions are met, so as to obtain a trained speech synthesis model. In the method and the device, pronunciation labeling can be carried out based on sample voice of the dialect, and then a voice synthesis model with the pronunciation characteristics of the dialect can be synthesized, so that the quality of dialect voice synthesis is improved.

Description

Training method of voice synthesis model and voice synthesis method

Technical Field

The present application belongs to the field of computer technology, and in particular, relates to a method for training a speech synthesis model and a speech synthesis method.

Background

Speech synthesis technology refers to technology that generates artificial speech by mechanical, electronic methods. Text-To-Speech (TTS) technology belongs To Speech synthesis, and is a technology for converting Text information generated by a computer or input from the outside into intelligible and fluent spoken Speech and outputting the spoken information. At present, the speech synthesis technology based on deep learning is gradually mature, and common languages such as mandarin, English and the like can be synthesized. However, the speech synthesis quality for dialects is still not high.

Disclosure of Invention

The embodiment of the application provides a training method of a speech synthesis model, a readable storage medium and a terminal device, which can improve the quality of dialect speech synthesis.

In a first aspect, an embodiment of the present application provides a method for training a speech synthesis model, including:

carrying out pronunciation labeling on the first voice sample to obtain a text sequence of the first voice sample;

inputting the first voice sample and the text sequence of the first voice sample into a preset voice synthesis model in pairs to obtain an output audio corresponding to the text sequence of the first voice sample and an audio characteristic of the first voice sample;

and adjusting parameters of the speech synthesis model according to the audio characteristics of the first speech sample and the output audio to obtain a trained speech synthesis model.

Further, the speech synthesis model comprises an audio processing module, a text encoding module, a decoding module and a synthesis module;

the inputting the first voice sample and the text sequence of the first voice sample into a preset voice synthesis model in pairs for processing to obtain an output audio corresponding to the text sequence of the first voice sample and an audio feature of the first voice sample includes:

inputting the first voice sample into the audio processing module for processing to obtain the audio characteristics of the first voice sample;

inputting the text sequence of the first voice sample into the text coding module for processing to obtain a feature vector corresponding to the text sequence;

inputting the feature vector corresponding to the text sequence into the decoding module for processing to obtain a frequency spectrum corresponding to the feature vector;

and inputting the frequency spectrum to the synthesis module for processing to obtain the output audio.

Further, the performing pronunciation labeling on the first voice sample to obtain a text sequence of the first voice sample includes:

constructing a first data set from the first speech sample;

and carrying out pronunciation labeling on the first voice sample based on the first data set to obtain a text sequence of the first voice sample.

Specifically, since the pronunciation tag of each word of the dialect is already included in the first data set, the pronunciation tag of each word in the text corresponding to the first speech sample can be determined by traversing the first data set, and the text sequence of the first speech sample can be determined based on a preset encoding rule.

Further, the constructing the first data set from the first speech sample includes:

and carrying out pronunciation labeling on the text corresponding to the first voice sample according to the first voice sample and determining a pronunciation label of each word in the first data set.

Further, the performing pronunciation labeling on the first voice sample based on the first data set to obtain a text sequence of the first voice sample includes:

traversing the first data set, and carrying out pronunciation labels on each character of the first voice sample;

and determining the pronunciation sequence number corresponding to the pronunciation label of each word according to a preset coding rule so as to obtain the text sequence of the first voice sample.

Further, the training method of the speech synthesis model further includes:

constructing a second data set from a second speech sample and the first data set;

performing pronunciation labeling on the second voice sample based on the second data set to obtain a text sequence of the second voice sample;

and training the preset voice synthesis model based on the second voice sample and the text sequence of the second voice sample.

Further, the constructing a second data set from a second speech sample and the first data set includes:

carrying out pronunciation labeling on the text corresponding to the second voice sample according to the second voice sample and determining a pronunciation label of each character in the second voice sample;

and adding the pronunciation label of each word in the text corresponding to the second voice sample into the first data set to obtain a second data set.

Further, the training the preset speech synthesis model based on the second speech sample and the text sequence of the second speech sample includes:

inputting the text sequence of the second voice sample into a preset voice synthesis model for processing to obtain corresponding output audio;

processing the second voice sample through the preset voice synthesis model to obtain a frequency spectrum characteristic corresponding to the second voice sample;

and adjusting parameters of the speech synthesis model according to the spectral feature corresponding to the second speech sample and the output audio to obtain a trained speech synthesis model.

In a second aspect, an embodiment of the present application provides a speech synthesis method, including:

acquiring a text to be synthesized;

and inputting the text to be synthesized into a trained speech synthesis model for processing, and acquiring the output audio output by the speech synthesis model, wherein the speech synthesis model is obtained by training any one of the speech synthesis model training methods.

Further, the speech synthesis model comprises a text encoding module, a decoding module and a synthesis module;

the inputting the text to be synthesized into a trained speech synthesis model for processing to obtain the output audio output by the speech synthesis model includes:

inputting the text to be synthesized into the text coding module for processing to obtain a feature vector corresponding to the text to be synthesized;

inputting the feature vector corresponding to the text to be synthesized into the decoding module for processing so as to obtain a frequency spectrum corresponding to the feature vector;

Further, the inputting the text to be synthesized into the text encoding module for processing to obtain a feature vector corresponding to the text to be synthesized includes:

generating a corresponding text sequence according to the text to be synthesized;

and performing feature extraction on the text sequence to obtain a feature vector corresponding to the text to be synthesized.

A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, where the processor implements the steps of the training method of any one of the above-mentioned speech synthesis models or the steps of any one of the above-mentioned speech synthesis methods when executing the computer readable instructions.

In a fourth aspect, the present application provides a computer program product, which, when run on a terminal device, causes the terminal device to perform the steps of the training method for a speech synthesis model according to any one of the above first aspects or the steps of any one of the above speech synthesis methods.

It is understood that the beneficial effects of the second to fourth aspects can be seen from the description of the first aspect, and are not described herein again.

Compared with the prior art, the embodiment of the application has the advantages that: the method can be used for carrying out pronunciation labeling on the speech sample of the dialect, training a speech synthesis model based on the text sequence of the speech sample and the collected speech sample, further obtaining the speech synthesis model capable of synthesizing the pronunciation characteristics of the dialect, and improving the quality of the dialect speech synthesis.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flowchart illustrating an implementation of a method for training a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating an implementation of a first data set forming process in a training method of a speech synthesis model according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating an implementation of a text sequence determination process for a first speech sample in a method for training a speech synthesis model according to an embodiment of the present application;

fig. 4 is a flowchart of an implementation S103 of a method for training a speech synthesis model according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for training a speech synthesis model according to another embodiment of the present application;

FIG. 6 is a flow chart of a speech synthesis method according to another embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The speech synthesis method provided by the embodiment of the application can be applied to terminal devices such as a mobile phone, a tablet personal computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), and the like, and the embodiment of the application does not limit the specific type of the terminal device at all.

Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a method for training a speech synthesis model according to an embodiment of the present application, where as shown in fig. 1, the method for training the speech synthesis model may include:

s101: and carrying out pronunciation labeling on the first voice sample to obtain a text sequence of the first voice sample.

Specifically, a first voice sample is collected, and then pronunciation labeling is carried out on the first voice sample according to pronunciation of each character in the first voice sample. The first voice sample may be a voice sample of a most common version of a certain dialect, for example, by recording the voice sample of the most common version of the dialect for more than 10 hours, all pronunciations and words of the dialect need to be covered as much as possible during recording, and commonly used chinese characters need to be covered as much as possible. And because the first voice sample covers all pronunciations, words and common Chinese characters of the dialect, the pronunciation label of each character of the dialect can be basically determined. The pronunciation tag is a tag for identifying the pronunciation method of each word in the dialect, and each word has a corresponding pronunciation tag. It will be appreciated that due to the existence of polyphonic words, a word may also correspond to multiple pronunciation tags. It should be noted that, performing pronunciation labeling on each word in the text corresponding to the first speech sample refers to: marking each character in the text corresponding to the first voice sample through the pronunciation of each character in the audio frequency of the first voice sample, wherein the marking comprises marking the initial consonant, the final, the tone, the pronunciation mode and the like of the character. The initial consonant and the final of each character can be labeled based on a self-defined initial consonant table and a self-defined final table, meanwhile, the tone can be self-defined according to the tone of pronunciation contained in the corresponding dialect, and the pronunciation mode can be set according to the pronunciation characteristics of the dialect, such as retroflex sound, accent sound, lingering long sound and the like. The initial consonant list and the final list can be based on the initial consonants and the final sounds in the mandarin pinyin, and then the initial consonants and the final sounds which are missing in the mandarin pinyin are added (for example, the initial consonants and the final sounds are set by referring to the pronunciation mode of English phonetic symbols).

Specifically, since the first speech sample already includes the pronunciation label of each word in the dialect, the pronunciation label of each word in the text corresponding to the first speech sample can be determined by traversing the pronunciation label of each word in the text corresponding to the first speech sample, and the text sequence of the first speech sample can be determined based on the preset encoding rule.

It should be noted that the preset coding rule is to code the initials, finals, tones, and pronunciation modes by using numeric coding, specifically, the data range may be divided according to the types of the initials, finals, tones, and pronunciation modes, and numerical values in a specific range are used to respectively represent the initials, finals, tones, and pronunciation modes. Illustratively, the preset encoding rule is that a reserved bit from 0 to 39 is used to represent the initial serial number; the reserved bits from 40 to 99 are used for representing the sequence number of the vowel, the reserved bits from 100 to 105 are used for representing the sequence number of the tone, and the reserved bits from 106 to 110 are used for representing the sequence number of the pronunciation mode. The method has the advantages that the extensibility of the expression of the pronunciation is increased by expressing the pronunciation of the dialect in a data coding mode, and if a new initial consonant, vowel, tone or pronunciation mode is found in the dialect, only one sequence number needs to be added. For the reserved bits, if the forms of the initial consonant and the final sound are more in a certain party, only the corresponding reserved range needs to be increased.

As a possible implementation manner, the above S101 may include the following steps:

constructing a first data set from the first speech sample;

Specifically, a first data set is formulated by collecting a first voice sample and then formulating a dialect corresponding to the voice sample based on the first voice sample. The first data set refers to a data set including each word in the first speech sample and a pronunciation tag corresponding to each word. And each character is stored corresponding to the pronunciation label corresponding to each character.

Specifically, the above-mentioned first data set constructing process may be obtained by a preprocessing device (including but not limited to a computer, a server, and other terminal devices with computing capability) through the processing process shown in fig. 2:

s1011: and carrying out pronunciation labeling on the text corresponding to the first voice sample according to the first voice sample and determining a pronunciation label of each word in the first data set.

Typically, the original data format of the speech samples is the WAV audio format, which is the closest lossless audio format, so its size is relatively large. In practical applications, the voice samples may be converted from the WAV audio format to the PCM audio format in advance in order to reduce the amount of subsequent calculation. Preferably, these silence signals are removed from the speech samples to reduce interference with the final result, considering that the speech samples may contain silence signals that typically occur during the period before the user speaks, during the period after the user speaks, and during the pause in the middle of the user's speaking, and do not contain any useful information.

When the first voice sample is collected, the first voice sample and the text corresponding to the first voice sample are correspondingly stored, and after the first voice sample is collected, the text corresponding to the first voice sample is subjected to pronunciation marking word by word based on pronunciation of the first voice sample, namely, pronunciation marking is carried out on each word in the text according to pronunciation of the first voice sample. The pronunciation label comprises an initial consonant, a vowel, a tone and a pronunciation mode. And carrying out pronunciation labeling on each character in the text based on the self-defined initial list, the self-defined tone list and the self-defined pronunciation mode list.

In practical application, the customized initial table and final table can take the initial and final in the mandarin pinyin as the basis, and then add the initial and final missing in the mandarin pinyin. The tone table can also be based on the tones of the mandarin pinyin, and then the tones missing in the mandarin pinyin are added. The pronunciation mode table can be set according to pronunciation characteristics of the dialect, such as a retroflex sound, a accent sound, a lingering sound and the like.

For example, in the shanxi dialect, the consonant list, the vowel list, the tone list, and the pronunciation mode list of the shanxi dialect are defined as follows:

phonogram table: b p m f d t n l g k h j q x zh ch sh r z c s y ng (wherein ng can refer to English phonetic symbol ^ greater than or equal to English phonetic symbol ^ greater than or equal to)

The pronunciation of/. ) (ii) a

A rhyme-mother table: a o e i u v ai i ui ao iu ie e er an en in un vn ang eng ong or (wherein, the pronunciation of the or can refer to English phonetic symbol ^ or ^ based on

(r)/pronunciation. ) (ii) a

Tone table: 100 (one), 101 (two), 102 (three), 103 (four), 104 (soft);

the pronunciation mode is as follows: 107 (heavy), 108 (long, dragging), 109 (retroflex). It should be noted that, because some pronunciations in the dialect are different from the pronunciations of mandarin, an initial determined according to the pronunciation of an english phonetic symbol (for example, "ng" shown in the above example) may be added to the customized initial table, and a final determined according to the pronunciation of an english audio (for example, "or" shown in the above example) may be added to the customized final table, so as to ensure that the pronunciation of the dialect can be more fully embodied.

Because the common words of the dialect are covered as much as possible when the first voice sample is collected, the pronunciation label of each word in the first data set can be determined based on the pronunciation labeling process, and then the first data set corresponding to the dialect is constructed.

The pronunciation label for each word in the data set includes an initial, a final, a tone, and a pronunciation style. Illustratively, for a data set corresponding to the Shanxi dialect, the pronunciation labels of the words in "I love Beijing Tiananmen" are as follows:

i: ng e 103107

Love: ng ai100

North: biei 102

Beijing: j ing 102

Day: ti an 102

And (2) safety: ng an 102

A door: m en 102

Note that, when the pronunciation method is omitted, the normal pronunciation is defaulted. It should be noted that, because the pronunciation of the characters "i", "ai" and "an" in the shanxi language is a character beginning with a vowel or a semivowel, the pronunciation is different from the pronunciation of "wo", "ai" and "an" in mandarin, and the pronunciation of "i" according to the shanxi language is "nge", and the pronunciation of "ai" is "ngai" and the pronunciation of "a" is "ngan".

It should be noted that there are also polyphonic words in dialects, so a word may correspond to multiple pronunciation tags, for example: the pronunciation labels of the "back" word in Shanxi dialect are as follows:

{

"behind": {

"her 103": [ "the back" ],

"hor 103": [ "the back head" ],

"hou 100" is: [ "rear" ]

}

Specifically, the text sequence determination process for obtaining the first speech sample may be obtained by a preprocessing device (including but not limited to a computer, a server, and other terminal devices with computing capabilities) through a processing process shown in fig. 3:

s1021: and traversing the first data set and carrying out pronunciation labels on all the characters of the first voice sample.

Specifically, by traversing the first data set, the pronunciation of each word in the text corresponding to the first voice sample is labeled by a pronunciation tag.

Illustratively, if the corresponding text in the first voice sample is "i love beijing tiananmen", wherein the pronunciation label of "i" is "ng e 103107"; the pronunciation label of "ai" is "ng ai 100"; the pronunciation label of "north" is "biei 102"; the pronunciation label of "Jing" is "j ing 102"; the pronunciation label for "day" is "ti an 102"; the pronunciation label of "an" is "ng an 102"; the pronunciation label of "gate" is "m en 102"; namely, the pronunciation label corresponding to the Beijing Tiananmen is 'ng e 103107ng ai100b ei 102j ing 102t i an 102ng an 102m en 102'.

S1022: and determining the pronunciation sequence number corresponding to the pronunciation label of each word according to a preset coding rule so as to obtain the text sequence of the first voice sample.

Specifically, due to the complexity of the dialect, in order to more conveniently represent the pronunciation of the dialect, symbols specifying tone and vowel are not used for representing, and only how many kinds of initial consonants, vowels and tones are needed to be known for the dialect, and how many kinds of pronunciation modes are added, different contents are represented through different number numbers, that is, the pronunciation number corresponding to the pronunciation label of each character is determined according to the preset coding rule. It should be noted that the preset coding rule is to code the initials, finals, tones, and pronunciation modes by using numeric coding, specifically, the data range may be divided according to the types of the initials, finals, tones, and pronunciation modes, and numerical values in a specific range are used to respectively represent the initials, finals, tones, and pronunciation modes. Illustratively, the preset encoding rule is that a reserved bit from 0 to 39 is used to represent the initial serial number; the reserved bits from 40 to 99 are used for representing the sequence number of the vowel, the reserved bits from 100 to 105 are used for representing the sequence number of the tone, and the reserved bits from 106 to 110 are used for representing the sequence number of the pronunciation mode. The method has the advantages that the extensibility of the expression of the pronunciation is increased by expressing the pronunciation of the dialect in a data coding mode, and if a new initial consonant, vowel, tone or pronunciation mode is found in the dialect, only one sequence number needs to be added. For the reserved bits, if the forms of the initial consonant and the final sound are more in a certain party, only the corresponding reserved range needs to be increased.

In a specific application, by specifying the sequence numbers corresponding to the initials, the finals, the tones and the pronunciation modes, the corresponding sequence number can be determined based on the pronunciation label, and then the text sequence corresponding to the first speech sample can be determined.

For example, the pronunciation label corresponding to "i love beijing tiananmen" in shanxi dialect is: ng e 103107ng ai100b ei 102j ing 102 ti an 102ng an 102m en 102; its corresponding text sequence is "234210310723461000471021162102543551022355102256102".

S102: and inputting the first voice sample and the text sequence of the first voice sample into a preset voice synthesis model in pairs for processing to obtain an output audio corresponding to the text sequence of the first voice sample and the audio characteristics of the first voice sample.

In this embodiment, after the text sequence of the first speech sample is input into the preset speech synthesis model for processing, the corresponding output audio can be obtained, and the speech synthesis model can also obtain the audio feature corresponding to the first speech sample by processing the first speech sample, and use the spectral feature of the first speech sample as a standard for verifying the output audio, so as to verify whether the model is trained. In application, the speech synthesis model may be a tacontron, which is an end-to-end deep learning TTS model, and the core structure of the model is a model of an attention-based coder-to-decoder. In the training process, the model is trained based on the collected first voice sample and the text sequence of the first voice sample. The voice synthesis model comprises an audio processing module, a text coding module, a decoding module and a synthesis module.

Specifically, S102 may include the process shown in fig. 4:

s1031: and inputting the first voice sample into the audio processing module for processing to obtain the audio characteristics of the first voice sample.

Specifically, for the first voice sample, in order to more intuitively acquire the change rule of the frequency domain of the first voice sample, the audio feature is extracted based on the audio processing module. The audio features can adopt a Mel frequency spectrum, which is a relatively common audio feature, for sound, the Mel frequency spectrum is a one-dimensional time domain signal, the change rule of a frequency domain is difficult to be seen intuitively, frequency domain information of the audio features can be obtained by using Fourier change, but the time domain information is lost, the change of the frequency domain along with the time domain cannot be seen, so that sound cannot be well described, the Mel frequency spectrum is a frequency spectrum used for representing short-term audio, and the principle is based on a logarithmic frequency spectrum represented by a nonlinear Mel scale and linear cosine transformation of the logarithmic frequency spectrum. In one specific implementation, the speech sample may be first converted from the time domain to the frequency domain by a short-time fourier transform, then its logarithmic energy spectrum is convolved by a set of triangular filters distributed according to the mel scale, and finally the vector formed by the outputs of the filters is subjected to a discrete cosine transform to obtain its mel spectrum.

The short-time fourier transform (STFT) is a process of fourier-transforming a short-time signal. The long-term signal is divided into short-term signals by framing. Windowing the long-term signal, performing Fourier transform (FFT) on each frame of data, and finally stacking the Fourier transform result of each frame of data along another dimension to obtain a two-dimensional signal form similar to a graph. Since the first speech sample is a sound signal, a two-dimensional signal obtained by STFT expansion is a spectrogram. The spectrogram is often a large image, and in order to obtain a sound feature with a proper size, the sound feature is transformed into a mel-scale spectrum by a mel-scale filter banks (mel-scale filters). The mel cepstrum is obtained by performing cepstrum analysis (taking logarithm and performing DCT) on the mel frequency spectrum.

S1032: and inputting the text sequence of the first voice sample into the text coding module for processing so as to obtain a feature vector corresponding to the text sequence.

Specifically, a one-hot vector corresponding to the text sequence of the first speech sample can be obtained by one-hot encoding the text sequence. one-hot coding is a binary representation of a classification variable, and the value range of discrete features (text features) can be expanded to a Euclidean space by using the one-hot coding, so that the discrete features (text features) can correspond to one value in the Euclidean space, the features contained in a text sequence of a first voice sample can be better extracted, and a feature vector corresponding to the text sequence can be obtained.

one-hot encoding is to represent the text sequence by using a one-hot vector containing 0 and 1. And then inputting the one-hot vector corresponding to the text sequence into the text encoding module for processing.

In this embodiment, the text encoding module includes a preprocessing module and a feature extraction module.

The preprocessing module (per-net) comprises two fully connected layers, the number of neurons of the fully connected layers is 256 and 128 respectively, the number of the hidden units of the first layer is consistent with the number of the input units, and the number of the hidden units of the second layer is half of the number of the first hidden units. And adopting a relu activation function and 0.5 dropout, and introducing nonlinear transformation through a preprocessing module to further accelerate the convergence of the model.

The feature extraction module (CBHG) is composed of 1-D constraint bank, high way network, bidirectional GRU.

Wherein, 1-D containment bank comprises 1 convolution layer, 1 pooling layer and 2 one-dimensional convolution layers, and wherein first layer convolution layer includes K1 dimensional filters of different size, and wherein the size of filter is 1, 2, 3 … K. These convolution kernels of different sizes extract context information of different lengths. Accumulating the outputs of k convolution kernels with different sizes together, namely combining the extracted context information together, inputting the combined context information into a pooling layer for processing so as to increase local invariance, wherein the pooling layer is a pooling layer with stride of 1 and width of 2, outputting the pooling layer through two layers of one-dimensional convolution layers, the filter size of the first convolution layer is 3, stride is 1, and the adopted activation function is ReLu; the second convolution layer has a filter size of 3 and stride of 1, only as a linear transformation. Finally, residual connection (residual connection) is performed between the output result of the convolutional layer and the vector input to the feature extraction module, that is, the vector input to the feature extraction module and the sum output by the convolutional layer are added. The use of residual connection can alleviate the problem of gradient diffusion brought by the neural network.

Each layer structure of the highway network is as follows: the input is put into two fully-connected networks of one layer at the same time, the activation functions of the two networks respectively adopt ReLu and sigmoid functions, and the output of the highway layer is output1 output2+ input (1-output2) assuming that the input is input, the output of ReLu is output1 and the output of sigmoid is output 2.

bidirectional GRU is bidirectional GRU, and the output is input into the bidirectional GRU to extract the time sequence of the text. Among them, GRU is a variant of RNN that uses a threshold mechanism as well as LSTM.

A feature vector corresponding to the text sequence can be obtained by a feature extraction module (CBHG).

S1033: and inputting the feature vector corresponding to the text sequence into the decoding module for processing to obtain a frequency spectrum corresponding to the feature vector.

Specifically, based on the attention mechanism, the feature vectors of the text sequence are sent to the decoding module for processing, so as to obtain the corresponding frequency spectrum.

The decoding module comprises pre-net, Attention-RNN and Decoder-RNN. Wherein the structure of Pre-net is the same as the Pre-processing module (per-net) in the text encoding module. The Attention-RNN structure is a layer of RNN containing 256 GRUs, the output of pre-net and the output of Attention are used as input and output to decoder-RNN after passing through GRU units, Decode-RNN is a two-layer residual GRU, the output of the decoder-RNN is the sum of the input and the output after passing through the GRU units, the initial input is a vector of 0, and the Mel frequency spectrum of an r frame can be predicted each time during prediction, so that the calculated amount can be effectively reduced, and the model convergence is accelerated.

S1034: and inputting the frequency spectrum to the synthesis module for processing to obtain the output audio.

The synthesis module comprises a post-processing unit (post-processing) and a synthesis unit, the output Mel frequency spectrum is input into the post-processing unit (post-processing) to be converted, after a corresponding linear spectrum is obtained, and finally the output linear spectrum of the post-processing net is synthesized into output audio by using a Griffin-Lim algorithm.

It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

S103: and adjusting parameters of the speech synthesis model according to the audio characteristics of the first speech sample and the output audio to obtain a trained speech synthesis model.

Specifically, the spectral feature of the first speech sample and the spectral sequence of the output audio output by the speech synthesis model may be compared, and if the two are not identical, it is determined that the output of the model is still inaccurate, a training loss value of the speech synthesis model may be calculated, and model parameters of the speech synthesis model may be adjusted.

In this embodiment, assuming that the model parameter of the speech synthesis model is W1, the training loss value is propagated backward to modify the model parameter W1 of the speech synthesis model, so as to obtain a modified parameter W2. After the parameters are modified, the steps of processing the first voice sample and the text sequence of the first voice sample are executed again, that is, the next training process is started, during this training, a new set of the first speech samples and the text sequence of the first speech samples may be processed, and calculates the training loss value corresponding to the text sequence, and propagates the training loss value in the reverse direction to modify the model parameter W2 of the speech synthesis model, to obtain the modified parameters W3, … …, by analogy, the above processes are repeated continuously, each training process can be used for training a new group of first voice samples, model parameters are modified until preset training conditions are met, the training condition may be that the number of times of training reaches a preset number threshold, and optionally, the number threshold may be 100000 times; the training condition may also be the speech synthesis model convergence; since it may happen that the number of training times has not reached the number threshold, but the speech synthesis model has converged, unnecessary work may be repeated; or the speech synthesis model may not be converged all the time, which may result in infinite loop and may not end the training process, and based on the two cases, the training condition may also be that the training frequency reaches the frequency threshold or the speech synthesis model converges. And when the training condition is met, the trained speech synthesis model can be obtained.

In another embodiment, since the same dialect may have some differences due to different regions and the pronunciations of some words may be different, as shown in fig. 5, the method for training the speech synthesis model may further include the following steps:

s104: a second data set is constructed from a second speech sample and the first data set.

Specifically, the second voice sample is a variant of the first voice sample in different regions of the same dialect. Dialect voice samples have been supplemented by recording voice samples of the dialect in a plurality of other territories.

The second data set is based on the first data set and supplements the pronunciation labels of each word in the second speech sample, thereby ensuring that the second data set contains as much pronunciation of each word in the dialect as possible.

As a possible implementation manner of the present application, the above S104 may include the following steps:

Specifically, the second data set can be constructed by adding the pronunciation label of the word in the second speech sample on the basis of the first data set.

S105: and carrying out pronunciation labeling on the second voice sample based on the second data set to obtain a text sequence of the second voice sample.

Here, since the second data set also includes pronunciation tags of the words of the dialect, the specific pronunciation tagging process is similar to S102, see S102.

S106: and training the preset voice synthesis model based on the audio features of the second voice sample and the text sequence of the second voice sample.

The corresponding output audio can be obtained by inputting the text sequence of the second voice sample into a preset voice synthesis model for processing, and the voice synthesis model can also obtain the spectral feature corresponding to the second voice sample by processing the second voice sample, and the spectral feature of the second voice sample is used as a standard for verifying the output audio, so as to verify whether the model is trained.

As a possible implementation manner of the present application, the above S106 may include the following steps:

Specifically, the specific process of adjusting the parameters of the speech synthesis model according to the spectral feature corresponding to the second speech sample and the output audio to obtain the trained speech synthesis model may refer to S103, which is not described herein again.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In summary, the training method for the speech synthesis model provided in the embodiment of the present application can perform pronunciation labeling based on the sample speech of the dialect, and train the speech synthesis model based on the text sequence of the pronunciation labeling and the collected speech sample, so as to obtain the speech synthesis model capable of synthesizing the pronunciation characteristics of the dialect, thereby improving the quality of the dialect speech synthesis.

Referring to fig. 6, another embodiment of the present application provides a speech synthesis method, as shown in fig. 6, including:

s201: and acquiring a text to be synthesized.

The text to be synthesized can be text data which is instantly acquired by a user through an input device such as a keyboard of terminal equipment such as a mobile phone and a tablet personal computer. In a specific usage scenario of this embodiment, when a user wants to perform speech synthesis immediately, a speech synthesis mode of a terminal device may be opened by clicking a specific physical key or virtual key before a text to be synthesized is collected, and in this mode, the terminal device may process a text input by the user according to subsequent steps to obtain an output audio corresponding to the text, where a specific processing procedure will be described in detail later.

The text to be synthesized may also be text data originally stored in the terminal device, or text data acquired by the terminal device from a cloud server or other terminal devices through a network. In another specific use scenario of this embodiment, when a user wants to perform speech synthesis on an existing text to be synthesized, a speech synthesis mode of a terminal device may be opened by clicking a specific physical key or virtual key, and the text to be synthesized is selected (the order of clicking the key and selecting the text may be interchanged, that is, the text may be selected first, and then the speech synthesis mode of the terminal device is opened), so that the terminal device may process the text to be synthesized according to subsequent steps to obtain output audio corresponding to the text to be synthesized, and a specific processing procedure will be described in detail later.

S202: and inputting the text to be synthesized into a trained voice synthesis model for processing, and acquiring the output audio output by the voice synthesis model.

The speech synthesis model is the speech synthesis model described in the previous embodiment.

Specifically, the text to be synthesized is input into the speech synthesis model, the speech synthesis model can convert the text to be synthesized into a text sequence corresponding to the text to be synthesized, a feature vector corresponding to the text sequence is determined based on the text sequence, a frequency spectrum corresponding to the feature vector is further obtained, and speech synthesis is performed based on the frequency spectrum, so that output audio corresponding to the text to be synthesized can be obtained. After the audio is input into a preset speech synthesis model for processing, corresponding output audio can be obtained.

In this embodiment, the speech synthesis model includes a text encoding module, a decoding module, and a synthesis module, and the S201 includes:

In one embodiment, the inputting the text to be synthesized into the text encoding module for processing to obtain a feature vector corresponding to the text to be synthesized includes:

In the speech synthesis method provided by this embodiment, the speech synthesis model trained by the training method based on the speech synthesis model can also construct a corresponding data set based on the dialect, perform pronunciation labeling based on the data set, and train the speech synthesis model based on the text sequence of the pronunciation labeling and the collected speech sample, so as to obtain the speech synthesis model capable of synthesizing the pronunciation characteristics of the dialect, thereby improving the quality of dialect speech synthesis.

Fig. 7 shows a schematic block diagram of a terminal device provided in an embodiment of the present application, and only shows a part related to the embodiment of the present application for convenience of description.

As shown in fig. 7, the terminal device 7 of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70. The processor 70, when executing the computer program 72, implements the steps in the above-described embodiments of the training method for the speech synthesis models, such as the steps S101 to S103 shown in fig. 1. Or a step in the speech synthesis method described above.

Illustratively, the computer program 72 may be partitioned into one or more modules/units that are stored in the memory 71 and executed by the processor 70 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 72 in the terminal device 7.

The terminal device 7 may be a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a cloud server, or other computing devices. It will be understood by those skilled in the art that fig. 7 is only an example of the terminal device 7, and does not constitute a limitation to the terminal device 7, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device 7 may further include an input-output device, a network access device, a bus, etc.

The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 7. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used for storing the computer programs and other programs and data required by the terminal device 7. The memory 71 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for training a speech synthesis model, comprising:

2. The method of claim 1, wherein the speech synthesis model comprises an audio processing module, a text encoding module, a decoding module, and a synthesis module;

the inputting the first voice sample and the text sequence of the first voice sample into a preset voice synthesis model in pairs to obtain an output audio corresponding to the text sequence of the first voice sample and an audio feature of the first voice sample includes:

3. The method for training a speech synthesis model according to claim 1, wherein the performing pronunciation labeling on the first speech sample to obtain a text sequence of the first speech sample comprises:

constructing a first data set from the first speech sample;

4. A method of training a speech synthesis model according to claim 3, wherein said constructing a first data set from first speech samples comprises:

5. The method for training a speech synthesis model according to claim 4, wherein the performing pronunciation labeling on the first speech sample based on the first data set to obtain the text sequence of the first speech sample comprises:

6. The method for training a speech synthesis model according to any one of claims 1 to 5, further comprising:

7. The method of training a speech synthesis model according to claim 6, wherein said constructing a second data set from a second speech sample and said first data set comprises:

carrying out pronunciation labeling on the text corresponding to the second voice sample according to the second voice sample and determining a pronunciation label of each word in the text corresponding to the second voice sample;

8. The method for training a speech synthesis model according to claim 6, wherein the training the preset speech synthesis model based on the second speech sample and the text sequence of the second speech sample comprises:

9. A method of speech synthesis, comprising:

acquiring a text to be synthesized;

inputting the text to be synthesized into a trained speech synthesis model for processing, and obtaining an output audio output by the speech synthesis model, wherein the speech synthesis model is obtained based on the training method of the speech synthesis model according to any one of claims 1 to 8.

10. A speech synthesis method according to claim 9, wherein the speech synthesis model comprises a text encoding module, a decoding module and a synthesis module;

11. The speech synthesis method of claim 9, wherein the inputting the text to be synthesized into the text encoding module for processing to obtain the feature vector corresponding to the text to be synthesized comprises:

12. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method of training a speech synthesis model according to any of claims 1 to 8 or the method of speech synthesis according to any of claims 9 to 11 when executing the computer program.

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method for training a speech synthesis model according to any one of claims 1 to 8 or a method for speech synthesis according to any one of claims 9 to 11.