CN112863483A

CN112863483A - Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm

Info

Publication number: CN112863483A
Application number: CN202110008049.1A
Authority: CN
Inventors: 盛乐园
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-05-28
Anticipated expiration: 2041-01-05
Also published as: CN112863483B

Abstract

The invention discloses a voice synthesis device supporting multi-speaker style and language switching and controllable rhythm, belonging to the field of voice synthesis. The method comprises the following steps: the text preprocessing unit is used for preprocessing the acquired text data; the language switching unit is used for storing and displaying speaker labels corresponding to training data of different language types and automatically identifying the language type of the text to be synthesized; a style switching unit for specifying a speech synthesis style according to a language type; a speaker switching unit for designating a speaker; an encoding-decoding unit for obtaining a predicted Mel spectrum; a training unit for training the encoding-decoding unit; and the voice synthesis unit is used for generating the predicted Mel frequency spectrum according to the generated predicted Mel frequency spectrum and converting the predicted Mel frequency spectrum into a sound signal for voice playing. The invention can generate voice with richer prosodic change and simultaneously can respectively control the speaker and the style of the speaker.

Description

Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm

Technical Field

The invention belongs to the field of voice synthesis, and particularly relates to a voice synthesis device which supports multi-speaker style and language switching and has controllable rhythm.

Background

With the development of deep learning in recent years, speech synthesis technology has been greatly improved. Speech synthesis moves from the traditional parametric approach and the concatenation approach towards an end-to-end approach. They typically synthesize speech by first generating a mel-frequency spectrum from text features and then using the mel-frequency spectrum to synthesize the speech using a vocoder image. These end-to-end methods can be classified by structure into autoregressive models and non-autoregressive models. The autoregressive model is usually generated by autoregressive using an Encoder-Attention-Decoder (Encoder-Attention-Decoder) mechanism: to generate the current data point, all previous data points in the time series must be generated as model inputs, like Taoctron, Taoctron 2, Deep voice 3, Clarinet, tanformer TTS. Although autoregressive models can produce satisfactory results, Attention to Attention may be inadequate, leading to the phenomenon of duplication or word missing in the synthesized speech. Non-autoregressive models can parallelize the generation of mel spectra from text features much faster than autoregressive models like ParaNet, fastspech, alignts, fastspech 2.

The existing voice synthesis method has single control on synthesized voice, cannot synthesize mixed voice of multiple languages, and further cannot decouple and separate styles of multiple speakers to apply to other speakers.

Therefore, how to enable the voice synthesis system to support multiple speakers on the basis of ensuring controllable prosody and to decouple and separate the styles of the speakers to apply to other speakers is still an unsolved problem in the field of intelligent voice synthesis of computers.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and on one hand, the prosody of the synthesized voice is controlled by the text, the duration, the energy and the pitch of prosody labels and four characteristics. On the other hand, the method can support the ability of speaking two languages for only one language in the data set to perform language migration, and can also decouple the style of multiple speakers from the characteristics of the speakers, apply the style to other speakers, and perform speaker style migration. The invention overcomes the limit of language and speaker style to other speakers by optimizing the voice synthesis model, and realizes that the multi-language style of multiple speakers can be separated, and the rhythm can be controlled comprehensively.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

a speech synthesis apparatus supporting multi-speaker style, language switching and controllable prosody, comprising:

the text acquisition unit is used for acquiring different text data according to the mode of the voice synthesis device, acquiring a mixed training text with a rhythm label and corresponding standard voice audio in the training mode, and marking a speaker label of each standard voice audio; acquiring a text to be synthesized in a prediction mode;

the text preprocessing unit is used for converting the text into a phoneme sequence with a prosodic tag, and outputting a real Mel frequency spectrum, a real energy, a real pitch, a real duration and a corresponding speaker tag according to a standard voice audio corresponding to the text during a training mode;

the language switching unit is used for storing and displaying speaker labels corresponding to training data of different language types and automatically identifying the language type of the text to be synthesized;

the style switching unit is used for reading the language type of the text displayed by the language switching unit and setting a first speaker label as a voice synthesis style according to the language type;

a speaker switching unit for setting a second speaker tag as a designated speaker;

in the training mode, the first speaker label and the second speaker label are both speaker labels marked in the mixed training sample; in the prediction mode, the first speaker tag and the second speaker tag are respectively specified by a user through a style switching unit and a speaker switching unit;

an encoding-decoding unit including an encoder for encoding a phoneme sequence with a prosody tag, a first speaker tag, and a second speaker tag; the prosody control unit is used for predicting and adjusting the duration, pitch and energy of voice synthesis; the decoder is used for combining the first speaker coding information, the second speaker coding information, the pitch and the energy which are regulated by the rhythm control unit, and decoding the combined coding information to obtain a predicted Mel frequency spectrum;

the training unit is used for training the coding-decoding unit and saving the coding-decoding unit as a model file after the training is finished;

and the voice synthesis unit is used for loading the model file generated by the training unit, reading the text to be synthesized in the text acquisition unit, the first speaker label set by the style switching unit and the second speaker label set by the speaker switching unit as the input of the model, generating a predicted Mel frequency spectrum, and converting the predicted Mel frequency spectrum into a voice signal for voice playing.

Further, the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends and character boundaries. The prosodic tags are added by adopting a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output.

Further, the multiple speaker tag refers to a natural number sequence that distinguishes each speaker.

Further, the decoder consists of a bi-directional LSTM and a linear affine transform.

Furthermore, the time length control unit, the pitch control unit and the energy control unit are all composed of three one-dimensional convolution layers and a regularization layer, a two-way gating circulation unit GRU and linear affine transformation.

The invention can utilize the coding-decoding unit, the duration/energy/pitch control unit and the neural network decoder to realize the control of the rhythm in the synthesized voice and further realize the support of the separation of the multi-speaker and the multi-speaker style. Compared with the prior art, the invention has the beneficial effects that:

(1) the encoding-decoding unit of the invention comprises an encoder, a prosody control unit and a decoder, wherein the encoder is used for encoding a phoneme sequence with a prosody label, a first speaker label and a second speaker label; the prosody control unit is used for predicting and adjusting the duration, pitch and energy of voice synthesis; the decoder is used for combining the first speaker coding information, the second speaker coding information, the pitch and the energy which are regulated by the prosody control unit, decoding the combined coding information and obtaining the predicted Mel frequency spectrum. By the method, the voice synthesis system supporting multiple speakers and multiple languages is realized, the model function is richer, the control of pitch, energy, duration and rhythm of voice synthesis can be supported, the speaker tag and the speaking style tag can be designated in the voice synthesis, and the migration of the language and the speaking style of the speaker is supported, for example, the speaker who only speaks in Chinese can be migrated to the same speaker who speaks in English in customer service style.

(2) Compared with the traditional method for separately constructing various models, the method adopts a mode of directly converting text to acoustic characteristics, avoids the influence of single model prediction error on the effect of the whole model, thereby improving the fault-tolerant capability of the model and reducing the deployment cost of the model; and the CBHG module is adopted to effectively model the current and context information, extract the features of higher level and extract the context features of the sequence, and the method can learn more text pronunciation features which are difficult to define by people through data, thereby effectively improving the pronunciation effect of the voice.

(3) The text of the invention also comprises a preprocessing process before being input into the speech synthesis model, namely a process of adding prosodic tags to the text in a prosodic phrase boundary prediction mode, thereby ensuring controllable text prosody and solving the defect of reduced naturalness of long sentence synthesis caused by unnatural prosodic pause of synthesized speech in the traditional speech synthesis method; through a supervision mode, the text, the duration, the energy and the pitch with rhythm labeling are finely controlled, so that the rhythm of the synthesized voice is more comprehensively controlled, and the voice with richer and more natural rhythm change is synthesized;

(4) the invention simplifies the complexity of the speech synthesis model training by introducing the time length control unit, and because the traditional process of dynamically aligning the text and the audio by adopting the attention module in the end-to-end speech synthesis model needs a large amount of computing resource consumption and time consumption, the invention avoids the process of aligning the text and the audio in the form of autoregressive attention, thereby reducing the requirements on the computing resources and saving the computing cost of the model. For example, the traditional Chinese and English two independent models can be replaced by a Chinese-English mixed model, and the method can be used for not only Chinese-English mixed speech synthesis, but also other mixed languages.

Drawings

FIG. 1 is an overall schematic diagram of a speech synthesis apparatus of the present invention;

FIG. 2 is a schematic diagram of a pitch/energy/duration control unit for use in the present invention;

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

Compared with the traditional scheme, the invention better controls the rhythm pause information in the synthesized voice by using a skip neural network encoder, and simultaneously controls the rhythm pronunciation information of each frame in the synthesized voice by using the predicted duration, energy and pitch, so that the characteristic of the synthesized voice can be more accurately controlled, and the voice with richer rhythm change can be generated. On the other hand, the support to multiple speakers is realized, the whole solution is completely completed by one model, the language of the text does not need to be distinguished, and the complexity of the model is reduced.

As shown in fig. 1, a speech synthesizer supporting multi-speaker style, language switching and controllable prosody of the present invention comprises the following units:

In an embodiment of the present invention, the text preprocessing unit converts the text into a phoneme sequence with a prosodic tag, specifically:

respectively converting different language types in the text into corresponding pronunciation phonemes to construct a mixed phoneme dictionary; mapping phonemes with prosodic labels to serialized data using a mixed phoneme dictionary to obtain a phoneme sequence w₁,w₂,…,w_UWhere U is the length of the text.

In one embodiment of the present invention, the prosody control unit is described.

The prosody control unit includes:

the time length control unit is used for predicting the time length of the text coding information and the first speaker coding information output by the CBHG module, outputting the predicted time length and adjusting the predicted time length;

the alignment unit is used for aligning the text coding information which is output by the encoder and does not contain the prosodic tags according to the duration information output by the duration control unit, the length of the text coding information needs to be consistent with the length of a real Mel frequency spectrum in the training mode, the predicted duration of each phoneme is output by the trained duration control unit in the prediction mode, the length of each phoneme is expanded according to the predicted duration, and the text coding information after the duration adjustment is output after the expansion;

the energy control unit is used for reading the text coding information and the first speaker coding information which are output by the alignment unit and have the adjusted time length, generating predicted energy and adjusting the predicted energy;

and the high pitch control unit is used for reading the text coding information after the time length adjustment output by the alignment unit and the second speaker coding information, generating a predicted pitch and performing pitch adjustment on the predicted pitch.

In one embodiment of the present invention, the alignment unit is described.

The operation steps of the alignment unit are as follows: text coding information t without rhythm label position after jump coding₁,t₂,…,t_U′Length expansion is carried out by combining the time length information output by the time length control unit, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each phoneme is output according to the trained duration control unit, and the length of each phoneme is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion₁,t₂,…,t_TAnd T is the frame number of the extracted real Mel spectrum.

In this embodiment, the encoder is provided with a phoneme Embedding layer, a speaker Embedding layer, a CBHG module, and a skip module;

for phoneme sequence w with rhythm label₁,w₂,…,w_UConverted into phoneme vector sequence x through phoneme Embedding layer₁,x₂,…,x_U；

Speaker tag s for input_iI 1,2,3, which is converted into a speaker vector sequence S by a speaker Embedding layer_i；

Using the converted phoneme vector sequence as the input of the CBHG module to generate text coding information t₁,t₂,…,t_U；

Encoding information t from text₁,t₂,…,t_UAnd speaker vector sequence S_iPredicting the duration;

encoding text into information t₁,t₂,…,t_UText coding information t without rhythm label position after generating jump coding by jump module₁,t₂,…,t_U′Of which is U'<U, wherein U' is the text length after the prosodic tag is removed.

In this embodiment, the training of the encoding-decoding unit by the training unit specifically includes:

processing the phoneme sequence with the prosodic tag by a phoneme Embedding layer and a CBHG module in sequence to obtain text coding information, and removing the prosodic tag from the text coding information by a hopping module; respectively processing a first speaker tag and a second speaker tag through a speaker Embedding layer to obtain first speaker coding information and second speaker coding information;

the text coding information and the first speaker coding information are subjected to duration control to obtain the predicted duration with the speaker characteristics, and the predicted duration is multiplied by a duration adjustment factor, wherein the duration adjustment factor is 1;

aligning the text coding information without the prosodic tags according to the predicted duration after the duration adjustment to obtain the text coding information after the duration adjustment;

using the text coding information and the second speaker coding information after the duration adjustment as the input of a pitch control unit to obtain a predicted pitch with speaker characteristics, and multiplying the predicted pitch by a pitch adjustment factor, wherein the pitch adjustment factor is 1;

using the text coding information and the first speaker coding information after the duration adjustment as the input of an energy control unit to obtain predicted energy with speaker characteristics, and multiplying the predicted energy by an energy adjustment factor, wherein the energy adjustment factor is 1;

combining the predicted pitch, the predicted energy, the text coding information subjected to time length adjustment and the first speaker coding information to be used as the input of a decoder to obtain a predicted Mel frequency spectrum;

calculating a duration loss according to the predicted duration and the real duration, calculating a pitch loss according to the predicted pitch and the real pitch, calculating an energy loss according to the predicted energy and the real energy, and calculating a mel-frequency spectrum loss according to the predicted mel-frequency spectrum and the real mel-frequency spectrum; and combining various loss values to carry out end-to-end training on the encoder, the prosody control unit and the decoder.

In this embodiment, the prosodic tags include prosodic words, prosodic phrases, intonation phrases, sentence ends, and character boundaries.

In this embodiment, after the voice synthesis unit reads the text to be synthesized in the text acquisition unit, a prosody tag needs to be added to the text to be synthesized, the addition of the prosody tag is implemented by using a pre-trained prosody phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosody phrase boundary prediction model, and the text to be synthesized with the prosody tag is output.

The speech synthesis system needs to complete training before use, the training process needs to calculate time loss according to predicted time and real time, calculate pitch loss according to predicted pitch and real pitch, calculate energy loss according to predicted energy and real energy, and calculate mel-frequency spectrum loss according to predicted mel-frequency spectrum and real mel-frequency spectrum; and performing end-to-end training on the mixed speech synthesis model by combining various loss values.

In a specific Chinese and English implementation process of the invention, a text preprocessing module (front end) mainly has the functions of receiving text data, normalizing the text, analyzing an XML tag, mapping Chinese and English phonemes with rhythm labels to serialized data by utilizing a Chinese and English mixed phoneme dictionary to obtain a phoneme sequence w₁,w₂,…,w_UWhere U is the length of the text. The rhythm labeling process specifically comprises the following steps: the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends and character boundaries, the prosodic tags are added by adopting a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output. The training samples used in the training stage may be data with prosodic tags in the open source database.

Specifically, the main function of the encoder is to train and learn the text features and speaker information of the phoneme sequence of the current sample, so that the phoneme sequence can be converted into a fixed dimension vector capable of representing the text and the speaker features. Compared with the traditional parametric method speech synthesis algorithm, the function of the encoder is similar to the step of manually extracting the features in the parametric method, the encoder can learn representative feature vectors through data, a large amount of manpower is consumed in the process of manually extracting the features to carry out statistical criteria, and the labor cost is greatly increased. On the other hand, compared with the incomplete feature information possibly caused by manually extracting the features, sufficient feature information can be learned under the condition of comprehensive data coverage through the learned feature vectors. Compared with the traditional decoder, the decoder has a simple structure and only consists of a bidirectional LSTM and a linear affine transformation, and greatly improves the decoding speed.

Specifically, the time length control unit and the alignment unit are used for carrying out length expansion on the coding information output by the coder, the introduction of the time length control unit simplifies the complexity of the training of the speech synthesis model, and the traditional end-to-end speech synthesis model adopts an attention module to dynamically align texts and audios, which requires a large amount of computing resource consumption and time consumption.

In addition, the introduction of the time length control unit, the pitch control unit and the energy control unit enables prosody to be adjusted in three aspects of time length, pitch and energy, and the specific adjustment mode can be realized by adding an adjustable parameter after the output value of each module and multiplying the output result by a coefficient.

In one embodiment of the present invention, the multiple speaker tags refer to a sequence of natural numbers that distinguish each speaker, wherein a first speaker tag serves as a speech synthesis style and a second speaker tag serves as a designated speaker. In the training mode, the first speaker label and the second speaker label are both speaker labels marked in the mixed training sample; in the prediction mode, the first speaker tag and the second speaker tag are respectively specified by a user through a style switching unit and a speaker switching unit; for example, after the system training is finished, a second speaker is designated to perform speech synthesis by adopting the style of the first speaker, and only the first speaker tag needs to be set as the speech synthesis style through the style switching unit, and the second speaker tag needs to be set as the designated speaker through the speaker switching unit. According to the mode, the speaker only meeting the Chinese reading style can be migrated to the same speaker who can speak English in a customer service style.

The following describes a specific implementation method of the speech synthesis apparatus of the present invention.

Preprocessing an input Chinese and English text data sequence with rhythm labels to serve as the input of a hopping neural network encoder; the hopping neural network encoder consists of a phoneme Embedding layer, a CBHG module and a hopping module;

step two, for the output of the jumping neural network encoder, combining the output of the CBHG module and the output of the speaker Embedding layer, and obtaining text coding information with adjusted duration through the duration adjustment;

step three, the text coding information after the duration adjustment and the output of the speaker Embedding layer are used as the input of a pitch control unit and an energy control unit together to obtain the predicted pitch and the predicted energy; combining the predicted pitch, the predicted energy and the text coding information subjected to time length adjustment, and using the combined information and the output of the speaker Embedding layer as the input of a decoder to obtain a predicted Mel frequency spectrum; the vocoder outputs the synthesized speech.

In one embodiment of the present invention (taking chinese and english as an example), the transmission and processing process of the input text in the speech synthesis device is as follows:

1) aiming at Chinese and English in the text, respectively converting into corresponding pronunciation phonemes,constructing a Chinese-English mixed phoneme dictionary; mapping Chinese and English phonemes with rhythm labels to serialized data by adopting a Chinese-English mixed phoneme dictionary to obtain a phoneme sequence w₁,w₂,…,w_UWhere U is the length of the text, w_iIndicating phoneme information corresponding to the ith word in the text.

2) Constructing a multi-speaker dictionary for multiple speakers; deriving a multiple speaker tag s₁,s₂,…,s_kWherein k is the number of speakers. The multi-speaker tag is converted into a speaker vector sequence S through a speaker Embedding layer_i；

3) For serialized text data (phoneme sequence w)₁,w₂,…,w_U) Converted into phoneme vector sequence x through phoneme Embedding layer₁,x₂,…,x_UAnd the prosodic tags comprise prosodic words (#1), prosodic phrases (#2), intonation phrases (#3), sentence ends (#4) and character boundaries (# S).

x₁,x₂,…,x_U＝Embedding(w₁,w₂,…,w_U)；

x_iRepresenting a phoneme vector corresponding to the ith word in the text, wherein Embedding (·) represents Embedding processing; for example, a text "she holds her shoes in the hands and deliberately steps on the bottom of a puddle with the feet shining. "convert to after labeling by rhythm label" she holds #1 shoe #1 in #1 hand #3 bare #1 foot #2 and steps on #1 puddle #4 deliberately # 1. "

4) For the converted phoneme vector sequence x₁,x₂,…,x_UInput to CBHG module, resulting in a sequence of speaker vectors S_iThe predicted time length is generated through the time length control unit, and then the text coding information which does not contain the rhythm label position after the jump coding is generated through the jump module; the CBHG module employed in this embodiment comprises a one-dimensional convolutional filter bank, which efficiently models current and context information. Followed by a multi-level highway network to extract higher level features. Finally, a bidirectional gate control circulation unit GRU circulation neural network RNN is used for extracting the sequenceThe contextual characteristics of the column.

Expressed by the formula:

t₁,t₂,…,t_U＝CBHG(x₁,x₂,…,x_U)

wherein, t_iCoding information of the ith word in the text;

5) because the prosodic tags are added to the input serialized text data, but the tags do not have obvious pronunciation duration, skip coding is needed to remove the prosodic tags to generate t₁,t₂,…,t_U′,(U′<U), where U' is the text length after removal of the prosodic tag.

Encoding information t for the removed text₁,t₂,…,t_U′。

t₁,t₂,…,t_U′＝Skip_state(t₁,t₂,…,t_U)

6) Text coding information t without rhythm label position after jump coding₁,t₂,…,t_U′And length expansion is carried out by combining the time length control unit, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each phoneme is output according to the trained duration control unit, and the length of each phoneme is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion₁,t₂,…,t_TAnd T is the frame number of the extracted real Mel spectrum.

The duration control unit and the energy and pitch control unit have the same network structure: the three one-dimensional convolution layers and the regularization layer are used for feature separation; a bidirectional GRU learns the relationship between the front and rear phoneme characteristics; finally, the duration/energy/pitch is predicted through a linear affine transformation.

t₁,t₂,…,t_T＝State_Expand(t₁,t₂,…,t_U′)

7) Predicting pitch and energy; aiming at text with rhythm label removedCoding information and prediction duration information, which are subjected to duration adjustment and then are summed with a speaker vector sequence S_iTogether as inputs to the pitch control unit and the energy control unit, the predicted pitch and the predicted energy are derived for control with respect to the energy and pitch of the generated audio.

8) Encoding predicted pitch and energy with text information t₁,t₂,…,t_TPerforming combined text encoding feature E₁,E₂,…,E_T(ii) a Coding information t for text₁,t₂,…,t_TRespectively obtaining through the control unit of energy and pitch:

e₁,e₂,…,e_T＝Energy_Predictor(t₁,t₂,…,t_T；S_i)

p₁,p₂,…,p_T＝Pitch_Predictor(t₁,t₂,…,t_T；；S_i)

E₁,E₂,…,E_T＝(e₁,e₂,…,e_T+p₁,p₂,…,p_T)*t₁,t₂,…,t_T+S_i

wherein E is₁,E₂,…,E_TFor encoding information for the combined text, e₁,e₂,…,e_TAs output of the energy control unit, p₁,p₂,…,p_TAs output of the pitch control unit, t₁,t₂,…,t_TFor text-coded information subject to duration adjustment, S_iThe second speaker vector sequence is represented here as a speaker vector sequence.

9) Encoding features E for text₁,E₂,…,E_TDecoding to generate a predicted Mel frequency spectrum;

in an embodiment of the present invention, the decoder is specifically composed of a bi-directional LSTM and a linear affine transformation, and may be specifically expressed as:

encoding by BLSTM:

combining the bi-directional final hidden states to obtain h^*Represents:

for the obtained h^*Through linear affine transformation, a predicted mel spectrum can be generated:

M₁,M₂,…,M_T＝Linear(h^*)

finally, the generated Mel frequency spectrum is synthesized into the voice with controllable rhythm by a common vocoder.

In one embodiment of the present invention, as shown in fig. 2, the duration control unit, the pitch control unit, and the energy control unit are all composed of three one-dimensional convolution layers and regularization layers, a bidirectional gating loop unit GRU, and a linear affine transformation.

Adding prosodic tags to the text to be synthesized, wherein the addition of the prosodic tags is realized by adopting a pre-trained prosodic phrase boundary prediction model, inputting the text to be synthesized into the pre-trained prosodic phrase boundary prediction model, and outputting the text to be synthesized with the prosodic tags. The pre-trained prosodic phrase boundary prediction model adopts a decision tree or blstm-crf for predicting the boundary of the phrase and inserting prosodic tags, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output.

During synthesis, the styles of the speaker and any speaker can be respectively designated for model input.

Compared with the traditional method for separately constructing various models, the method adopts a mode of directly constructing from text to acoustic characteristics and an end-to-end training mode, calculates time loss according to the predicted time and the real time, calculates pitch loss according to the predicted pitch and the real pitch, calculates energy loss according to the predicted energy and the real energy, and calculates mel frequency spectrum loss according to the predicted mel frequency spectrum and the real mel frequency spectrum; performing end-to-end training on the mixed speech synthesis model by combining various loss values; the influence of single model prediction error on the effect of the whole model is avoided, and therefore the fault tolerance of the model is improved.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

The invention tests on a text data set containing 32500 speakers with 30000 Chinese, 2000 English, and 500 Chinese-English mixed, and corresponding prosody labels. The invention preprocesses the data set as follows:

1) extracting Chinese and English phoneme files and corresponding audio, and extracting pronunciation duration of the phoneme by using an open source tool Montreal-forced-aligner.

2) A mel-frequency spectrum is extracted for each audio, with a window size of 50 milliseconds, a frame shift of 12.5 milliseconds, and a dimension of 80 dimensions.

3) For each audio, the pitch of the audio is extracted using the World vocoder.

4) The energy of the mel spectrum is obtained by summing the mel spectrum extracted from the audio in dimension.

The mixed voice synthesis system realizes the controllable operation of four dimensions of text rhythm, energy, duration and pitch in the voice synthesis process, and realizes the support of multiple languages; the support of multiple speakers is realized; the support of speaker style migration is realized, and the wide application of a voice synthesis system in an industrial scene is facilitated.

Various technical features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described in detail, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered as being described in the present specification.

Claims

1. A speech synthesis apparatus supporting multi-speaker style, language switching and prosody control, comprising:

2. The apparatus according to claim 1, wherein the text preprocessing unit converts the text into a phoneme sequence with prosodic tags, and specifically comprises:

respectively converting different language types in the text into corresponding pronunciation phonemes to construct a mixed phoneme dictionary; mapping phonemes with prosodic labels to serialized data using a mixed phoneme dictionary to obtain a phoneme sequence w₁，w₂，…，w_UWhere U is the length of the text.

3. The apparatus of claim 1, wherein the prosody control unit comprises:

4. The apparatus according to claim 3, wherein the aligning unit is configured to perform the following steps: text coding information t without rhythm label position after jump coding₁，t₂，…，t_U′Length expansion is carried out by combining the time length information output by the time length control unit, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each phoneme is output according to the trained duration control unit, and the length of each phoneme is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion₁，t₂，…，t_TAnd T is the frame number of the extracted real Mel spectrum.

5. The speech synthesis device of claim 3, wherein the coder comprises a phoneme Embedding layer, a speaker Embedding layer, a CBHG module and a skip module;

for phoneme sequence w with rhythm label₁，w₂，…，w_UBy phonemeThe Embedding Embedding layer is converted into a phoneme vector sequence x₁，x₂，…，x_U；

Using the converted phoneme vector sequence as the input of the CBHG module to generate text coding information t₁，t₂，…，t_U；

Encoding information t from text₁，t₂，…，t_UAnd speaker vector sequence S_iPredicting the duration;

encoding text into information t₁，t₂，…，t_UText coding information t without rhythm label position after generating jump coding by jump module₁，t₂，…，t_U′Wherein U '< U, wherein U' is the text length after removal of the prosodic tag.

6. The apparatus according to claim 3 or 5, wherein the encoding-decoding unit is trained by a training unit, specifically:

7. The device of claim 1, wherein the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends, and character boundaries.

8. The device of claim 1, wherein after the voice synthesis unit reads the text to be synthesized in the text acquisition unit, a prosodic tag needs to be added to the text to be synthesized, the prosodic tag is added by using a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tag is output.

9. The apparatus of claim 1, wherein the decoder is configured to combine the first speaker encoding information, the second speaker encoding information, and the pitch and energy adjusted by the prosody control unit, and the combination formula is:

E₁，E₂，…，E_T＝(e₁，e₂，…，e_T+p₁，p₂，…，p_T)*t₁，t₂，…，t_T+S_i

wherein E is₁，E₂，…，E_TFor encoding information for the combined text, e₁，e₂，…，e_TAs output of the energy control unit, p₁，p₂，…，p_TAs output of the pitch control unit, t₁，t₂，…，t_TFor text-coded information subject to duration adjustment, S_iThe second speaker vector sequence is represented here as a speaker vector sequence.

10. A device for speech synthesis supporting multiple speaker styles, language switching and prosody control according to claim 1, wherein the decoder comprises a bi-directional LSTM and a linear affine transformation; the time length control unit, the pitch control unit and the energy control unit are all composed of three one-dimensional convolution layers and a regularization layer, a bidirectional gating circulation unit GRU and linear affine transformation.