CN112786006B

CN112786006B - Speech synthesis method, synthesis model training method, device, medium and equipment

Info

Publication number: CN112786006B
Application number: CN202110042176.3A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2024-05-17
Anticipated expiration: 2041-01-13
Also published as: CN112786006A; WO2022151931A1

Abstract

The present disclosure relates to a speech synthesis method, a synthesis model training method, a device, a medium and equipment. The method comprises the following steps: acquiring voice characteristic information corresponding to a text to be synthesized; inputting the voice characteristic information into a voice synthesis model to obtain predicted waveform point information corresponding to a text to be synthesized, wherein the voice synthesis model comprises an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly carrying out joint training on the acoustic submodel and the vocoder; and performing mu-law expansion on the predicted waveform point information to obtain audio information. Therefore, the efficiency of voice synthesis can be improved, error accumulation generated by respectively training the acoustic submodel and the vocoder in the related technology can be effectively reduced, and the accuracy of voice synthesis can be improved. In addition, the problem that the generated audio information cannot adapt to special pronunciation demands due to the fact that the acoustic features do not have universality can be avoided, and the voice synthesis effect is improved. In addition, the training period of the model is short, and the rhythm fidelity of the model is better.

Description

Speech synthesis method, synthesis model training method, device, medium and equipment

Technical Field

The present disclosure relates to the field of speech synthesis technology, and in particular, to a speech synthesis method, a synthesis model training method, a device, a medium, and equipment.

Background

At present, when performing speech synthesis, acoustic features (such as mel spectrum, linear spectrum, fundamental frequency, etc.) corresponding to a text to be synthesized are generally extracted through an acoustic submodel, and then a vocoder is utilized to generate audio information corresponding to the text to be synthesized according to the acoustic features. However, when the acoustic submodel and the vocoder cooperate to perform voice synthesis, the speed of voice synthesis is slow, and a phenomenon of error accumulation easily occurs, thereby affecting the accuracy of voice synthesis. In addition, acoustic features extracted by the acoustic submodel may not be generic such that audio information generated based on the acoustic features cannot be adapted to specific pronunciation requirements, such as high-tip female voices or low-sinking male voices.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, comprising:

acquiring voice characteristic information corresponding to a text to be synthesized;

Inputting the voice characteristic information into a voice synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the voice synthesis model comprises an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly carrying out joint training on the acoustic submodel and the vocoder;

and performing mu-law expansion on the predicted waveform point information to obtain the audio information corresponding to the text to be synthesized.

Optionally, the acoustic submodel comprises an encoding network, a time submodel, a gaussian sampling module, a linear processing module and a flow-based reversible generation model Glow;

The coding network is used for generating a representation sequence corresponding to the text to be synthesized according to the voice characteristic information, wherein the representation sequence is formed by arranging codes of each phoneme in the text to be synthesized according to the sequence of the corresponding phoneme in the text to be synthesized;

the duration submodel is used for obtaining duration feature information corresponding to the text to be synthesized according to the voice feature information, wherein the duration feature information comprises the number of voice frames corresponding to each phoneme in the text to be synthesized;

The Gaussian sampling module is used for generating fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration characteristic information;

The linear processing module is used for carrying out linear transformation on the semantic representation to obtain first Mel spectrum information corresponding to the text to be synthesized;

the flow-based reversible generation model Glow is used for generating second mel spectrum information according to standard normal distribution;

The vocoder is used for generating predicted waveform point information corresponding to the text to be synthesized according to the first Mel spectrum information and the second Mel spectrum information.

Optionally, the vocoder is a Flow-based generation model Flow;

the speech synthesis model is trained by:

Acquiring marked voice characteristic information, marked waveform point information and marked mel spectrum information corresponding to a text training sample;

The method comprises the steps of respectively taking the marked voice characteristic information as the input of the coding network and the time length submodel, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Glow, and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow to directly perform joint training on the acoustic submodel and the vocoder so as to obtain the voice synthesis model.

Optionally, the voice characteristic information comprises phonemes, tones, word segmentation and prosody boundaries;

The obtaining the voice characteristic information corresponding to the text to be synthesized includes:

And inputting the text to be synthesized into an information extraction model to obtain voice characteristic information corresponding to the text to be synthesized.

Optionally, the method further comprises:

And synthesizing the audio information with background music.

In a second aspect, the present disclosure provides a method for training a speech synthesis model, where the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes a coding network, a duration sub-model, a gaussian sampling module, a linear processing module, and a Flow-based reversible generation model Glow, and the vocoder is a Flow-based generation model Flow;

The method comprises the following steps:

In a third aspect, the present disclosure provides a speech synthesis apparatus comprising:

The first acquisition module is used for acquiring voice characteristic information corresponding to the text to be synthesized;

The voice synthesis module is used for inputting the voice characteristic information acquired by the first acquisition module into a voice synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the voice synthesis model comprises an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly carrying out joint training on the acoustic submodel and the vocoder;

And the expansion module is used for performing mu-law expansion on the predicted waveform point information obtained by the voice synthesis module to obtain the audio information corresponding to the text to be synthesized.

In a fourth aspect, the present disclosure provides a speech synthesis model training apparatus, where the speech synthesis model includes an acoustic sub-model and a vocoder, the acoustic sub-model includes a coding network, a duration sub-model, a gaussian sampling module, a linear processing module, and a Flow-based reversible generation model Glow, and the vocoder is a Flow-based generation model Flow;

The device comprises:

The second acquisition module is used for acquiring marked voice characteristic information, marked waveform point information and marked Mel spectrum information corresponding to the text training sample;

The training module is used for directly carrying out joint training on the acoustic submodel and the vocoder in a mode of taking the marked voice characteristic information as the input of the coding network and the time length submodel respectively, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Flow and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow so as to obtain the voice synthesis model.

In a fifth aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device implements the steps of the method provided by the first or second aspects of the present disclosure.

In a sixth aspect, the present disclosure provides an electronic device, comprising:

A storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method provided by the first aspect of the present disclosure.

In a seventh aspect, the present disclosure provides an electronic device, comprising:

A storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method provided by the second aspect of the present disclosure.

According to the technical scheme, the predicted waveform point information can be directly obtained according to the voice characteristic information corresponding to the text to be synthesized through the voice synthesis model, then, the audio information corresponding to the text to be synthesized can be obtained by performing simple mu-law expansion on the predicted waveform point information, and cooperation between the acoustic submodel and the vocoder is not needed, so that the efficiency of voice synthesis is improved, error accumulation generated by training the acoustic submodel and the vocoder respectively in the related technology can be effectively reduced, and the accuracy of voice synthesis is improved. In addition, because the predicted waveform point information can be directly generated according to the voice characteristic information corresponding to the text to be synthesized without the acoustic characteristic, the problem that the generated audio information cannot adapt to special pronunciation requirements due to the fact that the acoustic characteristic does not have universality can be avoided, and therefore the effect of voice synthesis is improved. In addition, the voice synthesis model can be obtained by directly carrying out combined training on the acoustic submodel and the vocoder, so that the model training period can be shortened, and the rhythm fidelity of the voice synthesis model is better; and moreover, the matching degree of the acoustic submodel and the vocoder in the voice synthesis model can be ensured, so that the problem that the accuracy of the obtained voice synthesis result is low even though the accuracy of the acoustic submodel and the vocoder is high is solved, and the accuracy of voice synthesis is further improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating a speech synthesis process according to an example embodiment.

FIG. 3 is a flowchart illustrating a method of training a speech synthesis model, according to an example embodiment.

FIG. 4 is a schematic diagram illustrating a speech synthesis model training process, according to an example embodiment.

Fig. 5 is a flowchart illustrating a method of speech synthesis according to another exemplary embodiment.

Fig. 6 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment.

FIG. 7 is a block diagram illustrating a speech synthesis model training apparatus, according to an example embodiment.

Fig. 8 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment. Wherein, as shown in fig. 1, the method includes S101 to S103.

In S101, voice feature information corresponding to a text to be synthesized is acquired.

In the present disclosure, the text to be synthesized may be a tone language such as chinese, tibetan, wiki, tay, and qiang. The speech feature information may be used to characterize relevant information such as phonemes, intonation, pauses, etc. of the text to be synthesized. In addition, the text to be synthesized may be various types of text such as novels, lyrics, and the like.

In S102, the speech feature information is input into the speech synthesis model, and predicted waveform point information corresponding to the text to be synthesized is obtained.

In the present disclosure, a speech synthesis model includes an acoustic submodel and a vocoder, wherein the speech synthesis model is obtained by directly performing joint training on the acoustic submodel and the vocoder.

In S103, μ -law expansion is performed on the predicted waveform point information, so as to obtain audio information corresponding to the text to be synthesized.

The following describes in detail the specific embodiment for acquiring the voice feature information corresponding to the text to be synthesized in S101.

In the present disclosure, the speech feature information may include phonemes, tones, word breaks, and prosody boundaries. Wherein, the phonemes are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, and one action forms a phoneme; phonemes are divided into two major classes, vowels and consonants. For Chinese, for example, the phonemes include initials (initials are one complete syllable formed with finals using consonants in front of the finals) and finals (i.e., vowels). Tone refers to the change in the elevation of sound. Illustratively, there are four tones in Chinese: level, loud and loud. Prosodic boundaries are used to indicate where pauses should be made while reading text. Illustratively, the prosody boundary is divided into four pause levels "#1", "#2", "#3", and "#4", the degree of pause of which increases in sequence.

Specifically, the above-mentioned voice feature information may be obtained in a plurality of ways, and in one embodiment, the voice feature information corresponding to the text to be synthesized may be labeled in advance by the user and stored in the corresponding storage module, so that the voice feature information corresponding to the text to be synthesized may be obtained by accessing the storage module.

In another embodiment, the text to be synthesized can be input into the information extraction model to obtain the voice characteristic information corresponding to the text to be synthesized, so that the method is convenient and quick, does not need to be manually participated, and saves manpower.

In the present disclosure, the information extraction model may include a text regularization (Text Normalization, TN) model, a grapheme-to-phoneme (Grapheme-to-Phoneme, G2P) model, a word segmentation model, and a prosody model. The method comprises the steps of converting numbers, symbols, abbreviations and the like in a text to be synthesized into language characters through a TN model, obtaining phonemes in the text to be synthesized through a G2P model, segmenting the text to be synthesized through a word segmentation model, and obtaining prosodic boundaries and tones of the text to be synthesized through a prosodic model.

Illustratively, the G2P model may employ a recurrent neural network (Recurrent Neural Network, RNN) and a Long Short-Term Memory (LSTM) to effect the conversion from graphemes to phonemes.

The word segmentation model may be an n-gram model, a hidden Markov model, a naive Bayesian classification model, etc.

The prosody model is a pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), a bi-directional LSTM-CRF (Conditional Random Field ) model, or the like.

In the above embodiment, by extracting the voice feature information of phonemes, tones, segmentation words, and prosody boundaries of the text to be synthesized and performing voice synthesis on the text to be synthesized based on the voice feature information, the text content of the text to be synthesized can be more focused. Therefore, the obtained audio information corresponding to the text to be synthesized can be stopped according to the text content and the word segmentation of the text to be synthesized, the accuracy and the understandability of the audio information are improved, and a user can conveniently and quickly understand the text content corresponding to the audio information. In addition, because the pause can be carried out at the natural prosodic boundary during the speech synthesis, the naturalness and fluency of the audio information corresponding to the text to be synthesized can be improved.

The process of speech synthesis by the speech synthesis model is described in detail below in connection with the structure of the speech synthesis model. As shown in fig. 2, the acoustic submodel includes an encoding network, a time submodel, a gaussian sampling (Gaussian sampling) module, a linear processing module, and a flow-based reversible generation model Glow.

In the present disclosure, a coding network is used for generating a representation sequence corresponding to a text to be synthesized according to voice feature information corresponding to the text to be synthesized; the time length sub-model is used for obtaining time length characteristic information corresponding to the text to be synthesized according to the voice characteristic information corresponding to the text to be synthesized; the Gaussian sampling module is used for generating fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration characteristic information; the linear processing module is used for carrying out linear transformation on the fixed-length semantic representation to obtain first Mel spectrum information corresponding to the text to be synthesized; a reversible generation model Glow based on the flow is used for generating second mel spectrum information according to standard normal distribution; the vocoder is used for generating predicted waveform point information corresponding to the text to be synthesized according to the first Mel spectrum information and the second Mel spectrum information.

The duration characteristic information comprises the number of voice frames corresponding to each phoneme in the text to be synthesized. The representation sequence is formed by arranging codes of each phoneme in the text to be synthesized according to the sequence of the corresponding phonemes in the text to be synthesized.

For example, the phoneme sequence corresponding to the text to be synthesized is "AB", wherein the encoding of the phoneme "a" is "a", the encoding of the phoneme "B" is "B", and the expression sequence corresponding to the text to be synthesized is "AB".

As shown in fig. 2, the encoding network may include a Pre-processing network (Pre-net) sub-model and a transducer sub-model. Firstly, inputting voice characteristic information corresponding to a text to be synthesized into a Pre-net sub-model to perform nonlinear transformation on the voice characteristic information, so as to improve convergence and generalization capability of the voice synthesis model, and then obtaining a representation sequence corresponding to the text to be synthesized according to the voice characteristic information obtained after nonlinear transformation through a transducer sub-model.

The long sub-model may be, for example, CBHG model, long short-time memory network (Long Short Term Memory Network, LSTM) model, LSTM-RNN (Recurrent Neural Network ) model, deep neural network (Deep Neural Networks, DNN) model, transformer model, stream-based generation model Flow, and the like. Preferably, the time sub-model may employ a Flow-based generation model Flow suitable for modeling uncertain information in speech, thereby further improving the accuracy of speech synthesis.

In addition, the time length submodel can determine the number of voice frames corresponding to each phoneme in the text to be synthesized through the following steps: the method comprises the steps of (1) obtaining pronunciation time length of each phoneme in a text to be synthesized; (2) And determining the number of the voice frames corresponding to each phoneme according to the pronunciation time length of the phoneme.

For example, the pronunciation duration of a phoneme is 200ms, the time length of a speech frame is 5ms, and the number of speech frames corresponding to the phoneme is 40.

Also, for example, the pronunciation time length of a phoneme is 203ms, the time length of a speech frame is 5ms, and the number of speech frames corresponding to the phoneme isI.e. the last slice is less than 5ms, processed in one frame.

As shown in fig. 2, the flow-based reversible generation model Glow includes a compression layer, an activation normalization (Activation Normalization, actNorm) layer, a reversible 1*1 convolution layer, and an affine coupling layer.

The ActNorm layers are used for carrying out standardization processing on the data, and the scaling and deviation parameters of each channel are adopted for activation, so that small batches of data have zero mean and unit variance after activation, which is equivalent to preprocessing the data, and the performance degradation of the model is avoided. The reversible 1*1 convolution layer realizes the scrambling of the data of each dimension by adopting matrix multiplication, so that the information is mixed more fully. The affine coupling layer realizes the reversible conversion of the data through a reversible function.

In the voice synthesis stage, inputting standard normal distribution into an affine coupling layer to obtain data after inverse conversion; then, inputting the data after the inverse conversion into a reversible 1*1 convolution layer to disturb the data; and then, inputting the scrambled data into ActNorm layers for data normalization processing, decompressing by a compression layer to obtain second Mel spectrum information, and outputting the second Mel spectrum information to a vocoder.

The following describes the training method of the speech synthesis model in detail. In order to enhance the training effect of the speech synthesis model, the vocoder may be a Flow-based generation model Flow, and in particular, may be implemented through S301 and S302 shown in fig. 3.

In S301, labeled speech feature information, labeled waveform point information, and labeled mel-spectrum information corresponding to the text training sample are obtained.

In the disclosure, the annotation waveform point information corresponding to the text training sample can be obtained by performing mu-law compression on the audio information corresponding to the text training sample.

In S302, the acoustic submodel and the vocoder are directly combined and trained by taking the labeled speech feature information as the input of the coding network and the time submodel, respectively, taking the output of the coding network and the output of the time submodel as the input of the gaussian sampling module, taking the output of the gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the labeled waveform point information as the target output of the Flow-based generation model Flow, taking the labeled mel spectrum information as the input of the Flow-based reversible generation model Glow, and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow.

Specifically, as shown in fig. 4, the labeled speech feature information corresponding to the text training sample may be input into the coding network to obtain a representation sequence corresponding to the text training sample, and meanwhile, the labeled speech feature information corresponding to the text training sample is input into the duration sub-model to obtain duration feature information corresponding to the text training sample; then, inputting a representation sequence and duration characteristic information corresponding to the text training sample into a Gaussian sampling module to obtain a fixed-length semantic representation corresponding to the text training sample; then, inputting the fixed-length semantic representation into a linear processing module to perform linear transformation to generate predicted mel spectrum information corresponding to the text training sample; then, inputting the predicted Mel spectrum information into a vocoder (namely, generating model Flow based on Flow) to obtain predicted waveform point information corresponding to the text training sample; and updating model parameters of all modules except the flow-based reversible generation model Glow in the acoustic submodel according to a comparison result of the predicted waveform point information and the labeling waveform point information corresponding to the text training sample.

Meanwhile, inputting marked mel spectrum information corresponding to the text training sample into a compression layer in a flow-based reversible generation model Glow so as to compress the marked mel spectrum information; then inputting the compressed data into ActNorm layers to perform data normalization processing; then, inputting the normalized data into a reversible 1*1 convolution layer to disturb the data; then, inputting the scrambled data into an affine coupling layer to perform data inverse conversion to obtain simulated normal distribution; then, the model parameters of the flow-based reversible generation model Glow can be updated according to the comparison result of the simulation normal distribution and the standard normal distribution.

Thus, the above-described speech synthesis model can be obtained.

In addition, in order to enhance the user experience, after the audio information corresponding to the text to be synthesized is obtained in step 103, background music may be added to the audio information, so that the user can more easily understand the corresponding text content according to the background music and the audio information. Specifically, as shown in fig. 5, the above method may further include S104.

In S104, the audio information is synthesized with the background music.

In an embodiment, the background music may be preset music, that is, any music set by the user, or default music.

In another embodiment, before the audio information and the background music are synthesized, the usage scenario information corresponding to the text to be synthesized can be determined according to the text information and/or the voice characteristic information of the text to be synthesized, wherein the usage scenario information includes but is not limited to news broadcasting, army introduction, fairy tale, campus broadcasting and the like; then, based on the usage scenario information, background music matching the usage scenario information is determined.

In the present disclosure, the usage scenario information may be determined in various manners, and in an embodiment, the usage scenario information corresponding to the text to be synthesized may be determined according to the text information of the text to be synthesized, where the text information may be a keyword. For example, the text to be synthesized may be automatically identified by keywords, so as to intelligently pre-judge the usage scenario information of the text to be synthesized according to the keywords.

In another embodiment, the usage scenario information corresponding to the text to be synthesized may be determined according to the voice feature information of the text to be synthesized. Specifically, the scene description word may be identified from the words in the speech feature information determined in the step 101, where the scene description word may be identified by matching each word with a pre-stored scene description word table, and then the usage scene information of the text to be synthesized may be determined according to the scene description word.

In yet another embodiment, the usage scenario information corresponding to the text to be synthesized may be determined according to the text information and the voice feature information of the text to be synthesized. Specifically, the keyword automatic recognition can be performed on the text to be synthesized, the scene description word is recognized from the word segmentation in the voice characteristic information determined in the step 101, and then the usage scene information of the text to be synthesized is determined together according to the keyword and the scene description word. Thus, the accuracy of determination of the usage scenario information can be improved.

After determining the usage scenario information corresponding to the text to be synthesized, the corresponding relation between the pre-stored usage scenario information and the background music can be utilized to determine the background music matched with the usage scenario information corresponding to the text to be synthesized according to the usage scenario information. For example, using scene information to introduce army, its corresponding background music can be exciting music; the scene information is fairy tale, and the corresponding background music can be light and lively music.

The disclosure also provides a method for training a speech synthesis model, wherein the speech synthesis model comprises an acoustic sub-model and a vocoder, the acoustic sub-model comprises a coding network, a duration sub-model, a Gaussian sampling module, a linear processing module and a reversible generation model Glow based on Flow, and the vocoder is a generation model Flow based on Flow. Specifically, the speech synthesis model may be trained by S301 and S302 shown in fig. 3.

Thus, the above-described speech synthesis model can be obtained.

Fig. 6 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment. As shown in fig. 6, the apparatus 600 includes: the first obtaining module 601 is configured to obtain voice feature information corresponding to a text to be synthesized; the voice synthesis module 602 is configured to input the voice feature information acquired by the first acquisition module 601 into a voice synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, where the voice synthesis model includes an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly performing joint training on the acoustic submodel and the vocoder; and the expansion module 603 is configured to perform μ -law expansion on the predicted waveform point information obtained by the speech synthesis module 602, so as to obtain audio information corresponding to the text to be synthesized.

Optionally, the vocoder is a Flow-based generation model Flow;

The speech synthesis model is trained by a speech synthesis model training apparatus, wherein, as shown in fig. 7, the speech synthesis model training apparatus 700 comprises:

the first acquisition module is used for inputting the text to be synthesized into an information extraction model to obtain voice characteristic information corresponding to the text to be synthesized.

Optionally, the apparatus 600 further includes: and a background music synthesis module, configured to synthesize the audio information obtained by the expansion module 603 with background music.

The disclosure further provides a device for training the speech synthesis model, wherein the speech synthesis model comprises an acoustic submodel and a vocoder, the acoustic submodel comprises a coding network, a duration submodel, a Gaussian sampling module, a linear processing module and a Flow-based reversible generation model Glow, and the vocoder is a Flow-based generation model Flow. As shown in fig. 7, the apparatus 700 includes: the second obtaining module 701 is configured to obtain labeled speech feature information, labeled waveform point information, and labeled mel spectrum information corresponding to the text training sample; the training module 702 is configured to directly perform joint training on the acoustic submodel and the vocoder in a manner of taking the labeled speech feature information as the input of the coding network and the duration submodel, taking the output of the coding network and the output of the duration submodel as the input of the gaussian sampling module, taking the output of the gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the labeled waveform point information as the target output of the Flow-based generation model Flow, taking the labeled mel spectrum information as the input of the Flow-based reversible generation model Glow, and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow, so as to obtain the speech synthesis model.

Thus, the above-described speech synthesis model can be obtained.

Note that, the speech synthesis model training device 700 may be integrated into the speech synthesis device 600, or may be independent of the speech synthesis device 600, and is not particularly limited in the present disclosure. In addition, with respect to the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail herein.

The present disclosure also provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the above-described speech synthesis method or the steps of a speech synthesis model training method provided by the present disclosure.

Referring now to fig. 8, a schematic diagram of an electronic device (e.g., a terminal device or server) 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring voice characteristic information corresponding to a text to be synthesized; inputting the voice characteristic information into a voice synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the voice synthesis model comprises an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly carrying out joint training on the acoustic submodel and the vocoder; and performing mu-law expansion on the predicted waveform point information to obtain the audio information corresponding to the text to be synthesized.

Or the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

Acquiring marked voice characteristic information, marked waveform point information and marked mel spectrum information corresponding to a text training sample; the voice synthesis model comprises an acoustic submodel and a vocoder, wherein the acoustic submodel comprises a coding network, a duration submodel, a Gaussian sampling module, a linear processing module and a Flow-based reversible generation model Flow, and the vocoder is a Flow-based generation model Flow; the method comprises the steps of respectively taking the marked voice characteristic information as the input of the coding network and the time length submodel, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Glow, and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow to directly perform joint training on the acoustic submodel and the vocoder so as to obtain the voice synthesis model.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited in some cases to the module itself, and for example, the first obtaining module may also be described as "a module for obtaining voice feature information corresponding to the text to be synthesized".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a speech synthesis method, comprising: acquiring voice characteristic information corresponding to a text to be synthesized; inputting the voice characteristic information into a voice synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the voice synthesis model comprises an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly carrying out joint training on the acoustic submodel and the vocoder; and performing mu-law expansion on the predicted waveform point information to obtain the audio information corresponding to the text to be synthesized.

Example 2 provides the method of example 1, according to one or more embodiments of the present disclosure, the acoustic submodel comprising an encoding network, a time submodel, a gaussian sampling module, a linear processing module, and a stream-based reversible generation model Glow; the coding network is used for generating a representation sequence corresponding to the text to be synthesized according to the voice characteristic information, wherein the representation sequence is formed by arranging codes of each phoneme in the text to be synthesized according to the sequence of the corresponding phoneme in the text to be synthesized; the duration submodel is used for obtaining duration feature information corresponding to the text to be synthesized according to the voice feature information, wherein the duration feature information comprises the number of voice frames corresponding to each phoneme in the text to be synthesized; the Gaussian sampling module is used for generating fixed-length semantic representation corresponding to the text to be synthesized according to the representation sequence and the duration characteristic information; the linear processing module is used for carrying out linear transformation on the semantic representation to obtain first Mel spectrum information corresponding to the text to be synthesized; the flow-based reversible generation model Glow is used for generating second mel spectrum information according to standard normal distribution; the vocoder is used for generating predicted waveform point information corresponding to the text to be synthesized according to the first Mel spectrum information and the second Mel spectrum information.

Example 3 provides the method of example 2, the vocoder being a Flow-based generation model Flow, in accordance with one or more embodiments of the present disclosure; the speech synthesis model is trained by: acquiring marked voice characteristic information, marked waveform point information and marked mel spectrum information corresponding to a text training sample; the method comprises the steps of respectively taking the marked voice characteristic information as the input of the coding network and the time length submodel, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Glow, and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow to directly perform joint training on the acoustic submodel and the vocoder so as to obtain the voice synthesis model.

In accordance with one or more embodiments of the present disclosure, example 4 provides the method of any one of examples 1-3, the speech characteristic information including phonemes, intonation, word segmentation, and prosody boundaries; the obtaining the voice characteristic information corresponding to the text to be synthesized includes: and inputting the text to be synthesized into an information extraction model to obtain voice characteristic information corresponding to the text to be synthesized.

According to one or more embodiments of the present disclosure, example 5 provides the method of any one of examples 1-3, the method further comprising: and synthesizing the audio information with background music.

Example 6 provides a speech synthesis model training method according to one or more embodiments of the present disclosure, the speech synthesis model including an acoustic sub-model including an encoding network, a duration sub-model, a gaussian sampling module, a linear processing module, and a Flow-based reversible generation model Glow, and a vocoder being a Flow-based generation model Flow; the method comprises the following steps: acquiring marked voice characteristic information, marked waveform point information and marked mel spectrum information corresponding to a text training sample; the method comprises the steps of respectively taking the marked voice characteristic information as the input of the coding network and the time length submodel, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Glow, and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow to directly perform joint training on the acoustic submodel and the vocoder so as to obtain the voice synthesis model.

According to one or more embodiments of the present disclosure, example 7 provides a speech synthesis apparatus, comprising: the first acquisition module is used for acquiring voice characteristic information corresponding to the text to be synthesized; the voice synthesis module is used for inputting the voice characteristic information acquired by the first acquisition module into a voice synthesis model to obtain predicted waveform point information corresponding to the text to be synthesized, wherein the voice synthesis model comprises an acoustic submodel and a vocoder, and the voice synthesis model is obtained by directly carrying out joint training on the acoustic submodel and the vocoder; and the expansion module is used for performing mu-law expansion on the predicted waveform point information obtained by the voice synthesis module to obtain the audio information corresponding to the text to be synthesized.

Example 8 provides a speech synthesis model training apparatus according to one or more embodiments of the present disclosure, the speech synthesis model including an acoustic sub-model including an encoding network, a duration sub-model, a gaussian sampling module, a linear processing module, and a Flow-based reversible generation model Glow, and a vocoder being a Flow-based generation model Flow; the device comprises: the second acquisition module is used for acquiring marked voice characteristic information, marked waveform point information and marked Mel spectrum information corresponding to the text training sample; the training module is used for directly carrying out joint training on the acoustic submodel and the vocoder in a mode of taking the marked voice characteristic information as the input of the coding network and the time length submodel respectively, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Flow and taking the standard normal distribution as the target output of the Flow-based reversible generation model Glow so as to obtain the voice synthesis model.

According to one or more embodiments of the present disclosure, example 9 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-6.

In accordance with one or more embodiments of the present disclosure, example 10 provides an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-5.

Example 11 provides an electronic device according to one or more embodiments of the present disclosure, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method of example 6.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of speech synthesis, comprising:

performing mu-law expansion on the predicted waveform point information to obtain audio information corresponding to the text to be synthesized;

The acoustic submodel comprises a coding network, a time submodel, a Gaussian sampling module, a linear processing module and a reversible generation model Flow based on Flow, wherein the vocoder is a generation model Flow based on Flow;

the speech synthesis model is trained by:

The method comprises the steps of respectively taking the marked voice characteristic information as the input of the coding network and the time length submodel, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Glow, and directly carrying out joint training on the acoustic submodel and the vocoder in a mode of taking standard normal distribution as the target output of the Flow-based reversible generation model Glow so as to obtain the voice synthesis model.

2. The method according to claim 1, wherein the coding network is configured to generate a representation sequence corresponding to the text to be synthesized according to the speech feature information, where the representation sequence is formed by arranging codes of each phoneme in the text to be synthesized according to a sequence of the corresponding phoneme in the text to be synthesized;

3. The method according to claim 1 or 2, wherein the speech feature information comprises phonemes, tones, word segmentation and prosody boundaries;

4. The method according to claim 1 or 2, characterized in that the method further comprises:

And synthesizing the audio information with background music.

5. The voice synthesis model training method is characterized in that the voice synthesis model comprises an acoustic submodel and a vocoder, the acoustic submodel comprises a coding network, a duration submodel, a Gaussian sampling module, a linear processing module and a Flow-based reversible generation model Glow, and the vocoder is a Flow-based generation model Flow;

The method comprises the following steps:

6. A speech synthesis apparatus, comprising:

The expansion module is used for performing mu-law expansion on the predicted waveform point information obtained by the voice synthesis module to obtain audio information corresponding to the text to be synthesized;

the speech synthesis model is trained by:

7. The voice synthesis model training device is characterized in that the voice synthesis model comprises an acoustic submodel and a vocoder, the acoustic submodel comprises a coding network, a duration submodel, a Gaussian sampling module, a linear processing module and a Flow-based reversible generation model Glow, and the vocoder is a Flow-based generation model Flow;

The device comprises:

The training module is used for directly carrying out joint training on the acoustic submodel and the vocoder in a mode of taking the marked voice characteristic information as the input of the coding network and the time length submodel respectively, taking the output of the coding network and the output of the time length submodel as the input of the Gaussian sampling module, taking the output of the Gaussian sampling module as the input of the linear processing module, taking the output of the linear processing module as the input of the Flow-based generation model Flow, taking the marked waveform point information as the target output of the Flow-based generation model Flow, taking the marked Mel spectrum information as the input of the Flow-based reversible generation model Flow and taking the standard normal distribution as the target output of the Flow-based reversible generation model Flow so as to obtain the voice synthesis model.

8. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-5.

9. An electronic device, comprising:

A storage device having a computer program stored thereon;

Processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-4.

10. An electronic device, comprising:

A storage device having a computer program stored thereon;

Processing means for executing said computer program in said storage means to carry out the steps of the method of claim 5.