GB2591245A - An expressive text-to-speech system - Google Patents

An expressive text-to-speech system Download PDF

Info

Publication number
GB2591245A
GB2591245A GB2000883.5A GB202000883A GB2591245A GB 2591245 A GB2591245 A GB 2591245A GB 202000883 A GB202000883 A GB 202000883A GB 2591245 A GB2591245 A GB 2591245A
Authority
GB
United Kingdom
Prior art keywords
module
speech
expressive
sub
expression vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB2000883.5A
Other versions
GB202000883D0 (en
GB2591245B (en
Inventor
Monge Alvarez Jesus
Francois Holly
Sung Hosang
Choi Seungdo
Choo Kihyun
Park Sangjun
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to GB2000883.5A priority Critical patent/GB2591245B/en
Publication of GB202000883D0 publication Critical patent/GB202000883D0/en
Priority to KR1020200062637A priority patent/KR20210095010A/en
Priority to US17/037,023 priority patent/US11830473B2/en
Publication of GB2591245A publication Critical patent/GB2591245A/en
Application granted granted Critical
Publication of GB2591245B publication Critical patent/GB2591245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

An expressive text-to-speech synthesiser generates expression vectors, which represent prosodic information in a stored reference audio style file, at an expressivity characterisation module 104. These expression vectors then condition an acoustic expression model 106 comprising a deep convolutional neural network (CNN) which synthesises output speech 110 from text 102. The expression vectors may be compressed to a fixed length vector, and weighted. The system enables the style of the output expressive speech to be controlled, as well as other speech characteristics such as speaking rate and pitch.

Description

An Expressive Text-to-Speech System
Field
[001] The present application generally relates to a method and system for text-to-speech synthesis, and in particular to a system which enables expressive speech to be synthesised from input text.
Background
[2] Generally speaking, text-to-speech (ITS) systems involve automatically converting input language (written) text into speech. A ITS system is a computer-based system which can read text aloud, which is useful for people who are visually impaired or who have reading difficulties, or to aid people who have a speech impairment. Any ITS system is composed of two main parts: a front-end natural language processing (NLP) module and a back-end digital signal processing (DSP) module. The NLP module performs analysis of input text to identify linguistic features (by, for example, performing text normalisation, or converting graphemes to phonemes, etc.), whereas the DSP module performs speech waveform generation using the linguistic features identified by the NLP module.
[3] Currently, there are two main approaches to build a TTS system: concatenative synthesis, and statistical parametric speech synthesis (SPSS).
[4] Concatenafive synthesis is a data-driven approach which generates speech by connecting natural, pre-recorded units (such as words, syllables, diphones, phonemes, etc.).
It provides very good speech quality, but it lacks flexibility since the inventory of pre-recorded units must be rebuilt every time the system needs to be updated. Additionally, the rigidity of the system makes it difficult to transfer certain speech characteristics into the synthesised speech, such as prosodic information. The prosodic information comprises the elements of speech that are not individual phonetic segments, but are properties of syllables or larger units of speech, such as intonation, tone, stress and rhythm. Prosodic information may reflect various feature of the speaker or of the utterance itself, such as whether the utterance is a statement, question or command, or if the utterance includes irony or sarcasm.
[005] Statistical parametric speech synthesis (SPSS) aims at building a statistical model that converts the linguistic features identified by NLP into acoustic features that can be used by a vocoder to generate the speech waveform. Conventional SPSS systems comprise statistical acoustic models that are based on hidden Markov models (HMM). This approach has various advantages over concatenafive synthesis, such as flexibility to change voice characteristics and robustness. However, a major limitation of SPSS is a degradation of speech quality due to the quality of the vocoder and/or the quality of the acoustic model based on HMM.
[6] Recently, deep neural networks (DNN) have emerged as a powerful tool due to their ability to perform statistical modelling and the availability of the required hardware and software to implement them. Figure 1 is a block diagram of a conventional deep neural network based text-to-speech system, in which a deep neural network is used within an SPSS system to build three modules: a front-end NLP module, an acoustic model module, and a vocoder module. Over the past years, several DNN architectures have outperformed classical signal processing based state-of-the-art TTS. For example, Tacotron 2 (Shen et. al., "Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions"), Tacotron-GST (Wang et. a/., "Style tokens: unsupervised style modelling, control and transfer in end-to-end speech synthesis"), and DC-TTS (Tachibana et. aL, "Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention") as acoustic models, or WaveNet (van den Cord et. al., "WaveNet: a generative model for raw audio"), and LPCNet (J-M Valin and J. Skoglund, "LPCNet: improving neural speech synthesis through linear prediction") as vocoders. There have been also proposals that cover the whole pipeline -e.g., Char2Wav (Sotelo et. a/., "Char2Wav: end-to-end speech synthesis").
[7] Many of the aforementioned acoustic models simply generate neutral style speech. However, a more natural interaction with the listener may require synthesised speech with different styles, speaking rates and pitch. Tacotron-GST is an expressive TTS engine that performs style transfer. In this system, the user provides not only the input text but also an audio reference file with the style that must be copied in the synthesised speech. Even though Tacotron-GST provides good style transfer without severely compromising speech quality, its main disadvantages are a long training time due to the presence of many recurrent layers in its architecture, and limited controllability of the style and other speech properties like pitch or speaking rate. Moreover, neural vocoders like WaveNet require billions of floating-point operations per second (GFLOPS). Therefore, putting together an acoustic model like Tacotron-GST and a vocoder like WaveNet makes the computational complexity of the DSP part in Figure 1 impracticable for deployment on devices whose computational capabilities and energy consumption are limited.
[8] The present applicant has recognised the need for an improved TTS system which is able to generate expressive synthesised speech.
Summary
[9] In a first approach of the present techniques, there is provided a system for synthesising expressive speech, comprising: an interface for receiving an input text for conversion to speech; at least one processor coupled to memory to: generate, using an expressivity characterisation module, a plurality of expression vectors, where each expression vector is a representation of prosodic information in a reference audio style file; and synthesise expressive speech from the input text, using an expressive acoustic model comprising a deep convolutional neural network that is conditioned by at least one of the plurality of generated expression vectors.
[010] As explained above, it is desirable to build a system for generating expressive speech which is able to be implemented on devices with limited computational capabilities or power constraints, such as consumer electronic devices (e.g. laptops and smartphones). The present techniques achieve this by providing a new acoustic model which is conditioned on expression or expressivity information, such that the output of the acoustic model includes expression.
[11] The standard DC-TTS model reaches high efficiency by using dilated convolutional neural network layers rather than recurrent neural network layers. For example, 15 hours of DC-TTS training provides the same speech quality as 12 days of Tacotron training. However DC-TTS is not an expressive TTS system and can only reproduce neutral speech. It was also originally designed for a low-quality spectral vocoder (Griffin-Lim).
[12] Likewise, there exist other neural vocoders which are more efficient than WaveNet. For example, LPCNet combines classical signal processing techniques with neural networks to reduce its computational load while providing equivalent speech quality to WaveNet. It requires around 2.8 GFLOPS, so real-time synthesis can be achieved on small devices such as smartphones or tablets. However it requires a different set of input features compared to spectral based vocoders such as Griffin Lim.
[013] The system of the present techniques comprises an autoregressive sequence-to-sequence acoustic model. The acoustic model is a modification of the deep convolutional text-to-speech (DC-US) acoustic model mentioned above, and is used with an LPCNet vocoder. The standard DC-US acoustic model is modified in the present techniques to produce the acoustic features required by the LPCNet vocoder. Specifically, since the standard DC-TTS acoustic model does not include expressivity and was designed for a spectral based vocoder, the present techniques have modified the standard DC-TTS architecture to generate cepstral based expressive acoustic features suitable for an LPCNet vocoder, so that real-time speech synthesis can be achieved even on computationally-limited or low-power devices. The modification comprises new techniques for feedback channel selection and re-shaping, a new reference encoder module, and a new expression injection method. Furthermore, the present techniques enable the style of the output expressive speech to be controlled, as well as other speech characteristics such as speaking rate and pitch. Thus, advantageously, the present techniques provide a controllable and customisable system for generating expressive synthetic speech from text.
[014] The expressivity characterisation module comprises a trainable neural network. The expressivity characterisation module may be part of the acoustic model, or may be separate to the acoustic model. In either case, the expressivity characterisation module is used to generate expression vectors that are used to condition the deep convolutional neural network layers of the acoustic model (i.e. of the modified DC-TTS acoustic model). The expressivity characterisation module comprises trainable sub-modules to characterise the expressivity of an input reference file and create a representation of this expressivity information. Its output is used as conditioning input to the expressive acoustic model -in particular, the audio decoder and optionally the audio encoder sub-modules of the acoustic model -thereby copying the reference style into the synthesised speech.
[015] The expressivity characterisation module may comprise: an interface for receiving a reference audio style file; and a reference encoder sub-module for compressing prosodic information of the received reference audio style file into a fixed-length vector. The reference audio style file is a pre-recorded audio file that represents a particular style or speech characteristic. For example, the reference audio style file may represent a style such as "happy", "friendly", "angry", "stern", etc., and/or may represent a speech characteristic such as fast speaking rate, slow speaking rate, higher average pitch, lower average pitch, normal average pitch, normal speaking rate, etc. [016] The reference encoder sub-module may comprise a plurality of two-dimensional convolutional layers for generating the fixed-length vector. The reference encoder sub-module may further comprise max pooling layers, residual connections, a gated recurrent unit (GRU) layer, and a fully-connected (FC) layer.
[17] The expressivity characterisation module may comprise: an attention sub-module for: receiving the fixed-length vector from the reference encoder sub-module; generating a set of weights corresponding to the prosodic information of the received reference audio style file; and outputting an expression vector comprising the set of weights, for the reference audio style file. The expression vector may, in some cases, be a 256-dimensional expression vector, but this is a non-limiting example vector size.
[18] In some cases, the attention sub-module may be a multi-head attention sub-module.
[19] Instead of an attention sub-module, the expressivity characterisation module may comprise the reference encoder sub-module and a variational autoencoder sub-module comprising a plurality of fully-connected layers for: receiving the fixed-length vector from the reference encoder sub-module; generating a latent space corresponding to the prosodic information of the received reference audio style file; and outputting an expression vector for the reference audio style file. The expression vector may, in some cases, be a 64-dimensional expression vector, but this is a non-limiting example vector size.
[20] The system may further comprise storage for storing the expression vectors for reference audio style files generated by the expressivity characterisation module.
[21] The expressive acoustic model comprises a trainable neural network. The expressive acoustic model learns the relationship between linguistic features (e.g. phonemes) in the input text and acoustic features (e.g. the sounds corresponding to the linguistic features). In other words, the expressive acoustic model performs sequence-to-sequence modelling. The expressive acoustic model comprises a number of sub-modules which may all be based on, or comprise, dilated convolutional layers.
[22] The expressive acoustic model may comprise: an audio encoder sub-module for: receiving pre-recorded or pre-synthesised speech features, and generating a vector corresponding to the received speech. The expressive acoustic model may be used in two ways. Firstly, the expressive acoustic model may be used for training, i.e. to learn the above-mentioned relationship between linguistic features and acoustic features. In this case, the expressive acoustic model may be trained using input text and pre-recorded speech (or pre-synthesised speech) corresponding to the input text. For example, the input text may be the following sentence: "This is a patent application for a text-to-speech synthesis system", and the pre-recorded or pre-synthesised speech is a human or computer voice reading/speaking this sentence. Secondly, the expressive acoustic model may be used in real-time to generate new synthesised speech from new input text. In this case, the expressive acoustic model may use a previous audio frame of generated speech (e.g. via auto-regression).
[023] In some cases, the audio encoder sub-module may: receive at least one of the plurality of generated expression vectors, and generate a vector corresponding to the received speech, conditioned by the received expression vector. Advantageously, this enables the audio encoder sub-module to take into account the expressivity represented by the received expression vector, so that the synthesised speech contains expressivity.
[024] The step of receiving the at least one of the plurality of generated expression vectors may comprise receiving a user-selected expression vector. That is, a user may specify that they wish the synthesised speech to have a particular style (e.g. "happy") or a particular speech characteristic (e.g. "slow speaking rate"). Alternatively, the step of receiving the at least one of the plurality of generated expression vectors comprises receiving an expression vector selected to suit a context from which the received input text is obtained. For example, if the input text is received from a news website, an expression vector may be selected to have a "sombre" or "neutral" style, so that the news is read out in an appropriate tone. In another example, if the input text is received from a story for children, expression vectors may be selected to represent a "happy" or "friendly" style and a "slow speaking rate". This selection may be automatic, based on the context.
[25] The speech received by the audio encoder sub-module may comprise a plurality of audio frames, each audio frame comprising a feature vector. The feature vector may comprise twenty Bark-based cepstrum features, a period feature and a correlation feature. The twenty Bark-based cepstrum features, the period feature and the correlation feature are required by an LPCNet vocoder.
[26] Before reaching the audio encoder, the feature vector of each audio frame is normalised and passed through a feedback channel selection module. The normalisation process may use a mean and standard deviation of the whole feature set (where the feature set is all the feature vectors that make up the input data). The original DC-TTS only generates one out of every four frames of the acoustic features per decoding step and, later, a secondary upsampling network (described as a Spectrogram Super Resolution Network or SSRN by Tachibana S. a/. ("Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention")) is employed to generate the remaining three frames. In the present techniques, a feedback selection channel module may be included in order to avoid this upsampling network, and so that the all the frames can be generated using the sequenceto-sequence architecture only. Thus, the final input to the audio encoder may be a plurality of audio frames that each comprise a feature vector containing 25 features which represent four adjacent frames of the acoustic feature set. Specifically, the 25 features may comprise 22 features of the first audio frame, and a DC component (i.e. the first Bark-based cepstrum coefficient) of the second, third, and fourth adjacent frames.
[27] The expressive acoustic model may comprise a text encoder sub-module for generating the keys and values for a guided attention module based on the linguistic features of the input text. Thus, the text encoder sub-module may: receive phonemes or graphemes corresponding to the received input text; generate a first matrix V representing the value of each phoneme or grapheme in the received input; and generate a second matrix K representing the unique key associated with each value, as explained in, for example, Tachibana et a/. "Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention".
[28] The expressive acoustic model may comprise a guided attention sub-module for: comparing the generated matrix and the generated first and second matrices; and determining a similarity between each character in the received input text with a sound represented in the matrix. In other words, the guided attention sub-module evaluates how strongly the n-th phoneme and the t-th frame of the acoustic features are related. It is a dot-product-based attention mechanism. It is 'guided' because it applies a weighting function which exploits the fact that the relationship between the order of phonemes and the acoustic frames is nearly linear with time. This module is unchanged from the original DC-TTS as described in Tachibana et a/. "Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention".
[29] The expressive acoustic model may comprise an audio decoder sub-module for generating the acoustic features needed as an input to the vocoder, based on the output of the guided attention sub-module. The audio decoder sub-module may: receive the generated
S
expression vector used by the audio encoder, and generate acoustic features corresponding to an output of the guided attention sub-module, conditioned by the received expression vector. Advantageously, this enables the audio decoder sub-module to take into account the expressivity represented by the received expression vector, so that the output of the expressive acoustic model comprises expressivity information before it is sent to the vocoder to produce synthesised speech.
[030] The acoustic features generated by the audio decoder sub-module may represent a plurality of audio frames, each audio frame comprising twenty Bark-based cepstrum features, a period feature and a correlation feature. That is, the output of the audio decoder sub-module may include the features required for the vocoder (as explained above with respect to the audio encoder). The output of the audio decoder sub-module may need to be reshaped so that it is in the right format for the vocoder. This is described in more detail below.
[031] The system may further comprise a vocoder for synthesising speech using the acoustic features generated by the audio decoder sub-module. The vocoder may comprise an LPCNet model, as described in J-M Valin and J. Skoglund, "LPCNet: improving neural speech synthesis through linear prediction".
[032] As mentioned above, the present techniques enable customisable speech to be synthesised that contains a style and speech characteristics desired by a user. However, there are many possible styles and speech characteristics, and it would be time-consuming to record reference audio style files for every style and characteristic and combination thereof, and then generate expression vectors for each such reference file. Therefore, the present techniques advantageously use interpolation and extrapolation to generate desired styles and characteristics from a set of existing expression vectors.
[33] The at least one processor coupled to memory of the system may be further configured to: generate, using an interpolation and extrapolation module, a user-defined expression vector for use by expressive acoustic model to generate expressive speech from the input text.
[34] The interpolation and extrapolation module may be configured to: obtain, from storage, a first expression vector and a second expression vector, each representing a distinct style; perform a linear interpolation or extrapolation between the first expression vector and the second expression vector, using a user-defined scaler value; and generate the user-defined expression vector. Once the user-defined expression vector has been generated, the user-defined expression vector can be used, in real-time to convert new input text into expressive synthesised speech. Thus, the user-defined expression vector may be input into the expressive acoustic model (together with new received input text) to generate expressive speech from the received input text.
[35] In a second approach of the present techniques, there is provided a method for synthesising expressive speech, comprising: generating, using an expressivity characterisation module, a plurality of expression vectors, where each expression vector is a representation of prosodic information in a reference audio style file; and synthesising expressive speech from the input text, using an expressive acoustic model comprising a deep convolutional neural network that is conditioned by at least one of the plurality of generated expression vectors.
[36] The above features described with respect to the first approach apply equally to the second approach.
[37] In a related approach of the present techniques, there is provided a non-transitory data carrier carrying processor control code to implement the methods described herein.
[38] As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
[39] Furthermore, the present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
[40] Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise subcomponents which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.
[041] Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.
[042] The techniques further provide processor control code to implement the above- described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD-or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.
[043] It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.
[44] In an embodiment, the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.
Brief description of drawings
[45] Implementations of the present techniques will now be described, by way of example only, with reference to the accompanying drawings, in which: [46] Figure 1 is a block diagram of a conventional deep neural network based text-tospeech system; [47] Figure 2 is a block diagram of an expressive text-to-speech synthesis system according to the present techniques; [48] Figure 3 is a flowchart of example steps for synthesising expressive speech using the system of Figure 2; [049] Figure 4A is a block diagram of an acoustic model and an example expressivity characterisation module of the system of Figure 2; [050] Figure 4B is a more detailed block diagram of the system of Figure 2; [051] Figure 5A is a block diagram of a conventional reference encoder, and Figure 5B is a block diagram of a reference encoder of the expressivity characterisation module; [052] Figure 6 is a block diagram of another example expressivity characterisation module of the system of Figure 2; [53] Figure 7 is a block diagram of a 1D dilated convolutional layer with conditioning weighting; [54] Figure 8 is a schematic diagram illustrating a normal inference mode for generating expression vectors using the system of Figure 2, and an inference mode which uses interpolation/extrapolation to generate new expression vectors for further expression customisation; [065] Figure 9 is a schematic diagram illustrating a normal inference mode for customising speech characteristics, and an inference mode which uses interpolation/extrapolation to further customise speech characteristics; [056] Figure 10 shows boxplot representations of three features of an expression vector for low, normal and high pitch; and [57] Figure 11 shows a representation of the offset computation for a linear feature of an expression vector.
Detailed description of drawings
[58] Broadly speaking, the present techniques relate to a method and system for text-tospeech synthesis, and in particular to a system which enables expressive speech to be synthesised from input text. The system enables the style of the output expressive speech to be controlled, as well as other speech characteristics such as speaking rate and pitch. Thus, advantageously, the present techniques provide a controllable and customisable system for generating expressive synthetic speech from text.
[59] Figure 2 is a block diagram of an expressive text-to-speech synthesis system 100 according to the present techniques. As mentioned above, the expressive speech synthesis system is a modification of existing deep neural-network based SPSS systems (as shown in Figure 1). The system 100 may be entirely implemented within an apparatus, such as a consumer electronic device. The consumer electronic device may be any user device, such as, but not limited to, a smartphone, tablet, laptop, computer or computing device, virtual assistant device, robot or robotic device, consumer good/appliance (e.g. a smart fridge), an internet of things device, or image capture system/device. Some parts or functions of the system 100 may be distributed across devices, e.g. cloud storage or cloud-based/remote servers.
[60] The system 100 comprises an interface 102 for receiving an input text for conversion to speech, and an interface 110 for outputting the synthesised speech. The system 100 may comprise other interfaces (not shown) that enable the system to receive inputs and/or generate outputs (e.g. user selections of expression vectors, etc.) [61] The system 100 comprises at least one processor coupled to memory (not shown) to: generate, using an expressivity characterisation module 104, a plurality of expression vectors, where each expression vector is a representation of prosodic information in a reference audio style file; and synthesise expressive speech from the input text, using an expressive acoustic model 106 comprising a deep convolutional neural network that is conditioned by at least one of the plurality of generated expression vectors.
[62] The at least one processor or processing circuitry (not shown) controls various processing operations performed by the system 100, such as executing all or part of the neural network(s) of system 100. The processor may comprise processing logic to process data and generate output data in response to the processing. The processor may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The processor may itself comprise computing resources that are available to the system 100 for executing a neural network. That is, the system 100 may comprise one or more of: a central processing unit (CPU), a graphics processing unit (GPU), a neural processing unit (NPU), and a digital signal processor (DSP). Any of these computing resources may be used by the system 100 to execute all or part of the neural network(s).
[63] The system 100 may comprise a vocoder 108 for synthesising and outputting speech 110 using the acoustic features generated by the expressive acoustic model 106. The vocoder may comprise an LPCNet model, as described in J-M Valin and J. Skoglund, "LPCNet: improving neural speech synthesis through linear prediction".
[64] The system 100 may comprise storage (not shown) for storing expression vectors for reference audio style files generated by the expressivity characterisation module 104. Storage may comprise a volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
[65] Figure 3 is a flowchart of example steps for synthesising expressive speech using the system of Figure 2. As will be explained below, once the acoustic model of the system has been trained, the acoustic model can be used to synthesise expressive speech from input text in real-time or near-real-time. In order to include expressivity in the speech, expressivity information is injected into the acoustic model. This is achieved by generating, using the expressivity characterisation module, a plurality of expression vectors, where each expression vector is a representation of prosodic information in a reference audio style file (step S100).
The method comprises receiving input text that is to be synthesised into speech (step 5102). The method comprises synthesising expressive speech from the input text, using an expressive acoustic model comprising a deep convolutional neural network that is conditioned by at least one of the plurality of generated expression vectors (steps 5104 and 5106). More specifically, the method comprises outputting acoustic features from the acoustic model (step 5108) and inputting these acoustic features into a vocoder to synthesise expressive speech using the acoustic features (step S110).
[066] Figure 4A is a block diagram of an acoustic model 106 and an example expressivity characterisation module 104 of the system of Figure 2. (An alternative expressivity characterisation module is shown in Figure 6). In some of the conventional acoustic models mentioned above, expressivity information is combined with the speech synthesis using a concatenation of the vectors at the input of an attention module. This technique works well with complex recurrent acoustic models like Tacotron GST, however this technique is not effective when used with an efficient convolutional sequence-to-sequence model (such as the present acoustic model described herein), and no expressivity is obtained. To achieve expressive synthesis with this type of architecture a novel method of expression injection was developed [067] The expressive acoustic model 106 of the present techniques learns the relationship between linguistic features (e.g. phonemes) in the input text and acoustic features (e.g. the sounds corresponding to the linguistic features). In other words, the expressive acoustic model performs sequence-to-sequence modelling. The expressive acoustic model comprises a number of sub-modules which may all be based on, or comprise, dilated convolutional layers.
[068] The novel expression injection technique comprises a conditioning weighting technique, which is applied to all convolutional layers of the audio encoder and/or audio decoder modules. In the example shown in Figure 4A, the conditioning is applied to both the audio encoder and audio decoder. This technique combines the conditioning input -i.e., the expression vector -and the original input of dilated convolutional layers at each layer of the audio encoder and audio decoder module (see Figure 7). In this way, the final output of the acoustic model 106 serves to solve both the basic sequence-to-sequence TTS problem, and to transfer the style in the reference file by means of the corresponding expression vector.
[069] Figure 7 is a block diagram of a 10 dilated convolutional layer with conditioning weighting. The output of the expressivity characterisation module is an expression vector that characterises the expressivity of a reference utterance. The expressive acoustic model 106 must somehow exploit this information to transfer the expressivity to the synthesised speech. As mentioned above, this new acoustic model 106 applies conditioning weighting to every dilated convolutional layer of the audio encoder and/or audio decoder of the acoustic model 106. Firstly, a fully-connected (FC) layer is used to convert the expression vector output by the expressivity characterisation module to a tensor of a required dimension (e.g. 256 dimensions). The tensor is tiled in the time axis to match the dimension of the output of the 10 dilated convolutional layer (of the audio decoder/encoder) in this axis. Both tensors (i.e. the output of the 1D dilated convolutional layer and the conditioning input) are element-wise added. Finally, the required activation function is applied to the result. In this way, the expressivity has been incorporated into the acoustic features generated by the audio decoder.
[70] Figure 4B is a more detailed block diagram of the system of Figure 2. The system comprises the expressivity characterisation module 104 and the acoustic model 104.
[71] The expressivity characterisation module 104 comprises a trainable neural network. The expressivity characterisation module 104 may be part of the acoustic model 106, or may be separate to the acoustic model 106. The expressivity characterisation module may be used during a normal inference mode or may be used to generate expression vectors during a training mode. The expressivity characterisation module may not be required in an interpolation/extrapolation inference mode, as pre-saved expression vectors are used in this mode (see Figures 8 and 9 below). In either case, the expressivity characterisation module 104 is used to generate expression vectors that are used to condition the deep convolutional neural network layers of the acoustic model 106 (i.e. of the modified DC-TTS acoustic model).
The expressivity characterisation module 104 comprises trainable sub-modules to characterise the expressivity of an input reference file and create a representation of this expressivity information. Its output or a pre-saved vector is used as conditioning input to the expressive acoustic model 106 -in particular, audio encoder and/or audio decoder sub-modules of the acoustic model 106-thereby copying the reference style into the synthesised speech.
[72] The expressivity characterisation module 104 may comprise: an interface (not shown) for receiving a reference audio style file; and a reference encoder sub-module for compressing prosodic information of the received reference audio style file into a fixed-length vector. The reference audio style file is a pre-recorded audio file that represents a particular style or speech characteristic. For example, the reference audio style file may represent a style such as "happy", "friendly", "angry", "stern", etc., and/or may represent a speech characteristic such as fast speaking rate, slow speaking rate, higher average pitch, lower average pitch, normal average pitch, normal speaking rate, etc. [73] The reference encoder sub-module may comprise a plurality of two-dimensional convolutional layers for generating the fixed-length vector. The reference encoder sub-module may further comprise max pooling layers, residual connections, a gated recurrent unit (GRU) layer, and a fully-connected (FC) layer.
[74] The expressivity characterisation module 104 may comprise: an attention sub-module for: receiving the fixed-length vector from the reference encoder sub-module; generating a set of weights corresponding to the prosodic information of the received reference audio style file; and outputting an expression vector comprising the set of weights, for the reference audio style file. The expression vector may, in some cases, be a 256-dimensional expression vector, but this is a non-limiting example vector size. The attention sub-module may be a multi-head attention sub-module.
[75] Instead of an attention sub-module, the expressivity characterisation module may comprise the reference encoder sub-module and a variational autoencoder (VAE) sub-module. Figure 6 is a block diagram of another example expressivity characterisation module of the system of Figure 2. Here, the expressivity characterisation module comprises a variational autoencoder sub-module which comprises a plurality of fully-connected layers for: receiving the fixed-length vector from the reference encoder sub-module; generating a latent space corresponding to the prosodic information of the received reference audio style file; and outputting an expression vector for the reference audio style file. The expression vector may, in some cases, be a 64-dimensional expression vector, but this is a non-limiting example vector size. Thus, whether an attention sub-module or a VAE sub-module is used, the output is the same: an expression vector that is to be injected into the acoustic model 106.
[076] The expressive acoustic model 106 comprises a trainable neural network. The expressive acoustic model 106 learns the relationship between linguistic features (e.g. phonemes) in the input text and acoustic features (e.g. the sounds corresponding to the linguistic features). In other words, the expressive acoustic model 106 performs sequence-to-sequence modelling. The expressive acoustic model comprises a number of sub-modules which may all be based on, or comprise, dilated convolutional layers. In the example shown in Figure 4B, the expressive acoustic model 106 comprises four sub-modules: a text encoder, an audio encoder, a guided attention, and an audio decoder.
[077] The audio encoder sub-module may be configured for: receiving pre-recorded or pre-synthesised speech features, and generating a vector corresponding to the received speech. The expressive acoustic model 106 may be used in two ways. Firstly, the expressive acoustic model may be used for training, i.e. to learn the above-mentioned relationship between linguistic features and acoustic features. In this case, the expressive acoustic model may be trained using input text and pre-recorded speech (or pre-synthesised speech) corresponding to the input text. For example, the input text may be the following sentence: "This is a patent application for a text-to-speech synthesis system", and the pre-recorded or pre-synthesised speech is a human or computer voice reading/speaking this sentence. Secondly, the expressive acoustic model may be used in real-time to generate new synthesised speech from new input text. In this case, the expressive acoustic model may use a previous audio frame of generated speech (e.g. via auto-regression).
[78] In some cases, the audio encoder sub-module may: receive the at least one of the plurality of generated expression vectors, and generate a vector corresponding to the received speech, conditioned by the received expression vector. Advantageously, this enables the audio encoder sub-module to take into account the expressivity represented by the received expression vector, so that the synthesised speech contains expressivity.
[79] The step of receiving the at least one of the plurality of generated expression vectors may comprise receiving a user-selected expression vector. That is, a user may specify that they wish the synthesised speech to have a particular style (e.g. "happy") or a particular speech characteristic (e.g. "slow speaking rate"). Alternatively, the step of receiving the at least one of the plurality of generated expression vectors comprises receiving an expression vector selected to suit a context from which the received input text is obtained. For example, if the input text is received from a news website, an expression vector may be selected to have a "sombre" or "neutral" style, so that the news is read out in an appropriate tone. In another example, if the input text is received from a story for children, expression vectors may be selected to represent a "happy" or "friendly" style and a "slow speaking rate". This selection may be automatic, based on the context [080] The speech received by the audio encoder sub-module may comprise a plurality of audio frames, each audio frame comprising a feature vector. The feature vector may comprise twenty Bark-based cepstrum features, a period feature and a correlation feature. The twenty Bark-based cepstrum features, period feature and correlation feature are required by an LPCNet vocoder.
[81] Before reaching the audio encoder, the feature vector of each audio frame is normalised and passed through a feedback channel selection module. The normalisation process may use a mean and standard deviation of the whole feature set (where the feature set is all the feature vectors that make up the input data. The original DC-TTS only generates one out of every four frames of the acoustic features per decoding step and, later, a secondary upsampling network (described as a Spectrogram Super Resolution Network or SSRN by Tachibana et. a/. ("Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention")) is employed to generate the remaining three frames. In the present techniques, a feedback selection channel module may be included in order to avoid this upsampling network, and so that the all the frames can be generated using the sequence-to-sequence architecture only. Thus, the final input to the audio encoder may be a plurality of audio frames that each comprise a feature set containing 25 features which represent four adjacent frames of the acoustic feature set. Specifically, the 25 features may comprise 22 features of the first audio frame, and a DC component (i.e. the first Bark-based cepstrum coefficient) of the second, third, and fourth adjacent frames.
[82] The text encoder sub-module generates keys and values for a guided attention module based on the linguistic features of the input text. Thus, the text encoder sub-module may: receive phonemes or graphemes corresponding to the received input text; generate a first matrix V representing the value of each phoneme or grapheme in the received input; and generate a second matrix K representing the unique key associated with each value, as explained in, for example, Tachibana et. aL "Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention".
[083] The guided attention sub-module may be configured for: comparing the generated matrix and the generated first and second matrices; and determining a similarity between each character or phoneme in the received input text with a sound represented in the matrix. In other words, the guided attention sub-module evaluates how strongly the n-th phoneme and the t-th frame of the acoustic features are related. It is a dot-product-based attention mechanism. It is 'guided' because it applies a weighting function which exploits the fact that the relationship between the order of phonemes and the acoustic frames is nearly linear with time. This module is unchanged from the original DC-TTS as described in Tachibana et al. "Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention".
[084] The audio decoder sub-module generates the acoustic features needed as an input to the vocoder, based on the output of the guided attention sub-module. The audio decoder sub-module may: receive the generated expression vector used by the audio encoder, and generate acoustic features corresponding to an output of the guided attention sub-module, conditioned by the received expression vector. Advantageously, this enables the audio decoder sub-module to take into account the expressivity represented by the received expression vector, so that the output of the expressive acoustic model comprises expressivity information before it is sent to the vocoder to produce synthesised speech.
[085] The acoustic features generated by the audio decoder sub-module may represents a plurality of audio frames, each audio frame comprising twenty Bark-based cepstrum features, a period feature and a correlation feature. That is, the output of the audio decoder sub-module may be in the required format for the vocoder (as explained above with respect to the audio encoder).
[086] The system may further comprise a vocoder 108 for synthesising speech using the acoustic features generated by the audio decoder sub-module. The vocoder may comprise an LPCNet model, as described in J-M Valin and J. Skoglund, "LPCNet: improving neural speech synthesis through linear prediction".
[087] Before the acoustic features output by the audio decoder are provided to the vocoder 108, reshaping may be performed. As mentioned above, the new acoustic model 106 must generate four acoustic features per decoding step, so no upsampling is required between the acoustic model output and the vocoder input. To achieve this, the size (i.e., number of filters) of the last convolutional layer in the audio decoder was modified from 80 On the conventional DC-TTS system) to 88. Later, this output is reshaped as shown in Figure 4B. In this way, the new acoustic model 106 is able to generate the required 22 features for the four frames of each decoding step.
[88] A number of other modifications have been made to the acoustic model of the conventional DC-TTS system in order for the modified new acoustic model 106 to work with the LPCNet vocoder.
[89] Training hyper-parameters: the original DC-TTS employed fixed rate learning. However, the new acoustic model achieves better performance when an exponentially decaying learning rate is employed.
[90] Regularisation methods: the original DC-TTS does not use any regularisation methods. The new acoustic model employs two kind of regularisation methods: (1) Dropout -it is applied at the output of all layers of expressive acoustic model 106; and (2) L2 weight regularisation -it is applied to the weights of all layers of expressive acoustic model 106.
[91] Training losses: the original DC-TTS was trained using the mean absolute error (MAE) between the output acoustic features and the input acoustic features, plus the loss of the guided attention module (GAtt), plus the binary divergence (BinDiv) loss: LDC-TTS = LE + LGAtt + LBinoiv. In the present techniques, the acoustic model 106 computes the mean squared error (MSE) between the output acoustic features and the input acoustic features, plus the loss of the guided attention module (GAtt), plus the L2 weight regularisation loss (L2reg), plus the KL loss: [NEW = LMSE [GAtt LL2reg+ LKL.
[092] Figure 5A is a block diagram of a conventional reference encoder, and Figure 5B is a block diagram of a reference encoder of the expressivity characterisation module. A conventional reference encoder (such as that employed by Skerry-Ryan et. al., "Toward endto-end prosody transfer for expressive speech with Tacotron"), requires a high dimension spectral input -e.g., 80 Mel spectral features. If this conventional reference encoder architecture is used with a low dimensional cepstral input, the style transfer is severely degraded. The novel reference encoder of the present techniques is designed to maintain the style transfer when used with a low dimension cepstral input (e.g. 20 Bark Cepstral features). This low dimensional cepstral input is required to avoid a domain mismatch when used with vocoders requiring a cepstral input (e.g. LPCNet).
[093] The new reference encoder sub-module of the expressivity characterisation module 104 is composed of 2D convolutional layers, a GRU layer, and a fully-connected layer, as shown in Figure 58. The reference encoder sub-module receives, as an input, the 20 Bark-based cepstrum features (normalised versions of these features are also employed as part of the input to the audio encoder), rather than 80 Mel-based spectrum used by Skerry-Ryan et.
al. Since this input is a very coarse representation of the spectral content of the signal, a more complex architecture was requited to learn a meaningful prosody. Thus, to achieve this, the new reference encoder sub-module uses a combination of residual connections and both linear and ReLU activations to boost the characterization capacity and max-pooling layers to reduce the size of feature maps. Finally, the GRU layer breaks the temporal dependency of the feature maps, so the expression of the whole utterance is represented by a single expressive vector. The fully-connected layer is employed to adapt the output to the desired range.
[094] Figure 8 is a schematic diagram illustrating a normal inference mode for generating expression vectors using the system of Figure 2, and an inference mode which uses interpolation/extrapolation to generate new expression vectors for further expression customisation. As mentioned above, the present techniques enable customisable speech to be synthesised that contains a style and speech characteristics desired by a user. However, there are many possible styles and speech characteristics, and it would be time-consuming to record reference audio style files for every style and characteristic and combination thereof, and then generate expression vectors for each such reference file. Therefore, the present techniques advantageously use interpolation and extrapolation to generate desired styles and characteristics from a set of existing expression vectors.
[95] Thus, the at least one processor coupled to memory of the system may be further configured to: generate, using an interpolation and extrapolation module, a user-defined expression vector for use by expressive acoustic model to generate expressive speech from the input text.
[96] The interpolation and extrapolation module may be configured to: obtain, from storage, a first expression vector and a second expression vector, each representing a distinct style; perform a linear interpolation or extrapolation between the first expression vector and the second expression vector, using a user-defined scaler value; and generate the user-defined expression vector. Once the user-defined expression vector has been generated, the user-defined expression vector can be used, in real-time to convert new input text into expressive synthesised speech. Thus, the user-defined expression vector may be input into the expressive acoustic model (together with new received input text) to generate expressive speech from the received input text.
[097] More specifically, a method was designed to allow the user to control the output expression by interpolating/extrapolating between two pre-saved expression vectors representing particular styles. As shown on the left-hand side of Figure 8, expression vectors (ExpVec) -i.e., the output of the expressivity characterisation module 104-for pre-recorded speech representing various different styles or speech characteristics are generated and stored. The corresponding ExpVec are interpolated/extrapolated based on the following linear function: ExpVec_final = alpha * (ExpVec_style2 -ExpVec_style1) + ExpVec_style1 [098] The parameter alpha plays the role of a scaler: * If a/pha=0, ExpVec_final equals ExpVec_style1 * If alpha=1, ExpVec_final equals ExpVec_style2 * If 0<a/pha<1, interpolation between both styles * If alpha>1, extrapolation beyond style2 * If alpha<0, extrapolation beyond style1 [99] The resulted ExpVec (ExpVec_final) is stored, and when required, is provided as the conditioning input to the acoustic model 106, as shown on the right-hand side of Figure 8.
[100] Figure 9 is a schematic diagram illustrating a normal inference mode for customising speech characteristics (such as speaking rate and pitch variations), and an inference mode which uses interpolation/extrapolation to further customise speech characteristics.
[101] A data augmentation procedure was designed in order to promote the learning of speaking rate and pitch variations. Two groups of 10% of training data each are evenly selected across the training data set; one of them will be used for pitch variation and the other for speaking rate variation. A standard off-line tool is used to modify: (1) the tempo without changing the pitch, for speaking rate variation learning, and (2) the pitch without changing the tempo, for pitch variation leaning. For speaking rate variation, the corresponding group is converted twice: 20% faster and 20% slower. For pitch variation, the corresponding group is also converted twice: 2 semitones up and 2 semitones down. Thus, the extra data is 40% of the original dataset Finally, the training metadata is adapted accordingly [102] In order to benefit from the data augmentation procedure, a method to control the average pitch and the speaking rate of synthesised speech was developed. This method consists of the application of an offset to the expression vector of normal pitch and speaking rate reference audio file to: keep the style, and modify the average pitch and/or speaking rate.
[103] The first step is the offset computation. Using the normal inference mode which is shown on the left part of Fig. 8, the expression vectors of five groups of reference files are extracted and saved: fast speaking rate, slow speaking rate, higher average pitch, lower average pitch, and normal average pitch and speaking rate.
[104] Then, the features -i.e., single component of an expression vector -which exhibit a linear behaviour in relation to the pitch or the speaking rate variation are detected. Figure 10 represents some examples by means of the boxplots of the expressive vectors in each group -specifically, boxplot representations of three features of an expression vector for low, normal and high pitch. The technique focuses only on linear features in order to account for the variability as well as to be able to interpolate/extrapolate, so the output average pitch or speaking rate are modified progressively.
[105] The offsets of linear features are computed as the difference between the medians of the distributions generated by the aforesaid groups. This idea is depicted in Figure 11, which shows a representation of the offset computation for a linear feature of an expression vector.
For the remaining features, the offsets are zero. At the end of this step, four offset vectors are saved: offset vector for lower pitch modification, offset vector for higher pitch modification, offset vector for slower speaking rate modification, and offset vector for faster speaking rate modification. The offset vectors can be applied both at normal inference and at inference by interpolation/extrapolation of two styles, as shown in Figure 11. The application of the offsets is as follows: Offset_final = beta * initial_offset, with 0<beta<1.50 where, by means of beta, the user can control the degree of change in the output pitch or speaking rate, and then: ExpVec_final = ExpVec_original -Offset_final [106] The style, pitch and speaking rate modification techniques may be used both with the MHA sub-module or the VAE sub-module of the expressivity characterisation module. More specifically, when the MHA sub-module is used, speaking style (or expression) may be modified together with pitch, or speaking style may be modified together with speaking rate.
When the VAE sub-module is used, all three of speaking style, pitch and speaking rate may be modified simultaneously and independently. In either case, modification techniques are compatible with each other and do not degrade speech quality. However, they increase efficiency, since by feeding the pre-saved expression vectors, the new reference encoder and MHA modules are not executed at inference time. Therefore, at inference time, the main extra computational load of the new acoustic model in comparison to original sequence-tosequence model -called Text2Mel in original DC-ITS -is due to the FC layers for expression injection.
[107] Thus far, modification of the original DC-TTS acoustic model has been described to provide an expressive acoustic model that is compatible with an LPCNet vocoder. An alternative implementation is to directly modify Tacotron GST (Wang et al., "Style tokens: unsupervised style modelling, control and transfer in end-to-end speech synthesis") to work with LPCNet. This will not give the faster training times obtained when using a convolutional based acoustic model, such as the one described above, however it may be a useful modification if a Tacotron based model is already in use, and where the LPCNet vocoder is required for efficiency gains.
[108] Like the original DC-TTS, the Tacotron-GST was designed to be used with a spectral based vocoder, so both the target and the output of the acoustic model is an 80-dimensional Mel-based spectrum. However as described previously, LPCNet requires a 22-dim feature set which includes: 20 normalised Bark-based cepstrum features, a period feature, and a correlation feature. Therefore, the input and output of the recurrent audio decoder module of the Tacotron-GST must be modified in a similar way to the input of the audio encoder and output of the audio decoder sub-modules of DC-TTS (as described above) to accommodate the LPCNet features.
[109] The Tacotron-GST refence encoder was also designed to receive an 80-dimensional Mel-based spectrum as its input. However, if the rest of the acoustic model is modified to use the LPCNet features this will cause a mismatch that will affect style transfer and speech quality, so instead 20 non-normalised Bark-based cepstrum features must be used as the input to the reference encoder. Since this input is a very coarse representation of the spectral content of the signal, a more complex reference encoder is required to learn a meaningful prosody and obtain successful style transfer even for styles associated with a tiny pitch variation -e.g., warm style. Therefore, in order to successfully work with LPCNet, the original Tacotron-GST reference encoder should be replaced with the new reference encoder proposed herein and as described above with reference to Figure 53.
[110] Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present techniques, the present techniques should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present techniques have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Claims (23)

  1. CLAIMSA system for synthesising expressive speech, comprising: an interface for receiving an input text for conversion to speech; at least one processor coupled to memory to: generate, using an expressivity characterisation module, a plurality of expression vectors, where each expression vector is a representation of prosodic information in a reference audio style file; and synthesise expressive speech from the input text, using an expressive acoustic model comprising a deep convolutional neural network that is conditioned by at least one of the plurality of generated expression vectors.
  2. 2. The system as claimed in claim 1 wherein the expressivity characterisation module comprises: an interface for receiving a reference audio style file; and a reference encoder sub-module for compressing prosodic information of the received reference audio style file into a fixed-length vector.
  3. 3. The system as claimed in claim 2 wherein the reference encoder sub-module comprises a plurality of two-dimensional convolutional layers, max pooling layers and residual connections for generating the fixed-length vector.
  4. 4. The system as claimed in claim 2 or 3 wherein the expressivity characterisation module comprises: an attention sub-module for: receiving the fixed-length vector from the reference encoder sub-module; generating a set of weights corresponding to the prosodic information of the received reference audio style file; and outputting an expression vector comprising the set of weights, for the reference audio style file.
  5. 5. The system as claimed in claimed in claim 4 wherein the attention sub-module is a multi-head attention sub-module.
  6. 6. The system as claimed in claim 2 or 3 wherein the expressivity characterisation module comprises: a variational autoencoder sub-module comprising a plurality of fully-connected layers for: receiving the fixed-length vector from the reference encoder sub-module; generating a latent space corresponding to the prosodic information of the received reference audio style file; and outputting an expression vector for the reference audio style file.
  7. The system as claimed in any of claims 4 to 6 further comprising: storage for storing expression vectors for reference audio style files.
  8. 8. The system as claimed in any preceding claim wherein the expressive acoustic model comprises: an audio encoder sub-module for: receiving pre-recorded or pre-synthesised speech features, and generating a vector corresponding to the received speech.
  9. 9. The system as claimed in claim 8 wherein the audio encoder sub-module is further configured to: receive the at least one of the plurality of generated expression vectors, and generate a vector corresponding to the received speech, conditioned by the received expression vector.
  10. 10. The system as claimed in claim 9 wherein receiving the at least one of the plurality of generated expression vectors comprises receiving a user-selected expression vector. 25
  11. 11. The method as claimed in claim 9 wherein receiving the at least one of the plurality of generated expression vectors comprises receiving an expression vector selected to suit a context from which the received input text is obtained.
  12. 12. The system as claimed in claim 8, 9, 10 or 11 wherein the speech received by the audio encoder sub-module comprises a plurality of audio frames, each audio frame comprising twenty Bark-based cepstrum features, a period feature and a correlation feature.
  13. 13. The system as claimed in any of claims 8 to 12 wherein the expressive acoustic model comprises: a text encoder sub-module for: receiving phonemes or graphemes corresponding to the received input text; generating a first matrix V representing the value of each phoneme or grapheme in the received input; and generating a second matrix K representing the unique key associated with each value.
  14. 14. The system as claimed in claim 13 wherein the expressive acoustic model comprises: a guided attention sub-module for: comparing the generated matrix and the generated first and second matrices; and determining a similarity between each character in the received input text with a sound represented in the matrix.
  15. 15. The system as claimed in claim 14 wherein the expressive acoustic model comprises: an audio decoder sub-module for: receiving the generated expression vector used by the audio encoder, and generating acoustic features corresponding to an output of the guided attention sub-module, conditioned by the received expression vector.
  16. 16. The system as claimed in claim 15 wherein the acoustic features generated by the audio decoder sub-module represents a plurality of audio frames, each audio frame comprising twenty Bark-based cepstrum features, a period feature and a correlation feature.
  17. 17. The system as claimed in claim 15 or 16 further comprising a vocoder for synthesising speech using the acoustic features generated by the audio decoder sub-module.
  18. 18. The system as claimed in claim 17 wherein the vocoder comprises an LPCNet model.
  19. 19. The system as claimed in any preceding claim wherein the at least one processor coupled to memory is further configured to: generate, using an interpolation and extrapolation module, a user-defined expression vector for use by expressive acoustic model to generate expressive speech from the input text.
  20. 20. The system as claimed in claim 19 wherein the interpolation and extrapolation module is configured to: obtaining, from storage, a first expression vector and a second expression vector, each representing a distinct style; performing a linear interpolation or extrapolation between the first expression vector and the second expression vector, using a user-defined scaler value; and generating the user-defined expression vector.
  21. 21. The system as claimed in claim 20 wherein the user-defined expression vector and received input text is input into the expressive acoustic model to generate expressive speech from the received input text.
  22. 22. A method for synthesising expressive speech, comprising: generating, using an expressivity characterisation module, a plurality of expression vectors, where each expression vector is a representation of prosodic information in a reference audio style file; and synthesising expressive speech from the input text, using an expressive acoustic model comprising a deep convolutional neural network that is conditioned by at least one of the plurality of generated expression vectors.
  23. 23. A non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out the method of claim 22.
GB2000883.5A 2020-01-21 2020-01-21 An expressive text-to-speech system Active GB2591245B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB2000883.5A GB2591245B (en) 2020-01-21 2020-01-21 An expressive text-to-speech system
KR1020200062637A KR20210095010A (en) 2020-01-21 2020-05-25 Expressive text-to-speech system and method
US17/037,023 US11830473B2 (en) 2020-01-21 2020-09-29 Expressive text-to-speech system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB2000883.5A GB2591245B (en) 2020-01-21 2020-01-21 An expressive text-to-speech system

Publications (3)

Publication Number Publication Date
GB202000883D0 GB202000883D0 (en) 2020-03-04
GB2591245A true GB2591245A (en) 2021-07-28
GB2591245B GB2591245B (en) 2022-06-15

Family

ID=69636811

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2000883.5A Active GB2591245B (en) 2020-01-21 2020-01-21 An expressive text-to-speech system

Country Status (2)

Country Link
KR (1) KR20210095010A (en)
GB (1) GB2591245B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220406293A1 (en) * 2021-06-22 2022-12-22 Samsung Electronics Co., Ltd. Electronic device and method for controlling thereof
US11978475B1 (en) * 2021-09-03 2024-05-07 Wells Fargo Bank, N.A. Systems and methods for determining a next action based on a predicted emotion by weighting each portion of the action's reply
US11996084B2 (en) 2021-08-17 2024-05-28 Beijing Baidu Netcom Science Technology Co., Ltd. Speech synthesis method and apparatus, device and computer storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951202B (en) * 2021-03-11 2022-11-08 北京嘀嘀无限科技发展有限公司 Speech synthesis method, apparatus, electronic device and program product
CN113611309B (en) * 2021-07-13 2024-05-10 北京捷通华声科技股份有限公司 Tone conversion method and device, electronic equipment and readable storage medium
CN115985282A (en) * 2021-10-14 2023-04-18 北京字跳网络技术有限公司 Method and device for adjusting speech rate, electronic equipment and readable storage medium
CN114255737B (en) * 2022-02-28 2022-05-17 北京世纪好未来教育科技有限公司 Voice generation method and device and electronic equipment
CN115116431B (en) * 2022-08-29 2022-11-18 深圳市星范儿文化科技有限公司 Audio generation method, device, equipment and storage medium based on intelligent reading kiosk

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186359A1 (en) * 2013-12-30 2015-07-02 Google Inc. Multilingual prosody generation
US20190172443A1 (en) * 2017-12-06 2019-06-06 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
WO2019139428A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Multilingual text-to-speech synthesis method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186359A1 (en) * 2013-12-30 2015-07-02 Google Inc. Multilingual prosody generation
US20190172443A1 (en) * 2017-12-06 2019-06-06 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
WO2019139428A1 (en) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 Multilingual text-to-speech synthesis method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220406293A1 (en) * 2021-06-22 2022-12-22 Samsung Electronics Co., Ltd. Electronic device and method for controlling thereof
US11848004B2 (en) 2021-06-22 2023-12-19 Samsung Electronics Co., Ltd. Electronic device and method for controlling thereof
US11996084B2 (en) 2021-08-17 2024-05-28 Beijing Baidu Netcom Science Technology Co., Ltd. Speech synthesis method and apparatus, device and computer storage medium
US11978475B1 (en) * 2021-09-03 2024-05-07 Wells Fargo Bank, N.A. Systems and methods for determining a next action based on a predicted emotion by weighting each portion of the action's reply

Also Published As

Publication number Publication date
GB202000883D0 (en) 2020-03-04
KR20210095010A (en) 2021-07-30
GB2591245B (en) 2022-06-15

Similar Documents

Publication Publication Date Title
GB2591245A (en) An expressive text-to-speech system
US11830473B2 (en) Expressive text-to-speech system and method
Wang et al. Uncovering latent style factors for expressive speech synthesis
Li et al. Close to human quality TTS with transformer
KR102057927B1 (en) Apparatus for synthesizing speech and method thereof
JP2022107032A (en) Text-to-speech synthesis method using machine learning, device and computer-readable storage medium
US20230064749A1 (en) Two-Level Speech Prosody Transfer
US11763797B2 (en) Text-to-speech (TTS) processing
JP7238204B2 (en) Speech synthesis method and device, storage medium
CN115485766A (en) Speech synthesis prosody using BERT models
CN113920977A (en) Speech synthesis model, model training method and speech synthesis method
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
KR20200111609A (en) Apparatus for synthesizing speech and method thereof
WO2020233953A1 (en) A method of sequence to sequence data processing and a system for sequence to sequence data processing
US20240087558A1 (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
JP7462739B2 (en) Structure-preserving attention mechanism in sequence-sequence neural models
KR20210042696A (en) Apparatus and method for learning model
CN116783647A (en) Generating diverse and natural text-to-speech samples
US20240169973A1 (en) Method and device for speech synthesis based on multi-speaker training data sets
KR20190088126A (en) Artificial intelligence speech synthesis method and apparatus in foreign language
JP6594251B2 (en) Acoustic model learning device, speech synthesizer, method and program thereof
KR20200111608A (en) Apparatus for synthesizing speech and method thereof
KR102626618B1 (en) Method and system for synthesizing emotional speech based on emotion prediction
KR102277205B1 (en) Apparatus for converting audio and method thereof
JP5268731B2 (en) Speech synthesis apparatus, method and program