CN117894294A

CN117894294A - Personification auxiliary language voice synthesis method and system

Info

Publication number: CN117894294A
Application number: CN202410288143.0A
Authority: CN
Inventors: 刘刚; 苏江
Original assignee: DMAI Guangzhou Co Ltd
Current assignee: DMAI Guangzhou Co Ltd
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-04-16
Anticipated expiration: 2044-03-14
Also published as: CN117894294B

Abstract

The invention provides a personification auxiliary language voice synthesis method and a system, which are used for labeling auxiliary language tags on original tone voice data containing auxiliary languages, and acquiring an auxiliary language pronunciation unit with target tone according to the labeled original tone voice data and by combining reference audio of the target tone; receiving language input text, wherein the language input text comprises TTS text and a secondary language label marked at a corresponding position in the TTS text; and synthesizing the TTS text into target tone TTS voice, selecting a corresponding secondary language pronunciation unit with the target tone according to the secondary language label, and splicing the target tone TTS voice to generate audio with the target tone. The invention can realize that the speaker in the voice library has the pronunciation capability of the auxiliary language with low cost, improves the naturalness and the fidelity of the TTS speaker in the dialogue process, and ensures that the AI communicates with zero distance in the man-machine interaction.

Description

Personification auxiliary language voice synthesis method and system

Technical Field

The invention belongs to the technical field of voice processing, and relates to a personification auxiliary language voice synthesis method and system.

Background

The current voice synthesis technology can synthesize high-naturalness high-tone quality audio, and can meet more demands in life application, such as video dubbing, broadcasting and the like. However, compared with the situation that a real person pronounces a voice or has a small distance, especially in a conversation scene, the real person conversation can use different pauses and hesitations of 'people' and 'people' To think about the content of the next sentence, or make some laughter or other non-language sounds such as exhalations To express the state of the current speaker, the pronunciation without actual semantics is called a side language phenomenon in the case of phonetics, the existing TTS model (Text To Speech) can always speak smoothly with the same Speech speed, the whole conversation process is relatively mechanical and stiff, and the lack of the side language makes the TTS difficult To achieve the anthropomorphic effect in the conversation process.

TTS with a secondary language is not studied much at present, and there are few related products on the market. In general, to achieve the pronunciation of the secondary language, additional secondary language labeling is needed, namely, firstly, defining the secondary language label, designing the scene and the text corresponding to the scene, performing sound deduction on sound best according to the text and the label in the text, and finally, training the TTS model according to the customized data so as to have the capability of pronunciation of the secondary language. This mode is theoretically possible and has some problems. First, the customization cost is more expensive than general TTS data, and the data recording period is longer; secondly, whether the secondary language can be migrated is still to be verified, namely, the third party of the secondary language can be recorded, and the fourth party of the speaker is required to have the pronunciation capability of the secondary language.

Therefore, how to provide a highly personified and generalizable method and system for synthesizing a speech in a secondary language is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a personified method and system for synthesizing a sub-language speech, which can realize that a speaker in a speech library has a sub-language pronunciation capability at low cost, and improve naturalness and realism of TTS speakers in a conversation process, so that AI communicates with zero distance in man-machine interaction.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the invention discloses a personification auxiliary language voice synthesis method, which comprises the following steps:

s1: labeling the auxiliary language label on the original tone voice data containing the auxiliary language, and acquiring an auxiliary language pronunciation unit with a target tone according to the labeled original tone voice data and the reference audio of the target tone;

s2: receiving language input text, wherein the language input text comprises TTS text and a secondary language label marked at a corresponding position in the TTS text; and synthesizing the TTS text into target tone TTS voice, selecting a corresponding secondary language pronunciation unit with the target tone according to the secondary language label, and splicing the secondary language pronunciation unit with the target tone TTS voice to generate audio with the target tone.

It should be noted that, in this embodiment, the original tone color voice data includes one or more tone colors, and at least one tone color includes a sub-language.

Preferably, the S1 includes:

s11: a voice recognition step: labeling the auxiliary language label on the original tone color voice data containing the auxiliary language, and performing voice recognition on the labeled original tone color voice data to extract PPG features and fundamental frequency features;

s12: a voice conversion step: and performing content coding on the PPG characteristics, performing intonation coding on the fundamental frequency characteristics, performing tone coding on the reference audio of the target tone, and uniformly decoding the coding result to obtain the secondary language pronunciation unit with the target tone.

Preferably, the method further comprises a training step of a speech recognition model, wherein the speech recognition model is used for executing the speech recognition step:

collecting a voice dialogue data set containing a secondary language, and labeling the secondary language in the voice dialogue data set with a secondary language label to obtain a labeled voice dialogue data set;

constructing a web model, wherein the web model comprises Conformer Encoder and is used for receiving input audio and outputting PPG characteristics of the input audio;

pre-training the wenet model using a chinese speech recognition dataset;

and fine-tuning the weight of the pretrained wenet model by using the labeled voice dialogue data set to obtain a trained voice recognition model.

Preferably, the method further comprises a training step of a speech conversion model, wherein the speech conversion model is used for executing the speech conversion step:

pre-training the speech conversion model using the audio of the chinese speech recognition dataset and PPG features and fundamental frequency features of the corresponding audio;

and fine tuning the weight of the pre-trained voice conversion model by using the audio of the target tone, the audio of the marked voice dialogue data set, and the PPG characteristic and the fundamental frequency characteristic of the corresponding audio to obtain the trained voice conversion model.

Preferably, the step S12 further includes, after uniformly decoding the encoding result, obtaining target tone audio data corresponding to the original tone voice data content, and intercepting target tone sub-language field audio data, that is, a sub-language pronunciation unit with target tone, by combining the position of the sub-language tag.

Preferably, storing the secondary language pronunciation unit with the target tone corresponding to the secondary language label acquired in the step S1 to obtain a secondary language database; and S2, searching the corresponding secondary language pronunciation unit with the target tone from the secondary language database according to the secondary language label.

Preferably, the step of splicing the target timbre TTS voice with the secondary language pronunciation unit with the target timbre in S2 includes:

constructing a voice smooth model, and training an autoregressive model by using the audio frequency of the target tone as a training set, wherein the autoregressive model comprises a decoder and predicts a next voice frame by using a previous voice frame;

and carrying out voice smoothing on the auxiliary language pronunciation unit with the target tone and the voice smoothing model of the target tone TTS voice input after training, and outputting the audio with the target tone.

The invention also discloses a personified auxiliary language voice synthesis system according to the personified auxiliary language voice synthesis method, which comprises the following steps:

the auxiliary language unit extraction subsystem is used for labeling auxiliary language labels on the original tone voice data containing the auxiliary language, and acquiring an auxiliary language pronunciation unit with a target tone according to the labeled original tone voice data and the reference audio of the target tone;

the system comprises a sub-language synthesis subsystem, a sub-language synthesis subsystem and a sub-language synthesis subsystem, wherein the sub-language synthesis subsystem is used for receiving language input texts, and the language input texts comprise TTS texts and sub-language labels marked at corresponding positions in the TTS texts; and synthesizing the TTS text into target tone TTS voice, selecting a corresponding secondary language pronunciation unit with the target tone according to the secondary language label, and splicing the secondary language pronunciation unit with the target tone TTS voice to generate audio with the target tone.

Preferably, the secondary language unit extraction subsystem includes:

the voice recognition module is used for labeling the auxiliary language tag on the original tone voice data containing the auxiliary language, and performing voice recognition on the labeled original tone voice data to extract PPG features and fundamental frequency features;

and the voice conversion module is used for carrying out content coding on the PPG characteristics, carrying out intonation coding on the fundamental frequency characteristics, carrying out tone coding on the reference audio of the target tone, and uniformly decoding the coding result to obtain the auxiliary language pronunciation unit with the target tone.

Preferably, the speech recognition module comprises a speech recognition model, and the speech recognition model is trained according to the following steps:

pre-training the wenet model using a chinese speech recognition dataset;

Preferably, the voice conversion module comprises a voice conversion model, and the voice conversion model is trained according to the following steps:

Preferably, the method further comprises a secondary language pronunciation unit interception module, configured to perform an interception operation on target tone audio data corresponding to original tone voice data content obtained by uniformly decoding the encoding result, where the method comprises: and intercepting the audio data of the target tone sub-language field by combining the position of the sub-language tag, namely a sub-language pronunciation unit with the target tone.

Preferably, the method further comprises a secondary language database for storing secondary language pronunciation units with target timbres, which are different from the corresponding secondary language labels; and the secondary language synthesis subsystem retrieves the corresponding secondary language pronunciation unit with the target tone from the secondary language database according to the secondary language label.

Preferably, the system further comprises a voice smoothing module, wherein a voice smoothing model is built in the voice smoothing module, the voice smoothing model comprises an autoregressive model trained by using the audio of the target tone as a training set, the autoregressive model comprises a decoder, and the next voice frame is predicted by using the previous voice frame;

the voice smoothing mode is used for receiving input auxiliary language pronunciation units with target tone colors and performing voice smoothing on the target tone color TTS voice and outputting audio with the target tone colors.

Compared with the prior art, the invention has the following gain effects:

the method and the system for synthesizing the auxiliary language voice can realize the capability of the target speaker for auxiliary language pronunciation with extremely low cost, on one hand, the method has high expansibility and is suitable for any new speaker in storage, and on the other hand, as the conversion model and the smooth model are adopted, the tone and the naturalness of the auxiliary language pronunciation unit and the TTS synthesized audio can be perfectly received, the pronunciation with high anthropomorphic degree of a TTS system is realized, and more immersive dialogue in a man-machine interaction scene is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of personified secondary language speech synthesis in an embodiment of the present invention;

FIG. 2 is a schematic diagram of the operation of the sub-linguistic unit extraction subsystem in an embodiment of the present invention;

FIG. 3 is a flow chart of a sub-language optimization conversion in an embodiment of the invention;

FIG. 4 is a schematic diagram of a speech recognition model in an embodiment of the invention;

FIG. 5 is a schematic diagram of a speech conversion model in an embodiment of the invention;

FIG. 6 is a schematic diagram of the operation of the sub-language synthesis subsystem in an embodiment of the invention;

FIG. 7 is a schematic diagram of a speech smoothing model based on an autoregressive model in an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The first aspect of the embodiment of the invention provides a personified auxiliary language voice synthesis method. As shown in fig. 1, the method comprises the following steps:

s1: sub-language unit extraction sub-flow: and labeling the auxiliary language label on the original tone voice data containing the auxiliary language, and acquiring the auxiliary language pronunciation unit with the target tone according to the labeled original tone voice data and the reference audio of the target tone.

S2: sub-language synthesis sub-flow: receiving language input text, wherein the language input text comprises TTS text and a secondary language label marked at a corresponding position in the TTS text; and synthesizing the TTS text into target tone TTS voice, selecting a corresponding secondary language pronunciation unit with the target tone according to the secondary language label, and splicing the target tone TTS voice to generate audio with the target tone.

In one embodiment, obtaining a secondary language speech unit with a target timbre is one of the core flows of the overall secondary language speech synthesis system, so that the uniformity of the secondary language speech segments and the audio timbre synthesized by the TTS model can be ensured. Therefore, the voices with the sub-languages in the data set are required to be converted into the target speaker so as to have the tone color of the target speaker. However, the direct conversion effect is not ideal, because one of the cores of the voice conversion is the extraction of content in voice recognition, the current voice recognition model does not support recognition of a secondary language, and the secondary language conversion often has the problem of wrong pronunciation content during conversion. So supporting speech conversion in the secondary language is one of the key technologies for realizing speech synthesis in the secondary language. As shown in FIG. 2, the invention uses the auxiliary language tag as a recognition unit for voice recognition, fine-tunes a voice recognition model based on the labeled voice data with the auxiliary language tag, and converts an optimization model based on features generated by optimized voice recognition. The core module of the sub-language unit conversion technology and the optimization mode thereof are described in detail below.

As shown in fig. 3, the sub-language conversion optimization process in S1 includes:

s12: a voice conversion step: and performing content coding on the PPG characteristic, performing intonation coding on the fundamental frequency characteristic, performing tone coding on the reference audio of the target tone, and uniformly decoding the coding result to obtain the auxiliary language pronunciation unit with the target tone.

The training process of the conversion optimization model of this embodiment aims to obtain a secondary language pronunciation unit with a target tone from the voice dialogue data set for the subsequent secondary language synthesis subsystem to splice and synthesize TTS text.

The specific implementation process is as follows:

in this embodiment, the speech recognition model is used for executing a speech recognition step, and the training step of the speech recognition model includes:

s110: in general, a large number of pronunciation sub-languages exist in the process of a conversation, so that a voice data set of a daily conversation is collected first, and the sub-languages in the data set are labeled according to a predefined sub-language label, so that a labeled voice conversation data set is obtained. The designed secondary language label is shown in table 1, namely, text content corresponding to a voice segment with the following semantics in voice is marked as a corresponding secondary language label.

Table 1 side language tag table:

s120: the wenet model is constructed, as shown in fig. 4, the input audio outputs PPG features through Conformer Encoder, wherein the PPG features are collectively called posterior phoneme probabilities, and are important features used in the subsequent speech conversion model, and the CTC Decoder converts the PPG features into corresponding texts. The text converted by the CTC Decoder is not involved in this embodiment.

S130: the web model is pre-trained using a large-scale chinese speech recognition dataset.

S140: the weight of the pre-trained wenet model is finely adjusted by using the labeled voice dialogue data set, so that the voice recognition model has better effect on the auxiliary language recognition, and the trained voice recognition model is obtained. Finally, the model is used to extract PPG features on the training data set for optimization of the subsequent speech conversion model.

In this embodiment, the speech conversion model is used to perform the speech conversion step, and the speech conversion model is intended to convert the input audio to have the timbre of the target speaker, but at the same time preserve the speaking content of the input audio, and as shown in fig. 5, the speech conversion model mainly includes four sub-modules: the training optimization process of each sub-module and the whole module is described in detail below:

content encoder:

the PPG feature of the input audio mainly retains the content information of the input audio, which is essential for maintaining the consistency of the converted content. The content encoder aims to further process the PPG features extracted from the speech recognition model for feeding into a subsequent decoder to generate converted audio. The entire content encoder consists of multiple layers of one-dimensional convolutions for fast reasoning,

alternatively, higher order feature processing structures such as transformers may be used.

Intonation encoder:

the PPG feature of the input audio does not substantially contain style information such as intonation, whereas the pronunciation of the sub-language is more biased to be style information than that of the normal text, so that the fundamental frequency feature is extracted from the input audio and is indispensable as one of the intonation features of the conversion process. The intonation encoder is composed of multiple convolutional layers, processes fundamental frequency characteristic information extracted from input audio, and serves as one of the inputs to the decoder.

Timbre encoder:

in order to make the converted audio have the target timbre, the timbre encoder aims to obtain the integral prosody and timbre information from the reference audio of the target speaker, and a fixed one-dimensional vector is obtained from the reference audio through a plurality of convolution and pooling layers, and the vector is summed with the content information or spliced for later audio generation.

A decoder:

the decoder adopts a classical transducer structure, integrates the content, tone and tone information output by each encoder, and finally outputs converted audio with target tone.

Alternatively, the decoder may employ other networks, such as a consumer, etc.

The training step of the speech conversion model also adopts a mode of pre-training and fine tuning, and comprises the following steps:

pre-training a voice conversion model by using the audio of the Chinese voice recognition data set and PPG characteristics and fundamental frequency characteristics of the corresponding audio;

and fine tuning the weight of the pre-trained voice conversion model by using the audio of the target tone, the audio of the marked voice dialogue data set, and the PPG characteristic and the fundamental frequency characteristic of the corresponding audio so as to enable the auxiliary language conversion effect to be better. And obtaining a trained voice conversion model.

In one embodiment, after the optimized voice conversion model is obtained, the voices with the secondary languages in the dialogue data set are converted to obtain voices with target tone colors, and the audio data of the target tone color secondary language fields in the voices are extracted as a basic target speaker secondary language pronunciation unit by combining the positions of the secondary language labels.

In one embodiment, storing the secondary language pronunciation unit with the target tone corresponding to the secondary language label and different from the secondary language label acquired in the step S1 to obtain a secondary language database; and S2, searching the corresponding sub-language pronunciation unit with the target tone from a sub-language database according to the sub-language label.

In one embodiment, as shown in fig. 6, to synthesize the speech with the sub-language, S2 sequentially separates TTS text and the sub-language tag from the input text for the input language with the sub-language tag,

the TTS text is a language text that does not include a sub-language.

For TTS text, synthesizing corresponding voice by using a TTS model; for the secondary language tag, the corresponding secondary language voice pronunciation unit is searched from the secondary language database. For all the voice fragments, the volume of each audio frequency is required to be unified when the voice fragments are spliced in sequence, and the common pydub voice tool is adopted for volume balancing, so that the target speaker is basically realized to have the target of the auxiliary language pronunciation capability.

In this embodiment, the concatenation between the target tone sub-language pronunciation unit and TTS speech in S2 is inevitably excessively unnatural, and needs to be smoothed, but is difficult to be smoothed by a rule-based method, because the difference between the front and rear speech pronunciation conditions at the junction is relatively various and complex.

Therefore, the embodiment designs a voice smooth processing process in the auxiliary language synthesis subsystem, and can automatically process the inconsistency of the voice complexity before and after the splicing part. An autoregressive model of a pure decoder was trained using the audio of the target speaker, i.e., the target tone, as a training set, as shown in fig. 7, with a Transformer Decoder structure containing a total of 6 transformer block,representing i frames of speech frames,/->And predicting the next voice frame by using the previous voice frame, calculating the loss by using the mean square error, and optimizing the model parameters. The model is trained by adopting the audio of the target speaker during training, TTS voice and spliced voice with auxiliary language are input during reasoning, and smooth audio is output, so that the problem of excessive unnatural splicing position can be effectively solved.

A second aspect of the embodiment of the present invention proposes a personified secondary language speech synthesis system according to the first aspect of the embodiment, comprising:

the sub-language synthesis subsystem is used for receiving language input texts, wherein the language input texts comprise TTS texts and sub-language labels marked at corresponding positions in the TTS texts; and synthesizing the TTS text into target tone TTS voice, selecting a corresponding secondary language pronunciation unit with the target tone according to the secondary language label, and splicing the target tone TTS voice to generate audio with the target tone.

In one embodiment, the secondary language unit extraction subsystem includes:

In this embodiment, the speech recognition module includes a speech recognition model, and the speech recognition model is trained according to the following steps:

pre-training a wenet model by using a Chinese voice recognition data set;

and fine-tuning the weight of the pre-trained wenet model by using the labeled voice dialogue data set to obtain a trained voice recognition model.

In this embodiment, the voice conversion module includes a voice conversion model, and the voice conversion model is trained according to the following steps:

pre-training a voice conversion model by using the reference audio of the target tone, the PPG characteristic and the fundamental frequency characteristic of the Chinese voice recognition dataset;

and fine tuning the weight of the pre-trained voice conversion model by using the reference audio of the target tone, the PPG characteristic and the fundamental frequency characteristic of the marked voice dialogue data set to obtain the trained voice conversion model.

In one embodiment, the method further includes a secondary language pronunciation unit interception module, configured to perform an interception operation on target tone audio data corresponding to original tone voice data content obtained by uniformly decoding an encoding result, where the method includes: and intercepting the audio data of the target tone sub-language field by combining the position of the sub-language tag, namely a sub-language pronunciation unit with the target tone.

In one embodiment, the method further comprises a secondary language database for storing secondary language pronunciation units with target timbres, which are different from the corresponding secondary language labels; the secondary language synthesis subsystem retrieves the corresponding secondary language pronunciation unit with the target tone from the secondary language database according to the secondary language label.

In one embodiment, the system further comprises a voice smoothing module, wherein a voice smoothing model is built in the voice smoothing module, the voice smoothing model comprises an autoregressive model trained by using the audio of the target tone as a training set, the autoregressive model comprises a decoder, and the next voice frame is predicted by using the previous voice frame;

the voice smoothing model is used for receiving input auxiliary language pronunciation units with target tone colors and performing voice smoothing on target tone color TTS voices and outputting audio with the target tone colors.

The second aspect of the embodiment of the present invention is applicable to all execution procedures of the anthropomorphic sub-language speech synthesis method set forth in the first aspect of the embodiment.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The above describes the personified auxiliary language speech synthesis method and system provided by the invention in detail, and specific examples are applied in this embodiment to illustrate the principle and implementation of the invention, and the above description of the embodiments is only used to help understand the method and core idea of the invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present invention, the present disclosure should not be construed as limiting the present invention in summary.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this embodiment may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A personification auxiliary language voice synthesis method is characterized in that: the method comprises the following steps:

2. The method of personified secondary language speech synthesis according to claim 1, wherein S1 comprises:

3. The personified bilingual speech synthesis method of claim 2, further comprising a training step of a speech recognition model, wherein the speech recognition model is configured to perform the speech recognition step:

pre-training the wenet model using a chinese speech recognition dataset;

4. The personified bilingual speech synthesis method of claim 2, further comprising a training step of a speech conversion model, wherein the speech conversion model is configured to perform the speech conversion step:

5. The personification method of claim 2, wherein S12 further comprises, after uniformly decoding the encoding result, obtaining target tone audio data corresponding to the original tone voice data content, and intercepting target tone sub-language field audio data, i.e. a sub-language pronunciation unit with target tone, by combining the location of the sub-language tag.

6. The personified secondary language speech synthesis method according to claim 1, wherein the secondary language pronunciation unit with the target tone corresponding to the secondary language label acquired in the step S1 is stored to obtain a secondary language database; and S2, searching the corresponding secondary language pronunciation unit with the target tone from the secondary language database according to the secondary language label.

7. The method of synthesizing a sub-speech by personification according to claim 1, wherein the step of concatenating the sub-speech sound unit having the target timbre in S2 with the TTS speech having the target timbre comprises:

8. A personified secondary language speech synthesis system according to any one of claims 1 to 7, comprising:

9. The personified secondary language speech synthesis system of claim 8, wherein the secondary language unit extraction subsystem comprises:

10. The personified secondary language speech synthesis system of claim 9, wherein the speech recognition module comprises a speech recognition model that is trained by:

pre-training the wenet model using a chinese speech recognition dataset;

11. The personified secondary language speech synthesis system of claim 9, wherein the speech conversion module comprises a speech conversion model that is trained by:

pre-training the speech conversion model using the reference audio of the chinese speech recognition dataset and PPG features and fundamental frequency features of the corresponding audio;

12. The personified secondary language speech synthesis system of claim 9, further comprising a secondary language pronunciation unit interception module for performing an interception operation on target tone audio data corresponding to the original tone speech data content obtained by uniformly decoding the encoding result, comprising: and intercepting the audio data of the target tone sub-language field by combining the position of the sub-language tag, namely a sub-language pronunciation unit with the target tone.

13. The personified secondary language speech synthesis system of claim 8, further comprising a secondary language database for storing secondary language pronunciation units having target timbres that are different from the secondary language labels; and the secondary language synthesis subsystem retrieves the corresponding secondary language pronunciation unit with the target tone from the secondary language database according to the secondary language label.

14. The personified secondary language speech synthesis system of claim 8, further comprising a speech smoothing module within which a speech smoothing model is built, the speech smoothing model comprising an autoregressive model trained using audio of a target timbre as a training set, the autoregressive model comprising a decoder to predict a next speech frame using a previous speech frame;