CN118116363A

CN118116363A - Speech synthesis method based on time perception position coding and model training method thereof

Info

Publication number: CN118116363A
Application number: CN202410512813.2A
Authority: CN
Inventors: 潘启正; 陈毅松; 杨洪进
Original assignee: Xiamen Chanyu Network Technology Co ltd
Current assignee: Xiamen Chanyu Network Technology Co ltd
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-05-31

Abstract

The invention provides a speech synthesis method based on time perception position coding and a model training method thereof, and relates to the technical field of speech synthesis. The synthesis method comprises the steps of obtaining prompt voice and text and to-be-synthesized text thereof. The text is converted to phoneme codes. Audio is converted into audio codes of a plurality of codebooks. And acquiring the time of the words in the text, combining the phoneme codes, acquiring the time perception position codes and converting the time perception position codes into audio position codes. And splicing the phoneme codes and the audio codes of the first codebook, aligning the phoneme codes with the position codes, and inputting an autoregressive model to obtain the audio predictive codes of the first codebook. Splicing the phoneme codes, the audio codes of all codebooks and the audio prediction codes, aligning with the position codes, and then inputting a non-autoregressive model to obtain the audio prediction codes of other codebooks of the speech to be synthesized. And splicing audio predictive codes of all codebooks, and decoding to obtain synthesized audio. The method has the advantages of faster training speed and higher generated audio quality.

Description

Speech synthesis method based on time perception position coding and model training method thereof

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a speech synthesis method based on time perception position coding and a model training method thereof.

Background

Current zero-sample speech synthesis schemes are largely divided into non-autoregressive models with modeled mel spectrum as an intermediate representation and autoregressive models based on modeled audio coding. Wherein the sound output by the non-autoregressive model based on the mel spectrum is more stable, but the sound is flatter due to lack of style capturing of the speaker. While the autoregressive model based on the audio coding can effectively capture the emotion of a speaker, the training efficiency is low, and the model cannot well capture the pronunciation time of each word.

It has the following drawbacks: 1. insufficient style capture: the non-autoregressive model has limited performance in generating personalized speech, and cannot fully simulate the style and emotion of a speaker. 2. The training efficiency is low: the training process of the autoregressive model is time consuming and resource intensive, which limits the scalability and practicality of the model. 3. Pronunciation time capture inaccuracy: the autoregressive model has difficulty in simulating the pronunciation time of each word, which affects the natural fluency of speech.

In view of the above, the applicant has studied the prior art and has made the present application.

Disclosure of Invention

The invention provides a speech synthesis method based on time-aware position coding and a model training method thereof, which are used for improving at least one of the technical problems.

In a first aspect, the present invention provides a speech synthesis method based on time-aware position coding, which includes steps S1 to S8.

S1, acquiring prompt voice, prompt voice text and voice text to be synthesized.

S2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.

S3, the prompt voice is encoded into audio codes of a plurality of codebooks through an audio coder-decoder.

S4, according to the prompt voice and the prompt voice text, obtaining time stamp information of each word, and then obtaining time perception position codes of the phonemes by combining phoneme codes of the prompt voice text.

S5, according to the time perception position codes, the audio position codes of the words in the text at the corresponding positions of the audio are obtained.

S6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced codes, and then inputting the time perception position code and the audio position code into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized.

S7, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized.

S8, splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding through an audio coder and decoder to obtain synthesized audio.

In an alternative embodiment, step S2 specifically includes steps S21 to S23.

S21, word segmentation is carried out on the prompt voice text and the voice text to be synthesized through a word segmentation tool.

S22, respectively converting the prompt voice text and the word segmentation of the voice text to be synthesized into phonemes through a phoneme conversion model.

S23, respectively converting phonemes of the prompt voice text and the voice text to be synthesized into phoneme codes through a mapping dictionary. Wherein the mapping dictionary contains discrete codes corresponding to the phonemes.

In an alternative embodiment, the step S4 of obtaining the timestamp information of each word in the text specifically includes a step S41. S41, acquiring the time stamp range of each word in the text in the voice audio. Wherein the range of time stamps includes a start time stamp and an end time stamp of each word in the audio.

In an alternative embodiment, the step S4 of obtaining the time-aware position code of each word in the text specifically includes steps S42 to S44.

S42, according to the phoneme codes of the words in the text, numbering the phonemes in the phoneme codes through an arithmetic progression to obtain the phoneme number range of the words. Wherein/>，/>For/>Numbering of individual phonemes,/>Is the total number of phonemes of the text.

S43, respectively calculating the phoneme position codes of each word according to the time stamp range and the phoneme number range. Wherein, phoneme position coding/>The calculation model of (2) is as follows: /(I)In the/>For phoneme position coding,/>For/>Position coding of individual phonemes,/>Is the total number of phonemes of the text,/>For/>Position coding of individual phonemes,/>For/>Numbering of individual phonemes,/>For normalizing the difference of the post-arithmetic progression,/>For/>End timestamp of individual word,/>For/>The start timestamp of the individual word.

S44, splicing the phoneme position codes of each word in the text, and unifying the coding scale of the spliced phoneme position codes and the coding scale of the audio coder-decoder to obtain the time perception position codes of the phonemes.

In an alternative embodiment, step S5 specifically includes step S51. S51, rounding downwards according to the time perception position codes of the words in the text, and obtaining the audio position codes of the words in the text.

In an alternative embodiment, the autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Encoding positions for audio,/>For the number of audio encoding positions,/>First codebook for speech to be synthesized/>Audio coding of individual positions,/>Is less than the first codebook of speech to be synthesized/>Audio encoding of individual positions.

In an alternative embodiment, the non-autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding representing prompt speech,/>Parameters representing non-autoregressive models,/>For the sequence number of codebook,/>Represents the/>, of the speech to be synthesizedAudio predictive coding of individual codebooks,/>Representing less than the/>, of the speech to be synthesizedAudio predictive coding of the individual codebooks.

In a second aspect, the present invention provides a method for training a speech synthesis model based on time-aware position coding, which includes steps M1 to M10.

M1, acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized.

And M2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.

And M3, respectively encoding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder.

And M4, respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then acquiring the time perception position code of the phonemes by combining the phoneme codes.

And M5, acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.

M6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the codes into an autoregressive model together to obtain the audio predictive code of the first codebook of the voice to be synthesized.

And M7, acquiring the real audio coding of a first codebook of the voice to be synthesized, and then calculating an autoregressive loss with the audio predictive coding of the first codebook to train the autoregressive model.

M8, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio predictive codes of the j+1-th codebook of the voice to be synthesized until the audio predictive codes of other codebooks except the first codebook are generated.

M9, acquiring the real audio coding of the j+1th codebook of the voice to be synthesized, and then calculating a non-autoregressive loss with the audio predictive coding of the j+1th codebook to train the non-autoregressive model.

And M10, training the autoregressive model and the non-autoregressive model to obtain a speech synthesis model based on time perception position coding.

In a third aspect, the present invention provides a speech synthesis apparatus based on time-aware position coding, which includes an initial data acquisition module, a phoneme coding module, an audio coding module, a time-aware position coding acquisition module, an audio position coding module, a first prediction module, a second prediction module, and a splicing module.

The initial data acquisition module is used for acquiring prompt voice, prompt voice text and voice text to be synthesized.

And the phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.

And the audio coding module is used for coding the prompt voice into audio codes of a plurality of codebooks through an audio coder.

And the time perception position code acquisition module is used for acquiring the time stamp information of each word according to the prompt voice and the prompt voice text, and then acquiring the time perception position code of the phoneme by combining the phoneme code of the prompt voice text.

And the audio position coding module is used for acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.

The first prediction module is used for splicing the audio coding of the first codebook of the prompt voice, the phoneme coding of the prompt voice text and the phoneme coding of the voice text to be synthesized, aligning the time perception position coding and the audio position coding with the spliced codes, and then inputting the time perception position coding and the audio position coding into the autoregressive model together to obtain the audio prediction coding of the first codebook of the voice to be synthesized.

The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized.

And the splicing module is used for splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding the audio predictive coding by an audio coder and decoder to obtain synthesized audio.

In a fourth aspect, the present invention provides a training device for a speech synthesis model based on time-aware position coding, which includes a training data acquisition module, a phoneme coding module, an audio coding module, a time-aware position coding acquisition module, an audio position coding module, a first prediction module, a first training module, a second prediction module, a second training module, and a model acquisition module.

The training data acquisition module is used for acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized.

And the audio coding module is used for respectively coding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder.

And the time perception position code acquisition module is used for respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then combining the phoneme codes to acquire the time perception position codes of the phonemes.

The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the time perception position code and the audio position code into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized.

And the first training module is used for acquiring the real audio codes of the first codebook of the voice to be synthesized, and then calculating the autoregressive loss with the audio predictive codes of the first codebook to train the autoregressive model.

The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into the non-autoregressive model together to obtain the audio prediction codes of the j+1-th codebook of the voice to be synthesized until the audio prediction codes of other codebooks except the first codebook are generated.

And the second training module is used for acquiring the real audio codes of the j+1th codebook of the voice to be synthesized, and then calculating the non-autoregressive loss with the audio predictive codes of the j+1th codebook to train the non-autoregressive model.

The model acquisition module is used for obtaining a speech synthesis model based on time perception position coding after the autoregressive model and the non-autoregressive model are trained.

By adopting the technical scheme, the invention can obtain the following technical effects:

According to the voice synthesis method based on the time perception position coding, the time perception position coding is introduced to replace the traditional sine and cosine coding, so that the convergence speed of a model is effectively improved, the training time length is reduced from 3w iterations to 1w iterations, the convergence can be achieved, and the quality of generated audio is higher. And the pronunciation of each word can be effectively controlled in the reasoning stage through the time perception position coding, the speed change of the audio is supported, and the method has good practical significance.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a speech synthesis method based on time-aware position coding.

Fig. 2 is a flow architecture diagram of an autoregressive model.

FIG. 3 is a flow architecture diagram of a non-autoregressive model.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 to 3, a first embodiment of the present invention provides a speech synthesis method based on time-aware position coding, which can be performed by a speech synthesis apparatus based on time-aware position coding (hereinafter referred to as a speech synthesis apparatus). In particular, by one or more processors in the speech synthesis apparatus to implement steps S1 to S8.

S1, acquiring prompt voice, prompt voice text and voice text to be synthesized.

In particular, a hint voice may be understood as the pronunciation of the same person as the voice to be synthesized, but not the pronunciation of a single word in the text of the voice to be synthesized.

S2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes. Preferably, step S2 specifically includes steps S21 to S23. S21, word segmentation is carried out on the prompt voice text and the voice text to be synthesized through a word segmentation tool. S22, respectively converting the prompt voice text and the word segmentation of the voice text to be synthesized into phonemes through a phoneme conversion model. S23, respectively converting phonemes of the prompt voice text and the voice text to be synthesized into phoneme codes through a mapping dictionary. Wherein the mapping dictionary contains discrete codes corresponding to the phonemes.

Specifically, the word segmentation may use a word segmentation tool such as jieba, and the word segmentation may be converted into a phoneme by using a text-to-phoneme model such as G2P (Grapheme-to-Phoneme). After being converted into phonemes, the phonemes are converted into phoneme codes through a mapping dictionaryWherein/>Representing the total number of phonemes.

In the following, a specific example will be described of how the text corresponding to the prompt speech and the speech to be synthesized is converted into a phoneme code. First, a piece of text is cut by means of jieba word segmentation or the like. The text of a section is: "should therefore respect the hand-in-hand development in harmony" changes from jieba to [ 'so', 'should', 'mutual', 'harmony', 'respect', 'hand-in-hand', 'development' ]. Then, each word is converted into a corresponding phoneme by a g2p model or the like. Finally, the phonemes are converted into phoneme codes through a mapping dictionary. For example: the word "so" is converted to phonemes and then changed to ['s ', ' u ɔ ' 3', ' y ', ' i3' ], and since the number of phonemes is limited, a mapping dictionary can be built up by phonemes and then converted to discrete codes, such as ['s ', ' u ɔ ', ' y ', ' ii3' ] is changed to [83, 100, 106, 55].

Specifically, in the reasoning stage, only the audio of the prompt voice is provided, so that the prompt voice only needs to be converted into audio coding. Specifically, the encoder through the audio codec Encodec is turned into audio encoding. In this embodiment Encodec encodes a piece of audio asWherein T represents the downsampled audio step size and 8 represents the number of codebooks.

For example: a 3s audio will be converted to a corresponding audio through EncodecBecause the correspondence of the length of the audio discrete code and the audio duration(s) is 1 to 75. In this embodiment, the autoregressive portion only needs the information of the first codebook.

On the basis of the above embodiment, in an alternative embodiment of the present invention, the step S4 of obtaining the timestamp information of each word in the text specifically includes a step S41. S41, acquiring the time stamp range of each word in the text in the voice audio. Wherein the range of time stamps includes a start time stamp and an end time stamp of each word in the audio. It will be appreciated that the time stamps of text in a piece of audio are obtained from a time stamp predictive model that can be used in many ways, such as MFA (Montreal Forced Aligner) tools, etc.

Based on the above embodiment, in an alternative embodiment of the present invention, the step S4 of obtaining the time-aware position code of each word in the text specifically includes steps S42 to S44.

S43, respectively calculating the phoneme position codes of each word according to the time stamp range and the phoneme number range. Wherein, phoneme position coding/>The calculation model of (2) is as follows: /(I)In the/>For phoneme position coding,/>For/>Position coding of individual phonemes,/>Is the total number of phonemes of the text,/>For/>Position coding of individual phonemes,For/>Numbering of individual phonemes,/>For normalizing the difference of the post-arithmetic progression,/>For/>End timestamp of individual word,/>For/>The start timestamp of the individual word.

S44, splicing the phoneme coding position codes of each word in the text, and unifying the coding scale of the spliced phoneme position codes and the coding scale of the audio coder-decoder to obtain the time perception position codes of the phonemes。Wherein/>For/>Time-aware position coding of individual phonemes

Specifically, the purpose of step S4 is to generate a sequence of numbers associated with a time stamp as time-aware position codes based on the length of each word converted to a phonemeWhere n represents the number of phonemes corresponding to a word.

Firstly, initializing an arithmetic sequence, and then numbering each phoneme in the phoneme code according to the arithmetic sequence to obtain the phoneme number range of each word。

Then, according to the number, the phoneme position is encodedEach element/>The calculation formula of (2) is as follows: wherein/> The purpose of (a) is to normalize each element in the arithmetic progression,/>Representing the difference between the arithmetic columns.

By subtracting half of the differenceThe position code of each phoneme is adjusted to the middle position of the pronunciation interval where the phoneme is located, so that the model can be helped to capture pronunciation accuracy better. The position code corresponding to the phoneme is then mapped to its start time/>, at the entire audio segmentAnd end time/>In-range as time-aware position coding. Finally, in order to be unified with the coding scale of the audio codec Encodec, i.e. all time-aware position codes are multiplied by a fixed value as the final position code.

By the above calculation of the position-aware position coding, it can be found thatThe position of each word corresponding to the pronunciation of the phoneme is recorded. In this way, the model can capture the position information of the pronunciation more quickly, so that the convergence speed of the model is increased. In addition, in the reasoning stage, the pronunciation time length of each word can be manually changed, so that the pronunciation speed of the audio frequency is changed, and the generation result is more controllable. The audio position coding is generated by adopting a traditional position coding mode.

A specific example will be described below of how to obtain a time-aware position coding of the text corresponding to the alert speech.

1. A range of time stamps of the text in the audio is obtained. For example: [ [0.1, 0.26, ' ], 0.27, 0.46, ' with ' ], 0.54, 0.73, ' with ' ],...

2. Numbering the factors of each word according to the arithmetic progression to obtain the phoneme coding range of each word. Wherein,，/>For/>Numbering of individual phonemes,/>Is the total number of phonemes of the text. For example: the list of "phonemes" is ['s', 'u ɔ' 3 ], with an index range of [0,1,2].

3. According to the formulaAnd carrying out maximum normalization on the phoneme list index range. For example: the index range is changed from [0, 1, 2] to [0,0.499,0.999].

4. According to the formulaThe difference value of the arithmetic progression is obtained. For example: the difference of [0, 0.499, 0.999] was calculated to be 0.499.

5. According to the formulaAnd shifting each numerical value in the position codes by a half difference value length backwards, and intercepting the length corresponding to the phonemes. For example: [0,0.499,0.999] was changed to [0.249, 0.749] by calculation.

6. According to the formulaScaling the numbering range to/>And/>Within, wherein/>For/>Start timestamp of individual word,/>For/>End timestamp of the individual word. Hypothesis/>Is 1.25/>1.55, Thus calculated by the above formula [1.3247, 1.4747].

7. And splicing the position codes of phonemes corresponding to the words of the whole sentence as initial time perception position codes.

8. The coding scale of the initial time-aware position coding is unified with the coding scale of the audio codec Encodec. In this embodiment, all the temporal perceptual location codes are multiplied by 75 as the final temporal perceptual location code. For example: for example, [1.3247, 1.4747] to [99.352, 110.6024].

S5, according to the time perception position codes, the audio position codes of the words in the text at the corresponding positions of the audio are obtained. Preferably, step S5 specifically includes step S51. S51, rounding downwards according to the time perception position codes of the words in the text, and obtaining the audio position codes of the words in the text at the corresponding audio positions.

In particular, the audio positions are encoded as an arithmetic progression with a difference of 1, i.e-/>. Audio position coding set of audio/>The expression of (2) is: /(I)Wherein/>For/>Audio position coding,/>For/>And encoding the audio positions.

The time-aware position coding for a word is known to be denoted as t= {Its position coding at the corresponding audio is expressed as:/>Wherein/>For/>Audio position coding corresponding to integer value rounded down in B,/>For/>The corresponding down-rounded integer value is encoded in the corresponding audio position in B.

How the audio position codes are generated is described below with a specific example.

And aiming at the time perception position code corresponding to each word, acquiring the audio position code of the time perception position code at the corresponding audio position. For example: the time-aware position code for this word is [99.352, 110.6024], then the audio code corresponding to this word is obtained according to the formula in step four above as [99, 100, 110].

S6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced codes, and then inputting the time perception position code and the audio position code into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized. Preferably, the autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Encoding positions for audio,/>For the number of audio encoding positions,/>First codebook for speech to be synthesized/>Audio coding of individual positions,/>Is less than the first codebook of speech to be synthesized/>Audio encoding of individual positions.

Specifically, the audio code of the first codebook of the prompt voice and the phoneme code of the prompt voice text are spliced, the corresponding time-aware position codes and the audio position codes are aligned, and then the aligned time-aware position codes and the aligned audio position codes and the phoneme codes of the voice text to be synthesized are input into an autoregressive model. The autoregressive model is in the prior art, and consists of multiple layers of decoders, wherein each layer of decoder consists of an autoreply module, a forward propagation module and a layer normalization module, and the invention is not repeated.

S7, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized. Preferably, the non-autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding representing prompt speech,/>Parameters representing non-autoregressive models,/>For the sequence number of codebook,/>Represents the/>, of the speech to be synthesizedAudio predictive coding of individual codebooks,/>Representing less than the/>, of the speech to be synthesizedAudio predictive coding of the individual codebooks.

Specifically, the audio codes of the prompt voice corresponding to all codebooks and the phoneme codes of the prompt voice text are spliced, the corresponding time perception codes and audio position codes are aligned, and then the audio prediction codes of the first codebook of the voice to be synthesized and the audio prediction codes of the first codebook of the voice to be synthesized are input into a non-autoregressive model. Wherein the architecture of the non-autoregressive model is substantially identical to the autoregressive model, except that a separate embedded layer needs to be specified for the audio encoding of the eight codebooks.

Specifically, in the reasoning stage, for the autoregressive model, an audio code of the first codebook is generated by adopting a sampling-based decoding mode and is used as an input of a non-autoregressive model. And for the non-autoregressive model, generating the audio codes with highest probability in the seven latter codebooks in a greedy decoding mode. Finally, the audio codes of all codebooks are spliced, and then input into a decoder of an audio coder Encodec to obtain the voice to be synthesized.

In this embodiment, the final output of the model is a combination of an autoregressive model and a non-autoregressive model. Wherein the autoregressive model first predicts the audio coding of the speech to be synthesized in the first codebook. Then, the discrete codes in the first codebook are regarded as one of the conditions of the non-autoregressive model input, and the audio codes of the speech to be synthesized, which correspond to the 2 nd to 8 th codebooks, are predicted from the non-autoregressive model. Finally, the discrete codes of the 8 codebooks corresponding to the voice to be synthesized are spliced and input into a decoder of the audio codec Encodec to obtain the final voice to be synthesized.

The mathematical formula of the speech to be synthesized is derived as: In the/> Representing probability,/>Representing synthesized speech,/>Representing phoneme encoding,/>Audio coding representing prompt speech,/>Representing parameters of a model,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Parameters representing a non-autoregressive model.

The second embodiment of the present invention provides a training method of a speech synthesis model based on time-aware position coding, which includes steps M1 to M10.

Specifically, step M1 is similar to step S1, except that more speech is to be synthesized.

Specifically, since the text corresponding to the prompt voice and the voice to be synthesized can be obtained in advance, the text is converted into the phoneme code to be aimed at the text corresponding to the prompt voice and the voice to be synthesized. Here, both in the training phase and in the reasoning phase are identical, namely: step M2 is identical to step S2.

Specifically, in the training phase, there are more voices to be synthesized than in the reasoning phase. Both the alert speech and the speech to be synthesized are converted to audio coding by the encoder of the audio codec Encodec.

In the training phase, both the prompt speech and the speech to be synthesized are converted into discrete codes, because the speech to be synthesized is a label trained by the autoregressive model. In the reasoning stage, the voice to be synthesized is taken as a result of reasoning, so that the discrete codes of the voice to be synthesized cannot be acquired in advance, and the voice to be synthesized cannot be subjected to audio coding without being needed.

Specifically, in the training stage, the timestamp information corresponding to the prompt voice can be naturally acquired, so that the time-aware position code can also be generated. The voice to be synthesized can be obtained as tag training, so that the corresponding time stamp information can be obtained. In the reasoning stage, only the audio frequency of the prompt voice is provided, so that only the time perception position code corresponding to the prompt voice can be obtained.

That is, in the training stage, the time stamp information of the text corresponding to the prompt voice and the voice information to be synthesized is obtained according to the prompt voice and the voice information to be synthesized, and the time perception position code of each word in the section of audio is obtained according to the time stamp information. In the reasoning stage, only the prompt voice information is provided, so that only the timestamp information of the text corresponding to the prompt voice can be obtained.

And M5, acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes. Specifically, step M5 is identical to step S5, except that the processing object has more speech texts to be synthesized.

Specifically, a flow architecture diagram of the autoregressive model of the training phase is shown in fig. 2. And splicing the phoneme codes corresponding to the prompt voice and the text of the voice to be synthesized and the discrete codes corresponding to the audio, aligning the corresponding position code information, and inputting the aligned position code information into an autoregressive model. In the present embodiment, the autoregressive model is a transducer-based autoregressive model for autoregressively generating an audio code (discrete code) of audio to be synthesized. The optimization objective of the autoregressive model is to maximize the probability of audio encoding of the first codebook through cross entropy loss.

The reason why the arrow of the speech to be synthesized is downward in fig. 2 is that the speech to be synthesized is an acoustic wave in the time domain, and the next token predicted by the autoregressive model is actually the corresponding discrete code of the speech to be synthesized, so that it is necessary to convert to the corresponding discrete code by the audio self-decoder.

Specifically, a flow architecture diagram of the non-autoregressive model of the training phase is shown in fig. 3. In this embodiment, the autoregressive model and the non-autoregressive model are trained in a decoupled manner. Namely: and (3) through audio position coding, aligning and splicing the audio codes of all codebooks corresponding to the prompt voice and the phoneme codes of the prompt voice texts, aligning and splicing the phoneme codes of the voice texts to be synthesized and the audio codes of the voices to be synthesized, and then inputting the voice codes and the phoneme codes into a non-autoregressive model together.

In the present embodiment, the non-autoregressive model is a transducer-based non-autoregressive model for autoregressively generating an audio code (discrete code) of audio to be synthesized. The optimization objective of the non-autoregressive model is to maximize the probability of each audio code in the next codebook through cross entropy loss.

Specifically, in the training stage, the autoregressive model is trained according to steps M6 and M7, the non-autoregressive model is trained according to steps M8 and M9, and the two stages can be decoupled for training.

The embodiment of the invention helps the model to better capture the time sequence information of the audio pronunciation by time perception position coding, quickens the convergence rate of the model, effectively improves the generation quality of the audio, and can change the speech speed of the audio by controlling the time stamp information in the reasoning stage, so that the pronunciation duration of each word is more controllable, and has remarkable progress.

The third embodiment of the invention provides a speech synthesis device based on time-aware position coding, which comprises an initial data acquisition module, a phoneme coding module, an audio coding module, a time-aware position coding acquisition module, an audio position coding module, a first prediction module, a second prediction module and a splicing module.

The fourth embodiment of the invention provides a training device of a speech synthesis model based on time perception position coding, which comprises a training data acquisition module, a phoneme coding module, an audio coding module, a time perception position coding acquisition module, an audio position coding module, a first prediction module, a first training module, a second prediction module, a second training module and a model acquisition module.

It is understood that the speech synthesis device based on the time-aware position coding may be an electronic device with computing performance, such as a portable notebook computer, a desktop computer, a server, a smart phone or a tablet computer.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

References to "first\second" in the embodiments are merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that "first\second" may interchange a particular order or precedence where allowed. It is to be understood that the "first\second" distinguishing aspects may be interchanged where appropriate, such that the embodiments described herein may be implemented in sequences other than those illustrated or described herein.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of speech synthesis based on time-aware position coding, comprising:

acquiring prompt voice, a prompt voice text and a voice text to be synthesized;

respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes;

Audio encoding of the hint speech into a plurality of codebooks by an audio codec;

according to the prompt voice and the prompt voice text, acquiring time stamp information of each word, and then acquiring time perception position codes of phonemes by combining phoneme codes of the prompt voice text;

Acquiring audio position codes of each word in the text at corresponding audio positions according to the time perception position codes;

Splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced code, and then inputting the aligned codes into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized;

Splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized;

the audio predictive coding of the first codebook and the audio predictive coding of other codebooks are spliced, and then the synthesized audio is obtained through decoding of an audio coder.

2. The method for speech synthesis based on time-aware position coding according to claim 1, wherein the converting the prompt speech text and the speech text to be synthesized into phoneme codes, respectively, specifically comprises:

the word segmentation tool is used for respectively segmenting the prompt voice text and the voice text to be synthesized;

Respectively converting the prompt voice text and the word segmentation of the voice text to be synthesized into phonemes through a phoneme conversion model;

Respectively converting phonemes of the prompt voice text and the voice text to be synthesized into phoneme codes through a mapping dictionary; wherein the mapping dictionary contains discrete codes corresponding to the phonemes.

3. The speech synthesis method based on time-aware position coding according to claim 1, wherein the step of obtaining time stamp information of each word in the text comprises the steps of:

Acquiring a time stamp range of each word in the text in voice audio; wherein the timestamp range includes a start timestamp and an end timestamp of each word in audio;

obtaining the time-aware position coding of the phonemes, comprising the steps of:

According to the phoneme codes of each word in the text, each phoneme in the phoneme codes is numbered through an arithmetic progression, and the phoneme number range of each word is obtained ; Wherein/> ，/>For/>Numbering of individual phonemes,/>The total number of phonemes of the text;

respectively calculating the phoneme position codes of the words according to the time stamp range and the phoneme number range ; Wherein, phoneme position coding/>The calculation model of (2) is as follows: /(I)In the/>For phoneme position coding,/>For/>Position coding of individual phonemes,/>Is the total number of phonemes of the text,/>For/>Position coding of individual phonemes,For/>Numbering of individual phonemes,/>For normalizing the difference of the post-arithmetic progression,/>For/>End timestamp of individual word,/>For/>A start timestamp of the individual word;

splicing the phoneme position codes of each word in the text, and unifying the coding scale of the spliced phoneme position codes and the coding scale of the audio coder-decoder to obtain the time perception position codes of the phonemes.

4. The method for synthesizing speech based on time-aware position coding according to claim 1, wherein obtaining audio position codes of words in text at corresponding positions of audio according to the time-aware position coding specifically comprises:

And obtaining the audio position codes of the words in the text at the corresponding audio positions according to the time perception position codes of the words in the text by rounding downwards.

5. The method of claim 1, wherein the autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Encoding positions for audio,/>For the number of audio encoding positions,/>First codebook for speech to be synthesized/>Audio coding of individual positions,/>Is less than the first codebook of speech to be synthesized/>Audio encoding of individual positions.

6. The method of claim 1, wherein the non-autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding representing alert speech,Parameters representing non-autoregressive models,/>For the sequence number of codebook,/>Represents the/>, of the speech to be synthesizedAudio predictive coding of individual codebooks,/>Representing less than the/>, of the speech to be synthesizedAudio predictive coding of the individual codebooks.

7. A method for training a speech synthesis model based on time-aware position coding, comprising:

Acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized;

respectively encoding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder;

Respectively acquiring time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then acquiring time perception position codes of phonemes by combining phoneme codes;

Splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the code and the code into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized;

Acquiring a real audio code of a first codebook of speech to be synthesized, and then calculating an autoregressive loss with the audio predictive code of the first codebook to train the autoregressive model;

Splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and inputting the codes into a non-autoregressive model together to obtain the audio predictive codes of the j+1th codebook of the voice to be synthesized until the audio predictive codes of other codebooks except the first codebook are generated;

Acquiring a real audio code of a j+1th codebook of a voice to be synthesized, and then calculating a non-autoregressive loss with the audio predictive code of the j+1th codebook to train the non-autoregressive model;

And training the autoregressive model and the non-autoregressive model to obtain a speech synthesis model based on time perception position coding.

8. A speech synthesis apparatus based on time-aware position coding, comprising:

the initial data acquisition module is used for acquiring prompt voice, prompt voice text and voice text to be synthesized;

The phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes;

An audio encoding module for encoding the prompt voice into audio encoding of a plurality of codebooks through an audio codec;

the time perception position code acquisition module is used for acquiring time stamp information of each word according to the prompt voice and the prompt voice text, and then combining phoneme codes of the prompt voice text to acquire time perception position codes of phonemes;

the audio position coding module is used for acquiring audio position codes of all words in the text at corresponding positions of the audio according to the time perception position codes;

The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced codes, and then inputting the time perception position code and the audio position code into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized;

the second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized;

9. A training device for a speech synthesis model based on time-aware position coding, comprising:

The training data acquisition module is used for acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized;

The audio coding module is used for respectively coding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder;

the time perception position code acquisition module is used for respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then combining the phoneme codes to acquire the time perception position codes of the phonemes;

The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the codes into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized;

the first training module is used for acquiring the real audio coding of a first codebook of the voice to be synthesized, and then calculating the autoregressive loss with the audio predictive coding of the first codebook to train the autoregressive model;

The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of the j+1th codebook of the voice to be synthesized until the audio prediction codes of other codebooks except the first codebook are generated;

the second training module is used for acquiring the real audio codes of the j+1th codebook of the voice to be synthesized, and then calculating non-autoregressive loss with the audio predictive codes of the j+1th codebook so as to train the non-autoregressive model;