CN118116363A - Speech synthesis method based on time perception position coding and model training method thereof - Google Patents

Speech synthesis method based on time perception position coding and model training method thereof Download PDF

Info

Publication number
CN118116363A
CN118116363A CN202410512813.2A CN202410512813A CN118116363A CN 118116363 A CN118116363 A CN 118116363A CN 202410512813 A CN202410512813 A CN 202410512813A CN 118116363 A CN118116363 A CN 118116363A
Authority
CN
China
Prior art keywords
audio
codes
synthesized
voice
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410512813.2A
Other languages
Chinese (zh)
Inventor
潘启正
陈毅松
杨洪进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Chanyu Network Technology Co ltd
Original Assignee
Xiamen Chanyu Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Chanyu Network Technology Co ltd filed Critical Xiamen Chanyu Network Technology Co ltd
Priority to CN202410512813.2A priority Critical patent/CN118116363A/en
Publication of CN118116363A publication Critical patent/CN118116363A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a speech synthesis method based on time perception position coding and a model training method thereof, and relates to the technical field of speech synthesis. The synthesis method comprises the steps of obtaining prompt voice and text and to-be-synthesized text thereof. The text is converted to phoneme codes. Audio is converted into audio codes of a plurality of codebooks. And acquiring the time of the words in the text, combining the phoneme codes, acquiring the time perception position codes and converting the time perception position codes into audio position codes. And splicing the phoneme codes and the audio codes of the first codebook, aligning the phoneme codes with the position codes, and inputting an autoregressive model to obtain the audio predictive codes of the first codebook. Splicing the phoneme codes, the audio codes of all codebooks and the audio prediction codes, aligning with the position codes, and then inputting a non-autoregressive model to obtain the audio prediction codes of other codebooks of the speech to be synthesized. And splicing audio predictive codes of all codebooks, and decoding to obtain synthesized audio. The method has the advantages of faster training speed and higher generated audio quality.

Description

Speech synthesis method based on time perception position coding and model training method thereof
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a speech synthesis method based on time perception position coding and a model training method thereof.
Background
Current zero-sample speech synthesis schemes are largely divided into non-autoregressive models with modeled mel spectrum as an intermediate representation and autoregressive models based on modeled audio coding. Wherein the sound output by the non-autoregressive model based on the mel spectrum is more stable, but the sound is flatter due to lack of style capturing of the speaker. While the autoregressive model based on the audio coding can effectively capture the emotion of a speaker, the training efficiency is low, and the model cannot well capture the pronunciation time of each word.
It has the following drawbacks: 1. insufficient style capture: the non-autoregressive model has limited performance in generating personalized speech, and cannot fully simulate the style and emotion of a speaker. 2. The training efficiency is low: the training process of the autoregressive model is time consuming and resource intensive, which limits the scalability and practicality of the model. 3. Pronunciation time capture inaccuracy: the autoregressive model has difficulty in simulating the pronunciation time of each word, which affects the natural fluency of speech.
In view of the above, the applicant has studied the prior art and has made the present application.
Disclosure of Invention
The invention provides a speech synthesis method based on time-aware position coding and a model training method thereof, which are used for improving at least one of the technical problems.
In a first aspect, the present invention provides a speech synthesis method based on time-aware position coding, which includes steps S1 to S8.
S1, acquiring prompt voice, prompt voice text and voice text to be synthesized.
S2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
S3, the prompt voice is encoded into audio codes of a plurality of codebooks through an audio coder-decoder.
S4, according to the prompt voice and the prompt voice text, obtaining time stamp information of each word, and then obtaining time perception position codes of the phonemes by combining phoneme codes of the prompt voice text.
S5, according to the time perception position codes, the audio position codes of the words in the text at the corresponding positions of the audio are obtained.
S6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced codes, and then inputting the time perception position code and the audio position code into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized.
S7, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized.
S8, splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding through an audio coder and decoder to obtain synthesized audio.
In an alternative embodiment, step S2 specifically includes steps S21 to S23.
S21, word segmentation is carried out on the prompt voice text and the voice text to be synthesized through a word segmentation tool.
S22, respectively converting the prompt voice text and the word segmentation of the voice text to be synthesized into phonemes through a phoneme conversion model.
S23, respectively converting phonemes of the prompt voice text and the voice text to be synthesized into phoneme codes through a mapping dictionary. Wherein the mapping dictionary contains discrete codes corresponding to the phonemes.
In an alternative embodiment, the step S4 of obtaining the timestamp information of each word in the text specifically includes a step S41. S41, acquiring the time stamp range of each word in the text in the voice audio. Wherein the range of time stamps includes a start time stamp and an end time stamp of each word in the audio.
In an alternative embodiment, the step S4 of obtaining the time-aware position code of each word in the text specifically includes steps S42 to S44.
S42, according to the phoneme codes of the words in the text, numbering the phonemes in the phoneme codes through an arithmetic progression to obtain the phoneme number range of the words. Wherein/>,/>For/>Numbering of individual phonemes,/>Is the total number of phonemes of the text.
S43, respectively calculating the phoneme position codes of each word according to the time stamp range and the phoneme number range. Wherein, phoneme position coding/>The calculation model of (2) is as follows: /(I)In the/>For phoneme position coding,/>For/>Position coding of individual phonemes,/>Is the total number of phonemes of the text,/>For/>Position coding of individual phonemes,/>For/>Numbering of individual phonemes,/>For normalizing the difference of the post-arithmetic progression,/>For/>End timestamp of individual word,/>For/>The start timestamp of the individual word.
S44, splicing the phoneme position codes of each word in the text, and unifying the coding scale of the spliced phoneme position codes and the coding scale of the audio coder-decoder to obtain the time perception position codes of the phonemes.
In an alternative embodiment, step S5 specifically includes step S51. S51, rounding downwards according to the time perception position codes of the words in the text, and obtaining the audio position codes of the words in the text.
In an alternative embodiment, the autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Encoding positions for audio,/>For the number of audio encoding positions,/>First codebook for speech to be synthesized/>Audio coding of individual positions,/>Is less than the first codebook of speech to be synthesized/>Audio encoding of individual positions.
In an alternative embodiment, the non-autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding representing prompt speech,/>Parameters representing non-autoregressive models,/>For the sequence number of codebook,/>Represents the/>, of the speech to be synthesizedAudio predictive coding of individual codebooks,/>Representing less than the/>, of the speech to be synthesizedAudio predictive coding of the individual codebooks.
In a second aspect, the present invention provides a method for training a speech synthesis model based on time-aware position coding, which includes steps M1 to M10.
M1, acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized.
And M2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
And M3, respectively encoding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder.
And M4, respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then acquiring the time perception position code of the phonemes by combining the phoneme codes.
And M5, acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.
M6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the codes into an autoregressive model together to obtain the audio predictive code of the first codebook of the voice to be synthesized.
And M7, acquiring the real audio coding of a first codebook of the voice to be synthesized, and then calculating an autoregressive loss with the audio predictive coding of the first codebook to train the autoregressive model.
M8, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio predictive codes of the j+1-th codebook of the voice to be synthesized until the audio predictive codes of other codebooks except the first codebook are generated.
M9, acquiring the real audio coding of the j+1th codebook of the voice to be synthesized, and then calculating a non-autoregressive loss with the audio predictive coding of the j+1th codebook to train the non-autoregressive model.
And M10, training the autoregressive model and the non-autoregressive model to obtain a speech synthesis model based on time perception position coding.
In a third aspect, the present invention provides a speech synthesis apparatus based on time-aware position coding, which includes an initial data acquisition module, a phoneme coding module, an audio coding module, a time-aware position coding acquisition module, an audio position coding module, a first prediction module, a second prediction module, and a splicing module.
The initial data acquisition module is used for acquiring prompt voice, prompt voice text and voice text to be synthesized.
And the phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
And the audio coding module is used for coding the prompt voice into audio codes of a plurality of codebooks through an audio coder.
And the time perception position code acquisition module is used for acquiring the time stamp information of each word according to the prompt voice and the prompt voice text, and then acquiring the time perception position code of the phoneme by combining the phoneme code of the prompt voice text.
And the audio position coding module is used for acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.
The first prediction module is used for splicing the audio coding of the first codebook of the prompt voice, the phoneme coding of the prompt voice text and the phoneme coding of the voice text to be synthesized, aligning the time perception position coding and the audio position coding with the spliced codes, and then inputting the time perception position coding and the audio position coding into the autoregressive model together to obtain the audio prediction coding of the first codebook of the voice to be synthesized.
The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized.
And the splicing module is used for splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding the audio predictive coding by an audio coder and decoder to obtain synthesized audio.
In a fourth aspect, the present invention provides a training device for a speech synthesis model based on time-aware position coding, which includes a training data acquisition module, a phoneme coding module, an audio coding module, a time-aware position coding acquisition module, an audio position coding module, a first prediction module, a first training module, a second prediction module, a second training module, and a model acquisition module.
The training data acquisition module is used for acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized.
And the phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
And the audio coding module is used for respectively coding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder.
And the time perception position code acquisition module is used for respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then combining the phoneme codes to acquire the time perception position codes of the phonemes.
And the audio position coding module is used for acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.
The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the time perception position code and the audio position code into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized.
And the first training module is used for acquiring the real audio codes of the first codebook of the voice to be synthesized, and then calculating the autoregressive loss with the audio predictive codes of the first codebook to train the autoregressive model.
The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into the non-autoregressive model together to obtain the audio prediction codes of the j+1-th codebook of the voice to be synthesized until the audio prediction codes of other codebooks except the first codebook are generated.
And the second training module is used for acquiring the real audio codes of the j+1th codebook of the voice to be synthesized, and then calculating the non-autoregressive loss with the audio predictive codes of the j+1th codebook to train the non-autoregressive model.
The model acquisition module is used for obtaining a speech synthesis model based on time perception position coding after the autoregressive model and the non-autoregressive model are trained.
By adopting the technical scheme, the invention can obtain the following technical effects:
According to the voice synthesis method based on the time perception position coding, the time perception position coding is introduced to replace the traditional sine and cosine coding, so that the convergence speed of a model is effectively improved, the training time length is reduced from 3w iterations to 1w iterations, the convergence can be achieved, and the quality of generated audio is higher. And the pronunciation of each word can be effectively controlled in the reasoning stage through the time perception position coding, the speed change of the audio is supported, and the method has good practical significance.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a speech synthesis method based on time-aware position coding.
Fig. 2 is a flow architecture diagram of an autoregressive model.
FIG. 3 is a flow architecture diagram of a non-autoregressive model.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1 to 3, a first embodiment of the present invention provides a speech synthesis method based on time-aware position coding, which can be performed by a speech synthesis apparatus based on time-aware position coding (hereinafter referred to as a speech synthesis apparatus). In particular, by one or more processors in the speech synthesis apparatus to implement steps S1 to S8.
S1, acquiring prompt voice, prompt voice text and voice text to be synthesized.
In particular, a hint voice may be understood as the pronunciation of the same person as the voice to be synthesized, but not the pronunciation of a single word in the text of the voice to be synthesized.
S2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes. Preferably, step S2 specifically includes steps S21 to S23. S21, word segmentation is carried out on the prompt voice text and the voice text to be synthesized through a word segmentation tool. S22, respectively converting the prompt voice text and the word segmentation of the voice text to be synthesized into phonemes through a phoneme conversion model. S23, respectively converting phonemes of the prompt voice text and the voice text to be synthesized into phoneme codes through a mapping dictionary. Wherein the mapping dictionary contains discrete codes corresponding to the phonemes.
Specifically, the word segmentation may use a word segmentation tool such as jieba, and the word segmentation may be converted into a phoneme by using a text-to-phoneme model such as G2P (Grapheme-to-Phoneme). After being converted into phonemes, the phonemes are converted into phoneme codes through a mapping dictionaryWherein/>Representing the total number of phonemes.
In the following, a specific example will be described of how the text corresponding to the prompt speech and the speech to be synthesized is converted into a phoneme code. First, a piece of text is cut by means of jieba word segmentation or the like. The text of a section is: "should therefore respect the hand-in-hand development in harmony" changes from jieba to [ 'so', 'should', 'mutual', 'harmony', 'respect', 'hand-in-hand', 'development' ]. Then, each word is converted into a corresponding phoneme by a g2p model or the like. Finally, the phonemes are converted into phoneme codes through a mapping dictionary. For example: the word "so" is converted to phonemes and then changed to ['s ', ' u ɔ ' 3', ' y ', ' i3' ], and since the number of phonemes is limited, a mapping dictionary can be built up by phonemes and then converted to discrete codes, such as ['s ', ' u ɔ ', ' y ', ' ii3' ] is changed to [83, 100, 106, 55].
S3, the prompt voice is encoded into audio codes of a plurality of codebooks through an audio coder-decoder.
Specifically, in the reasoning stage, only the audio of the prompt voice is provided, so that the prompt voice only needs to be converted into audio coding. Specifically, the encoder through the audio codec Encodec is turned into audio encoding. In this embodiment Encodec encodes a piece of audio asWherein T represents the downsampled audio step size and 8 represents the number of codebooks.
For example: a 3s audio will be converted to a corresponding audio through EncodecBecause the correspondence of the length of the audio discrete code and the audio duration(s) is 1 to 75. In this embodiment, the autoregressive portion only needs the information of the first codebook.
S4, according to the prompt voice and the prompt voice text, obtaining time stamp information of each word, and then obtaining time perception position codes of the phonemes by combining phoneme codes of the prompt voice text.
On the basis of the above embodiment, in an alternative embodiment of the present invention, the step S4 of obtaining the timestamp information of each word in the text specifically includes a step S41. S41, acquiring the time stamp range of each word in the text in the voice audio. Wherein the range of time stamps includes a start time stamp and an end time stamp of each word in the audio. It will be appreciated that the time stamps of text in a piece of audio are obtained from a time stamp predictive model that can be used in many ways, such as MFA (Montreal Forced Aligner) tools, etc.
Based on the above embodiment, in an alternative embodiment of the present invention, the step S4 of obtaining the time-aware position code of each word in the text specifically includes steps S42 to S44.
S42, according to the phoneme codes of the words in the text, numbering the phonemes in the phoneme codes through an arithmetic progression to obtain the phoneme number range of the words. Wherein/>,/>For/>Numbering of individual phonemes,/>Is the total number of phonemes of the text.
S43, respectively calculating the phoneme position codes of each word according to the time stamp range and the phoneme number range. Wherein, phoneme position coding/>The calculation model of (2) is as follows: /(I)In the/>For phoneme position coding,/>For/>Position coding of individual phonemes,/>Is the total number of phonemes of the text,/>For/>Position coding of individual phonemes,For/>Numbering of individual phonemes,/>For normalizing the difference of the post-arithmetic progression,/>For/>End timestamp of individual word,/>For/>The start timestamp of the individual word.
S44, splicing the phoneme coding position codes of each word in the text, and unifying the coding scale of the spliced phoneme position codes and the coding scale of the audio coder-decoder to obtain the time perception position codes of the phonemesWherein/>For/>Time-aware position coding of individual phonemes
Specifically, the purpose of step S4 is to generate a sequence of numbers associated with a time stamp as time-aware position codes based on the length of each word converted to a phonemeWhere n represents the number of phonemes corresponding to a word.
Firstly, initializing an arithmetic sequence, and then numbering each phoneme in the phoneme code according to the arithmetic sequence to obtain the phoneme number range of each word
Then, according to the number, the phoneme position is encodedEach element/>The calculation formula of (2) is as follows: wherein/> The purpose of (a) is to normalize each element in the arithmetic progression,/>Representing the difference between the arithmetic columns.
By subtracting half of the differenceThe position code of each phoneme is adjusted to the middle position of the pronunciation interval where the phoneme is located, so that the model can be helped to capture pronunciation accuracy better. The position code corresponding to the phoneme is then mapped to its start time/>, at the entire audio segmentAnd end time/>In-range as time-aware position coding. Finally, in order to be unified with the coding scale of the audio codec Encodec, i.e. all time-aware position codes are multiplied by a fixed value as the final position code.
By the above calculation of the position-aware position coding, it can be found thatThe position of each word corresponding to the pronunciation of the phoneme is recorded. In this way, the model can capture the position information of the pronunciation more quickly, so that the convergence speed of the model is increased. In addition, in the reasoning stage, the pronunciation time length of each word can be manually changed, so that the pronunciation speed of the audio frequency is changed, and the generation result is more controllable. The audio position coding is generated by adopting a traditional position coding mode.
A specific example will be described below of how to obtain a time-aware position coding of the text corresponding to the alert speech.
1. A range of time stamps of the text in the audio is obtained. For example: [ [0.1, 0.26, ' ], 0.27, 0.46, ' with ' ], 0.54, 0.73, ' with ' ],...
2. Numbering the factors of each word according to the arithmetic progression to obtain the phoneme coding range of each word. Wherein,,/>For/>Numbering of individual phonemes,/>Is the total number of phonemes of the text. For example: the list of "phonemes" is ['s', 'u ɔ' 3 ], with an index range of [0,1,2].
3. According to the formulaAnd carrying out maximum normalization on the phoneme list index range. For example: the index range is changed from [0, 1, 2] to [0,0.499,0.999].
4. According to the formulaThe difference value of the arithmetic progression is obtained. For example: the difference of [0, 0.499, 0.999] was calculated to be 0.499.
5. According to the formulaAnd shifting each numerical value in the position codes by a half difference value length backwards, and intercepting the length corresponding to the phonemes. For example: [0,0.499,0.999] was changed to [0.249, 0.749] by calculation.
6. According to the formulaScaling the numbering range to/>And/>Within, wherein/>For/>Start timestamp of individual word,/>For/>End timestamp of the individual word. Hypothesis/>Is 1.25/>1.55, Thus calculated by the above formula [1.3247, 1.4747].
7. And splicing the position codes of phonemes corresponding to the words of the whole sentence as initial time perception position codes.
8. The coding scale of the initial time-aware position coding is unified with the coding scale of the audio codec Encodec. In this embodiment, all the temporal perceptual location codes are multiplied by 75 as the final temporal perceptual location code. For example: for example, [1.3247, 1.4747] to [99.352, 110.6024].
S5, according to the time perception position codes, the audio position codes of the words in the text at the corresponding positions of the audio are obtained. Preferably, step S5 specifically includes step S51. S51, rounding downwards according to the time perception position codes of the words in the text, and obtaining the audio position codes of the words in the text at the corresponding audio positions.
In particular, the audio positions are encoded as an arithmetic progression with a difference of 1, i.e-/>. Audio position coding set of audio/>The expression of (2) is: /(I)Wherein/>For/>Audio position coding,/>For/>And encoding the audio positions.
The time-aware position coding for a word is known to be denoted as t= {Its position coding at the corresponding audio is expressed as:/>Wherein/>For/>Audio position coding corresponding to integer value rounded down in B,/>For/>The corresponding down-rounded integer value is encoded in the corresponding audio position in B.
How the audio position codes are generated is described below with a specific example.
And aiming at the time perception position code corresponding to each word, acquiring the audio position code of the time perception position code at the corresponding audio position. For example: the time-aware position code for this word is [99.352, 110.6024], then the audio code corresponding to this word is obtained according to the formula in step four above as [99, 100, 110].
S6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced codes, and then inputting the time perception position code and the audio position code into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized. Preferably, the autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Encoding positions for audio,/>For the number of audio encoding positions,/>First codebook for speech to be synthesized/>Audio coding of individual positions,/>Is less than the first codebook of speech to be synthesized/>Audio encoding of individual positions.
Specifically, the audio code of the first codebook of the prompt voice and the phoneme code of the prompt voice text are spliced, the corresponding time-aware position codes and the audio position codes are aligned, and then the aligned time-aware position codes and the aligned audio position codes and the phoneme codes of the voice text to be synthesized are input into an autoregressive model. The autoregressive model is in the prior art, and consists of multiple layers of decoders, wherein each layer of decoder consists of an autoreply module, a forward propagation module and a layer normalization module, and the invention is not repeated.
S7, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized. Preferably, the non-autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding representing prompt speech,/>Parameters representing non-autoregressive models,/>For the sequence number of codebook,/>Represents the/>, of the speech to be synthesizedAudio predictive coding of individual codebooks,/>Representing less than the/>, of the speech to be synthesizedAudio predictive coding of the individual codebooks.
Specifically, the audio codes of the prompt voice corresponding to all codebooks and the phoneme codes of the prompt voice text are spliced, the corresponding time perception codes and audio position codes are aligned, and then the audio prediction codes of the first codebook of the voice to be synthesized and the audio prediction codes of the first codebook of the voice to be synthesized are input into a non-autoregressive model. Wherein the architecture of the non-autoregressive model is substantially identical to the autoregressive model, except that a separate embedded layer needs to be specified for the audio encoding of the eight codebooks.
S8, splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding through an audio coder and decoder to obtain synthesized audio.
Specifically, in the reasoning stage, for the autoregressive model, an audio code of the first codebook is generated by adopting a sampling-based decoding mode and is used as an input of a non-autoregressive model. And for the non-autoregressive model, generating the audio codes with highest probability in the seven latter codebooks in a greedy decoding mode. Finally, the audio codes of all codebooks are spliced, and then input into a decoder of an audio coder Encodec to obtain the voice to be synthesized.
In this embodiment, the final output of the model is a combination of an autoregressive model and a non-autoregressive model. Wherein the autoregressive model first predicts the audio coding of the speech to be synthesized in the first codebook. Then, the discrete codes in the first codebook are regarded as one of the conditions of the non-autoregressive model input, and the audio codes of the speech to be synthesized, which correspond to the 2 nd to 8 th codebooks, are predicted from the non-autoregressive model. Finally, the discrete codes of the 8 codebooks corresponding to the voice to be synthesized are spliced and input into a decoder of the audio codec Encodec to obtain the final voice to be synthesized.
The mathematical formula of the speech to be synthesized is derived as: In the/> Representing probability,/>Representing synthesized speech,/>Representing phoneme encoding,/>Audio coding representing prompt speech,/>Representing parameters of a model,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Parameters representing a non-autoregressive model.
According to the voice synthesis method based on the time perception position coding, the time perception position coding is introduced to replace the traditional sine and cosine coding, so that the convergence speed of a model is effectively improved, the training time length is reduced from 3w iterations to 1w iterations, the convergence can be achieved, and the quality of generated audio is higher. And the pronunciation of each word can be effectively controlled in the reasoning stage through the time perception position coding, the speed change of the audio is supported, and the method has good practical significance.
The second embodiment of the present invention provides a training method of a speech synthesis model based on time-aware position coding, which includes steps M1 to M10.
M1, acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized.
Specifically, step M1 is similar to step S1, except that more speech is to be synthesized.
And M2, respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
Specifically, since the text corresponding to the prompt voice and the voice to be synthesized can be obtained in advance, the text is converted into the phoneme code to be aimed at the text corresponding to the prompt voice and the voice to be synthesized. Here, both in the training phase and in the reasoning phase are identical, namely: step M2 is identical to step S2.
And M3, respectively encoding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder.
Specifically, in the training phase, there are more voices to be synthesized than in the reasoning phase. Both the alert speech and the speech to be synthesized are converted to audio coding by the encoder of the audio codec Encodec.
In the training phase, both the prompt speech and the speech to be synthesized are converted into discrete codes, because the speech to be synthesized is a label trained by the autoregressive model. In the reasoning stage, the voice to be synthesized is taken as a result of reasoning, so that the discrete codes of the voice to be synthesized cannot be acquired in advance, and the voice to be synthesized cannot be subjected to audio coding without being needed.
And M4, respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then acquiring the time perception position code of the phonemes by combining the phoneme codes.
Specifically, in the training stage, the timestamp information corresponding to the prompt voice can be naturally acquired, so that the time-aware position code can also be generated. The voice to be synthesized can be obtained as tag training, so that the corresponding time stamp information can be obtained. In the reasoning stage, only the audio frequency of the prompt voice is provided, so that only the time perception position code corresponding to the prompt voice can be obtained.
That is, in the training stage, the time stamp information of the text corresponding to the prompt voice and the voice information to be synthesized is obtained according to the prompt voice and the voice information to be synthesized, and the time perception position code of each word in the section of audio is obtained according to the time stamp information. In the reasoning stage, only the prompt voice information is provided, so that only the timestamp information of the text corresponding to the prompt voice can be obtained.
And M5, acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes. Specifically, step M5 is identical to step S5, except that the processing object has more speech texts to be synthesized.
M6, splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the codes into an autoregressive model together to obtain the audio predictive code of the first codebook of the voice to be synthesized.
And M7, acquiring the real audio coding of a first codebook of the voice to be synthesized, and then calculating an autoregressive loss with the audio predictive coding of the first codebook to train the autoregressive model.
Specifically, a flow architecture diagram of the autoregressive model of the training phase is shown in fig. 2. And splicing the phoneme codes corresponding to the prompt voice and the text of the voice to be synthesized and the discrete codes corresponding to the audio, aligning the corresponding position code information, and inputting the aligned position code information into an autoregressive model. In the present embodiment, the autoregressive model is a transducer-based autoregressive model for autoregressively generating an audio code (discrete code) of audio to be synthesized. The optimization objective of the autoregressive model is to maximize the probability of audio encoding of the first codebook through cross entropy loss.
The reason why the arrow of the speech to be synthesized is downward in fig. 2 is that the speech to be synthesized is an acoustic wave in the time domain, and the next token predicted by the autoregressive model is actually the corresponding discrete code of the speech to be synthesized, so that it is necessary to convert to the corresponding discrete code by the audio self-decoder.
M8, splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio predictive codes of the j+1-th codebook of the voice to be synthesized until the audio predictive codes of other codebooks except the first codebook are generated.
M9, acquiring the real audio coding of the j+1th codebook of the voice to be synthesized, and then calculating a non-autoregressive loss with the audio predictive coding of the j+1th codebook to train the non-autoregressive model.
Specifically, a flow architecture diagram of the non-autoregressive model of the training phase is shown in fig. 3. In this embodiment, the autoregressive model and the non-autoregressive model are trained in a decoupled manner. Namely: and (3) through audio position coding, aligning and splicing the audio codes of all codebooks corresponding to the prompt voice and the phoneme codes of the prompt voice texts, aligning and splicing the phoneme codes of the voice texts to be synthesized and the audio codes of the voices to be synthesized, and then inputting the voice codes and the phoneme codes into a non-autoregressive model together.
In the present embodiment, the non-autoregressive model is a transducer-based non-autoregressive model for autoregressively generating an audio code (discrete code) of audio to be synthesized. The optimization objective of the non-autoregressive model is to maximize the probability of each audio code in the next codebook through cross entropy loss.
And M10, training the autoregressive model and the non-autoregressive model to obtain a speech synthesis model based on time perception position coding.
Specifically, in the training stage, the autoregressive model is trained according to steps M6 and M7, the non-autoregressive model is trained according to steps M8 and M9, and the two stages can be decoupled for training.
The embodiment of the invention helps the model to better capture the time sequence information of the audio pronunciation by time perception position coding, quickens the convergence rate of the model, effectively improves the generation quality of the audio, and can change the speech speed of the audio by controlling the time stamp information in the reasoning stage, so that the pronunciation duration of each word is more controllable, and has remarkable progress.
The third embodiment of the invention provides a speech synthesis device based on time-aware position coding, which comprises an initial data acquisition module, a phoneme coding module, an audio coding module, a time-aware position coding acquisition module, an audio position coding module, a first prediction module, a second prediction module and a splicing module.
The initial data acquisition module is used for acquiring prompt voice, prompt voice text and voice text to be synthesized.
And the phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
And the audio coding module is used for coding the prompt voice into audio codes of a plurality of codebooks through an audio coder.
And the time perception position code acquisition module is used for acquiring the time stamp information of each word according to the prompt voice and the prompt voice text, and then acquiring the time perception position code of the phoneme by combining the phoneme code of the prompt voice text.
And the audio position coding module is used for acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.
The first prediction module is used for splicing the audio coding of the first codebook of the prompt voice, the phoneme coding of the prompt voice text and the phoneme coding of the voice text to be synthesized, aligning the time perception position coding and the audio position coding with the spliced codes, and then inputting the time perception position coding and the audio position coding into the autoregressive model together to obtain the audio prediction coding of the first codebook of the voice to be synthesized.
The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized.
And the splicing module is used for splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding the audio predictive coding by an audio coder and decoder to obtain synthesized audio.
The fourth embodiment of the invention provides a training device of a speech synthesis model based on time perception position coding, which comprises a training data acquisition module, a phoneme coding module, an audio coding module, a time perception position coding acquisition module, an audio position coding module, a first prediction module, a first training module, a second prediction module, a second training module and a model acquisition module.
The training data acquisition module is used for acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized.
And the phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes.
And the audio coding module is used for respectively coding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder.
And the time perception position code acquisition module is used for respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then combining the phoneme codes to acquire the time perception position codes of the phonemes.
And the audio position coding module is used for acquiring the audio position codes of the words in the text at the corresponding positions of the audio according to the time perception position codes.
The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the time perception position code and the audio position code into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized.
And the first training module is used for acquiring the real audio codes of the first codebook of the voice to be synthesized, and then calculating the autoregressive loss with the audio predictive codes of the first codebook to train the autoregressive model.
The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into the non-autoregressive model together to obtain the audio prediction codes of the j+1-th codebook of the voice to be synthesized until the audio prediction codes of other codebooks except the first codebook are generated.
And the second training module is used for acquiring the real audio codes of the j+1th codebook of the voice to be synthesized, and then calculating the non-autoregressive loss with the audio predictive codes of the j+1th codebook to train the non-autoregressive model.
The model acquisition module is used for obtaining a speech synthesis model based on time perception position coding after the autoregressive model and the non-autoregressive model are trained.
It is understood that the speech synthesis device based on the time-aware position coding may be an electronic device with computing performance, such as a portable notebook computer, a desktop computer, a server, a smart phone or a tablet computer.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
References to "first\second" in the embodiments are merely to distinguish similar objects and do not represent a particular ordering for the objects, it being understood that "first\second" may interchange a particular order or precedence where allowed. It is to be understood that the "first\second" distinguishing aspects may be interchanged where appropriate, such that the embodiments described herein may be implemented in sequences other than those illustrated or described herein.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method of speech synthesis based on time-aware position coding, comprising:
acquiring prompt voice, a prompt voice text and a voice text to be synthesized;
respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes;
Audio encoding of the hint speech into a plurality of codebooks by an audio codec;
according to the prompt voice and the prompt voice text, acquiring time stamp information of each word, and then acquiring time perception position codes of phonemes by combining phoneme codes of the prompt voice text;
Acquiring audio position codes of each word in the text at corresponding audio positions according to the time perception position codes;
Splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced code, and then inputting the aligned codes into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized;
Splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized;
the audio predictive coding of the first codebook and the audio predictive coding of other codebooks are spliced, and then the synthesized audio is obtained through decoding of an audio coder.
2. The method for speech synthesis based on time-aware position coding according to claim 1, wherein the converting the prompt speech text and the speech text to be synthesized into phoneme codes, respectively, specifically comprises:
the word segmentation tool is used for respectively segmenting the prompt voice text and the voice text to be synthesized;
Respectively converting the prompt voice text and the word segmentation of the voice text to be synthesized into phonemes through a phoneme conversion model;
Respectively converting phonemes of the prompt voice text and the voice text to be synthesized into phoneme codes through a mapping dictionary; wherein the mapping dictionary contains discrete codes corresponding to the phonemes.
3. The speech synthesis method based on time-aware position coding according to claim 1, wherein the step of obtaining time stamp information of each word in the text comprises the steps of:
Acquiring a time stamp range of each word in the text in voice audio; wherein the timestamp range includes a start timestamp and an end timestamp of each word in audio;
obtaining the time-aware position coding of the phonemes, comprising the steps of:
According to the phoneme codes of each word in the text, each phoneme in the phoneme codes is numbered through an arithmetic progression, and the phoneme number range of each word is obtained ; Wherein/> ,/>For/>Numbering of individual phonemes,/>The total number of phonemes of the text;
respectively calculating the phoneme position codes of the words according to the time stamp range and the phoneme number range ; Wherein, phoneme position coding/>The calculation model of (2) is as follows: /(I)In the/>For phoneme position coding,/>For/>Position coding of individual phonemes,/>Is the total number of phonemes of the text,/>For/>Position coding of individual phonemes,For/>Numbering of individual phonemes,/>For normalizing the difference of the post-arithmetic progression,/>For/>End timestamp of individual word,/>For/>A start timestamp of the individual word;
splicing the phoneme position codes of each word in the text, and unifying the coding scale of the spliced phoneme position codes and the coding scale of the audio coder-decoder to obtain the time perception position codes of the phonemes.
4. The method for synthesizing speech based on time-aware position coding according to claim 1, wherein obtaining audio position codes of words in text at corresponding positions of audio according to the time-aware position coding specifically comprises:
And obtaining the audio position codes of the words in the text at the corresponding audio positions according to the time perception position codes of the words in the text by rounding downwards.
5. The method of claim 1, wherein the autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing a first codebook of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding of a first codebook representing a prompt speech,/>Parameters representing an autoregressive model,/>Encoding positions for audio,/>For the number of audio encoding positions,/>First codebook for speech to be synthesized/>Audio coding of individual positions,/>Is less than the first codebook of speech to be synthesized/>Audio encoding of individual positions.
6. The method of claim 1, wherein the non-autoregressive model is: In the/> Representing probability,/>Audio predictive coding representing the 2 nd to 8 th codebooks of speech to be synthesized,/>Representing phoneme encoding,/>Audio coding representing alert speech,Parameters representing non-autoregressive models,/>For the sequence number of codebook,/>Represents the/>, of the speech to be synthesizedAudio predictive coding of individual codebooks,/>Representing less than the/>, of the speech to be synthesizedAudio predictive coding of the individual codebooks.
7. A method for training a speech synthesis model based on time-aware position coding, comprising:
Acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized;
Respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes;
respectively encoding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder;
Respectively acquiring time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then acquiring time perception position codes of phonemes by combining phoneme codes;
Acquiring audio position codes of each word in the text at corresponding audio positions according to the time perception position codes;
Splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the code and the code into an autoregressive model to obtain the audio predictive code of the first codebook of the voice to be synthesized;
Acquiring a real audio code of a first codebook of speech to be synthesized, and then calculating an autoregressive loss with the audio predictive code of the first codebook to train the autoregressive model;
Splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and inputting the codes into a non-autoregressive model together to obtain the audio predictive codes of the j+1th codebook of the voice to be synthesized until the audio predictive codes of other codebooks except the first codebook are generated;
Acquiring a real audio code of a j+1th codebook of a voice to be synthesized, and then calculating a non-autoregressive loss with the audio predictive code of the j+1th codebook to train the non-autoregressive model;
And training the autoregressive model and the non-autoregressive model to obtain a speech synthesis model based on time perception position coding.
8. A speech synthesis apparatus based on time-aware position coding, comprising:
the initial data acquisition module is used for acquiring prompt voice, prompt voice text and voice text to be synthesized;
The phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes;
An audio encoding module for encoding the prompt voice into audio encoding of a plurality of codebooks through an audio codec;
the time perception position code acquisition module is used for acquiring time stamp information of each word according to the prompt voice and the prompt voice text, and then combining phoneme codes of the prompt voice text to acquire time perception position codes of phonemes;
the audio position coding module is used for acquiring audio position codes of all words in the text at corresponding positions of the audio according to the time perception position codes;
The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the time perception position code and the audio position code with the spliced codes, and then inputting the time perception position code and the audio position code into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized;
the second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the phoneme codes of the voice text to be synthesized and the audio prediction codes of the first codebook, aligning the time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of other codebooks of the voice to be synthesized;
and the splicing module is used for splicing the audio predictive coding of the first codebook and the audio predictive coding of other codebooks, and then decoding the audio predictive coding by an audio coder and decoder to obtain synthesized audio.
9. A training device for a speech synthesis model based on time-aware position coding, comprising:
The training data acquisition module is used for acquiring prompt voice, prompt voice text, voice to be synthesized and voice text to be synthesized;
the phoneme coding module is used for respectively converting the prompt voice text and the voice text to be synthesized into phoneme codes;
The audio coding module is used for respectively coding the prompt voice and the voice to be synthesized into audio codes of a plurality of codebooks through an audio coder-decoder;
the time perception position code acquisition module is used for respectively acquiring the time stamp information of each word in the prompt voice text and the voice text to be synthesized according to the corresponding relation between the voice and the text content, and then combining the phoneme codes to acquire the time perception position codes of the phonemes;
the audio position coding module is used for acquiring audio position codes of all words in the text at corresponding positions of the audio according to the time perception position codes;
The first prediction module is used for splicing the audio code of the first codebook of the prompt voice, the phoneme code of the prompt voice text and the phoneme code of the voice text to be synthesized, aligning the corresponding time perception position code and the audio position code with the spliced code, and then inputting the codes into the autoregressive model to obtain the audio prediction code of the first codebook of the voice to be synthesized;
the first training module is used for acquiring the real audio coding of a first codebook of the voice to be synthesized, and then calculating the autoregressive loss with the audio predictive coding of the first codebook to train the autoregressive model;
The second prediction module is used for splicing the audio codes of all codebooks of the prompt voice, the phoneme codes of the prompt voice text, the audio codes of the j-th codebook of the voice to be synthesized and the phoneme codes of the voice text to be synthesized, aligning the corresponding time perception position codes and the audio position codes with the spliced codes, and then inputting the codes into a non-autoregressive model together to obtain the audio prediction codes of the j+1th codebook of the voice to be synthesized until the audio prediction codes of other codebooks except the first codebook are generated;
the second training module is used for acquiring the real audio codes of the j+1th codebook of the voice to be synthesized, and then calculating non-autoregressive loss with the audio predictive codes of the j+1th codebook so as to train the non-autoregressive model;
The model acquisition module is used for obtaining a speech synthesis model based on time perception position coding after the autoregressive model and the non-autoregressive model are trained.
CN202410512813.2A 2024-04-26 2024-04-26 Speech synthesis method based on time perception position coding and model training method thereof Pending CN118116363A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410512813.2A CN118116363A (en) 2024-04-26 2024-04-26 Speech synthesis method based on time perception position coding and model training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410512813.2A CN118116363A (en) 2024-04-26 2024-04-26 Speech synthesis method based on time perception position coding and model training method thereof

Publications (1)

Publication Number Publication Date
CN118116363A true CN118116363A (en) 2024-05-31

Family

ID=91208983

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410512813.2A Pending CN118116363A (en) 2024-04-26 2024-04-26 Speech synthesis method based on time perception position coding and model training method thereof

Country Status (1)

Country Link
CN (1) CN118116363A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102287499B1 (en) * 2020-09-15 2021-08-09 주식회사 에이아이더뉴트리진 Method and apparatus for synthesizing speech reflecting phonemic rhythm
CN113257221A (en) * 2021-07-06 2021-08-13 成都启英泰伦科技有限公司 Voice model training method based on front-end design and voice synthesis method
CN113870838A (en) * 2021-09-27 2021-12-31 平安科技(深圳)有限公司 Voice synthesis method, device, equipment and medium
CN114387946A (en) * 2020-10-20 2022-04-22 北京三星通信技术研究有限公司 Training method of speech synthesis model and speech synthesis method
CN117612512A (en) * 2023-11-15 2024-02-27 腾讯音乐娱乐科技(深圳)有限公司 Training method of voice model, voice generating method, equipment and storage medium
CN117809620A (en) * 2024-01-22 2024-04-02 网易有道信息技术(北京)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102287499B1 (en) * 2020-09-15 2021-08-09 주식회사 에이아이더뉴트리진 Method and apparatus for synthesizing speech reflecting phonemic rhythm
CN114387946A (en) * 2020-10-20 2022-04-22 北京三星通信技术研究有限公司 Training method of speech synthesis model and speech synthesis method
CN113257221A (en) * 2021-07-06 2021-08-13 成都启英泰伦科技有限公司 Voice model training method based on front-end design and voice synthesis method
CN113870838A (en) * 2021-09-27 2021-12-31 平安科技(深圳)有限公司 Voice synthesis method, device, equipment and medium
CN117612512A (en) * 2023-11-15 2024-02-27 腾讯音乐娱乐科技(深圳)有限公司 Training method of voice model, voice generating method, equipment and storage medium
CN117809620A (en) * 2024-01-22 2024-04-02 网易有道信息技术(北京)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11295721B2 (en) Generating expressive speech audio from text data
Arık et al. Deep voice: Real-time neural text-to-speech
US11735162B2 (en) Text-to-speech (TTS) processing
Mertens The prosogram: Semi-automatic transcription of prosody based on a tonal perception model
JP2022531414A (en) End-to-end automatic speech recognition of digit strings
Wang et al. A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $ F_0 $ Model for Statistical Parametric Speech Synthesis
US11763797B2 (en) Text-to-speech (TTS) processing
CN110570876B (en) Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
JP7379756B2 (en) Prediction of parametric vocoder parameters from prosodic features
WO2011080597A1 (en) Method and apparatus for synthesizing a speech with information
EP4266306A1 (en) A speech processing system and a method of processing a speech signal
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
Guo et al. MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS
ES2366551T3 (en) CODING AND DECODING DEPENDENT ON A SOURCE OF MULTIPLE CODE BOOKS.
CN118116363A (en) Speech synthesis method based on time perception position coding and model training method thereof
JP5268731B2 (en) Speech synthesis apparatus, method and program
CN114203151A (en) Method, device and equipment for training speech synthesis model
CN115206281A (en) Speech synthesis model training method and device, electronic equipment and medium
Alastalo Finnish end-to-end speech synthesis with Tacotron 2 and WaveNet
US12033611B2 (en) Generating expressive speech audio from text data
Oh et al. DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation
CN114360492B (en) Audio synthesis method, device, computer equipment and storage medium
Louw Neural speech synthesis for resource-scarce languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination