CN116092479A

CN116092479A - Text prosody generation method and system based on comparison text-audio pair

Info

Publication number: CN116092479A
Application number: CN202310361791.XA
Authority: CN
Inventors: 黄俊杰; 姜伟昊; 王志辉; 李烈锋; 孙清; 陈梓铭
Original assignee: Hangzhou Dongshang Intelligent Technology Co ltd
Current assignee: Hangzhou Dongshang Intelligent Technology Co ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-05-09
Anticipated expiration: 2043-04-07
Also published as: CN116092479B

Abstract

The invention discloses a text prosody generation method and system based on a comparison text-audio pair, and belongs to the field of speech synthesis. Acquiring original voice audio and corresponding text combinations as a training set, encoding selected word symbol prosody characteristics by using a prosody encoder, and encoding selected word symbol text characteristics by using a text encoder; and calculating a cosine similarity matrix according to the prosodic features and the text features of the selected words, calculating a symmetrical cross entropy loss function, and performing comparison training from two scales of words and phonemes. And for the given text, encoding the encoding sequence by utilizing the trained text encoder to encode the phoneme sequence and bytes of the given text, acquiring the phoneme-level text feature codes and/or word-level text feature codes, and generating the prosody corresponding to the text. The text-audio training method and the text-audio training device can fully learn text characterization information related to rhythm from text contexts by comparing text-audio pair pre-training, ignore semantic information, and improve the quality of audio synthesized by downstream tasks.

Description

Text prosody generation method and system based on comparison text-audio pair

Technical Field

The invention relates to the field of speech synthesis, in particular to a text prosody generation method and system based on contrast text-audio pairs.

Background

Learning text tokens to achieve expressive speech synthesis tasks has recently attracted a great deal of attention as an important artificial intelligence task aimed at linking text contexts with their prosodic features, providing text token information concerning prosody to speech synthesis tasks, and thus improving the quality of synthesized speech.

The prior art of prosodic rich speech synthesis mainly includes 1) predicting prosodic attributes using a reference encoder and style vocabulary, 2) predicting prosodic attributes using an additional predictor, 3) modeling using a variation self-encoder, 4) providing better text characterization using a priori knowledge, 5) pre-training learning semantic information from text space, 6) pre-training by filling in the vacancies of speech (prosody).

The first method is used for solving the problem that training and reasoning are inconsistent because real voice in training is not available in reasoning; the second, third and fourth methods have obvious progress in the expressive force of the synthesized voice; the fifth method learns semantic information of a text space, but ignores prosody changes in a speech space; the sixth method learns a large amount of phoneme information while focusing on a speech space, and has a slow model training process and a slow speech synthesis speed.

Disclosure of Invention

In order to overcome the defect that the prior art can learn a lot of text information but cannot learn prosody-related text characterization information sufficiently, the method and the system for generating the text prosody based on the comparison text-audio pair are provided, the text characterization information related to the prosody can be learned from the text context sufficiently through the pre-training of the comparison text-audio pair, the semantic information is ignored, the rich text characterization information related to the prosody is provided for a speech synthesis model, and the quality of synthesized audio is improved.

The invention adopts the specific technical scheme that:

in a first aspect, the present invention provides a text prosody generating method based on contrast text-audio pairs, including the steps of:

step 1, acquiring original voice audio and a corresponding text combination, and extracting a phoneme sequence and a byte pair coding sequence in the text; obtaining a mel spectrogram of the original voice audio;

step 2, cutting the Mel spectrogram of all selected characters from the Mel spectrogram of the original voice audio;

step 3, coding all the phoneme sequences and bytes containing the selected logograms by using a text coder to obtain a phoneme level text feature code and a word level text feature code, and obtaining the text features of the selected logograms by using subscripts;

step 4, coding the mel spectrogram of the cut selected word symbol by utilizing a prosody coder to obtain the prosody characteristic of the selected word symbol;

step 5, calculating cosine similarity of all selected character text features and selected character prosody features in the original voice audio and corresponding text combinations to obtain a cosine similarity matrix;

step 6, calculating a symmetrical cross entropy loss function according to a cosine similarity matrix by utilizing a one-to-one correspondence relation between the original voice audio and the text, and respectively carrying out comparison training of a phoneme level and a word level on a text encoder and a prosody encoder;

and 7, for a given text, firstly extracting a phoneme sequence and a byte pair coding sequence in the text, then utilizing a trained text encoder to code the phoneme sequence and the byte pair coding sequence, obtaining a phoneme level text feature code and/or a word level text feature code, and generating prosody corresponding to the text.

Further, the text encoder comprises a phoneme embedding layer, a byte pair coding embedding layer, a feedforward neural network block, a byte-word converter, a word-phoneme converter and a phoneme-word converter;

the phoneme embedding layer and the byte pair coding embedding layer respectively take a phoneme sequence and a byte pair coding sequence as input, and the phoneme embedding result and the byte pair coding embedding result respectively extract characteristics through two independent feedforward neural network blocks to generate voice habit characteristics and byte pair coding characteristics;

the byte pair coding features are sequentially converted into phoneme level byte pair features by a byte-word converter and a word-phoneme converter; the phoneme level byte pair features and the voice habit features are fused through a third independent feedforward neural network block, and features are extracted, so that a phoneme level text feature code is generated;

the phoneme-level text feature codes are converted to word-level text feature codes by a phoneme-to-word converter.

Furthermore, the independent feedforward neural network blocks have the same structure and do not share parameters.

Further, the prosody encoder comprises M connected in sequence ₁ Individual feature extraction module, M ₂ A residual block and a one-dimensional attention pooling layer; the feature extraction module consists of one-dimensional convolution, layer standardization and ReLU activation functions, and the residual block consists of a plurality of feature extraction modules.

Further, in step 5, the cosine similarity matrix calculation method includes:

5.1 Linearly projecting the selected character text features obtained in the step 3 and the selected character prosody features obtained in the step 4 into a multi-mode embedding space after layer standardization to respectively obtain a selected character text mode embedding vector and a selected word Fu Yinpin mode embedding vector;

5.2 For a training batch containing N text-to-audio pairs, combining N selected vocabulary text-modality embedded vectors and N selected vocabulary Fu Yinpin-modality embedded vectors in pairs, and calculating N ² The cosine similarity forms a similarity matrix, and the calculation formula is as follows:

wherein C is _ph/word Representing dimensions NCosine similarity matrix, T _ph/word The selected logographic text modality embedding vector is represented, the subscript ph represents the phoneme level, the subscript word represents the word level, S represents the selected logographic Fu Yinpin modality embedding vector, and the superscript T represents the transpose.

Further, in step 6, diagonal elements in the cosine similarity matrix are positive sample cosine similarities, and the rest elements are negative sample cosine similarities, so as to calculate a symmetrical cross entropy loss function:

wherein L is _ph/word A symmetric cross entropy penalty representing phoneme/word level;

is a learnable scaling parameter, L _text Representing a cross entropy loss function along the text dimension, L, in a cosine similarity matrix _speech Representing the cross entropy loss function along the audio dimension in the cosine similarity matrix.

In a second aspect, the present invention proposes a text prosody generating system based on contrasting text-to-audio pairs, comprising:

the voice audio preprocessing module is used for acquiring a mel spectrogram of the original voice audio from the original voice audio and the corresponding text combination, and cutting the mel spectrogram of all selected characters from the mel spectrogram of the original voice audio;

a text preprocessing module for extracting a phoneme sequence and a byte pair coding sequence in a text from the original voice audio and the corresponding text combination;

the text encoder module is used for encoding all phoneme sequences and bytes containing the selected logograms to the encoding sequences, obtaining phoneme-level text feature codes and word-level text feature codes, and obtaining the selected logogram text features through subscripts;

a prosody encoder module for encoding the mel spectrogram of the cut selected word symbol to obtain a prosody feature of the selected word symbol;

the multi-scale comparison learning training module is used for calculating cosine similarity of all selected character text features and selected character prosody features in the original voice audio and corresponding text combinations to obtain a cosine similarity matrix; and calculating a symmetrical cross entropy loss function according to the cosine similarity matrix by utilizing the one-to-one correspondence of the original voice audio and the text, and respectively carrying out comparison training of a phoneme level and a word level on the text encoder and the prosody encoder.

Compared with the prior art, the invention has the following beneficial effects:

(1) The text encoder can effectively process text context by comparing text-audio pair pre-training mode, is focused on extracting prosody information from the voice section, and can be conveniently inserted into the existing voice synthesis system, so that high-quality prosody-related text characterization information is provided for a downstream voice synthesis task, and the quality of synthesized audio is improved.

(2) According to the invention, through multi-scale comparison learning, the model captures prosody information at two layers of phonemes and words, and ignores semantic information, so that the prediction capability of prosody-related text characterization information is improved.

Drawings

FIG. 1 is a flow chart of a text prosody generation method based on contrasting text-to-audio pairs according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a frame of a text prosody generating system based on contrasting text-to-audio pairs according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a text encoder according to an embodiment of the present invention;

fig. 4 is a schematic structural view of a prosody encoder according to an embodiment of the present invention;

FIG. 5 is a schematic diagram showing an embodiment of the present invention embedding the present invention into an existing FastSpecech 2 speech synthesis system;

fig. 6 is a schematic diagram of an electronic device terminal for implementing a text prosody generating method based on contrast text-to-audio pairs according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated and described below with reference to the drawings and detailed description, wherein like steps in the multi-scale contrast learning process are mainly at a phoneme level, and specific steps at a word level are further described for convenience of illustration.

As shown in fig. 1, the text prosody generating method based on the comparison text-audio pair mainly includes the steps of:

step 1, acquiring original voice audio and a corresponding text combination, and extracting a phoneme sequence and a byte pair coding sequence in the text; and acquiring a mel spectrogram of the original voice audio.

And 2, cutting the Mel spectrogram of all the selected characters from the Mel spectrogram of the original voice audio.

And 3, coding all the phoneme sequences and bytes containing the selected logograms by using a text coder to obtain a phoneme-level text feature code and a word-level text feature code, and obtaining the text features of the selected logograms by using subscripts.

And 4, encoding the mel spectrogram of the cut selected word symbol by utilizing a prosody encoder to obtain the prosody characteristic of the selected word symbol.

And 5, calculating cosine similarity of all selected character text features and selected character prosody features in the original voice audio and the corresponding text combination to obtain a cosine similarity matrix.

And 6, calculating a symmetrical cross entropy loss function according to the cosine similarity matrix by utilizing the one-to-one correspondence of the original voice audio and the text, and respectively performing comparison training of a phoneme level and a word level on the text encoder and the prosody encoder.

And 7, for a given text, firstly extracting a phoneme sequence and a byte pair coding sequence in the text, and then extracting text feature codes by using a trained text encoder to generate prosody corresponding to the text.

In the step 1, D is obtained _train Original speech audio and corresponding D _train A combination of the original texts, each original textThe audio corresponds to an original text.

For each given original text, the original text first needs to be converted into a phoneme sequence and a byte pair code sequence by some open source tool, which is well known to those skilled in the art and will not be described here in detail. For simplicity of representation, a pair of phoneme sequences and byte pair code sequences extracted from the same text are represented as X _text 。

For each given original voice audio, framing, fast Fourier transforming, taking absolute value, mel filtering and taking logarithm are sequentially carried out on the original voice audio, so as to obtain a Mel spectrogram of the original voice audio.

In step 2, a word symbol (phoneme or word) is selected based on the Mel spectrogram of the original voice frequency, the Mel spectrogram containing the word symbol is found, and the Mel spectrogram X corresponding to the word symbol is obtained by intercepting _speech For simplicity of representation, X will be _speech The mel spectrogram of the selected character is f×t, F represents the number of mel filters, and T represents the number of audio frames.

In step 3, the text encoder includes a phoneme embedding layer, a byte pair encoding embedding layer, a feedforward neural network block, a byte-to-word converter, a word-to-phoneme converter, a phoneme-to-word converter, and the like. In this embodiment, the text encoding process is shown in fig. 3, and specifically includes:

3.1 Obtaining the embedded information of the phoneme sequence containing the selected word by utilizing the phoneme embedding layer to obtain a phoneme embedded feature vector;

acquiring the embedded information of a byte pair coding sequence containing the selected word symbol by utilizing a byte pair coding embedded layer to acquire a byte pair coding embedded feature vector;

3.2 The phoneme embedded feature vector is subjected to a first feedforward neural network block FFT to extract voice habit features;

3.2 Embedding the byte pair code into a characteristic vector through a second feedforward neural network block FFT to obtain a byte pair code characteristic sequence; then the byte pair coding feature sequence passes through a byte-word converter, the byte pair hiding state in each word boundary is averaged to be used as word level features, and then the word-phoneme converter is adopted to copy the hiding state of each word into all phonemes in each word boundary, so that the phoneme level byte pair features are obtained;

3.3 And (3) fusing and processing the voice habit characteristics obtained in the step 3.1) and the phoneme level byte obtained in the step 3.2) through a third feedforward neural network block FFT to obtain a phoneme level text characteristic code.

For word level, a phoneme-to-word converter is employed to convert the phoneme-level text feature codes to word-level text feature codes. It should be noted that, in the training stage, a phoneme-word converter is required to obtain word-level text feature codes, and in actual use, phoneme-level prosody information or word-level prosody information can be obtained according to requirements.

For simplicity of representation, the phoneme/word level text feature codes are denoted as f _text (X _text ) The text feature code is prosodic information extracted from the text, where f _text (-) represents a text encoder.

3.4 For the phoneme/word level text feature encoding f obtained in step 3.3) _text (X _text ) The text characteristics of the selected logographic text are obtained through subscripts, specifically:

for phoneme-level text feature encoding, a selected phoneme-word text feature f is obtained using the subscript i_ph of the selected phoneme word _text (X _text ) _{i_ph} Here ph in the subscript i_ph is an abbreviation for the phoneme.

For word-level text feature encoding, using the subscript i_word of the selected word-phrase to obtain the selected word-phrase text feature f _text (X _text ) _{i_word} 。

In step 4, the mel spectrogram of the cut selected word symbol is encoded by using a prosody encoder, and the flow is as shown in fig. 4, where the prosody encoder in this embodiment includes three feature extraction modules, four residual blocks and a one-dimensional attention pooling layer, specifically:

for the mel spectrograms of all the selected tokens obtained in the step 2, firstly, three feature extraction modules consisting of one-dimensional convolution (Conv 1D), layer Normalization (LN) and ReLU activation functions are used for processing; taking the first feature extraction module as an example, the method is constructed according to the following formula:

wherein X is _speech Mel spectrogram, X, which is a selected logogram _speech,1 Is the output of the first feature extraction module, and the outputs of the three feature extraction modules are sequentially marked as X _speech,1 、X _speech,2 、X _speech,3 。

Then, four residual blocks consisting of one-dimensional convolution (Conv 1D), layer Normalization (LN) and ReLU activation functions are processed; taking the first residual block as an example, the following formula is used for constructing:

wherein X is _speech,4 The output of the first residual block is recorded as X in turn _speech,4 、X _speech,5 、X _speech,6 、X _speech,7 At this time, X _speech,7 Where N represents the batch size, T represents the number of audio frames, and C represents the number of channels.

Finally, a one-dimensional attention pooling layer is utilized to obtain the final selected character rhythm characteristic f _speech (X _speech ) Wherein f _speech () represents a prosodic encoder. At this time, the character prosody feature f is selected _speech (X _speech ) Is N x C.

In step 5, the cosine similarity of all the selected logographic text features and selected logographic prosody features in the original voice audio and the corresponding text combination is calculated, specifically:

5.1 For the selected logographic text feature f obtained in step 3) _text (X _text ) _{i_ph/ i_word} And the selected prosodic features f obtained in step 4 _speech (X _speech ) These characterization sequences were subjected to Layer Normalization (LN) and then linearly projected (L _text/speech ) To the multi-modal embedding space, it is constructed according to the following formula:

wherein L is _text Representing text linear projection operations, L _speech Representing an audio linear projection operation, T _ph/word Representing a selected logographic text modality embedding vector having dimensions N x C; s represents a selected word Fu Yinpin modal embedded vector, and the dimension of the vector is N multiplied by C; n represents the batch size and C represents the number of channels.

5.2 For the training batch containing N text-audio pairs, combining N selected character text mode embedded vectors and N selected word Fu Yinpin mode embedded vectors in pairs, and calculating N ² The cosine similarity forms a similarity matrix, and the specific formula is as follows:

wherein C is _ph/word The cosine similarity matrix is represented with dimensions N x N.

The flow chart of the above steps 1-5 may refer to fig. 2, taking the selected word "higher" as an example, using a text encoder to encode all the phoneme sequences and bytes containing the selected word "higher" to obtain a phoneme-level text feature code and a word-level text feature code, and obtaining the text feature of the selected word "higher" through a subscript; cutting all selected words "higher" from mel-spectrogram of original speech audio "Mel-pattern of (a); and encoding the mel spectrogram of the cut selected word symbol by using a prosodic encoder to obtain the prosodic feature of the selected word symbol 'higher'. For a training batch containing N text-audio pairs, computing cosine similarity of text features and prosodic features of N selected vocabulary "highers" to obtain a cosine similarity matrix C for the selected vocabulary "highers _word 。

In step 6, the contrast training method specifically comprises:

in this embodiment, text and audio that truly belong to a pair are taken as positive sample pairs (corresponding to diagonal elements in the cosine similarity matrix), and N positive samples are taken in total; and the remaining N ² The training goal is to maximize the similarity of the N positive samples while minimizing N, if the N text-to-audio pairs are negative samples ² -similarity of N negative samples, so that the resulting symmetric cross entropy loss function is specifically:

wherein L is _ph/word Representing symmetrical cross entropy loss, and respectively obtaining phoneme level symmetrical cross entropy loss L aiming at phoneme level and word level _ph And word-level symmetric cross entropy loss L _word ；

Is a learnable scaling parameter, L _text Representing a cross entropy loss function along the text dimension, L, in a cosine similarity matrix _speech The cross entropy loss function along the audio dimension in the cosine similarity matrix is represented, and the calculation formula is as follows:

wherein diag () represents a diagonal element.

By using the loss function and adopting a gradient descent learning method, the text encoder and the prosody encoder are respectively subjected to comparison training of a phoneme level and a word level, and the function of extracting prosody information from text contexts is realized, so that prosody characteristics can be provided for a downstream speech synthesis task, and the audio synthesis quality is improved.

In step 7, for a given text, firstly extracting a phoneme sequence and a byte pair coding sequence in the text, and then extracting text feature codes by using a trained text encoder to generate prosody corresponding to the text. Wherein, the process of extracting the phoneme sequence and the byte pair code sequence in the text refers to the step 1 to obtain the combination X of the phoneme sequence and the byte pair code sequence _text The method comprises the steps of carrying out a first treatment on the surface of the The process of extracting text feature codes refers to the steps 3.1) to 3.3) to obtain phoneme/word level text feature codes f _text (X _text ) The text feature code is prosodic information extracted from the text.

In one implementation of the present invention, the text encoder of the present invention may be inserted into an existing speech synthesis system, with the phoneme/word level prosody information f provided by the text encoder of the present invention _text (X _text ) And the downstream voice synthesis task is assisted to be completed, and high-quality voice audio is synthesized.

The text encoder of the invention is inserted into the existing speech synthesis system FastSpecch 2, wherein the FastSpecch 2 comprises a speech encoder, a fusion encoder, a duration and pitch predictor and a Mel spectrogram decoder. The text encoder of the invention is used as an auxiliary encoder of a speech encoder of a FastSpeCH 2 speech synthesis system. The speech coder may be at a phoneme level or a word level. Taking phoneme level as an example, as shown in fig. 5, a speech encoder (phoneme encoder) acquires a phoneme level code, combines the phoneme level code with a phoneme level text feature code generated by the text encoder of the present invention, and then performs fusion processing by a following fusion encoder; and then sequentially passing through a duration and pitch predictor and a Mel spectrogram decoder to generate a Mel spectrogram, so as to synthesize voice. In this embodiment, the phoneme-word converter in the text encoder of the invention need not be enabled.

The above method is applied to the following embodiments to embody the technical effects of the present invention, and specific steps in the embodiments are not described in detail.

The present invention performs pre-training on the speech recognition dataset Librispech and performs evaluation testing on the quality of the synthesized speech on both speech synthesis datasets LJSpeech, libriTTS. In order to objectively evaluate the performance of the present invention, the present invention uses three criteria of MOS, DTW and DE in the selected test set to evaluate the effect of the present invention, and uses prosodic feature similarities of the same token from different texts to analyze and compare them, in comparison with the performance of the following existing pre-training model and the existing speech synthesis model:

the BERT model is compared with the pre-training model 1, and abundant semantic information is learned from a large-scale text corpus in the Internet, so that prosody prediction and robustness of the existing speech synthesis system are improved.

A text encoder is introduced as auxiliary information on the basis of a MAM model to help the model to recover the missing Mel spectrogram, and a post-processing network based on a convolutional neural network is introduced to improve the audio quality.

The speech synthesis system 1. FastSpech model, which is based on a predicted speech synthesis task PB, utilizes several additional predictors to predict prosodic information, such as pitch, duration and energy, to complete expressive speech synthesis tasks.

The speech synthesis system 2. PortaSpech model is based on a varied speech synthesis task VB, and models prosody in potential space by using a variation self-encoder, thereby completing the expressive speech synthesis task.

Following the procedure described in the detailed description, the experimental results obtained are shown in tables 1 and 2, and the model of the present invention is expressed as clappeech.

Table 1: the invention aims at the speech synthesis test result obtained by the LJSpeech, libriTTS data set

In table 1, GT represents real voice audio, GT (voc) represents converting real voice audio into mel-spectrogram first, and then converting mel-spectrogram back into audio using a vocoder. As can be seen from table 1, the clappeech method extracts prosody from text, improves the speech synthesis model, its performance is significantly better than the existing two most advanced pre-trained models BERT and A3T, in the LJSPeech dataset, MOS scores are increased from 4.04 (4.13) to 4.11 (4.28), DTW and DE are decreased; in the ljspecch dataset, the MOS score increased from 3.60 (3.95) to 3.71 (4.06), with DTW and DE decreasing.

Table 2: the invention aims at the text feature similarity test result of the same word symbol in different contexts

As can be seen from table 2, the average similarity of the text features of the same token in different contexts obtained by the method of clap specification is smaller than that of other models, which indicates that clap specification captures prosodic differences of the same token based on the text context, and provides prosodic information efficiently. The method for comparing text-audio pair pre-training provided by the invention is beneficial to paying attention to only prosodic space in the training process of the model, and ignoring the information of the voice space, and the text representation related to prosody is obtained efficiently, so that the voice synthesis model is assisted to generate higher-quality audio.

Table 3: the invention aims at ablation experiment performances of different settings

In Table 3, CLAPSPeech represents the present model, TTS baseline represents the speech synthesis benchmark model, w/o BPE represents the removed byte pair encoding, w/o ph-level represents the removed phoneme level text feature encoding, and w/o word-level represents the removed word level text feature encoding. It can be seen from Table 3 that the removal of any of the elements of the present invention has an effect on the final result. The addition of byte pair coding features in the CLAPSPeech pre-training and the pre-training of multi-scale logos can improve model efficiency and acquire prosodic features of each level of text, which shows that for a speech synthesis task, the method for learning prosody from text context based on contrast text-to-audio pair pre-training is effective.

There is also provided in this embodiment a text prosody generating system based on a comparative text-to-audio pair for implementing the above embodiment. The terms "module," "unit," and the like, as used below, may be a combination of software and/or hardware that performs a predetermined function. Although the system described in the following embodiments is preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible.

Specifically, a text prosody generating system based on a comparison text-to-audio pair includes:

and the voice audio preprocessing module is used for acquiring the mel spectrogram of the original voice audio from the original voice audio and the corresponding text combination, and cutting the mel spectrogram of all the selected characters from the mel spectrogram of the original voice audio.

A text preprocessing module for extracting phoneme sequences and byte pair coding sequences in the text from the original speech audio and corresponding text combinations.

And the text encoder module is used for encoding all the phoneme sequences and bytes containing the selected logograms into the encoding sequences, obtaining phoneme-level text feature codes and word-level text feature codes, and obtaining the selected logogram text features through subscripts.

And the prosody encoder module is used for encoding the mel spectrogram of the cut selected word symbol and acquiring the prosody characteristics of the selected word symbol.

In one embodiment of the present invention, the text encoder module includes:

and a phoneme embedding unit for generating a phoneme embedding result by taking the phoneme embedding layer as an input.

And the byte pair code embedding unit is used for taking the byte pair code sequence as input and generating a byte pair code embedding result.

The feedforward neural network block unit is used for respectively extracting characteristics of the phoneme embedding result and the byte pair coding embedding result to generate voice habit characteristics and byte pair coding characteristics; and the method is used for merging the phoneme-level byte pair features and the voice habit features and extracting the features to generate a phoneme-level text feature code.

A byte-to-word converter unit for converting byte pair encoded features to word-level byte pair features.

A word-to-phoneme converter unit for converting word-level byte pair features into phoneme-level byte pair features.

A phoneme-word converter unit for converting the phoneme-level text feature codes into word-level text feature codes.

In one embodiment of the present invention, the prosody encoder module includes M connected in sequence ₁ Feature extraction unit M ₂ A residual block unit and a one-dimensional attention pooling layer unit; the characteristic extraction unit is composed of one-dimensional convolution, layer standardization and ReLU activation functions, and the residual block unit is composed of a plurality of characteristic extraction units.

In one embodiment of the present invention, the multi-scale contrast learning training module includes:

the multi-modal projection unit is used for linearly projecting the selected logographic text characteristics generated by the text encoder module and the selected logographic prosody characteristics generated by the prosody encoder module into the multi-modal embedding space after layer standardization to respectively obtain a selected logographic text modal embedding vector and a selected logographic Fu Yinpin modal embedding vector.

The cosine similarity matrix calculation unit is used for combining the selected word text mode embedded vector and the selected word Fu Yinpin mode embedded vector in pairs for a training batch of one text-audio pair to calculate a similarity matrix.

And the symmetrical cross entropy loss calculation unit is used for calculating a symmetrical cross entropy loss function by regarding diagonal elements in the cosine similarity matrix as positive sample cosine similarity and regarding other elements as negative sample cosine similarity, and updating text encoder and prosody encoder parameters based on the symmetrical cross entropy loss function.

For the system embodiment, since the system embodiment basically corresponds to the method embodiment, the relevant parts only need to be referred to in the description of the method embodiment, and the implementation methods of the remaining modules are not repeated herein. The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Embodiments of the system of the present invention may be applied to any device having data processing capabilities, such as a computer or the like. The system embodiment may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware level, fig. 6 is a hardware structure diagram provided in this embodiment, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 6, any device with data processing capability in the system in the embodiment generally includes other hardware according to the actual function of the device, which is not described herein.

The foregoing list is only illustrative of specific embodiments of the invention. Obviously, the invention is not limited to the above embodiments, but many variations are possible. All modifications directly derived or suggested to one skilled in the art from the present disclosure should be considered as being within the scope of the present invention.

Claims

1. A text prosody generating method based on a comparison text-audio pair, comprising the steps of:

2. The method of claim 1, wherein the text encoder comprises a phoneme embedding layer, a byte pair code embedding layer, a feedforward neural network block, a byte-to-word converter, a word-to-phoneme converter, and a phoneme-to-word converter;

3. The method for generating text prosody based on comparative text-to-audio pairs according to claim 1, wherein the independent feedforward neural network blocks have the same structure and do not share parameters.

4. The method for generating text prosody based on comparative text-to-audio pairs according to claim 1, wherein the prosody encoder comprises sequentially connected M' s ₁ Individual feature extraction module, M ₂ A residual block and a one-dimensional attention pooling layer; the feature extraction module consists of one-dimensional convolution, layer standardization and ReLU activation functions, and the residual block consists of a plurality of feature extraction modules.

5. The method for generating text prosody based on comparative text-to-audio pairs according to claim 1, wherein in step 5, the cosine similarity matrix calculation method comprises:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein C is _ph/word Representing a cosine similarity matrix of dimension N, T _ph/word The selected logographic text modality embedding vector is represented, the subscript ph represents the phoneme level, the subscript word represents the word level, S represents the selected logographic Fu Yinpin modality embedding vector, and the superscript T represents the transpose.

6. The method for generating text prosody based on contrast text-audio pairs according to claim 1, wherein in step 6, diagonal elements in the cosine similarity matrix are positive sample cosine similarities, and the remaining elements are negative sample cosine similarities, and a symmetrical cross entropy loss function is calculated:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein L is _ph/word A symmetric cross entropy penalty representing phoneme/word level; />

Is a learnable scaling parameter, L _text Representing cosine similarityCross entropy loss function, L, along text dimension in matrix _speech Representing the cross entropy loss function along the audio dimension in the cosine similarity matrix.

7. A text prosody generating system based on contrasting text-to-audio pairs, comprising:

8. The text prosody generating system according to claim 7, wherein said text encoder module comprises:

a phoneme embedding unit for generating a phoneme embedding result by taking the phoneme embedding layer as an input;

the byte pair code embedding unit is used for taking the byte pair code sequence as input to generate a byte pair code embedding result;

the feedforward neural network block unit is used for respectively extracting characteristics of the phoneme embedding result and the byte pair coding embedding result to generate voice habit characteristics and byte pair coding characteristics; the method comprises the steps of generating a phoneme-level text feature code, and merging and extracting phoneme-level byte pair features and voice habit features to generate a phoneme-level text feature code;

a byte-to-word converter unit for converting byte pair encoded features to word-level byte pair features;

a word-to-phoneme converter unit for converting word-level byte pair features into phoneme-level byte pair features;

9. The system for generating text prosody based on comparative text-to-audio pairs of claim 7, wherein the prosody encoder module comprises sequentially connected M' s ₁ Feature extraction unit M ₂ A residual block unit and a one-dimensional attention pooling layer unit; the characteristic extraction unit is composed of one-dimensional convolution, layer standardization and ReLU activation functions, and the residual block unit is composed of a plurality of characteristic extraction units.

10. The text prosody generating system according to claim 7, wherein said multi-scale contrast learning training module comprises:

the multi-modal projection unit is used for linearly projecting the selected character text characteristics generated by the text encoder module and the selected character prosody characteristics generated by the prosody encoder module into a multi-modal embedding space after layer standardization to respectively obtain a selected character text modal embedding vector and a selected word Fu Yinpin modal embedding vector;

the cosine similarity matrix calculation unit is used for combining the text mode embedded vector of the selected word symbol and the Fu Yinpin mode embedded vector of the selected word with each other for a training batch of one text-audio pair to calculate a similarity matrix;