CN116129876A

CN116129876A - Training method and device for voice conversion model and voice generation method and device

Info

Publication number: CN116129876A
Application number: CN202210956115.2A
Authority: CN
Inventors: 刘鹏飞; 蒋宁; 吴海英; 刘敏
Original assignee: Mashang Xiaofei Finance Co Ltd
Current assignee: Mashang Xiaofei Finance Co Ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2023-05-16

Abstract

The disclosure provides a training method and device for a speech conversion model and a speech generation method and device. The training method comprises the following steps: acquiring first training data, wherein the first training data comprises voice data, phoneme data corresponding to the voice data and a standard mel spectrogram corresponding to the voice data; inputting the voice data into a pre-trained speaker recognition model to obtain speaker embedding corresponding to the voice data, wherein the speaker embedding is used for representing the tone of a speaker; training an initial speech conversion model based on the phoneme data, the standard mel spectrogram and the speaker embedding to obtain the speech conversion model, wherein the speech conversion model is used for converting text into a mel spectrogram of speech.

Description

Training method and device for voice conversion model and voice generation method and device

Technical Field

The present invention relates to the field of speech synthesis, and in particular, to a method and apparatus for training a speech conversion model, and a method and apparatus for generating speech.

Background

Custom speech attracts more and more attention in different application scenarios, such as personal assistant, news broadcast and audio navigation, and is also widely supported in the business field. In existing methods of customizing target speaker's speech, speech customization is achieved either by training the TTS model directly with the target speaker's speech, or by fine-tuning the trained base TTS model with a small amount of availability adaptation data (typically data for only a few seconds or minutes), but they do not meet most of the scenarios that follow for the target speaker's speech.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a speech-to-text (TTS) model and a speech synthesis method and apparatus.

According to a first aspect of embodiments of the present disclosure, there is provided a training method of a speech conversion model, where the training method includes: acquiring first training data, wherein the first training data comprises voice data, phoneme data corresponding to the voice data and a standard mel spectrogram corresponding to the voice data; inputting the voice data into a pre-trained speaker recognition model to obtain speaker embedding corresponding to the voice data, wherein the speaker embedding is used for representing the tone of a speaker; training an initial speech conversion model based on the phoneme data, the standard mel spectrogram and the speaker embedding to obtain the speech conversion model, wherein the speech conversion model is used for converting text into a mel spectrogram of speech.

According to a second aspect of embodiments of the present disclosure, there is provided a speech generating method, wherein the speech generating method includes: acquiring target voice data and target text of a target speaker; converting the target text into target phoneme data; inputting the target voice data into a speaker recognition model to obtain target speaker embedding, wherein the target speaker embedding is used for representing the tone of the target speaker; embedding the target speaker and inputting the target phoneme data into a voice conversion model to generate a mel spectrogram of voice corresponding to the target text; and generating the voice by utilizing the Mel spectrogram, wherein the voice comprises the tone of the target speaker, and the voice conversion model is obtained by embedding and training the speaker output according to the speaker recognition model.

According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus for a speech conversion model, wherein the training apparatus includes: a training data acquisition unit configured to acquire first training data, wherein the first training data includes speech data, phoneme data corresponding to the speech data, and a standard mel spectrogram corresponding to the speech data; a model training unit configured to: inputting the voice data into a pre-trained speaker recognition model to obtain speaker embedding corresponding to the voice data, training an initial voice conversion model based on the phoneme data, the standard Mel spectrogram and the speaker embedding to obtain the voice conversion model, wherein the voice conversion model is used for converting text into voice, and the speaker embedding is used for representing the Mel spectrogram of the tone of the speaker.

According to a fourth aspect of embodiments of the present disclosure, there is provided a speech generating apparatus, wherein the speech generating apparatus includes: a data acquisition unit configured to acquire target voice data and target text of a target speaker; a phoneme generating unit configured to convert the target text into target phoneme data; a speaker embedding obtaining unit configured to input the target voice data to a speaker recognition model to obtain a target speaker embedding, wherein the target speaker embedding is used for representing a tone of the target speaker; and the voice generating unit is configured to embed the target speaker and input the target phoneme data into a voice conversion model, generate a mel spectrogram of the voice corresponding to the target text, and generate the voice by utilizing the mel spectrogram, wherein the voice comprises the tone of the target speaker, and the voice conversion model is obtained through speaker embedding training output according to the speaker recognition model.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method or a speech generation method of a speech conversion model according to the present disclosure.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by at least one processor, causes the at least one processor to perform a training method or a speech generating method according to a speech conversion model of the present disclosure.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement a training method or a speech generation method according to a speech conversion model of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the method disclosed by the invention, the target speaker is embedded from the target voice data of the target speaker through the speaker recognition model, the target speaker embedding and the target phoneme data corresponding to the target text are input into the Mel spectrogram of the voice conversion model to generate the voice with the tone of the target speaker according to the Mel spectrogram, so that the follow-up use of the voice can be realized, namely, the voice customization of any target speaker (including the speaker which is not in the training data of the voice conversion model) can be immediately realized only by a small amount of target voice data, in addition, in the process, the fine tuning stage of the voice conversion model does not exist, the calculation resource and the model training time cost are saved, the application timeliness of the voice conversion model customized for the target speaker is greatly reduced, and in addition, the model training data purchasing/collecting cost can also be saved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram illustrating a scenario in which a speech conversion model according to an exemplary embodiment of the present disclosure may be applied;

FIG. 2 illustrates a flowchart of a method of training a speech conversion model according to an exemplary embodiment of the present disclosure;

FIG. 3 is a diagram showing the structure of the RepVGG model at training and reasoning;

FIG. 4 is a flowchart illustrating a process of predicting a Mel-spectrogram based on phoneme data and speaker embedding in training an initial speech conversion model according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a schematic diagram of a speech conversion model according to an exemplary embodiment of the present disclosure;

fig. 6 is a diagram showing a Trasformer network structure;

FIG. 7 is a diagram showing the internal structure of a multi-headed attention layer in a transducer network structure;

FIG. 8 is a flowchart illustrating a speech generation method according to an exemplary embodiment of the present disclosure;

FIG. 9 is a flowchart illustrating a process of generating a Mel spectrogram of speech using target speaker embedding and target phoneme data by a speech conversion model according to an exemplary embodiment of the present disclosure;

fig. 10 is a diagram illustrating one example of synthesized speech generated by a speech generating method according to an exemplary embodiment of the present disclosure;

FIG. 11 is a block diagram illustrating a training apparatus of a speech conversion model according to an exemplary embodiment of the present disclosure;

fig. 12 is a block diagram illustrating a speech generating apparatus according to an exemplary embodiment of the present disclosure;

FIG. 13 is a diagram illustrating an implementation environment of a training method and a speech generation method of a speech conversion model according to an exemplary embodiment of the present disclosure;

fig. 14 is a block diagram illustrating an electronic device according to an exemplary embodiment of the present disclosure.

Hereinafter, the present application will be described in detail with reference to the drawings, wherein the same or similar elements will be designated with the same or similar reference numerals throughout the drawings.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

The terms and words used in the following description and claims are not limited to written meanings, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the claims and their equivalents.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

Before starting the description of the present disclosure, some technical content and terms that may be used in the specification of the present disclosure will be explained so that the present disclosure can be more easily understood:

Transformer model: a time sequence model based on a self-attention mechanism can effectively encode time sequence information in an encoder part, and the time sequence information processing capacity of the time sequence model is far better than that of LSTM and is high in speed. The model is widely applied to the fields of natural language processing, computer vision, machine translation, voice recognition and the like.

Mel spectrogram (Mel spline): the mel-pattern is a pattern in which the horizontal axis is time and the vertical axis is mel frequency obtained by analyzing the speech, and the mel-pattern can be reproduced by vocoder processing. Mel-frequency cepstral coefficients (Mel-frequency cepstral coefficients, MFCC), which are part of coefficients obtained by cosine transforming (DCT) a Mel-spectrum, are features widely used in speaker segmentation, voiceprint recognition, speech recognition, and speech synthesis. Mel frequency is proposed based on the auditory characteristics of the human ear, which has a non-linear correspondence with Hz frequency. The mel frequency cepstrum coefficient is the Hz frequency spectrum characteristic obtained by calculation by utilizing the relation between the mel frequency cepstrum coefficient and the mel frequency cepstrum coefficient, and is mainly used for extracting the characteristic of voice data.

Speaker recognition model: speaker recognition, also known as voiceprint recognition, is one of the biometric techniques. Speaker recognition can be divided into two categories, speaker recognition and speaker verification. The theoretical basis of voiceprint recognition is that each sound has a unique feature by which the sounds of different people can be effectively distinguished. The speaker recognition model is a model for voiceprint recognition of speaker voices based on a neural network.

At present, when customizing a speech conversion model with the speech characteristics of a target speaker, the speech conversion model can be directly trained by using the data of the target speaker, but the speech data of the target speaker needs several hours (such as 7 hours or 8 hours), and the purchasing cost of the speech + labeling per hour is very high, so that the cost for training the speech conversion model of a plurality of target speakers (customization) is very high, and the situation that the speech of the target speaker is used along with the acquisition is not satisfied, for example, in the speech navigation requiring the customization of the speech, the user hopes to immediately use the speech of the target speaker (such as the user or the specific speaker) for navigation, and the effect of immediately using the speech with the target speaker for speech navigation cannot be achieved in the situation.

In another implementation, a basic speech conversion model is usually trained first, and then the trained basic speech conversion model is fine-tuned by using less available adaptive data (usually only a few seconds or a few minutes), so that a speech conversion model capable of synthesizing the timbre of the target speaker can be obtained, but such methods usually require a long time to train the fine-tuned speech conversion model and require a certain computing resource, so that most of the scenarios of following the speech of the target speaker cannot be satisfied, for example, in the speech navigation requiring customized speech, the above method also achieves the effect of immediately using the speech with the target speaker for speech navigation.

Therefore, the present application provides a training method and a speech generating method for a speech conversion model, which aim at the above problems existing in the prior art, the scheme implements a small sample speech generating scheme, the scheme firstly trains an initial speaker recognition model and an initial speech conversion model to obtain a trained speaker recognition model and a speech conversion model, then, when the speech customization is performed, obtains a target speaker embedding capable of representing the tone of the target speaker from a small amount of target speech data (for example, tens of seconds or minutes) of the target speaker through the speaker recognition model, then, generates a mel spectrogram by the speech conversion model based on the target speaker embedding and the target text, and the mel spectrogram can generate speech through a vocoder, the speech has the tone of the target speaker, so that the following use of the speech of the target speaker can be achieved, namely, the customization of any target speaker (including a speaker not in the training data of the initial speech conversion model and/or the initial speaker recognition model) can be immediately realized, in addition, the conversion process of the speech can be performed, the time cost of the training model can be greatly reduced by the training cost is saved by the training model is greatly reduced, and the time-saving model is saved by applying the training cost is saved.

Fig. 1 is a schematic diagram illustrating a scenario in which a speech conversion model according to an exemplary embodiment of the present disclosure may be applied.

As shown in fig. 1, in the case of performing speech customization according to the speech of a target speaker, the speaker recognition model 110 may receive the target speech in the form of the speech of the target speaker and obtain a target speaker insert corresponding to the target speech based on the target speech, the speech conversion model 120 may receive the target text in the form of text and the target speaker insert, generate a mel pattern corresponding to the target text, and generate speech using the mel pattern at a vocoder, and since the target speaker insert is used to represent the tone of the speaker, the speech generated from the generated mel pattern contains the tone of the target speaker.

The speech conversion model 120 of the present disclosure may be widely applied to a variety of scenarios, such as personal assistant, news broadcast, voice navigation, etc., where the voice characteristics of a particular person are tailored.

For example, when the voice conversion model 120 is applied to voice navigation for customizing voice characteristics of a specific character, a small number of voice samples of the specific character are input to the speaker recognition model 110 to obtain speaker embedment of the specific character, then the voice conversion model 120 may receive text data related to road conditions, receive the speaker embedment of the specific character from the speaker recognition model 110, and then generate a mel spectrogram, and the mel spectrogram may generate voice through a vocoder, wherein the voice is a road condition broadcasting voice having a tone of the specific character.

For another example, when the voice conversion model 120 is applied to a news broadcast customizing the voice characteristics of a specific character, a small number of voice samples of the specific character are first input to the speaker recognition model 110 to obtain speaker embedments of the specific character, then the voice conversion model 120 may receive text data related to news and receive the speaker embedments of the specific character from the speaker recognition model 110, and then a mel spectrogram is generated, which may generate a voice, that is, a news broadcast voice having the tone of the specific character, through a vocoder.

Fig. 2 is a flowchart illustrating a training method of a speech conversion model according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S210, first training data may be first acquired. Here, the first training data includes voice data, phoneme data corresponding to the voice data, and a standard mel-spectrogram corresponding to the voice data. For example, the speech data may be a plurality of speech segments in audio form, e.g., speakable speech of different people for the same text or different texts. The phoneme data is a phoneme (phone) corresponding to the voice data, for example, if the voice data is "korean max island ji state island", the phoneme data corresponding to the voice data is "han2 guo2 7zui4 da4 de5 7dao6 yu6 7ji3 zhou1 dao3". The speech data may be derived from various large corpora, or may be speech data prepared in advance for a speech conversion model training task.

Thereafter, in step S220, the voice data is input to the pre-trained speaker recognition model to obtain speaker inserts corresponding to the voice data, wherein the speaker inserts are used to represent the timbre of the speaker.

Different speakers have different timbres, the pre-trained speaker recognition model may obtain speaker inserts corresponding to each speaker from the speech data of the different speakers, thereby obtaining the timbres of each speaker, which may be represented by a vector. For example, although both reddish and yellowish persons speak voices according to the text "island ji zhou island of the biggest korea", the speaker recognition model may acquire speaker inserts representing reddish timbre and speaker inserts representing reddish Huang Yinse, respectively, from two pieces of voices spoken by the two persons, and in addition, even if reddish and yellowish are speaking voices according to different texts, the speaker recognition model may acquire respective speaker inserts from voices spoken by the two persons, respectively.

In addition, in an exemplary embodiment of the present disclosure, before inputting the voice data into the pre-trained speaker recognition model to obtain the speaker embedding corresponding to the voice data in step S220, an initial speaker recognition model needs to be trained in advance to obtain the pre-trained speaker recognition model, and thus, the training method may further include: acquiring second training data, wherein the second training data comprises voice data which is different from the voice data in the first training data; and training the initial speaker recognition model based on the voice data in the first training data and the second training data or based on the second training data to obtain a pre-trained speaker recognition model. That is, in training the speech conversion model, a speaker recognition model has been obtained by training in advance, which only needs to provide speaker embedding for training of the speech conversion model.

In an exemplary embodiment of the present disclosure, the voice data included in the second training data may be data from various large datasets, for example, CN-Celeb1/2 datasets that may be explaining the field of speaker recognition, which contains thousands of speaker voice data, and there are various voice genres (11 different scenes are covered), but the voice data included in the second training data is not limited to this dataset, but may be other datasets, for example, voxcelleb 1 dataset, voxcelleb 2 dataset, or the like, or voice data prepared in advance for training tasks of an initial speaker recognition model. In the training process, the initial speaker recognition model predicts speaker embedding by using the voice data in the first training data and the voice data in the second training data or only using the voice data in the second training data, and adjusts parameters of the initial speaker recognition model according to the predicted speaker embedding and the real speaker embedding until the prediction of the initial speaker recognition model converges.

In addition, in the exemplary embodiment of the present disclosure, the speaker recognition model may employ a RepVGG model, which employs a structure re-parameterization, and employs a multi-path structure (having the advantage of high performance in multi-branch model training) in the model training stage, after the model training is completed, the multi-path structure in training is equivalently converted into a one-path structure in reasoning (having the advantage of fast speed and memory saving in model reasoning), for example, as shown in (a) in fig. 3, the RepVGG model is formed by a series of RepVGG blocks in training, wherein each RepVGG block is formed by arranging a 3×3 convolution layer, a 1×1 convolution branch and an identical mapping branch in parallel, after the training is completed, the structure in training is equivalently converted, so as to obtain a structure as shown in (b) in fig. 3, and each operator is formed by a serial series of operators, and each operator is connected with a reactivation lu function by a 3×3 convolution layer, so that the model can achieve high precision in the single-path model. In addition, in addition to using the RepVGG model as a speaker recognition model, the present application may also use other speaker recognition models, for example, the ECAPA-TDNN model or the ResNet-TDNN model may be used.

In step S230, an initial speech conversion model is trained based on the phoneme data, the standard mel-pattern and the speaker insert to obtain a speech conversion model for converting text into a mel-pattern of speech.

When training the initial speech conversion model, firstly, embedding and inputting phoneme data and a speaker into the initial speech conversion model for prediction to obtain a predicted Mel spectrogram. In predicting the mel-spectrogram, speaker inserts obtained from the speech data in the first training data are added to intermediate quantities generated by the initial speech conversion model at different positions of the initial speech conversion model, respectively, for training processes, and in an exemplary embodiment of the present disclosure, the speaker inserts are added to output quantities of other modules of the initial speech conversion model, respectively, which are input to a plurality of adders in the initial speech conversion model, respectively, for training processes. The speaker embedding which can be used for representing the tone of the speaker is integrated into the model training process, so that the trained voice conversion model can well realize voice customization. In an exemplary embodiment of the present disclosure, the initial speech conversion model employed in step S230 may be the initial speech conversion model 120 shown in fig. 5. The process of predicting a mel-pattern is described below with reference to fig. 4 to 7.

Fig. 4 is a flowchart illustrating a process of predicting a mel-spectrogram based on phoneme data and speaker embedding in training an initial speech conversion model according to an exemplary embodiment of the present disclosure. Fig. 5 shows a schematic diagram of an initial speech conversion model according to an exemplary embodiment of the present disclosure.

As shown in fig. 5, the initial speech conversion model 120 may include a phoneme embedding layer, a first encoder, a first adder, a second encoder, a second adder, a variable adapter, a third encoder, a third adder, and a decoder. The initial speech conversion model 120 receives speaker embedments from the speaker recognition model 110. As described above with reference to FIG. 2, the speaker recognition model 110 may be a RepVGG model, or may be other speaker recognition models, such as an ECAPA-TDNN model or a ResNet-TDNN model, etc., which may also produce effects similar to those of the RepVGG model. In addition, the speaker recognition model 110 has been trained in advance before the initial speech conversion model 120 is trained, and thus, in the following description with reference to fig. 4, speaker embedding obtained from the speech data by the pre-trained speaker recognition model 110 is directly used when the initial speech conversion model 120 is trained.

In step S410, the phoneme data is encoded by the phoneme embedding layer to obtain a phoneme embedding. As described above with reference to fig. 2The phoneme data is described as phoneme data corresponding to audio data in the first training data. For example, if the phoneme data is represented by a feature vector X, the feature vector X is encoded (i.e., matrix-computed) by the phoneme embedding layer and converted into a phoneme embedding, i.e., feature vector X _embedding 。

In step S420, the phoneme insertion is encoded by the first encoder to obtain position-encoded information of the phoneme insertion, the position-encoded information being used to embody a time sequence or position of elements in the phoneme insertion, and the position-encoded information being a vector having the same dimension as that of the phoneme insertion.

In step S430, the position-coding information of the phoneme-embedding is added to the phoneme-embedding by the first adder, resulting in a phoneme-embedding in which the position-coding information is superimposed. In an exemplary embodiment of the present disclosure, it is assumed that the phoneme-embedded position-coding information is a vector X _pos The first adder embeds the phoneme X outputted from the phoneme embedding layer _embedding And position-coding information X _pos After addition, a phoneme insert is obtained, which is also a vector, denoted as X ', with the superimposed position-coded information' _embedding And X is _embedding 、X _pos And X' _embedding Having the same dimensions.

In step S440, the phoneme embedding with the superimposed position-coding information is encoded by a second encoder, so as to obtain a first concealment sequence. For example, the second encoder embeds the phonemes with X' _embedding After coding, a first hidden sequence X is obtained _hidden In an exemplary embodiment of the present disclosure, the second encoder may employ a transducer network structure as shown in fig. 6, and in particular, the second encoder is stacked of N identical encoding modules, wherein the output of the previous encoding module is transmitted to the next encoding module for encoding. As shown in fig. 6, each coding module includes a multi-head attention layer, addition&Normalization layer, feedforward layer, addition&Normalizing the four components of the layer. Step S440 is described below with reference to fig. 5, 6, and 7.

First, the phoneme output from the first adder is embedded with X' _embedding The first of the N coding modules, which is input to the second encoder, first obtains three matrices of Query (Q), key (Key, K) and Value (V) through linear transformation according to different weight matrices.

Q, K and V are then input to the multi-headed attention layer for processing to obtain an attention matrix. As shown in fig. 7 (a), the multi-head attention layer includes a plurality of linear layers, a plurality of proportional-Dot-product attention layers, a merging layer, and linear layers, and when Q, K and V are input to the multi-head attention layer, each of Q, K and V is first linearly transformed through the corresponding linear layer to obtain Q ', K', and V ', and then the transformed results Q', K ', and V' are input to the proportional-Dot-product attention layer (Scaled Dot-Product Attention) to perform calculation. Specifically, fig. 7 (b) shows the internal structure of the proportional-dot product attention layer, and as shown in fig. 7 (b), Q ', K ' are first matrix-multiplied (MatMul), then the result of MatMul is scaled (Scale), then the Mask operation (Mask) is selectively performed, and finally the normalization operation (SoftMax) is performed, thereby obtaining an attention matrix, and since the position-coded information is added to the characteristics input to the multi-head attention layer, the attention matrix obtained here is position-dependent, and then the attention matrix is matrix-multiplied (MatMul) with V ', thereby obtaining an output matrix z to which the attention information is added. The operations described above with reference to (b) of fig. 7 are performed h times in parallel, then the results of the h operations (i.e., h output matrices Z) are combined (concat), and then the combined results are input to the linear layer for linear transformation, thereby obtaining a multi-headed attention layer final output matrix Z, wherein the output matrix Z is embedded with phonemes X' _embedding With the same dimensions, h represents the number of proportional dot product attention layers arranged in parallel, and h is configurable.

Thereafter, as shown in FIG. 6, by addition&Normalization layer embeds X 'into phonemes as input to the current coding module' _embedding Adding and normalizing with the output matrix Z of the multi-head attention layerProcessing to obtain an optimized output matrix Z'. The optimized output matrix Z' is then input to the feed forward layer for transformation to a transformed output matrix Z ", after which it is added subsequently&The normalization layer adds and normalizes the optimized output matrix Z 'and the transformed output matrix Z' to obtain the coding result of the current coding module, and then the coding result is coded by the next coding module connected with the current coding module according to similar operation until all N coding modules are processed to output a first hidden sequence X _hidden 。

In step S450, the first hidden sequence X is concealed by a second adder _hidden Added with the speaker to obtain a second hidden sequence X' _hidden . In an exemplary embodiment of the present disclosure, a speaker is embedded with a first concealment sequence X _hidden Adding to obtain a second hidden sequence X 'added with the tone of the speaker' _hidden The hidden sequence is then input to the variable adapter.

In step S460, the second hidden sequence is fused by the variable adapter, so as to obtain a third hidden sequence.

In an exemplary embodiment of the present disclosure, in a training phase, the variable adaptor fuses real information of variable information (e.g., information of duration, pitch, energy, etc.) with the second hidden sequence as input to obtain a third hidden sequence, in which the initial speech conversion model 120 trains a plurality of predictors inside to fit the real variable information, so that in an inference phase, predictions are made using the variable information predicted by the predictors.

In step S470, the third concealment sequence is subjected to position coding by the third encoder, so as to obtain position coding information of the third concealment sequence, where the position coding information is used for reflecting the time sequence or the position of the element in the third concealment sequence, and the position coding information has a vector with the same dimension as that of the third concealment sequence.

In step S480, the position coding information of the third concealment sequence, the speaker embedding and the third concealment sequence are added by a third adder to obtain a fourth concealment sequence, where the position coding information, the speaker embedding and the third concealment sequence have the same dimension.

In step S490, the fourth concealment sequence is decoded by the decoder to obtain a predicted mel-spectrogram. Similar to the second encoder, the decoder also employs the transducer network structure shown in fig. 6, and since the data processing procedure of the transducer network structure has been described above with reference to fig. 6 and 7, a detailed description thereof will be omitted.

In an exemplary embodiment of the present disclosure, as described in fig. 5, unlike in the related art, the first concealment sequence output from the second encoder is added to the speaker embedding output from the speaker recognition model, thereby obtaining a second concealment sequence to which the speaker tone is added, and in addition, the third concealment sequence output from the variable adaptor is added to the speaker embedding output from the speaker recognition model, in addition to the position coding information, thereby obtaining a fourth concealment sequence to which the speaker tone is added, after which the addition result is input to the decoder for decoding to obtain a predicted mel spectrogram.

Further, although not shown in fig. 5, at the time of the inference stage, the output of the speech conversion model in fig. 5 is input to a vocoder for converting the mel-spectrogram of the decoder decoded output into sound at the time of the inference stage.

In the process of training an initial voice conversion model, after a predicted Mel spectrogram is obtained, parameters of the initial voice conversion model are adjusted based on the predicted Mel spectrogram and a standard Mel spectrogram, and a trained voice conversion model is obtained. The standard mel-pattern is converted from the voice data in the first training data as described above with reference to fig. 2, so that, for example, by calculating a loss function using the predicted mel-pattern and the standard mel-pattern, it is determined whether the initial voice conversion model is satisfactory, if not, the parameters of the initial voice conversion model are adjusted, and training is continued until the initial voice conversion model satisfies the requirement.

The training of the initial speech conversion model can be completed, the trained speech conversion model can utilize the target speaker to embed and target text to perform text-to-speech synthesis under the condition of knowing the target speaker to embed, or the trained speech conversion model can perform text-to-speech synthesis together with the pre-trained speaker recognition model, namely, the speaker recognition model utilizes the target speaker to obtain the speaker embedding, and the speech conversion model utilizes the target speaker embedding and the target text to perform text-to-speech synthesis. A method of speech generation using the trained speech conversion model will be described later in connection with fig. 8 and 1.

Fig. 8 is a flowchart illustrating a speech generation method according to an exemplary embodiment of the present disclosure.

Referring to fig. 8, in step S810, target voice data and target text of a target speaker are acquired.

In an exemplary embodiment of the present disclosure, the target speaker is a speaker whose voice the user wants to customize, for example, when the user wants to customize voice for a singer, the singer is the target speaker, and accordingly, the target voice data is a piece of voice of the target speaker, for example, can be ten or several seconds, several tens of seconds or several minutes of voice, so that model training data purchase/collection costs can be saved. The target text is a text that the user wants to convert into a sound having the voice characteristics of the target speaker, for example, the user wants to convert a piece of text "korean max island-ji-state island" into a sound having the voice characteristics of the target speaker, and then the piece of text is the target text.

In step S820, the target text is converted into target phoneme data. In the exemplary embodiment of the present disclosure, the target phoneme data is consistent with the target text, and any text-to-phoneme conversion method may be employed to implement the conversion process, and although fig. 1 illustrates that the target text is directly input to the speech conversion model, the target text needs to be converted into the target phoneme data using a text-to-phoneme converter before the target text is input to the speech conversion model, and the process may be implemented outside the speech conversion model (not shown in fig. 1), but the present application is not limited thereto, and the process may also be implemented using a text-to-phoneme converter provided in the speech conversion model, for example, the text-to-phoneme converter may employ a look-up-phoneme table manner to convert the target text into the target phoneme data. In an exemplary embodiment of the present disclosure, if the target text is "korean maximum island ji state island", the target phoneme data generated from the target text is "han2 guo2 7zui4 da4 de5 7dao6 yu6 7ji3 zhou1 dao3".

In step S830, the target voice data is input to the speaker recognition model, and a target speaker insert is obtained, where the target speaker insert is used to represent the timbre of the target speaker. In exemplary embodiments of the present disclosure, the speaker recognition model may employ a RepVGG model, and other speaker recognition models may also be used, for example, an ECAPA-TDNN model or a ResNet-TDNN model may be used.

In step S840, the target speaker' S embedding and target phoneme data are input to the speech conversion model, generating a mel-spectrogram of the speech corresponding to the target text, as shown in fig. 1. The speech conversion model used in this step is derived from speaker-embedded training output from the speaker recognition model, and will not be described in more detail herein since the process of deriving the speech conversion model by training the initial speech conversion model has been described above with reference to fig. 2.

In an exemplary embodiment of the present disclosure, when a speech conversion model is obtained to generate a mel pattern, the structure of the speech conversion model may still retain the structure in fig. 5, and thus, in generating a mel pattern of a speech corresponding to a target text, first, a target speaker insert of a target speaker is obtained from target speech data by a speaker recognition model, wherein the target speaker insert is used to represent a timbre of the target speaker, and then, the mel pattern of the speech corresponding to the target text is generated by the speech conversion model using the target speaker insert and the target phoneme data.

In an exemplary embodiment of the present disclosure, in generating a mel profile, a target speaker's insert is added to intermediate quantities generated by a speech conversion model at different locations within the speech conversion model, respectively, for use in an inference process, e.g., in which the target speaker's insert is added to outputs of other modules of the speech conversion model, respectively, that are input into the speech conversion model, respectively, for use in mel profile generation. This process is described below with reference to fig. 5, 9, and 10.

Fig. 9 is a flowchart illustrating a process of generating a mel-spectrogram of a voice using target speaker embedding and target phoneme data by a voice conversion model according to an exemplary embodiment of the present disclosure.

Referring to fig. 9, in step S910, target phoneme data is encoded by a phoneme embedding layer to obtain target phoneme embedding.

In step S920, the target phoneme inlay is position-coded by the first encoder to obtain position-coded information of the target phoneme inlay.

In step S930, the position-coding information of the target phoneme inlay is added to the target phoneme inlay by the first adder, resulting in a target phoneme inlay on which the position-coding information is superimposed. In an exemplary embodiment of the present disclosure, the target phoneme inlay output from the phoneme inlay layer is added to the position-coded information output from the first encoder, thereby obtaining a phoneme inlay on which the position-coded information is superimposed.

In step S940, the target phoneme insert with the position-coding information superimposed thereon is encoded by a second encoder to obtain a first target hidden sequence.

In step S950, the first target hidden sequence and the target speaker are added by the second adder to obtain a second target hidden sequence.

In an exemplary embodiment of the present disclosure, the targeted speaker insert may characterize the timbre of the targeted speaker, and thus adding the targeted speaker insert to the first targeted hiding sequence may result in a hiding sequence with the timbre of the targeted speaker added, which is input to the variable adapter.

In step S960, fusion processing is performed on the second target hidden sequence through the variable adapter, so as to obtain a third target hidden sequence.

In step S970, the position coding of the third target hidden sequence is performed by the third encoder, so as to obtain the position coding information of the third target hidden sequence.

In step S980, the position coding information of the third target hidden sequence, the target speaker embedding and the third target hidden sequence are added by a third adder, so as to obtain a fourth target hidden sequence.

In step S990, the fourth target hidden sequence is decoded by the decoder to obtain a mel spectrogram of the speech corresponding to the target text. Since steps S910 to S990 are similar to the processes of S410 to S490 described above with reference to fig. 4, a detailed description thereof will not be provided here.

In an exemplary embodiment of the present disclosure, as shown in fig. 5, a concealment sequence output from a variable adaptor, position encoding information output from a third encoder, and a target speaker are insert-added, and then the addition result is input to a decoder to be decoded to obtain a mel profile, and the target speaker insert is added to the first target concealment sequence and the third target concealment sequence, respectively, in steps S950 and S980, which may cause a voice finally generated from the above-described mel profile to have a timbre of the target speaker.

In step S850, a speech is generated using the mel-pattern, wherein the speech contains the timbre of the target speaker. In an exemplary embodiment of the present disclosure, the output content from the decoder may be a mel-pattern that may generate synthetic speech through a vocoder, such as the synthetic speech shown in fig. 10.

After the voice conversion model is trained according to the method, the voice conversion model can generate a Mel spectrogram aiming at a target text by utilizing the target speaker embedding only by providing data of any target speaker of tens of seconds, tens of seconds or more minutes for the speaker recognition model, and the any target speaker can be a speaker in training data which is not used in training the voice conversion model, so that the voice conversion model is suitable for various scenes of voice along with the acquisition.

Fig. 11 is a block diagram illustrating a training apparatus 1100 of a speech conversion model according to an exemplary embodiment of the present disclosure.

Referring to fig. 11, the apparatus 1100 includes a training data acquisition unit 1110 and a model training unit 1120.

In an exemplary embodiment of the present disclosure, the training data acquisition unit 1110 may be configured to acquire first training data, wherein the first training data includes voice data, phoneme data corresponding to the voice data, and a standard mel-pattern corresponding to the voice data.

In an exemplary embodiment of the present disclosure, the model training unit 1120 may be configured to input the voice data into a pre-trained speaker recognition model to obtain a speaker insert corresponding to the voice data, and train an initial speech conversion model based on the phoneme data, the standard mel-pattern, and the speaker insert to obtain the speech conversion model for converting text into a mel-pattern of voice, wherein the speaker insert is used to represent a timbre of a speaker.

In an exemplary embodiment of the present disclosure, the speaker recognition model needs to be trained in advance before using the speaker recognition model, and thus, the training data acquisition unit 1110 may be further configured to acquire second training data, wherein the second training data includes speech data different from speech data in the first training data. Meanwhile, the model training unit 1120 may be further configured to train the initial speaker recognition model based on the voice data in the first training data and the second training data or based on the second training data, to obtain a pre-trained speaker recognition model.

In an exemplary embodiment of the present disclosure, the model training unit 1120 may be configured to input the phoneme data and the speaker in-line to an initial speech conversion model to predict, obtain a predicted mel-spectrogram, and adjust parameters of the initial speech conversion model based on the predicted mel-spectrogram and the standard mel-spectrogram, to obtain a speech conversion model.

In an exemplary embodiment of the present disclosure, in predicting the mel profile, the speaker's insert is added to intermediate quantities of the initial speech conversion model at different positions of the initial speech conversion model, respectively, for example, the speaker's insert is added to output quantities of other modules of the initial speech conversion model by a plurality of adders respectively inputted to the initial speech conversion model.

In an exemplary embodiment of the present disclosure, the initial speech conversion model includes a phoneme embedding layer, a first encoder, a first adder, a second encoder, a second adder, a variable adapter, a third encoder, a third adder, and a decoder. Model training unit 1120 may be configured to predict a predicted mel-spectrum by: encoding the phoneme data through the phoneme embedding layer to obtain phoneme embedding; performing position coding on the phoneme embedding by the first encoder to obtain position coding information of the phoneme embedding; adding the position coding information of the phoneme embedding with the phoneme embedding through the first adder to obtain the phoneme embedding overlapped with the position coding information; encoding the phoneme embedding overlapped with the position coding information through the second encoder to obtain a first hiding sequence; embedding and adding the first hidden sequence and the speaker by a second adder to obtain a second hidden sequence; the second hidden sequence is fused through the variable adapter, and a third hidden sequence is obtained; performing position coding on the third hidden sequence through the third encoder to obtain position coding information of the third hidden sequence; adding the position coding information of the third hidden sequence, the speaker embedding and the third hidden sequence through the third adder to obtain a fourth hidden sequence; and decoding the fourth hidden sequence through the decoder to obtain the predicted mel spectrogram.

The operations performed by the above respective units have been described in detail above with reference to fig. 2, and thus, for brevity, will not be explained here.

Fig. 12 is a block diagram illustrating a speech generating apparatus 1200 according to an exemplary embodiment of the present disclosure.

Referring to fig. 12, the tts synthesis apparatus 1200 may include a data acquisition unit 1210, a phoneme generation unit 1220, a speaker-embedded obtaining unit 1230, and a speech generation unit 1240.

In an exemplary embodiment of the present disclosure, the data acquisition unit 1210 may be configured to acquire target voice data and target text of a target speaker.

In an exemplary embodiment of the present disclosure, the phoneme generating unit 1220 may be configured to convert the target text into target phoneme data.

In an exemplary embodiment of the present disclosure, the speaker embedding obtaining unit 1230 may be configured to input the target voice data into a speaker recognition model to obtain a target speaker embedding, wherein the target speaker embedding is used to represent a timbre of the target speaker.

In an exemplary embodiment of the present disclosure, the speech generating unit 1230 may be configured to input the target speaker's embedding and the target phoneme data into a speech conversion model, generate a mel-spectrogram of a speech corresponding to the target text, and generate the speech using the mel-spectrogram, wherein the speech contains a timbre of the target speaker, wherein the speech conversion model is derived for speaker embedding training output according to the speaker recognition model.

In an exemplary embodiment of the present disclosure, in generating a voice, the target speaker's insert is added to intermediate amounts of the voice conversion model at different positions of the voice conversion model, respectively, for example, the target speaker's insert is added to outputs of other modules of the voice conversion model by a plurality of adders respectively inputted into the voice conversion model.

In an exemplary embodiment of the present disclosure, the voice generating unit 1230 may be configured to generate a mel-spectrogram of a voice corresponding to the target text by: encoding the target phoneme data through the phoneme embedding layer to obtain target phoneme embedding; performing position coding on the target phoneme embedding through the first encoder to obtain position coding information of the target phoneme embedding; adding the position coding information of the target phoneme embedding with the target phoneme embedding through the first adder to obtain a target phoneme embedding with the position coding information superimposed; encoding the target phoneme embedding overlapped with the position coding information through the second encoder to obtain a first target hiding sequence; embedding and adding the first target hiding sequence and the target speaker through a second adder to obtain a second target hiding sequence; performing fusion processing on the second target hiding sequence through the variable adapter to obtain a third target hiding sequence; performing position coding on the third target hiding sequence through the third encoder to obtain position coding information of the third target hiding sequence; adding the position coding information of the third target hiding sequence, the target speaker embedding and the third target hiding sequence through the third adder to obtain a fourth target hiding sequence; and decoding the fourth target hidden sequence through the decoder to obtain the mel spectrogram.

Since the operations performed by the above respective units have been described in detail above with reference to fig. 8, for brevity, explanation will not be made here.

Fig. 13 is a diagram illustrating an implementation environment of a training method and a speech generation method of a speech conversion model according to an exemplary embodiment of the present disclosure.

As shown in fig. 13, an implementation environment of a training method of a speech conversion model according to an exemplary embodiment of the present disclosure may include a terminal device 1310, or include the terminal device 1310 and a server device 1320 connected to the terminal device 1310 through a wired or wireless network or the like. The terminal device 1310 may be an electronic device capable of voice generation functions, such as a smart phone, a notebook computer, a desktop computer, and a smart watch. The server device 1320 may be a platform device composed of one or more servers, or a virtual server platform.

The terminal device 1310 may independently implement the foregoing method for training a speech conversion model, and then apply the trained speech conversion model to generate a mel-spectrogram and generate speech using the mel-spectrogram.

Alternatively, the terminal device 1310 may implement a training method of the speech conversion model together with the server device 1320. For example, the terminal device 1310 may obtain training data for training a model from the server device 1320, and the model training process may be implemented on the terminal device 1310. For another example, the terminal device 1310 may load the speech conversion model trained in the server device 1320 directly from the server device 1320 or the speech conversion model trained acquired by the server device 1320 from another device, and then perform mel-spectrogram generation using the speech conversion model and generate speech using the mel-spectrogram.

Fig. 14 is a block diagram illustrating an electronic device 1400 according to an exemplary embodiment of the present disclosure.

According to an embodiment of the present disclosure, an electronic device 1400 may be provided. The electronic device 1400 comprises at least one memory 1401 and at least one processor 1402, the at least one memory 1401 having stored therein computer-executable instructions which, when executed by the at least one processor, cause the at least one processor 1402 to perform a training method or a speech generating method of a speech conversion model according to an embodiment of the present disclosure.

By way of example, the electronic device 1400 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 1400 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) individually or in combination. The electronic device 1400 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In the electronic device 1400, the at least one processor 1402 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, the at least one processor 1402 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The at least one processor 1402 may execute instructions or code stored in the memory, wherein the at least one memory 1401 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The at least one memory 1401 may be integrated with the at least one processor 1402, e.g. with RAM or flash memory arranged within an integrated circuit microprocessor or the like. Further, the at least one memory 1401 may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system. The at least one memory 1401 and the at least one processor 1402 may be operatively coupled or may communicate with each other, e.g., through an I/O port, a network connection, etc., such that the at least one processor 1402 is capable of reading files stored in the at least one memory 1401.

In addition, electronic device 1400 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device may be connected to each other via a bus and/or a network.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium, wherein the instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the training method or the speech generation method of the speech conversion model of the embodiments of the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an embodiment of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement a training method or a speech generation method of a speech conversion model of an embodiment of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of training a speech conversion model, the method comprising:

acquiring first training data, wherein the first training data comprises voice data, phoneme data corresponding to the voice data and a standard mel spectrogram corresponding to the voice data;

Inputting the voice data into a pre-trained speaker recognition model to obtain speaker embedding corresponding to the voice data, wherein the speaker embedding is used for representing the tone of a speaker;

training an initial speech conversion model based on the phoneme data, the standard mel spectrogram and the speaker embedding to obtain the speech conversion model, wherein the speech conversion model is used for converting text into a mel spectrogram of speech.

2. The training method of claim 1, wherein before the inputting the speech data into a pre-trained speaker recognition model results in speaker embedding corresponding to the speech data, the training method further comprises:

acquiring second training data, wherein the second training data comprises voice data different from the voice data in the first training data;

and training an initial speaker recognition model based on the voice data in the first training data and the second training data or the second training data to obtain the pre-trained speaker recognition model.

3. The training method of claim 1 wherein training the initial speech conversion model based on the phoneme data, the standard mel-pattern, and the speaker insertion to obtain the speech conversion model comprises:

Embedding and inputting the phoneme data and the speaker into the initial voice conversion model to predict, so as to obtain a predicted mel spectrogram;

and adjusting parameters of the initial voice conversion model based on the predicted Mel spectrogram and the standard Mel spectrogram to obtain the voice conversion model.

4. The training method of claim 3 wherein the initial speech conversion model comprises a phoneme embedding layer, a first encoder, a first adder, a second encoder, a second adder, a variable adapter, a third encoder, a third adder, and a decoder, and wherein the inputting the phoneme data and the speaker's embeddings into the initial speech conversion model for prediction results in a predicted mel-pattern comprises:

encoding the phoneme data through the phoneme embedding layer to obtain phoneme embedding;

performing position coding on the phoneme embedding by the first encoder to obtain position coding information of the phoneme embedding;

adding the position coding information of the phoneme embedding with the phoneme embedding through the first adder to obtain the phoneme embedding overlapped with the position coding information;

encoding the phoneme embedding overlapped with the position coding information through the second encoder to obtain a first hiding sequence;

Embedding and adding the first hidden sequence and the speaker by a second adder to obtain a second hidden sequence;

the second hidden sequence is fused through the variable adapter, and a third hidden sequence is obtained;

performing position coding on the third hidden sequence through the third encoder to obtain position coding information of the third hidden sequence;

adding the position coding information of the third hidden sequence, the speaker embedding and the third hidden sequence through the third adder to obtain a fourth hidden sequence;

and decoding the fourth hidden sequence through the decoder to obtain the predicted mel spectrogram.

5. A speech generation method, characterized in that the speech generation method comprises:

acquiring target voice data and target text of a target speaker;

converting the target text into target phoneme data;

inputting the target voice data into a speaker recognition model to obtain target speaker embedding, wherein the target speaker embedding is used for representing the tone of the target speaker;

embedding the target speaker and inputting the target phoneme data into a voice conversion model to generate a mel spectrogram of voice corresponding to the target text;

Generating the voice by utilizing the mel spectrogram, wherein the voice comprises the tone color of the target speaker;

the speech conversion model is obtained through speaker embedding training output according to the speaker recognition model.

6. The speech generating method of claim 5 wherein the speech conversion model comprises a phoneme embedding layer, a first encoder, a first adder, a second encoder, a second adder, a variable adapter, a third encoder, a third adder, and a decoder,

the step of embedding the target speaker and inputting the target phoneme data into the speech conversion model to generate a mel spectrogram of the speech corresponding to the target text, comprising:

encoding the target phoneme data through the phoneme embedding layer to obtain target phoneme embedding;

performing position coding on the target phoneme embedding through the first encoder to obtain position coding information of the target phoneme embedding;

adding the position coding information of the target phoneme embedding with the target phoneme embedding through the first adder to obtain a target phoneme embedding with the position coding information superimposed;

Encoding the target phoneme embedding overlapped with the position coding information through the second encoder to obtain a first target hiding sequence;

embedding and adding the first target hiding sequence and the target speaker through the second adder to obtain a second target hiding sequence;

performing fusion processing on the second target hiding sequence through the variable adapter to obtain a third target hiding sequence;

performing position coding on the third target hiding sequence through the third encoder to obtain position coding information of the third target hiding sequence;

adding the position coding information of the third target hiding sequence, the target speaker embedding and the third target hiding sequence through the third adder to obtain a fourth target hiding sequence;

and decoding the fourth target hidden sequence through the decoder to obtain the mel spectrogram.

7. A training device for a speech conversion model, the training device comprising:

a training data acquisition unit configured to acquire first training data, wherein the first training data includes speech data, phoneme data corresponding to the speech data, and a standard mel spectrogram corresponding to the speech data;

A model training unit configured to: inputting the voice data into a pre-trained speaker recognition model to obtain speaker embedding corresponding to the voice data, training an initial voice conversion model based on the phoneme data, the standard Mel spectrogram and the speaker embedding to obtain the voice conversion model, wherein the voice conversion model is used for converting text into a Mel spectrogram of voice, and the speaker embedding is used for representing the tone of the speaker.

8. A speech generating apparatus, characterized in that the speech generating apparatus comprises:

a data acquisition unit configured to acquire target voice data and target text of a target speaker;

a phoneme generating unit configured to convert the target text into target phoneme data;

a speaker embedding obtaining unit configured to input the target voice data to a speaker recognition model to obtain a target speaker embedding, wherein the target speaker embedding is used for representing a tone of the target speaker;

a speech generating unit configured to embed the target speaker and input the target phoneme data into a speech conversion model, generate a mel-spectrogram of a speech corresponding to the target text, and generate the speech using the mel-spectrogram, wherein the speech contains a timbre of the target speaker,

9. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method of any one of claims 1 to 4 or the speech generation method of any one of claims 5 and 6.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the training method of any one of claims 1 to 4 or the speech generation method of any one of claims 5 and 6.