CN115206281A

CN115206281A - Speech synthesis model training method and device, electronic equipment and medium

Info

Publication number: CN115206281A
Application number: CN202110397569.6A
Authority: CN
Inventors: 张泽旺; 张文杰
Original assignee: Tencent Cloud Computing Beijing Co Ltd
Current assignee: Tencent Cloud Computing Beijing Co Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2022-10-18

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a medium for training a speech synthesis model, which can be applied to the technical field of speech. The method comprises the following steps: the method comprises the steps of obtaining source audio data and target voice style characteristics, calling a voice conversion model to process the source audio data and the target voice style characteristics to obtain first predicted audio data so as to obtain audio data corresponding to the target voice style characteristics, calling an initial voice synthesis model to process the source text data and the target voice style characteristics to obtain second predicted audio data, and updating parameters in the initial voice synthesis model based on the first predicted audio data and the second predicted audio data so as to train the initial voice synthesis model. By adopting the embodiment of the application, the generalization capability of the trained voice synthesis model is enhanced, and the voice style similarity of the synthesized audio data is improved.

Description

Speech synthesis model training method and device, electronic equipment and medium

Technical Field

The present application relates to the field of speech technology, and in particular, to a method and an apparatus for training a speech synthesis model, an electronic device, and a medium.

Background

With the rapid development of speech technology, speech synthesis technology (TTS) is increasingly used in life. The style migration of speech is an important field of speech synthesis technology, and the style migration of speech refers to giving a speech style of a certain speaking object to synthesized audio data. Currently, speech synthesis model training for style migration of speech is performed by training an average model based on a corpus of multiple speakers, and then by fine-tuning the target speaker. However, in practice, it is found that the model trained in this way has a limited generalization capability, and when the model synthesized audio data is obtained according to the training way, the synthesized audio data has low speech style similarity. Therefore, how to improve the similarity of voice styles in the voice style migration process is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a medium for training a speech synthesis model, which are beneficial to enhancing the generalization capability of the speech synthesis model and improving the speech style similarity of synthesized audio data.

In one aspect, an embodiment of the present application provides a method for training a speech synthesis model, where the method includes:

acquiring source audio data and target voice style characteristics, wherein the target voice style characteristics are characteristics corresponding to the voice style of the target audio data;

calling a voice conversion model to process the source audio data and the target voice style characteristics to obtain first prediction audio data;

extracting source text data in the source audio data, and calling an initial speech synthesis model to process the source text data and the target speech style characteristics to obtain second predicted audio data;

updating parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data to train the initial speech synthesis model.

In another aspect, an embodiment of the present application provides a speech synthesis model training apparatus, where the apparatus includes:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring source audio data and target voice style characteristics, and the target voice style characteristics are characteristics corresponding to the voice style of the target audio data;

the processing module is used for calling a voice conversion model to process the source audio data and the target voice style characteristics to obtain first predicted audio data;

the processing module is further configured to extract source text data from the source audio data, and call an initial speech synthesis model to process the source text data and the target speech style characteristics to obtain second predicted audio data;

the processing module is further configured to update parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data to train the initial speech synthesis model.

In another aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a storage device, where the processor is connected to the storage device, where the storage device is configured to store computer program instructions, and the processor is configured to perform the following steps:

In yet another aspect, an embodiment of the present application provides a computer-readable storage medium, in which computer program instructions are stored, and when executed by a processor, the computer program instructions are configured to perform the following steps:

updating parameters in the initial speech synthesis model based on the first predictive audio data and the second predictive audio data to train the initial speech synthesis model.

According to the embodiment of the application, the source audio data and the target voice style characteristics can be obtained, the voice conversion model is called to process the source audio data and the target voice style characteristics to obtain the first predicted audio data, so that the audio data corresponding to the target voice style characteristics are obtained, the initial voice synthesis model is called to process the source text data and the target voice style characteristics to obtain the second predicted audio data, and parameters in the initial voice synthesis model are updated based on the first predicted audio data and the second predicted audio data to train the initial voice synthesis model. Therefore, the generalization capability of the speech synthesis model obtained by training can be enhanced, and the speech style similarity of the synthesized audio data can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for training a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a speech conversion model provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a speech conversion model provided in an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for training a speech conversion model according to an embodiment of the present application;

fig. 5 is a schematic flowchart of an audio data acquisition method according to an embodiment of the present application;

fig. 6 is a schematic flowchart of an audio data obtaining method according to an embodiment of the present application;

FIG. 7 is a flowchart illustrating a method for training a speech synthesis model according to an embodiment of the present application;

fig. 8 is a schematic flowchart of an audio data acquisition method according to an embodiment of the present application;

fig. 9 is a schematic flowchart of an audio data obtaining method according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a speech synthesis model training apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application provides a speech synthesis model training scheme, which can acquire source audio data and target speech style characteristics, call a speech conversion model to process the source audio data and the target speech style characteristics to obtain first predicted audio data so as to obtain audio data corresponding to the target speech style characteristics, call an initial speech synthesis model to process the source text data and the target speech style characteristics to obtain second predicted audio data, and update parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data so as to train the initial speech synthesis model. Therefore, the generalization capability of the speech synthesis model obtained by training can be enhanced, and the speech style similarity of the synthesized audio data can be improved.

In this application, audio data refers to digitized sound data, such as source audio data, target audio data, sample audio data, first prediction audio data, or second prediction audio data. The Acoustic Spectrum (Acoustic Spectrum) in this application is a tool for describing the distribution mode of the Acoustic energy in the sound and the components contained in the sound, such as the first prediction Spectrum or the second prediction Spectrum. In the present application, the sound spectrum may be a mel spectrum, a speech spectrogram, a mel cepstrum, and the like, which is not limited herein. The essence of the sound spectrums such as the mel spectrums, the speech spectrograms and the like is a frequency spectrum sequence, in other words, the sound spectrums such as the mel spectrums, the speech spectrograms and the like can be formed by multi-frame frequency domain data.

In the present application, the voice style may refer to characteristics of tone, accent, fundamental pitch, and the like in the audio data, and is not limited herein. In the present application, the representation of the voice style is represented by voice style features, for example, the target voice style feature may represent a feature corresponding to the voice style feature of the target audio data, and the sample voice style feature may represent a feature corresponding to the voice style feature of the sample audio data. Alternatively, the speech style feature may represent the feature of the corresponding audio data by a vector or a matrix, or the speech style feature may represent the identifier of the object corresponding to the audio data by a group of vectors or a matrix, where the identifier is used to indicate the speech style feature of the audio data of the object.

In this application, the rhythm is used to describe characteristics of relative pitch and interval time between syllables in audio data, and is not limited herein. In the present application, the representation of the tempo is represented by a tempo feature, for example, a source tempo feature in the present application may represent a feature corresponding to the tempo in the source audio data, and a sample tempo feature may represent a feature corresponding to the tempo in the sample audio data. Alternatively, the rhythm feature may be generally represented by a vector or a matrix, which is not limited herein.

The technical scheme of the application can be applied to electronic equipment, the electronic equipment can be a terminal or a server, and the application is not limited. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, and a big data and artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.

The technical scheme of the application can be applied to the technical field of voice data, and the key technologies of the Speech Technology (Speech Technology) include an automatic Speech recognition Technology (ASR), a Speech synthesis Technology (TTS) and a voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future.

The technical scheme of the application can also be applied to mathematical intelligent education, for example, one or more virtual teachers can be trained through the model training method provided by the application, so that the one or more virtual teachers have fluent speaking capability and even have the speaking capability of multiple languages. Specifically, multi-language audio data with target style characteristics can be trained through the voice conversion model of the scheme, so that the one or more virtual teachers have multi-language speaking capability. Therefore, the technical scheme of the application can be applied to other fields related to voice synthesis, such as intelligent sound boxes, intelligent homes, audio book software, virtual broadcasters, virtual tour guides, virtual teaching aids, intelligent customer services and the like.

In a possible implementation manner, the technical solution of the present application may also be used for training a cross-language speech synthesis model, so that the cross-language speech synthesis model may be obtained. Specifically, the source audio data and the target speech style characteristics of the target language can be processed by calling the speech conversion model, so that first predicted audio data of the target language is obtained, and then the initial speech synthesis model is trained through the first predicted audio data, the source text data and the target speech style characteristics. For example, when a speaking object can only speak Chinese, but needs to generate English audio data corresponding to the speech style of the speaking object, source audio data with English text content can be acquired, so that a speech conversion model is called to process target speech style characteristics and the source audio data, first predicted audio data with English text content can be acquired, and English audio data corresponding to the target speech style can be acquired after an initial speech synthesis model is trained. Similarly, a speech synthesis model capable of obtaining more synthesized target language types can be obtained according to the type of the adjusted target language, such as a cantonese model, a Sichuan language model or a northeast language model, so that the language ability of a speaking object is greatly enriched.

In a possible implementation, the technical solution of the present application can also be applied to change one speech style to another speech style of a speaking object. Specifically, a text data feature corresponding to audio data of one speech style and a target speech style feature of another speech style can be obtained, and a speech conversion model is called to process the text data feature and the target speech style feature so as to obtain target predicted audio data, wherein the speech style of the target predicted audio data is another speech style, and the audio content is a content corresponding to the audio data of the one speech style, so that one speech style can be changed into the speech style of another speaking object.

It should be understood that the foregoing scenarios are only examples, and do not constitute a limitation on application scenarios of the technical solutions provided in the embodiments of the present application, and the technical solutions of the present application may also be applied to other scenarios. For example, as a person having ordinary skill in the art can know, with the evolution of the system architecture and the emergence of new service scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

Based on the above description, the embodiment of the present application provides a speech synthesis model training method. Referring to fig. 1, fig. 1 is a schematic flow chart of a speech synthesis model training method according to an embodiment of the present application, where the speech synthesis model training method may include steps S101 to S104.

S101, acquiring source audio data and target voice style characteristics.

The source audio data may be obtained from various audio databases, or an audio data set may be obtained by collecting audio data of various objects, and the source audio data is determined from the audio data set. The target speech style feature may represent a feature corresponding to a speech style feature of the target audio data, and the target speech style feature may be represented by a vector or a matrix.

In one possible implementation, the target speech style characteristics may be obtained by inputting target audio data of an indefinite length into a pre-trained speaker recognition network, and mapping the target audio data of the indefinite length into a target vector, which may be of a fixed length, thereby treating the target vector as the target speech style characteristics. In some scenarios, this target vector may be referred to as the x-vector.

Optionally, after obtaining the target vector corresponding to the target audio data, the target vector may be further processed, for example, the target vector is interpolated, or a value of the target vector is modified, so that a new target vector may be created, and the new target vector is used as the target speech style feature.

S102, calling a voice conversion model to process the source audio data and the target voice style characteristics to obtain first prediction audio data.

The voice conversion model can be used for converting the voice style of the source audio data into the voice style corresponding to the target voice style characteristic, so as to obtain first predicted audio data. In other words, the speech style of the first predicted audio data is the speech style corresponding to the target speech style characteristic, and the audio content of the first predicted audio data is the audio content indicated by the source audio data.

In one possible implementation, the speech conversion model may include a first encoding network and a first decoding network. The first encoding network is used for determining source encoding characteristics of source audio data, wherein the source encoding characteristics are characteristic vectors or characteristic matrixes of the source audio data extracted through the first encoding network. The first encoder may be an expert system, which is an intelligent computer program system that contains a large amount of knowledge and experience of expert level in a certain field and can decouple the content of the source audio data. The first decoding network is used for receiving the source coding characteristics output by the first coding network and acquiring target voice style characteristics so as to obtain first predicted audio data. The first decoder sequence models, such as a trained cyclic convolutional neural network (RNN) or a long short term memory network (LSTM). For example, please refer to fig. 2, fig. 2 is a schematic structural diagram of a speech conversion model according to an embodiment of the present application, in which a first coding network in the speech conversion model receives source audio data, generates source coding features according to the source audio data, and uses the source coding features and target speech style features as input of a first decoding network, so as to obtain first predicted audio data.

In a possible implementation manner, invoking a speech conversion model to process source audio data and a target speech style feature to obtain first predicted audio data may specifically include the following steps: acquiring source rhythm characteristics in source audio data, and determining source coding characteristics of the source audio data through a first coding network; processing the source coding characteristics, the source rhythm characteristics and the target voice style characteristics through a first decoding network to obtain a first prediction sound spectrum of the source audio data under the voice style of the target audio data; first predicted audio data is constructed based on the first predicted audio spectrum. Wherein the source rhythm feature is a feature corresponding to a rhythm in the source audio data. The rhythm in the first prediction audio data is a rhythm characteristic corresponding to the source rhythm characteristic, and the voice style is a voice style corresponding to the target voice style characteristic. The first predicted sound spectrum may refer to a sound spectrum corresponding to the first predicted audio data, such as a mel spectrum, a spectrogram (spectrogram), a mel cepstrum, and the like. When the first predicted audio data is constructed based on the first predicted audio spectrum, the first predicted audio spectrum may be input to a neural network synthesizer, so as to obtain the first predicted audio data, where the neural network synthesizer is a neural network model trained by a training sample, and may generate the audio data according to the audio spectrum. For example, please refer to fig. 3, fig. 3 is a schematic structural diagram of a speech conversion model according to an embodiment of the present application, in which a first coding network of the speech conversion model receives source audio data, generates source coding features according to the source audio data, and uses the source coding features, the source rhythm features, and the target speech style features as inputs of a first decoding network, so as to obtain first predicted audio data. Inputting the source tempo feature into the first decoding network can make the sound in the generated first predicted audio data more stable, thereby improving the speech quality of the generated first predicted audio data.

In a possible implementation, before determining the source coding characteristics of the source audio data by the first coding network, the following steps may be further performed: obtaining sample audio data, sample rhythm characteristics and sample voice style characteristics, and determining sample coding characteristics of the sample audio data through a first initial coding network; processing the sample coding characteristics, the sample rhythm characteristics and the sample voice style characteristics through a first initial decoding network to obtain sample prediction audio data; updating parameters in the first initial encoding network and the first initial decoding network based on the sample audio data and the sample prediction audio data; and if the first initial coding network after the parameter updating and the first initial decoding network after the parameter updating meet the first preset condition, determining the first initial coding network after the parameter updating as the first coding network, and determining the first initial decoding network after the parameter updating as the first decoding network. The sample rhythm feature is a feature corresponding to a rhythm in the sample audio data, and the sample speech style feature is a feature corresponding to a speech style of the sample audio data. The first initial coding network is used for determining sample coding features of the sample audio data, wherein the sample coding features are feature vectors or feature matrixes of the sample audio data extracted through the first initial coding network. The first encoder may be an expert system, which is an intelligent computer program system that contains a large amount of knowledge and experience of expert level in a certain field and can decouple the content of the sample audio data. The first initial decoding network is used for receiving the sample coding features output by the first initial coding network and acquiring sample rhythm features and sample voice style features, so that sample prediction audio data are obtained. The first predetermined condition is used to indicate that the parameters and structures of the first initial coding network and the first initial decoding network reach a target satisfaction degree, and the target satisfaction degree may be used to indicate that the matching degree between the generated sample predicted audio data and the sample audio data satisfies a condition, for example, a loss value between the sample predicted audio data and the sample audio data is determined by a loss function, if the loss value satisfies a predetermined loss value condition, that is, the first initial coding network and the first initial decoding network after the parameter update satisfy the first predetermined condition, the predetermined loss value condition may be that the loss value is smaller than a predetermined value, or that a change between the loss values obtained by updating the parameters of the first initial coding network and the first initial decoding network each time is smaller than the predetermined value. Optionally, the sample audio data or the sample audio data may be represented by an audio signal, and a loss value between the sample predicted audio data and the sample audio data is determined by a loss function, or may be said to be a loss value between the sample predicted audio data and the audio signal corresponding to the sample audio data is calculated. The preset loss value condition, the preset value, and the like may be set by a model training user, and the preset condition or the preset value may be set differently according to different models and scenes, which is not limited herein.

For example, please refer to fig. 4, where fig. 4 is a schematic flowchart of a method for training a speech conversion model according to an embodiment of the present application, and it can be seen in the diagram that sample audio data can obtain a sample coding feature corresponding to the sample audio data after being input into a first initial coding network, the first initial decoding network processes the sample coding feature, the sample rhythm feature, and the sample speech style feature to obtain sample predicted audio data, and determines a loss value between the sample predicted audio data and the sample audio data through a loss function, so as to update parameter values of the first initial coding network and the first initial decoding network, where the loss function refers to an object to be minimized in a model training process, such as a mean square error loss function (MES loss), a mean absolute error loss function (MAE loss), or a cross entropy loss function, and the like, and is not limited herein. When the loss value between the sample predicted audio data and the sample audio data satisfies a preset loss value condition, the updating of the parameter is stopped, and the updated first initial coding network is determined as the first coding network, and the first initial decoding network is determined as the first decoding network, where the preset loss value condition may be that the loss value is smaller than a preset value, or that a change between the loss values obtained by updating the parameters of the first initial coding network and the first initial decoding network each time is smaller than a preset value, which is not limited herein. Through training of the first initial coding network and the first initial decoding network through the sample audio data, the obtained speech conversion model can realize construction of prediction audio data according to speech style characteristics.

Optionally, after the sample coding features are obtained through the first initial coding network, the sample coding features, the sample rhythm features and the sample voice style features may be processed through the first initial decoding network to obtain a sample prediction sound spectrum; acquiring a sample sound spectrum of sample audio data, and updating parameters in the first initial coding network and the first initial decoding network based on the sample sound spectrum and a sample prediction sound spectrum; and if the first initial coding network after the parameter updating and the first initial decoding network after the parameter updating meet the first preset condition, determining the first initial coding network after the parameter updating as the first coding network, and determining the first initial decoding network after the parameter updating as the first decoding network. The sample prediction audio spectrum may be a sound spectrum corresponding to the sample prediction audio data, in other words, the neural network synthesizer may be invoked to synthesize the sample prediction audio spectrum into the sample prediction audio. The first preset condition is used to indicate that parameters and structures of the first initial coding network and the first initial decoding network reach a target satisfaction degree, and the target satisfaction degree may be used to indicate that a matching degree between a generated sample predicted sound spectrum and a sample audio sound spectrum satisfies a condition, for example, a loss value between the sample predicted sound spectrum and the sample sound spectrum may be determined by a loss function, and if the loss value satisfies a preset loss value condition, the first initial coding network and the first initial decoding network after the parameter update satisfy the first preset condition. The default loss value condition may be that the loss value is smaller than a default value, or that a variation between the loss values obtained by updating the parameters of the first initial encoding network and the first initial decoding network each time is smaller than a default value, which is not limited herein. The preset loss value condition or preset value may be set by a model training user, and the setting of the preset condition or preset value may be different for different models and scenes, which is not limited herein.

In one possible implementation, determining the source coding characteristics of the source audio data by the first coding network may include the steps of: extracting at least one frame of audio included in source audio data and phonemes corresponding to each frame of audio through a first coding network, and constructing a source audio posterior probability matrix based on the at least one frame of audio and the phonemes corresponding to each frame of audio; and performing dimension compression on the source audio posterior probability matrix through a first coding network to obtain source coding characteristics of the source audio data. Wherein the horizontal axis of the speech posterior probability matrix represents the number of frames in the time dimension, and the vertical axis represents the category of each symbol, which may be a phoneme category. The speech posterior probability matrix may also be referred to as a PPG matrix (Phonetic posterorgrams). For example, if the content corresponding to a segment of source audio data is "v3 in1 h e2 ch eng2", at least one audio burst corresponding to the source audio data is extracted, and the corresponding factors of each frame of audio are: "v3", "in1", "h", "e2", "ch" and "eng2", the audio of each frame can be represented as a column vector, and the plurality of column vectors are sequentially ordered according to the time sequence, so that the PPG matrix can be obtained. The dimension compression of the source audio posterior probability matrix through the first coding network can be compression through a bottleneck layer of the first coding network, so that the source audio style characteristics in the audio posterior probability matrix can be discarded when the source audio data is coded by the first coding network, and the compression of the dimension of the PPG matrix is realized.

S103, extracting source text data in the source audio data, and calling the initial voice synthesis model to process the source text data and the target voice style characteristics to obtain second prediction audio data.

Wherein the source text data in the source audio data is used for indicating the linguistic features of the corresponding text content in the source audio data, and the linguistic features include but are not limited to feature information such as Chinese phonemes, english phonemes, chinese vowels, tone, word boundaries, phrase boundaries or sentence boundaries. The source text data may be generally expressed in the form of a phoneme sequence, where a phoneme is the smallest unit of speech divided from the perspective of tone quality, and a Chinese phoneme sequence may generally be composed of a vowel, and a silent syllable, where the vowel is tonal. For example, if the corresponding text content in the source audio data is "speech synthesis", the tonal speech sequence corresponding to the text content is "v3 in1 h e2 ch eng2", so that the source text data is "v3 in1 h e2 ch eng2". The initial speech synthesis model may be a variant of the tacontron 2 system or other neural network model, and is not limited herein.

In a possible implementation, the initial speech synthesis model includes a second initial coding network and a second initial decoding network, and the step of invoking the initial speech synthesis model to process the source text data and the target speech style characteristics may include the steps of: determining text encoding characteristics of the source text data through a second initial encoding network; processing the text coding features and the target voice style features through a second initial decoding network to obtain a second predicted sound spectrum of the source text data under the voice style of the target audio data; second predicted audio data is constructed based on the second predicted audio spectrum. The second initial coding network is used for determining text coding features of the source audio data, and the text coding features are feature vectors or feature matrixes of the source text data extracted through the second initial coding network. If the second initial encoder converts the input phoneme sequence (i.e. the source text data) into the context-related abstract text features, the second initial encoder is usually composed of models with context correlation, such as a bidirectional recurrent neural network and a convolutional neural network, and the text coding features coming out through the second initial encoder have the feature of context correlation. The second initial decoding network is used for receiving the text coding characteristics and the target voice style characteristics output by the second initial coding network so as to obtain a second predicted sound spectrum. The second predicted sound spectrum may refer to a sound spectrum corresponding to the second predicted audio data, such as a mel spectrum, a spectrogram (spectrogram), a mel cepstrum, and the like. When second predicted audio data is constructed based on the second predicted audio spectrum, the second predicted audio spectrum may be input to the neural network synthesizer, thereby obtaining second predicted audio data.

For example, please refer to fig. 5, fig. 5 is a flowchart illustrating an audio data obtaining method according to an embodiment of the present application, in which source text data may obtain text coding features corresponding to the source text data after being input into a second initial coding network, the second initial coding network processes the text coding features and target speech style features to obtain a second predicted sound spectrum, and then determines second predicted audio data according to the second predicted sound spectrum.

In one possible implementation, the initial speech synthesis model may include a second initial coding network, an initial attention mechanism and a second initial decoding network, and the step of invoking the initial speech synthesis model to process the source text data and the target speech style characteristics may include the steps of: determining text encoding characteristics of the source text data through a second initial encoding network; inputting the text coding characteristics and the target voice style characteristics into an initial attention mechanism; processing the output result of the initial attention mechanism through a second initial decoding network to obtain a second predicted sound spectrum of the source text data under the voice style of the target audio data; second predicted audio data is constructed based on the second predicted audio spectrum. Wherein the second initial encoding network is used to determine text encoding characteristics of the source audio data. Since the second predicted audio data is synthesized, typically by a monotonic mapping process from the phoneme sequence (i.e., the source text data) to the spectral sequence, the spectral sequence can constitute the second predicted audio spectrum. Therefore, when each frame of spectrum of the second predicted sound spectrum is generated, only the phoneme content corresponding to the frame of spectrum needs to be obtained, and the initial attention mechanism is used for generating the phoneme content corresponding to the spectrum to be generated of each frame, so that the second initial decoder can better predict the next frame of spectrum, and the second predicted sound spectrum is obtained. The second initial decoder network is used for receiving an output result of the initial attention mechanism and processing the output result to obtain a second predicted sound spectrum. For example, the first decoding network receives the output of the initial attention mechanism at the time t0 and the predicted spectrum of the frame before the time t0 (i.e. the predicted spectrum of the previous frame) to predict the spectrum at the time t0, in other words, the predicted spectrum of each frame needs to be generated depending on the predicted spectrum of the previous frame, and then the predicted spectrum of each frame can be arranged in time sequence, so as to obtain the second predicted sound spectrum. The second predicted audio data constructed based on the second predicted audio spectrum may be constructed by a neural network synthesizer, for example, the neural network synthesizer processes each frame frequency domain data of the second predicted audio spectrum to obtain multiple frames of time domain data corresponding to each frame frequency domain data, so as to synthesize the second predicted audio data according to the multiple frames of time domain data. For example, please refer to fig. 6, fig. 6 is a flowchart illustrating an audio data obtaining method according to an embodiment of the present application, where in the method, source text data may obtain a text encoding feature corresponding to the source text data after being input into a second initial encoding network, and then a target speech style feature and the text encoding feature are input into an initial attention mechanism, so as to obtain an output result of the initial attention mechanism, and the second initial encoding network processes the output result, so as to obtain a second predicted spectrum, and then determines second predicted audio data according to the second predicted spectrum.

And S104, updating parameters in the initial speech synthesis model based on the first prediction audio data and the second prediction audio data so as to train the initial speech synthesis model.

The updating of the initial speech synthesis model may update parameters in the initial speech synthesis model according to a loss value between the first predicted audio data and the second predicted audio data, or update parameters in the initial speech synthesis model according to a loss value between a first predicted sound spectrum corresponding to the first predicted audio data and a second predicted sound spectrum corresponding to the second predicted audio data, which is not limited here.

In a possible implementation, updating the parameters in the initial speech synthesis model based on the loss value between the first predicted audio data and the second predicted audio data may include the following steps: calling a target loss function to determine a target loss value between the first predicted audio data and the second predicted audio data; and if the target loss value meets the preset loss value condition, updating parameters in the initial voice synthesis model based on the target loss value. The loss function refers to an objective to be minimized in the model training process, such as a mean square error loss function (MES loss), a mean absolute error loss function (MAE loss), or a cross entropy loss function, and is not limited herein. The preset loss value condition may be that the target loss value is greater than a preset value, or that a variation value between target loss values obtained by updating parameters of the initial speech synthesis model each time is greater than a preset value, which is not limited herein. Optionally, the first predicted audio data or the second predicted audio data may be represented by an audio signal, and a target loss value between the first predicted audio data and the second predicted audio data is calculated, or a loss value between audio signals corresponding to the first predicted audio data and the second predicted audio data is calculated.

In a possible implementation, updating the parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data may include the steps of: acquiring a first prediction sound spectrum corresponding to the first prediction audio data and a second prediction sound spectrum corresponding to the second prediction audio data; calling a target loss function to determine a target loss value between the first prediction sound spectrum and the second prediction sound spectrum; and if the target loss value meets the preset loss value condition, updating parameters in the initial voice synthesis model based on the target loss value. For example, when the predicted sound spectrum (the first predicted sound spectrum or the second predicted sound spectrum) of the audio data (the first predicted audio data or the second predicted audio data) is to be obtained as a mel spectrum, the time domain information of the audio data may be converted into frequency domain information through fourier transform, and the frequency domain information is subjected to mel scale transform to obtain a mel spectrum, that is, a mel spectrum. The loss function refers to an objective to be minimized in the model training process, such as a mean square error loss function (MES loss), a mean absolute error loss function (MAE loss), or a cross entropy loss function, and is not limited herein. The preset loss value condition may be that the target loss value is greater than a preset value, or that a variation value between target loss values obtained by updating parameters of the initial speech synthesis model each time is greater than a preset value, which is not limited herein.

According to the embodiment of the application, the source audio data and the target voice style characteristics can be obtained, the voice conversion model is called to process the source audio data and the target voice style characteristics to obtain first predicted audio data, so that audio data corresponding to the target voice style characteristics are obtained, the initial voice synthesis model is called to process the source text data and the target voice style characteristics to obtain second predicted audio data, and parameters in the initial voice synthesis model are updated based on the first predicted audio data and the second predicted audio data to train the initial voice synthesis model. Therefore, the generalization capability of the speech synthesis model obtained by training can be enhanced, and the speech style similarity of the synthesized audio data can be improved.

Fig. 7 is a schematic flow chart of a speech synthesis model training method according to an embodiment of the present invention. The speech synthesis model training method may be performed by an electronic device. As shown in FIG. 7, the speech synthesis model training method may include the following steps S701-S705.

S701, acquiring source audio data and target voice style characteristics.

S702, calling a voice conversion model to process the source audio data and the target voice style characteristics to obtain first prediction audio data.

S703, extracting source text data in the source audio data, and calling an initial speech synthesis model to process the source text data and the target speech style characteristics to obtain second prediction audio data.

And S704, updating parameters in the initial speech synthesis model based on the first prediction audio data and the second prediction audio data so as to train the initial speech synthesis model.

Steps S701 to S704 may correspond to the related descriptions in step S101 to step S102, which are not described herein again.

It is understood that the description of steps S701 to S704 (steps S101 to S102) is only one round of training in the training process of the speech synthesis model, and when the updated initial speech synthesis model does not satisfy the second preset condition, the parameters in the initial speech synthesis model are updated again according to the first predicted audio data and the second predicted audio data. Optionally, the parameters of the updated initial speech synthesis model may be updated according to another first predicted audio data obtained in step S702, and then the parameters of the updated initial speech synthesis model are updated according to the another first predicted audio data, the corresponding source text data and the target speech style characteristics, that is, steps S704-S704 are repeated until the updated initial speech synthesis model satisfies the second preset condition.

S705, if the initial speech synthesis model after the parameter updating meets a second preset condition, determining the initial speech synthesis model after the parameter updating as a speech synthesis model.

The second predetermined condition is used to indicate that the parameter and structure of the initial speech synthesis model reach a target satisfaction degree, which may be used to indicate that the matching degree between the second predicted audio data and the first predicted audio data satisfies a condition, for example, a target loss value between the second predicted audio data and the first predicted audio data is determined by a loss function, and if the target loss value satisfies a predetermined loss value condition, that is, the initial speech synthesis model satisfies the second predetermined condition, the predetermined loss value condition may be that a target loss value between the second predicted audio data and the first predicted audio data generated by the initial speech synthesis model is smaller than a predetermined value, or that a variation value between the target loss value between the second predicted audio data and the first predicted audio data generated by the initial speech synthesis model after the parameter update is smaller than a predetermined value.

Optionally, the target satisfaction degree may also be used to indicate that the matching degree between the second predicted sound spectrum and the first predicted sound spectrum satisfies a condition, for example, a target loss value between the second predicted sound spectrum and the first predicted sound spectrum is determined by a loss function, if the target loss value satisfies a preset loss value condition, that is, the initial speech synthesis model satisfies the second preset condition, the preset loss value condition may be that a target loss value between the second predicted sound spectrum generated by the initial speech synthesis model and the first predicted sound spectrum is smaller than a preset value, or that a variation value between a target loss value between the second predicted sound spectrum generated by the initial speech synthesis model after the parameter update and the first predicted sound spectrum is smaller than a preset value.

In a possible embodiment, the initial speech synthesis model may include a second initial coding network and a second initial decoding network, and then the second initial coding network may be determined as the second coding network and the second initial decoding network may be determined as the second decoding network.

In one possible embodiment, the initial speech synthesis model may include a second initial coding network, an initial attention mechanism, and a second initial decoding network, and the second initial coding network may be determined as the second coding network, the initial attention mechanism as the attention mechanism, and the second initial decoding network as the second decoding network.

And S706, acquiring text data to be synthesized.

The text data to be synthesized can be represented in the form of a phoneme sequence. For example, a piece of text content, such as "speech synthesis" may be obtained, and then the text content is processed, so as to obtain a phoneme sequence "v3 in1 h e2 ch eng2" corresponding to the text content.

And S707, if the initial speech synthesis model after parameter updating meets a second preset condition, determining the initial speech synthesis model after parameter updating as a speech synthesis model.

The voice style characteristic of the target predicted audio data is the voice style indicated by the target voice style characteristic, and the audio content of the target predicted audio data is the content indicated by the text data to be synthesized. The speech synthesis model may include a second encoding network and a second decoding network, or the speech synthesis model may include a second encoding network, an attention mechanism, and a second decoding network.

In a possible implementation, the speech synthesis model may include a second encoding network and a second decoding network, and then invoking the speech synthesis model to process the text data to be synthesized and the target speech style characteristic to obtain the target predicted audio data may include the following steps: determining the text coding characteristics to be synthesized of the text data to be synthesized through a second coding network; processing the text coding features to be synthesized and the target voice style features through a second decoding network to obtain a target prediction sound spectrum of the text data to be synthesized under the voice style of the target audio data; and constructing target prediction audio data based on the target prediction sound spectrum. The second coding network is used for determining the text coding features to be synthesized, and the text coding features to be synthesized are feature vectors or feature matrixes of the text data to be synthesized extracted through the second coding network. The second decoding network is used for receiving the text coding features to be synthesized and the target voice style features output by the second coding network, so that a target prediction sound spectrum is obtained. The target predicted sound spectrum may refer to a sound spectrum corresponding to the target predicted audio data, such as a mel spectrum, a spectrogram (spectrogram), a mel cepstrum, and the like. When the target prediction audio data is constructed based on the target prediction audio spectrum, the target prediction audio spectrum can be input into the neural network synthesizer, so that the target prediction audio data is obtained. For example, please refer to fig. 8, fig. 8 is a schematic flowchart of an audio data obtaining method provided in an embodiment of the present application, in which after text data to be synthesized is input into a second coding network, a text coding feature to be synthesized corresponding to the text data to be synthesized may be obtained, the second coding network processes the text coding feature to be synthesized and a target speech style feature, so as to obtain a target prediction sound spectrum, and then determines target prediction audio data according to the target prediction sound spectrum, where the speech style of the target prediction audio data is a speech style indicated by the target speech style feature, and audio content is content indicated by the text data to be synthesized.

In a possible implementation manner, if the initial speech synthesis model includes the second encoding network, the attention mechanism, and the second decoding network, then invoking the speech synthesis model to process the text data to be synthesized and the target speech style characteristic to obtain the target predicted audio data may include the following steps: determining the text coding characteristics to be synthesized of the text data to be synthesized through a second coding network; inputting the text coding features and the target voice style features to be synthesized into an attention mechanism; processing the output result of the attention mechanism through a second decoding network to obtain a target prediction sound spectrum of the text data to be synthesized under the voice style of the target audio data; and constructing target prediction audio data based on the target prediction sound spectrum. The second coding network is used for determining the text coding features to be synthesized, and the text coding features to be synthesized are feature vectors or feature matrixes of the text data to be synthesized extracted through the second coding network. The attention mechanism is used for receiving the text information to be synthesized and the target speech style characteristics output by the second coding network and obtaining an output result, and the output result is used for indicating the phoneme content corresponding to the generated frequency spectrum of each frame to be generated so as to enable the second initial decoder to better predict the frequency spectrum of the next frame. The second decoding network is used for receiving an output result of the attention mechanism so as to obtain a target prediction sound spectrum. The target predicted sound spectrum may refer to a sound spectrum corresponding to the target predicted audio data, such as a mel spectrum, a spectrogram (spectrogram), a mel cepstrum, and the like. When the target prediction audio data is constructed based on the target prediction audio spectrum, the target prediction audio spectrum can be input into the neural network synthesizer, so that the target prediction audio data is obtained. For example, please refer to fig. 9, where fig. 9 is a schematic flowchart of an audio data obtaining method according to an embodiment of the present application, in the figure, after text data to be synthesized is input into a second coding network, a text coding feature to be synthesized corresponding to the text data to be synthesized may be obtained, then a target speech style feature and the text coding feature to be synthesized are input into an initial attention mechanism, so as to obtain an output result of the initial attention mechanism, the second initial coding network processes the output result, so as to obtain a target prediction sound spectrum, and then the target prediction audio data is determined according to the target prediction sound spectrum. The voice style of the target predicted audio data is the voice style indicated by the target voice style characteristics, and the audio content is the content indicated by the text data to be synthesized.

According to the method and the device, source audio data and target voice style characteristics can be obtained, a voice conversion model is called to process the source audio data and the target voice style characteristics to obtain first predicted audio data, parameters in an initial voice synthesis model are updated to train the initial voice synthesis model, when the updated initial voice synthesis model meets a second preset condition, the initial voice synthesis model with the updated parameters is determined to be the voice synthesis model, and text data to be synthesized are obtained; and calling a voice synthesis model to process the text data to be synthesized and the target voice style characteristics to obtain target prediction audio data. Therefore, the generalization capability of the trained speech synthesis model can be enhanced, and the speech style similarity of the synthesized target prediction audio data can be improved.

Based on the description of the foregoing embodiment of the method for training a speech synthesis model, an embodiment of the present application further discloses a device for training a speech synthesis model, which can be configured in the foregoing electronic device, for example, the device can be a computer program (including program code) running in the electronic device. The apparatus may perform the method shown in fig. 1, 7. Referring to fig. 10, the apparatus may operate as follows:

an obtaining module 1001, configured to obtain source audio data and a target voice style feature, where the target voice style feature is a feature corresponding to a voice style of the target audio data;

the processing module 1002 is configured to invoke a speech conversion model to process the source audio data and the target speech style characteristics, so as to obtain first predicted audio data;

the processing module 1002 is further configured to extract source text data in the source audio data, and call an initial speech synthesis model to process the source text data and the target speech style characteristics to obtain second predicted audio data;

the processing module 1002 is further configured to update parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data, so as to train the initial speech synthesis model.

In an embodiment, the speech conversion model includes a first encoding network and a first decoding network, and the processing module 1002 is configured to, when the speech conversion model is called to process the source audio data and the target speech style feature to obtain first predicted audio data, specifically:

obtaining source rhythm characteristics in the source audio data, and determining the source coding characteristics of the source audio data through a first coding network, wherein the source rhythm characteristics are characteristics corresponding to the rhythm in the source audio data;

processing the source coding feature, the source rhythm feature and the target voice style feature through a first decoding network to obtain a first prediction sound spectrum of the source audio data under the voice style of the target audio data;

first predicted audio data is constructed based on the first predicted audio spectrum.

In one embodiment, before determining the source coding characteristics of the source audio data through the first coding network, the processing module 1002 is further configured to:

acquiring sample audio data, sample rhythm characteristics and sample voice style characteristics, wherein the sample rhythm characteristics are characteristics corresponding to rhythms in the sample audio data, and the sample voice style characteristics are characteristics corresponding to voice styles of the sample audio data;

determining, by a first initial coding network, sample coding characteristics of the sample audio data;

processing the sample coding features, the sample rhythm features and the sample voice style features through a first initial decoding network to obtain sample prediction audio data;

updating parameters in the first initial encoding network and the first initial decoding network based on the sample audio data and the sample prediction audio data;

and if the first initial coding network after the parameter updating and the first initial decoding network after the parameter updating meet a first preset condition, determining the first initial coding network after the parameter updating as a first coding network, and determining the first initial decoding network after the parameter updating as a first decoding network.

In an embodiment, the initial speech synthesis model includes a second initial coding network and a second initial decoding network, and the processing module 1002 is configured to invoke the initial speech synthesis model to process the source text data and the target speech style feature to obtain second predicted audio data, and specifically configured to:

determining text encoding characteristics of the source text data through a second initial encoding network;

processing the text coding features and the target voice style features through a second initial decoding network to obtain a second predicted sound spectrum of the source text data under the voice style of the target audio data;

and constructing second prediction audio data based on the second prediction sound spectrum.

In an embodiment, the processing module 1002 is configured to, when updating the parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data, specifically:

acquiring a first prediction sound spectrum corresponding to the first prediction audio data and a second prediction sound spectrum corresponding to the second prediction audio data;

calling a target loss function to determine a target loss value between the first predicted sound spectrum and the second predicted sound spectrum;

and if the target loss value meets a preset loss value condition, updating parameters in the initial speech synthesis model based on the target loss value.

In an embodiment, when determining the source coding characteristics of the source audio data through the first coding network, the processing module 1002 is specifically configured to:

extracting at least one frame of audio included in the source audio data and phonemes corresponding to each frame of audio through the first coding network, and constructing a source audio posterior probability matrix based on the at least one frame of audio and the phonemes corresponding to each frame of audio;

and performing dimension compression on the source audio posterior probability matrix through the first coding network to obtain the source coding characteristics of the source audio data.

In one embodiment, the processing module 1002 is configured to, after the updating of the parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data, further:

if the initial speech synthesis model after the parameter updating meets a second preset condition, determining the initial speech synthesis model after the parameter updating as a speech synthesis model;

acquiring text data to be synthesized;

and calling the voice synthesis model to process the text data to be synthesized and the target voice style characteristics to obtain target prediction audio data.

The functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module may be implemented in the form of hardware, or in the form of a software functional module, which is not limited in this application.

Referring to fig. 11 again, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where the electronic device according to the embodiment of the present disclosure includes a processor 1101 and a storage device 1102. Optionally, the electronic device may also include a network interface 1103. Data can be exchanged among the processor 1101, the storage device 1102 and the network interface 1103, the network interface 1103 is controlled by the processor for transceiving messages, the storage device 1102 is used for storing a computer program, the computer program comprises program instructions, and the processor 1101 is used for executing the program instructions stored in the storage device 1102. Wherein the processor 1101 is configured to call the program instructions to perform the above method.

The storage device 1102 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device 1102 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), or the like; the storage 1102 may also comprise a combination of memories of the kind described above.

The processor 1101 may be a Central Processing Unit (CPU). In one embodiment, the processor 1101 may also be a Graphics Processing Unit (GPU) 1101. The processor 1101 may also be a combination of a CPU and a GPU.

In one embodiment, the storage 1102 is used to store program instructions. The processor 1101 may call the program instructions to perform the steps of:

in one embodiment, the processor 1101 is further configured to:

In one embodiment, the processor 1101 is further configured to:

in an embodiment, the speech conversion model includes a first encoding network and a first decoding network, and the processor 1101 is configured to execute the calling speech conversion model to process the source audio data and the target speech style characteristic to obtain first predicted audio data, and specifically configured to execute:

In one embodiment, before performing the determining the source coding characteristics of the source audio data through the first coding network, the processor 1101 is further configured to:

acquiring sample audio data, sample rhythm characteristics and sample voice style characteristics, wherein the sample rhythm characteristics are characteristics corresponding to the rhythm in the sample audio data, and the sample voice style characteristics are characteristics corresponding to the voice style of the sample audio data;

processing the sample coding features, the sample rhythm features and the sample voice style features through a first initial decoding network to obtain sample predicted audio data;

In an embodiment, the initial speech synthesis model includes a second initial coding network and a second initial decoding network, and the processor 1101 is configured to execute the calling of the initial speech synthesis model to process the source text data and the target speech style characteristic to obtain second predicted audio data, and specifically to execute:

In an embodiment, the processor 1101 is configured to, when performing the updating of the parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data, specifically, perform:

and if the target loss value meets a preset loss value condition, updating parameters in the initial voice synthesis model based on the target loss value.

In an embodiment, when the processor 1101 is configured to execute the determining the source coding feature of the source audio data through the first coding network, specifically, the processor is configured to execute:

In one embodiment, the processor 1101 is configured to perform, after the updating the parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data, further:

acquiring text data to be synthesized;

In a specific implementation, the apparatus, the processor 1101, the storage device 1102 and the like described in this embodiment of the application may perform the implementation described in the above method embodiment, and may also perform the implementation described in this embodiment of the application, which is not described herein again.

Also provided in embodiments of the present application is a computer (readable) storage medium storing a computer program comprising program instructions that, when executed by a processor, perform some or all of the steps performed in the above-described method embodiments. Alternatively, the computer storage media may be volatile or nonvolatile.

Embodiments of the present application also provide a computer program product or computer program comprising program instructions, which may be stored in a computer readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions, so that the computer performs part or all of the steps performed in the method, which are not described herein again.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer storage medium, and the computer storage medium may be a computer readable storage medium, and when executed, the programs may include the processes of the above embodiments of the methods. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the present disclosure has been described with reference to particular embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure.

Claims

1. A method for training a speech synthesis model, comprising:

2. The method of claim 1, wherein the speech conversion model comprises a first encoding network and a first decoding network, and wherein the invoking the speech conversion model to process the source audio data and the target speech style feature to obtain first predicted audio data comprises:

processing the source coding feature, the source rhythm feature and the target voice style feature through a first decoding network to obtain a first predicted sound spectrum of the source audio data under the voice style of the target audio data;

3. The method of claim 2, wherein prior to determining the source coding characteristics of the source audio data over the first coding network, the method further comprises:

4. The method of claim 1, wherein the initial speech synthesis model comprises a second initial coding network and a second initial decoding network, and wherein invoking the initial speech synthesis model to process the source text data and the target speech style feature to obtain second predicted audio data comprises:

5. The method of claim 1, wherein the updating parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data comprises:

6. The method of claim 2, wherein determining the source coding characteristics of the source audio data via the first coding network comprises:

7. The method of claim 1, wherein after the updating the parameters in the initial speech synthesis model based on the first predicted audio data and the second predicted audio data, the method further comprises:

acquiring text data to be synthesized;

8. A speech synthesis apparatus, comprising:

the acquisition module is used for acquiring source audio data and target voice style characteristics, wherein the target voice style characteristics are characteristics corresponding to the voice style of the target audio data;

9. An electronic device comprising a processor and a storage device, the processor being interconnected with the storage device, wherein the storage device is configured to store computer program instructions, and the processor is configured to execute the program instructions to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon computer program instructions, which, when executed by a processor, are adapted to perform the method of any one of claims 1-7.