CN116052638A

CN116052638A - Training method of speech synthesis model, speech synthesis method and device

Info

Publication number: CN116052638A
Application number: CN202310138459.7A
Authority: CN
Inventors: 宋伟; 张雅洁; 岳杨皓; 张政臣; 吴友政
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-05-02

Abstract

The application provides a training method of a speech synthesis model, a speech synthesis method and a device, and relates to the technical field of artificial intelligence such as deep learning, speech technology and the like, wherein the training method of the speech synthesis model comprises the following steps: acquiring sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, wherein at least one first sample audio corresponding to the same speaker identification has a single style feature; inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identifier into a speech synthesis model to obtain a predicted acoustic feature of the first sample audio; the speech synthesis model is trained based on the predicted acoustic features and the sample acoustic features of each first sample audio. The decoupling of tone color characteristics and style characteristics in the audio is realized, so that the audio with single style characteristics corresponding to each speaker can be utilized to train the speech synthesis model, and the training cost of the speech synthesis model is reduced.

Description

Training method of speech synthesis model, speech synthesis method and device

Technical Field

The application relates to the technical field of artificial intelligence such as deep learning and voice technology, in particular to a training method of a voice synthesis model, a voice synthesis method and a device.

Background

At present, the voice synthesis technology is widely applied to scenes such as intelligent questions and answers, voice broadcasting, audio books, virtual anchor and the like. In some scenarios, it is desirable to synthesize audio of different styles from the same speaker.

In the related art, in order to synthesize the audio of different styles of the same speaker, the audio of different styles recorded by each speaker needs to be used as training data to train and generate the speech synthesis model, and the cost of recording the audio is high, so that the training cost of the speech synthesis model is high.

Disclosure of Invention

The present application aims to solve, at least to some extent, one of the technical problems in the related art.

The application provides a training method of a voice synthesis model, a voice synthesis method and a device, and aims to solve the technical problem of high training cost of the voice synthesis model in the related technology.

An embodiment of a first aspect of the present application provides a training method for a speech synthesis model, including: acquiring sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, wherein at least one first sample audio corresponding to the same speaker identification has a single style feature; inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identification into a coding layer of a speech synthesis model to determine prosodic features of the speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence based on a style characterization corresponding to the speaker identification and the first phoneme sequence, and determining text coding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features; inputting a text code and a corresponding speaker identifier of the first phoneme sequence at the audio frame level into a decoding layer of the speech synthesis model to decode based on a tone representation and the text code corresponding to the speaker identifier, so as to obtain a predicted acoustic feature of the first sample audio; the speech synthesis model is trained based on the predicted acoustic features of each of the first sample audio and the sample acoustic features.

An embodiment of a second aspect of the present application provides a speech synthesis method, including: acquiring a third phoneme sequence corresponding to a target text to be synthesized, and acquiring a first speaker identification and a second speaker identification from a candidate identification set; inputting the third phoneme sequence and the first speaker identification into an encoding layer of a speech synthesis model, determining prosodic features of a speaker corresponding to the first speaker identification on each phoneme in the third phoneme sequence based on style characterization corresponding to the first speaker identification and the third phoneme sequence, and determining text encoding of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features; inputting the text code of the third phoneme sequence on the audio frame level and the second speaker identification into a decoding layer of the speech synthesis model to decode based on the tone representation corresponding to the second speaker identification and the text code to obtain acoustic characteristics; and generating target audio corresponding to the target text based on the acoustic features.

An embodiment of a third aspect of the present application provides a model training apparatus for speech synthesis, including: the first acquisition module is used for acquiring sample acoustic characteristics of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, and at least one first sample audio corresponding to the same speaker identification has a single style characteristic; the first processing module is used for inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identifier into an encoding layer of a speech synthesis model, determining prosodic features of the speaker corresponding to the speaker identifier on each phoneme in the first phoneme sequence based on a style characterization corresponding to the speaker identifier and the first phoneme sequence, and determining text encoding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features; the second processing module is used for inputting the text codes of the first phoneme sequences on the audio frame level and the corresponding speaker identifications into a decoding layer of the speech synthesis model so as to decode based on the tone characterization corresponding to the speaker identifications and the text codes, and obtain the predicted acoustic characteristics of the first sample audio; and the first training module is used for training the voice synthesis model based on the predicted acoustic characteristics of each first sample audio and the sample acoustic characteristics.

An embodiment of a fourth aspect of the present application provides a speech synthesis apparatus, including: the third acquisition module is used for acquiring a third phoneme sequence corresponding to the target text to be synthesized and acquiring a first speaker identifier and a second speaker identifier from the candidate identifier set; a third processing module, configured to input the third phoneme sequence and the first speaker identifier into an encoding layer of a speech synthesis model, so as to determine prosodic features of a speaker corresponding to the first speaker identifier on each phoneme in the third phoneme sequence based on a style characterization corresponding to the first speaker identifier and the third phoneme sequence, and determine text encoding of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features; a fourth processing module, configured to input, to a decoding layer of the speech synthesis model, a text encoding of the third phoneme sequence at an audio frame level and the second speaker identifier, so as to decode based on a tone representation corresponding to the second speaker identifier and the text encoding, thereby obtaining an acoustic feature; and the generating module is used for generating target audio corresponding to the target text based on the acoustic characteristics.

An embodiment of a fifth aspect of the present application proposes an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a training method of a speech synthesis model as set forth in the embodiments of the first aspect of the application or to perform a speech synthesis method as set forth in the embodiments of the second aspect of the application.

An embodiment of a sixth aspect of the present application proposes a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform a training method of a speech synthesis model as set forth in the embodiment of the first aspect of the present application, or to perform a speech synthesis method as set forth in the embodiment of the second aspect of the present application.

An embodiment of a seventh aspect of the present application proposes a computer program product comprising a computer program which, when executed by a processor, implements a training method of a speech synthesis model as proposed by an embodiment of the first aspect of the present application, or performs a speech synthesis method as proposed by an embodiment of the second aspect of the present application.

One embodiment of the above invention has the following advantages or benefits:

the decoupling of tone color characteristics and style characteristics in the audio is realized, so that the audio with single style characteristics corresponding to each speaker can be utilized to train the speech synthesis model, the recording cost of the audio in training data is reduced, and the training cost of the speech synthesis model is further reduced. In addition, the same voice synthesis model can be utilized to flexibly generate target audios with different tone colors and different styles, so that the flexibility of the voice synthesis model is improved, the application range of the voice synthesis model is expanded, and the number of models required by style migration is reduced.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of a training method of a speech synthesis model according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of a training method of a speech synthesis model according to a second embodiment of the present disclosure;

Fig. 3 is a schematic structural diagram of a speech synthesis model according to a second embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a prosody prediction module according to a second embodiment of the present application;

fig. 5 is a flow chart of a speech synthesis method according to a third embodiment of the present application;

FIG. 6 is a schematic diagram of pitch characteristics of target audio a and target audio b provided in embodiment III of the present application;

FIG. 7 is a schematic diagram of pitch characteristics of target audio c and target audio d according to third embodiment of the present application;

FIG. 8 is a schematic diagram of pitch characteristics of target audio e and target audio f provided in embodiment III of the present application;

fig. 9 is a schematic structural diagram of a training device for a speech synthesis model according to a fourth embodiment of the present application;

fig. 10 is a schematic structural diagram of a speech synthesis apparatus according to a fifth embodiment of the present application;

fig. 11 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

It should be noted that, in the technical solution of the present application, the acquisition, storage, application, etc. of the related personal information of the user all conform to the rules of the related laws and regulations, and do not violate the popular regulations of the public order.

Aiming at the technical problem of high training cost of a voice synthesis model in the related art, the application provides a training method of the voice synthesis model, a voice synthesis method, a device, electronic equipment, a storage medium and a computer program product.

The training method of the speech synthesis model comprises the following steps: acquiring sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, wherein at least one first sample audio corresponding to the same speaker identification has a single style feature; inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identification into a coding layer of a speech synthesis model to determine prosodic features of a speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence based on style characterization corresponding to the speaker identification and the first phoneme sequence, and determining text coding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features; inputting the text codes of the first phoneme sequence on the audio frame level and the corresponding speaker identifications into a decoding layer of a speech synthesis model to decode based on the tone characterization and the text codes corresponding to the speaker identifications, thereby obtaining the predicted acoustic characteristics of the first sample audio; the speech synthesis model is trained based on the predicted acoustic features and the sample acoustic features of each first sample audio. Therefore, decoupling of tone color characteristics and style characteristics in the audio is achieved, so that the voice synthesis model can be trained by utilizing the audio with single style characteristics corresponding to each speaker, recording cost of the audio in training data is reduced, and training cost of the voice synthesis model is further reduced.

For ease of understanding, technical terms involved in the embodiments of the present application will be explained first.

In the description of the present application, a tone color feature is a feature that a speaker's tone color has, and may include, for example, but not limited to, a wavelength, a frequency, an intensity, and a rhythm of a speaker's voice. Different speakers have different tone characteristics.

And the style characteristics are used for representing the speaking style, speaking characteristics or language expressive force of the speaker. Such as style characteristics may include, but are not limited to, characteristics of light reading, weak reading, heavy reading, prolonged, and emphasized for each word or word as the speaker speaks.

Speaker identification for uniquely identifying a speaker. The speaker identifier may be a number set for the speaker in advance, or an identifier such as a name of the speaker, which is not limited in this application.

The phoneme sequence is a sequence composed of a plurality of phonemes. Wherein, the phonemes are the minimum phonetic units divided according to the natural properties of the speech. For example, taking the text "they" as an example, the corresponding phoneme sequence is the phoneme sequence "ta m en" consisting of the phonemes "t", "a", "m", "en". In the embodiment of the present application, the numbers corresponding to the phonemes may be preset, so that the phonemes may be represented by the corresponding numbers, and a phoneme sequence composed of the numbers corresponding to the phonemes may be obtained.

The acoustic features, which characterize the acoustic properties of the speech, may be Mel (Mel) spectral features, for example.

The Style characterization, used to characterize the Style characteristics of the corresponding speaker, may be a Style embedding vector (Style embedding) corresponding to the speaker.

Tone characterization, which is used to characterize the tone characteristics of the corresponding speaker, may be a tone embedding vector (Speaker embedding) corresponding to the speaker.

The prosodic features of the speaker on each phoneme in a certain phoneme sequence are used for representing the pronunciation style of the speaker on each phoneme, and may include features such as pitch, duration, energy and the like.

The audio frames, which may be 10ms (milliseconds) or other length, may be set as desired, and are not limited in this application.

The text coding of a certain phoneme sequence at the audio frame level comprises text codes corresponding to a plurality of audio frames which are divided into the phoneme sequence according to the audio frame level, and the text codes corresponding to each audio frame comprise the characteristics of the text corresponding to the audio frame and the prosodic characteristics of a speaker on the audio frame. The prosodic features of the speaker on the audio frame are used for representing the pronunciation style of the speaker on the audio frame, and can comprise characteristics of pitch, duration, energy and the like.

The following describes a training method of a speech synthesis model, a speech synthesis method, an apparatus, an electronic device, a storage medium, and a computer program product of the embodiments of the present application with reference to the accompanying drawings.

First, a training method of a speech synthesis model provided in the embodiment of the present application is described.

It should be noted that, the training method of the speech synthesis model provided in the embodiment of the present application is executed by the training device of the speech synthesis model. The training device of the speech synthesis model can be electronic equipment or can be configured in the electronic equipment to decouple tone characteristics and style characteristics in audio through executing the training method of the speech synthesis model provided by the embodiment of the application, so that the speech synthesis model can be trained by utilizing the audio with single style characteristics corresponding to a plurality of speakers, the recording cost of the audio in training data is reduced, and the training cost of the speech synthesis model is further reduced.

The electronic device may be a personal computer (Personal Computer, abbreviated as PC), a cloud device, a mobile device, a server, etc., and the mobile device may be any hardware device such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, a vehicle-mounted device, etc., which is not limited in this application.

Fig. 1 is a flow chart of a training method of a speech synthesis model according to an embodiment of the present application. As shown in FIG. 1, the training method of the speech synthesis model may include the following steps 101-104.

Step 101, obtaining sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifiers, wherein at least one first sample audio corresponding to the same speaker identifier has a single style feature.

The acoustic characteristics of the first sample audio are represented by, for example, mel spectrum characteristics, and are obtained by extracting acoustic characteristics of the first sample audio.

The first phoneme sequence is a phoneme sequence obtained by performing phoneme conversion on a text corresponding to the first sample audio. For example, taking the text "they" corresponding to the first sample audio as an example, the text may be subjected to phoneme conversion to obtain a first phoneme sequence "ta m en" corresponding to the first sample audio. Where "t", "a", "m", "en" are phonemes.

The same speaker identifier may correspond to one first sample audio or may correspond to a plurality of first sample audio, which is not limited in this application.

The first sample audio corresponding to at least one identical speaker identification has a single style characteristic, i.e., at least one speaker may record audio of only one style characteristic as training data.

It should be noted that, the training data in training the speech synthesis model in the embodiments of the present disclosure may be provided by a user and authorized for use, or obtained from a public data set, or obtained by other means in accordance with the rules of the relevant laws and regulations, which are not limited in this application.

Step 102, inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identification into a coding layer of a speech synthesis model to determine prosodic features of the speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence based on the style characterization corresponding to the speaker identification and the first phoneme sequence, and determining text coding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features.

The voice synthesis model is a neural network model with a voice synthesis function.

The Style characterization corresponding to the speaker identifier is used for characterizing the Style characteristics of the speaker corresponding to the speaker identifier, and can be Style mapping corresponding to the speaker identifier.

The prosodic features of the speaker on each phoneme in the first phoneme sequence are used for representing the pronunciation style of the speaker on each phoneme in the first phoneme sequence, and may include features such as pitch, duration, energy and the like.

The text coding of the first phoneme sequence at the audio frame level comprises text coding corresponding to a plurality of audio frames, wherein the text coding corresponding to each audio frame comprises the characteristics of the text corresponding to the audio frame and the prosodic characteristics of a speaker on the audio frame.

In one embodiment of the present application, the speech synthesis model may include an encoding layer, a first phoneme sequence corresponding to the same first sample audio and a corresponding speaker identifier may be input into the encoding layer of the speech synthesis model, the encoding layer may obtain a style characterization corresponding to the speaker identifier based on a speaker identifier query token table (ebedding table), and further may determine prosodic features of the speaker corresponding to the speaker identifier on each phoneme in the first phoneme sequence based on the style characterization corresponding to the speaker identifier and the first phoneme sequence, and determine text encoding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features.

Step 103, inputting the text code and the corresponding speaker identification of the first phoneme sequence at the audio frame level into a decoding layer of the speech synthesis model to decode based on the tone representation and the text code corresponding to the speaker identification, thereby obtaining the predicted acoustic feature of the first sample audio.

The tone color characterization corresponding to the speaker identifier is used for characterizing the tone color characteristics of the speaker corresponding to the speaker identifier, and may be Speaker embedding corresponding to the speaker identifier.

The predicted acoustic features, which characterize the acoustic properties of the first sample audio, may be, for example, mel (Mel) spectra, are predicted by a speech synthesis model.

In one embodiment of the present application, the speech synthesis model may include a decoding layer, and after inputting the text code of the first phoneme sequence at the audio frame level and the corresponding speaker identifier into the decoding layer, the decoding layer may obtain a tone representation corresponding to the speaker identifier based on a speaker identifier query table (ebedding table), and may further decode based on the tone representation and the text code to obtain the predicted acoustic feature. The predicted acoustic features carry the features of the text in the text corresponding to the first sample audio, the style features of the speaker and the tone features of the speaker.

It should be noted that, in the embodiment of the present application, for the same speaker identifier, different token tables are used to obtain the corresponding tone token and style token, so for the same speaker, although the speaker identifier is the same, the tone token and style token are different.

Step 104, training the speech synthesis model based on the predicted acoustic features and the sample acoustic features of each first sample audio.

In the embodiment of the application, the predicted acoustic features of each first sample audio and the corresponding sample acoustic features of the first sample audio can be substituted into the loss function to determine the loss value, and further, the model parameters of the speech synthesis model are adjusted according to the loss value, and the trained speech synthesis model is obtained through multiple iterative optimization.

The loss function may be set as needed, for example, may be set as a mean square error MSE (mean squared error) loss function, or other loss functions, which is not limited in this application.

In the embodiment of the application, when the coding layer of the speech synthesis model determines that the text of the first phoneme sequence on the audio frame level is coded, the tone characterization is only used for determining the predicted acoustic feature by the decoding layer, and only the decoding layer and the tone characterization influence the tone feature carried in the predicted acoustic feature, so that the decoupling of the text and the tone feature is realized. In addition, the coding layer does not depend on tone characterization when determining text coding of the first phoneme sequence at the audio frame level according to the style characterization, and the prosodic features determined according to the style characterization are only related to the style features and are not related to the tone features, so that the coding layer does not depend on the tone features when determining text coding of the first phoneme sequence at the audio frame level according to the style features, and decoupling of the style features and the tone features is realized.

Because the voice synthesis model realizes decoupling of tone characteristics and style characteristics in the audio, the same speaker can use single-style audio as training data when the voice synthesis model is trained, and each speaker data is regarded as an independent style in the training process, so that each speaker is independent style, and the prediction of prosodic characteristics of different speaker styles based on corresponding style characterization can be realized according to the identification of different speakers. Because the same speaker can use the audio of a single style as the training data, the requirement on the training data can be reduced, each speaker is not required to record the audio of different styles as the training data, the recording cost of the audio in the training data is reduced, and the training cost of a speech synthesis model is further reduced.

In addition, by training the voice synthesis model in the above manner, the voice synthesis model learns the respective tone color characteristics and style characteristics of each speaker in the training process, so that the target audio with the tone color characteristics corresponding to one speaker identification and the style characteristics corresponding to the other speaker identification can be synthesized based on any text data and any two speaker identifications by using the voice synthesis model obtained by training in the embodiment of the application. Therefore, the same voice synthesis model can be utilized to flexibly generate target audios with different tone colors and different styles, the flexibility of the voice synthesis model is improved, the application range of the voice synthesis model is expanded, and the number of models required by style migration is reduced.

In summary, according to the training method of the speech synthesis model provided by the embodiment of the application, the sample acoustic features of a plurality of first sample audios, the corresponding first phoneme sequences and the corresponding speaker identifications are obtained, and at least one first sample audio corresponding to the same speaker identification has a single style feature; inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identification into a coding layer of a speech synthesis model to determine prosodic features of a speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence based on style characterization corresponding to the speaker identification and the first phoneme sequence, and determining text coding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features; inputting the text codes of the first phoneme sequence on the audio frame level and the corresponding speaker identifications into a decoding layer of a speech synthesis model to decode based on the tone characterization and the text codes corresponding to the speaker identifications, thereby obtaining the predicted acoustic characteristics of the first sample audio; the speech synthesis model is trained based on the predicted acoustic features and the sample acoustic features of each first sample audio. Therefore, decoupling of tone color characteristics and style characteristics in the audio is achieved, so that the voice synthesis model can be trained by utilizing the audio with single style characteristics corresponding to each speaker, recording cost of the audio in training data is reduced, and training cost of the voice synthesis model is further reduced.

The process of determining the predicted acoustic features of the first sample audio by the speech synthesis model in the training method of the speech synthesis model in the embodiment of the present application will be further described with reference to fig. 2.

Fig. 2 is a flow chart of a training method of a speech synthesis model according to a second embodiment of the present application. As shown in FIG. 2, the training method of the speech synthesis model may include the following steps 201-209.

Step 201, obtaining sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, wherein at least one first sample audio corresponding to the same speaker identification has a single style feature.

The specific implementation process and principle of step 201 may refer to the description of the foregoing embodiments, which is not repeated herein.

Step 202, inputting the speaker identification into a second embedding module to obtain the style characterization corresponding to the speaker identification.

Step 203, inputting the first phoneme sequence into a first embedding module to obtain text representations of each phoneme in the first phoneme sequence, and inputting the text representations of each phoneme into an encoder to obtain text codes of each phoneme.

Step 204, inputting the text codes and style characterization of each phoneme into a prosody prediction module to obtain prosody characteristics of the speaker on each phoneme in the first phoneme sequence corresponding to the speaker identification, and determining the text codes of the first phoneme sequence on the audio frame level based on the first phoneme sequence and the prosody characteristics.

In one embodiment of the present application, referring to fig. 3, the coding layer in the speech synthesis model may include a first embedding module 31, an encoder 32, a prosody prediction module 33, and a second embedding module 34, which are connected in sequence. Wherein the encoder 32 may be a multi-layer feedforward neural network (Feed Forward Networks, FFN for short) based on Self-attention (Self-attention) mechanism.

Accordingly, the speaker identifier is input into the second embedding module 34, and the style characterization corresponding to the speaker identifier can be obtained through the second embedding module. The first phoneme sequence is input to the first embedding module 31, and a text representation of each phoneme in the first phoneme sequence may be obtained by the first embedding module 31. The text token is used for representing the corresponding phonemes and can be an embedded vector embedding of the phonemes. Further, information such as a text representation of each phoneme in the first phoneme sequence and a position representation vector position embedding corresponding to each phoneme may be input to the encoder 32 to obtain a text code of each phoneme. The text codes are used for representing the characteristic information of the corresponding phonemes. Further, the text codes and style characterization of each phoneme may be input to the prosody prediction module 33 to obtain prosody features of the speaker on each phoneme in the first phoneme sequence corresponding to the speaker identification, and the text codes of the first phoneme sequence on the audio frame level may be determined by the prosody prediction module 33 based on the first phoneme sequence and the prosody features.

In one embodiment of the present application, referring to fig. 3, the prosody prediction module 33 may include a prosody prediction unit 331 and a prosody processing unit 332. Prosodic features include pitch features, energy features, and duration features. Accordingly, step 204 may be implemented by: inputting the text codes and style characterization of each phoneme into a prosody prediction unit 331 to obtain the pitch characteristic, the energy characteristic and the duration characteristic of the speaker on each phoneme corresponding to the speaker identification; the text codes of each phoneme and the pitch feature, the energy feature and the duration feature of the speaker on each phoneme are input into the prosody processing unit 332, so that the pitch feature and the energy feature of the speaker on each phoneme are fused with the text codes of each phoneme to obtain fusion codes of each phoneme, and the fusion codes of each phoneme are expanded to an audio frame level based on the duration feature of the speaker on each phoneme by the prosody processing unit 332 to obtain the text codes of the first phoneme sequence on the audio frame level.

Among them, the prosody prediction unit 331 may include a plurality of prosody prediction subunits, each for performing prediction of one type of prosody feature based on text encoding and style characterization of each phoneme.

For example, the structures of the prosody prediction unit 331 and the prosody processing unit 332 may be the structures shown in fig. 4. Among them, referring to fig. 4, the prosody prediction unit 331 may include three prosody prediction subunits of a duration prediction subunit 3311, a pitch prediction subunit 3312, and an energy prediction subunit 3313, and the prosody processing unit 332 may include a length adjustment subunit 3321. Where "+" in fig. 4 indicates that the corresponding elements of the vectors are added. The structure of each prosody prediction subunit may refer to the related art, and will not be described herein.

In one embodiment of the present application, referring to fig. 4, the text encoding and style characterization of each phoneme may be input into the duration prediction subunit 3311, the pitch prediction subunit 3312, and the energy prediction subunit 3313, so that the duration prediction subunit 3311 may determine the duration characteristic of each phoneme based on the text encoding and style characterization of each phoneme, the pitch prediction subunit 3312 may determine the pitch characteristic of each phoneme based on the text encoding and style characterization of each phoneme, and the energy prediction subunit 3313 may determine the energy characteristic of each phoneme based on the text encoding and style characterization of each phoneme. The pitch feature and the energy feature of each phoneme may be fused with the text code of each phoneme to obtain a fused code of each phoneme, and then the fused code and the duration feature of each phoneme may be input into the length adjustment subunit 3321, so that the fused code of each phoneme is expanded to an audio frame level by the length adjustment subunit 3321 to obtain the text code of the first phoneme sequence at the audio frame level. Wherein the pitch feature, the energy feature and the duration feature of the phonemes can be added to the backbone network where the text encoding is located by one-dimensional convolution.

In one embodiment of the present application, the pitch and energy features of each phoneme in the first phoneme sequence may be gaussian normalized for each speaker, so that the pitch and energy features do not contain any speaker information and only represent the overall prosodic representation. The duration characteristics of the phonemes can be obtained by a forced alignment method.

Step 205, inputting the speaker identification into a third embedding module to obtain the tone characterization corresponding to the speaker identification.

At step 206, the timbre representation and the text encoding of the first phoneme sequence at the audio frame level are input to a decoder for decoding based on the timbre representation and the text encoding to obtain predicted acoustic features.

In one embodiment of the present application, referring to fig. 3, the decoding layer of the speech synthesis model includes a decoder 35 connected to the encoding layer (specifically, the prosody processing unit 332 in the encoding layer) and a third embedding module 36 connected to the decoder 35. Wherein decoder 35 may be a multi-layer FFN network based on a self-attention mechanism.

Correspondingly, the speaker identifier may be input to the third embedding module 36 to obtain a timbre token corresponding to the speaker identifier, and further, the timbre token and the text encoding of the first phoneme sequence at the audio frame level are input to the decoder 35 to be decoded based on the timbre token and the text encoding to obtain the predicted acoustic feature.

Referring to fig. 3, in the embodiment of the present application, the timbre token output by the third embedding module 36 is input to the decoder 35, and the encoder 32 and the prosody prediction module 33 are independent of the timbre token, so that only the timbre token and the decoder 35 affect the timbre feature contained in the predicted acoustic feature, and decoupling of the text and the timbre feature is achieved. In addition, the style characterization output by the second embedding module 34 is not directly added to the output of the encoder 32, but is inside the prosody prediction module 33, because the prosody prediction module 33 predicts the prosody characteristics such as the duration characteristic, the pitch characteristic and the energy characteristic of each phoneme, and the prosody characteristics are scalar, when the prosody characteristics are added to the backbone network, the style characterization can be ensured not to appear in the backbone network, only the added prosody characteristics on the backbone network are not the style characterization information of the speaker, so that even if the style characterization information contains partial tone characteristics due to strong correlation between the style characteristics and the tone characteristics, the prosody prediction module can separate the style characteristics of the speaker, and only predict the prosody characteristics corresponding to the style characteristics without depending on the tone characteristics in the style characterization information, thereby avoiding confusion between the style characterization and the tone characteristics of the speaker, and the prediction of the prosody prediction module 33 for the prosody characteristics, which does not depend on the tone characteristics, the style characterization does not affect the style characteristics, thereby realizing decoupling of the tone characteristics and the style characteristics with high quality.

Step 207, training the speech synthesis model based on the predicted acoustic features and the sample acoustic features of each first sample audio.

The specific implementation process and principle of step 207 may refer to the description of the foregoing embodiments, which is not repeated herein.

At step 208, at least one sample acoustic feature of the second sample audio containing noise, a corresponding second sequence of phonemes, and a corresponding speaker identification are obtained.

The second sample audio is characterized by a sample acoustic feature, for example, a mel spectrum feature, and is obtained by extracting the acoustic feature of the second sample audio.

The second phoneme sequence is a phoneme sequence obtained by performing phoneme conversion on the text corresponding to the second sample audio.

The speaker identifier corresponding to the second sample audio may be the same as or different from the speaker identifier corresponding to the first sample audio, which is not limited in this application.

Step 209, training the prosody prediction module based on the at least one sample acoustic feature of the second sample audio comprising noise, the corresponding second phoneme sequence, and the corresponding speaker identification.

In one embodiment of the present application, the first sample audio when the speech synthesis model is trained as a whole is clean, noise-free audio.

In addition, it can be understood that the pitch feature and the energy feature of the phonemes are not greatly influenced by noise, and the influence of the noise on the features can be reduced in a Gaussian normalization mode, and meanwhile, the duration feature of the phonemes is obtained through forced alignment and is not influenced by the noise, so that the pitch feature, the energy feature and the duration feature of each phoneme in the embodiment of the application are extracted without strictly adopting noiseless audio, and even if the audio contains the noise, the relatively accurate rhythm feature can be extracted through the rhythm feature prediction module in the embodiment of the application. In the embodiment of the application, after the speech synthesis model is integrally trained, training data containing noise can be used for training the prosody prediction module in the speech synthesis model, so that the prosody prediction module learns style characteristics in the training data.

For example, in the embodiment of the present application, after the whole speech synthesis model is trained by using the noiseless first sample audio, a large amount of live and live audio of a certain product may be obtained from the internet, and these audios are used as second sample audio corresponding to a speaker identifier, and model parameters of a prosody prediction module in the speech synthesis model are updated, and model parameters of other parts of the model are not updated, so that only the prosody prediction module learns style features in the second sample audio. After training the prosody prediction module based on the second sample audio, synthesizing the target text to be synthesized based on the speaker identification corresponding to any noiseless first sample audio, the speaker identification corresponding to the live broadcast with-cargo audio and the target text to be synthesized, thereby obtaining the target audio with the live broadcast with-cargo style. Since the prosodic features are not affected by noise, the quality of the resulting target audio is not affected, and the quality of the target audio is high.

Therefore, the requirement on training data of the speech synthesis model can be reduced by training the prosody prediction module based on the second sample audio containing noise, and style migration is realized.

Based on the training method of the speech synthesis model in the above embodiment, a speech synthesis method is provided in the embodiment of the present application. The following describes a speech synthesis method provided in the embodiment of the present application.

It should be noted that, the voice synthesis method provided in the embodiment of the present application is executed by a voice synthesis device. The voice synthesis device can be electronic equipment or can be configured in the electronic equipment, so that decoupling of tone characteristics and style characteristics in the audio is realized by executing the voice synthesis method provided by the embodiment of the application, and therefore, target audio with different tone colors and different styles can be flexibly generated by using the same voice synthesis model, the flexibility of the voice synthesis model is improved, the application range of the voice synthesis model is expanded, and the number of models required by style migration is reduced.

Fig. 5 is a flow chart of a speech synthesis method according to a third embodiment of the present application. As shown in fig. 5, the speech synthesis method may comprise the following steps 501-504.

Step 501, a third phoneme sequence corresponding to a target text to be synthesized is obtained, and a first speaker identifier and a second speaker identifier are obtained from a candidate identifier set.

The target text to be synthesized can be any text in any language type. The language type is, for example, chinese type, english type, japanese type, etc.; the target text is, for example, news text, entertainment text, chat text, etc.; the target text to be synthesized may be a text under one language type or a text under multiple language types, which is not limited in this disclosure.

The third phoneme sequence is a phoneme sequence obtained by performing phoneme conversion on the target text.

The candidate identification set consists of speaker identifications corresponding to the first sample audios utilized by the speech synthesis model in the training process.

The first speaker identifier is used for uniquely identifying the first speaker and can be the identifier of any speaker in the candidate identifier set; the second speaker identifier is used for uniquely identifying the second speaker, and can be any speaker identifier in the candidate identifier set.

It should be noted that, the above data in the speech synthesis in the embodiments of the present disclosure may be provided by a user and authorized for use, or obtained from a public data set, or obtained by other means in accordance with the rules of the related laws and regulations, which are not limited in this application.

Step 502, inputting the third phoneme sequence and the first speaker identification into an encoding layer of the speech synthesis model to determine prosodic features of the speaker corresponding to the first speaker identification on each phoneme in the third phoneme sequence based on the style characterization corresponding to the first speaker identification and the third phoneme sequence, and determining text encoding of the third phoneme sequence on the audio frame level based on the third phoneme sequence and the prosodic features.

The Style characterization corresponding to the first speaker identifier is used for characterizing Style characteristics of the first speaker, and may be Style mapping corresponding to the first speaker identifier.

The prosodic features of the corresponding speaker (i.e., the first speaker) on each phoneme in the third phoneme sequence are identified by the first speaker, and are used to represent the pronunciation style of the first speaker on each phoneme in the third phoneme sequence, and may include features such as pitch, duration, energy, and the like.

The text coding of the third phoneme sequence at the audio frame level comprises text coding corresponding to a plurality of audio frames, which are divided into the third phoneme sequence according to the audio frame level, wherein the text coding corresponding to each audio frame comprises the characteristics of the text corresponding to the audio frame and the rhythm characteristics of a third speaker on the audio frame. The prosodic features of the third speaker on the audio frame are used for representing the pronunciation style of the third speaker on the audio frame and can comprise features such as pitch, duration, energy and the like.

In one embodiment of the present application, the speech synthesis model may include an encoding layer, the first speaker identifier and the third phoneme sequence may be input into the encoding layer of the speech synthesis model, the encoding layer may query the embellishing table based on the first speaker identifier, obtain a style characterization corresponding to the first speaker identifier, and further determine prosodic features of the first speaker on each phoneme in the third phoneme sequence based on the style characterization corresponding to the first speaker identifier and the third phoneme sequence, and determine text encoding of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features.

In one embodiment of the present application, the coding layer includes a first embedding module, an encoder, a prosody prediction module, and a second embedding module connected in sequence. Accordingly, step 502 may be implemented by:

Inputting the first speaker identification into a second embedding module to obtain a style characterization corresponding to the first speaker identification; inputting the third phoneme sequence into a first embedding module to obtain text representation of each phoneme in the third phoneme sequence, and inputting the text representation of each phoneme into an encoder to obtain text codes of each phoneme; inputting the text codes and style characterization of each phoneme into a prosody prediction module, obtaining prosody characteristics of a speaker corresponding to the first speaker identification on each phoneme in the third phoneme sequence, and determining the text codes of the third phoneme sequence on the audio frame level based on the third phoneme sequence and the prosody characteristics.

In one embodiment of the present application, the prosody prediction module includes a prosody prediction unit and a prosody processing unit; prosodic features include pitch features, energy features, and duration features.

Correspondingly, inputting the text codes and style characterization of each phoneme into a prosody prediction module to obtain prosody characteristics of a speaker corresponding to the first speaker identification on each phoneme in the third phoneme sequence, and determining the text codes of the third phoneme sequence on the audio frame level based on the third phoneme sequence and the prosody characteristics, wherein the text codes comprise: inputting text codes and style characterization of each phoneme in the third phoneme sequence into a prosody prediction unit to obtain pitch characteristics, energy characteristics and duration characteristics of a first speaker on each phoneme corresponding to the first speaker identification; inputting the pitch characteristic, the energy characteristic and the duration characteristic of each phoneme of the speaker into a prosody processing unit to fuse the pitch characteristic and the energy characteristic of each phoneme of the first speaker with the text code of each phoneme to obtain fused codes of each phoneme, and expanding the fused codes of each phoneme to an audio frame level based on the duration characteristic of each phoneme of the first speaker to obtain the text code of the third phoneme sequence at the audio frame level.

The process of determining the text encoding of the third phoneme sequence at the audio frame level may refer to the process of determining the text encoding of the first phoneme sequence at the audio frame level, and the related description will not be repeated here.

According to the prosody characteristic predicted by the prosody prediction module in the embodiment of the application, the prosody characteristic codes of prosody characteristics such as multi-scale prosody characteristics, the bottlebeck characteristics of prosody characteristics (language characteristics of audio) and the like are not used, but the simplest prosody characteristics such as pitch characteristics, energy characteristics, duration characteristics and the like are used, and the style characteristics comprise the duration, the pitch and the energy of phonemes and are scalar quantities, so that the predicted prosody characteristics can be modified randomly according to the needs when speech synthesis is carried out, the overall speech speed of the finally synthesized target audio can be controlled, the speech speed can be regulated slowly or quickly at will, the pitch and the energy of pronunciation can be regulated at will, and the control of phonemes and phrases can be realized.

Step 503, inputting the text code and the second speaker identification of the third phoneme sequence at the audio frame level into a decoding layer of the speech synthesis model to decode based on the tone representation and the text code corresponding to the second speaker identification, thereby obtaining the acoustic feature.

The speech synthesis model is obtained by training the speech synthesis model according to any one of the above embodiments.

The tone color representation corresponding to the second speaker identifier is used for representing tone color characteristics of the second speaker corresponding to the second speaker identifier, and may be Speaker embedding corresponding to the second speaker identifier.

The acoustic features, which characterize the target audio, may be, for example, mel (Mel) spectra.

In one embodiment of the present application, the speech synthesis model may include a decoding layer, and after inputting the text code of the third phoneme sequence at the audio frame level and the corresponding second speaker identifier into the decoding layer, the decoding layer may query the embedding table based on the second speaker identifier to obtain a tone representation corresponding to the second speaker identifier, and may further decode based on the tone representation and the text code to obtain the acoustic feature. The acoustic features carry the features of the text in the target text, the style features of the first speaker and the tone features of the second speaker.

In one embodiment of the present application, a decoding layer of a speech synthesis model includes a decoder coupled to an encoding layer and a third embedded module coupled to the decoder.

Accordingly, step 503 may be implemented by: inputting the second speaker identification into a third embedding module to obtain tone characterization corresponding to the second speaker identification; the timbre token and the text encoding of the third phoneme sequence at the audio frame level are input to a decoder for decoding based on the timbre token and the text encoding to obtain acoustic features.

The process of obtaining the acoustic features through the coding layer may refer to the process of obtaining the predicted acoustic features through the coding layer in the training process of the speech synthesis model, and the description thereof will not be repeated here.

Step 504, generating target audio corresponding to the target text based on the acoustic features.

In one embodiment of the present application, the acoustic feature may be converted into a speech waveform to obtain a target audio corresponding to the target text. The target audio has style characteristics of a first speaker and tone characteristics of a second speaker.

In the embodiment of the application, when the encoding layer of the speech synthesis model determines that the text of the third phoneme sequence on the audio frame level is encoded, the tone characterization is only used for determining the predicted acoustic feature by the decoding layer, and the tone characterization only affects the tone feature carried in the acoustic feature by the decoding layer and the tone characterization, so that the decoupling of the text and the tone feature is realized. In addition, the coding layer does not depend on tone characterization when determining text coding of the third phoneme sequence at the audio frame level according to the style characterization, and the prosodic features determined according to the style characterization are only related to the style features and are not related to the tone features, so that the coding layer does not depend on the tone features when determining text coding of the third phoneme sequence at the audio frame level according to the style features, and decoupling of the style features and the tone features is realized.

Because the voice synthesis model realizes decoupling of tone color characteristics and style characteristics in the audio, the same speaker can use single-style audio as training data in the training process, and the voice synthesis model learns respective tone color characteristics and style characteristics of each speaker in the training process, the voice synthesis model obtained by training in the embodiment of the application can be utilized to synthesize target audio with tone color characteristics corresponding to one speaker identifier and style characteristics corresponding to the other speaker identifier based on any target text to be synthesized and any two speaker identifiers in the candidate identifier set. For example, in the training process of the speech synthesis module, when the training data is the first sample audio of 10 speakers, based on the speech synthesis model obtained by training, the target audio of 10×10 tone feature and style feature combinations can be synthesized. Therefore, the same voice synthesis model can be utilized to flexibly generate target audios with different tone colors and different styles, the flexibility of the voice synthesis model is improved, the application range of the voice synthesis model is expanded, and the number of models required by style migration is reduced.

In addition, because the attention module in the related technology is not adopted in the backbone network in the embodiment of the application, the problems of sound loss, repeated pronunciation and the like caused by the problem of an attention mechanism are avoided, and the quality of the generated target audio is high.

For example, in the embodiment of the present application, the identifier of the speaker a, the identifier of the speaker B, and the target text to be synthesized may be input into a speech synthesis model, so as to synthesize, based on the style characteristics of the speaker a and the timbre characteristics of the speaker B, a target audio a having the style characteristics of the speaker a and the timbre characteristics of the speaker B. Wherein, speaker A is male and speaker B is female. The upper graph in fig. 6 is a schematic diagram of the pitch characteristics of the target audio a. And the identification of the speaker A and the same target text to be synthesized can be input into a speech synthesis model, so that synthesis is performed based on the style characteristics of the speaker A and the tone characteristics of the speaker A, and target audio b with the style characteristics of the speaker A and the tone characteristics of the speaker A is obtained. The lower graph in fig. 6 is a schematic diagram of the pitch characteristics of the target audio b.

Referring to the pitch contour (pitch contour) of fig. 6, since speech synthesis is performed using the style characteristics of the speaker a, the pitch characteristics of the target audio a and the target audio b are substantially identical, and the styles are identical. While speaker a is male and speaker B is female, the fundamental frequency of the tone of speaker B in the upper graph of fig. 6 is significantly higher than that of speaker a in the lower graph of fig. 6.

In addition, for example, the target text is a news text, in the embodiment of the present application, the identifier of the speaker B, the identifier of the speaker C, and the target text to be synthesized may be input into a speech synthesis model, so as to synthesize, based on the style characteristic of the speaker C and the tone characteristic of the speaker B, a target audio C having the style characteristic of the speaker C and the tone characteristic of the speaker B. Wherein, speakers B and C are females. The style of speaker C is a news style. The upper graph in fig. 7 is a schematic diagram of the pitch characteristics of the target audio c. And the identification of the speaker C and the same target text to be synthesized can be input into a speech synthesis model, so that synthesis is performed based on the style characteristics of the speaker C and the tone characteristics of the speaker C, and target audio d with the style characteristics of the speaker C and the tone characteristics of the speaker C is obtained. The lower graph in fig. 7 is a schematic diagram of the pitch characteristics of the target audio d.

Referring to pitch content of fig. 7, since speech synthesis is performed using the style characteristics of the speaker C, the pitch characteristics of the target audio C and the target audio d are substantially identical, and the styles are identical. Since both speakers B and C are females, there is no significant difference between the fundamental frequency of the tone of speaker B in the upper graph of fig. 7 and the fundamental frequency of the tone of speaker C in the lower graph of fig. 7.

In addition, for example, the target text is a live broadcast with a cargo text, in the embodiment of the present application, the identifier of the speaker B, the identifier of the speaker D, and the target text to be synthesized may be input into a speech synthesis model, so as to synthesize, based on the style characteristic of the speaker D and the tone characteristic of the speaker B, a target audio e having the style characteristic of the speaker D and the tone characteristic of the speaker B. Wherein, the speakers B and D are females. The style of speaker D is the live broadcast on-hold style. The upper graph in fig. 8 is a schematic diagram of the pitch characteristics of the target audio e. And the identification of the speaker D and the same target text to be synthesized can be input into a speech synthesis model, so that synthesis is performed based on the style characteristics of the speaker D and the tone characteristics of the speaker D, and target audio f with the style characteristics of the speaker D and the tone characteristics of the speaker D is obtained. The lower graph in fig. 8 is a schematic diagram of the pitch characteristics of the target audio f.

Referring to pitch content of fig. 8, since speech synthesis is performed using the style characteristics of the speaker D, the pitch characteristics of the target audio e and the target audio f are substantially identical, and the styles are identical. Since both speakers B and D are females, there is no significant difference between the fundamental frequency of the tone of speaker B in the upper graph of fig. 8 and the fundamental frequency of the tone of speaker D in the lower graph of fig. 8.

Through the three examples, the effectiveness of the multi-tone multi-style voice synthesis model based on tone characteristic and style characteristic decoupling and the effectiveness of tone characteristic and style characteristic decoupling, which are provided by the embodiment of the application, can be clearly shown.

In summary, the speech synthesis method provided in the embodiment of the present application obtains a third phoneme sequence corresponding to a target text to be synthesized, obtains a first speaker identifier and a second speaker identifier from a candidate identifier set, inputs the third phoneme sequence and the first speaker identifier into a coding layer of a speech synthesis model, so as to determine prosodic features of a speaker corresponding to the first speaker identifier on each phoneme in the third phoneme sequence based on style characterization and the third phoneme sequence corresponding to the first speaker identifier, determine text codes of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features, input text codes of the third phoneme sequence on the audio frame level and the second speaker identifier into a decoding layer of the speech synthesis model, decode the voice features and the text codes corresponding to the second speaker identifier to obtain acoustic features, and generate target audio corresponding to the target text based on the acoustic features, thereby implementing the decoupling of the voice features and the style features in the audio, flexibly generating different target audio with the same speech synthesis model, improving the speech synthesis model, and reducing the number of the speech synthesis model.

Fig. 9 is a schematic structural diagram of a training device for a speech synthesis model according to a fourth embodiment of the present application.

As shown in fig. 9, the training apparatus 900 of the speech synthesis model may include: a first acquisition module 901, a first processing module 902, a second processing module 903, and a first training module 904.

The first obtaining module 901 is configured to obtain sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences, and corresponding speaker identifiers, where at least one first sample audio corresponding to the same speaker identifier has a single style feature;

the first processing module 902 is configured to input a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identifier into an encoding layer of the speech synthesis model, determine prosodic features of a speaker corresponding to the speaker identifier on each phoneme in the first phoneme sequence based on a style characterization corresponding to the speaker identifier and the first phoneme sequence, and determine text encoding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features;

a second processing module 903, configured to input a text code and a corresponding speaker identifier of the first phoneme sequence at an audio frame level into a decoding layer of the speech synthesis model, so as to decode based on a tone representation and the text code corresponding to the speaker identifier, and obtain a predicted acoustic feature of the first sample audio;

A first training module 904 for training the speech synthesis model based on the predicted acoustic features and the sample acoustic features of each first sample audio.

It should be noted that, the training device for a speech synthesis model provided in the embodiment of the present application may perform the training method for a speech synthesis model of the foregoing embodiment, where the training device for a speech synthesis model may be an electronic device, or may be configured in an electronic device, so as to implement decoupling of tone color features and style features in audio, so that audio corresponding to multiple speakers and having a single style feature may be used to train a speech synthesis model, so as to reduce recording cost of audio in training data, and further reduce training cost of the speech synthesis model.

The electronic device may be a PC, a cloud device, a mobile device, a server, or the like, and the mobile device may be any hardware device such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device, or a vehicle-mounted device, which is not limited in this application.

In one possible implementation manner of the embodiment of the present application, the coding layer includes a first embedding module, an encoder, a prosody prediction module, and a second embedding module that are sequentially connected; a first processing module 902 comprising:

The first processing unit is used for inputting the speaker identification into the second embedding module to obtain the style characterization corresponding to the speaker identification;

the second processing unit is used for inputting the first phoneme sequence into the first embedding module to obtain text representation of each phoneme in the first phoneme sequence, and inputting the text representation of each phoneme into the encoder to obtain text codes of each phoneme;

and the third processing unit is used for inputting the text codes and style characterization of each phoneme into the prosody prediction module, obtaining prosody characteristics of the speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence, and determining the text codes of the first phoneme sequence on the audio frame level based on the first phoneme sequence and the prosody characteristics.

In one possible implementation of the embodiment of the present application, the prosody prediction module includes a prosody prediction unit and a prosody processing unit; prosodic features include pitch features, energy features, and duration features; a third processing unit for:

inputting the text codes and style characterization of each phoneme into a prosody prediction unit to obtain the pitch characteristic, the energy characteristic and the duration characteristic of a speaker on each phoneme corresponding to the speaker identification;

inputting the text codes of each phoneme and the pitch characteristic, the energy characteristic and the duration characteristic of the speaker on each phoneme into a prosody processing unit to fuse the pitch characteristic and the energy characteristic of the speaker on each phoneme with the text codes of each phoneme to obtain fused codes of each phoneme, and expanding the fused codes of each phoneme to an audio frame level based on the duration characteristic of the speaker on each phoneme to obtain the text codes of the first phoneme sequence on the audio frame level.

In one possible implementation manner of the embodiment of the application, the decoding layer includes a decoder connected with the encoding layer and a third embedded module connected with the decoder; the second processing module 903 includes:

the fourth processing unit is used for inputting the speaker identification into the third embedding module to obtain tone characterization corresponding to the speaker identification;

a fifth processing unit for inputting the timbre representation and the text encoding of the first phoneme sequence at the audio frame level to a decoder for decoding based on the timbre representation and the text encoding to obtain the predicted acoustic feature.

In one possible implementation manner of the embodiment of the present application, the training apparatus 900 of the speech synthesis model further includes:

the second acquisition module is used for acquiring at least one sample acoustic feature of second sample audio containing noise, a corresponding second phoneme sequence and a corresponding speaker identifier;

and a second training module for training the prosody prediction module based on the at least one sample acoustic feature of the second sample audio including noise, the corresponding second phoneme sequence, and the corresponding speaker identification.

It should be noted that the explanation in the foregoing embodiment of the method for training a speech synthesis model is also applicable to the training device for a speech synthesis model in this embodiment, and will not be repeated here.

According to the training device of the speech synthesis model, the sample acoustic characteristics of a plurality of first sample audios, the corresponding first phoneme sequences and the corresponding speaker identifications are obtained, and at least one first sample audio corresponding to the same speaker identification has a single style characteristic; inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identification into a coding layer of a speech synthesis model to determine prosodic features of a speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence based on style characterization corresponding to the speaker identification and the first phoneme sequence, and determining text coding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features; inputting the text codes of the first phoneme sequence on the audio frame level and the corresponding speaker identifications into a decoding layer of a speech synthesis model to decode based on the tone characterization and the text codes corresponding to the speaker identifications, thereby obtaining the predicted acoustic characteristics of the first sample audio; the speech synthesis model is trained based on the predicted acoustic features and the sample acoustic features of each first sample audio. Therefore, decoupling of tone color characteristics and style characteristics in the audio is achieved, so that the voice synthesis model can be trained by utilizing the audio with single style characteristics corresponding to each speaker, recording cost of the audio in training data is reduced, and training cost of the voice synthesis model is further reduced.

Fig. 10 is a schematic structural diagram of a speech synthesis apparatus according to a fifth embodiment of the present application.

As shown in fig. 10, the voice synthesizing apparatus 1000 may include: a third acquisition module 1001, a third processing module 1002, a fourth processing module 1003, and a generation module 1004.

The third obtaining module 1001 is configured to obtain a third phoneme sequence corresponding to a target text to be synthesized, and obtain a first speaker identifier and a second speaker identifier from the candidate identifier set;

a third processing module 1002, configured to input a third phoneme sequence and a first speaker identifier into an encoding layer of a speech synthesis model, to determine prosodic features of a speaker corresponding to the first speaker identifier on each phoneme in the third phoneme sequence based on a style characterization corresponding to the first speaker identifier and the third phoneme sequence, and to determine text encoding of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features;

a fourth processing module 1003, configured to input the text encoding of the third phoneme sequence at the audio frame level and the second speaker identifier into a decoding layer of the speech synthesis model, so as to decode based on the tone representation and the text encoding corresponding to the second speaker identifier, and obtain an acoustic feature;

The generating module 1004 is configured to generate target audio corresponding to the target text based on the acoustic feature.

It should be noted that, the voice synthesis device provided in the embodiment of the present application may perform the voice synthesis method of the foregoing embodiment, where the voice synthesis device may be an electronic device, or may be configured in an electronic device, so as to implement decoupling of tone features and style features in audio, so that target audio with different tone colors and different styles is flexibly generated by using the same voice synthesis model, flexibility of the voice synthesis model is improved, application range of the voice synthesis model is expanded, and number of models required for style migration is reduced.

It should be noted that the explanation in the foregoing embodiment of the speech synthesis method is also applicable to the speech synthesis apparatus of this embodiment, and will not be repeated here.

According to the voice synthesis device, the third phoneme sequence corresponding to the target text to be synthesized is obtained, the first speaker identification and the second speaker identification are obtained from the candidate identification set, the third phoneme sequence and the first speaker identification are input into the coding layer of the voice synthesis model, so that the rhythm characteristics of each phoneme of the first speaker identification corresponding to the first speaker identification are determined based on the style characterization and the third phoneme sequence corresponding to the first speaker identification, the text coding of the third phoneme sequence on the audio frame level is determined based on the third phoneme sequence and the rhythm characteristics, the text coding of the third phoneme sequence on the audio frame level and the second speaker identification are input into the decoding layer of the voice synthesis model, the acoustic characteristics are obtained based on the tone characterization and the text coding corresponding to the second speaker identification, and the target audio corresponding to the target text is generated based on the acoustic characteristics, so that the target audio of different tones and different styles can be flexibly generated by using the same voice synthesis model, the flexibility of the voice synthesis model is improved, the applicable number of the voice synthesis model is reduced, and the applicable number of the voice synthesis model is reduced.

In order to achieve the above embodiments, the present application further proposes an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for training a speech synthesis model as set forth in any of the foregoing embodiments of the application or to perform the method for speech synthesis as set forth in any of the foregoing embodiments of the application.

To achieve the above embodiments, the present application further proposes a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a training method of a speech synthesis model as set forth in any of the foregoing embodiments of the present application, or to perform a speech synthesis method as set forth in any of the foregoing embodiments of the present application.

To achieve the above embodiments, the present application also proposes a computer program product comprising a computer program which, when executed by a processor, implements a training method of a speech synthesis model as set forth in any of the foregoing embodiments of the present application, or implements a speech synthesis method as set forth in any of the foregoing embodiments of the present application.

Fig. 11 illustrates a block diagram of an exemplary electronic device suitable for use in implementing embodiments of the present application. The electronic device 1100 shown in fig. 11 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 11, the electronic device 1100 is embodied in the form of a general purpose computing device. Components of electronic device 1100 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include industry Standard architecture (Industry Standard Architecture; hereinafter ISA) bus, micro channel architecture (Micro Channel Architecture; hereinafter MAC) bus, enhanced ISA bus, video electronics standards Association (Video Electronics Standards Association; hereinafter VESA) local bus, and peripheral component interconnect (Peripheral Component Interconnection; hereinafter PCI) bus.

Electronic device 1100 typically includes a variety of computer system-readable media. Such media can be any available media that can be accessed by the electronic device 1100 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32. The electronic device 1100 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, commonly referred to as a "hard disk drive"). Although not shown in fig. 11, a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable nonvolatile optical disk (e.g., a compact disk read only memory (Compact Disc Read Only Memory; hereinafter CD-ROM), digital versatile read only optical disk (Digital Video Disc Read Only Memory; hereinafter DVD-ROM), or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the embodiments of the present application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods in the embodiments described herein.

The electronic device 1100 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the electronic device 1100, and/or any devices (e.g., network card, modem, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Also, the electronic device 1100 may communicate with one or more networks, such as a local area network (Local Area Network; hereinafter: LAN), a wide area network (Wide Area Network; hereinafter: WAN) and/or a public network, such as the Internet, via the network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 1100 over the bus 18. It should be appreciated that although not shown in fig. 11, other hardware and/or software modules may be used in connection with electronic device 1100, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the methods mentioned in the foregoing embodiments.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. Although embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A method of training a speech synthesis model, the method comprising:

acquiring sample acoustic features of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, wherein at least one first sample audio corresponding to the same speaker identification has a single style feature;

inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identification into a coding layer of a speech synthesis model to determine prosodic features of the speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence based on a style characterization corresponding to the speaker identification and the first phoneme sequence, and determining text coding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features;

Inputting a text code and a corresponding speaker identifier of the first phoneme sequence at the audio frame level into a decoding layer of the speech synthesis model to decode based on a tone representation and the text code corresponding to the speaker identifier, so as to obtain a predicted acoustic feature of the first sample audio;

the speech synthesis model is trained based on the predicted acoustic features of each of the first sample audio and the sample acoustic features.

2. The method of claim 1, wherein the encoding layer comprises a first embedding module, an encoder, a prosody prediction module, and a second embedding module connected in sequence;

inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identifier into an encoding layer of a speech synthesis model to determine prosodic features of the speaker corresponding to the speaker identifier on each phoneme in the first phoneme sequence based on a style characterization corresponding to the speaker identifier and the first phoneme sequence, and determining text encoding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features, wherein the method comprises the following steps:

Inputting the speaker identification into the second embedding module to obtain a style characterization corresponding to the speaker identification;

inputting the first phoneme sequence into the first embedding module to obtain text representations of phonemes in the first phoneme sequence, and inputting the text representations of the phonemes into the encoder to obtain text codes of the phonemes;

inputting the text codes of the phonemes and the style characterization into the prosody prediction module to obtain prosody characteristics of the speaker on each phoneme in the first phoneme sequence corresponding to the speaker identification, and determining the text codes of the first phoneme sequence on the audio frame level based on the first phoneme sequence and the prosody characteristics.

3. The method according to claim 2, wherein the prosody prediction module includes a prosody prediction unit and a prosody processing unit; the prosodic features include pitch features, energy features, and duration features;

inputting the text codes of the phonemes and the style characterization into the prosody prediction module to obtain prosody features of the speaker on each phoneme in the first phoneme sequence corresponding to the speaker identification, and determining the text codes of the first phoneme sequence on the audio frame level based on the first phoneme sequence and the prosody features, wherein the text codes comprise:

Inputting the text codes and the style characterization of each phoneme into the prosody prediction unit to obtain the pitch characteristic, the energy characteristic and the duration characteristic of the speaker on each phoneme, which correspond to the speaker identification;

inputting the text codes of the phonemes and the pitch characteristics, the energy characteristics and the duration characteristics of the speaker on the phonemes into a prosody processing unit to fuse the pitch characteristics and the energy characteristics of the speaker on the phonemes with the text codes of the phonemes to obtain fusion codes of the phonemes, and expanding the fusion codes of the phonemes to an audio frame level based on the duration characteristics of the speaker on the phonemes to obtain the text codes of the first phoneme sequence on the audio frame level.

4. A method according to any of claims 1-3, characterized in that the decoding layer comprises a decoder connected to the encoding layer and a third embedding module connected to the decoder;

inputting the text code and the corresponding speaker identifier of the first phoneme sequence at the audio frame level into a decoding layer of the speech synthesis model to decode based on the corresponding timbre representation and the text code of the speaker identifier to obtain the predicted acoustic feature of the first sample audio, wherein the method comprises the following steps:

Inputting the speaker identification into the third embedding module to obtain tone characterization corresponding to the speaker identification;

inputting the timbre representation and the text encoding of the first phoneme sequence at the audio frame level to the decoder for decoding based on the timbre representation and the text encoding to obtain the predicted acoustic feature.

5. A method according to claim 2 or 3, wherein after training the speech synthesis model based on the predicted acoustic features of each of the first sample audio and the sample acoustic features, further comprising:

acquiring at least one sample acoustic feature of a second sample audio containing noise, a corresponding second phoneme sequence and a corresponding speaker identification;

training the prosody prediction module based on the at least one sample acoustic feature of the second sample audio comprising noise, the corresponding second sequence of phonemes, and the corresponding speaker identification.

6. A method of speech synthesis, the method comprising:

acquiring a third phoneme sequence corresponding to a target text to be synthesized, and acquiring a first speaker identification and a second speaker identification from a candidate identification set;

Inputting the third phoneme sequence and the first speaker identification into an encoding layer of a speech synthesis model, determining prosodic features of a speaker corresponding to the first speaker identification on each phoneme in the third phoneme sequence based on style characterization corresponding to the first speaker identification and the third phoneme sequence, and determining text encoding of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features;

inputting the text code of the third phoneme sequence on the audio frame level and the second speaker identification into a decoding layer of the speech synthesis model to decode based on the tone representation corresponding to the second speaker identification and the text code to obtain acoustic characteristics;

and generating target audio corresponding to the target text based on the acoustic features.

7. A training device for a speech synthesis model, the device comprising:

the first acquisition module is used for acquiring sample acoustic characteristics of a plurality of first sample audios, corresponding first phoneme sequences and corresponding speaker identifications, and at least one first sample audio corresponding to the same speaker identification has a single style characteristic;

The first processing module is used for inputting a first phoneme sequence corresponding to the first sample audio and a corresponding speaker identifier into an encoding layer of a speech synthesis model, determining prosodic features of the speaker corresponding to the speaker identifier on each phoneme in the first phoneme sequence based on a style characterization corresponding to the speaker identifier and the first phoneme sequence, and determining text encoding of the first phoneme sequence on an audio frame level based on the first phoneme sequence and the prosodic features;

the second processing module is used for inputting the text codes of the first phoneme sequences on the audio frame level and the corresponding speaker identifications into a decoding layer of the speech synthesis model so as to decode based on the tone characterization corresponding to the speaker identifications and the text codes, and obtain the predicted acoustic characteristics of the first sample audio;

and the first training module is used for training the voice synthesis model based on the predicted acoustic characteristics of each first sample audio and the sample acoustic characteristics.

8. The apparatus of claim 7, wherein the encoding layer comprises a first embedding module, an encoder, a prosody prediction module, and a second embedding module connected in sequence; the first processing module includes:

The first processing unit is used for inputting the speaker identification into the second embedding module to obtain a style characterization corresponding to the speaker identification;

and the third processing unit is used for inputting the text codes of the phonemes and the style characterization into the prosody prediction module, obtaining prosody characteristics of the speaker corresponding to the speaker identification on each phoneme in the first phoneme sequence, and determining the text codes of the first phoneme sequence on the audio frame level based on the first phoneme sequence and the prosody characteristics.

9. The apparatus according to claim 8, wherein the prosody prediction module includes a prosody prediction unit and a prosody processing unit; the prosodic features include pitch features, energy features, and duration features; the third processing unit is configured to:

10. The apparatus according to any of claims 7-9, wherein the decoding layer comprises a decoder connected to the encoding layer and a third embedding module connected to the decoder; the second processing module includes:

a fifth processing unit for inputting the timbre representation and the text encoding of the first phoneme sequence at the audio frame level to the decoder for decoding based on the timbre representation and the text encoding to obtain the predicted acoustic feature.

11. The apparatus according to claim 8 or 9, characterized in that the apparatus further comprises:

and a second training module for training the prosody prediction module based on the at least one sample acoustic feature of the second sample audio containing noise, the corresponding second phoneme sequence, and the corresponding speaker identification.

12. A speech synthesis apparatus, the apparatus comprising:

the third acquisition module is used for acquiring a third phoneme sequence corresponding to the target text to be synthesized and acquiring a first speaker identifier and a second speaker identifier from the candidate identifier set;

a third processing module, configured to input the third phoneme sequence and the first speaker identifier into an encoding layer of a speech synthesis model, so as to determine prosodic features of a speaker corresponding to the first speaker identifier on each phoneme in the third phoneme sequence based on a style characterization corresponding to the first speaker identifier and the third phoneme sequence, and determine text encoding of the third phoneme sequence on an audio frame level based on the third phoneme sequence and the prosodic features;

A fourth processing module, configured to input, to a decoding layer of the speech synthesis model, a text encoding of the third phoneme sequence at an audio frame level and the second speaker identifier, so as to decode based on a tone representation corresponding to the second speaker identifier and the text encoding, thereby obtaining an acoustic feature;

and the generating module is used for generating target audio corresponding to the target text based on the acoustic characteristics.

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5 or to perform the method of claim 6.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-5 or to perform the method of claim 6.