CN114464162B

CN114464162B - Speech synthesis method, neural network model training method, and speech synthesis model

Info

Publication number: CN114464162B
Application number: CN202210377265.8A
Authority: CN
Inventors: 柴萌鑫; 林羽钦; 黄智颖
Original assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Current assignee: Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2022-08-02
Anticipated expiration: 2042-04-12
Also published as: CN115294963A; CN114464162A

Abstract

The embodiment of the application provides a speech synthesis method, a neural network model training method and a speech synthesis model. The speech synthesis method comprises the following steps: acquiring a phoneme vector of a text to be synthesized; predicting the voice characteristics and the voice posterior graph corresponding to each phoneme from the phoneme vector, wherein the voice posterior graph carries accent information; generating a voice frequency spectrum according to the voice characteristics and the voice posterior graph; outputting target voice corresponding to the text to be synthesized based on the voice spectrum, wherein the accent of the target voice is matched with the accent information. The method can synthesize accented speech.

Description

Speech synthesis method, neural network model training method, and speech synthesis model

Technical Field

The embodiment of the application relates to the technical field of neural networks, in particular to a speech synthesis method, a neural network model training method and a speech synthesis model.

Background

At present, an end-to-end model based on a neural network is continuously improved, the modeling capability of a speech synthesis model is continuously improved, so that the time for synthesizing speech is shorter, the speed is higher, the effect is more robust, the synthesized speech is more and more biased to natural pronunciation, but the existing speech synthesis model needs a huge database and a large amount of computing resources; on the other hand, in daily life, dialects with accents are widely used due to geographical influences, but existing speech synthesis models are difficult to synthesize speech audio with accents.

Disclosure of Invention

In view of the above, embodiments of the present application provide a speech synthesis scheme to at least partially solve the above problems.

According to a first aspect of embodiments of the present application, there is provided a speech synthesis method, including: acquiring a phoneme vector of a text to be synthesized; predicting the voice characteristics and the voice posterior graph corresponding to each phoneme from the phoneme vector, wherein the voice posterior graph carries accent information; generating a voice frequency spectrum according to the voice characteristics and the voice posterior graph; outputting target voice corresponding to the text to be synthesized based on the voice frequency spectrum, wherein the accent of the target voice is matched with the accent information.

According to a second aspect of the embodiments of the present application, a speech synthesis model is provided, including an encoder, a decoder, and a vocoder, the encoder is used for predicting speech characteristics and a speech posterior graph from a phoneme vector of a text to be synthesized, the speech posterior graph carries accent information, the decoder is used for determining a speech spectrum based on the speech characteristics and the speech posterior graph, the vocoder is used for generating a target speech corresponding to the text to be synthesized according to the speech spectrum, the accent of the target speech matches the accent information in the speech posterior graph.

According to a third aspect of embodiments of the present application, there is provided a neural network model training method, for training the above speech synthesis model, the method including: training the speech synthesis model by using an audio sample corresponding to the first accent to obtain an initially trained speech synthesis model; and training the initially trained speech synthesis model by using an audio sample corresponding to a second accent to obtain a secondarily trained speech synthesis model, wherein the duration of the audio sample corresponding to the first accent is longer than that of the audio sample corresponding to the second accent.

According to a fourth aspect of embodiments of the present application, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the corresponding operation of the method according to the first aspect.

According to a fifth aspect of embodiments herein, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to the first aspect.

According to a sixth aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions for instructing a computing device to perform operations corresponding to the method as described above.

In this way, a true, non-mandarin accent target speech may be generated, thereby enhancing the richness of the synthesizable speech. The embodiment creatively applies the Phonetic Posterior Graphs (PPGs) to the synthesis of the accented (i.e. the non-mandarin chinese) speech, thereby realizing the automatic synthesis of the accented speech under the condition of using less non-mandarin chinese audio, and solving the problem that the accented speech can not be synthesized due to the insufficient accented audio in the prior art.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1A is a diagram of a speech synthesis model according to a first embodiment of the present application;

FIG. 1B is a diagram of an encoder and a decoder in a speech synthesis model according to a first embodiment of the present application;

FIG. 1C is a diagram of a variance adapter of an encoder in a speech synthesis model according to a first embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of a speech synthesis method according to a first embodiment of the present application;

fig. 3 is a flowchart illustrating sub-steps of step S204 of a speech synthesis method according to a first embodiment of the present application;

fig. 4 is a flowchart illustrating sub-steps of step S206 of a speech synthesis method according to a first embodiment of the present application;

FIG. 5 is a flowchart illustrating steps of a neural network model training method according to a second embodiment of the present disclosure;

FIG. 6 is a flow chart illustrating sub-steps of step 502 of a neural network model training method according to the second embodiment of the present application;

FIG. 7 is a flowchart illustrating another sub-step of step 502 of a neural network model training method according to the second embodiment of the present application;

fig. 8 is a block diagram of a speech synthesis apparatus according to a third embodiment of the present application;

fig. 9 is a block diagram of a neural network model training apparatus according to a fourth embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the embodiments of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application shall fall within the scope of the protection of the embodiments in the present application.

The following further describes specific implementations of embodiments of the present application with reference to the drawings of the embodiments of the present application.

Example one

In the present embodiment, a new neural network model (which may also be referred to as a speech synthesis model) capable of synthesizing a target speech with an accent (i.e., non-mandarin speech) is provided, and for the sake of understanding, the speech synthesis model is explained before the implementation process of the speech synthesis method is explained.

Referring to FIG. 1A, a schematic diagram of a speech synthesis model is shown. The model includes an encoder, a decoder, and a vocoder.

The encoder (i.e., the encoder shown in fig. 1A) is configured to predict a speech feature and a speech posterior graph from a phoneme vector of a text to be synthesized, where the speech posterior graph carries accent information.

The speech features may include, but are not limited to, fundamental frequencies (F0) and energy information (energy) of phonemes in the target speech to be synthesized. The encoder of this embodiment can predict not only fundamental frequency and energy information, but also a Phonetic Posterior Graph (PPGs), which can extract the Phonetic posterior probability irrelevant to the language to form the Phonetic posterior graph, which can keep the sound relevant information (such as accent information) and exclude the influence of the speaker, so that the Phonetic posterior graph can be used as a bridge between the speaker and the speech, and the accent corresponding to each phoneme indicated in the Phonetic posterior graph and the duration of each phoneme can be used to control the accent of the target speech synthesized subsequently, thereby solving the problem that the phoneme and prosody of the speech with different accents are different and the non-mandarin accent speech is difficult to synthesize.

The decoder (i.e., decoder shown in fig. 1A) is configured to determine a speech spectrum based on the speech features and the speech a posteriori. The speech spectrum may be a mel spectrum.

The vocoder (i.e. LPCnet shown in fig. 1A) is configured to generate a target voice corresponding to the text to be synthesized according to the voice spectrum, where an accent of the target voice matches the accent information in the voice posterior graph.

In the following, an encoder, a decoder and a vocoder are exemplarily described, as shown in fig. 1B, the encoder includes a variance adapter for extracting context information from a phoneme vector of a text to be synthesized, and a plurality of encoding modules for predicting the speech feature and the speech posterior based on output data of the encoding modules.

The number of encoding modules may be determined according to requirements, and is not limited thereto. For example, the number of encoding modules may be 6, but the number is only one example and is not limited thereto.

The encoding module may include, for example, encoding a multi-head self-attention layer (multi-head attention), encoding a normalization layer (add & norm), and encoding a one-dimensional convolution layer (conv 1D).

The phoneme vector is spliced with position information (position encoding) of each phoneme and then input to an encoding multi-head self-attention layer as encoding input data, and first feature information is extracted from the encoding multi-head self-attention layer.

And in a coding normalization layer connected with the coding multi-head self-attention layer, normalizing the first characteristic information and the coding input data to obtain a first normalization result.

And inputting the first normalization result into the coded one-dimensional convolutional layer, and further extracting second characteristic information. And in the coding normalization layer connected with the coding one-dimensional convolution layer, normalizing the second characteristic information and the first normalization result to obtain output data. The output data carries context information of the text to be synthesized.

The output data from the multiple coding modules is input to a variance adapter (variance adapter), which, as shown in fig. 1C, includes a pitch prediction unit (pitch predictor), an energy prediction unit (energy predictor), and a speech posterior prediction unit (PPG predictor). The speech posterior prediction unit may use an LSTM network or other suitable neural network, without limitation.

The output data of the coding module is input into the variance adapter, and on one hand, the output data is processed by the fundamental frequency prediction unit to output the fundamental frequency corresponding to each phoneme. To further solve the problem that the pronunciations and prosodies of phonemes in non-Mandarin speech are different from those in Mandarin speech, the fundamental frequency prediction unit in this embodiment outputs a normalized Log-scale fundamental frequency (i.e., Log-F0).

On the other hand, the output data is processed by the energy prediction unit to output the energy corresponding to each phoneme, and on the other hand, the output data is processed by the speech posterior graph unit to output the speech posterior graph which comprises the duration and the accent information of each phoneme.

The variance adapter splices the output fundamental frequency, energy and speech posterior graph into the output data of the coding module, thereby forming the coded data output by the coder. In general, the variance adapter takes the hidden sequence (i.e., the output data of the coding module) as input and uses the mse (mean square error) loss function to predict the fundamental frequency, energy and the speech a posteriori (a-posteriori) of each speech frame corresponding to each phoneme.

Similar to the encoder, the decoder includes a plurality of decoding modules for generating a speech spectrum based on input speech features, a speech posterior map, and a preset speaker vector. For example, the number of decoding modules is equal to the number of encoding modules, and is also 6, however, this is only an illustration and is not limited to other numbers.

The decoding module comprises a decoding multi-head self-attention layer, a decoding normalization layer, a decoding one-dimensional convolution layer and the like. The encoded data splicing position information output by the variance adapter is input into the decoding module as decoding input data. And the decoding multi-head self-attention layer of the decoding module processes the decoding input data to obtain third characteristic information.

The third feature information and the decoded input data are input into a decoding normalization layer connected to the decoding multi-headed self-attention layer, and a third normalization result is output. And inputting the third normalization result into the decoded one-dimensional convolutional layer to obtain fourth characteristic information output by the decoded one-dimensional convolutional layer. And inputting the fourth characteristic information and the third normalization result into a decoding normalization layer connected with the decoding one-dimensional convolution layer to obtain an output fourth normalization result.

In this way, the fourth normalization result output by the decoding module is input to the linear layer (linear layer) by the processing of the plurality of decoding modules, and the output speech spectrum is obtained.

Target voice is generated according to the voice spectrum through a vocoder. Thus, the target voice with accents except the Mandarin can be synthesized, and the function of voice synthesis is enriched. The vocoder may be LPCnet, but of course, may be other neural network capable of converting to target speech based on speech spectrum, which is not limited in this respect.

The speech synthesis model can be an end-to-end based neural network model, called PPG _ FS, the encoder and decoder of which are non-autoregressive structures, and the attention alignment mechanism can be extracted from the teacher model based on the encoder-decoder through training, thereby improving accuracy. The synthesis of the target speech is realized by converting acoustic features (such as a speech spectrum) into a speech frame in the target speech by using an LPCnet vocoder.

The speech synthesis method of the present embodiment will be described below with reference to the speech synthesis model. Of course, it should be noted that the method is not limited to using the speech synthesis model illustrated in the present embodiment, and may be applied to other models

As shown in fig. 2, the method comprises the steps of:

step S202: and acquiring a phoneme vector of the text to be synthesized.

The text sequence to be synthesized may be a sentence or a paragraph, and when converted into speech, it may be divided into one or more phonemes (phones), and a phoneme vector (phone embedding) is generated based on the divided factors.

Step S204: and predicting the speech characteristics and the speech posterior graph corresponding to each phoneme from the phoneme vector.

In one example, a trained speech synthesis model is used, which includes an encoder, a decoder, and a vocoder, as previously described.

As shown in fig. 3, step S204 can be implemented by the following sub-steps:

substep S2041: input data is constructed based on the phoneme vectors.

For example, the phoneme vector and the position information in the text to be synthesized under each phoneme are spliced to form the input data.

Substep S2042: and inputting the input data into an encoder of a trained speech synthesis model, and obtaining fundamental frequencies and the energy information corresponding to the phonemes output by the encoder as the speech features.

The input data is input to an encoder, which processes the input data to output a fundamental frequency (F0) and energy information (energy) corresponding to each phoneme.

The fundamental frequency affects the pitch of the synthesized target speech to a certain extent, and therefore, the fundamental frequency also affects the tone of the target speech to a certain extent.

The energy information influences the volume of the target voice, and can also reflect logic stress of the target voice and the like.

Substep S2043: and obtaining the voice posterior graph which carries the accent information and is output by the encoder.

Since the pronunciation and duration corresponding to each phoneme are indicated in the speech posterior graph, both represent accent information. The speech posterior graph is the speech posterior graph of the desired accent predicted by the model from the phonemes.

Step S206: and generating a voice frequency spectrum according to the voice characteristics and the voice posterior graph.

In a possible manner, as shown in fig. 4, step S206 may be implemented by the following sub-steps:

substep S2061: and acquiring a speaker vector, wherein the speaker vector carries the tone information of a speaker.

In order to make the generated target voice more real and make the tone color of the target voice closer to the real human voice, not only the voice characteristics (including the fundamental frequency and energy corresponding to each phoneme) and the voice posterior graph but also the speaker vector can be input in the decoder, and the speaker information carries the tone color of the speaker. The speaker vector may be obtained during a decoder training phase.

Substep S2062: and inputting the voice features, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model, and obtaining a Mel frequency spectrum output by the decoder as the voice frequency spectrum.

The mel frequency spectrum can accurately indicate the acoustic characteristics of the target voice, thereby ensuring the authenticity of the generated target voice.

Step S208: outputting target voice corresponding to the text to be synthesized based on the voice spectrum, wherein the accent of the target voice is matched with the accent information.

In a possible manner, the step S208 can be implemented as: and inputting the voice spectrum into a vocoder, and obtaining a plurality of voice frames output by the vocoder as target voices corresponding to the text to be synthesized.

In this way, a true, non-mandarin accent target speech may be generated, thereby enhancing the richness of the synthesizable speech. The embodiment innovatively applies the Phonetic Posterior Graphs (PPGs) to the speech synthesis of the heavy accent (i.e., the non-mandarin chinese), thereby realizing the automatic synthesis of the accented speech under the condition of using less non-mandarin chinese audio, and solving the problem that the accented speech cannot be synthesized due to the insufficient accented audio in the prior art.

Example two

Referring to fig. 5, a schematic flowchart illustrating steps of a neural network model training method according to a second embodiment of the present application is shown.

The method is used for training the speech synthesis model and comprises the following steps:

step S502: the speech synthesis model is trained using audio samples corresponding to a first accent to obtain an initially trained speech synthesis model.

The first accent may be mandarin or other accent types with a relatively large sample size (i.e., audio duration is relatively long).

In this embodiment, in order to solve the problem that the amount of the audio sample of the non-mandarin accent is insufficient and it is difficult to train a usable speech synthesis model with accents using the audio sample of the non-mandarin accent, the speech synthesis model is trained using the audio sample of the mandarin with a larger sample amount to obtain an initially trained speech synthesis model.

In one example, as shown in fig. 6, step S502 includes the following sub-steps:

substep S5021: and extracting the voice characteristics and the voice posterior graph from the audio sample corresponding to the first pronunciation.

The speech features include, but are not limited to, the fundamental frequency and energy of each phoneme in the audio sample corresponding to the first accent. The acquisition may be obtained in any suitable, known manner, without limitation.

The phonetic posterior graph can be extracted by using Pythrch-Kaldi, but is not limited thereto, and can be extracted by other methods capable of extracting the phonetic posterior graph.

Substep S5022: a speaker vector is obtained.

During the training process, the speaker vector may be initialized randomly at the first training. The speaker vector may be an adjusted speaker vector during a non-first training.

Substep S5023: and inputting the voice characteristics, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model, and acquiring a voice frequency spectrum output by the decoder.

Substep S5024: a vocoder uses a speech synthesis model to generate target speech based on the speech spectrum.

Substep S5025: and adjusting the speaker vector according to the target voice and the audio sample, taking the adjusted speaker vector as a new speaker vector, returning to input the voice characteristics, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model to continue executing until a first termination condition is met, and obtaining a trained decoder and the speaker vector.

In the training process, because the timbre and the like represented by the speaker vector may be different from the real speaker, and the accuracy of the decoder for extracting the features of the audio sample may be insufficient, the synthesized target voice and the audio sample may have deviation, so that a loss value can be calculated based on the target voice and the audio sample, and the speaker vector and the parameters of the decoder are adjusted according to the loss value.

And taking the adjusted speaker vector as a new speaker vector, returning to the substep S5023 and continuing to execute until a first termination condition is met. The termination condition may be that a set number of training times is reached, or that the decoder satisfies a convergence condition, which is not limited. A trained decoder can be obtained by the above sub-steps.

As shown in fig. 7, a training-based decoder may train an encoder through the following sub-steps.

Substep S5026: and acquiring a phoneme vector sample of the text sample corresponding to the audio sample of the first accent.

The phoneme vector samples may be obtained in any suitable, known manner, without limitation.

Substep S5027: and inputting the phoneme vector sample into an encoder of the speech synthesis model, and obtaining the speech characteristics and the speech posterior graph output by the encoder.

As previously described, the encoder may predict the fundamental frequency and energy of each phoneme as a speech feature, as well as predict the speech posterior.

Substep S5028: and inputting the voice features, the voice posterior graph and the trained speaker vector into a trained decoder to obtain a voice frequency spectrum output by the trained decoder.

Substep S5029: a vocoder uses a speech synthesis model to generate target speech based on the speech spectrum.

Substep S50210: and adjusting the encoder according to the target voice and the audio sample, and returning to the step of inputting the phoneme vector sample into the encoder of the voice synthesis model to continue execution until a termination condition is met so as to obtain a trained encoder.

And calculating a loss value based on the target voice and the audio sample, adjusting the encoder according to the loss value, and returning to the substep S5027 to continue execution until a second termination condition is met. The second termination condition may be that the number of training times is satisfied, or that the encoder reaches convergence, which is not limited.

Step S504: and training the initially trained speech synthesis model by using an audio sample corresponding to a second accent to obtain a secondarily trained speech synthesis model, wherein the duration of the audio sample corresponding to the first accent is longer than that of the audio sample corresponding to the second accent.

In order to enable the speech synthesis model to synthesize a real, non-mandarin target speech, after the speech synthesis model is trained using the audio samples of mandarin, the audio samples of the second accent may also be used to adjust the speech synthesis model, thereby obtaining a speech synthesis model corresponding to the second accent.

The second accent may be an accent of non-Mandarin Chinese, and the duration of the audio samples required for the second accent may be shorter than the duration of the audio samples of the first accent, which ensures that a better speech synthesis model is trained with less duration of the audio samples. The process of performing the secondary training on the initial speech synthesis model by using the audio sample of the second accent is similar to the process of the aforementioned substeps S5021 to step S50210, and is not repeated here.

In this embodiment, the speech synthesis model may be improved based on the FastSpeech model, so that the improved speech synthesis model can predict a Phonetic Posterior Graph (PPGs) that spans the boundaries of a speaker and a language, and then solve the problem of mismatching between the mandarin and other languages in terms of speech and prosody by using the Phonetic posterior graph and a normalized logarithmic scale fundamental frequency (Log-F0), energy, and the like, thereby implementing an end-to-end speech synthesis model that can synthesize accents.

Due to the introduction of the language posterior graph, the speech synthesis model is well adapted to the problem that the quantity of audio samples of other accents is insufficient, and the speech synthesis model with accents is difficult to directly train on a FastSpeech model, overcomes the difficulty of sparseness of the audio samples of accents (the duration of the audio samples used by the FastSpeech model training needs nearly 24 hours, while the improved speech synthesis model (which can also be called PPG _ FS) only needs two hours of accent audio samples), also solves the problems that the accent language is difficult to label, the phonemic pronunciation of the accent is different from the tone of the accent, the synthesis effect is uncontrollable, and is easily adapted to the training of the accent language with limited data resources and realizes the voice conversion of a speaker.

Experiments show that the model can synthesize intelligible, natural and fluent heavy accent voice.

In the stage of training the speech synthesis model, the audio samples (also called corpora) of the Mandarin are used, but when the accent data is used for fine tuning, the trained speaker vector of the Mandarin is still kept unchanged, so that the speech synthesis model can synthesize the effect of speaking the accent language by the voice of the speaker of the Mandarin when the speech synthesis model is finally used, and the target speech is further enriched.

The voice posterior graph is blended into the model, and the voice posterior graph can eliminate the identity of a speaker while keeping voice information, so that the model can be used as a bridge of the speaker and the language information, language phoneme posterior representation spanning mandarin and accent is carried out by utilizing the language-independent characteristics of the voice posterior graph, the problem that different models trained from audio samples of different accents are different in conversion due to different matching relations of phonemes and rhythms between different accents is fully solved, and a better training effect is realized. In addition, the fundamental frequency is used for further compensating the mismatching problem between rhythm and phoneme, the method of training by a large amount of Mandarin corpus and fine tuning by a small amount of accent corpus is realized, the accent voice synthesis of the voice conversion of the speaker is realized, namely, the target voice with other accents is spoken by the tone color of the Mandarin, and the text can be converted into the vivid voice by using the model and the method, and then the artificial intelligence service is carried out.

EXAMPLE III

Referring to fig. 8, a block diagram of a speech synthesis apparatus according to a third embodiment of the present application is shown.

In this embodiment, the apparatus includes:

an obtaining module 802, configured to obtain a phoneme vector of a text to be synthesized;

the prediction module 804 is configured to predict, from the phoneme vector, a speech feature and a speech posterior graph corresponding to each phoneme, where the speech posterior graph carries accent information;

a generating module 806, configured to generate a speech spectrum according to the speech feature and the speech posterior graph;

a synthesizing module 808, configured to output a target speech corresponding to the text to be synthesized based on the speech spectrum, where an accent of the target speech matches the accent information.

Optionally, the speech features include a fundamental frequency corresponding to each phoneme and energy information corresponding to each phoneme, and the prediction module 804 is configured to construct input data based on the phoneme vectors; inputting the input data into an encoder of a trained speech synthesis model, and obtaining fundamental frequencies and the energy information corresponding to phonemes output by the encoder as the speech features; and obtaining the voice posterior graph which carries the accent information and is output by the encoder.

Optionally, the generating module 806 is configured to obtain a speaker vector, where the speaker vector carries timbre information of a speaker; and inputting the voice features, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model, and obtaining a Mel frequency spectrum output by the decoder as the voice frequency spectrum.

Optionally, the synthesis module 808 is configured to input the speech spectrum into a vocoder, and obtain a plurality of speech frames output by the vocoder as target speech corresponding to the text to be synthesized.

The apparatus of this embodiment is used to implement the corresponding method in the foregoing method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again. In addition, the functional implementation of each module in the apparatus of this embodiment can refer to the description of the corresponding part in the foregoing method embodiment, and is not described herein again.

Example four

Referring to fig. 9, a block diagram of a neural network model training apparatus according to a fourth embodiment of the present application is shown.

The device comprises:

the first training module 902 is configured to train the speech synthesis model using an audio sample corresponding to a first voice to obtain an initially trained speech synthesis model;

the second training module 904 is configured to train the initially trained speech synthesis model using an audio sample corresponding to a second accent, so as to obtain a secondarily trained speech synthesis model, where a duration of the audio sample corresponding to the first accent is longer than that of the audio sample corresponding to the second accent.

Optionally, the first training module 902 is configured to extract a speech feature and a speech posterior graph from an audio sample corresponding to the first voice; obtaining a speaker vector; inputting the voice features, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model, and acquiring a voice frequency spectrum output by the decoder; generating a target speech based on the speech spectrum using a vocoder of a speech synthesis model; and adjusting the speaker vector according to the target voice and the audio sample, taking the adjusted speaker vector as a new speaker vector, returning to input the voice characteristics, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model to continue executing until a first termination condition is met, and obtaining a trained decoder and the speaker vector.

Optionally, the first training module 902 is further configured to obtain a phoneme vector sample of a text sample corresponding to the audio sample of the first accent; inputting the phoneme vector sample into an encoder of the speech synthesis model, and obtaining speech characteristics and a speech posterior graph output by the encoder; inputting the speech features, the speech posterior graph and the trained speaker vector into a trained decoder to obtain a speech spectrum output by the trained decoder; generating a target speech based on the speech spectrum using a vocoder of a speech synthesis model; and adjusting the encoder according to the target voice and the audio sample, and returning to the step of inputting the phoneme vector sample into the encoder of the voice synthesis model to continue execution until a second termination condition is met so as to obtain a trained encoder.

EXAMPLE five

Referring to fig. 10, a schematic structural diagram of an electronic device according to a fifth embodiment of the present application is shown, and the specific embodiment of the present application does not limit a specific implementation of the electronic device.

As shown in fig. 10, the electronic device may include: a processor (processor)1002, a Communications Interface 1004, a memory 1006, and a Communications bus 1008.

Wherein:

the processor 1002, communication interface 1004, and memory 1006 communicate with each other via a communication bus 1008.

A communication interface 1004 for communicating with other electronic devices or servers.

The processor 1002 is configured to execute the program 1010, and may specifically perform the relevant steps in the foregoing method embodiments.

In particular, the program 1010 may include program code that includes computer operating instructions.

The processor 1002 may be a processor CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present application. The intelligent device comprises one or more processors which can be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

The memory 1006 is used for storing the program 1010. The memory 1006 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 1010 may be specifically configured to enable the processor 1002 to execute operations corresponding to the foregoing methods.

For specific implementation of each step in the program 1010, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The embodiment of the present application further provides a computer program product, which includes computer instructions for instructing a computing device to execute an operation corresponding to any one of the methods in the foregoing method embodiments.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The above embodiments are only used for illustrating the embodiments of the present application, and not for limiting the embodiments of the present application, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present application, so that all equivalent technical solutions also belong to the scope of the embodiments of the present application, and the scope of patent protection of the embodiments of the present application should be defined by the claims.

Claims

1. A method of speech synthesis comprising:

acquiring a phoneme vector of a text to be synthesized;

predicting the speech characteristics and a speech posterior graph corresponding to each phoneme from the phoneme vector through an encoder of a speech synthesis model, wherein the speech posterior graph carries accent information and is used for indicating the accent corresponding to each phoneme and the duration of each phoneme;

generating a voice frequency spectrum according to the voice characteristics and the voice posterior graph;

outputting a target voice corresponding to the text to be synthesized based on the voice spectrum, wherein the accent of the target voice is matched with the accent information;

the voice synthesis model comprises an encoder, a decoder and a vocoder, and is obtained by firstly training the decoder and then training the encoder based on the trained decoder;

the training of the decoder comprises: extracting voice characteristics and a voice posterior graph from an audio sample corresponding to the first pronunciation; obtaining a speaker vector; inputting the voice features, the voice posterior graph and the speaker vector into the decoder to obtain a voice frequency spectrum output by the decoder; generating, using the vocoder, a target speech based on the speech spectrum; adjusting the speaker vector according to the target voice and the audio sample, taking the adjusted speaker vector as a new speaker vector, returning to input the voice feature, the voice posterior graph and the speaker vector into the decoder to continue execution until a first termination condition is met, and obtaining a trained decoder and a speaker vector;

training the encoder based on the trained decoder comprises: acquiring a phoneme vector sample of a text sample corresponding to the audio sample of the first accent; inputting the phoneme vector sample into the coder, and obtaining the speech characteristics and the speech posterior graph output by the coder; inputting the speech features, the speech posterior graph and the trained speaker vector into a trained decoder to obtain a speech spectrum output by the trained decoder; generating, using the vocoder, a target speech based on the speech spectrum; and adjusting the encoder according to the target voice and the audio sample, and returning to the step of inputting the phoneme vector sample into the encoder to continue execution until a second termination condition is met so as to obtain a trained encoder.

2. The method of claim 1, wherein the phonetic features include a fundamental frequency corresponding to each phoneme and energy information corresponding to each phoneme, and predicting the phonetic features and the phonetic posterior graph corresponding to each phoneme from the phoneme vector comprises:

constructing input data based on the phoneme vector;

inputting the input data into an encoder of a trained speech synthesis model, and obtaining fundamental frequencies and the energy information corresponding to the phonemes output by the encoder as the speech features;

and obtaining the voice posterior graph which carries the accent information and is output by the encoder.

3. The method of claim 2, wherein said generating a speech spectrum from said speech features and said speech posterior map comprises:

obtaining a speaker vector, wherein the speaker vector carries tone information of a speaker;

and inputting the voice features, the voice posterior graph and the speaker vector into a decoder of the voice synthesis model, and obtaining a Mel frequency spectrum output by the decoder as the voice frequency spectrum.

4. The method of claim 1, wherein the outputting, based on the speech spectrum, target speech corresponding to the text to be synthesized comprises:

and inputting the voice spectrum into a vocoder, and obtaining a plurality of voice frames output by the vocoder as target voices corresponding to the text to be synthesized.

5. A neural network model training method for training the speech synthesis model of any one of claims 1-4, the method comprising:

training the speech synthesis model using audio samples corresponding to a first accent to obtain an initially trained speech synthesis model, comprising: firstly, training the decoder, and then training the encoder based on the trained decoder; the training of the decoder comprises: extracting voice characteristics and a voice posterior graph from an audio sample corresponding to the first pronunciation; obtaining a speaker vector; inputting the voice features, the voice posterior graph and the speaker vector into the decoder to obtain a voice frequency spectrum output by the decoder; generating, using the vocoder, a target speech based on the speech spectrum; adjusting the speaker vector according to the target voice and the audio sample, taking the adjusted speaker vector as a new speaker vector, returning to input the voice feature, the voice posterior graph and the speaker vector into the decoder to continue execution until a first termination condition is met, and obtaining a trained decoder and a speaker vector; training the encoder based on the trained decoder comprises: acquiring a phoneme vector sample of a text sample corresponding to the audio sample of the first accent; inputting the phoneme vector sample into the coder, and obtaining the speech characteristics and the speech posterior graph output by the coder; inputting the speech features, the speech posterior graph and the trained speaker vector into a trained decoder to obtain a speech spectrum output by the trained decoder; generating, using the vocoder, a target speech based on the speech spectrum; adjusting the encoder according to the target voice and the audio sample, and returning to the step of inputting the phoneme vector sample into the encoder to continue execution until a second termination condition is met so as to obtain a trained encoder;

and training the initially trained speech synthesis model by using an audio sample corresponding to a second accent to obtain a secondarily trained speech synthesis model, wherein the duration of the audio sample corresponding to the first accent is longer than that of the audio sample corresponding to the second accent.

6. An electronic device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction which causes the processor to execute the corresponding operation of the method according to any one of claims 1-5.

7. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of any one of claims 1 to 5.