CN113421548A

CN113421548A - Speech synthesis method, apparatus, computer device and storage medium

Info

Publication number: CN113421548A
Application number: CN202110742024.4A
Authority: CN
Inventors: 孙奥兰; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-21
Anticipated expiration: 2041-06-30
Also published as: CN113421548B

Abstract

The application relates to the field of artificial intelligence, and can improve the accuracy of a speech synthesis result by carrying out pronunciation time constraint processing on the sound spectrum characteristic information output by an acoustic model according to a pronunciation time statistical table. To a speech synthesis method, apparatus, computer device and storage medium, the method comprising: acquiring a target text to be subjected to voice synthesis and acquiring a pronunciation duration statistical table; inputting the target text into a second acoustic model in a second speech synthesis model for extracting the sound spectrum characteristic to obtain sound spectrum characteristic information corresponding to the target text; based on the pronunciation time statistical table, carrying out pronunciation time constraint processing on the sound spectrum characteristic information to obtain the sound spectrum characteristic information after constraint processing; and inputting the constrained voice spectrum characteristic information into a vocoder in the second voice synthesis model for voice synthesis to obtain a voice synthesis result corresponding to the target text. In addition, the application also relates to a block chain technology, and the pronunciation duration statistical table can be stored in the block chain.

Description

Speech synthesis method, apparatus, computer device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a speech synthesis method, apparatus, computer device, and storage medium.

Background

The speech synthesis technology is a technology capable of converting arbitrarily input text information into corresponding speech. Currently, an end-to-end speech synthesis model, such as a Tacotron model, is mainly used to align the input text information and the target generated speech acoustic features through an attention mechanism. However, in the training process of the speech synthesis model, when the consistency of the recording rhythm of the sound recorder is poor, so that the quality of a training sample is not high, the attention mechanism obtained by training is low in stability, so that the problem that syllables are not naturally paused is easily caused in the speech synthesis process of the speech synthesis model, and the accuracy of a speech synthesis result is greatly reduced.

Therefore, how to improve the accuracy of speech synthesis becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a voice synthesis method, a voice synthesis device, computer equipment and a storage medium, and the voice synthesis method, the voice synthesis device, the computer equipment and the storage medium can be used for carrying out voice duration constraint processing on voice spectrum characteristic information output by an acoustic model according to a voice duration statistical table, so that the problem that the syllable pause is unnatural due to the fact that an overlarge attention weighted value is given to a certain character in the syllable by an attention mechanism in the existing voice synthesis model is solved, and the accuracy of a voice synthesis result can be improved.

In a first aspect, the present application provides a speech synthesis method, including:

acquiring a target text to be subjected to voice synthesis and acquiring a pronunciation duration statistical table, wherein the pronunciation duration statistical table is obtained by performing syllable pronunciation duration statistics on a standard text based on a first acoustic model in a first voice synthesis model;

inputting the target text into a second acoustic model in a second speech synthesis model for acoustic spectrum feature extraction to obtain acoustic spectrum feature information corresponding to the target text;

based on the pronunciation duration statistical table, carrying out pronunciation duration constraint processing on the voice spectrum characteristic information to obtain the voice spectrum characteristic information after constraint processing;

and inputting the voice spectrum characteristic information after constraint processing into a vocoder in the second voice synthesis model for voice synthesis to obtain a voice synthesis result corresponding to the target text.

In a second aspect, the present application also provides a speech synthesis apparatus, the apparatus comprising:

the statistical table acquisition module is used for acquiring a target text to be subjected to voice synthesis and acquiring a pronunciation duration statistical table, wherein the pronunciation duration statistical table is obtained by performing syllable pronunciation duration statistics on a standard text based on a first acoustic model in a first voice synthesis model;

the acoustic spectrum feature extraction module is used for inputting the target text into a second acoustic model in a second speech synthesis model to perform acoustic spectrum feature extraction so as to obtain acoustic spectrum feature information corresponding to the target text;

the constraint processing module is used for carrying out pronunciation duration constraint processing on the voice spectrum characteristic information based on the pronunciation duration statistical table to obtain the voice spectrum characteristic information after constraint processing;

and the voice synthesis module is used for inputting the voice spectrum characteristic information after the constraint processing into a vocoder in the second voice synthesis model for voice synthesis to obtain a voice synthesis result corresponding to the target text.

In a third aspect, the present application further provides a computer device comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to execute the computer program and to implement the speech synthesis method as described above when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the speech synthesis method as described above.

The application discloses a voice synthesis method, a device, computer equipment and a storage medium, wherein a target text to be voice synthesized and a pronunciation duration statistical table are obtained, and then pronunciation duration constraint processing can be carried out on the corresponding sound spectrum characteristic information of a target file according to the pronunciation duration statistical table; the target text is input into a second acoustic model in the second speech synthesis model for acoustic spectrum feature extraction, so that acoustic spectrum feature information corresponding to the target text can be obtained, and the accuracy of the acoustic spectrum feature information is improved; by carrying out pronunciation duration constraint processing on the voice spectrum characteristic information based on the pronunciation duration statistical table, the problem that the syllable is not naturally paused due to the fact that an attention mechanism in the existing voice synthesis model assigns an overlarge attention weight value to a certain character in the syllable can be solved; the voice synthesis is carried out by inputting the voice spectrum characteristic information after the constraint processing into a vocoder in the second voice synthesis model, so that the voice synthesis result with natural syllable pause can be obtained, and the accuracy of the voice synthesis result is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a speech synthesis method provided by an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a first speech synthesis model provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a sub-step of generating a pronunciation duration statistics table according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a second speech synthesis model provided in the embodiments of the present application;

FIG. 5 is a schematic diagram of speech synthesis of a target text according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of a sub-step of performing sonographic feature extraction on a target text provided by an embodiment of the present application;

FIG. 7 is a schematic flow chart of a sub-step of performing pronunciation duration constraint processing on the sound spectrum feature information according to an embodiment of the present application;

fig. 8 is a schematic block diagram of a speech synthesis apparatus provided in an embodiment of the present application;

fig. 9 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a voice synthesis method, a voice synthesis device, computer equipment and a storage medium. The voice synthesis method can be applied to a server or a terminal, realizes the voice generation duration constraint processing of the voice spectrum characteristic information output by the acoustic model according to the voice generation duration statistical table, solves the problem that the syllable is not stopped naturally due to the fact that an attention mechanism in the existing voice synthesis model gives an overlarge attention weight value to a certain character in the syllable, and can improve the accuracy of a voice synthesis result.

The server may be an independent server or a server cluster. The terminal can be an electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer and the like.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

As shown in fig. 1, the speech synthesis method includes steps S10 through S40.

Step S10, obtaining a target text to be voice synthesized and obtaining a pronunciation duration statistical table, wherein the pronunciation duration statistical table is obtained by performing syllable pronunciation duration statistics on a standard text based on a first acoustic model in a first voice synthesis model.

Illustratively, the target text is a text which a user needs to perform speech synthesis; the target text may include a phoneme sequence or a character sequence corresponding to the text.

For example, the target text may be stored in a local disk or a local database, or may be stored in a blockchain node.

In some embodiments, a pre-generated pronunciation duration statistical table may be obtained, wherein the pronunciation duration statistical table is obtained by performing syllable pronunciation duration statistics on the standard text based on the first acoustic model in the first speech synthesis model.

It should be noted that the standard text is a text in a preset standard corpus; the standard corpus is a BIAOBEI corpus and comprises standard texts and standard voice information corresponding to the standard texts.

By acquiring the target text to be subjected to voice synthesis and the pronunciation duration statistical table, the pronunciation duration constraint processing can be subsequently performed on the sound spectrum characteristic information corresponding to the target file according to the pronunciation duration statistical table.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a first speech synthesis model according to an embodiment of the present application. As shown in fig. 2, the first speech synthesis model may be a Tacotron model, including a first acoustic model and a vocoder. The first acoustic model may include, among other things, an encoder, an attention mechanism layer, and a decoder. The vocoder is used for converting the sound spectrum characteristic information output by the acoustic model into sound waveforms, and the vocoder can comprise a WaveGlow model, a Parallel WaveNet model, a Parallel WaveGan model, a MelGan model and the like.

It should be noted that the encoder is used to convert the input sequence into a vector of fixed length; the attention mechanism layer can comprise a weight value matrix, is positioned between the encoder and the decoder and is used for carrying out weight value distribution on the vector output by the encoder and inputting the vector after weight value distribution into the decoder; the decoder is used for converting the vector after the weight value distribution of the attention mechanism layer output into a sequence output. Illustratively, the encoder and decoder may be a Recurrent neural network model, such as an LSTM (Long Short-Term Memory) network model or a GRU (gated cyclic Unit) network model.

In the embodiment of the present application, the first speech synthesis model may be a pre-trained speech synthesis model. For example, the initial first speech synthesis model may be trained according to the above-mentioned standard corpus BIAOBEI until the first speech synthesis model converges, resulting in a trained first speech synthesis model. Since the standard corpus BIAOBEI includes high-quality standard text and standard speech information, the first acoustic model with good prosody can be obtained by training the first speech synthesis model according to the standard corpus. Therefore, syllable pronunciation duration statistics can be carried out on the standard text through the first acoustic model with good prosody, pronunciation duration information with good prosody is obtained, and accuracy of the pronunciation duration statistical table is improved.

For example, the training process of the first speech synthesis model may include: determining training sample data of each round and comparison voice information corresponding to the training sample data according to a standard text in a standard corpus and standard voice information corresponding to the standard text; inputting the sample data of the current training round into an initial first voice synthesis model for voice synthesis training to obtain a voice synthesis training result corresponding to the sample data of the current training round; determining a loss function value corresponding to the sample data of the current round of training according to the voice synthesis training result and the comparison voice information based on a preset loss function; and if the loss function value is larger than the preset loss value threshold, adjusting the parameters of the first voice synthesis model, carrying out the next round of training until the obtained loss function value is smaller than or equal to the loss value threshold, and finishing the training to obtain the trained first voice synthesis model.

Exemplary, the predetermined loss function may include, but is not limited to, a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a quadratic loss function, an exponential loss function, and the like. The loss function value may be set according to actual conditions, and specific values are not limited herein.

For example, the parameters of the first speech synthesis model may be adjusted by a gradient descent algorithm or a back propagation algorithm, and the specific process of adjusting the parameters is not limited herein.

In the embodiment of the application, after the trained first speech synthesis model is obtained, syllable pronunciation duration statistics may be performed on the standard text based on the first acoustic model in the trained first speech synthesis model, so as to generate a pronunciation duration statistics table.

Referring to fig. 3, fig. 3 is a schematic flowchart of a sub-step of generating a pronunciation duration statistical table according to an embodiment of the present application, which may specifically include the following steps S101 to S103.

Step S101, inputting the standard text into the first acoustic model for sound spectrum feature extraction, and obtaining standard sound spectrum feature information corresponding to the standard text.

Illustratively, a standard text for counting the time length to be sounded can be obtained from a standard corpus BIAOBEI; and then, inputting the standard text into a first acoustic model in the trained first speech synthesis model for sound spectrum feature extraction to obtain standard sound spectrum feature information corresponding to the standard text.

For example, a standard text may be input into an encoder in the first acoustic model for encoding, and a word vector set corresponding to the standard text is obtained; then, inputting the word vector set into an attention mechanism layer in a first acoustic model for weight value distribution to obtain a word vector set after weight value distribution; and finally, inputting the word vector set after the weight value distribution into a decoder in the first acoustic model for decoding to obtain standard sound spectrum characteristic information.

And S102, extracting pronunciation duration of the standard sound spectrum characteristic information to obtain pronunciation duration information corresponding to the standard sound spectrum characteristic information.

For example, the standard sound spectrum feature information may include at least one syllable, each syllable corresponding to at least one character. The pronunciation time information may include at least one pronunciation time corresponding to each character.

For example, the standard sound spectrum feature information output by the decoder may be subjected to pronunciation duration extraction, for example, to extract a pronunciation duration corresponding to each character in each syllable, so as to obtain at least one pronunciation duration corresponding to each character.

It should be noted that, in different syllables of the same character, the corresponding pronunciation time lengths may be different, so that at least one pronunciation time length corresponding to each character can be determined.

For example, for the character "a", in the syllable "an", the corresponding pronunciation time length may be 2; in the syllable "ang", the corresponding utterance duration may be 4.

And step S103, generating the pronunciation time length statistical table according to the pronunciation time length information.

In some embodiments, generating the pronunciation duration statistics table according to the pronunciation duration information may include: extracting at least one pronunciation time corresponding to each character in the pronunciation time information to obtain a pronunciation time array corresponding to each character; and storing the pronunciation time array corresponding to each character into a preset data table to obtain a pronunciation time statistical table.

For example, in the pronunciation duration statistical table, the pronunciation duration array corresponding to the character "a" may be [2, 3, 4, 5], the pronunciation duration array corresponding to the character "b" may be [2, 3, 4, 6], and the pronunciation duration array corresponding to the character "o" may be [4, 2, 3, 6 ].

It should be emphasized that, in order to further ensure the privacy and security of the pronunciation time length statistic table, the pronunciation time length statistic table can also be stored in a node of a block chain. Of course, the pronunciation duration statistics table may also be stored in a local database, a local disk, and an external storage device, which is not limited specifically.

It can be understood that a pronunciation duration statistical table containing standard syllable pronunciation duration can be obtained by performing syllable pronunciation duration statistics on a standard text, and pronunciation duration constraint processing can be subsequently performed on voice spectrum characteristic information according to the pronunciation duration statistical table, so that the problem that syllable pause is unnatural due to the fact that an overlarge attention weight value is given to a certain character in a syllable by an attention mechanism in the existing voice synthesis model can be solved.

And step S20, inputting the target text into a second acoustic model in a second speech synthesis model for acoustic spectrum feature extraction, and obtaining acoustic spectrum feature information corresponding to the target text.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a second speech synthesis model according to an embodiment of the present application. As shown in fig. 4, the second speech synthesis model may be a Tacotron model, including a second acoustic model and a vocoder. The second acoustic model may include, among other things, an encoder, an attention mechanism layer, and a decoder. It should be noted that the structure of the second speech synthesis model is the same as that of the first speech synthesis model, and the second speech synthesis model differs from the first speech synthesis model in that the training process is different.

In this embodiment of the present application, the second speech synthesis model may be trained in advance to obtain a trained speech synthesis model.

In some embodiments, before the inputting the target text into the second acoustic model in the second speech synthesis model for performing the extraction of the acoustic spectrum feature and obtaining the acoustic spectrum feature information corresponding to the target text, the method may further include: acquiring training sample data; and training the second voice synthesis model according to the training sample data and the pronunciation duration statistical table until the second voice synthesis model converges to obtain the trained second voice synthesis model.

For example, the training sample data may include text information and speech information corresponding to the text information. The text information can be a standard text in a standard corpus BIAOBEI, and can also be other types of texts except the standard corpus BIAOBEI; the voice information is obtained by recording the text information by a sound recorder.

Illustratively, the training process of the second acoustic model may include: determining training text data of each round and comparison voice information corresponding to the training text data according to training sample data; inputting the current round of training text data into a second acoustic model for acoustic spectrum feature extraction training to obtain acoustic spectrum feature training information corresponding to the current round of training text data; based on the pronunciation duration statistical table, carrying out pronunciation duration constraint processing on the sound spectrum characteristic training information to obtain the sound spectrum characteristic training information after constraint processing; inputting the constrained acoustic spectrum characteristic training information into a vocoder for voice synthesis to obtain a voice synthesis training result corresponding to the current round of training text data; determining a loss function value corresponding to the text data of the current round of training according to the voice synthesis training result and the comparison voice information based on a preset loss function; and if the loss function value is larger than the preset loss threshold value, adjusting the parameters of the second speech synthesis model, carrying out the next round of training until the obtained loss function value is smaller than or equal to the loss threshold value, and finishing the training to obtain the trained second speech synthesis model.

Exemplary, the predetermined loss function may include, but is not limited to, a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a quadratic loss function, an exponential loss function, and the like.

For example, the loss function value may be set according to actual conditions, and the specific value is not limited herein. The parameters of the second speech synthesis model may be adjusted by a convergence algorithm such as a gradient descent algorithm, a newton algorithm, a conjugate gradient method, or a cauchy-newton method, and the specific process of adjusting the parameters is not limited herein. It can be understood that the parameters of the second speech synthesis model are adjusted according to the convergence algorithm, so that the second speech synthesis model can be converged quickly, and the training efficiency is improved.

By training the second speech synthesis model to be convergent, the second speech synthesis model can learn pronunciation duration constraint processing, and therefore accuracy of speech synthesis of the trained second speech synthesis model is improved.

In some embodiments, to further ensure the privacy and security of the trained second speech synthesis model, the trained second speech synthesis model may be stored in a node of a blockchain. When the trained second speech synthesis model needs to be invoked, it can be invoked from a node of the blockchain.

Referring to fig. 5, fig. 5 is a schematic diagram of speech synthesis of a target text according to an embodiment of the present application. As shown in fig. 5, the target text may be input into a second acoustic model in the second speech synthesis model to perform acoustic spectrum feature extraction, so as to obtain acoustic spectrum feature information corresponding to the target text; secondly, carrying out pronunciation duration constraint processing on the sound spectrum characteristic information based on a pronunciation duration statistical table to obtain the sound spectrum characteristic information after constraint processing; and finally, inputting the constrained voice spectrum characteristic information into a vocoder in a second voice synthesis model for voice synthesis to obtain a voice synthesis result corresponding to the target text.

In some embodiments, inputting the target text into a second acoustic model in the second speech synthesis model for performing the extraction of the acoustic spectrum feature to obtain acoustic spectrum feature information corresponding to the target text may include: and inputting the target text into a second acoustic model in the trained second speech synthesis model for acoustic spectrum feature extraction to obtain acoustic spectrum feature information.

Referring to fig. 6, fig. 6 is a schematic flowchart of a sub-step of performing audio spectrum feature extraction on a target text according to an embodiment of the present application, and specifically includes the following steps S201 to S203.

Step S201, inputting the target text into an encoder in the second acoustic model for encoding, and obtaining a word vector set corresponding to the target text.

For example, the target text may be directly input to the encoder in the second acoustic model for encoding, or the target text may be first participled, and the obtained phrase set may be input to the encoder in the second acoustic model for encoding.

For example, the target text may be segmented according to the segmentation model to obtain a phrase set corresponding to the target text. The word segmentation model may include a BI _ LSTM-CRF neural network model, and of course, other neural network models may also be used, which are not limited herein.

Step S202, inputting the word vector set into the attention mechanism layer in the second acoustic model for weight value distribution, and obtaining the word vector set after weight value distribution.

Illustratively, the word vectors output by the encoder may be aggregated and input into the attention mechanism layer; and (4) performing weight value distribution on each word vector in the word vector set by the attention mechanism layer to obtain the word vector set after weight value distribution. The specific process of assigning the weight value is not limited herein.

Step S203, inputting the word vector set after the weight value assignment to a decoder in the second acoustic model for decoding, so as to obtain the audio spectrum feature information.

For example, the word vector set after the weighted value output by the attention mechanism layer is assigned may be input to a decoder for decoding processing, so as to obtain the acoustic spectrum feature information corresponding to the target text. The specific procedure of the decoding process is not limited herein.

And the target text is input into a second acoustic model in the trained second speech synthesis model to perform acoustic spectrum feature extraction, so that acoustic spectrum feature information corresponding to the target text can be obtained, and the accuracy of the acoustic spectrum feature information is improved.

And step S30, based on the pronunciation duration statistical table, carrying out pronunciation duration constraint processing on the sound spectrum characteristic information to obtain the sound spectrum characteristic information after constraint processing.

In the embodiment of the application, the pronunciation duration constraint processing is performed on the voice spectrum characteristic information based on the pronunciation duration statistical table, so that the problem that the syllable pause is unnatural due to the fact that an attention mechanism in the existing voice synthesis model gives an overlarge attention weight value to a certain character can be solved, the character pronunciation duration in the voice spectrum characteristic information is constrained, and the voice spectrum characteristic information after constraint processing is obtained.

Referring to fig. 7, fig. 7 is a schematic flowchart illustrating sub-steps of performing pronunciation duration constraint processing on the sound spectrum feature information according to an embodiment of the present application, and specifically includes the following steps S301 to S303.

Step S301, obtaining a first pronunciation duration corresponding to each character to be constrained in the sound spectrum characteristic information.

Illustratively, the sound spectrum characteristic information comprises at least one character to be constrained and processed; for example, each character in the sound spectrum feature information may be determined as a character to be constrained.

In some embodiments, a first sounding duration corresponding to each character to be constrained and processed in the acquired sound spectrum feature information may be extracted. For example, the playback time length corresponding to each character to be constrained and processed may be extracted from the syllable in which each character to be constrained and processed is located, so as to obtain the first playback time length corresponding to each character to be constrained and processed. Wherein the first utterance duration includes at least one utterance duration. For example, for the character "a" to be constrained, the pronunciation duration corresponding to phoneme a may be 1, and the pronunciation duration corresponding to phoneme B may be 6. So that the first sounding time length (1, 6) corresponding to the character "a" to be constrained can be obtained. It should be noted that in the embodiment of the present application, the unit of the pronunciation time length may be determined according to actual conditions, such as seconds, milliseconds, and the like.

Step S302, performing character matching on each character to be constrained and the pronunciation duration statistical table to obtain a target character corresponding to each character to be constrained.

For example, the character in the pronunciation duration statistical table that is the same as the character to be constrained may be determined as the target character corresponding to the character to be constrained. For example, for the character "a" to be constrained, the character "a" in the pronunciation duration statistical table may be determined as the target character corresponding to the character "a" to be constrained.

Step S303, according to the pronunciation time array corresponding to the target character, the first pronunciation time is subjected to constraint processing, and a second pronunciation time corresponding to each character to be subjected to constraint processing is obtained.

In some embodiments, performing constraint processing on the first pronunciation duration according to the pronunciation duration array corresponding to the target character to obtain a second pronunciation duration corresponding to each character to be constrained, includes: determining the maximum pronunciation time length and the minimum pronunciation time length in the pronunciation time length array; when the first pronunciation duration is greater than or equal to the maximum pronunciation duration, determining the maximum pronunciation duration as a second pronunciation duration; when the first pronunciation duration is less than or equal to the minimum pronunciation duration, determining the minimum pronunciation duration as a second pronunciation duration; and when the first pronunciation time length is greater than the minimum pronunciation time length and less than the maximum pronunciation time length, determining the first pronunciation time length as a second pronunciation time length.

Illustratively, for the character "a" to be constrained, the corresponding first pronunciation time length is (1, 6), and the pronunciation time length array corresponding to the target character "a" is [2, 3, 4, 5], where the maximum pronunciation time length is 5 and the minimum pronunciation time length is 2. For a first pronunciation duration 1 corresponding to the character "a" to be constrained, the first pronunciation duration 1 is smaller than a minimum pronunciation duration 2, so that the minimum pronunciation duration 2 can be determined as a second pronunciation duration; for the first pronunciation duration 6 corresponding to the character "a" to be constrained, since the first pronunciation duration 6 is greater than the maximum pronunciation duration 5, the maximum pronunciation duration 5 can be determined as the second pronunciation duration. Therefore, the second pronunciation time length corresponding to the character "a" to be constrained is (2, 5).

Step S40, inputting the constrained voice spectrum feature information into the vocoder in the second speech synthesis model to perform speech synthesis, and obtaining a speech synthesis result corresponding to the target text.

In the embodiment of the present application, after performing pronunciation duration constraint processing on the voice spectrum feature information based on the pronunciation duration statistical table, the voice spectrum feature information after constraint processing may be input to a vocoder in the second voice synthesis model for voice synthesis, so that a voice synthesis result corresponding to the target text may be obtained. The voice synthesis is carried out by inputting the voice spectrum characteristic information after the constraint processing into a vocoder in the second voice synthesis model, so that the voice synthesis result with natural syllable pause can be obtained, and the accuracy of the voice synthesis result is improved.

It should be noted that the vocoder is a tool for converting acoustic parameters into a speech waveform. Illustratively, vocoders may include, but are not limited to, the WaveGlow model, the parallell WaveNet model, the parallell WaveGan model, and the MelGan model, among others.

Illustratively, the vocoder performs voice synthesis on the constrained sound spectrum feature information, and may perform phase recovery processing on the constrained sound spectrum feature information based on the Griffin-Lim algorithm, and then restore a voice waveform through short-time fourier transform ISTFT. Thereby obtaining a speech synthesis result corresponding to the target text. It should be noted that the Griffin-Lim algorithm for generating the reconstructed speech signal requires the use of a magnitude spectrum and a phase spectrum.

According to the voice synthesis method provided by the embodiment, the target text to be subjected to voice synthesis and the pronunciation duration statistical table are obtained, and then pronunciation duration constraint processing can be performed on the voice spectrum characteristic information corresponding to the target file according to the pronunciation duration statistical table; the first acoustic model with good rhythm can be obtained by training the first voice synthesis model according to the standard corpus, so that syllable pronunciation duration statistics can be performed on the standard text through the first acoustic model with good rhythm to obtain pronunciation duration information with good rhythm, and the accuracy of a pronunciation duration statistical table is improved; the pronunciation duration statistical table containing the standard syllable pronunciation duration can be obtained by carrying out syllable pronunciation duration statistics on the standard text, and the subsequent pronunciation duration constraint processing can be carried out on the sound spectrum characteristic information according to the pronunciation duration statistical table, so that the problem that the syllable is not naturally paused due to the fact that an overlarge attention weight value is given to a certain character in the syllable by an attention mechanism in the existing voice synthesis model can be solved; the second speech synthesis model is trained to be convergent, so that the second speech synthesis model can learn pronunciation duration constraint processing, and the accuracy of speech synthesis of the trained second speech synthesis model is improved; the target text is input into a second acoustic model in the trained second speech synthesis model for acoustic spectrum feature extraction, so that acoustic spectrum feature information corresponding to the target text can be obtained, and the accuracy of the acoustic spectrum feature information is improved; the method has the advantages that the problem that syllables are not naturally paused due to the fact that an attention mechanism in the existing voice synthesis model gives an overlarge attention weight value to a certain character can be solved by performing pronunciation time constraint processing on the voice spectrum characteristic information based on the pronunciation time statistical table, and the character pronunciation time in the voice spectrum characteristic information is constrained to obtain the voice spectrum characteristic information after constraint processing; the voice synthesis is carried out by inputting the voice spectrum characteristic information after the constraint processing into a vocoder in the second voice synthesis model, so that the voice synthesis result with natural syllable pause can be obtained, and the accuracy of the voice synthesis result is improved.

Referring to fig. 8, fig. 8 is a schematic block diagram of a speech synthesis apparatus 1000 according to an embodiment of the present application, the speech synthesis apparatus being configured to perform the foregoing speech synthesis method. The speech synthesis apparatus may be configured in a server or a terminal.

As shown in fig. 8, the speech synthesis apparatus 1000 includes: a statistical table obtaining module 1001, a sound spectrum feature extracting module 1002, a constraint processing module 1003 and a speech synthesizing module 1004.

The statistical table obtaining module 1001 is configured to obtain a target text to be subjected to speech synthesis and obtain a pronunciation duration statistical table, where the pronunciation duration statistical table is obtained by performing syllable pronunciation duration statistics on a standard text based on a first acoustic model in a first speech synthesis model.

And a sound spectrum feature extraction module 1002, configured to input the target text into a second acoustic model in a second speech synthesis model to perform sound spectrum feature extraction, so as to obtain sound spectrum feature information corresponding to the target text.

A constraint processing module 1003, configured to perform constraint processing on the pronunciation duration for the sound spectrum feature information based on the pronunciation duration statistical table, to obtain the sound spectrum feature information after constraint processing.

A speech synthesis module 1004, configured to input the constrained sonogram feature information into the vocoder in the second speech synthesis model for speech synthesis, so as to obtain a speech synthesis result corresponding to the target text.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 9.

Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 9, the computer device includes a processor and a memory connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the speech synthesis methods.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring a target text to be subjected to voice synthesis and acquiring a pronunciation duration statistical table, wherein the pronunciation duration statistical table is obtained by performing syllable pronunciation duration statistics on a standard text based on a first acoustic model in a first voice synthesis model; inputting the target text into a second acoustic model in a second speech synthesis model for acoustic spectrum feature extraction to obtain acoustic spectrum feature information corresponding to the target text; based on the pronunciation duration statistical table, carrying out pronunciation duration constraint processing on the voice spectrum characteristic information to obtain the voice spectrum characteristic information after constraint processing; and inputting the voice spectrum characteristic information after constraint processing into a vocoder in the second voice synthesis model for voice synthesis to obtain a voice synthesis result corresponding to the target text.

In one embodiment, the processor, when being configured to obtain the pronunciation duration statistics, is configured to:

inputting the standard text into the first acoustic model for sound spectrum feature extraction to obtain standard sound spectrum feature information corresponding to the standard text; extracting pronunciation duration of the standard sound spectrum characteristic information to obtain pronunciation duration information corresponding to the standard sound spectrum characteristic information; and generating the pronunciation duration statistical table according to the pronunciation duration information.

In one embodiment, the standard sound spectrum feature information comprises at least one syllable, each syllable corresponds to at least one character, and the pronunciation duration information comprises at least one pronunciation duration corresponding to each character; when the processor generates the pronunciation duration statistical table according to the pronunciation duration information, the processor is used for realizing that:

extracting at least one pronunciation time corresponding to each character in the pronunciation time information to obtain a pronunciation time array corresponding to each character; and storing the pronunciation time array corresponding to each character into a preset data table to obtain the pronunciation time statistical table.

In one embodiment, before implementing the extraction of the acoustic spectrum feature of the second acoustic model in which the target text is input into the second speech synthesis model, and obtaining the acoustic spectrum feature information corresponding to the target text, the processor is further configured to implement:

acquiring training sample data; and training the second voice synthesis model according to the training sample data and the pronunciation duration statistical table until the second voice synthesis model is converged to obtain the trained second voice synthesis model.

In one embodiment, when implementing the extraction of the acoustic spectrum feature of the second acoustic model in which the target text is input into the second speech synthesis model to obtain the acoustic spectrum feature information corresponding to the target text, the processor is configured to implement:

and inputting the target text into a second acoustic model in the trained second speech synthesis model for acoustic spectrum feature extraction to obtain the acoustic spectrum feature information.

In one embodiment, when implementing the extraction of the acoustic spectrum feature of the second acoustic model in the second trained speech synthesis model to which the target text is input, and obtaining the acoustic spectrum feature information, the processor is configured to implement:

inputting the target text into an encoder in the second acoustic model for encoding to obtain a word vector set corresponding to the target text; inputting the word vector set into an attention mechanism layer in the second acoustic model for weight value distribution to obtain a word vector set after weight value distribution; and inputting the word vector set after the weight value distribution into a decoder in the second acoustic model for decoding to obtain the sound spectrum characteristic information.

In one embodiment, the sound spectrum characteristic information comprises at least one character to be constrained and processed; the processor is used for realizing that when the pronunciation duration constraint processing is carried out on the voice spectrum characteristic information based on the pronunciation duration statistical table, the processor is used for realizing that:

acquiring a first pronunciation duration corresponding to each character to be constrained in the sound spectrum characteristic information; performing character matching on each character to be constrained and the pronunciation duration statistical table to obtain a target character corresponding to each character to be constrained; and according to the pronunciation time array corresponding to the target character, carrying out constraint processing on the first pronunciation time to obtain a second pronunciation time corresponding to each character to be constrained.

In one embodiment, the processor is configured to perform constraint processing on the first pronunciation time according to the pronunciation time array corresponding to the target character to obtain a second pronunciation time corresponding to each character to be constrained, and is configured to:

determining the maximum pronunciation time length and the minimum pronunciation time length in the pronunciation time length array; when the first pronunciation time length is greater than or equal to the maximum pronunciation time length, determining the maximum pronunciation time length as the second pronunciation time length; when the first pronunciation duration is less than or equal to the minimum pronunciation duration, determining the minimum pronunciation duration as the second pronunciation duration; and when the first pronunciation time length is greater than the minimum pronunciation time length and less than the maximum pronunciation time length, determining the first pronunciation time length as the second pronunciation time length.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the speech synthesis methods provided in the embodiments of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital Card (SD Card), a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech synthesis, comprising:

2. The speech synthesis method of claim 1, wherein the obtaining of the pronunciation duration statistics comprises:

inputting the standard text into the first acoustic model for sound spectrum feature extraction to obtain standard sound spectrum feature information corresponding to the standard text;

extracting pronunciation duration of the standard sound spectrum characteristic information to obtain pronunciation duration information corresponding to the standard sound spectrum characteristic information;

and generating the pronunciation duration statistical table according to the pronunciation duration information.

3. The speech synthesis method according to claim 2, wherein the standard sound spectrum feature information includes at least one syllable, each syllable corresponds to at least one character, and the pronunciation duration information includes at least one pronunciation duration corresponding to each character;

generating the pronunciation duration statistical table according to the pronunciation duration information, including:

extracting at least one pronunciation time corresponding to each character in the pronunciation time information to obtain a pronunciation time array corresponding to each character;

and storing the pronunciation time array corresponding to each character into a preset data table to obtain the pronunciation time statistical table.

4. The speech synthesis method according to claim 1, wherein before the extracting of the acoustic spectrum feature of the second acoustic model in which the target text is input into the second speech synthesis model and obtaining the acoustic spectrum feature information corresponding to the target text, the method further comprises:

acquiring training sample data;

training the second voice synthesis model according to the training sample data and the pronunciation duration statistical table until the second voice synthesis model converges to obtain the trained second voice synthesis model;

the inputting the target text into a second acoustic model in a second speech synthesis model for performing acoustic spectrum feature extraction to obtain acoustic spectrum feature information corresponding to the target text includes:

5. The method according to claim 4, wherein the inputting the target text into a second acoustic model of the trained second speech synthesis model for performing acoustic spectrum feature extraction to obtain the acoustic spectrum feature information comprises:

inputting the target text into an encoder in the second acoustic model for encoding to obtain a word vector set corresponding to the target text;

inputting the word vector set into an attention mechanism layer in the second acoustic model for weight value distribution to obtain a word vector set after weight value distribution;

and inputting the word vector set after the weight value distribution into a decoder in the second acoustic model for decoding to obtain the sound spectrum characteristic information.

6. The speech synthesis method according to claim 1, wherein the sound spectrum feature information includes at least one character to be constrained; based on the pronunciation duration statistical table, carrying out pronunciation duration constraint processing on the voice spectrum characteristic information, including:

acquiring a first pronunciation duration corresponding to each character to be constrained in the sound spectrum characteristic information;

performing character matching on each character to be constrained and the pronunciation duration statistical table to obtain a target character corresponding to each character to be constrained;

and according to the pronunciation time array corresponding to the target character, carrying out constraint processing on the first pronunciation time to obtain a second pronunciation time corresponding to each character to be constrained.

7. The speech synthesis method according to claim 6, wherein the constraining the first pronunciation duration according to the pronunciation duration array corresponding to the target character to obtain a second pronunciation duration corresponding to each character to be constrained, comprises:

determining the maximum pronunciation time length and the minimum pronunciation time length in the pronunciation time length array;

when the first pronunciation time length is greater than or equal to the maximum pronunciation time length, determining the maximum pronunciation time length as the second pronunciation time length;

when the first pronunciation duration is less than or equal to the minimum pronunciation duration, determining the minimum pronunciation duration as the second pronunciation duration;

and when the first pronunciation time length is greater than the minimum pronunciation time length and less than the maximum pronunciation time length, determining the first pronunciation time length as the second pronunciation time length.

8. A speech synthesis apparatus, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory for storing a computer program;

the processor for executing the computer program and implementing the speech synthesis method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the speech synthesis method according to any one of claims 1 to 7.