CN113450758B

CN113450758B - Speech synthesis method, apparatus, device and medium

Info

Publication number: CN113450758B
Application number: CN202110996774.4A
Authority: CN
Inventors: 郭少彤; 陈昌滨; 贺刚
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-11-16
Anticipated expiration: 2041-08-27
Also published as: CN113450758A

Abstract

The present disclosure provides a speech synthesis method, apparatus, device and medium, wherein the method comprises: obtaining semantic features, phoneme features and acoustic features of a target text; performing a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result; performing a second alignment operation on the phoneme characteristics and the acoustic characteristics to obtain a second alignment result; performing feature fusion according to the first alignment result and the second alignment result to obtain fusion features; and generating synthetic voice corresponding to the target text based on the fusion characteristics. The method and the device can better improve the voice synthesis effect.

Description

Speech synthesis method, apparatus, device and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a speech synthesis method, apparatus, device, and medium.

Background

With the development of artificial intelligence, various practical scenes such as online customer service, online education, voice assistance, intelligent sound, voiced novel, and the like have been widely used with voice synthesis technology that can automatically convert text into voice (synthesized audio). However, the speech synthesis technology in the prior art has a poor speech effect, i.e., a poor speech synthesis effect.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, the present disclosure provides a speech synthesis method, apparatus, device and medium.

According to an aspect of the present disclosure, there is provided a speech synthesis method including: obtaining semantic features, phoneme features and acoustic features of a target text; performing a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result; performing a second alignment operation on the phoneme characteristics and the acoustic characteristics to obtain a second alignment result; performing feature fusion according to the first alignment result and the second alignment result to obtain fusion features; and obtaining the synthetic voice corresponding to the target text based on the fusion characteristics.

According to another aspect of the present disclosure, there is provided a speech synthesis apparatus including: the feature acquisition module is used for acquiring semantic features, phoneme features and acoustic features of the target text; the first alignment module is used for executing a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result; the second alignment module is used for executing a second alignment operation on the phoneme characteristics and the phoneme characteristics to obtain a second alignment result; the feature fusion module is used for carrying out feature fusion based on the first alignment result and the second alignment result to obtain fusion features; and the voice generating module is used for generating synthetic voice corresponding to the target text based on the fusion characteristics.

According to another aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory storing a program, wherein the program comprises instructions which, when executed by the processor, cause the processor to perform a speech synthesis method according to the above.

According to another aspect of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described speech synthesis method.

According to the technical scheme provided by the embodiment of the disclosure, the semantic features, the phoneme features and the acoustic features of the target text can be firstly obtained; then, performing a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result; and executing a second alignment operation on the phoneme characteristics and the acoustic characteristics to obtain a second alignment result; and then, performing feature fusion according to the first alignment result and the second alignment result to obtain fusion features, and further obtaining synthetic voice corresponding to the target text based on the fusion features. According to the method, the semantic features of the coarse granularity and the phoneme features of the fine granularity can be respectively aligned with the acoustic features, then the alignment results are fused to obtain the synthesized voice, the features of different granularities can be fully utilized, and the voice synthesis effect can be favorably improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a speech synthesis system according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart of another speech synthesis method provided by the embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and its variants as used in this disclosure are intended to be inclusive, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

Most of the existing speech synthesis technologies are implemented based on an end-to-end speech synthesis system, the speech synthesis system is mainly constructed based on a sequence-to-sequence framework, and specifically, the speech synthesis system mainly includes an encoder and a decoder, the encoder is used for encoding (feature extraction) an input text, the encoder automatically learns a context relationship from the text, such as learning an alignment relationship between text features and acoustic features, and the decoder is used for decoding the encoded text (extracted features), and finally obtains an audio (synthesized speech) corresponding to the text. However, the ability of the encoder to learn from the text is limited, which results in poor final speech effect, such as that the obtained speech generally has problems of prosody, expression accuracy, and the like, such as speech intonation and hard word continuous reading error. For the problem, research proposes that a semantic level text feature with a larger granularity and a phoneme level text feature with a smaller (fine) granularity in a text sentence are obtained and spliced, the spliced features are used as input of a decoder, and then the decoder aligns the spliced features with acoustic features based on the spliced features, that is, the alignment information of the text features and the acoustic features is calculated at the phoneme level and the semantic level at the same time. However, the method has a limited degree of improving the naturalness and prosody of the synthesized speech, and the inventors found through a lot of research that the method of splicing and aligning the large-granularity semantic-level text features (hereinafter referred to as semantic features) and the fine-granularity phoneme-level text features (hereinafter referred to as phoneme features) needs to be implemented on the fine-granularity phoneme features for calculation, so that part of information carried by the coarse-granularity semantic features is lost (ignored), and thus the information of the coarse-granularity semantic features is not well played, which results in insufficient understanding of the whole context of the text by the speech synthesis system, and thus the finally synthesized speech is not well expressed in the aspects of naturalness, prosody, fluency, and the like.

It is understood that the defects existing in the speech synthesis scheme in the related art are the results obtained after the applicant has practiced and studied carefully, and therefore, the discovery process of the above defects and the solutions proposed by the embodiments of the present application to the above defects in the following should be considered as contributions of the applicant to the present application. In order to solve the above problems, the inventors propose a speech synthesis method, apparatus, device, and medium, which can make full use of semantic features and phoneme features of different granularities, thereby better implementing speech synthesis and improving speech synthesis effect, so as to be better applied to various occasions requiring speech synthesis, such as better application to online customer service, online education, speech assistant, smart sound box, voiced novel, and the like, without limitation. For ease of understanding, the following detailed description is as follows:

fig. 1 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure, which may be executed by a speech synthesis apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in an electronic device. As shown in fig. 1, the method mainly includes the following steps S102 to S110:

step S102, semantic features, phoneme features and acoustic features of the target text are obtained.

Acoustic features may also be referred to herein as speech features. The semantic features may also be referred to as semantic-level text features, and the phone features may also be referred to as phone-level text features, where a phone refers to a minimum voice unit divided according to natural attributes of voice, in other words, a phone is a minimum unit or a minimum voice segment constituting a syllable, and a phone is a minimum voice unit divided from the viewpoint of sound quality in terms of acoustic properties; from the physiological point of view, a pronunciation action forms a phoneme. It is understood that the semantic features are features of larger granularity, and the phoneme features are features of smaller granularity, such as, for example, a sentence containing 10 Chinese characters, which may correspond to only 10 characters (assuming that one Chinese character is one character), and which may correspond to 30 phonemes, for example, when converted to a phoneme sequence; the semantic features obtained based on the character sequence including 10 characters have a granularity greater than that of the phoneme features compared with the phoneme features obtained based on the phoneme sequence including 30 phonemes.

In practical application, the semantic features, the phoneme features and the acoustic features of the target text can be obtained through the neural network obtained through pre-training, and in the embodiment of the disclosure, the manner of obtaining the semantic features, the phoneme features and the acoustic features of the target text is not specifically limited.

Step S104, a first alignment operation is executed on the semantic features and the acoustic features to obtain a first alignment result.

The main purpose of performing the alignment operation is to find a mapping relationship or an association relationship between the semantic features and the acoustic features, and in the embodiment of the present disclosure, the feature alignment operation is not limited. To achieve a better alignment effect, in some embodiments, the first alignment operation may be an attention-based alignment operation, and the first alignment result may be represented by means of an alignment matrix. That is, the semantic features and the acoustic features may be aligned based on an attention mechanism, resulting in a first alignment matrix.

And step S106, executing a second alignment operation on the phoneme characteristics and the acoustic characteristics to obtain a second alignment result.

In some embodiments, the second alignment operation may be an attention-based alignment operation, and the second alignment result may be represented by way of an alignment matrix. That is, the phoneme features and the acoustic features may be aligned based on an attention mechanism to obtain the second alignment matrix.

The step S104 and the step S106 may be executed sequentially, and the step S104 is executed first, and then the step S106 is executed, or the step S106 is executed first, and then the step S104 is executed, or the step S104 and the step S106 may be executed simultaneously, and the execution order of the steps is not limited in the embodiment of the present disclosure. However, no matter what execution sequence is, the semantic features and phoneme features with different granularities are respectively aligned with the acoustic features, that is, a double-alignment mechanism is adopted to obtain two alignment results, so as to fully utilize the semantic features and phoneme features with different granularities.

And S108, performing feature fusion according to the first alignment result and the second alignment result to obtain fusion features.

The embodiment of the present disclosure does not limit the feature fusion method. By performing feature fusion on the first alignment result and the second alignment result, the obtained fusion feature fully contains both semantic information and phoneme information.

And step S110, generating synthetic voice corresponding to the target text based on the fusion characteristics.

The fusion features are used as decoding conditions and can participate in an autoregressive decoding process, so that synthetic speech is obtained based on a decoding result. In some embodiments, the fused features may be subjected to autoregressive decoding by a pre-trained decoder to obtain a mel spectrum (which may also be referred to as a mel spectrum); then, the Mel spectrum is converted into audio by a vocoder, and the audio is taken as the synthetic voice corresponding to the target text. Alternatively, the vocoder may be a Griffin-Lim vocoder, and in particular, the mel spectrum may be converted into a magnitude spectrum, and then the Griffin-Lim vocoder may be used to obtain a speech signal (audio) based on the magnitude spectrum, where Griffin-Lim is an algorithm that can reconstruct a speech even under the condition that only the magnitude spectrum is known and the phase spectrum is not known.

The speech synthesis method provided by the embodiment of the disclosure can align the semantic features of the coarse granularity and the phoneme features of the fine granularity with the acoustic features respectively, and then fuse the alignment results to obtain the synthesized speech, so that the features of different granularities can be fully utilized, the speech synthesis effect can be favorably improved, and the expressive force of the synthesized speech in the aspects of naturalness, fluency, rhythm and the like can be improved.

Further, an embodiment of the present disclosure provides a method for obtaining semantic features of a target text, including: inputting a target text into a semantic feature extraction model obtained by pre-training; and performing semantic feature extraction operation on the target text through a semantic feature extraction model to obtain the semantic features of the target text. The semantic feature extraction model is a neural network model, which inputs a text and outputs semantic features corresponding to the text, and the semantic feature extraction can be conveniently and quickly realized through the neural network model. In addition, the network structure of the semantic feature extraction model is not limited in the embodiment of the present disclosure, and the training mode of the semantic feature extraction model may be implemented by referring to a network training mode in the related art, such as by using an unsupervised mode, and the model is obtained by training on the basis of a large number of text samples, that is, the semantic feature extraction model may be an unsupervised pre-training model, and for example, the semantic feature extraction model may be implemented by using a BERT (Bidirectional Encoder representation based on a converter) model, which is not described herein again.

In order to extract semantic features well, the semantic feature extraction operation performed by the semantic feature extraction model comprises the following steps (1) - (3):

(1) and performing character segmentation on the target text to obtain a character sequence. In some specific embodiments, the target text may be segmented according to a preset character table (or a word table), and then the character sequences are obtained by sorting the respective characters obtained by segmentation according to the sequence positions of the respective characters in the original target text. Illustratively, the target text is Chinese, each word is segmented once, the segmentation result is taken as a character (token), and the characters obtained by segmentation form a character sequence according to the sequence in the text. Through the steps, the target text can be split into a sequence with characters as units, so that the subsequent processing is facilitated.

(2) And acquiring character codes corresponding to the character sequences.

That is, a character sequence is encoded (which may also be understood as character feature extraction), and a character code corresponding to the character sequence is extracted, where the character code may also be referred to as character embedding (embedding) or character feature.

(3) And extracting semantic features based on character coding. In practical application, semantic features of character codes are extracted through a network layer in a semantic feature extraction model, and the output of a specified network layer is used as the finally extracted semantic features.

In some embodiments, the semantic feature extraction model includes a BERT (Bidirectional Encoder representation from transforms based on converters) model. The target text is input into the BERT model, namely, the semantic features corresponding to the target text can be output through the BERT model, for example, the target text is Chinese, the semantic features are also character level vectors, and the character level vectors contain richer semantic information and can be used as linguistic features at semantic level. Illustratively, the output semantic features may be a two-dimensional matrix [ seq _ len, dims ], where seq _ len represents the text length and dims represents a dimension of the word-level vector, such as may be 768 dimensions. For convenience of understanding, the embodiment of the present disclosure provides an obtaining process of the BERT model for obtaining semantic features (linguistic features at a semantic level) based on a target text, where the obtaining process is the implementation process of (1) - (3), and may also be briefly described by the following formula:

wherein S is_textTarget text representing the input, such as may be a Chinese text sequence; tokennizer () represents character segmentation of input target text to obtain character sequence T composed of multiple characters (token)_s. Embedding () denotes the encoding process (which can also be understood as a feature extraction process), Embedding (T)_s) I.e. obtaining the character code corresponding to the target text, wherein E_wThe obtained character codes may also be referred to as character embedding (embedding) or character features. Bert₁₁() Representing an output vector E for obtaining layer 11 (i.e., the second to last layer) of the BERT model_lThe output vector E_lAs semantic features acquired by the BERT model. Specifically, the BERT model includes a plurality of network layers (such as 12 network layers), and semantic features output by the layer 11 network can be finally selected by performing semantic feature extraction based on character coding through the plurality of network layers, where the semantic features can already more fully represent semantic information of a target text.

Further, the embodiment of the present disclosure further provides a method for obtaining a phoneme feature of a target text, which may be implemented by referring to the following steps a to c:

step a, inputting a target text into a preset grapheme-to-phoneme conversion unit to obtain a phoneme sequence output by the grapheme-to-phoneme conversion unit. The Grapheme-to-Phoneme unit may also be referred to as G2P (Grapheme-to-Phoneme), and may be implemented by using Network models such as Recurrent Neural Network (RNN), LSTM (Long Short-term memory unit), and the like, which are not limited herein. The grapheme-phoneme conversion unit can directly convert an input target text into a phoneme sequence, the target text is a Chinese text example, the grapheme-phoneme conversion unit can convert the Chinese text into a corresponding pinyin label according to a certain pinyin conversion rule, and the pinyin label sequence is the phoneme sequence. This step first converts the target text into a sequence of phonemes for subsequent processing on the phonemes.

And b, inputting the phoneme sequence into a coder obtained by pre-training. The structure of the encoder is not limited in the embodiments of the present disclosure, and the training mode of the encoder may also be implemented with reference to the related art, which is not described in detail herein.

And c, performing phoneme feature extraction operation on the phoneme sequence through an encoder to obtain a phoneme feature corresponding to the target text. The phoneme features may also be referred to as phoneme-level linguistic features.

In some embodiments, the phoneme feature extraction operation comprises the following steps 1) to 3):

1) and acquiring a phoneme code corresponding to the phoneme sequence. The phoneme code may be obtained, for example, by encoding (embedding) the phoneme features.

2) Extracting intermediate feature vectors according to the phoneme codes; the intermediate feature vector includes local feature information and context information in phoneme coding. Specifically, local features and context dependencies can be embodied so as to perform processing based on phoneme local and overall context, which is helpful to further improve naturalness, prosody and fluency of the finally obtained synthesized speech.

In order to extract intermediate feature vectors that can adequately contribute to local feature information as well as context information in existing phoneme coding, in some embodiments, extracting intermediate feature vectors from phoneme coding comprises: continuously executing N times of specified combination operation based on phoneme coding, and taking the feature vector output by the Nth combination operation as an intermediate feature vector; wherein, the input of the 1 st combination operation is phoneme coding, and the input of the ith combination operation is the output of the (i-1) th combination operation; n is a natural number not less than 1, and the value range of i is [2, N ]; the combining operation includes a convolution operation and a nonlinear transformation operation. Illustratively, the non-linear transformation operation may be implemented using an activation function relu. The value of N can be flexibly set according to actual conditions, and is not limited herein, and for example, N may be 3. It can be understood that, by performing the specified combination operation multiple times, the finally obtained intermediate feature vector may include richer local feature information and context information.

3) And performing phoneme feature extraction based on the intermediate feature vector. In some embodiments, the phoneme feature extraction may be performed on the intermediate feature vector through a preset long-short term memory network. The long-short term memory network (LSTM) is a special neural network that can self-invoke based on input sequences, and differs from other networks in that its hidden layer is a self-connecting hidden layer that can span time points, can continuously retain information, and can deduce subsequent states based on previous states. Therefore, the phoneme feature extraction is carried out on the basis of the intermediate feature vector through the long-short term memory network, and the phoneme feature capable of fully representing the phoneme information of the target text can be further effectively extracted.

For convenience of understanding, the embodiment of the present disclosure provides an implementation manner in which an encoder performs a phoneme feature extraction operation on a phoneme sequence to obtain a phoneme feature corresponding to a target text, and may be briefly described by the following formula:

wherein, X_textIs a sequence of phonemes, Embedding () represents the encoding process, Embedding (X)_text) Representing the phoneme code, Con, corresponding to the sequence of acquired phonemesν_iAnd (i =1,2,3) represents a one-dimensional convolutional layer, which can be used for learning local features and context dependency in a phoneme sequence, and relu represents an activation function, which is mainly used for realizing nonlinear transformation. E_cI.e. the intermediate feature vector, LSTM () is processed by LSTM, E_rI.e. the phoneme characteristics.

In practical applications, the embodiment of the present disclosure may simulate a mapping relationship between a text and an acoustic feature, and may obtain an acoustic feature of a target text based on the mapping relationship, which may be implemented with reference to related technologies, and is not limited herein.

After the semantic features, the phoneme features and the acoustic features of the target text are obtained, in some embodiments, the first alignment operation and the second alignment operation are performed on the semantic features and the acoustic features respectively, and in some embodiments, the first alignment operation and the second alignment operation are both alignment operations based on an attention mechanism, and the alignment operations based on the attention mechanism can better determine the relevance between the semantic features and the acoustic features and the relevance between the phoneme features and the acoustic features, and then perform feature fusion based on the first alignment result and the second alignment result to obtain fusion features, so that the fusion features fully contain feature information with different granularities, and better speech generation in the subsequent process is facilitated. For ease of understanding, the above process may also be referred to using the following brief description:

wherein the Attention () represents aligning linguistic features including semantic-level linguistic features (semantic features) or phoneme-level linguistic features (phoneme features) with acoustic features based on an Attention mechanism. E_rFor phoneme features, E_speechIs an acoustic feature, Attention (E)_r,E_speech) I.e. representing an attention-based mechanism for aligning phoneme features with acoustic features, Align_phoneAs a result of the alignment of the phoneme features with the acoustic features (front)The second alignment result); e_lIs a semantic feature, Attention (E)_l,E_speech) I.e. representing an attention-based mechanism for aligning semantic features with acoustic features, Align_sentenceFor the alignment result of semantic features and acoustic features (the aforementioned first alignment result), Concat () represents a pair splicing operation, Conv () represents a convolution operation, Align_fusionIs a fusion feature. In practical application, Align_phoneAnd Align_sentenceMay each be an aligned feature matrix, Align_fusionThe feature matrix after feature fusion.

Based on the foregoing speech synthesis method, the embodiment of the present disclosure provides a speech synthesis system capable of implementing the foregoing speech synthesis method, and refer to a schematic structural diagram of a speech synthesis system shown in fig. 2, which illustrates main functional modules of the speech synthesis system, and mainly include a BERT model, a grapheme-to-phoneme unit, an encoder, a first alignment unit, a second alignment unit, a feature fusion unit, a decoder, and a vocoder. The input of the BERT model and the input of the grapheme-to-phoneme unit are both target texts, the output of the BERT model is semantic features, the output of the grapheme-to-phoneme unit is a phoneme sequence, the input of the encoder is a phoneme sequence, the output of the encoder is a phoneme feature, the input of the first alignment unit is a semantic feature and an acoustic feature obtained by mapping in advance, the output is a first alignment result, the input of the second alignment unit is a phoneme feature and an acoustic feature, the output is a second alignment result, the input of the feature fusion unit is a first alignment result and a second alignment result, the output is a fusion feature, the input of the decoder is a fusion feature, the fusion feature is subjected to autoregressive decoding by the fusion feature, a Mel spectrum is output, then the Mel spectrum is input into a vocoder, and the Mel spectrum is converted into audio by the vocoder and is output. For the specific process, reference may be made to the foregoing related matters, which are not described herein again.

Compared with the traditional voice synthesis system only comprising an encoder and a decoder, the voice synthesis system not only additionally introduces a BERT model, but also sets a double-alignment mechanism, semantic features and phoneme features with different granularities are respectively aligned with acoustic features and subsequently fused, the semantic features and the phoneme features with different granularities can be fully utilized, the obtained fusion features contain richer linguistic feature information, and the synthetic audio obtained based on the rich linguistic feature information can be further promoted in the aspects of naturalness, rhythm and fluency. In practical application, each functional unit in the speech synthesis system may be trained together, in other words, the speech synthesis system may be trained integrally, in the training process, a training text and an audio corresponding to the training text are input to the speech synthesis system, the speech synthesis system is obtained through supervised training, the speech synthesis system after training may directly output an audio meeting expectations for an input target text, a step flow required by the speech synthesis system during training is substantially identical to a step flow required by practical application, in other words, functions of each functional unit included in the speech synthesis system during training are substantially identical to functions of each functional unit during practical application, and thus, no further description is given here.

For understanding, based on the speech synthesis system shown in fig. 2, the embodiment of the present disclosure further provides a speech synthesis method, referring to a flow chart of another speech synthesis method shown in fig. 3, which mainly includes the following steps:

step S302, inputting the target text into a BERT model, and performing semantic feature extraction operation on the target text through the BERT model to obtain semantic features of the target text.

Step S304, inputting the target text into a preset grapheme-to-phoneme unit, and obtaining a phoneme sequence corresponding to the target text through the grapheme-to-phoneme unit.

Step S306, inputting the phoneme sequence into an encoder, and performing phoneme feature extraction operation on the phoneme sequence through the encoder to obtain a phoneme feature corresponding to the target text.

And step S308, acquiring the acoustic features of the target text based on the predicted mapping relation between the target text and the acoustic features.

Step S310, performing an alignment operation on the semantic features and the acoustic features based on an attention mechanism to obtain a first alignment result.

In step S312, an alignment operation is performed on the phoneme feature and the acoustic feature based on the attention mechanism, so as to obtain a second alignment result.

Step S314, executing a feature stitching operation based on the first alignment result and the second alignment result to obtain a stitched feature.

And step S316, performing convolution operation on the spliced features to obtain fusion features.

And step S318, performing autoregressive decoding on the fusion characteristics through a decoder to obtain a Mel spectrum.

Step S320, converting the mel spectrum into audio by the vocoder, and using the audio as the synthesized voice corresponding to the target text. The vocoder may be a Griffin-Lim vocoder.

By the voice synthesis method, abundant semantic features and phoneme features can be obtained, phoneme features and alignment of the semantic features and the acoustic features are calculated on the linguistic features with different granularities respectively, feature fusion is further performed, the semantic features and the phoneme features with different granularities can be fully utilized, the obtained linguistic feature information contained in the fusion features is abundant, the synthesized audio obtained based on the abundant linguistic feature information can be further improved in the aspects of naturalness, rhythm, fluency and the like, namely, on the basis of ensuring the accuracy of synthesized voice pronunciation, the naturalness of the synthesized voice is further improved, and the rhythm and the fluency of the synthesized voice are further optimized.

Corresponding to the foregoing speech synthesis method, an embodiment of the present disclosure further provides a speech synthesis apparatus, and fig. 4 is a schematic structural diagram of a speech synthesis apparatus provided in an embodiment of the present disclosure, which may be implemented by software and/or hardware and may be generally integrated in an electronic device. As shown in fig. 4, the speech synthesis apparatus 400 includes:

a feature obtaining module 402, configured to obtain semantic features, phoneme features, and acoustic features of the target text;

a first alignment module 404, configured to perform a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result;

a second alignment module 406, configured to perform a second alignment operation on the phoneme features and the phoneme features to obtain a second alignment result;

a feature fusion module 408, configured to perform feature fusion based on the first alignment result and the second alignment result to obtain a fusion feature;

and a speech generating module 410, configured to generate a synthesized speech corresponding to the target text based on the fusion feature.

The device can align the semantic features of the coarse granularity and the phoneme features of the fine granularity with the acoustic features respectively, then fuse the alignment results to obtain the synthesized voice, can make full use of the features of different granularities, and is favorable for better promoting the voice synthesis effect.

In some embodiments, the feature obtaining module 402 is specifically configured to: inputting the target text into a semantic feature extraction model obtained by pre-training; and executing semantic feature extraction operation on the target text through the semantic feature extraction model to obtain the semantic features of the target text.

In some embodiments, the semantic feature extraction operation comprises: performing character segmentation on a target text to obtain a character sequence; acquiring a character code corresponding to the character sequence; and extracting semantic features based on the character codes.

In some embodiments, the semantic feature extraction model comprises a BERT model.

In some embodiments, the feature obtaining module 402 is specifically configured to: inputting the target text into a preset grapheme-to-phoneme unit to obtain a phoneme sequence output by the grapheme-to-phoneme unit; inputting the phoneme sequence into a coder obtained by pre-training; and performing phoneme feature extraction operation on the phoneme sequence through the encoder to obtain a phoneme feature corresponding to the target text.

In some embodiments, the phoneme feature extraction operation comprises: acquiring a phoneme code corresponding to the phoneme sequence; extracting an intermediate feature vector according to the phoneme codes; wherein the intermediate feature vector embodies local feature information and context information in the phoneme coding; and performing phoneme feature extraction based on the intermediate feature vector.

In some embodiments, the phoneme feature extraction operation specifically includes: continuously executing N times of specified combination operation based on the phoneme coding, and taking a feature vector output by the combination operation at the Nth time as an intermediate feature vector; wherein, the input of the 1 st time of the combination operation is the phoneme coding, and the input of the ith time of the combination operation is the output of the (i-1) th time of the combination operation; n is a natural number not less than 1, and the value range of i is [2, N ]; the combining operation includes a convolution operation and a non-linear transformation operation.

In some embodiments, the phoneme feature extraction operation specifically includes: and extracting phoneme features of the intermediate feature vector through a preset long-term and short-term memory network.

In some embodiments, the first alignment operation and the second alignment operation are both attention-based alignment operations.

In some embodiments, the feature fusion module 408 is specifically configured to: executing a feature splicing operation based on the first alignment result and the second alignment result to obtain spliced features; and performing convolution operation on the spliced features to obtain fusion features.

In some embodiments, the speech generation module 410 is specifically configured to: performing autoregressive decoding on the fusion characteristics through a decoder obtained through pre-training to obtain a Mel spectrum; and converting the Mel spectrum into audio through a vocoder, and taking the audio as the synthetic voice corresponding to the target text.

The speech synthesis device provided by the embodiment of the disclosure can execute the speech synthesis method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatus embodiments may refer to corresponding processes in the method embodiments, and are not described herein again.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform a method according to an embodiment of the present disclosure.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform a speech synthesis method provided by embodiments of the present disclosure. The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Referring to fig. 5, a block diagram of a structure of an electronic device 500, which may be a server or a client of an embodiment of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of embodiments of the present disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506, an output unit 507, a storage unit 508, and a communication unit 509. The input unit 506 may be any type of device capable of inputting information to the electronic device 500, and the input unit 506 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 507 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. The storage unit 508 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above. For example, in some embodiments, the speech synthesis method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. In some embodiments, the computing unit 501 may be configured to perform the speech synthesis method by any other suitable means (e.g., by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech synthesis comprising:

obtaining semantic features, phoneme features and acoustic features of a target text; the acoustic features are obtained based on a mapping relation between a text and acoustics which are simulated in advance;

performing a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result; wherein the first alignment result is represented by means of an alignment matrix;

performing a second alignment operation on the phoneme characteristics and the acoustic characteristics to obtain a second alignment result; wherein the second alignment result is represented by means of an alignment matrix;

performing feature fusion according to the first alignment result and the second alignment result to obtain fusion features;

generating synthetic voice corresponding to the target text based on the fusion characteristics;

generating the synthetic speech corresponding to the target text based on the fusion feature comprises:

performing autoregressive decoding on the fusion characteristics through a decoder obtained through pre-training to obtain a Mel spectrum;

and converting the Mel spectrum into audio through a vocoder, and taking the audio as the synthetic voice corresponding to the target text.

2. The speech synthesis method of claim 1, wherein obtaining semantic features of the target text comprises:

inputting the target text into a semantic feature extraction model obtained by pre-training;

and executing semantic feature extraction operation on the target text through the semantic feature extraction model to obtain the semantic features of the target text.

3. The speech synthesis method of claim 2, wherein the semantic feature extraction operation comprises:

performing character segmentation on a target text to obtain a character sequence;

acquiring a character code corresponding to the character sequence;

and extracting semantic features based on the character codes.

4. The speech synthesis method of claim 2, wherein the semantic feature extraction model comprises a BERT model.

5. The speech synthesis method of claim 1, wherein obtaining the phoneme characteristics of the target text comprises:

inputting the target text into a preset grapheme-to-phoneme unit to obtain a phoneme sequence output by the grapheme-to-phoneme unit;

inputting the phoneme sequence into a coder obtained by pre-training;

and performing phoneme feature extraction operation on the phoneme sequence through the encoder to obtain a phoneme feature corresponding to the target text.

6. The speech synthesis method of claim 5, wherein the phoneme feature extraction operation comprises:

acquiring a phoneme code corresponding to the phoneme sequence;

extracting an intermediate feature vector according to the phoneme codes; wherein the intermediate feature vector embodies local feature information and context information in the phoneme coding;

and performing phoneme feature extraction based on the intermediate feature vector.

7. The speech synthesis method of claim 6, wherein extracting intermediate feature vectors from the phoneme encoding comprises:

continuously executing N times of specified combination operation based on the phoneme coding, and taking a feature vector output by the combination operation at the Nth time as an intermediate feature vector; wherein, the input of the 1 st time of the combination operation is the phoneme coding, and the input of the ith time of the combination operation is the output of the (i-1) th time of the combination operation; n is a natural number not less than 1, and the value range of i is [2, N ]; the combining operation includes a convolution operation and a non-linear transformation operation.

8. The speech synthesis method of claim 6, wherein performing phoneme feature extraction based on the intermediate feature vectors comprises:

and extracting phoneme features of the intermediate feature vector through a preset long-term and short-term memory network.

9. The speech synthesis method of any one of claims 1 to 8, wherein the first and second alignment operations are both attention-based alignment operations.

10. The speech synthesis method according to claim 1, wherein performing feature fusion based on the first alignment result and the second alignment result, resulting in a fused feature comprises:

executing a feature splicing operation based on the first alignment result and the second alignment result to obtain spliced features;

and performing convolution operation on the spliced features to obtain fusion features.

11. A speech synthesis apparatus comprising:

the feature acquisition module is used for acquiring semantic features, phoneme features and acoustic features of the target text; the acoustic features are obtained based on a mapping relation between a text and acoustics which are simulated in advance;

the first alignment module is used for executing a first alignment operation on the semantic features and the acoustic features to obtain a first alignment result; wherein the first alignment result is represented by means of an alignment matrix;

the second alignment module is used for executing a second alignment operation on the phoneme characteristics and the acoustic characteristics to obtain a second alignment result; wherein the second alignment result is represented by means of an alignment matrix;

the feature fusion module is used for carrying out feature fusion based on the first alignment result and the second alignment result to obtain fusion features;

the voice generating module is used for generating synthetic voice corresponding to the target text based on the fusion characteristics;

the speech generation module is specifically configured to: performing autoregressive decoding on the fusion characteristics through a decoder obtained through pre-training to obtain a Mel spectrum; and converting the Mel spectrum into audio through a vocoder, and taking the audio as the synthetic voice corresponding to the target text.

12. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method of speech synthesis according to any one of claims 1-10.

13. A computer-readable storage medium storing a computer program for executing the speech synthesis method according to any one of claims 1 to 10.