CN116312476A

CN116312476A - Speech synthesis method and device, storage medium and electronic equipment

Info

Publication number: CN116312476A
Application number: CN202310189613.3A
Authority: CN
Inventors: 岳杨皓; 宋伟; 张雅洁; 张政臣; 吴友政
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2023-06-23

Abstract

The disclosure provides a voice synthesis method and device, a storage medium and electronic equipment; relates to the technical field of information processing. The method comprises the following steps: acquiring a symbol sequence of a sentence to be synthesized, and carrying out acoustic feature prediction on the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain acoustic features corresponding to the sentence to be synthesized; the acoustic prediction model comprises a prosody prediction model, and the prosody prediction model is used for enhancing the prosody characteristics of the sentence to be synthesized in the voice synthesis stage by learning the prosody characteristics of the reference recording audio in the model training stage; and performing feature conversion and synthesis on the acoustic features to obtain the voice corresponding to the sentence to be synthesized. The method and the device can solve the problem that a voice synthesis system in the related technology cannot meet the requirements of specific business scenes on rhythm naturalness and expressive force and the voice synthesis effect is poor.

Description

Speech synthesis method and device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of information processing, and in particular relates to a voice synthesis method and device, a storage medium and electronic equipment.

Background

Speech synthesis technology has been widely used in various man-machine interaction and intelligent speech devices, and speech synthesis systems can implement Text To Speech (TTS) functions. The online TTS service basically satisfies the stable level of synthesized audio in terms of prosody pauses, timbre similarity, pronunciation accuracy, and timbre.

However, with diversification of service scenes, under some specific scenes (such as scenes requiring spoken language, strong rhythm sense or live broadcast, etc.), the synthesis results of the existing TTS system tend to be averaged, and the requirements of the service scenes on rhythm naturalness and expressivity cannot be met, so that the speech synthesis quality is poor.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the disclosure aims to provide a voice synthesis method and device, a storage medium and electronic equipment, and further solves the problems that a TTS system in the related art cannot meet the requirements of a specific business scene on rhythm naturalness and expressive force and the voice synthesis effect is poor to a certain extent.

According to a first aspect of the present disclosure, there is provided a speech synthesis method, the method comprising: acquiring a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a text to be synthesized and a query result sentence aiming at a target object; carrying out acoustic feature prediction on the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain acoustic features corresponding to the sentences to be synthesized; the acoustic prediction model comprises a prosody prediction model which enhances the prosody characteristics of the sentence to be synthesized in a speech synthesis stage by learning the prosody characteristics of the reference recording audio in a model training stage; and performing feature conversion and synthesis on the acoustic features to obtain the voice corresponding to the sentence to be synthesized.

Optionally, the acoustic prediction model further includes an encoding model and a decoding model, and the predicting the acoustic feature of the symbol sequence by using the pre-trained acoustic prediction model includes: performing primary coding processing on the symbol sequence by utilizing a pre-trained coding model to obtain a first coding vector; performing prosody feature prediction on the first coding vector by utilizing a pre-trained prosody prediction model to obtain a prosody feature vector; and predicting the acoustic features of the sentence to be synthesized according to a pre-trained decoding model, the first coding vector and the prosodic feature vector to obtain the acoustic features corresponding to the sentence to be synthesized.

Optionally, the predicting the acoustic feature of the sentence to be synthesized according to the pre-trained decoding model, the first encoding vector and the prosodic feature vector includes: performing variable prediction on the superposition result of the first coding vector and the prosody feature vector by using a pre-trained variable adaptation model to obtain a first variable prediction result; and carrying out decoding processing based on an attention mechanism on the first variable prediction result by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the statement to be synthesized.

Optionally, the predicting the acoustic feature of the sentence to be synthesized according to the pre-trained decoding model, the first encoding vector and the prosodic feature vector includes: respectively inputting the first coding vector and the prosody feature vector into a pre-trained variable adaptation model to perform variable prediction to obtain a second variable prediction result; and carrying out decoding processing based on an attention mechanism on the second variable prediction result by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the statement to be synthesized.

Optionally, the prosody prediction model includes a first prosody prediction model and a second prosody prediction model, and the performing prosody feature prediction on the first encoding vector using the pre-trained prosody prediction model includes: performing sentence-level prosody feature prediction on the first coding vector by using the first prosody prediction model to obtain a first prosody feature vector; performing phoneme-level prosodic feature prediction on the superposition features of the first coding vector and the first prosodic feature vector by using the second prosodic prediction model to obtain a second prosodic feature vector; the variable prediction of the superposition result of the first coding vector and the prosody feature vector includes: and carrying out variable prediction on the superposition result of the second prosodic feature vector and the superposition feature.

Optionally, the prosody prediction model includes a first prosody prediction model and a second prosody prediction model, and the performing prosody feature prediction on the first encoding vector using the pre-trained prosody prediction model includes: performing sentence-level prosody feature prediction on the first coding vector by using the first prosody prediction model to obtain a third prosody feature vector; performing phoneme-level prosody feature prediction on the first coding vector by using the second prosody prediction model to obtain a fourth prosody feature vector; the step of inputting the first coding vector and the prosody feature vector into a pre-trained variable adaptation model respectively to perform variable prediction comprises the following steps: and respectively inputting the first coding vector, the third prosody feature vector and the fourth prosody feature vector into a pre-trained variable adaptation model to perform variable prediction.

Optionally, the sentence-level prosodic feature prediction includes: and carrying out time sequence feature processing and linear transformation on the input data of the first prosody prediction model.

Optionally, the phoneme-level prosodic feature prediction includes: and carrying out convolution processing and linear transformation on the input data of the second prosody prediction model.

Optionally, the sentence to be synthesized further includes a recording sentence of the target object, and the method further includes: and carrying out secondary coding processing on the recording statement by adopting a pre-trained first reference coding model to obtain a second coding vector.

The variable prediction includes: the variable prediction includes: performing variable prediction on the superposition result of the second prosodic feature vector and the superposition feature and the second coding vector by using a pre-trained variable adaptation model; or, performing variable prediction on the first coding vector, the prosodic feature vector and the second coding vector by using a pre-trained variable adaptation model.

Optionally, the method further comprises training the acoustic prediction model, and the training process comprises: acquiring a training sample, wherein the training sample comprises a recording sample, and a corresponding acoustic characteristic sample and symbol sequence sample; training the initial acoustic prediction model for one time by adopting the training sample to obtain an intermediate model; the initial acoustic prediction model includes a first reference coding model and a second reference coding model; and fixing model parameters of the first reference coding model and the second reference coding model of the intermediate model, and performing secondary training on the acoustic prediction model by using the training samples, the first reference coding model and the second reference coding model.

Optionally, the initial acoustic prediction model further includes an encoding model and a decoding model, and the training the initial acoustic prediction model with the training sample includes: coding the symbol sequence samples by adopting a coding model to obtain a first sample coding vector; adopting the first reference coding model to code the recording sample to obtain a second sample coding vector; adopting a second reference coding model to code the acoustic characteristic samples to obtain a third sample coding vector; after the first sample coding vector and the second sample coding vector are subjected to feature superposition, the first sample coding vector and the second sample coding vector are subjected to feature splicing with the third sample coding vector to obtain splicing features; decoding the spliced characteristic by adopting a decoding model, outputting the decoded characteristic, and calculating a first loss function value according to an output result and the acoustic characteristic sample; and adjusting model parameters of the coding model, the first reference coding model, the second reference coding model and the decoding model according to the first loss function value.

Optionally, the acoustic prediction model includes a first prosody prediction model and a second prosody prediction model, and the training the acoustic prediction model for the second time includes: adopting a coding model after one-time training to code the symbol sequence samples to obtain a fourth sample coding vector; adopting a first reference coding model after one training to code the recording sample to obtain a fifth sample coding vector; adopting a second reference coding model after one training to code the acoustic characteristic sample to obtain a sixth sample coding vector; superposing the fourth sample coding vector and the fifth sample coding vector and then respectively inputting the first prosody prediction model and the second prosody prediction model; calculating a second loss function value according to the fifth sample coding vector and the output result of the first prosody prediction model; adjusting model parameters of the first prosody prediction model based on the second loss function value; calculating a third loss function value according to the sixth sample coding vector and the output result of the second prosody prediction model; and adjusting model parameters of the second prosody prediction model based on the third loss function value.

According to a second aspect of the present disclosure, there is provided a speech synthesis apparatus, the apparatus comprising: the system comprises an acquisition module, a prediction module and a voice synthesis module; the system comprises an acquisition module, a synthesis module and a query module, wherein the acquisition module is used for acquiring a symbol sequence of a sentence to be synthesized, and the sentence to be synthesized comprises a text to be synthesized and a query result sentence aiming at a target object; the prediction module is used for predicting the acoustic characteristics of the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain the acoustic characteristics corresponding to the statement to be synthesized; the acoustic prediction model comprises a prosody prediction model which enhances the prosody characteristics of the sentence to be synthesized in a speech synthesis stage by learning the prosody characteristics of the reference recording audio in a model training stage; and the voice synthesis module is used for carrying out feature conversion and synthesis on the acoustic features to obtain voices corresponding to the sentences to be synthesized.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the embodiments described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and storage means for one or more programs which, when executed by the one or more processors, cause the one or more processors to perform the method of any of the embodiments described above.

Exemplary embodiments of the present disclosure may have some or all of the following advantages:

in the voice synthesis method provided by the exemplary embodiment of the disclosure, on one hand, a prosody prediction model can be added to an acoustic prediction model, and prosody characteristics of reference recording audio are learned in a model training stage, so that the prosody characteristics of the sentence to be synthesized are enhanced in a voice synthesis stage, the prosody-enhanced acoustic characteristics are converted and synthesized by the acoustic characteristics, and then the prosody-enhanced voice is obtained, so that the requirement of a specific business scene (a scene with high prosody expression requirement) on prosody expression is met, and the accuracy and the authenticity of the synthesized voice are improved. On the other hand, through the text to be synthesized and the query result statement aiming at the target object, the voice of the statement to be synthesized of the specific target object can be synthesized through the pre-trained acoustic prediction model, and the voice customization of the target object under the specific service scene is realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

Fig. 1 schematically illustrates an exemplary application scenario architecture diagram of a speech synthesis method and apparatus according to one embodiment of the present disclosure.

Fig. 2 schematically illustrates one of the flowcharts of the speech synthesis method in one embodiment according to the present disclosure.

Fig. 3 schematically illustrates one of the flowcharts of the acoustic feature prediction process in one embodiment according to the present disclosure.

FIG. 4 schematically illustrates a second flow chart of an acoustic feature prediction process in one embodiment in accordance with the present disclosure.

Fig. 5 schematically illustrates one of the schematics of variable prediction of the variable adaptation model in one embodiment according to the present disclosure.

Fig. 6 schematically illustrates one of the schematics of variable prediction of a pitch predictor in a variable adaptation model according to one embodiment of the present disclosure.

FIG. 7 schematically illustrates a third flowchart of an acoustic feature prediction process in one embodiment in accordance with the present disclosure.

FIG. 8 schematically illustrates a fourth flow chart of an acoustic feature prediction process in one embodiment in accordance with the present disclosure.

FIG. 9 schematically illustrates a second diagram of variable prediction of a variable adaptation model in one embodiment according to the present disclosure.

Fig. 10 schematically illustrates a second diagram of variable prediction of a pitch predictor in a variable adaptation model according to one embodiment of the present disclosure.

FIG. 11 schematically illustrates a training process flow diagram of an acoustic prediction model according to one embodiment of the present disclosure.

FIG. 12 schematically illustrates a first partial training process flow diagram of an acoustic prediction model according to one embodiment of the present disclosure.

FIG. 13 schematically illustrates a second partial training process flow diagram of an acoustic prediction model according to one embodiment of the present disclosure.

Fig. 14 schematically shows a block diagram of a speech synthesis apparatus in an embodiment according to the present disclosure.

Fig. 15 illustrates a block diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

As shown in fig. 1, an exemplary system diagram 100 of an application scenario of a speech synthesis method and apparatus is provided, and the system 100 may include a terminal 110 and a server 120. The embodiment is illustrated by the method applied to the server 120, and it is understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, or nodes in a blockchain. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted device, etc. For example, in the human-computer interaction process, the terminal may be an intelligent device, and the user performs speech synthesis through the intelligent device. When the voice synthesis method provided in the present embodiment is implemented through interaction between the terminal and the server, the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

In this example, the user may input a text to be synthesized at the terminal 110, or the user may perform a specific operation on the terminal to generate the text to be synthesized (for example, click a click button of a client page or click a recite button, etc.), and the terminal 110 sends the text to be synthesized to the server 120, so that the server obtains a query result sentence of a corresponding speaker according to the text to be synthesized, performs symbol serialization on the text to be synthesized and the query result sentence to form a symbol sequence, and performs acoustic feature prediction on the symbol sequence by using a pre-trained acoustic prediction model to obtain acoustic features corresponding to the sentence to be synthesized; the acoustic prediction model includes a prosodic prediction model that enhances prosodic features of the sentence to be synthesized during the speech synthesis phase by learning prosodic features of the reference recorded audio during the model training phase. And performing feature conversion and synthesis on the acoustic features to obtain the voice corresponding to the sentence to be synthesized. And sending the synthesized voice to a terminal for playing.

The voice synthesis method provided by the embodiments of the present disclosure may be performed in the server 120, and accordingly, the voice synthesis apparatus is generally disposed in the server 120. The voice synthesis method provided in the embodiments of the present disclosure may also be performed in the terminal 110, and accordingly, the voice synthesis apparatus is generally disposed in the terminal 110.

Next, a speech synthesis method disclosed in the embodiments of the present specification will be described with reference to specific embodiments.

Referring to fig. 2, the voice synthesis method of an example embodiment provided by the present disclosure may include the following steps S210 to S230.

Step S210, a symbol sequence of a sentence to be synthesized is acquired.

In this example embodiment, the symbol sequence may include a factor sequence or a pinyin-character sequence. The sentence to be synthesized may include the text to be synthesized and a query result sentence for the target object. The target object may be a specified speaker whose sentences may be searched from the corpus based on the determined speaker. For example, identification information may be added for each speaker, through which a corresponding sentence is queried. The text to be synthesized or the query result sentence can be converted into pinyin, the pinyin is converted into phonemes, and a symbol sequence is formed according to the text sequence.

In this exemplary embodiment, the text to be synthesized may be input by a user, may be formed according to an operation of the user, or may be obtained after a certain process is performed on the input of the user, for example, the text to be synthesized is searched for by using the voice input by the user, which is not limited in this example.

Step S220, carrying out acoustic feature prediction on the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain acoustic features corresponding to the sentences to be synthesized; the acoustic prediction model includes a prosodic prediction model that enhances prosodic features of the sentence to be synthesized during the speech synthesis phase by learning prosodic features of the reference recorded audio during the model training phase.

In this example embodiment, the acoustic prediction model may include a transducer-based acoustic model and a prosody prediction model. The acoustic model based on the transducer can predict basic prosody, prosody pause, timbre, pronunciation and tone quality of the sentence to be synthesized. The backbone network of the acoustic prediction model is an end-to-end network architecture based on the encoder-decoder network architecture, such as Fastspeech2. The prosody prediction model is to add prosody enhancement features to the coding features of the backbone network to enhance prosody expressive force.

In the present exemplary embodiment, the acoustic feature may be a Spectrogram, such as Mel-spectrum (Mel-spectral), MFCC (Mel-Frequency Cepstral Coefficient, cepstral coefficient of Mel frequency), a linear Spectrogram, such as STFT (Short-Time Fourier Transform ) Spectrogram, or the like, which is not limited in this example.

In this exemplary embodiment, the reference audio may be a sound recording sentence with strong setback and clear prosody style, and the sound recording audio may originate from different speakers and may be sound recording audio of speakers with different sexes and different ages, which is not limited in this example. Through learning various prosody styles in a training stage, the prosody prediction model can better predict the prosody style of the sentence to be synthesized in the speech synthesis process.

And step S230, performing feature conversion and synthesis on the acoustic features to obtain the voice corresponding to the sentence to be synthesized.

In this exemplary embodiment, the vocoder may be used to acoustically encode the acoustic spectrum to synthesize speech corresponding to the sentence to be synthesized, where the speech has prosodic features.

In the voice synthesis method provided in this example embodiment, on one hand, a prosody prediction model may be added to an acoustic prediction model, and prosody features of a reference recording audio may be learned in a model training stage, so as to enhance prosody features of a sentence to be synthesized in a voice synthesis stage, so that after the acoustic features are converted and synthesized by the prosody enhanced acoustic features, a prosody enhanced voice is obtained, requirements of a specific service scene (a scene with a high requirement for prosody expression) on prosody expression are satisfied, and accuracy and reality of the synthesized voice are improved. On the other hand, through the text to be synthesized and the query result statement aiming at the target object, the voice of the statement to be synthesized of the specific target object can be synthesized through the pre-trained acoustic prediction model, and the voice customization of the target object under the specific service scene is realized.

The various steps of the present disclosure are described in more detail below.

In some embodiments, referring to fig. 3, the acoustic prediction model may include an encoding model 310, a prosody prediction model 320, and a decoding model 330, and predicting the acoustic features of the symbol sequence using the pre-trained acoustic prediction model may include the following steps.

The first step, a symbol sequence is subjected to primary coding processing by utilizing a pre-trained coding model, and a first coding vector is obtained.

In this example embodiment, the encoding model 310 may include a phoneme embedding module (phoneme embedding), a multi-headed attention encoder, a speaker embedding module (speaker embedding), and so forth. The text to be synthesized may be converted into phonetic symbols, forming a first symbol sequence in text order. And inputting the first symbol sequence into a phoneme embedding module for dimension mapping to obtain a first embedding vector. The first embedded vector is processed by a multi-head attention encoder for encoding based on a multi-head attention mechanism to extract context semantic features. In this example, the first embedded vector may also be position encoded and then input into a multi-headed attention encoder to record the position of each symbol in the sequence.

In this example embodiment, text corresponding to the query result statement for the target object is converted into a phoneme symbol, forming a corresponding second symbol sequence. And inputting the second symbol sequence into a speaker embedding module for dimension mapping to obtain a second embedded vector.

In this example embodiment, the primary encoding process may include an encoding process for text to be synthesized, an encoding process based on a multi-head attention mechanism, and an encoding process for a query result statement of a target object, and may also include processes of feature superimposition, position encoding, and the like, which are not limited in this example.

And secondly, performing prosodic feature prediction on the first coding vector by utilizing a pre-trained prosodic prediction model to obtain a prosodic feature vector.

In this example embodiment, the prosody prediction model 320 may include a first prosody prediction model, which may be sentence-level prosody prediction (utternce-level prosody predictor), and a second prosody prediction model, which may be phoneme-level prosody prediction (phonme-level prosody predictor).

Illustratively, sentence-level prosodic feature prediction may be performed on the first encoded vector using the first prosodic prediction model to obtain a first prosodic feature vector; and performing phoneme-level prosody feature prediction on the superposition features of the first coding vector and the first prosody feature vector by using the second prosody prediction model to obtain a second prosody feature vector.

In this example embodiment, the output result of the coding model may be input into the first prosody prediction model to perform prosody enhancement of each sentence, then the first prosody prediction model and the output result of the coding model are subjected to feature superposition and then input into the second prosody prediction model, the prosody enhancement of each phoneme in the sentence is performed, and then the output result of the second prosody prediction model and the input data thereof are spliced to realize that various features are combined into the main trunk network. The spliced features can be input into the full-connection layer for feature integration.

In this example embodiment, the first prosody prediction model may include a recurrent neural network (e.g., gating gated recurrent neural network, GRU) and a linear layer. The second prosody prediction model may include a neural network based on one-dimensional convolution, and illustratively, the second prosody prediction model may include a plurality of convolution units and a linear unit (linear layer), each of which may include a one-dimensional convolution layer and a normalization layer. A linear activation function (ReLU) may be added after the one-dimensional convolution layer, and a discard process (dropout) may be added after the normalization layer.

Thirdly, predicting the acoustic features of the sentences to be synthesized according to the pre-trained decoding model, the first coding vector and the prosodic feature vector to obtain the acoustic features corresponding to the sentences to be synthesized.

In this example embodiment, the decoding model 330 may be a multi-head attention-based decoding model corresponding to a multi-head attention encoder orientation in the encoding model. The features may be position coded once before they enter the decoding model to mark the sequence positions.

In the present exemplary embodiment, before the prosody-enhanced features (post-concatenation features) enter the decoding model, they are subjected to variable prediction so that optimized adjustment of prosody style and timbre is performed on the prosody-enhanced features. In this example, the acoustic variables may include a phoneme duration (duration), a pitch (pitch), an amplitude energy (energy), and the like, and may also include other acoustic variables, which are not limited in this example. And respectively carrying out variable prediction on each acoustic variable.

For example, the variable prediction may be performed on the superposition result of the first coding vector and the prosodic feature vector by using a pre-trained variable adaptation model, so as to obtain a first variable prediction result. And then, decoding the first variable prediction result based on an attention mechanism by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the sentences to be synthesized.

In this example embodiment, the variable adaptation model may include a phoneme duration predictor, a pitch predictor, an amplitude energy predictor, etc., and the duration is the length of each phoneme in the logarithmic domain, which may improve the accuracy of segmentation and reduce the information gap between input and output. For a pitch predictor, the output sequence is a frame-level F0 sequence; for the energy predictor, the output is an energy sequence for each mel-spectrum frame. All predictors share the same model structure but different model parameters. The predictor for each acoustic variable may include a full-connected layer, a plurality of convolution units, and a linear unit (linear layer), each of which may include a one-dimensional convolution layer and a normalization layer. A linear activation function (ReLU) may be added after the one-dimensional convolution layer, and a discard process (dropout) may be added after the normalization layer. For example, the predictor for each acoustic variable may consist of a fully connected layer, two convolution units (one-dimensional convolution layer+relu activation function) and a linear layer connected in sequence.

In this example embodiment, each variable predictor in the variable adaptation model has a similar model structure. The hidden sequence can be taken as input and the variation of each phoneme (duration) or each frame (pitch and energy) can be predicted by mean square error loss.

In this example embodiment, a length regularization unit may be further included to match the input factor sequence length with the sequence length of the output mel spectrum, in preparation for output; meanwhile, the pronunciation speed and the like of the acoustic prediction model can be controlled through the length regularization unit. The variable adaptation model may also include a superposition process, etc., which is not limited in this example.

In the above embodiment (as shown in fig. 3), the present disclosure adds prosodic feature prediction models (such as the first prosodic feature prediction model and the second prosodic feature prediction model) to the backbone network of the acoustic prediction models, and superimposes the prosodic features predicted by the prosodic feature prediction models on the features of the backbone network, so as to enhance the prosody at sentence level and phoneme level, and promote the feeling of setbacks, so that the synthesized audio mood is richer.

The present disclosure solves the problem that the added prosody reference information may adversely affect the timbre of the backbone network, i.e., may cause the original timbre to change while increasing the prosody, by the following scheme.

In some embodiments, as shown in fig. 4, the prosody prediction model is decoupled from the backbone network, and the output result of the prosody prediction model is directly input into the variable adaptation model, without overlapping the prediction result of the prosody prediction model with the characteristics of the backbone network, so as to avoid the influence of the prosody prediction model on the timbre synthesis effect of the backbone network. Since the scheme is the same as the backbone network of the scheme of fig. 3, the present example describes in detail the different addition modes of the prosody prediction model including the first prosody prediction model and the second prosody prediction model, and predicting the prosody characteristic of the first encoded vector using the pre-trained prosody prediction model may include the following steps.

And a first step of predicting the prosody features of the sentence level on the first coding vector by using the first prosody prediction model to obtain a third prosody feature vector.

In this example embodiment, the sentence-level prosodic feature prediction may include temporal feature processing and linear transformation of the input data of the first prosodic prediction model. Illustratively, a recurrent neural network (e.g., a GRU) may be employed to perform time-series feature processing on the input data, and then perform linear transformation on the processing results. For a GRU, its input data is: the input data at the current moment and the hidden layer state at the previous moment, wherein the hidden layer state comprises the related information of the previous node. The output data are as follows: the output data of the hidden node at the current moment and the hidden state transferred to the next node. The GRU acquires two gating states through the state transmitted by the last node and the input data of the current node.

In the present exemplary embodiment, prosodic feature prediction at the sentence level may be directly performed on the output result of the encoding model (first encoding vector).

And secondly, performing phoneme-level prosody feature prediction on the first coding vector by using a second prosody prediction model to obtain a fourth prosody feature vector.

In this example embodiment, the prosodic feature prediction at the phoneme level may include convolution processing and linear transformation of the input data of the second prosodic prediction model. For example, the input data may be convolved with a one-dimensional convolutional neural network, and then the processing result is linearly transformed. The convolution process may include a plurality of one-dimensional convolutions, after which an activation function may be added, and a normalization process, after which dropout may be set.

In the present exemplary embodiment, prosodic feature prediction at a phoneme level may be directly performed on the output result (first encoding vector) of the encoding model. The prosodic feature predictions of the two levels do not interfere with each other and belong to a parallel relationship.

And thirdly, respectively inputting the first coding vector and the prosody feature vector into a pre-trained variable adaptation model to perform variable prediction.

In this example embodiment, the first encoding vector, the third prosodic feature vector, and the fourth prosodic feature vector may be input into a pre-trained variable adaptation model, respectively, to perform variable prediction.

Illustratively, referring to fig. 5, the variable adaptation model includes predictors corresponding to a plurality of acoustic variables, such as a duration predictor, a pitch predictor, and an amplitude energy predictor, and an output result of the coding model (a first coding vector), an output result of the first prosodic feature prediction model (a third prosodic feature vector), and an output result of the second prosodic feature prediction model (a fourth prosodic feature vector) may be respectively input into the predictors corresponding to each acoustic variable for performing optimization adjustment of timbre, prosody, and the like. In other embodiments, a feature superposition process may also be added after the pitch predictor and the amplitude energy predictor to implement feature superposition of pitch and amplitude energy to adjust the reduced timbre. The variable adaptation model also includes length regularization, adjusting the output length of the module.

In the present exemplary embodiment, the processing procedure of the first encoding vector, the third prosodic feature vector, and the fourth prosodic feature vector in the predictor corresponding to each acoustic variable may be the same as the structure of the predictor corresponding to each acoustic variable as shown in fig. 6. As shown in fig. 6, the pitch predictor may include a full-connection layer, two one-dimensional convolution units and a linear layer, the one-dimensional convolution units are composed of a one-dimensional convolution module and a normalization module, the one-dimensional convolution module is composed of a one-dimensional convolution layer plus a ReLU activation function, and the normalization module is composed of a normalization layer plus random discard (dropout).

For example, as shown in fig. 6, the output result of the coding model (the first coding vector) and the output result of the first prosodic feature prediction model (the third prosodic feature vector) may be subjected to feature superposition, and then the superimposed features and the output result of the second prosodic feature prediction model (the fourth prosodic feature vector) may be subjected to vector splicing, where the spliced results are sequentially processed by the full-connection layer, the two one-dimensional convolution units and the linear layer, so as to complete optimization adjustment of pitch and prosody.

In the present exemplary embodiment, prosody prediction models (a first prosody prediction model and a second prosody prediction model) are moved from a backbone network to respective acoustic variable predictors of a variable adaptation model (Variance adaptation) so that prosody prediction is completely separated from timbre prediction, decoupling of prosody enhancement and timbre synthesis is achieved, and prosody enhancement and control effects are achieved on the basis of ensuring that the quality of the original timbre is not changed.

In some embodiments, before speech synthesis, there is a target object record sentence, and the to-be-synthesized sentence further includes the target object record sentence, and the target object record sentence can be added into the variable adaptation model as reference information, so that prosody enhancement and tone prediction are more accurate. Referring to fig. 7 or 8, the method further includes the following steps.

And performing secondary coding processing on the record sentence by adopting a pre-trained first reference coding model to obtain a second coding vector.

In this example embodiment, the first reference encoding model may include a multi-layered two-dimensional convolution unit, which may include a two-dimensional convolution layer, a batch normalization layer, and an activation function (ReLU), a recurrent neural network unit (e.g., a GRU), a multi-headed attention unit of a pattern symbol, and a linear layer. The corresponding secondary encoding process may include a multi-round two-dimensional convolution process, a batch normalization process, an activation process, a multi-headed attention process for the pattern symbol, a linear process.

And inputting the second coding vector into a pre-trained variable adaptation model to perform variable prediction.

And carrying out variable prediction on the superposition result of the second prosodic feature vector and the superposition feature and the second coding vector under the condition that the first prosodic feature vector and the second prosodic feature vector are added into the backbone network.

In this exemplary embodiment, the variable prediction process of the second encoding vector is similar to the variable prediction process of the prosodic feature vector, that is, the second encoding vector and the superposition result of the second prosodic feature vector and the superposition feature are respectively input into the variable adaptation model to perform corresponding processing.

And under the condition that the first prosodic feature vector and the second prosodic feature vector are respectively added into the variable adaptation model, performing variable prediction on the first coding vector, the prosodic feature vector and the second coding vector by utilizing the pre-trained variable adaptation model.

In the present exemplary embodiment, the processing procedure of the first encoding vector, prosodic feature vector, and second encoding vector in the variable adaptation model is as shown in fig. 9. The output result (second coding vector), the first coding vector and the prosody feature vector of the first reference coding model can be respectively input into predictors corresponding to each acoustic variable to perform optimization adjustment of timbre, prosody and the like. The variable adaptation model also includes length regularization, adjusting the output length of the module.

For example, the processing in the predictor corresponding to each acoustic variable is as shown in fig. 10, and the structure of the predictor corresponding to each acoustic variable may be the same. As shown in fig. 10, the pitch predictor may include a full-connection layer, two one-dimensional convolution units and a linear layer, the one-dimensional convolution units are composed of a one-dimensional convolution module and a normalization module, the one-dimensional convolution module is composed of a one-dimensional convolution layer plus a ReLU activation function, and the normalization module is composed of a normalization layer plus random discard (dropout).

In this example, the output result of the first reference coding model (the second coding vector), the output result of the first prosodic feature prediction model (the third prosodic feature vector) and the first coding vector may be subjected to feature superposition, and then the superimposed features and the output result of the second prosodic feature prediction model (the fourth prosodic feature vector) may be subjected to vector concatenation, where the concatenation result is sequentially processed by the full-connection layer, the two one-dimensional convolution units and the linear layer, so as to complete optimization adjustment of pitch and prosody.

In some embodiments, referring to fig. 11, the method further comprises training the acoustic prediction model, and the training process may comprise two parts, one part being training of the first reference coding model and the second reference coding model; another part is the training of the first prosody prediction model and the second prosody prediction model, which may include the following steps S1110 to S1130.

Step S1110, a training sample is obtained, where the training sample includes a recording sample, and a corresponding acoustic feature sample and symbol sequence sample.

In this example embodiment, the recorded sample refers to real recorded audio, and the acoustic feature sample refers to an acoustic spectrum (e.g., mel-spectrum) extracted from the recorded audio. The recording audio has a corresponding training text, and a corresponding phoneme sequence sample is determined according to the training text; the actual audio will typically have speaker information (e.g., speaker ID) from which the corpus can be queried to obtain a sample of the corresponding query result statement. And the symbol sequence samples are formed by symbol sequences corresponding to the phoneme sequence samples and the query result sentence samples. One training sample may include a segment of real recorded audio, a corresponding acoustic atlas, a corresponding phoneme sequence sample, and a query result statement corresponding symbol sequence sample.

In this example embodiment, when model training is performed for one speaker, historical recorded audio of the speaker may be obtained, and other class samples (phoneme sequences, mel-patterns, etc.) are obtained according to the recorded audio to form a training set together. For a pervasive model, historical recorded audio of more speakers, for example, recorded samples of a population of men, women, elderly people, children, etc., may be obtained.

Step S1120, training the initial acoustic prediction model for one time by adopting a training sample to obtain an intermediate model; the initial acoustic prediction model includes a first reference coding model and a second reference coding model.

In this example embodiment, the first reference coding model may be a global style mark reference coder (Global style tokens reference encoder, GST reference coder), and the second reference coding model is an acoustic feature reference coder, such as a mel-spectrum reference coder, where recorded audio is input to the first reference coding model, the corresponding mel-spectrum is input to the second reference coding model, and the two input data are used as reference information for speech synthesis. One training is used to train the first reference coding model, the second reference coding model, and the backbone network.

Step S1130, fixing the model parameters of the first reference coding model and the second reference coding model of the intermediate model, and performing secondary training on the acoustic prediction model by using the training samples, the first reference coding model and the second reference coding model.

In the present exemplary embodiment, model parameters of the first reference coding model and the second reference coding model after the primary training are fixed, and the secondary training is performed, and the secondary training trains two prosodic prediction models, i.e., a first prosodic prediction model (corresponding to a mel-spectrum encoder) and a second prosodic prediction model (corresponding to a GST encoder), using the first reference coding model and the second reference coding model. The secondary training is used to train the first prosody prediction model, the second prosody prediction model, and the backbone network.

In some embodiments, as shown in fig. 12, adding two reference coding models, namely a first reference coding model and a second reference coding model, in the first part of the training process of the acoustic prediction model, to form an initial acoustic prediction model, and training the initial acoustic prediction model by using a training sample may include the following steps:

the first step, coding the symbol sequence samples by adopting a coding model to obtain a first sample coding vector.

In this example embodiment, model parameters of an initial acoustic prediction model may be initialized, and symbol sequence samples are processed using the initialized coding model. May include dimension mapping (empdding) using a phoneme sequence corresponding to the text sample to input a phoneme embedding module to obtain an embedding vector. And inputting a speaker query result statement sample into a speaker embedding module to perform dimension mapping (emmbedding), and superposing two mapping results to serve as a first sample coding vector.

And secondly, adopting a first reference coding model to code the recording sample, and obtaining a second sample coding vector.

In this example embodiment, the first reference encoding model performs the following processing on the recording sample: multi-round two-dimensional convolution processing, batch normalization processing, activation processing, multi-head attention processing of style symbols and linear processing.

And thirdly, adopting a second reference coding model to code the acoustic characteristic samples to obtain a third sample coding vector.

In this example embodiment, the second reference encoding model may include a multi-layer two-dimensional convolution unit, a linear layer+linear normalization, a multi-headed attention layer, and a linear layer+linear normalization, and one two-dimensional convolution unit may include a two-dimensional convolution layer, a batch normalization layer, and an activation function (ReLU). Illustratively, the acoustic feature sample is subjected to multi-convolution processing, linear transformation and normalization processing and multi-head attention layer extraction context information of multi-layer two-dimensional convolution, and then the feature sample is subjected to linear transformation and normalization processing to obtain a third sample coding vector.

And fourthly, performing feature superposition on the first sample coding vector and the second sample coding vector, and performing feature splicing on the first sample coding vector and the second sample coding vector and the third sample coding vector to obtain splicing features.

And fifthly, decoding the spliced characteristic by adopting a decoding model, outputting the decoded characteristic, and calculating a first loss function value according to an output result and an acoustic characteristic sample.

In this example embodiment, the first loss function may be determined by outputting the result with an acoustic feature sample mean square error MSE. The decoding process is similar to the corresponding part in the speech synthesis process of the sentence to be synthesized.

And sixthly, adjusting model parameters of the coding model, the first reference coding model, the second reference coding model and the decoding model according to the first loss function value.

In this example embodiment, the coding model and the coding model belong to a backbone network. Model parameters of the variable adaptation model may also be adjusted.

In one training process, the corresponding mel spectrum of the recorded audio can be obtained, the two are respectively used as reference information of speech synthesis, the GST reference encoder outputs GST coding results of sentence level, the mel spectrum reference encoder outputs coding results of phoneme level, the coding results of the phoneme sequence through the coding model of the backbone network and the coding results of the voice color information of the representative speaker are added, prediction learning of time length, pitch and amplitude energy of each phoneme in the variable adaptation model is input and influenced, after the influence is expanded to the frame level, the mel spectrum is output through the decoding model of the backbone network, and compared with the real mel spectrum, first loss is calculated and fed back to the network model, parameters of the model are continuously updated until the first loss function value is gradually converged, and the GST reference encoder and the mel spectrum reference encoder achieve a good reference coding effect. The process also trains other modules of the backbone network, ready for subsequent second part training.

In some embodiments, as shown in fig. 13, during the second portion of the training process, two prosody prediction models, namely a first prosody prediction model and a second prosody prediction model, are added to the model, the two prosody prediction models being trained by the first partially trained first reference coding model and second reference coding model. In this example, the secondary training of the acoustic prediction model may include the steps of:

and the first step, adopting a coding model after one training to code the symbol sequence samples, and obtaining a fourth sample coding vector.

In this example embodiment, the training sample may be input again to the model after one training.

And secondly, adopting a first reference coding model after one training to code the recording sample, and obtaining a fifth sample coding vector.

The coding process in the above two steps is the same as the corresponding part in the speech synthesis process of the sentence to be synthesized, and will not be described here again.

And thirdly, adopting a second reference coding model after one training to code the acoustic characteristic samples, and obtaining a sixth sample coding vector.

In this example embodiment, the second reference coding model may be a mel-spectrum reference encoder, and the mel-spectrum samples are encoded by using the once-trained mel-spectrum reference encoder. Illustratively, the acoustic feature sample is subjected to multiple convolution processing, linear transformation and normalization processing, and a multi-head attention layer of multi-layer two-dimensional convolution after one training to extract context information, and then the feature is subjected to linear transformation and normalization processing to obtain a sixth sample coding vector.

And fourthly, superposing the fourth sample coding vector and the fifth sample coding vector, and respectively inputting the first prosody prediction model and the second prosody prediction model.

Fifthly, calculating a second loss function value according to the fifth sample coding vector and the output result of the first prosody prediction model; model parameters of the first prosody prediction model are adjusted based on the second loss function value.

A sixth step of calculating a third loss function value according to the sixth sample coding vector and the output result of the second prosody prediction model; model parameters of the second prosody prediction model are adjusted based on the third loss function value.

The method may further include the steps of: and after characteristic splicing is carried out on the sixth sample coding vector and the main network, the sixth sample coding vector enters a variable adaptation model and a decoding model for processing, and then the output result and a real acoustic characteristic (acoustic characteristic sample) calculation loss function are fed back to adjust model parameters of the main network.

In the above embodiment, the second loss function value and the third loss function value may also be determined using the mean square error MSE, respectively. As shown in fig. 13, during the secondary training, the model parameters of the GST reference encoder and the mel-spectrum reference encoder are fixed so that they are not updated during the training of this section. A first prosody prediction model (sentence-level prosody predictor) and a second prosody prediction model (phoneme-level prosody predictor) are added to the first partially trained initial acoustic model. In the secondary training, sentences and mel spectrums of corresponding texts are used as reference information, so that a GST reference encoder and a mel spectrum reference encoder respectively output corresponding encoding results, meanwhile, a sentence-level prosody predictor and a phoneme-level prosody predictor respectively output prediction results, and model parameters of the sentence-level prosody predictor are adjusted by comparing the GST encoding results with the prediction results of the sentence-level prosody predictor; model parameters of the phoneme-level prosody predictor are adjusted by comparing the prediction results of the mel-spectrum reference encoder and the phoneme-level prosody predictor. Meanwhile, the coding result of the Mel spectrum reference coder can be input into a main network for characteristic splicing, then is input into a variable adaptation model and a decoding model for processing after full connection processing, and the predicted acoustic characteristics are output. The first loss function value can be calculated by the predicted acoustic features and the acoustic feature samples, and model parameters of the backbone network and the variable adaptation model are reversely adjusted by the first loss function value. And repeating the process, and continuously updating the parameters of the acoustic prediction model until all the loss function values are converged, and finishing training.

It can be understood that the acoustic prediction model disclosed by the disclosure can be used for model training first, and then the trained model is used for performing the speech synthesis process of the sentence to be synthesized.

The present disclosure is directed to studying specific requirements for prosody in specific scenarios. Prosody is generally used for reflecting the characteristics of the voice of a speaker, such as the height, the size, the speed of speech, the pause and the like. Because of the prosody differences, the listener can feel different moods of the speaker even if the same. For example, if the pronunciation is bright and full and the tone of the second half is higher, the pronunciation brings enthusiasm feelings to the audience. Conversely, if the pronunciation is fast and the tone of the latter half is low, a feeling of frigidity is brought to the listener. Prosody thus has a crucial role in the information conveyed by speech.

Although the existing speech synthesis system can output high-definition and high-timbre similarity audio, in a specific context or a specific service scene, after the model converges in the training process, the synthesis result is relatively fixed, and the expressive force of the synthesized audio on the specific style or rhythm required by the service scene is insufficient, so that the final speech synthesis quality is lower, and the speaking effect of a real speaker in the scene cannot be achieved. The specific business scene can comprise the situation that the rising intonation or the sentence tail sound is expected to be kept rising, the situation that the reading is expected to be more rhythmic, or the characteristics of other scene styles such as partial spoken language or live broadcast with goods are added. Because the existing model uses tone color data of a large number of different speakers to perform model training, and the consistency of audio recordings of the same recorder cannot be ensured, the rhythm sense of an output result is biased to be average.

In order to improve the effect of the model on rhythm and style, the method and the device guide the model to perform audio synthesis by introducing the rhythm feature prediction model, better realize rhythm control and improve rhythm effect. Specifically, the prosody predictor at sentence level and the prosody predictor at factor level are introduced to guide learning on prosody characteristics at sentence level and phoneme level, so that the prosody characteristics of a speaker can be better restored, and the pause and naturalness of synthesized audio can be improved. When facing to complex prosody modeling scenes of speech style speakers such as enthusiasm, emotion fluctuation, and the like, the method realizes the effect of prosody enhancement and improves the quality of synthesized voice.

Because the prosodic style is characterized by the height, the size, the speed, the rhythm, the pause and the like of the voice when the speaker speaks, the prosodic style is completely irrelevant to the tone of the speaker, i.e. each person can learn the accent style of any speaker by using the tone of the person. And mixing the predicted prosodic features into the backbone network may result in a change in the timbre of the partially synthesized audio speaker. Meanwhile, the prediction learning of the prosodic style characteristics and the variable adaptation model on the duration, the pitch and the amplitude energy of each phoneme is closely related, so that the prediction result of the prosodic feature prediction model is input into the variable adaptation model instead of being mixed with the tone characteristics of a backbone network, the influence of the prosodic style characteristics on the tone of a speaker is avoided, the tone and the prosodic style are decoupled, and the influence on the tone of the original speaker is completely eliminated by synthesizing the prosodic style characteristics of the reference information only learned by the audio.

In the variable adaptation model, the encoding result of the GST reference encoder is irrelevant to the intermediate result on the backbone network, and only the prediction of the time length, the pitch and the amplitude energy related to the prosody style is influenced, so that the model only learns the prosody style from the reference information, the decoupling of the timbre and the prosody style is completed, and the prosody control is better realized through the reference information.

According to the method, the predicted result of the prosody prediction model is input into the variable adaptation model, the influence of the predicted result on the backbone network is avoided, the prosody enhancement process of the prosody prediction model is completely separated from the timbre prediction, and the effects of prosody enhancement and control are achieved.

In the model training stage, the prosody features of sentence level and phoneme level are guided and learned through the two reference encoders, and in the speech synthesis stage, the reference encoders and reference recording information (only text information is usually used without recording audio in the TTS speech synthesis process) can be not used, so that the speech synthesis is independently completed, and meanwhile, the prosody enhancement and control are realized, and the practicability is increased.

The GST reference encoder (first reference encoding model) is added to the variable adaptation model, so that the GST reference encoder is only used for optimizing and adjusting the prosody style, does not influence the synthesized voice color at all, and realizes decoupling of the voice color and the prosody style.

The method and the device can be suitable for speech service scenes with higher requirements on prosody styles under different service scenes such as man-machine interaction, novel reading and the like.

Further, in the present exemplary embodiment, a speech synthesis apparatus 1400 is also provided. The speech synthesis apparatus 1400 may be applied to intelligent speech devices. Referring to fig. 14, the voice synthesizing apparatus 1400 may include: the system comprises an acquisition module 1410, a prediction module 1420 and a voice synthesis module 1430, wherein the acquisition module 1410 is used for acquiring a symbol sequence of a sentence to be synthesized, and the sentence to be synthesized comprises a text to be synthesized and a query result sentence aiming at a target object; the prediction module 1420 is configured to perform acoustic feature prediction on the symbol sequence by using a pre-trained acoustic prediction model, so as to obtain acoustic features corresponding to the to-be-synthesized sentence; the acoustic prediction model comprises a prosody prediction model which enhances the prosody characteristics of the sentence to be synthesized in the speech synthesis stage by learning the prosody characteristics of the reference recorded audio in the model training stage; the speech synthesis module 1430 is configured to perform feature conversion and synthesis on the acoustic features to obtain speech corresponding to the sentence to be synthesized.

In one exemplary embodiment of the present disclosure, the acoustic prediction model further includes an encoding model and a decoding model, and the prediction module 1420 may be further configured to: performing primary coding processing on the symbol sequence by utilizing a pre-trained coding model to obtain a first coding vector; performing prosodic feature prediction on the first coded vector by using a pre-trained prosodic prediction model to obtain a prosodic feature vector; and predicting the acoustic features of the sentence to be synthesized according to the pre-trained decoding model, the first coding vector and the prosodic feature vector to obtain the acoustic features corresponding to the sentence to be synthesized.

In one exemplary embodiment of the present disclosure, the prediction module 1420 may also be used to: performing variable prediction on the superposition result of the first coding vector and the prosody feature vector by using a pre-trained variable adaptation model to obtain a first variable prediction result; and carrying out decoding processing based on an attention mechanism on the first variable prediction result by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the sentences to be synthesized.

In one exemplary embodiment of the present disclosure, the prediction module 1420 may also be used to: respectively inputting the first coding vector and the prosody feature vector into a pre-trained variable adaptation model, and carrying out variable prediction to obtain a second variable prediction result; and carrying out decoding processing based on an attention mechanism on the second variable prediction result by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the sentences to be synthesized.

In one exemplary embodiment of the present disclosure, the prosody prediction model includes a first prosody prediction model and a second prosody prediction model, and the prediction module 1420 may be further configured to: performing sentence-level prosody feature prediction on the first coding vector by using the first prosody prediction model to obtain a first prosody feature vector; performing phoneme-level prosody feature prediction on the superposition features of the first coding vector and the first prosody feature vector by using a second prosody prediction model to obtain a second prosody feature vector; and carrying out variable prediction on the superposition result of the second prosodic feature vector and the superposition feature.

In one exemplary embodiment of the present disclosure, the prosody prediction model includes a first prosody prediction model and a second prosody prediction model, and the prediction module 1420 may be further configured to: performing sentence-level prosodic feature prediction on the first coding vector by using the first prosodic prediction model to obtain a third prosodic feature vector; performing phoneme-level prosody feature prediction on the first coding vector by using the second prosody prediction model to obtain a fourth prosody feature vector; and respectively inputting the first coding vector, the third prosody feature vector and the fourth prosody feature vector into a pre-trained variable adaptation model to perform variable prediction.

In one exemplary embodiment of the present disclosure, the prediction module 1420 may also be used to: and performing time sequence feature processing and linear transformation on the input data of the first prosody prediction model.

In one exemplary embodiment of the present disclosure, the prediction module 1420 may also be used to: and carrying out convolution processing and linear transformation on the input data of the second prosody prediction model.

In an exemplary embodiment of the present disclosure, the sentence to be synthesized further includes a recording sentence of the target object, and the apparatus 1400 further includes a reference module, where the reference module is configured to perform a secondary encoding process on the recording sentence by using a first reference encoding model trained in advance, to obtain a second encoding vector; the prediction module 1420 may also be used to: performing variable prediction on the superposition result of the second prosodic feature vector and the superposition feature and the second coding vector by using a pre-trained variable adaptation model; alternatively, the variable prediction is performed on the first encoded vector, the prosodic feature vector, and the second encoded vector using a pre-trained variable adaptation model.

In an exemplary embodiment of the present disclosure, the apparatus 1400 further comprises a training module that may be used to train the acoustic prediction model, comprising: the acquisition sub-module, the first training sub-module and the second training sub-module can be used for acquiring training samples, wherein the training samples comprise recording samples, corresponding acoustic feature samples and symbol sequence samples; the first training sub-module may be configured to train the initial acoustic prediction model once with a training sample to obtain an intermediate model; the initial acoustic prediction model comprises a first reference coding model and a second reference coding model; the second training sub-module may be configured to fix model parameters of the first reference coding model and the second reference coding model of the intermediate model, and perform a secondary training on the acoustic prediction model using the training samples, the first reference coding model, and the second reference coding model.

In an exemplary embodiment of the present disclosure, the initial acoustic prediction model further includes an encoding model and a decoding model, and the first training submodule is further configured to encode the symbol sequence samples with the encoding model to obtain a first sample encoding vector; adopting a first reference coding model to code the recording sample to obtain a second sample coding vector; adopting a second reference coding model to code the acoustic characteristic samples to obtain a third sample coding vector; after the characteristic superposition is carried out on the first sample coding vector and the second sample coding vector, characteristic stitching is carried out on the first sample coding vector and the second sample coding vector, and stitching characteristics are obtained; decoding the spliced characteristics by adopting a decoding model, outputting the decoded characteristics, and calculating a first loss function value according to an output result and an acoustic characteristic sample; and adjusting model parameters of the coding model, the first reference coding model, the second reference coding model and the decoding model according to the first loss function value.

In one exemplary embodiment of the present disclosure, the acoustic prediction model includes a first prosody prediction model and a second prosody prediction model, and the second training sub-module may be configured to: adopting a coding model after one-time training to code the symbol sequence samples to obtain a fourth sample coding vector; adopting a first reference coding model after one training to code the recording sample to obtain a fifth sample coding vector; superposing the fourth sample coding vector and the fifth sample coding vector and then respectively inputting the first prosody prediction model and the second prosody prediction model; calculating a second loss function value according to the fifth sample coding vector and the output result of the first prosody prediction model; adjusting model parameters of the first prosody prediction model based on the second loss function value; calculating a third loss function value according to the fifth sample coding vector and the output result of the second prosody prediction model; model parameters of the second prosody prediction model are adjusted based on the third loss function value.

The specific details of each module or unit in the above-mentioned speech synthesis apparatus have been described in detail in the corresponding speech synthesis method, and thus will not be described here again.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods in the embodiments described below. For example, the electronic device may implement the respective steps shown in fig. 2 to 13, and the like.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

An electronic device 1500 according to such an embodiment of the present disclosure is described below with reference to fig. 15. The electronic device 1500 shown in fig. 15 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 15, the electronic device 1500 is embodied in the form of a general purpose computing device. The components of electronic device 1500 may include, but are not limited to: the at least one processing unit 1510, the at least one storage unit 1520, a bus 1530 connecting the different system components (including the storage unit 1520 and the processing unit 1510), and a display unit 1540.

Wherein the storage unit stores program code that is executable by the processing unit 1510 such that the processing unit 1510 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the "exemplary method" of the present specification.

Illustratively, the processing unit 1510 may perform the steps of: acquiring a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a text to be synthesized and a query result sentence aiming at a target object; carrying out acoustic feature prediction on the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain acoustic features corresponding to the sentences to be synthesized; the acoustic prediction model comprises a prosody prediction model which enhances the prosody characteristics of the sentence to be synthesized in the speech synthesis stage by learning the prosody characteristics of the reference recorded audio in the model training stage; and performing feature conversion and synthesis on the acoustic features to obtain the voice corresponding to the sentence to be synthesized.

The storage unit 1520 may include readable media in the form of volatile memory units such as Random Access Memory (RAM) 15201 and/or cache memory 15202, and may further include Read Only Memory (ROM) 15203.

The storage unit 1520 may also include a program/utility 15204 having a set (at least one) of program modules 15205, such program modules 15205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 1530 may be a bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1500 may also communicate with one or more external devices 1570 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1500, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1500 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1550. Also, the electronic device 1500 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, for example, the Internet, through a network adapter 1560. As shown, the network adapter 1560 communicates with other modules of the electronic device 1500 over the bus 1530. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1500, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RA identification systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are only schematic illustrations of processes included in the method according to the exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although the steps of the methods of the present disclosure are illustrated in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order or that all of the illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc., all are considered part of the present disclosure.

It should be understood that the present disclosure disclosed and defined herein extends to all alternative combinations of two or more of the individual features mentioned or evident from the text and/or drawings. All of these different combinations constitute various alternative aspects of the present disclosure. Embodiments of the present disclosure describe the best mode known for carrying out the disclosure and will enable one skilled in the art to utilize the disclosure.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a text to be synthesized and a query result sentence aiming at a target object;

carrying out acoustic feature prediction on the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain acoustic features corresponding to the sentences to be synthesized; the acoustic prediction model comprises a prosody prediction model which enhances the prosody characteristics of the sentence to be synthesized in a speech synthesis stage by learning the prosody characteristics of the reference recording audio in a model training stage;

and performing feature conversion and synthesis on the acoustic features to obtain the voice corresponding to the sentence to be synthesized.

2. The method of speech synthesis according to claim 1, wherein the acoustic prediction model further comprises an encoding model and a decoding model, and wherein the predicting the acoustic features of the symbol sequence using the pre-trained acoustic prediction model comprises:

performing primary coding processing on the symbol sequence by utilizing a pre-trained coding model to obtain a first coding vector;

performing prosody feature prediction on the first coding vector by utilizing a pre-trained prosody prediction model to obtain a prosody feature vector;

and predicting the acoustic features of the sentence to be synthesized according to a pre-trained decoding model, the first coding vector and the prosodic feature vector to obtain the acoustic features corresponding to the sentence to be synthesized.

3. The speech synthesis method according to claim 2, wherein predicting the acoustic features of the sentence to be synthesized based on a pre-trained decoding model, the first encoding vector, and the prosodic feature vector comprises:

performing variable prediction on the superposition result of the first coding vector and the prosody feature vector by using a pre-trained variable adaptation model to obtain a first variable prediction result;

And carrying out decoding processing based on an attention mechanism on the first variable prediction result by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the statement to be synthesized.

4. The speech synthesis method according to claim 2, wherein predicting the acoustic features of the sentence to be synthesized based on a pre-trained decoding model, the first encoding vector, and the prosodic feature vector comprises:

respectively inputting the first coding vector and the prosody feature vector into a pre-trained variable adaptation model to perform variable prediction to obtain a second variable prediction result;

and carrying out decoding processing based on an attention mechanism on the second variable prediction result by utilizing a pre-trained decoding model to obtain acoustic features corresponding to the statement to be synthesized.

5. The speech synthesis method according to claim 3, wherein the prosody prediction model includes a first prosody prediction model and a second prosody prediction model, and wherein the performing prosody feature prediction on the first encoded vector using the pre-trained prosody prediction model includes:

performing sentence-level prosody feature prediction on the first coding vector by using the first prosody prediction model to obtain a first prosody feature vector;

Performing phoneme-level prosodic feature prediction on the superposition features of the first coding vector and the first prosodic feature vector by using the second prosodic prediction model to obtain a second prosodic feature vector;

the variable prediction of the superposition result of the first coding vector and the prosody feature vector includes:

and carrying out variable prediction on the superposition result of the second prosodic feature vector and the superposition feature.

6. The speech synthesis method according to claim 4, wherein the prosody prediction model includes a first prosody prediction model and a second prosody prediction model, and wherein the performing prosody feature prediction on the first encoded vector using the pre-trained prosody prediction model includes:

performing sentence-level prosody feature prediction on the first coding vector by using the first prosody prediction model to obtain a third prosody feature vector;

performing phoneme-level prosody feature prediction on the first coding vector by using the second prosody prediction model to obtain a fourth prosody feature vector;

the step of inputting the first coding vector and the prosody feature vector into a pre-trained variable adaptation model respectively to perform variable prediction comprises the following steps:

And respectively inputting the first coding vector, the third prosody feature vector and the fourth prosody feature vector into a pre-trained variable adaptation model to perform variable prediction.

7. The method of claim 5 or 6, wherein the sentence-level prosodic feature prediction comprises:

and carrying out time sequence feature processing and linear transformation on the input data of the first prosody prediction model.

8. The method of speech synthesis according to claim 5 or 6, wherein the phoneme-level prosodic feature prediction comprises:

and carrying out convolution processing and linear transformation on the input data of the second prosody prediction model.

9. The method of speech synthesis according to claim 5, wherein the sentence to be synthesized further comprises a recorded sentence of the target object, the method further comprising:

performing secondary coding processing on the recording statement by adopting a pre-trained first reference coding model to obtain a second coding vector;

the variable prediction includes:

performing variable prediction on the superposition result of the second prosodic feature vector and the superposition feature and the second coding vector by using a pre-trained variable adaptation model;

Or alternatively, the process may be performed,

and carrying out variable prediction on the first coding vector, the prosody characteristic vector and the second coding vector by utilizing a pre-trained variable adaptation model.

10. The method of speech synthesis according to claim 1, further comprising training the acoustic prediction model, the training comprising:

acquiring a training sample, wherein the training sample comprises a recording sample, and a corresponding acoustic characteristic sample and symbol sequence sample;

training the initial acoustic prediction model for one time by adopting the training sample to obtain an intermediate model; the initial acoustic prediction model includes a first reference coding model and a second reference coding model;

and fixing model parameters of the first reference coding model and the second reference coding model of the intermediate model, and performing secondary training on the acoustic prediction model by using the training samples, the first reference coding model and the second reference coding model.

11. The method of speech synthesis according to claim 10, wherein the initial acoustic prediction model further comprises an encoding model and a decoding model, the training of the initial acoustic prediction model with the training sample comprising:

Coding the symbol sequence samples by adopting a coding model to obtain a first sample coding vector;

adopting the first reference coding model to code the recording sample to obtain a second sample coding vector;

adopting a second reference coding model to code the acoustic characteristic samples to obtain a third sample coding vector;

after the first sample coding vector and the second sample coding vector are subjected to feature superposition, the first sample coding vector and the second sample coding vector are subjected to feature splicing with the third sample coding vector to obtain splicing features;

decoding the spliced characteristic by adopting a decoding model, outputting the decoded characteristic, and calculating a first loss function value according to an output result and the acoustic characteristic sample;

and adjusting model parameters of the coding model, the first reference coding model, the second reference coding model and the decoding model according to the first loss function value.

12. The method of speech synthesis according to claim 11, wherein the acoustic prediction model comprises a first prosody prediction model and a second prosody prediction model, the secondarily training the acoustic prediction model comprising:

adopting a coding model after one-time training to code the symbol sequence samples to obtain a fourth sample coding vector;

Adopting a first reference coding model after one training to code the recording sample to obtain a fifth sample coding vector;

adopting a second reference coding model after one training to code the acoustic characteristic sample to obtain a sixth sample coding vector;

superposing the fourth sample coding vector and the fifth sample coding vector and then respectively inputting the first prosody prediction model and the second prosody prediction model;

calculating a second loss function value according to the fifth sample coding vector and the output result of the first prosody prediction model; adjusting model parameters of the first prosody prediction model based on the second loss function value;

calculating a third loss function value according to the sixth sample coding vector and the output result of the second prosody prediction model; and adjusting model parameters of the second prosody prediction model based on the third loss function value.

13. A speech synthesis apparatus, the apparatus comprising:

the system comprises an acquisition module, a synthesis module and a query module, wherein the acquisition module is used for acquiring a symbol sequence of a sentence to be synthesized, and the sentence to be synthesized comprises a text to be synthesized and a query result sentence aiming at a target object;

The prediction module is used for predicting the acoustic characteristics of the symbol sequence by utilizing a pre-trained acoustic prediction model to obtain the acoustic characteristics corresponding to the statement to be synthesized; the acoustic prediction model comprises a prosody prediction model which enhances the prosody characteristics of the sentence to be synthesized in a speech synthesis stage by learning the prosody characteristics of the reference recording audio in a model training stage;

and the voice synthesis module is used for carrying out feature conversion and synthesis on the acoustic features to obtain voices corresponding to the sentences to be synthesized.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-12.

15. An electronic device, comprising: one or more processors; and

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the method of any of claims 1-12.