CN111816158A

CN111816158A - Voice synthesis method and device and storage medium

Info

Publication number: CN111816158A
Application number: CN201910878228.3A
Authority: CN
Inventors: 武执政; 宋伟
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2020-10-23
Anticipated expiration: 2039-09-17
Also published as: JP2022539914A; CN111816158B; JP7238204B2; KR102584299B1; US20220270587A1; WO2021051765A1; KR20220054655A

Abstract

The embodiment of the invention discloses a voice synthesis method, a device and a storage medium, wherein the method comprises the following steps: obtaining a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a sound recording sentence representing a target object and a query result sentence aiming at the target object; coding the symbol sequence by using a preset coding model to obtain a characteristic vector set; acquiring a recording acoustic characteristic corresponding to a recording statement; predicting the acoustic features corresponding to the sentences to be synthesized based on a preset decoding model, a feature vector set, a preset attention model and the acoustic features of the sound recording to obtain predicted acoustic features corresponding to the sentences to be synthesized, wherein the preset attention model is a model for generating context vectors for decoding by using the feature vector set, and the predicted acoustic features consist of at least one associated acoustic feature; and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

Description

Voice synthesis method and device and storage medium

Technical Field

The embodiment of the invention relates to a voice processing technology in the field of electronic application, in particular to a voice synthesis method and device and a storage medium.

Background

At present, a speech synthesis technology is applied to many intelligent devices, such as an intelligent sound box, a telephone outbound system and a number calling system, after receiving a query request for a target object sent by a user, the intelligent device generates a sentence to be synthesized representing the target object and a query result according to the query request, converts the sentence to be synthesized into a complete speech and plays the complete speech to inform the user of the query result about the target object; when the sentence to be synthesized is converted into the complete voice, recording the recording of the target object in advance for the fixed target object in the sentence to be synthesized, synthesizing the synthetic voice corresponding to the query result in a voice synthesis mode for the dynamically updated query result in the sentence to be synthesized, and splicing the recording and the synthetic voice to obtain the complete voice of the sentence to be synthesized.

However, since the process of generating the recorded voice and the process of generating the synthesized voice are independent, the speeds, the pitches, and the like of the recorded voice and the synthesized voice are different, which may cause the prosody of the complete voice combined by the recorded voice and the synthesized voice to be inconsistent, and further cause the excessive duration between the recorded voice and the synthesized voice to have uncertainty and poor voice quality.

Disclosure of Invention

The invention mainly aims to provide a voice synthesis method, a voice synthesis device and a storage medium, which can realize the prosody consistency of synthesized voice and improve the quality of the synthesized voice.

The technical scheme of the invention is realized as follows:

the embodiment of the invention provides a voice synthesis method, which comprises the following steps:

obtaining a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a sound recording sentence representing a target object and a query result sentence aiming at the target object;

coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

acquiring a recording acoustic characteristic corresponding to the recording statement;

predicting the acoustic features corresponding to the sentence to be synthesized based on a preset decoding model, the feature vector set, a preset attention model and the acoustic features of the sound recording to obtain predicted acoustic features corresponding to the sentence to be synthesized, wherein the preset attention model is a model for generating a context vector for decoding by using the feature vector set, and the predicted acoustic features consist of at least one associated acoustic feature;

and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

In the foregoing solution, the predicting the acoustic features corresponding to the sentence to be synthesized based on the preset decoding model, the feature vector set, the preset attention model, and the acoustic features of the sound recording to obtain the predicted acoustic features corresponding to the sentence to be synthesized includes:

when i is equal to 1, acquiring initial acoustic features at the ith decoding moment, and predicting the 1 st acoustic feature based on the initial acoustic features, the preset decoding model, the feature vector set and the preset attention model, wherein i is an integer greater than 0;

under the condition that i is larger than 1, when the ith decoding time is the decoding time of the sound recording statement, taking the acoustic feature of a jth frame from the sound recording acoustic features, taking the acoustic feature of the jth frame as the acoustic feature of an i-1 th frame, and predicting the ith acoustic feature based on the acoustic feature of the i-1 th frame, the preset decoding model, the feature vector set and the preset attention model, wherein j is an integer larger than 0;

when the ith decoding time is the decoding time of the query result statement, taking one frame of acoustic features in the (i-1) th acoustic features as the (i-1) th frame of acoustic features, and predicting the ith acoustic features based on the (i-1) th frame of acoustic features, the preset decoding model, the feature vector set and the preset attention model;

continuing to execute the prediction process of the (i +1) th decoding moment until the decoding of the sentence to be synthesized is finished to obtain the nth acoustic feature, wherein n is the total frame number of the decoding moments of the sentence to be synthesized and is an integer greater than 1;

and taking the obtained ith acoustic feature to the nth acoustic feature as the predicted acoustic feature.

In the above scheme, the preset decoding model includes a first recurrent neural network and a second recurrent neural network; predicting an ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model, including:

carrying out nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector;

performing matrix operation and nonlinear transformation on the intermediate eigenvector by using the first recurrent neural network to obtain an ith intermediate latent variable;

performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector;

performing matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable by using the second recurrent neural network to obtain an ith hidden variable;

and performing linear transformation on the ith hidden variable according to a preset frame number to obtain the ith acoustic feature.

In the above scheme, the feature vector set includes a feature vector corresponding to each symbol in the symbol sequence; the performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector includes:

performing attention calculation on the feature vector corresponding to each symbol in the symbol sequence and the ith intermediate hidden variable by using the preset attention model to obtain an ith group of attention values;

and according to the ith group of attention values, carrying out weighted summation on the feature vector set to obtain the ith context vector.

In the foregoing solution, after predicting the ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set, and the preset attention model, and before continuing to perform the prediction process at the i +1 th decoding time, the method further includes:

determining the ith target symbol corresponding to the maximum attention value from the ith group of attention values;

when the ith target symbol is a non-ending symbol of the sound recording statement, determining that the (i +1) th decoding time is the decoding time of the sound recording statement;

and/or when the ith target symbol is a non-end symbol of the query result statement, determining that the (i +1) th decoding time is the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the sound recording statement and the end symbol of the sound recording statement is not the end symbol of the statement to be synthesized, determining the (i +1) th decoding time as the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the query result statement and the end symbol of the query result statement is not the end symbol of the statement to be synthesized, determining that the (i +1) th decoding time is the decoding time of the sound recording statement;

and/or when the ith target symbol is the end symbol of the sentence to be synthesized, determining the (i +1) th decoding time as the decoding end time of the sentence to be synthesized.

In the foregoing scheme, the encoding the symbol sequence by using a preset encoding model to obtain a feature vector set includes:

performing vector conversion on the symbol sequence by using the preset coding model to obtain an initial characteristic vector set;

and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain the feature vector set.

In the foregoing scheme, the performing feature conversion and synthesis on the predicted acoustic features to obtain the speech corresponding to the sentence to be synthesized includes:

performing feature conversion on the predicted acoustic features to obtain a linear spectrum;

and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

In the above scheme, the symbol sequence is an alphabetical sequence or a phoneme sequence.

In the above scheme, before the obtaining of the symbol sequence of the sentence to be synthesized, the method further includes:

obtaining a sample symbol sequence corresponding to at least one sample synthesis statement, wherein each sample synthesis statement represents a sample object and a reference query result aiming at the sample object;

acquiring an initial voice synthesis model, initial acoustic features and sample acoustic features corresponding to the sample synthesis statements; the initial speech synthesis model is a model for encoding processing and prediction;

and training the initial speech synthesis model by using the sample symbol sequence, the initial acoustic features and the sample acoustic features to obtain the preset coding model, the preset decoding model and the preset attention model.

An embodiment of the present invention provides a speech synthesis apparatus, where the apparatus includes: the device comprises a sequence generation module, a voice synthesis module and an acquisition module; wherein,

the sequence generation module is used for acquiring a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a sound recording sentence representing a target object and a query result sentence aiming at the target object;

the voice synthesis module is used for coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

the acquisition module is used for acquiring the sound recording acoustic characteristics corresponding to the sound recording sentences;

the speech synthesis module is further configured to predict, based on a preset decoding model, the feature vector set, a preset attention model and the acoustic features of the sound recording, the acoustic features corresponding to the sentence to be synthesized, so as to obtain predicted acoustic features corresponding to the sentence to be synthesized, where the preset attention model is a model that uses the feature vector set to generate a context vector for decoding, and the predicted acoustic features are composed of at least one associated acoustic feature; and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

In the foregoing solution, the speech synthesis module is specifically configured to, when i is equal to 1, obtain an initial acoustic feature at an ith decoding time, predict a 1 st acoustic feature based on the initial acoustic feature, the preset decoding model, the feature vector set, and the preset attention model, where i is an integer greater than 0;

and using the obtained ith acoustic feature to the nth acoustic feature as the predicted acoustic feature.

In the above scheme, the preset decoding model includes a first recurrent neural network and a second recurrent neural network;

the speech synthesis module is specifically configured to perform nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector; performing matrix operation and nonlinear transformation on the intermediate eigenvector by using the first cyclic neural network to obtain an ith intermediate latent variable; performing context vector calculation on the characteristic vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector; performing matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable by using the second recurrent neural network to obtain an ith hidden variable; and according to a preset frame number, performing linear transformation on the ith hidden variable to obtain the ith acoustic feature.

In the above scheme, the feature vector set includes a feature vector corresponding to each symbol in the symbol sequence;

the speech synthesis module is specifically configured to perform attention calculation on a feature vector corresponding to each symbol in the symbol sequence and the ith intermediate hidden variable by using the preset attention model to obtain an ith group of attention values; and according to the ith group of attention values, carrying out weighted summation on the feature vector set to obtain the ith context vector.

In the foregoing solution, the speech synthesis module is further configured to determine, after predicting the ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set, and the preset attention model, an ith target symbol corresponding to a maximum attention value from the ith group of attention values before continuing to perform the prediction process at the i +1 th decoding time;

when the ith target symbol is a non-ending symbol of the sound recording statement, determining the (i +1) th decoding time as the decoding time of the sound recording statement;

In the foregoing scheme, the speech synthesis module is specifically configured to perform vector conversion on the symbol sequence to obtain an initial feature vector set; and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain the feature vector set.

In the above scheme, the speech synthesis module is specifically configured to perform feature conversion on the predicted acoustic features to obtain a linear spectrum; and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

In the above scheme, the apparatus further comprises: a training module;

the training module is configured to, before the obtaining of the symbol sequence of the sentence to be synthesized, obtain a sample symbol sequence corresponding to each of at least one sample synthesis sentence, where each sample synthesis sentence represents a sample object and a reference query result for the sample object; acquiring an initial voice synthesis model, initial acoustic features and sample acoustic features corresponding to the sample synthesis statements; the initial speech synthesis model is a model for encoding processing and prediction; and training the initial speech synthesis model by using the sample symbol sequence, the initial acoustic features and the sample acoustic features to obtain the preset coding model, the preset decoding model and the preset attention model.

An embodiment of the present invention provides a speech synthesis apparatus, where the apparatus includes: a processor, a memory and a communication bus, the memory being in communication with the processor through the communication bus, the memory storing one or more programs executable by the processor, the one or more programs, when executed, causing the processor to perform the steps of any of the speech synthesis methods described above.

Embodiments of the present invention provide a computer-readable storage medium storing a program which, when executed by at least one processor, causes the at least one processor to perform the steps of any of the speech synthesis methods described above.

The embodiment of the invention provides a voice synthesis method and device and a storage medium, wherein the technical implementation scheme is adopted, and the predicted acoustic features corresponding to the sentences to be synthesized are obtained by prediction based on a preset decoding model, a feature vector set, a preset attention model and a recording acoustic feature; secondly, the predicted acoustic features corresponding to the sentences to be synthesized are subjected to feature conversion and synthesis to obtain the voice, so that the problem of uncertainty of excessive time length existing when the recording and the voice are spliced is solved, and the quality of the synthesized voice is improved.

Drawings

Fig. 1 is a first schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a tacontron model according to an embodiment of the present invention;

fig. 3 is a first flowchart of a speech synthesis method according to an embodiment of the present invention;

fig. 4 is a flowchart of a speech synthesis method according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a correspondence relationship between a phoneme sequence and an attention number according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

Referring now to fig. 1, which is a schematic diagram of a speech synthesis apparatus 1 for implementing various embodiments of the present invention, the apparatus 1 may include: the system comprises a sequence generation module 11, a voice synthesis module 12 and a playing module 13, wherein the sequence generation module 11 receives a query request aiming at a target object sent by a user, determines a sentence to be synthesized according to the query request, the sentence to be synthesized is a text of a query result about the target object, and transmits a symbol sequence of the sentence to be synthesized to the voice synthesis module 12; the voice synthesis module 12 performs voice synthesis on the symbol sequence to obtain a voice corresponding to the voice to be synthesized, and transmits the voice to the playing module 13; the playing module 13 plays the voice.

In some embodiments, the speech synthesis module 12 is a module built by an Attention model and a coder-Decoder (Encoder-Decoder) model, for example, the speech synthesis module 12 is a tacontron model, the tacontron model is a Text-to-speech (TTS) model based on deep learning, as shown in fig. 2, the tacontron model mainly includes a coding model 21, an Attention model 22 and a decoding model 23, the coding model 21 includes a character embedding model 211, a Pre-net model 212 and a CBHG model 213, the decoding model 23 includes a Pre-net model 231, a first Recurrent Neural Network (RNN) 232, a second Recurrent Neural Network 233, a linear conversion model 234, a CBHG model 235 and a speech reconstruction model 236; the CBHG model 213 and the CBHG model 235 have the same structure, and are composed of a convolution bank, a highway neural network (highway network), and a gate cycle Unit (GRU); the speech reconstruction model 236 comprises a model generated using the Griffin-Lim algorithm.

Illustratively, the Tacotron model receives the symbol sequence of the sentence to be synthesized and starts to perform the encoding process as follows: the character embedding model 211 performs vector conversion on the symbol sequence to obtain a converted vector set, and transmits the converted vector set to the Pre-net model 212; the Pre-net model 212 performs nonlinear change on the converted vector set to obtain an intermediate characteristic vector set, and transmits the intermediate characteristic vector set to the CBHG model 213; the CBHG model 213 performs a series of matrix operations and nonlinear transformation on the intermediate characteristic vector set to obtain a characteristic vector set, and ends the encoding.

Further, after the encoding process is finished, the prediction process is started to be executed as follows: at the current decoding moment, the Pre-net model 231 performs nonlinear transformation on the current frame acoustic features to obtain an intermediate feature vector, and transmits the intermediate feature vector to the first recurrent neural network 232; the first recurrent neural network 232 performs a series of matrix operations and nonlinear transformation on the intermediate feature vector to obtain a current intermediate Hidden variable (Hidden State), transmits the current intermediate Hidden variable to the attention model 22 and the second recurrent neural network 233, and the first recurrent neural network 232 also stores the current intermediate Hidden variable for use at the next frame interface time; the attention model 22 performs context vector calculation on the current intermediate hidden variable and the feature vector set obtained by encoding to obtain a current context vector, and transmits the current context vector to the second recurrent nerve 233; the second recurrent neural network 233 performs a series of matrix operations and nonlinear transformations on the current context vector and the current intermediate hidden state to obtain a current hidden variable, and transmits the current hidden variable to the linear conversion model 234; the linear transformation model 234 performs linear transformation on the current hidden variable to obtain the current acoustic feature, and transmits the current acoustic feature to the CBHG model 235; continuing to execute the prediction process of the next decoding moment until the decoding of the sentence to be synthesized is finished, and obtaining the last acoustic feature; the CBHG model 235 performs feature conversion on the first acoustic feature to the last acoustic feature to obtain a linear spectrum, and transmits the linear spectrum to the voice reconstruction model 236; the speech reconstruction model 236 reconstructs and synthesizes the linear spectrum, and generates speech.

It should be noted that, as indicated by the dashed line in fig. 2, in the prediction process, the decoding model 23 may perform the prediction process in an auto-regression manner, that is, one frame of acoustic features in the current acoustic features obtained at the current decoding time is used as the input of the next decoding time; the prediction process can also be executed without adopting an autoregressive mode, namely the input of the next decoding moment is not one frame of acoustic features in the previous acoustic features obtained at the current decoding moment; fig. 2 illustrates only three decoding time points as an example, and the decoding time points are not limited in the embodiment of the present invention.

It will be appreciated by those skilled in the art that the configuration of the speech synthesis apparatus shown in fig. 1 or fig. 2 is not intended to be limiting, and the speech synthesis apparatus may comprise more or less components than those shown, or some components may be combined, or a different arrangement of components.

It should be noted that the embodiment of the present invention can be implemented based on the speech synthesis apparatus shown in fig. 1 or fig. 2, and a specific embodiment of speech synthesis is described below based on fig. 1 or fig. 2.

Example one

An embodiment of the present invention provides a speech synthesis method, as shown in fig. 3, the method includes:

s301, obtaining a symbol sequence of a sentence to be synthesized, wherein the sentence to be synthesized comprises a recording sentence representing a target object and a query result sentence aiming at the target object;

when receiving a query request aiming at a target object, the voice synthesis device generates a sentence to be synthesized according to the query request and then obtains a symbol sequence of the sentence to be synthesized, wherein the symbol sequence is an alphabetic sequence or a phoneme sequence.

In some embodiments, the speech synthesis apparatus obtains the query result information by one of: acquiring query result information from the query request, acquiring the query result information from the storage module according to the query request, and requesting to obtain the query result information from external equipment; then, the texts in the query result information are sorted to obtain query result sentences; in addition, the recording sentences matched with the target objects are obtained from a preset recording sentence library; and splicing the query result statement and the recording statement according to the dialect mode to obtain a statement to be synthesized, and further generating a symbol sequence of the statement to be synthesized.

In some embodiments, the preset recording statement library stores one-to-one corresponding recording, recording statement and recording Mel spectrum; before step S301, the speech synthesis apparatus extracts at least one frame of mel spectrum for each recording frame in advance, and stores the recording, the recording sentence, and the at least one frame of mel spectrum in a preset recording sentence library, where the duration of one frame may be 10ms or 15 ms.

In some embodiments, the conversational modalities are largely divided into three types: firstly, the query result statement is located in the middle of the recording statement, for example, a certain mr honoring, "a" is the query result statement; secondly, the query result statement is located at the end position of the recording statement, for example, Beijing turns cloudy in the weather today, and the "turning cloudy" is the query result statement; thirdly, the query result statement is located at the beginning of the recording statement, for example, a certain song, "a certain" is the query result statement.

In some embodiments, the recorded sentences in the sentences to be synthesized are divided into a first sub-recorded sentence and a second sub-recorded sentence, and the first sub-recorded sentence precedes the second sub-recorded sentence.

Exemplarily, taking a voice synthesis device as an intelligent sound box as an example, a user sends a query request of what weather of Beijing today, the intelligent sound box sends the request of querying the weather of Beijing today to a weather query device, receives query result information including a conversion from fine to cloud returned by the weather query device, and takes a conversion from fine to cloudy as a query result statement; the intelligent sound box also acquires recording sentences of Beijing today weather from a preset recording sentence library; and splicing to obtain the sentence to be synthesized, which is Beijing and changes the weather from clear to cloudy today.

In some embodiments, a phoneme sequence of the sentence to be synthesized is generated in the pronunciation order of the sentence to be synthesized; or generating the letter sequence of the sentence to be synthesized according to the letter spelling sequence of the sentence to be synthesized.

Illustratively, when the sentence to be synthesized is HelloEverybody, the corresponding letter sequence is { h, e, l, l, o, e, v, e, r, y, b, o, d, y }.

S302, coding the symbol sequence by using a preset coding model to obtain a characteristic vector set;

and the coding model in the speech synthesis device codes the symbol sequence to obtain a characteristic vector set, wherein the characteristic vector set consists of characteristic vectors of each symbol in the symbol sequence, and the coding model is a preset coding model.

In some embodiments, the speech synthesis apparatus performs vector conversion on the symbol sequence to obtain an initial feature vector set; and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain a feature vector set.

And the voice synthesis device converts each symbol in the symbol sequence into a vector to obtain an initial characteristic vector set, and further obtain a characteristic vector set.

S303, acquiring sound recording acoustic characteristics corresponding to the sound recording sentences;

the voice synthesis device acquires the recording acoustic characteristics corresponding to the recording sentences from a preset recording sentence library; wherein, the recording acoustic characteristic is at least one frame Mel spectrum corresponding to the recording statement.

In some embodiments, the recorded acoustic features characterize a plurality of frames of acoustic features ordered in the order of the sequence of symbols of the recorded statement.

S304, predicting acoustic features corresponding to the sentence to be synthesized based on a preset decoding model, a feature vector set, a preset attention model and sound recording acoustic features to obtain predicted acoustic features corresponding to the sentence to be synthesized, wherein the preset attention model is a model for generating a context vector for decoding by using the feature vector set, and the predicted acoustic features consist of at least one associated acoustic feature;

the voice synthesis device predicts the acoustic features corresponding to the sentences to be synthesized through a preset decoding model and a preset attention model to obtain predicted acoustic features; in the prediction process, the preset decoding model can also take out a frame of acoustic features from the acoustic features of the sound recording, and the frame of acoustic features are used as input of the prediction process; the preset decoding model is a decoding model, and the preset attention model is an attention model.

In some embodiments, when i is equal to 1, the speech synthesis apparatus obtains an initial acoustic feature at an ith decoding time, predicts the 1 st acoustic feature based on the initial acoustic feature, a preset decoding model, a feature vector set and a preset attention model, and i is an integer greater than 0; under the condition that i is larger than 1, when the ith decoding time is the decoding time of the sound recording statement, extracting the acoustic feature of a jth frame from the sound recording acoustic feature, taking the acoustic feature of the jth frame as the acoustic feature of an i-1 th frame, and predicting the ith acoustic feature based on the acoustic feature of the i-1 th frame, a preset decoding model, a feature vector set and a preset attention model, wherein j is an integer larger than 0; when the ith decoding moment is the decoding moment of the query result statement, taking one frame of acoustic features in the (i-1) th acoustic features as the (i-1) th frame of acoustic features, and predicting the ith acoustic features based on the (i-1) th frame of acoustic features, a preset decoding model, a feature vector set and a preset attention model; continuing to execute the prediction process of the (i +1) th decoding moment until the decoding of the sentence to be synthesized is finished to obtain the nth acoustic feature, wherein n is the total frame number of the decoding moment of the sentence to be synthesized and is an integer greater than 1; and taking the obtained ith acoustic feature to the nth acoustic feature as predicted acoustic features.

The voice synthesis device takes i as 1, and at the 1 st decoding moment, initial acoustic features are obtained from a preset recording sentence library, and the initial acoustic features are one frame of acoustic features; the method comprises the steps of taking initial acoustic features and a feature vector set as input, and predicting the 1 st acoustic feature by utilizing a preset decoding model and a preset attention model; taking i as 2, and starting from the 2 nd decoding moment, firstly judging the type of the 2 nd decoding moment, wherein the type comprises the decoding moment of the recording statement, the decoding moment of the query result statement and the decoding ending moment of the statement to be synthesized; taking out the acoustic features of the 1 st frame according to the type of the 2 nd decoding moment, taking the acoustic features of the 1 st frame as input, and predicting the 2 nd acoustic features by using a preset decoding model, a feature vector set and a preset attention model; and continuously judging the type of the 3 rd decoding moment until the decoding of the sentence to be synthesized is finished.

In some embodiments, the speech synthesis apparatus may set an all 0 vector of one frame in size as the initial acoustic feature.

It should be noted that, when the type of the i-th decoding time is determined by considering that the sound recording sentence has the sound recording acoustic features extracted from the real-person sound recording, a frame of acoustic features may be extracted from the sound recording acoustic features for predicting the i-th acoustic feature, and the sound quality corresponding to the i-th acoustic feature is more real because the frame of acoustic features in the real-person sound recording is used for prediction.

In some embodiments, each of the ith through nth acoustic features includes one frame of acoustic feature or at least two frames of acoustic features, the number of frames of the acoustic feature corresponding to the ith acoustic feature may be set, and the at least two frames of acoustic features are non-overlapping, time-continuous, multi-frame acoustic features, so that predicting the multi-frame acoustic features at each decoding time can reduce the decoding time and reduce the complexity of the decoding model.

In some embodiments, the last frame of acoustic features in the i-1 th acoustic features may be taken as the i-1 th frame of acoustic features; correspondingly, the ith acoustic feature comprises k frames of acoustic features, j is k (i-1), k is the total frame number corresponding to each acoustic feature, and k is a positive integer greater than 0.

Illustratively, when k is 3, j takes on values of 3, 6, 9 ….

In some embodiments, the speech synthesis apparatus extracts the k × i frame acoustic features from the acoustic features of the recording during the prediction process at the i-th decoding time, and uses the k × i frame acoustic features as the i-1 frame acoustic features; in the prediction process of the (i +1) th decoding moment, extracting the acoustic features of the (k) th frame (i +1) from the acoustic features of the recording, and taking the acoustic features of the (k) th frame (i +1) as the acoustic features of the (i) th frame; and the acoustic features of the k frame and the k frame (i +1) are extracted according to the sequence of the symbol sequences conforming to the recording statement.

In some embodiments, the preset decoding model comprises a first recurrent neural network and a second recurrent neural network; the voice synthesis device carries out nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector; performing matrix operation and nonlinear transformation on the intermediate eigenvector by using a first cyclic neural network to obtain an ith intermediate latent variable; performing context vector calculation on the characteristic vector set and the ith intermediate hidden variable by using a preset attention model to obtain an ith context vector; performing matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable by using a second cyclic neural network to obtain an ith hidden variable; and performing linear transformation on the ith hidden variable according to the preset frame number to obtain the ith acoustic feature.

And the speech synthesis device transmits the acoustic features of the (i-1) th frame to a preset decoding model at the ith decoding moment, and the preset decoding model predicts the ith acoustic features by using the acoustic features of the (i-1) th frame.

In some embodiments, the speech synthesis device transmits the acoustic features of the i-1 th frame to a Pre-net model in the decoding models at the i-th decoding moment; the Pre-net model carries out nonlinear change on the acoustic features of the (i-1) th frame to obtain an intermediate feature vector, and the intermediate feature vector is transmitted to the first cyclic neural network; the first cyclic neural network performs matrix operation and nonlinear transformation on the intermediate eigenvectors to obtain an ith intermediate hidden variable, and the ith intermediate hidden variable is transmitted to the attention model and the second cyclic neural network; the attention model carries out context vector calculation on the characteristic vector set and the ith intermediate hidden variable to obtain an ith context vector, and the ith context vector is transmitted to the second recurrent neural network; the second cyclic neural network performs matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable to obtain an ith hidden variable, and transmits the ith hidden variable to the linear transformation module; and the linear transformation module performs linear transformation on the ith hidden variable according to the preset frame number to obtain the ith acoustic characteristic.

It should be noted that, in the prediction process, the speech synthesis apparatus acquires the ith acoustic feature by using the first recurrent neural network and the second recurrent neural network, and since the recurrent neural network is an artificial neural network in which nodes are directionally connected into a ring, the input of the current time sequence is processed by using hidden variables that have been calculated so far, so that the output of a sequence position is connected with the input of all previous sequences, thus, by using the first recurrent neural network and the second recurrent neural network, all acoustic features in the obtained predicted acoustic features are correlated with each other, and further, the speech obtained by using the predicted acoustic features is excessive and natural.

In some embodiments, the first recurrent neural Network may be replaced with a first long short Term Memory Network (LSTM), and correspondingly, the second recurrent neural Network is replaced with a second LSTM; in addition, the first recurrent neural network and the second recurrent neural network may be replaced by other neural networks, and the embodiments of the present invention are not limited.

In some embodiments, the set of feature vectors comprises a feature vector corresponding to each symbol in the sequence of symbols; the speech synthesis device performs attention calculation on a feature vector corresponding to each symbol (letter or phoneme) in a symbol sequence and an ith intermediate hidden variable by using a preset attention model to obtain an ith group of attention numerical values; and carrying out weighted summation on the feature vector set according to the ith group of attention values to obtain the ith context vector.

The voice synthesis device transmits the ith intermediate hidden variable to the attention model, the attention model calculates an attention value (similarity) between a feature vector corresponding to each symbol in the symbol sequence and the ith intermediate hidden variable, each symbol and the attention value are correspondingly stored, the ith group of attention values are obtained, and the value range of the attention value is 0-1; and taking the attention value corresponding to each symbol as the weight of the feature vector corresponding to each symbol, and performing weighted summation on all feature vectors in the feature vector set to obtain the ith context vector.

It should be noted that, at the ith decoding time, an ith intermediate hidden variable is generated based on the acoustic features of the ith-1 th frame, the ith intermediate hidden variable represents a symbol to be predicted at the ith decoding time, the attention model calculates an attention value between a feature vector corresponding to each symbol in the symbol sequence and the ith intermediate hidden variable, the magnitude of the attention value represents the degree of correlation between the feature vector corresponding to each symbol and the symbol to be predicted, the symbol to be predicted at the ith decoding time includes a secondary pronunciation symbol closely connected to the primary pronunciation symbol in terms of pronunciation in addition to the primary pronunciation symbol, so that the attention values corresponding to a plurality of symbols in the symbol sequence are nonzero, and the symbol with the largest attention value is the primary pronunciation symbol.

In some embodiments, the speech synthesis apparatus determines an ith target symbol corresponding to the maximum attention value from the ith group of attention values after predicting the ith acoustic feature based on the (i-1) th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model and before continuing to perform the prediction process at the (i +1) th decoding time; when the ith target symbol is a non-ending symbol of the sound recording statement, determining the (i +1) th decoding time as the decoding time of the sound recording statement; and/or when the ith target symbol is a non-ending symbol of the query result statement, determining the (i +1) th decoding time as the decoding time of the query result statement; and/or when the ith target symbol is the end symbol of the sound recording statement and the end symbol of the sound recording statement is not the end symbol of the statement to be synthesized, determining the (i +1) th decoding time as the decoding time of the query result statement; and/or when the ith target symbol is the end symbol of the query result statement and the end symbol of the query result statement is not the end symbol of the statement to be synthesized, determining the (i +1) th decoding time as the decoding time of the sound recording statement; and/or when the ith target symbol is the end symbol of the sentence to be synthesized, determining the (i +1) th decoding time as the decoding end time of the sentence to be synthesized.

Before judging the type of the ith target symbol, the speech synthesis device determines a special symbol from the symbol sequence when generating the symbol sequence of the sentence to be synthesized, wherein the special symbol comprises at least one of the following items: the non-end symbol of the sound recording statement, the non-end symbol of the query result statement, the end symbol of the query result statement and the end symbol of the statement to be synthesized; taking a symbol corresponding to the maximum attention numerical value in the ith group of attention numerical values as an ith target symbol, wherein the ith target symbol is a main pronunciation symbol at the ith decoding moment; and comparing the ith target symbol with the special symbols in sequence until the type of the ith target symbol is determined.

It should be noted that, the speech synthesis apparatus takes i as 2, and based on the acoustic feature of the i-1 th frame, the preset decoding model, the feature vector set, and the preset attention model, before predicting the i-th acoustic feature, the type of the i-th decoding time is determined by using the i-1 th target symbol in the same manner as the process of determining the type of the i + 1-th decoding time.

In some embodiments, the speech synthesis apparatus determines, when generating a symbol sequence of the sentence to be synthesized, a start symbol and an end symbol of the sound recording sentence, a start symbol and an end symbol of the query result sentence, and an end symbol of the sentence to be synthesized from the symbol sequence before judging the type of the ith decoding time; the starting symbols and the ending symbols of the recording sentences are in one-to-one correspondence, the starting symbols and the ending symbols of the query result sentences are in one-to-one correspondence, the starting symbols of the recording sentences or the starting symbols of the query result sentences are the starting symbols of the sentences to be synthesized, and the ending symbols of the recording sentences or the ending symbols of the query result sentences are the ending symbols of the sentences to be synthesized.

Furthermore, the speech synthesis device takes the symbol corresponding to the maximum attention value in the ith group of attention values as the ith target symbol; sequentially comparing the ith target symbol with the starting symbol of the recording statement, the starting symbol of the statement to be synthesized and the starting symbol of the query result statement; when the ith target symbol is the same as the starting symbol of the sound recording statement, determining that the (i +1) th decoding time is the decoding time of the sound recording statement, and sequentially comparing the (i +1) th target symbol with the ending symbol of the sound recording statement and the ending symbol of the statement to be synthesized in the prediction process of the (i +1) th decoding time; when the (i +1) th target symbol is different from the end symbol of the sound recording statement and the end symbol of the statement to be synthesized, determining that the (i +1) th target symbol is a non-end symbol of the sound recording statement, and further determining that the (i + 2) th decoding time is the decoding time of the sound recording statement; when the ith target symbol is the same as the starting symbol of the query result statement, determining that the (i +1) th decoding time is the decoding time of the query result statement, and sequentially comparing the (i +1) th target symbol with the ending symbol of the query result statement and the ending symbol of the statement to be synthesized in the prediction process of the (i +1) th decoding time; and when the (i +1) th target symbol is not consistent with the end symbol of the query result statement and the end symbol of the statement to be synthesized, determining that the (i +1) th target symbol is a non-end symbol of the query result statement, and further determining that the (i + 2) th decoding time is the decoding time of the sound recording statement.

In some embodiments, when the speech synthesis apparatus determines that the ith target symbol is an end symbol of the recording statement and is not an end symbol of the statement to be synthesized, the speech synthesis apparatus increases a frame decoding duration for a holding duration of the end symbol of the recording statement and determines whether the holding duration of the ith target symbol is less than a preset duration; when the holding duration of the ith target symbol is longer than or equal to the preset duration, determining that the (i +1) th decoding time is the decoding time of the query result statement, when the holding duration of the ith target symbol is shorter than the preset duration, determining that the (i +1) th decoding time is the decoding time of the record statement, continuously judging the (i +1) th target symbol until the holding duration of the end symbol of the record statement is determined to be longer than or equal to the preset duration at the mth decoding time, determining that the (m +1) th decoding time is the decoding time of the query result statement, and m is the total frame number of the decoding time of the record statement and is an integer greater than 1; the preset duration is generally set to be one frame decoding duration or two frames decoding duration, which is not limited in the embodiments of the present invention.

In some embodiments, when the speech synthesis apparatus determines that the ith target symbol is an end symbol of the query result statement and is not an end symbol of the statement to be synthesized, the speech synthesis apparatus increases a frame decoding duration for a holding duration of the end symbol of the query result statement, and determines whether the holding duration of the ith target symbol is less than a preset duration; when the holding duration of the ith target symbol is longer than or equal to the preset duration, determining that the (i +1) th decoding time is the decoding time of the sound recording statement, when the holding duration of the ith target symbol is shorter than the preset duration, determining that the (i +1) th decoding time is the decoding time of the query result statement, continuously judging the (i +1) th target symbol until the holding duration of the end symbol of the query result statement is determined to be longer than or equal to the preset duration at the h-th decoding time, determining that the (h +1) th decoding time is the decoding time of the sound recording statement, and h is the total frame number of the decoding time of the query result statement and is an integer larger than 1.

It should be noted that the speech synthesis apparatus determines the type of the next decoding time by determining the target symbol at the current decoding time and comparing the target symbol with the feature symbol in sequence, so that the type of each decoding time can be obtained without performing special marking or symbol alignment operation on the recording statement or query result statement in the statement to be synthesized; further, by judging the holding time length of the ending symbol of one sentence, the decoding of another sentence is started when the holding time length is greater than or equal to the preset time length, so that the ending symbol of one sentence can be completely pronounced.

In some embodiments, for the case that the recorded sentence in the sentence to be synthesized is located before the query result sentence, it is found that the end symbol of the recorded sentence is delayed in the obtained predicted acoustic feature, which may be caused by too close connection between the prediction process of the recorded sentence and the prediction process of the query result sentence, and therefore, the symbol located before the end symbol in the recorded sentence is set as the end sentence of the recorded sentence, so that the problem of delayed end symbol of the recorded sentence is solved, and transition between the voice of the recorded sentence in the synthesized voice and the voice of the query result sentence is smoother.

S305, performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

The voice synthesis device carries out feature conversion on each acoustic feature in the predicted acoustic features to obtain a linear spectrum, carries out reconstruction synthesis on all the obtained linear spectrums to obtain voice corresponding to the sentence to be synthesized, transmits the voice to the playing module, and plays the voice through the playing module, so that a user can obtain a query result aiming at the target object by listening to the voice.

In some embodiments, the speech synthesis apparatus performs feature transformation on the predicted acoustic features to obtain a linear spectrum; and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

The speech synthesis device can adopt Griffin-Lim algorithm to reconstruct and synthesize the linear spectrum to obtain speech.

It should be noted that, since the recorded phrase is predicted by using the recorded acoustic features extracted from the real-person recording, the sound quality of the speech corresponding to the recorded phrase in the obtained speech is better after performing feature conversion and synthesis on the predicted acoustic features.

In some embodiments, before step S301, the speech synthesis method further comprises:

s3001, obtaining a sample symbol sequence corresponding to each of at least one sample synthesis statement, wherein each sample synthesis statement represents a sample object and a reference query result aiming at the sample object;

the speech synthesis device generates a sample symbol sequence aiming at each sample synthesis statement in at least one sample synthesis statement so as to obtain at least one sample symbol sequence; wherein the sample object in the at least one sample synthesis statement comprises a target object, and the at least one sample synthesis statement may further comprise a query result statement.

S3002, obtaining an initial speech synthesis model, initial acoustic features and sample acoustic features corresponding to the sample synthesis sentences; the initial speech synthesis model is a model for coding processing and prediction;

the voice synthesis device acquires an initial voice synthesis model, initial acoustic features and sample acoustic features corresponding to each sample synthesis statement; and obtaining the acoustic characteristics of the sample corresponding to each sample synthesized sentence from the sound recording of each sample synthesized sentence.

S3003, training the initial speech synthesis model by using the sample symbol sequence, the initial acoustic features and the sample acoustic features to obtain a preset coding model, a preset decoding model and a preset attention model.

The voice synthesis device takes the sample symbol sequence as the input of a preset voice synthesis model, and the preset voice synthesis model carries out coding processing on the sample conforming sequence to obtain a sample characteristic vector set; then, the initial acoustic features are used as the input of a preset voice synthesis model, and the preset voice synthesis model predicts reference acoustic features based on the sample feature vector set and the initial acoustic features; calculating the reference acoustic characteristics and the sample acoustic characteristics by using a preset loss function to obtain error values; and when the error value is larger than the preset error threshold value, continuing to perform prediction based on the sample feature vector set and the initial decoding model until the error value is smaller than or equal to the preset error threshold value.

In some embodiments, the predetermined error function comprises an absolute Loss function (L1 Loss).

It should be noted that the process of predicting the reference acoustic feature by the preset speech synthesis model based on the sample feature vector set and the initial acoustic feature is the same as the process of predicting the ith acoustic feature by using the preset decoding model and the preset attention model with the acoustic feature and the feature vector set of the (i-1) th frame as input, and is not described herein again.

In some embodiments, for example, a telephone outbound system, a speech synthesis method as shown in fig. 4 includes:

s401, when a telephone calling system receives an inquiry request of leaving a message of a certain telephone number, namely a Nazan two-together watching a television bar, acquiring a phoneme sequence of the Nazan two-together watching the television bar, determining a starting phoneme and an ending phoneme of a recording sentence, a starting phoneme and an ending phoneme of an inquiry result sentence and an ending phoneme of a sentence to be synthesized from the phoneme sequence;

the telephone calling-out system determines that the recording statement of the target object is 'NaZan two together' and the query result statement is 'watching TV bar' from the query request of 'NaZan two together watching TV bar'; determining that the 'two watching TV bars together' conforms to the talk mode, taking the 'two watching TV bars together' as a sentence to be synthesized, and acquiring phoneme sequences of the sentence as { n, a4, ss, z, an2, i, ia3, ss, i4, q, i3, ss, k, an4, d, ian4, sh, iii4, ss, b, a5, ss, sil }; it is determined that the start phoneme and the end phoneme of the recording sentence are 'n' and 'q', respectively, the start phoneme and the end phoneme of the query result sentence are 'k' and 'b', respectively, and the end phoneme of the sentence to be synthesized is 'b' as is the end phoneme of the query result sentence.

It should be noted that 'ss' in the phoneme sequence is a symbol for controlling the speech rhythm of the sentence to be synthesized, the symbol may be other phonemes, letters, or the like, and the phoneme sequence may include the symbol or not include the symbol, which is not limited in the embodiments of the present invention.

S402, the phone call-out system encodes the phoneme sequence to obtain a feature vector set;

the telephone calling-out system obtains the feature vector corresponding to each phoneme in the phoneme sequence, and the feature vectors of all phonemes form feature vector combination.

S403, the telephone call-out system acquires a full 0 vector as an initial acoustic feature, and acquires a recording Mel spectrum of 'two kinds of words together' from a preset recording sentence library;

s404, the telephone calling-out system predicts the corresponding predicted acoustic characteristics of the 'two watching TV together' based on the full 0 vector, the preset decoding model, the characteristic vector set, the preset attention model and the acoustic characteristics of the recorded sound;

illustratively, as shown in fig. 5, the phoneme sequence and the attention number value are corresponding to each other, the ordinate in fig. 5 is a phoneme sequence of "a two-way watching tv bar", the abscissa is the decoding time, the flag 51 on the right side indicates a corresponding relationship between the attention number value and a color, the lighter the color is, the larger the attention number value is, and 0.2, 0.4, 0.6 and 0.8 in the flag 51 are the attention number values, as can be seen from fig. 5, in the 12 th group of attention number values obtained at the 12 th decoding time, the 12 th target phoneme with the largest attention number value is determined to be 'q', that is, the end phoneme of the transcription sentence, and the 13 th decoding time is the decoding time of the query result sentence.

S405, the telephone calling system performs feature conversion and synthesis on the predicted acoustic features to obtain a voice corresponding to the fact that the two parties watch the television bar together;

s406, the telephone calling system dials a certain telephone number and plays the voice to the user after the user is connected.

It can be understood that the speech synthesis apparatus predicts and obtains the predicted acoustic features corresponding to the sentences to be synthesized based on the preset decoding model, the feature vector set, the preset attention model and the recording acoustic features, that is, for the recording sentences and the query result sentences, the corresponding predicted acoustic features are obtained through prediction, and the predicted acoustic features are composed of a plurality of associated acoustic features, so that the problems of different speeds, tones and the like of the recorded speech and the synthesized speech are solved, and thus, the speech rhythms obtained by using the predicted acoustic features are consistent; secondly, the predicted acoustic features corresponding to the sentences to be synthesized are subjected to feature conversion and synthesis to obtain the voice, so that the problem of uncertainty of excessive time length existing when the recording and the voice are spliced is solved, and the quality of the synthesized voice is improved.

Example two

The following further describes the same inventive concept of the first embodiment of the present invention.

An embodiment of the present invention provides a speech synthesis apparatus 6, where the apparatus 6 includes: a sequence generation module 61, a speech synthesis module 62 and an acquisition module 63; wherein,

the sequence generating module 61 is configured to obtain a symbol sequence of a sentence to be synthesized, where the sentence to be synthesized includes a sound recording sentence representing a target object and a query result sentence for the target object;

the speech synthesis module 62 is configured to perform coding processing on the symbol sequence by using a preset coding model to obtain a feature vector set;

an obtaining module 63, configured to obtain a recording acoustic feature corresponding to a recording statement;

the speech synthesis module 62 is further configured to predict, based on a preset decoding model, a feature vector set, a preset attention model and a recording acoustic feature, an acoustic feature corresponding to a sentence to be synthesized, to obtain a predicted acoustic feature corresponding to the sentence to be synthesized, where the preset attention model is a model that generates a context vector for decoding by using the feature vector set, and the predicted acoustic feature is composed of at least one associated acoustic feature; and performing feature conversion and synthesis on the predicted acoustic features to obtain the voice corresponding to the sentence to be synthesized.

In some embodiments, the speech synthesis module 62 is specifically configured to, when i is equal to 1, obtain an initial acoustic feature at an ith decoding time, predict a 1 st acoustic feature based on the initial acoustic feature, a preset decoding model, a feature vector set, and a preset attention model, where i is an integer greater than 0;

under the condition that i is larger than 1, when the ith decoding time is the decoding time of the sound recording statement, extracting the acoustic feature of a jth frame from the sound recording acoustic feature, taking the acoustic feature of the jth frame as the acoustic feature of an i-1 th frame, and predicting the ith acoustic feature based on the acoustic feature of the i-1 th frame, a preset decoding model, a feature vector set and a preset attention model, wherein j is an integer larger than 0;

when the ith decoding moment is the decoding moment of the query result statement, taking one frame of acoustic features in the (i-1) th acoustic features as the (i-1) th frame of acoustic features, and predicting the ith acoustic features based on the (i-1) th frame of acoustic features, a preset decoding model, a feature vector set and a preset attention model;

and taking the obtained ith acoustic feature to the nth acoustic feature as predicted acoustic features.

In some embodiments, the preset decoding model comprises a first recurrent neural network and a second recurrent neural network;

the speech synthesis module 62 is specifically configured to perform nonlinear change on the acoustic features of the i-1 th frame to obtain an intermediate feature vector; performing matrix operation and nonlinear transformation on the intermediate eigenvector by using a first cyclic neural network to obtain an ith intermediate latent variable; performing context vector calculation on the characteristic vector set and the ith intermediate hidden variable by using a preset attention model to obtain an ith context vector; performing matrix operation and nonlinear transformation on the ith context vector and the ith intermediate hidden variable by using a second cyclic neural network to obtain an ith hidden variable; and according to the preset frame number, carrying out linear transformation on the ith hidden variable to obtain the ith acoustic feature.

In some embodiments, the set of feature vectors comprises a feature vector corresponding to each symbol in the sequence of symbols;

the speech synthesis module 62 is specifically configured to perform attention calculation on a feature vector and an ith intermediate hidden variable corresponding to each symbol in the symbol sequence by using a preset attention model to obtain an ith group of attention values; and according to the ith group of attention values, carrying out weighted summation on the feature vector set to obtain the ith context vector.

In some embodiments, the speech synthesis module 62 is further configured to determine an ith target symbol corresponding to the maximum attention value from the ith group of attention values after predicting the ith acoustic feature based on the (i-1) th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model and before continuing to perform the prediction process at the (i +1) th decoding time;

and/or when the ith target symbol is a non-ending symbol of the query result statement, determining the (i +1) th decoding time as the decoding time of the query result statement;

and/or when the ith target symbol is the end symbol of the query result statement and the end symbol of the query result statement is not the end symbol of the statement to be synthesized, determining the (i +1) th decoding time as the decoding time of the sound recording statement;

In some embodiments, the speech synthesis module 62 is specifically configured to perform vector conversion on the symbol sequence to obtain an initial feature vector set; and carrying out nonlinear change and feature extraction on the initial feature vector set to obtain a feature vector set.

In some embodiments, the speech synthesis module 62 is specifically configured to perform feature transformation on the predicted acoustic features to obtain a linear spectrum; and carrying out reconstruction synthesis on the linear spectrum to obtain the voice.

In some embodiments, the sequence of symbols is a sequence of letters or a sequence of phonemes.

In some embodiments, the apparatus 6 further comprises: a training module 60;

the training module is used for acquiring a sample symbol sequence corresponding to at least one sample synthesis statement before acquiring the symbol sequence of the statement to be synthesized, wherein each sample synthesis statement represents a sample object and a reference query result aiming at the sample object; acquiring an initial voice synthesis model, initial acoustic features and sample acoustic features corresponding to sample synthesis statements; the initial speech synthesis model is a model for coding processing and prediction; and training the initial speech synthesis model by using the sample symbol sequence, the initial acoustic features and the sample acoustic features to obtain a preset coding model, a preset decoding model and a preset attention model.

In practical applications, the training module 60, the sequence generating module 61, the speech synthesizing module 62, and the obtaining module 63 may also be implemented by a processor 74 located on the speech synthesizing device 7, specifically implemented by a CPU (central Processing Unit), an MPU (micro processor Unit), a DSP (digital signal processor), a Field Programmable Gate Array (FPGA), or the like.

An embodiment of the present invention further provides a speech synthesis apparatus 7, as shown in fig. 7, where the apparatus 7 includes: a processor 74, a memory 75 and a communication bus 76, the memory 75 communicating with the processor 74 via the communication bus 76, the memory 75 storing one or more speech synthesis programs executable by the processor 74, the one or more speech synthesis programs, when executed, causing the processor 74 to perform any of the speech synthesis methods as described in the previous embodiments.

In practical applications, the Memory 75 may be a volatile first Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile first Memory (non-volatile Memory), such as a Read-Only first Memory (ROM), a flash first Memory (flash Memory), a Hard Disk Drive (HDD) or a Solid-State Drive (SSD); or a combination of first memories of the above kind and provides programs and data to the processor 74.

An embodiment of the present invention provides a computer-readable storage medium, which stores a speech synthesis program, and when the speech synthesis program is executed by a processor 74, the processor 74 is realized to execute any one of the speech synthesis methods described in the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable speech synthesis apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable speech synthesis apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable speech synthesis apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable speech synthesis apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The methods disclosed in the several method embodiments provided by the present invention can be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in several of the product embodiments provided by the invention may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided by the present invention may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, wherein the predicting the acoustic features corresponding to the sentence to be synthesized based on a preset decoding model, the feature vector set, a preset attention model and the acoustic features of the sound recording to obtain the predicted acoustic features corresponding to the sentence to be synthesized comprises:

3. The method of claim 2, wherein the preset decoding model comprises a first recurrent neural network and a second recurrent neural network; predicting an ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model, including:

4. The method of claim 3, wherein the set of feature vectors comprises a feature vector corresponding to each symbol in the sequence of symbols; the performing context vector calculation on the feature vector set and the ith intermediate hidden variable by using the preset attention model to obtain an ith context vector includes:

5. The method according to claim 4, wherein after predicting the ith acoustic feature based on the i-1 th frame acoustic feature, the preset decoding model, the feature vector set and the preset attention model, the method further comprises, before continuing to perform the prediction process at the i +1 th decoding time:

6. The method according to claim 1, wherein the encoding the symbol sequence by using a preset coding model to obtain a feature vector set comprises:

7. The method according to claim 1, wherein the performing feature transformation and synthesis on the predicted acoustic features to obtain the speech corresponding to the sentence to be synthesized comprises:

8. The method of claim 1, wherein the sequence of symbols is a sequence of letters or a sequence of phonemes.

9. The method of claim 1, wherein prior to said obtaining the sequence of symbols for the sentence to be synthesized, the method further comprises:

10. A speech synthesis apparatus, characterized in that the apparatus comprises: the device comprises a sequence generation module, a voice synthesis module and an acquisition module; wherein,

11. A speech synthesis apparatus, characterized in that the apparatus comprises: a processor, a memory and a communication bus, the memory in communication with the processor through the communication bus, the memory storing one or more programs executable by the processor, the one or more programs, when executed, causing the processor to perform the method of any of claims 1-9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a program which, when executed by at least one processor, causes the at least one processor to perform the method of any one of claims 1-9.