CN115359780A

CN115359780A - Speech synthesis method, apparatus, computer device and storage medium

Info

Publication number: CN115359780A
Application number: CN202210897499.5A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-11-18

Abstract

The invention discloses a voice synthesis method, a device, a computer and a storage medium, wherein the method comprises the following steps: processing the text sequence to obtain a text hidden vector; performing prosodic feature extraction on the prosodic reference audio to acquire a prosodic latent vector; acquiring a user coding vector corresponding to a user identifier; synthesizing the text hidden vector, the rhythm hidden vector and the user coding vector to obtain target acoustic features; and performing voice synthesis based on the target acoustic features to obtain a target audio file corresponding to the text sequence. The method ensures that the acquired target audio file is not only related to the text content corresponding to the text sequence, but also related to the prosodic style in the prosodic reference audio and the user voice tone corresponding to the user identification, thereby being beneficial to ensuring the voice synthesis effect of the target audio file and improving the naturalness of the synthesized voice.

Description

Speech synthesis method, speech synthesis device, computer equipment and storage medium

Technical Field

The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology and digital signal processing technology, speech synthesis technology is beginning to develop, and the current TTS technology has been widely applied in the aspects of information communication, broadcast and the like, and tacontron is an end-to-end TTS generation model. The end-to-end method is to synthesize voice from character text directly, break the barriers between traditional components, synthesize the text directly into acoustic features through a model, generate an audio file corresponding to the acoustic features through a vocoder, and even input the text into the model to generate the audio file directly and skip the middle vocoder link. The existing speech synthesis generally carries out simple synthesis on text contents, so that the speech synthesis effect is poor.

Disclosure of Invention

The embodiment of the invention provides a voice synthesis method, a voice synthesis device, computer equipment and a storage medium, and aims to solve the problem of poor voice synthesis effect.

A method of speech synthesis comprising:

processing the text sequence to obtain a text hidden vector;

performing prosodic feature extraction on the prosodic reference audio to acquire a prosodic latent vector;

acquiring a user coding vector corresponding to the user identifier;

synthesizing the text latent vector, the rhythm latent vector and the user coding vector by adopting an attention mechanism to obtain target acoustic features;

and performing voice synthesis based on the target acoustic features to obtain a target audio file corresponding to the text sequence.

A speech synthesis apparatus comprising:

the text hidden vector acquisition module is used for processing the text sequence to acquire a text hidden vector;

the rhythm hidden vector acquisition module is used for extracting rhythm characteristics of rhythm reference audio to acquire rhythm hidden vectors;

the user code vector acquisition module is used for acquiring a user code vector corresponding to the user identifier;

the target acoustic feature acquisition module is used for synthesizing the text hidden vector, the rhythm hidden vector and the user coding vector by adopting an attention mechanism to acquire target acoustic features;

and the target audio file acquisition module is used for carrying out voice synthesis based on the target acoustic characteristics to acquire a target audio file corresponding to the text sequence.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned speech synthesis method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned speech synthesis method.

The speech synthesis method, the speech synthesis device, the computer equipment and the storage medium determine the text hidden vector based on the text sequence, so that the text hidden vector not only contains text content, but also can be subjected to subsequent coding synthesis processing, and the feasibility of speech synthesis is guaranteed; determining a prosodic latent vector of the prosodic reference audio based on the prosodic reference audio so that the prosodic latent vector can learn the prosodic style in the prosodic reference audio; acquiring a user coding vector corresponding to the user identifier, so that the user coding vector can learn the voice tone of the user; and performing voice synthesis based on target acoustic features formed by synthesizing the text hidden vector, the rhythm hidden vector and the user coding vector, so that the obtained target audio file is not only related to text content corresponding to the text sequence, but also related to rhythm style in rhythm reference audio and user voice timbre corresponding to the user identification, thereby being beneficial to ensuring the voice synthesis effect of the target audio file and improving the naturalness of synthesized voice.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a diagram illustrating an application environment of a speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 3 is another flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 4 is another flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 5 is another flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 6 is another flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 7 is another flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 8 is another flow chart of a speech synthesis method according to an embodiment of the present invention;

FIG. 9 is a diagram of a speech synthesis apparatus according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a computer device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The speech synthesis method provided by the embodiment of the invention can be applied to the application environment shown in fig. 1. Specifically, the speech synthesis method is applied in a speech synthesis system, which includes a client and a server as shown in fig. 1, where the client and the server communicate through a network for implementing multi-user speech synthesis. The client is also called a user side, and refers to a program corresponding to the server and providing local services for the client. The client may be installed on, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a speech synthesis method is provided, which is described by taking the server in fig. 1 as an example, and includes the following steps:

s201: processing the text sequence to obtain a text hidden vector;

s202: performing rhythm feature extraction on rhythm reference audio to obtain rhythm hidden vectors;

s203: acquiring a user coding vector corresponding to the user identifier;

s204: synthesizing the text hidden vector, the rhythm hidden vector and the user coding vector to obtain target acoustic features;

s205: and performing voice synthesis based on the target acoustic characteristics to obtain a target audio file corresponding to the text sequence.

The text sequence is a sequence formed by text contents needing speech synthesis. The text hidden vector is a vector formed by vector conversion of a text sequence, and particularly is a hidden vector formed by encoding the text sequence by adopting an encoder network. A hidden vector refers to an intermediate vector output by a particular network.

As an example, in step S201, the server obtains a text sequence that needs to be subjected to speech synthesis, and performs parsing and vector conversion processing on the text sequence to obtain a corresponding text hidden vector. The text hidden vector can be understood as an intermediate vector related to the text content. For example, the server may analyze the text sequence to obtain a phoneme sequence, encode the phoneme sequence to obtain specific format data acceptable to the network, and then process the specific format data through the encoder network to obtain hidden layer feature representation, which is a text hidden vector. Understandably, the text hidden vector is a vector representation of a text sequence needing speech synthesis, and provides a text basis for subsequent speech synthesis.

Wherein the prosodic reference audio is preset to provide a prosodic style audio as a reference object. Prosody refers to the initial consonant and rhythm, and can be understood as the level and narrow tone format and the rhyme rule in speech, or as the information of speaking pause or speed and slow. The prosodic hidden vector is a hidden vector formed by prosodic style extraction and encoding of prosodic reference audio.

As an example, in step S202, the server may extract a prosody style of the prosody reference audio by using the default set prosody reference audio or the user-selected prosody reference audio to obtain a prosody hidden vector. The prosodic hidden vector refers to an intermediate vector associated with a prosodic style learned from the prosodic reference audio. In this example, the server may adopt default set prosody reference audio, where the extracted prosody latent vector of the default set prosody reference audio is a pre-extracted latent vector, may be a prosody latent vector extracted by a human identifier, and may also be a prosody latent vector intelligently extracted by a prosody encoder. For example, different prosody reference audios are labeled by adopting an artificial labeling mode, so that the same prosody styles have the same value. For example, a prosody encoder is used to perform feature extraction on different prosody reference audios, extract spectral features as prosody style information, and encode the prosody style information to obtain prosody hidden vectors. Understandably, the prosody hidden vector is a hidden vector which is learned from the prosody reference audio and is related to the prosody, provides a prosody basis for subsequent voice synthesis, and can ensure that a finally synthesized target audio file learns the prosody style in the prosody reference audio.

The user identifier is an identifier for distinguishing different users. A user encoded vector is a vector that reflects the correlation with different user voices. As an example, the user coding vector can adopt a simple number ID, can also adopt acoustic characteristics, and can be set autonomously according to user requirements.

As an example, in step S203, the server may obtain a user code vector corresponding to at least one user identifier, where the user code vector may be understood as a vector formed by encoding audio data formed by speaking of a user corresponding to each user identifier. In this example, the user code vector corresponding to the user identifier is determined, so that the user code vector can reflect the tone of the voice, the synthesis effect of the target audio file formed by subsequent voice synthesis can be ensured, the user corresponding to the user identifier is related, and the naturalness of the voice synthesis is improved.

As an example, in step S204, after the server obtains the text hidden vector, the prosody hidden vector, and the user coding vector, since the text hidden vector is a vector formed by text content for performing speech synthesis, the prosody hidden vector is a vector formed by learning prosody style in prosody reference audio, and the user coding vector is a vector for learning speech timbre of different users, the text hidden vector, the prosody hidden vector, and the user coding vector may be fused to obtain a fused target vector, and the target vector is input to a decoder for decoding, so that the target acoustic feature may be obtained, so that the target acoustic feature is related to the text content, the prosody style, and the speech timbre of the user, which is helpful for ensuring the synthesis effect of speech synthesis.

In this example, the server may align the durations of the text hidden vector and the prosody hidden vector to ensure the consistency of the durations; performing first splicing and fusion on the text hidden vector and the rhythm hidden vector after the time length alignment to obtain a fusion hidden vector; secondly, splicing and fusing the fused implicit vector and the user coding vector for the second time to obtain a fused target vector; and finally, inputting the fused target vector into a decoder to obtain the target acoustic features.

As an example, in step S205, after obtaining the target acoustic features, the server may use a preset vocoder to encode and synthesize the target acoustic features that fuse information such as text content, prosody style, and user voice timbre, and obtain the target audio file corresponding to the text sequence, so that the target audio file is not only related to the text content corresponding to the text sequence, but also related to the prosody style in the prosody reference audio and the user voice timbre corresponding to the user identifier, which is helpful to ensure the voice synthesis effect of the target audio file and improve the naturalness of the synthesized voice.

In the speech synthesis method provided by this embodiment, a text hidden vector is determined based on a text sequence, so that the text hidden vector contains text content, and subsequent encoding and synthesis processing can be performed, thereby ensuring the feasibility of speech synthesis; determining a prosodic latent vector of the prosodic reference audio based on the prosodic reference audio so that the prosodic latent vector can learn the prosodic style in the prosodic reference audio; acquiring a user coding vector corresponding to the user identifier, so that the user coding vector can learn the voice tone of the user; and performing voice synthesis on the basis of target acoustic features formed by synthesizing the text hidden vector, the rhythm hidden vector and the user coding vector, so that the obtained target audio file is not only related to text content corresponding to the text sequence, but also related to rhythm style in rhythm reference audio and user voice timbre corresponding to the user identification, the voice synthesis effect of the target audio file is favorably ensured, and the naturalness of synthesized voice is improved.

In an embodiment, as shown in fig. 3, step S201, namely, processing the text sequence to obtain a text hidden vector, includes:

s301: analyzing the text sequence to obtain a phoneme sequence;

s302: performing space vector conversion on the phoneme sequence to obtain a phoneme feature vector;

s303: and coding the phoneme feature vector by adopting a phoneme coder to obtain a text hidden vector.

The phoneme sequence is a sequence related to phonemes, specifically a sequence obtained by performing character-phoneme conversion on a text sequence. Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone.

As an example, in step S301, the server may parse the text sequence by using a preset model for performing text-to-Phoneme conversion, including but not limited to a G2P (Grapheme-to-Phoneme) model, to obtain a Phoneme sequence. In this example, the server may use a G2P model based on RNN and LSTM to implement conversion of the text sequence into a phoneme sequence to obtain the phoneme sequence.

Wherein, the phoneme feature vector expresses each phoneme vector by a corresponding space vector.

As an example, in step S302, the server performs spatial embedding coding on the phoneme sequence, that is, each phoneme in the phoneme sequence is represented by a corresponding spatial vector to obtain a phoneme feature vector. In this example, the obtained phoneme feature vector may be a continuous variable or a discrete variable, the continuous variable may be represented in a floating-point data format, and if the continuous variable is represented in a discrete variable, the continuous variable may be represented in a set integer vector or a floating-point data vector.

The phoneme coder is a functional module for realizing phoneme coding. As an example, the phoneme coder may employ a coder built of an LSTM layer or a transform layer.

As an example, in step S303, the server may encode the phoneme feature vector by using a phoneme coder, and determine an output result of the phoneme coder as a hidden layer feature of the text sequence as a text hidden vector. In this example, the server may employ a phoneme encoder constructed based on a transform layer, where the phoneme encoder is provided with four feedforward transform layers, and the four feedforward transform layers are employed, and an attention mechanism in the transform layer is utilized to enhance learning of timing attention in a text sequence, thereby improving an accuracy rate of recognizing a text implicit vector obtained by encoding with the phoneme encoder.

In the speech synthesis method provided by this embodiment, a text sequence is first parsed to obtain a phoneme sequence formed by a minimum speech unit (i.e., a phoneme), so as to improve the basis of speech synthesis; then, carrying out spatial embedded coding on the phoneme sequence to obtain a phoneme feature vector so as to convert the phoneme sequence into a phoneme feature vector capable of carrying out coding calculation and ensure coding feasibility; and finally, coding the phoneme feature vector by adopting a phoneme coder to obtain a text hidden vector output by text sequence coding and provide a text basis for subsequent speech synthesis. Understandably, when the phoneme coder adopts a coder with four layers of feedforward transformer layers, the learning of the time sequence attention in the text sequence is enhanced by using an attention mechanism in the transformer layers, and the recognition accuracy of the text hidden vectors obtained by coding the phoneme coder is improved.

In one embodiment, as shown in fig. 4, in step S202, performing prosodic feature extraction on the prosodic reference audio to obtain a prosodic latent vector, includes:

s401: performing prosodic feature extraction on the prosodic reference audio to acquire prosodic style codes;

s402: a prosody encoder is adopted to encode the prosody style encoding and the prosody reference audio to obtain prosody feature vectors;

s403: and (3) carrying out time length alignment processing on the rhythm feature vector by adopting a time length control module to obtain a rhythm hidden vector.

Wherein the prosodic reference audio is preset to provide a prosodic style audio as a reference object. The prosodic style coding is coding after extracting prosodic features of prosodic reference audio.

As an example, in step S401, the server may adopt a default set prosody reference audio or a user-selected prosody reference audio to perform prosody feature extraction on the prosody reference audio, extract prosody style information related to a pronunciation manner of the voice from the prosody reference audio, encode the extracted prosody style information, and acquire prosody style encoding capable of performing subsequent model processing. The prosodic style information is a prosodic expression characteristic independent of text content. The prosody style coding is a coding result of coding the prosody style information. In this example, the server performs prosody feature extraction on the prosody reference audio by using an encoder-decoder manner, specifically, but not limited to, a Mel-GAN vocoder, to obtain prosody style coding.

Wherein, the Mel-GAN vocoder is a pre-trained vocoder, and the training process comprises the following steps: acquiring a training text and a training audio corresponding to the training text; carrying out spectrum extraction on the training audio to obtain a real spectrum of the training audio; generating audio according to the training text to obtain generated audio; carrying out spectrum extraction on the generated audio to obtain a generated spectrum corresponding to the generated audio; calculating a model loss value between the real frequency spectrum and the generated frequency spectrum by adopting a loss function; and if the model loss value is smaller than the preset value, determining that the Mel-GAN vocoder training is finished.

Wherein the prosody encoder is an encoder for implementing prosody encoding. As an example, the prosody encoder may employ a 2-layer two-dimensional convolutional network, which encodes output prosody feature vectors more accurately than a 1-layer two-dimensional convolutional network and has higher processing efficiency than a multi-layer two-dimensional convolutional network.

As an example, in step S402, the server may employ a prosody encoder to perform encoding processing on the input prosody style encoding and prosody reference audio, and specifically, may perform spectrum conversion on the prosody reference audio to obtain a prosody reference spectrum; processing the rhythm reference frequency spectrum by adopting a two-dimensional convolution network, and outputting frequency spectrum characteristic information; and splicing the frequency spectrum characteristic information and the prosody style codes or fusing the frequency spectrum characteristic information and the prosody style codes in other modes, and outputting prosody characteristic vectors expressed in a two-dimensional matrix form.

The time length control module is a module for realizing time length alignment processing.

As an example, in step S403, after acquiring the prosody feature vector represented in the two-dimensional matrix form, the server performs time length alignment on the prosody feature vector by using the time length control module to acquire a prosody hidden vector after time length alignment. In this example, the prosodic feature vector is a two-dimensional matrix reflecting the prosodic phoneme-time correspondence, and when the prosodic feature vector is subjected to time alignment by using the time length control module, the column vector can be expanded to obtain an expanded two-dimensional matrix as a prosodic latent vector. Generally, the duration of the audio is a frame sequence size, for example, a two-dimensional matrix of the behavior 500 is obtained by framing 1 second of audio according to a frame block size of 20ms, where a row vector represents the duration, and a plurality of audio frames are corresponding to a prosody feature vector in a two-dimensional matrix form after encoding based on prosody style encoding and prosody reference audio, and the extended prosody feature vector needs to be copied and a prosody hidden vector is output according to prediction of the duration control module.

In the speech synthesis method provided by the embodiment, prosodic features of prosodic reference audio are extracted according to the default set prosodic reference audio or the prosodic reference audio selected by a user to obtain prosodic style codes, so that the basis of speech synthesis is improved; then, a prosodic coder is adopted to code the prosodic style coding and the prosodic reference audio to obtain a prosodic feature vector, so that the feasibility of coding is ensured; and finally, carrying out time length alignment processing on the rhythm feature vectors by adopting a time length control module to obtain rhythm hidden vectors and provide a text basis for subsequent voice synthesis. Understandably, the prosody encoder adopts a 2-layer two-dimensional convolution network, compared with a 1-layer two-dimensional convolution network, the prosody encoder has more accurate result of encoding the output prosody feature vector, and compared with a multilayer two-dimensional convolution network, the prosody encoder improves the efficiency of voice synthesis.

In an embodiment, as shown in fig. 5, in step S203, obtaining a user code vector corresponding to a user identifier includes:

s501: inquiring the identification code table based on the user identification, and judging whether the user identification is a preset identification in the identification code table;

s502: if the user identifier is a preset identifier, determining a preset coding vector corresponding to the preset identifier as a user coding vector corresponding to the user identifier;

s503: and if the user identification is not the preset identification, acquiring user audio data corresponding to the user identification, and determining a user coding vector corresponding to the user identification based on the user audio data.

The identification coding table is an information table formed according to different preset identifications in the training set and corresponding preset coding vectors. The preset identifier is an identifier which is preset and stored in the identifier code table and is used for uniquely identifying a certain user.

As an example, in step S501, after acquiring the user identifier, the server may query a preset identifier code table based on the user identifier to determine whether the user identifier is a preset identifier in the identifier code table.

The preset coding vector is formed by performing feature extraction on preset audio data corresponding to the preset identifier, and reflects the speaking habit of the user corresponding to the preset identifier.

As an example, in step S502, when the user identifier is a preset identifier in the identifier coding table, the server indicates that the preset audio data corresponding to the user identifier has been used before the current time of the system, and associates and stores a preset coding vector formed by coding the preset audio data with the preset identifier in the identifier coding table, so that the preset coding vector corresponding to the preset identifier can be directly determined as the user coding vector corresponding to the user identifier, and the obtaining efficiency of the user coding vector can be improved.

The audio data corresponding to the user audio data and the user identifier is audio data formed by speaking of the user corresponding to the user identifier.

As an example, in step S503, when the user identifier is not a preset identifier in the identifier coding table, the server indicates that there is no preset coding vector in the identifier coding table, at this time, user audio data corresponding to the user identifier needs to be obtained in real time, and then, the user audio data is encoded to determine a user coding vector corresponding to the user identifier, so as to ensure the instantaneity of obtaining the user coding vector.

In this example, a user coding module is provided in the server, and the user coding module is a coding module for distinguishing speech synthesis of different users. The user coding module can adopt a plurality of modes, the simplest mode is to use a digital number ID as a user identifier to distinguish different users, user coding is completed through an embedding layer, and a user coding vector is obtained. The user coding module can also adopt acoustic features, such as X-vector, d-vector and the like, to distinguish different user voice timbres by using vectors so as to obtain user coding vectors, so that the user coding vectors can reflect the voice timbres, the synthesis effect of a target audio file formed by subsequent voice synthesis can be guaranteed, users corresponding to user identifications are related, and the naturalness of voice synthesis is improved.

In the speech synthesis method provided in this embodiment, according to whether the user identifier is a preset identifier in the identifier coding table, when the user identifier is the preset identifier, the preset coding vector corresponding to the preset identifier may be directly determined as the user coding vector corresponding to the user identifier, which may improve the efficiency of obtaining the user coding vector; when the user identifier is not the preset identifier, the user audio data can be encoded to obtain the user code vector, so that the real-time property of obtaining the user code vector is guaranteed. Understandably, the user coding vector corresponding to the user identification is determined, so that the user coding vector can reflect the tone of the voice, the synthesis effect of a target audio file formed by subsequent voice synthesis can be guaranteed, the user corresponding to the user identification is related, and the naturalness of the voice synthesis is improved.

In one embodiment, as shown in fig. 6, the step S503 of determining the user code vector corresponding to the user identifier based on the user audio data includes:

s601: carrying out feature extraction on user audio data to obtain a first spectrum feature;

s602: the first spectrum features are segmented to obtain N second spectrum features;

s603: sequentially outputting the N second spectrum features to a convolutional neural network for processing to obtain first hidden vectors corresponding to the N second spectrum features;

s604: calculating the mean value and the variance of the first hidden vectors corresponding to the N second spectral features, and determining the mean value and the variance of the hidden vectors corresponding to the N second spectral features;

s605: splicing the implicit vector mean value and the implicit vector variance corresponding to the N second spectrum features to obtain a second implicit vector;

s606: and inputting the second hidden vector into a convolutional neural network for processing to obtain a user coding vector corresponding to the user identifier.

The first spectral feature is a spectral feature obtained by extracting user audio data.

As an example, in step S601, the server obtains user audio data corresponding to the user identifier in real time, performs feature extraction on the user audio data, specifically, may extract a spectral feature corresponding to the user audio data, and uses the spectral feature as a first spectral feature to provide a basis for generating a subsequent coding vector.

The second spectral feature is a spectral feature obtained by dividing the first spectral feature.

As an example, in step S602, the server divides the extracted first spectral feature according to a preset spectral feature division policy, and uses each divided segment of spectral feature as a second spectral feature. The spectrum feature segmentation strategy is a preset strategy for segmenting spectrum features, and specifically can be segmentation according to a fixed time length or a user-defined segmentation standard.

Wherein the first convolutional neural network is a convolutional neural network for processing the second spectral feature. In this example, the second convolutional Neural network is a DNN network, the DNN network (Deep Neural Networks) is a Deep Neural network, the Neural network layers inside the DNN can be divided into three types, an input layer, a hidden layer and an output layer, and the hidden layer in the middle can be divided into multiple layers. The first hidden vector is an output value of the second spectrum characteristic through a convolution neural network.

As an example, in step S602, the server sequentially inputs each of the divided second spectral features into the first convolutional neural network, specifically, a convolutional neural network composed of 9 fully-connected layers may be input, and determines an output value of the convolutional neural network as a first hidden vector corresponding to the second spectral feature. The first hidden vector can be understood as an intermediate variable output after the convolutional neural network processes the second spectrum feature.

As an example, in step S603, after obtaining the first hidden vectors corresponding to the N second spectral features, the server may use a mean calculation formula and a variance calculation formula to calculate the first hidden vectors corresponding to the N second spectral features, respectively, so as to determine a hidden vector mean and a hidden vector variance corresponding to the N second spectral features.

As a first example, in step S604, after obtaining the implicit vector mean and the implicit vector variance corresponding to the N second spectral features, the server may adopt a specific splicing strategy or according to a preset splicing sequence to splice the implicit vector mean and the implicit vector variance corresponding to the N second spectral features, and determine a splicing result as a second implicit vector.

Wherein the second convolutional neural network is a convolutional neural network for processing the second hidden vector. The user coding vector is a voiceprint recognition vector X-vector, can accept input of any length, and is converted into a feature expression of fixed length.

As an example, in step S606, the server inputs the second hidden vector to the second convolutional neural network for processing, specifically to a 4-layer second convolutional neural network, to obtain a voiceprint recognition vector X-vector, and uses the voiceprint recognition vector X-vector as a user coding vector corresponding to the user identifier.

In the speech synthesis method provided by this embodiment, feature extraction is performed on user audio data to obtain a first spectral feature, which improves a basis for coding vector synthesis; the second spectrum characteristics are segmented, so that the information quantity of each second spectrum characteristic is small, subsequent processing is facilitated, and the feasibility of code generation is guaranteed; and then inputting the second spectral features into the first convolutional neural network, performing mean value and variance calculation and splicing on the output first implicit vectors to obtain second implicit vectors, providing a basis for subsequent coding vector synthesis, and finally performing second convolutional neural network on the second implicit vectors to obtain user coding vectors. The generated user coding vector is a voiceprint recognition vector X-vector, can accept input with any length and convert the input into a feature expression with fixed length, and introduces a data enhancement strategy containing noise and reverberation in convolutional neural network training, so that the model has stronger interference on the noise, the reverberation and the like.

In an embodiment, as shown in fig. 7, in step S204, synthesizing the text hidden vector, the prosody hidden vector and the user encoding vector to obtain the target acoustic feature includes:

s701: processing the text hidden vector and the rhythm hidden vector by adopting an attention mechanism to obtain a fusion hidden vector;

s702: and synthesizing the fusion hidden vector and the user coding vector to obtain the target acoustic feature.

The attention mechanism is cross attention, and it is understood that a weight is calculated based on the similarity to perform a weighted average. The fusion hidden vector is a vector synthesized by a text hidden vector and a prosody hidden vector through an attention mechanism.

As an example, in step S701, the attention mechanism adopted by the server may specifically adopt an attention mechanism such as cross attention, the text hidden vector and the prosody hidden vector are processed, the text hidden vector may be used as a query of the attention mechanism, the prosody hidden vector may be used as a key of the attention mechanism, an attention score is calculated, and the calculated attention score is determined as the fusion hidden vector.

As an example, in step S702, the server inputs the fused hidden vector and the user encoding vector to a decoder for decoding, so as to determine the output result of the decoder as the target acoustic feature.

In the speech synthesis method provided by this embodiment, an attention mechanism is first used to synthesize a text latent vector and a prosody latent vector, so as to obtain a fusion latent vector, which is prepared for obtaining target acoustic features. And finally, inputting the fusion vector and the user coding vector into a decoder for decoding to obtain target acoustic features, so that the obtained target acoustic features are not only related to the text hidden vector and the rhythm hidden vector, but also related to the user coding vector, the voice synthesis effect of the target audio file is favorably ensured, and the naturalness of the synthesized voice is improved.

In an embodiment, as shown in fig. 8, in step S701, processing the text hidden vector and the prosody hidden vector by using an attention mechanism to obtain a fused hidden vector, includes:

s801: performing similarity calculation on the text hidden vector and the rhythm hidden vector by adopting an attention mechanism to obtain vector similarity;

s802: carrying out normalization processing on the vector similarity by adopting a softmax layer to obtain a rhythm weight value;

s803: and performing weighting processing based on the text latent vector and the rhythm weight value to obtain a fusion latent vector.

The vector similarity is obtained by performing a vector similarity calculation on the text latent vector and the prosody latent vector. The Softmax layer is an activation function that normalizes a vector of values into a probability distribution vector.

As an example, in step S801, the server performs similarity calculation on the text hidden vector and the prosody hidden vector by using an attention mechanism, and obtains a vector similarity F (query, key), where query is the text hidden vector, key is the prosody hidden vector, and F is the similarity calculation between the text hidden vector and the prosody hidden vector, so as to obtain a vector similarity. The similarity calculation can be completed by using dot product, and can also be realized by using a full connection layer.

As an example, in step S802, after calculating the vector similarity F (query, key) corresponding to the text hidden vector and the prosody hidden vector, the server may perform normalization processing on the vector similarity F (query, key) by using a Softmax layer to normalize the vector similarity F (query, key) to a numerical value between 0 and 1, and determine the numerical value as a prosody weight value, i.e., softmax (F (query, key)).

As an example, in step S803, the server further performs weighted summation on the text-based hidden vector α and the prosody weight value Softmax (F (query, key)), that is, Σ α × Softmax (F (query, key)), to obtain a fused hidden vector. In this example, the text hidden vector and the prosody hidden vector need to be continuously optimized through training intensive training.

In the speech synthesis method provided by this embodiment, the server performs similarity calculation on the text hidden vector and the prosody hidden vector by using an attention mechanism, which mentions feasibility of the scheme, then performs normalization processing on the vector similarity by using a softmax layer to obtain a prosody weight value, and finally performs weighting summation processing on the prosody weight value and the text hidden vector to obtain a fused hidden vector, which is used as a basis for encoding and synthesizing with a user and is prepared for enhancing prosody style control in multi-user speech synthesis.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In one embodiment, a speech synthesis apparatus is provided, and the speech synthesis apparatus corresponds to the speech synthesis method in the above embodiment one to one. As shown in fig. 9, the speech synthesis apparatus includes a text hidden vector acquisition module 901, a prosodic hidden vector acquisition module 902, a user coding vector acquisition module 903, a target acoustic feature acquisition module 904, and a target audio file acquisition module 905. The functional modules are explained in detail as follows:

a text hidden vector obtaining module 901, configured to process a text sequence to obtain a text hidden vector;

a prosodic latent vector acquisition module 902, configured to perform prosodic feature extraction on the prosodic reference audio to acquire a prosodic latent vector;

a user code vector obtaining module 903, configured to obtain a user code vector corresponding to the user identifier;

a target acoustic feature obtaining module 904, configured to synthesize the text hidden vector, the prosody hidden vector, and the user coding vector by using an attention mechanism, so as to obtain a target acoustic feature;

and a target audio file obtaining module 905, configured to perform speech synthesis based on the target acoustic features, and obtain a target audio file corresponding to the text sequence.

In an embodiment, the text hidden vector obtaining module 901 includes:

a phoneme sequence obtaining unit, configured to analyze the text sequence to obtain a phoneme sequence;

the phoneme feature vector acquisition unit is used for carrying out space vector conversion on the phoneme sequence to acquire a phoneme feature vector;

and the text hidden vector acquisition unit is used for coding the phoneme feature vector to acquire a text hidden vector.

In one embodiment, the prosodic hidden vector obtaining module 902 includes:

the prosodic style coding acquisition unit is used for extracting prosodic features of the prosodic reference audio to acquire prosodic style codes;

the prosodic feature vector acquiring unit is used for encoding the prosodic style encoding and the prosodic reference audio to acquire prosodic feature vectors;

and the rhythm hidden vector acquisition unit is used for carrying out time length alignment processing on the rhythm feature vector to acquire a rhythm hidden vector.

In one embodiment, the user code vector obtaining module 903 includes:

the preset identification judging unit is used for inquiring the identification coding table based on the user identification and judging whether the user identification is a preset identification in the identification coding table;

the first user code vector determining unit is used for determining a preset code vector corresponding to a preset identifier as a user code vector corresponding to the user identifier if the user identifier is the preset identifier;

and the second user code vector determining unit is used for acquiring user audio data corresponding to the user identifier if the user identifier is not the preset identifier, and determining the user code vector corresponding to the user identifier based on the user audio data.

In an embodiment, the second user code vector determination unit includes:

the first spectral feature acquisition subunit is used for performing feature extraction on the user audio data to acquire a first spectral feature;

the second spectrum characteristic obtaining subunit is used for segmenting the first spectrum characteristic to obtain N second spectrum characteristics;

the first implicit vector acquisition subunit is used for outputting the N second spectral features to the first convolution neural network in sequence for processing to acquire first implicit vectors corresponding to the N second spectral features;

the mean variance determining subunit is configured to perform mean and variance calculation on the first hidden vectors corresponding to the N second spectral features, and determine hidden vector means and hidden vector variances corresponding to the N second spectral features;

the second implicit vector obtaining subunit is configured to splice implicit vector means and implicit vector variances corresponding to the N second spectral features to obtain a second implicit vector;

and the user coding vector acquisition subunit is used for inputting the second hidden vector into a second convolutional neural network for processing, and acquiring a user coding vector corresponding to the user identifier.

In an embodiment, the target acoustic feature obtaining module 904 includes:

the fusion hidden vector acquisition unit is used for processing the text hidden vector and the rhythm hidden vector by adopting an attention mechanism to acquire a fusion hidden vector;

and the target acoustic feature acquisition unit is used for synthesizing the fusion hidden vector and the user coding vector to acquire the target acoustic feature.

In one embodiment, the fused implicit vector obtaining unit includes:

the vector similarity obtaining subunit is configured to perform similarity calculation on the text latent vector and the prosody latent vector by using an attention mechanism to obtain vector similarity;

a rhythm weight value obtaining subunit, configured to perform normalization processing on the vector similarity to obtain a rhythm weight value;

and the fusion hidden vector obtaining subunit is used for performing weighting processing on the basis of the text hidden vector and the prosody weight value to obtain a fusion hidden vector.

For the specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, which are not described herein again. The various modules in the speech synthesis apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data used or generated during the execution of the speech synthesis method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech synthesis method.

In an embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the speech synthesis method in the foregoing embodiments is implemented, for example, as shown in S201-S205 in fig. 2, or as shown in fig. 3 to fig. 8, which is not described herein again to avoid repetition. Alternatively, when executing the computer program, the processor implements functions of each module/unit in the embodiment of the speech synthesis apparatus, for example, functions of the text hidden vector acquisition module 901, the prosody hidden vector acquisition module 902, the user coding vector acquisition module 903, the target acoustic feature acquisition module 904, and the target audio file acquisition module 905 shown in fig. 9, which are not described herein again to avoid repetition.

In an embodiment, a computer-readable storage medium is provided, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the speech synthesis method in the foregoing embodiments is implemented, for example, S201 to S205 shown in fig. 2, or shown in fig. 3 to fig. 8, which are not described herein again to avoid repetition. Alternatively, when being executed by a processor, the computer program implements the functions of the modules/units in the embodiment of the speech synthesis apparatus, for example, the functions of the text hidden vector acquisition module 901, the prosody hidden vector acquisition module 902, the user coding vector acquisition module 903, the target acoustic feature acquisition module 904, and the target audio file acquisition module 905 shown in fig. 9, and are not described herein again to avoid repetition. The computer readable storage medium may be non-volatile or volatile.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims

1. A speech synthesis method, comprising:

processing the text sequence to obtain a text hidden vector;

acquiring a user coding vector corresponding to a user identifier;

synthesizing the text hidden vector, the rhythm hidden vector and the user coding vector to obtain target acoustic features;

2. The speech synthesis method of claim 1, wherein the processing the text sequence to obtain text hidden vectors comprises:

analyzing the text sequence to obtain a phoneme sequence;

performing space vector conversion on the phoneme sequence to obtain a phoneme feature vector;

and coding the phoneme feature vector to obtain a text hidden vector.

3. The method of claim 1, wherein performing prosodic feature extraction on the prosodic reference audio to obtain prosodic latent vectors comprises:

performing prosodic feature extraction on the prosodic reference audio to acquire prosodic style codes;

encoding the prosodic style encoding and the prosodic reference audio to obtain prosodic feature vectors;

and carrying out time length alignment processing on the rhythm feature vector to obtain a rhythm hidden vector.

4. The speech synthesis method of claim 1, wherein the obtaining the user code vector corresponding to the user identifier comprises:

inquiring an identification code table based on the user identification, and judging whether the user identification is a preset identification in the identification code table;

if the user identifier is the preset identifier, determining a preset coding vector corresponding to the preset identifier as a user coding vector corresponding to the user identifier;

and if the user identification is not the preset identification, acquiring user audio data corresponding to the user identification, and determining a user coding vector corresponding to the user identification based on the user audio data.

5. The speech synthesis method of claim 4 wherein determining a user-encoded vector corresponding to a user identification based on the user audio data comprises:

performing feature extraction on the user audio data to acquire a first spectrum feature;

dividing the first spectrum characteristics to obtain N second spectrum characteristics;

outputting the N second spectrum features to a first convolution neural network in sequence for processing to obtain N first hidden vectors corresponding to the second spectrum features;

calculating the mean value and the variance of the first hidden vectors corresponding to the N second spectral features, and determining the mean value and the variance of the hidden vectors corresponding to the N second spectral features;

splicing the implicit vector mean values and the implicit vector variances corresponding to the N second spectrum features to obtain second implicit vectors;

and inputting the second hidden vector into a second convolutional neural network for processing to obtain a user coding vector corresponding to the user identifier.

6. The speech synthesis method of claim 1, wherein the synthesizing the text latent vector, the prosody latent vector, and the user encoded vector using an attention mechanism to obtain target acoustic features comprises:

processing the text latent vector and the rhythm latent vector by adopting an attention mechanism to obtain a fusion latent vector;

and synthesizing the fusion hidden vector and the user coding vector to obtain the target acoustic feature.

7. The method of synthesizing speech of claim 6 wherein said processing said text latent vectors and said prosodic latent vectors using an attention mechanism to obtain fused latent vectors comprises:

performing similarity calculation on the text hidden vector and the prosody hidden vector by adopting an attention mechanism to obtain vector similarity;

normalizing the vector similarity to obtain a rhythm weight value;

and performing weighting processing based on the text hidden vector and the rhythm weight value to obtain a fusion hidden vector.

8. A speech synthesis apparatus, comprising:

the rhythm hidden vector acquisition module is used for extracting rhythm characteristics of the rhythm reference audio to acquire rhythm hidden vectors;

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the speech synthesis method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech synthesis method according to any one of claims 1 to 7.