CN115620702A

CN115620702A - Speech synthesis method, speech synthesis device, electronic apparatus, and storage medium

Info

Publication number: CN115620702A
Application number: CN202211102117.1A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-01-17

Abstract

The application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring original phoneme data to be processed, and inputting the original phoneme data into a preset speech synthesis model; the voice synthesis model comprises an acoustic network and a generating network; coding the original phoneme data through an acoustic network to obtain a phoneme feature vector; performing prosodic tag recognition on the phoneme feature vector through an acoustic network to obtain prosodic tag features of the original phoneme data; performing acoustic feature extraction on the phoneme feature vector through an acoustic network to obtain vector quantization features of the original phoneme data; performing feature prediction according to the vector quantization feature and the prosody label feature to obtain a target prosody feature of the original phoneme data; and performing voice synthesis on the target prosody characteristic and the vector quantization characteristic through a generation network to obtain target voice data. The method and the device can improve the accuracy of voice synthesis.

Description

Speech synthesis method, speech synthesis device, electronic apparatus, and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech synthesis method, a speech synthesis apparatus, an electronic device, and a storage medium.

Background

Speech synthesis refers To the synthesis of intelligible, natural Speech from Text, also known as Text-To-Speech (TTS). Most of common speech synthesis methods adopt mel cepstrum as an acoustic feature of speech synthesis, and the relationship between the mel cepstrum in a time domain and a frequency domain is often complex, and the complexity of the mel cepstrum greatly affects the accuracy of speech synthesis.

Disclosure of Invention

The present disclosure provides a speech synthesis method, a speech synthesis apparatus, an electronic device, and a storage medium, which are used to improve accuracy of speech synthesis.

To achieve the above object, a first aspect of an embodiment of the present application provides a speech synthesis method, including:

acquiring original phoneme data to be processed, wherein the original phoneme data is text data;

inputting the original phoneme data into a preset speech synthesis model; wherein the speech synthesis model comprises an acoustic network and a generation network;

coding the original phoneme data through the acoustic network to obtain a phoneme feature vector;

performing prosodic tag recognition on the phoneme feature vector through the acoustic network to obtain prosodic tag features of the original phoneme data;

extracting acoustic features of the phoneme feature vectors through the acoustic network to obtain vector quantization features of the original phoneme data;

performing feature prediction according to the vector quantization feature and the prosody label feature to obtain a target prosody feature of the original phoneme data;

and performing voice synthesis on the target prosodic features and the vector quantization features through the generation network to obtain target voice data.

In some embodiments, the acoustic network includes a first LSTM layer and a decoder, and performing prosody label recognition on the phoneme feature vectors through the acoustic network to obtain prosody label features of the original phoneme data includes:

performing prosodic feature extraction on the phoneme feature vector through the first LSTM layer to obtain initial prosodic features corresponding to the phoneme feature vector;

clustering the initial prosodic features through a preset clustering algorithm and reference clustering labels to obtain target clustering labels of the original phoneme data;

and decoding the target clustering label through the decoder to obtain the prosodic label characteristic.

In some embodiments, the acoustic network includes a first LSTM layer, a decoder, and a second LSTM layer, and the performing acoustic feature extraction on the phoneme feature vector through the acoustic network to obtain a vector quantization feature of the original phoneme data includes:

performing prosodic feature extraction on the phoneme feature vector through the first LSTM layer to obtain an initial prosodic feature corresponding to the phoneme feature vector;

decoding the initial prosodic features through the decoder to obtain initial Mel cepstrum features;

and performing prediction processing on the initial Mel cepstrum characteristics through the second LSTM layer and a preset acoustic characteristic label to obtain the vector quantization characteristics.

In some embodiments, the performing feature prediction according to the vector quantization feature and the prosody label feature to obtain a target prosody feature of the raw phoneme data includes:

splicing the vector quantization features and the prosody label features to obtain predicted prosody features;

carrying out layer normalization processing on the predicted prosody features to obtain intermediate prosody features;

screening the intermediate prosodic features according to preset reference prosodic parameters to obtain three-dimensional prosodic features;

and carrying out standardization processing on the three-dimensional rhythm characteristics to obtain the target rhythm characteristics.

In some embodiments, the normalizing the three-dimensional prosodic feature to obtain a target prosodic feature includes:

calculating the mean value of the three-dimensional rhythm characteristics to obtain a rhythm characteristic mean value;

performing variance calculation on the three-dimensional rhythm characteristics to obtain rhythm characteristic variance values;

and carrying out standardization processing on the three-dimensional prosody feature according to the prosody feature mean value and the prosody feature variance value to obtain the target prosody feature.

In some embodiments, the performing speech synthesis on the target prosodic features and the vector quantized features through the generation network to obtain target speech data includes:

performing convolution processing on the target prosody features through the convolution layer of the generated network to obtain candidate prosody feature vectors, and performing convolution processing on the vector quantization features through the convolution layer to obtain candidate quantization feature vectors;

splicing the candidate quantization feature vector and the candidate prosody feature vector to obtain target acoustic features;

smoothing the target acoustic features through a feature encoder of the generating network to obtain acoustic feature vectors;

and performing voice generation processing on the acoustic feature vector through the vocoder for generating the network to obtain the target voice data.

In some embodiments, the obtaining raw phoneme data to be processed includes:

acquiring an original text;

filtering the original text according to a preset language type to obtain an initial text;

carrying out format conversion on the initial text according to a preset format template to obtain a target text;

and performing data conversion on the target text through a preset text conversion model and a reference dictionary to obtain the original phoneme data.

To achieve the above object, a second aspect of an embodiment of the present application proposes a speech synthesis apparatus, including:

the data acquisition module is used for acquiring original phoneme data to be processed, wherein the original phoneme data is text data;

the input module is used for inputting the original phoneme data into a preset speech synthesis model; wherein the speech synthesis model comprises an acoustic network and a generation network;

the coding module is used for coding the original phoneme data through the acoustic network to obtain a phoneme feature vector;

a label identification module, configured to perform prosody label identification on the phoneme feature vector through the acoustic network to obtain prosody label features of the original phoneme data;

the acoustic feature extraction module is used for extracting acoustic features of the phoneme feature vectors through the acoustic network to obtain vector quantization features of the original phoneme data;

the feature prediction module is used for performing feature prediction according to the vector quantization feature and the prosody label feature to obtain a target prosody feature of the original phoneme data;

and the voice synthesis module is used for carrying out voice synthesis on the target prosody features and the vector quantization features through the generation network to obtain target voice data.

In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method of the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program, which when executed by a processor implements the method of the first aspect.

The application provides a speech synthesis method, a speech synthesis device, an electronic device and a computer readable storage medium, which are used for obtaining original phoneme data to be processed; the method comprises the steps of inputting original phoneme data into a preset speech synthesis model, wherein the speech synthesis model comprises an acoustic network and a generation network, processing the original phoneme data through the model can be achieved, and the efficiency of speech synthesis can be improved to a certain extent. Further, encoding the original phoneme data through an acoustic network to obtain a phoneme feature vector; the acoustic network is used for carrying out prosody label recognition on the phoneme feature vectors to obtain prosody label features of the original phoneme data, prosody information of the original phoneme data can be conveniently obtained, meanwhile, acoustic feature extraction is carried out on the phoneme feature vectors through the acoustic network to obtain vector quantization features of the original phoneme data, and therefore the acoustic feature content of the original phoneme data can be represented by adopting discontinuous vector quantization features. Furthermore, feature prediction is performed according to the vector quantization features and the prosody tag features to obtain target prosody features of the original phoneme data, more accurate prosody information can be obtained based on the vector quantization features and the prosody tag features, and the quality of feature information used for voice synthesis is improved. Finally, the target prosody feature and the vector quantization feature are subjected to voice synthesis through the generation network to obtain target voice data, so that the synthesized target voice data can simultaneously contain the prosody feature and the acoustic feature of the original phoneme data, and the accuracy of voice synthesis is effectively improved.

Drawings

Fig. 1 is a flowchart of a speech synthesis method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S101 in fig. 1;

FIG. 3 is a flowchart of step S104 in FIG. 1;

fig. 4 is a flowchart of step S105 in fig. 1;

FIG. 5 is a flowchart of step S106 in FIG. 1;

FIG. 6 is a flowchart of step S504 in FIG. 5;

fig. 7 is a flowchart of step S107 in fig. 1;

fig. 8 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It is noted that while functional block divisions are provided in device diagrams and logical sequences are shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions within devices or flowcharts. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, which attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, expert systems, and the like. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and it belongs to a branch of artificial intelligence, which is a cross discipline of computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.

Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Mel-Frequency cepstral Coefficients (MFCC): is a set of key coefficients used to create the mel-frequency cepstrum. From segments of the music signal, a set of cepstra is obtained that sufficiently represent the music signal, and the mel-frequency cepstral coefficients are the cepstrums (i.e. the frequency spectra of the frequency spectrum) derived from this cepstrum. Unlike the general cepstrum, the main feature of the mel cepstrum is that the frequency bands on the mel cepstrum are uniformly distributed on the mel scale, i.e., such frequency bands are closer to the nonlinear human auditory System (Audio System) than the commonly seen linear cepstrum representation method. For example: in the art of audio compression, the mel-frequency cepstrum is often used for processing.

Phoneme (Phone): the minimum phonetic unit is divided according to the natural attributes of the speech, and is analyzed according to the pronunciation action in the syllable, and one action forms a phoneme.

Pooling (Pooling): the method is essentially sampling, and selects a certain mode to perform dimension reduction processing and compression processing on an input characteristic diagram so as to accelerate the operation speed, and adopts more Pooling processes as maximum Pooling (Max Pooling).

Activation Function (Activation Function): is a function that runs on a neuron of an artificial neural network responsible for mapping the input of the neuron to the output.

Self-Supervised Vector quantization (Self-Supervised Vector-Quantized, VQ): the method clusters original continuous data into discrete data in a clustering-like mode, so that the data quantity required to be stored is reduced, and the aim of data compression is fulfilled.

Encoding (Encoder): the input sequence is converted into a vector of fixed length.

Decoding (Decoder): converting the fixed vector generated before into an output sequence; wherein, the input sequence can be characters, voice, images and videos; the output sequence may be text, images.

GRU (Gate recovery Unit, gated cycle Unit): one of the GRU Recurrent Neural Networks (RNN), like LSTM (Long-Short Term Memory), is proposed to solve the problems of Long-Term Memory and gradient in back propagation.

Softmax function: the Softmax function is a normalized exponential function that "compresses" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0, 1) and the sum of all elements is 1, which is commonly used in multi-classification problems.

Based on this, embodiments of the present application provide a speech synthesis method, a speech synthesis apparatus, an electronic device, and a storage medium, aiming to improve accuracy of speech synthesis.

The speech synthesis method, the speech synthesis apparatus, the electronic device, and the computer-readable storage medium provided in the embodiments of the present application are specifically described in the following embodiments, and first, the speech synthesis method in the embodiments of the present application is described.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a voice synthesis method, and relates to the technical field of artificial intelligence. The speech synthesis method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements a speech synthesis method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an alternative flowchart of a speech synthesis method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, acquiring original phoneme data to be processed, wherein the original phoneme data is text data;

step S102, inputting original phoneme data into a preset speech synthesis model; the voice synthesis model comprises an acoustic network and a generation network;

step S103, encoding the original phoneme data through an acoustic network to obtain a phoneme feature vector;

step S104, performing prosody label recognition on the phoneme feature vector through an acoustic network to obtain prosody label features of the original phoneme data;

step S105, extracting acoustic features of the phoneme feature vectors through an acoustic network to obtain vector quantization features of the original phoneme data;

step S106, performing feature prediction according to the vector quantization features and the prosody label features to obtain target prosody features of the original phoneme data;

and S107, performing voice synthesis on the target prosody feature and the vector quantization feature through a generation network to obtain target voice data.

Steps S101 to S107 illustrated in the embodiment of the present application are performed by acquiring raw phoneme data to be processed; the original phoneme data are input into a preset speech synthesis model, wherein the speech synthesis model comprises an acoustic network and a generation network, the original phoneme data can be processed through the model, and the speech synthesis efficiency can be improved to a certain extent. Further, the original phoneme data is coded through an acoustic network to obtain phoneme feature vectors; the acoustic network is used for carrying out prosody label recognition on the phoneme feature vectors to obtain prosody label features of the original phoneme data, prosody information of the original phoneme data can be conveniently obtained, meanwhile, the acoustic network is used for carrying out acoustic feature extraction on the phoneme feature vectors to obtain vector quantization features of the original phoneme data, and therefore the acoustic feature content of the original phoneme data can be represented by adopting discontinuous vector quantization features. Furthermore, feature prediction is carried out according to the vector quantization features and the prosody tag features to obtain target prosody features of the original phoneme data, more accurate prosody information can be obtained based on the vector quantization features and the prosody tag features, and the quality of feature information used for voice synthesis is improved. Finally, the target prosody feature and the vector quantization feature are subjected to voice synthesis through the generation network to obtain target voice data, so that the synthesized target voice data can simultaneously contain the prosody feature and the acoustic feature of the original phoneme data, and the accuracy of voice synthesis is effectively improved.

Referring to fig. 2, in some embodiments, step S101 may include, but is not limited to, step S201 to step S204:

step S201, acquiring an original text;

step S202, filtering the original text according to a preset language type to obtain an initial text;

step S203, carrying out format conversion on the initial text according to a preset format template to obtain a target text;

and S204, performing data conversion on the target text through a preset text conversion model and a reference dictionary to obtain original phoneme data.

In step S201 of some embodiments, the original text may be obtained from the public data set, or may be obtained from an existing text database or a network platform, without limitation. For example, the public data set may be an LJSpeech data set that contains a plurality of english voice data recorded by female speaking subjects and text data corresponding to the english voice data.

In step S202 of some embodiments, the preset language type includes multiple languages such as chinese, english, french, and the like. In the stage of preprocessing the original text, a language type meeting the requirements needs to be determined according to the current requirements, the original text is filtered according to the language type, and the original text which does not belong to the language type is removed, so that the original text is obtained. For example, the original text includes a chinese text, an english text, and a japanese text, and the currently required language type is english, the chinese text and the japanese text are filtered, and only the english text is reserved as the original text.

In step S203 of some embodiments, in order to improve the normalization of the text, after the initial text is obtained, format conversion is further performed on the initial text according to a preset format template, where the preset format template includes adjustment contents of a font type, a font size, a font spacing, a punctuation mark, and the like. For example, punctuation unification, full-angle and half-angle character conversion, font adjustment and the like are performed on the initial text according to the format template; further, the initial text may be subjected to text specification of physical quantities, dates, currency forms, and the like in different formats in the initial text, and the adjusted initial text may be used as the target text.

In step S204 of some embodiments, the preset text conversion model may be an open-source text-to-phoneme model, e.g., a g2p-seq2seq model or the like, and the reference dictionary may be a cmusshinx data dictionary, the reference dictionary containing a plurality of words. A word list can be constructed through a text conversion model and a reference dictionary, the format of the word list is one word or character in each line, data conversion can be carried out on text contents in a target text through the word list, and the words or characters corresponding to the text contents are converted into phoneme sequences, so that original phoneme data corresponding to the target text are formed.

Through the steps S201 to S204, the original text can be conveniently filtered, the initial text meeting the current requirement is screened out, the total text amount is reduced, and the text processing efficiency is improved.

In step S102 of some embodiments, the original phoneme data is input to a preset speech synthesis model; the voice synthesis model comprises an acoustic network and a generating network. The acoustic network can be constructed based on a txt2vec acoustic model and comprises an encoder, a first LSTM layer, a second LSTM layer and a decoder, and is mainly used for performing feature extraction on input phoneme data to acquire acoustic features and prosodic features corresponding to the phoneme data, wherein the acoustic features comprise tone information, fundamental frequency information, average speech power, duration and the like corresponding to the phoneme data, and the prosodic features comprise tone information, pitch information and the like corresponding to the phoneme data. The generation network can be constructed based on a vec2wav model and comprises a convolutional layer, a feature decoder and a vocoder, the generation network is mainly used for carrying out fusion processing on acoustic features and prosodic features corresponding to the phoneme data, generating a corresponding Mel cepstrum according to the acoustic features and the prosodic features after the fusion processing, and finally converting the Mel cepstrum from a frequency spectrum form to a waveform form to obtain synthesized speech data, wherein the speech content of the speech data is consistent with the text content of the input phoneme data. The efficiency of speech synthesis can be effectively improved through the speech synthesis model, and meanwhile, the prosodic features and the acoustic features of the original morpheme data are simultaneously blended in the speech synthesis process, so that the accuracy of speech synthesis can be improved.

In step S103 of some embodiments, the original phoneme data is encoded by an encoder of the acoustic network, and the input original phoneme data is converted into a vector with a fixed length to obtain a phoneme feature vector, so as to implement conversion of the phoneme data from a text form to a vector form, thereby facilitating a subsequent speech synthesis process.

Referring to fig. 3, in some embodiments where the acoustic network includes a first LSTM layer and a decoder, step S104 may include, but is not limited to including, steps S301 to S303:

step S301, performing prosody feature extraction on the phoneme feature vector through the first LSTM layer to obtain an initial prosody feature corresponding to the phoneme feature vector;

step S302, clustering the initial prosody features through a preset clustering algorithm and a reference clustering label to obtain a target clustering label of the original phoneme data;

and step S303, decoding the target clustering label through a decoder to obtain the rhythm label characteristic.

In step S301 of some embodiments, the phoneme feature vector is encoded from left to right by the first LSTM layer to obtain a first prosody feature vector, the phoneme feature vector is encoded from right to left by the first LSTM layer to obtain a second prosody feature vector, and finally the first prosody feature vector and the second prosody feature vector are vector-spliced or vector-added to obtain an initial prosody feature corresponding to the phoneme feature vector.

In step S302 of some embodiments, the preset clustering algorithm may include a K-Means clustering algorithm, the reference clustering label may be generated based on a clustering process on the sample phoneme data, and the reference clustering label may be a phoneme-level prosodic label, which includes various types of treble, accent, bass, and so on. And clustering the initial prosody features into N classes through a K-Means clustering algorithm and reference clustering labels, wherein N is an integer greater than 0, and taking the reference clustering label closest to the clustering center of each class of initial prosody features as the clustering label of the initial prosody feature, thereby obtaining the target clustering label of the original phoneme data.

In step S303 of some embodiments, the decoder decodes the target cluster labels, and converts the target cluster labels of the raw phoneme data into an output vector form, so as to obtain prosody label features, where the prosody label features are used for characterizing prosody information of the raw phoneme data.

Through the steps S301 to S303, the prosody feature of the original phoneme data can be more conveniently identified according to the preset clustering algorithm and the reference clustering label to obtain the corresponding prosody label feature, so that the prosody label feature can be used in the subsequent speech synthesis process, thereby improving the accuracy of speech synthesis.

Referring to fig. 4, in some embodiments, the acoustic network includes a first LSTM layer, a decoder, and a second LSTM layer, and step S105 may include, but is not limited to include steps S401 to S403:

step S401, performing prosody feature extraction on the phoneme feature vectors through the first LSTM layer to obtain initial prosody features corresponding to the phoneme feature vectors;

step S402, decoding the initial prosody characteristics through a decoder to obtain initial Mel cepstrum characteristics;

and S403, performing prediction processing on the initial Mel cepstrum features through the second LSTM layer and a preset acoustic feature tag to obtain vector quantization features.

In step S401 of some embodiments, the phoneme feature vector is encoded from left to right by the first LSTM layer to obtain a first prosody feature vector, the phoneme feature vector is encoded from right to left by the first LSTM layer to obtain a second prosody feature vector, and finally the first prosody feature vector and the second prosody feature vector are vector-spliced or vector-added to obtain an initial prosody feature corresponding to the phoneme feature vector.

In step S402 of some embodiments, the initial prosodic feature and the speech feature information of the initial prosodic feature are captured by a decoder, and the speech feature information is converted into a mel cepstrum form, resulting in an initial mel cepstral feature, which can be used to characterize the speech content information of the original phoneme data.

In step S403 of some embodiments, the initial mel-frequency cepstrum features are subjected to a prediction process through a prediction function of the second LSTM layer and preset acoustic feature tags, where the prediction function may be a softmax function, and the preset acoustic feature tags include a tone feature tag, a duration feature tag, and so on, and a probability distribution of the initial mel-frequency cepstrum features in each preset acoustic feature tag is created through the softmax function, and the probability distribution may clearly characterize the degree of correlation between the initial mel-frequency cepstrum features and each acoustic feature tag, so that according to a probability distribution situation, an acoustic feature tag with a largest probability distribution is selected as a feature tag of the initial mel-frequency cepstrum features, so as to obtain vector quantization features according to the initial mel-frequency cepstrum features and their feature tags, where the vector quantization features are discrete VQ features.

Through the steps S401 to S403, the acoustic features corresponding to the original phoneme data can be conveniently predicted based on the prediction function and the preset acoustic feature tag, so as to obtain discontinuous vector quantization features, thereby eliminating errors caused by using a complex mel cepstrum as the acoustic features, and improving the effect of speech synthesis.

Referring to fig. 5, in some embodiments, step S106 may include, but is not limited to, step S501 to step S504:

step S501, splicing the quantitative features and the rhythm label features to obtain predicted rhythm features;

step S502, performing layer normalization processing on the predicted prosodic features to obtain intermediate prosodic features;

step S503, screening the intermediate prosody features according to preset reference prosody parameters to obtain three-dimensional prosody features;

step S504, the three-dimensional prosodic features are subjected to standardization processing to obtain target prosodic features.

In step S501 of some embodiments, when the vector quantization feature and the prosody tag feature are subjected to the stitching processing, the vector quantization feature and the prosody tag feature may be vectorized first, and the vector quantization feature and the prosody tag feature in the form of a vector are subjected to vector addition or vector stitching, so as to obtain the predicted prosody feature.

In step S502 of some embodiments, the predicted prosody features may be input into a convolutional network, the convolutional network including a plurality of layer normalization layers and a filter layer, and the predicted prosody features are layer-normalized in the layer normalization layers so that a mean and a variance of the predicted prosody features satisfy preset normalization conditions, thereby obtaining intermediate prosody features.

In step S503 of some embodiments, the preset reference prosody parameters include a logarithmic pitch threshold, an energy threshold, a speech probability threshold, and the like, the intermediate prosody features are screened according to the series of reference prosody parameters, and the intermediate prosody features with logarithmic pitch, energy, and speech probability satisfying the threshold requirements are selected as the three-dimensional prosody features.

In step S504 of some embodiments, when the three-dimensional prosodic feature is subjected to the normalization processing, first, a mean value of the three-dimensional prosodic feature is calculated to obtain a prosodic feature mean value; and performing variance calculation on the three-dimensional rhythm characteristics to obtain a rhythm characteristic variance value, and performing standardized processing on the three-dimensional rhythm characteristics according to the rhythm characteristic mean value and the rhythm characteristic variance value to obtain target rhythm characteristics.

Through the steps S501 to S504, more accurate prosodic information can be acquired based on the vector quantization feature and the prosodic tag feature, so that the three-dimensional prosodic feature with logarithmic pitch, energy and speech probability meeting the requirement can be used for speech synthesis, the quality of feature information for speech synthesis is improved, and the accuracy of speech synthesis can be improved.

Referring to fig. 6, in some embodiments, step S504 includes, but is not limited to, step S601 to step S603:

step S601, calculating the mean value of the three-dimensional rhythm characteristics to obtain the rhythm characteristic mean value;

step S602, performing variance calculation on the three-dimensional prosodic features to obtain prosodic feature variance values;

step S603, the three-dimensional prosody characteristics are subjected to standardization processing according to the prosody characteristic mean value and the prosody characteristic variance value, and target prosody characteristics are obtained.

In step S601 in some embodiments, a sum of the prosodic features may be obtained by summing the three-dimensional prosodic features, and then a division process may be performed on the sum of the prosodic features and the number of the three-dimensional prosodic features, so as to implement a mean calculation of the three-dimensional prosodic features and obtain a prosodic feature mean.

In step S602 of some embodiments, a unit variance calculation may be performed on the three-dimensional prosodic feature to obtain a prosodic feature variance value. For example, the difference between each three-dimensional prosodic feature and the prosodic feature mean value is calculated to obtain a prosodic feature difference value, and the unit variance is calculated based on a plurality of prosodic feature difference values to obtain a prosodic feature variance value.

In step S603 of some embodiments, when the three-dimensional prosodic feature is normalized according to the prosodic feature mean value and the prosodic feature variance value, parameter fine adjustment needs to be performed on the three-dimensional prosodic feature, so that the prosodic feature mean value becomes 0, and the prosodic variance value is 1, that is, the three-dimensional prosodic feature subjected to the fine adjustment satisfies the distribution that the mean value is 0 and the variance is 1, so as to obtain the target prosodic feature.

For example, when the three-dimensional prosodic feature is normalized, the following formula may be used for calculation:

wherein x is the normalization requiredThree-dimensional rhythm characteristics of (1), x _scale Mu is a rhythm feature mean value and S is a rhythm feature variance value for the three-dimensional rhythm feature subjected to standardization, namely the target rhythm feature.

Through the steps S601 to S603, the three-dimensional prosodic features can be conveniently standardized, so that the logarithmic pitch, energy and speech probability of the three-dimensional prosodic features meet the speech synthesis requirement, and meanwhile, the three-dimensional prosodic features are subjected to parameter fine adjustment according to the prosodic feature mean value and the prosodic feature variance value, so that the obtained target prosodic features can more accurately reflect the prosodic information of the original phoneme data, thereby improving the quality of the prosodic feature information for speech synthesis and improving the accuracy of speech synthesis.

Referring to fig. 7, in some embodiments where the generation network includes convolutional layers, feature encoders and vocoders, step S107 may include, but is not limited to including, steps S701 through S704:

step S701, performing convolution processing on the target prosody features through the convolution layer of the generated network to obtain candidate prosody feature vectors, and performing convolution processing on the vector quantization features through the convolution layer to obtain candidate quantization feature vectors;

step S702, splicing the candidate quantization feature vectors and the candidate prosody feature vectors to obtain target acoustic features;

step S703, smoothing the target acoustic features by a feature encoder of the generated network to obtain acoustic feature vectors;

step S704, performing speech generation processing on the acoustic feature vector through a vocoder that generates a network, to obtain target speech data.

In step S701 of some embodiments, the convolutional layer of the generated network includes channels with the number of channels being 92 and channels with the number of channels being 32, the target prosody feature is convolved by the channels with the number of channels being 92 in the convolutional layer, feature information of the target prosody feature in the frequency domain space is extracted to obtain a candidate prosody feature vector, the vector quantization feature is convolved by the channels with the number of channels being 32 in the convolutional layer, and feature information of the vector quantization feature in the frequency domain space is extracted to obtain a candidate quantization feature vector.

In step S702 of some embodiments, when performing the splicing process on the candidate quantized feature vector and the candidate prosody feature vector, vector addition or vector concatenation may be performed on the candidate quantized feature vector and the candidate prosody feature vector, so as to obtain the target acoustic feature.

In step S703 of some embodiments, a feature encoder of the generation network performs smoothing processing on discontinuous target acoustic features, so that feature information of the target acoustic features is within a range meeting requirements, and more abnormal features in the target acoustic features can be effectively filtered, interference of the abnormal features on the overall target acoustic features is reduced, and thus stability of the generated acoustic feature vector is improved.

In step S704 of some embodiments, the vocoder may be a HiFi-GAN vocoder, and the vocoder includes an upsampling module and a residual error module fused to a multi-sensitivity field, and the upsampling module may perform upsampling on the acoustic feature vector to implement transpose convolution on the acoustic feature vector to obtain an initial voice feature, and perform reconstruction processing on the initial voice feature by using the residual error module to obtain a reconstructed voice waveform, and use the voice waveform as target voice data.

Through the steps S701 to S704, the synthesized target speech data can contain the prosodic features and the acoustic features of the original phoneme data at the same time, thereby effectively improving the accuracy of speech synthesis.

The speech synthesis method of the embodiment of the application obtains original phoneme data to be processed; the method comprises the steps of inputting original phoneme data into a preset speech synthesis model, wherein the speech synthesis model comprises an acoustic network and a generation network, processing the original phoneme data through the model can be achieved, and the efficiency of speech synthesis can be improved to a certain extent. Further, encoding the original phoneme data through an acoustic network to obtain a phoneme feature vector; the acoustic network is used for carrying out prosody label recognition on the phoneme feature vectors to obtain prosody label features of the original phoneme data, prosody information of the original phoneme data can be conveniently obtained, meanwhile, acoustic feature extraction is carried out on the phoneme feature vectors through the acoustic network to obtain vector quantization features of the original phoneme data, and therefore the acoustic feature content of the original phoneme data can be represented by adopting discontinuous vector quantization features. Furthermore, feature prediction is performed according to the vector quantization features and the prosody tag features to obtain target prosody features of the original phoneme data, more accurate prosody information can be obtained based on the vector quantization features and the prosody tag features, and the quality of feature information used for voice synthesis is improved. Finally, the target prosody feature and the vector quantization feature are subjected to voice synthesis through the generation network to obtain target voice data, so that the synthesized target voice data can simultaneously contain the prosody feature and the acoustic feature of the original phoneme data, and the accuracy of voice synthesis is effectively improved.

Referring to fig. 8, an embodiment of the present application further provides a speech synthesis apparatus, which can implement the speech synthesis method, and the apparatus includes:

a data obtaining module 801, configured to obtain raw phoneme data to be processed, where the raw phoneme data is text data;

an input module 802, configured to input original phoneme data into a preset speech synthesis model; the voice synthesis model comprises an acoustic network and a generation network;

the encoding module 803 is configured to perform encoding processing on the original phoneme data through an acoustic network to obtain a phoneme feature vector;

a label identification module 804, configured to perform prosody label identification on the phoneme feature vector through an acoustic network to obtain prosody label features of the original phoneme data;

the acoustic feature extraction module 805 is configured to perform acoustic feature extraction on the phoneme feature vector through an acoustic network to obtain a vector quantization feature of the original phoneme data;

the feature prediction module 806 is configured to perform feature prediction according to the vector quantization feature and the prosody label feature to obtain a target prosody feature of the original phoneme data;

the speech synthesis module 807 is configured to perform speech synthesis on the target prosodic feature and the vector quantization feature through a generation network to obtain target speech data.

In some embodiments, the data acquisition module 801 comprises:

a text acquisition unit for acquiring an original text;

the filtering unit is used for filtering the original text according to a preset language type to obtain an initial text;

the format conversion unit is used for carrying out format conversion on the initial text according to a preset format template to obtain a target text;

and the data conversion unit is used for performing data conversion on the target text through a preset text conversion model and a reference dictionary to obtain original phoneme data.

In some embodiments, the acoustic network includes a first LSTM layer and a decoder, and the tag identification module 804 includes:

the first prosodic feature extraction unit is used for performing prosodic feature extraction on the phoneme feature vector through the first LSTM layer to obtain initial prosodic features corresponding to the phoneme feature vector;

the clustering unit is used for clustering the initial prosody characteristics through a preset clustering algorithm and a reference clustering label to obtain a target clustering label of the original phoneme data;

and the first decoding unit is used for decoding the target clustering label through a decoder to obtain the rhythm label characteristic.

In some embodiments, the acoustic network comprises a first LSTM layer, a decoder, a second LSTM layer, and the acoustic feature extraction module 805 comprises:

the second prosodic feature extracting unit is used for performing prosodic feature extraction on the phoneme feature vector through the first LSTM layer to obtain initial prosodic features corresponding to the phoneme feature vector;

the second decoding unit is used for decoding the initial prosodic features through a decoder to obtain initial mel cepstrum features;

and the prediction unit is used for performing prediction processing on the initial Mel cepstrum characteristics through the second LSTM layer and a preset acoustic characteristic label to obtain vector quantization characteristics.

In some embodiments, the feature prediction module 806 includes:

the first splicing unit is used for splicing the vector quantization characteristics and the rhythm label characteristics to obtain predicted rhythm characteristics;

the normalization unit is used for carrying out layer normalization processing on the predicted prosody features to obtain intermediate prosody features;

the screening unit is used for screening the intermediate prosody features according to preset reference prosody parameters to obtain three-dimensional prosody features;

and the standardization unit is used for carrying out standardization processing on the three-dimensional rhythm characteristics to obtain target rhythm characteristics.

In some embodiments, the normalization unit comprises:

the mean value calculating subunit is used for calculating the mean value of the three-dimensional rhythm characteristics to obtain a rhythm characteristic mean value;

the variance calculating subunit is used for performing variance calculation on the three-dimensional rhythm characteristics to obtain rhythm characteristic variance values;

and the standardization subunit is used for carrying out standardization processing on the three-dimensional prosody feature according to the prosody feature mean value and the prosody feature variance value to obtain the target prosody feature.

In some embodiments, the speech synthesis module 807 includes:

the convolution unit is used for performing convolution processing on the target prosody features through convolution layers of the generated network to obtain candidate prosody feature vectors, and performing convolution processing on the vector quantization features through the convolution layers to obtain candidate quantization feature vectors;

the second splicing unit is used for splicing the candidate quantization characteristic vectors and the candidate prosody characteristic vectors to obtain target acoustic characteristics;

the smoothing unit is used for smoothing the target acoustic features through a feature encoder for generating a network to obtain acoustic feature vectors;

and the voice generating unit is used for carrying out voice generating processing on the acoustic feature vector through a vocoder for generating a network to obtain target voice data.

The specific implementation of the speech synthesis apparatus is substantially the same as the specific implementation of the speech synthesis method, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes: a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program, when executed by the processor, implementing the above-described speech synthesis method. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the speech synthesis method in the embodiments of the present application;

an input/output interface 903 for implementing information input and output;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 enable a communication connection within the device with each other through a bus 905.

Embodiments of the present application also provide a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the above-mentioned speech synthesis method.

The memory, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer-executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The speech synthesis method, the speech synthesis device, the electronic device and the computer readable storage medium provided by the embodiment of the application acquire original phoneme data to be processed; the original phoneme data are input into a preset speech synthesis model, wherein the speech synthesis model comprises an acoustic network and a generation network, the original phoneme data can be processed through the model, and the speech synthesis efficiency can be improved to a certain extent. Further, encoding the original phoneme data through an acoustic network to obtain a phoneme feature vector; the acoustic network is used for carrying out prosody label recognition on the phoneme feature vectors to obtain prosody label features of the original phoneme data, prosody information of the original phoneme data can be conveniently obtained, meanwhile, acoustic feature extraction is carried out on the phoneme feature vectors through the acoustic network to obtain vector quantization features of the original phoneme data, so that the acoustic feature content of the original phoneme data can be represented by adopting discontinuous vector quantization features, and the influence of taking a Mel inverse frequency spectrum as the acoustic features on the accuracy of voice synthesis is eliminated. Furthermore, feature prediction is performed according to the vector quantization features and the prosody tag features to obtain target prosody features of the original phoneme data, more accurate prosody information can be obtained based on the vector quantization features and the prosody tag features, and the quality of feature information used for voice synthesis is improved. Finally, the target prosody feature and the vector quantization feature are subjected to voice synthesis through the generation network to obtain target voice data, so that the synthesized target voice data can simultaneously contain the prosody feature and the acoustic feature of the original phoneme data, and the accuracy of voice synthesis is effectively improved.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute limitations on the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technologies and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the embodiments shown in fig. 1-7 are not limiting of the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like (if any) in the description of the present application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described units is only one type of logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, which includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereby. Any modifications, equivalents, and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A method of speech synthesis, the method comprising:

performing feature prediction according to the vector quantization features and the prosody label features to obtain target prosody features of the original phoneme data;

2. The speech synthesis method of claim 1, wherein the acoustic network comprises a first LSTM layer and a decoder, and wherein performing prosody label recognition on the phoneme feature vectors through the acoustic network to obtain prosody label features of the original phoneme data comprises:

clustering the initial prosody features through a preset clustering algorithm and reference clustering labels to obtain target clustering labels of the original phoneme data;

3. The method of speech synthesis of claim 1 wherein the acoustic network comprises a first LSTM layer, a decoder, and a second LSTM layer, and wherein the performing acoustic feature extraction on the phoneme feature vectors through the acoustic network to obtain vector quantized features of the raw phoneme data comprises:

4. The speech synthesis method of claim 1, wherein the performing feature prediction according to the vector quantization feature and the prosody tag feature to obtain a target prosody feature of the raw phoneme data comprises:

5. The method according to claim 4, wherein the normalizing the three-dimensional prosodic feature to obtain the target prosodic feature comprises:

6. The speech synthesis method according to any one of claims 1 to 5, wherein the performing speech synthesis on the target prosodic features and the vector quantized features through the generation network to obtain target speech data comprises:

performing convolution processing on the target prosody features through convolution layers of the generated network to obtain candidate prosody feature vectors, and performing convolution processing on the vector quantization features through the convolution layers to obtain candidate quantization feature vectors;

splicing the candidate quantization feature vectors and the candidate prosody feature vectors to obtain target acoustic features;

and performing voice generation processing on the acoustic feature vector through the vocoder of the generation network to obtain the target voice data.

7. The speech synthesis method according to any one of claims 1 to 5, wherein the obtaining raw phoneme data to be processed comprises:

acquiring an original text;

8. A speech synthesis apparatus, characterized in that the apparatus comprises:

the feature prediction module is used for performing feature prediction according to the vector quantization features and the prosody label features to obtain target prosody features of the original phoneme data;

9. An electronic device, characterized in that the electronic device comprises a memory, in which a computer program is stored, and a processor, which when executing the computer program, implements the speech synthesis method according to any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements the speech synthesis method according to any one of claims 1 to 7.