CN116386594A

CN116386594A - Speech synthesis method, speech synthesis device, electronic device, and storage medium

Info

Publication number: CN116386594A
Application number: CN202310406038.8A
Authority: CN
Inventors: 张旭龙; 王健宗; 唐浩彬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-07-04

Abstract

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring original phoneme data; inputting the original phoneme data into a speech synthesis model; performing first spectrum prediction on the original phoneme data based on a first prediction network and a speaking object corpus to obtain initial spectrum data and an initial speech embedding vector; filtering the initial spectrum data based on the initial voice embedding vector to obtain candidate spectrum data; performing second spectrum prediction on the candidate spectrum data based on a second prediction network and a speaking object corpus to obtain intermediate spectrum data and a target speech embedding vector; reconstructing the intermediate frequency spectrum data based on the target voice embedded vector to obtain target frequency spectrum data; and performing sound code conversion on the target frequency spectrum data based on the decoding network to obtain voice synthesis data. The embodiment of the application can improve the accuracy of voice synthesis.

Description

Speech synthesis method, speech synthesis device, electronic device, and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech synthesis method, a speech synthesis apparatus, an electronic device, and a storage medium.

Background

The training of the neural network model relied on by the current voice synthesis method often needs a large amount of marking data, and the cost for obtaining a large amount of high-quality marking data in a real scene is high, so that the marking data used for model training are fewer, the training effect of the model is affected, the accuracy of the voice synthesis data synthesized based on the model is poor, and therefore, how to improve the accuracy of voice synthesis becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and aims to improve the accuracy of voice synthesis.

To achieve the above object, a first aspect of an embodiment of the present application proposes a speech synthesis method, including:

acquiring original phoneme data, wherein the original phoneme data are text data;

inputting the original phoneme data into a preset voice synthesis model, wherein the voice synthesis model comprises a first prediction network, a second prediction network and a decoding network;

Performing first spectrum prediction on the original phoneme data based on the first prediction network and a pre-acquired speaking object corpus to obtain initial spectrum data and initial speech embedding vectors, wherein the initial speech embedding vectors are used for representing speaking style characteristics of reference speaking objects in the speaking object corpus;

filtering the initial spectrum data based on the initial voice embedding vector to obtain candidate spectrum data;

performing second spectrum prediction on the candidate spectrum data based on the second prediction network and the speaking object corpus to obtain intermediate spectrum data and a target speech embedding vector, wherein the target speech embedding vector is used for representing speaking style characteristics of a target speaking object in the speaking object corpus;

reconstructing the intermediate frequency spectrum data based on the target voice embedded vector to obtain target frequency spectrum data, wherein the target frequency spectrum data is a Mel spectrogram;

and performing sound code conversion on the target frequency spectrum data based on the decoding network to obtain voice synthesis data.

In some embodiments, the first prediction network includes a first coding layer and a first decoding layer, the first prediction network includes the performing, based on the first prediction network and a pre-obtained speech object corpus, a first spectrum prediction on the original phoneme data, to obtain initial spectrum data and an initial speech embedding vector, including:

Acquiring a first duration parameter of the original phoneme data;

encoding the original phoneme data through the first encoding layer and the first duration time parameter to obtain a phoneme hidden layer representation vector;

extracting features of the phone hidden layer representation vector based on the speech object corpus to obtain the initial speech embedding vector;

and performing attention calculation on the phoneme hidden representation vector through the first decoding layer to obtain the initial spectrum data.

In some embodiments, the obtaining the first duration parameter of the original phoneme data includes:

inputting the original phoneme data into a preset time prediction model, wherein the time prediction model comprises a convolution layer, an activation layer, a normalization layer and a linear layer;

extracting features of the original phoneme data through the convolution layer to obtain a phoneme feature vector;

mapping the phoneme feature vector to a preset time domain space through the activation layer to obtain an initial prediction time feature;

the normalization layer is used for carrying out normalization processing on the initial prediction time characteristics to obtain target prediction time characteristics;

And mapping the target prediction time characteristic to a preset data space through the linear layer to obtain the first duration parameter.

In some embodiments, the second prediction network includes a second coding layer and a second decoding layer, and the performing, based on the second prediction network and the speech object corpus, second spectrum prediction on the candidate spectrum data to obtain intermediate spectrum data and a target speech embedding vector includes:

acquiring a second duration parameter of the candidate spectrum data;

encoding the candidate spectrum data through the second encoding layer and the second duration time parameter to obtain a voice hidden layer representation vector;

extracting features of the voice hidden layer representation vector based on the speaking object corpus to obtain the target voice embedded vector;

and performing attention calculation on the voice hidden layer representation vector through the second decoding layer to obtain the intermediate frequency spectrum data.

In some embodiments, the filtering the initial spectrum data based on the initial speech embedding vector to obtain candidate spectrum data includes:

performing feature recognition on the initial spectrum data according to the initial voice embedding vector to obtain background spectrum features and key spectrum features; the key spectrum features comprise voice contents corresponding to the original phoneme data, and the background spectrum features do not comprise voice contents corresponding to the original phoneme data;

And filtering the background spectrum characteristics in the initial spectrum data to obtain the candidate spectrum data.

In some embodiments, the reconstructing the intermediate spectrum data based on the target speech embedding vector to obtain target spectrum data includes:

extracting features of the target voice embedded vector to obtain tone features of the target speaking object;

and embedding the tone characteristic into the intermediate frequency spectrum data to obtain the target frequency spectrum data, wherein the target frequency spectrum data comprises the tone characteristic of the target speaking object.

In some embodiments, the decoding network includes an upsampling module and a residual module, and the performing, based on the decoding network, vocoding the target spectrum data to obtain speech synthesis data includes:

the up-sampling module is used for up-sampling the target frequency spectrum data to obtain initial voice characteristics;

and performing sound code conversion on the initial voice characteristic through the residual error module to obtain the voice synthesis data.

To achieve the above object, a second aspect of the embodiments of the present application proposes a speech synthesis apparatus, the apparatus comprising:

The data acquisition module is used for acquiring original phoneme data, wherein the original phoneme data are text data;

the input module is used for inputting the original phoneme data into a preset voice synthesis model, and the voice synthesis model comprises a first prediction network, a second prediction network and a decoding network;

the first spectrum prediction module is used for performing first spectrum prediction on the original phoneme data based on the first prediction network and a pre-acquired speaking object corpus to obtain initial spectrum data and an initial speech embedding vector, wherein the initial speech embedding vector is used for representing speaking style characteristics of a reference speaking object in the speaking object corpus;

the filtering module is used for filtering the initial spectrum data based on the initial voice embedding vector to obtain candidate spectrum data;

the second spectrum prediction module is used for performing second spectrum prediction on the candidate spectrum data based on the second prediction network and the speaking object corpus to obtain intermediate spectrum data and a target speech embedded vector, wherein the target speech embedded vector is used for representing speaking style characteristics of a target speaking object in the speaking object corpus;

The reconstruction module is used for carrying out reconstruction processing on the intermediate frequency spectrum data based on the target voice embedded vector to obtain target frequency spectrum data, wherein the target frequency spectrum data is a Mel spectrogram;

and the sound code conversion module is used for performing sound code conversion on the target frequency spectrum data based on the decoding network to obtain voice synthesis data.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.

The voice synthesis method, the voice synthesis device, the electronic equipment and the computer readable storage medium are characterized in that original phoneme data are obtained, and the original phoneme data are text data; the original phoneme data is input into a preset voice synthesis model, wherein the voice synthesis model comprises a first prediction network, a second prediction network and a decoding network. And performing first spectrum prediction on the original phoneme data based on the first prediction network and a pre-acquired speaking object corpus to obtain initial spectrum data and initial speech embedding vectors, so that the initial prediction on the original phoneme data can be realized, and the phoneme characteristic information in the original phoneme data can be extracted. Filtering the initial spectrum data based on the initial speech embedding vector to obtain candidate spectrum data, and performing second spectrum prediction on the candidate spectrum data based on a second prediction network and a speaking object corpus to obtain intermediate spectrum data and a target speech embedding vector, so that the re-prediction of the original phoneme data can be realized, the quality of the spectrum data is further optimized, and the intermediate spectrum data with higher accuracy is obtained. Further, reconstruction processing is performed on the intermediate frequency spectrum data based on the target voice embedded vector to obtain target frequency spectrum data, and voice code conversion is performed on the target frequency spectrum data based on a decoding network to obtain voice synthesis data, so that the target voice embedded vector can be fused to the intermediate frequency spectrum data, the final target frequency spectrum data contains speaking style characteristics of a target speaking object, and the accuracy of voice synthesis can be improved.

Drawings

FIG. 1 is a flow chart of a speech synthesis method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S103 in fig. 1;

fig. 3 is a flowchart of step S201 in fig. 2;

fig. 4 is a flowchart of step S104 in fig. 1;

fig. 5 is a flowchart of step S105 in fig. 1;

fig. 6 is a flowchart of step S106 in fig. 1;

fig. 7 is a flowchart of step S107 in fig. 1;

fig. 8 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Mel-frequency cepstral coefficients (Mel-Frequency Cipstal Coefficients, MFCC): is a set of key coefficients used to create the mel-frequency spectrum. From the segments of the music signal, a set of cepstrum is obtained that is sufficient to represent the music signal, and mel-frequency cepstrum coefficients are cepstrum (i.e., the spectrum) derived from the cepstrum. Unlike the general cepstrum, the biggest feature of mel-frequency cepstrum is that the frequency bands on mel-frequency spectrum are uniformly distributed on mel scale, that is, such frequency bands are closer to human nonlinear auditory System (Audio System) than the general observed, linear cepstrum representation method. For example: in audio compression techniques, mel-frequency cepstrum is often used for processing.

Phoneme (Phone): the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme.

Activation function (Activation Function): is a function running on neurons of an artificial neural network, responsible for mapping the inputs of the neurons to the outputs.

Encoding (Encoder): the input sequence is converted into a vector of fixed length.

Decoding (Decoder): the fixed vector generated before is converted into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.

Speech synthesis refers To the synthesis of intelligible, natural Speech from Text, also known as Text-To-Speech (TTS).

Based on this, the embodiment of the application provides a voice synthesis method, a voice synthesis device, an electronic device and a storage medium, aiming at improving the accuracy of voice synthesis.

The speech synthesis method, the speech synthesis device, the electronic apparatus and the storage medium provided in the embodiments of the present application are specifically described by the following embodiments, and the speech synthesis method in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a voice synthesis method, which relates to the technical field of artificial intelligence. The voice synthesis method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a speech synthesis method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the embodiments of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards of related countries and regions. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a speech synthesis method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107.

Step S101, original phoneme data is obtained, wherein the original phoneme data is text data;

step S102, inputting original phoneme data into a preset speech synthesis model, wherein the speech synthesis model comprises a first prediction network, a second prediction network and a decoding network;

step S103, carrying out first frequency spectrum prediction on the original phoneme data based on a first prediction network and a pre-acquired speaking object corpus to obtain initial frequency spectrum data and an initial speech embedding vector, wherein the initial speech embedding vector is used for representing speaking style characteristics of a reference speaking object in the speaking object corpus;

step S104, filtering the initial spectrum data based on the initial voice embedding vector to obtain candidate spectrum data;

step S105, performing second spectrum prediction on the candidate spectrum data based on a second prediction network and a speaking object corpus to obtain intermediate spectrum data and a target speech embedding vector, wherein the target speech embedding vector is used for representing speaking style characteristics of a target speaking object in the speaking object corpus;

Step S106, reconstructing the intermediate frequency spectrum data based on the target voice embedded vector to obtain target frequency spectrum data, wherein the target frequency spectrum data is a Mel spectrogram;

step S107, performing sound code conversion on the target frequency spectrum data based on the decoding network to obtain voice synthesis data.

Step S101 to step S107 illustrated in the embodiment of the present application, by acquiring original phoneme data, the original phoneme data is text data; the original phoneme data is input into a preset voice synthesis model, wherein the voice synthesis model comprises a first prediction network, a second prediction network and a decoding network. And performing first spectrum prediction on the original phoneme data based on the first prediction network and a pre-acquired speaking object corpus to obtain initial spectrum data and initial speech embedding vectors, so that the initial prediction on the original phoneme data can be realized, and the phoneme characteristic information in the original phoneme data can be extracted. Filtering the initial spectrum data based on the initial speech embedding vector to obtain candidate spectrum data, and performing second spectrum prediction on the candidate spectrum data based on a second prediction network and a speaking object corpus to obtain intermediate spectrum data and a target speech embedding vector, so that the re-prediction of the original phoneme data can be realized, the quality of the spectrum data is further optimized, and the intermediate spectrum data with higher accuracy is obtained. Further, reconstruction processing is performed on the intermediate frequency spectrum data based on the target voice embedded vector to obtain target frequency spectrum data, and voice code conversion is performed on the target frequency spectrum data based on a decoding network to obtain voice synthesis data, so that the target voice embedded vector can be fused to the intermediate frequency spectrum data, the final target frequency spectrum data contains speaking style characteristics of a target speaking object, and the accuracy of voice synthesis can be improved.

In step S101 of some embodiments, the original text data may be obtained from the public data set, or the original text data to be processed may be obtained from an existing text database or a network platform, etc., without limitation. For example, the public data set may be a THCHS30 data set or a ljspech data set, or the like. For example, the raw text data may include lecture documents, musical text material, dubbing text material, and so forth. Further, the original text data is subjected to data conversion according to a preset reference dictionary, and original phoneme data are obtained. The preset reference dictionary may be a data dictionary, and the reference dictionary includes a plurality of words. The word list can be constructed through the reference dictionary, the word list is in the form of one word or word in each row, the text content in the original text data can be subjected to data conversion through the word list, and the words or words corresponding to the text content are converted into a phoneme sequence, so that the original phoneme data in the text form is formed. By the method, the original phoneme data can be conveniently acquired, and the data acquisition efficiency is improved.

In step S102 of some embodiments, the original phoneme data may be input into a preset speech synthesis model by a preset script program or other computer program, where the speech synthesis model includes a first prediction network, a second prediction network, and a decoding network, and the first prediction network is mainly used for performing preliminary spectrum prediction on the original phoneme data to obtain initial spectrum data and an initial speech embedding vector of each reference speaking object. The second prediction network is mainly used for carrying out secondary spectrum prediction according to the initial spectrum data and the initial speech embedding vector, obtaining intermediate spectrum data with higher accuracy and a target speech embedding vector of a target speaking object, and embedding the target speech embedding vector representing the speaking characteristics of the target speaking object into the intermediate spectrum data to obtain target spectrum data; the decoding network is mainly used for performing feature decoding on target spectrum data and extracting voice synthesis data in a waveform form. The voice synthesis model can better embed the voice characteristics of the target speaking object into the voice synthesis data, so that the voice synthesis data can contain the required rhythm characteristics and voice contents, and the accuracy of voice synthesis is improved.

It should be noted that, in the process of training the speech synthesis model, a speaking object corpus may be constructed based on the collected audio corpus for speech synthesis. The process specifically may include:

a, obtaining voice data of a reference speaking object based on a mode of data crawling by a web crawler, extracting from a preset database or downloading from a network platform and the like, dividing the obtained voice data of the reference speaking object into a single speaking object corpus and a plurality of speaking object corpuses, wherein the single speaking object corpus only comprises voice data of a certain fixed speaking object, and the plurality of speaking object corpuses comprise voice data of a plurality of different speaking objects;

b, forcedly aligning all voice data at a phoneme level, and dividing the voice data into voice fragments with fixed length, namely, the frames of all the voice fragments are the same;

and c, using the single speaking object corpus aligned with the phonemes for training a speech synthesis model, performing speech synthesis on the plurality of speaking object corpus by using the trained speech synthesis model to obtain synthesis data consistent with the transcription and the phoneme duration of the plurality of speaking object corpus, incorporating the synthesis data into the same set, and using the set as a final speaking object corpus.

The speech object corpus can assist model training, so that a speech synthesis model has good model performance, and the speech object corpus can be applied to a subsequent speech synthesis process, so that the accuracy of acquiring the speaking characteristics of a target speech object is improved, the speaking characteristics of the target speech object are fused into generated speech synthesis data, and high-quality speech synthesis data can be generated.

Referring to fig. 2, in some embodiments, the first prediction network includes a first encoding layer and a first decoding layer, and step S103 may include, but is not limited to, steps S201 to S205:

step S201, acquiring a first duration parameter of original phoneme data;

step S202, encoding original phoneme data through a first encoding layer and a first duration time parameter to obtain a phoneme hidden layer representation vector;

step S203, extracting features of the phone hidden layer representation vector based on the speaking object corpus to obtain an initial speech embedding vector;

step S204, attention calculation is carried out on the phoneme hidden representation vector through the first decoding layer, and initial spectrum data are obtained.

In step S201 of some embodiments, the original phoneme data may be input into a preset time prediction model, and operations such as feature extraction and feature mapping are performed on the original phoneme data through the time prediction model, so as to obtain a time feature of the original phoneme data in a time domain space, and then the time feature is normalized to obtain a first duration parameter, where the first duration parameter can represent a frame number of each phoneme in the original phoneme data, and a predicted duration represented by the first duration parameter affects a pronunciation length and prosody characteristics of the generated speech synthesis data.

In step S202 of some embodiments, the original phoneme data is encoded by the first encoding layer and the first duration parameter, so as to obtain a phoneme sequence feature in the original phoneme data, and a phoneme sequence in a text form of the original phoneme data is converted into a vector form, so as to obtain a phoneme hidden representation vector.

In step S203 of some embodiments, a speaking object corpus is traversed, feature comparison is performed on the phoneme hidden token vector and the speech data of the reference speaking object in the speaking object corpus, the similarity between the phoneme hidden token vector and the candidate speech embedding vectors of the speech data of each reference speaking object is calculated, a plurality of feature similarity values are obtained, and the candidate speech embedding vectors with the feature similarity values greater than or equal to a preset feature threshold are used as initial speech embedding vectors of the reference speaking object. The candidate speech embedded vector can be obtained by encoding speech data of a reference speaking object through a preset feature encoder, wherein the reference speaking object comprises different lecturers, different singers or speakers with other identities and the like, and the initial speech embedded vector can be used for representing the speaking style characteristics of the reference speaking object in a speaking object corpus.

In step S204 of some embodiments, when performing attention computation on the phoneme hidden token vector through the first decoding layer, firstly, a query matrix, a value matrix and a key matrix of the phoneme hidden token vector are computed, then weight computation is performed on the query matrix, the value matrix and the key matrix through a softmax function to obtain an attention matrix corresponding to the phoneme hidden token vector, and weight computation is performed on all feature vectors in the phoneme hidden token vector according to the attention matrix to obtain initial spectrum data.

Preliminary prediction of the original phoneme data can be achieved through the steps S201 to S204, and the phoneme characteristic information in the original phoneme data is extracted to obtain the initial spectrum data corresponding to the original phoneme data.

Referring to fig. 3, in some embodiments, step S201 may include, but is not limited to, steps S301 to S305:

step S301, inputting original phoneme data into a preset time prediction model, wherein the time prediction model comprises a convolution layer, an activation layer, a normalization layer and a linear layer;

step S302, extracting features of original phoneme data through a convolution layer to obtain a phoneme feature vector;

step S303, mapping the phoneme feature vector to a preset time domain space through an activation layer to obtain an initial prediction time feature;

Step S304, carrying out standardization processing on the initial prediction time characteristics through a normalization layer to obtain target prediction time characteristics;

in step S305, the target prediction time feature is mapped to a preset data space through the linear layer, so as to obtain a first duration parameter.

In step S301 of some embodiments, the original phoneme data is input into a predetermined temporal prediction model including a convolution layer, an activation layer, a normalization layer, and a linear layer by a predetermined script program or a computer program.

In step S302 of some embodiments, convolution transposition processing may be performed on the original phoneme data by using a convolution layer, so as to extract important phoneme features in the original phoneme data, and obtain a phoneme feature vector.

In step S303 of some embodiments, the phoneme feature vector is mapped to a preset time domain space through an activation function of the activation layer to obtain a target prediction time feature, where the activation function may be a sigmoid activation function, a relu activation function, or a tanh activation function, and the like, without limitation.

In step S304 of some embodiments, the normalization layer maps the initial predicted time feature to a preset mean-variance distribution, where the mean of the preset mean-variance distribution is 0 and the variance is 1, so that zero-averaging of the initial predicted time feature can be achieved, and the target predicted time feature is obtained.

In step S305 of some embodiments, mapping the target prediction time feature to a preset data space through a linear layer, and mapping the target prediction time feature from a vector space to the data space is implemented, to obtain a first duration parameter, where the first duration parameter may represent a frame number of each phoneme.

The above steps S301 to S305 can relatively conveniently obtain the predicted duration of the original phoneme data, that is, the first duration parameter, so that the total length of the phoneme hidden token vector can be adjusted according to the first duration parameter, the speech duration of the phoneme hidden token vector token is more accurate, and the consistency of the generated speech synthesis data is improved.

Referring to fig. 4, in some embodiments, step S104 may include, but is not limited to, steps S401 to S402:

step S401, performing feature recognition on initial spectrum data according to the initial voice embedding vector to obtain background spectrum features and key spectrum features; the key spectrum features comprise voice contents corresponding to the original phoneme data, and the background spectrum features do not comprise voice contents corresponding to the original phoneme data;

step S402, filtering background spectrum features in the initial spectrum data to obtain candidate spectrum data.

In step S401 of some embodiments, feature comparison is performed on the initial speech embedding vector and each frame of the initial spectrum data, and if feature information identical to or similar to the initial speech embedding vector exists in the current frame, the current frame is used as a key spectrum feature of the initial spectrum data, where the key spectrum feature includes speech content corresponding to the original phoneme data. Similarly, if the current frame does not have the feature information identical to or similar to the initial speech embedding vector, the current frame is taken as the background spectrum feature of the initial spectrum data, and the background spectrum feature does not include the speech content corresponding to the original phoneme data, wherein the feature information includes tone, pitch or other speech information.

In step S402 of some embodiments, background spectral features in the initial spectral data are filtered, and the retained key spectral features are taken as candidate spectral data. For example, the initial spectrum data is divided into a plurality of spectrum segments, the spectrum segments containing the background spectrum features are removed, and the remaining spectrum segments containing the key spectrum features are spliced in sequence according to the time sequence, so that candidate spectrum data is obtained.

Through the steps S401 to S402, useful spectrum information in the initial spectrum data can be conveniently identified, spectrum data representing background sounds is eliminated, interference of background noise on speech synthesis is reduced, spectrum quality of candidate spectrum data is improved, and accordingly speech synthesis accuracy is improved.

Referring to fig. 5, in some embodiments, the second prediction network includes a second encoding layer and a second decoding layer, and step S105 may include, but is not limited to, steps S501 to S504:

step S501, obtaining a second duration parameter of candidate spectrum data;

step S502, encoding the candidate spectrum data through a second encoding layer and a second duration parameter to obtain a voice hidden layer characterization vector;

step S503, extracting features of the voice hidden layer representation vector based on the speaking object corpus to obtain a target voice embedded vector;

step S504, the attention calculation is carried out on the voice hidden layer representation vector through the second decoding layer, and the intermediate frequency spectrum data are obtained.

In step S501 of some embodiments, candidate spectrum data may be input into a preset time prediction model, and operations such as feature extraction and feature mapping are performed on the candidate spectrum data through the time prediction model, so as to obtain a time feature of the candidate spectrum data in a time domain space, and then the time feature is normalized to obtain a second duration parameter, where the second duration parameter can represent a frame number of each phoneme in the candidate spectrum data, and a predicted duration represented by the second duration parameter affects a pronunciation length and a prosody feature of the generated speech synthesis data. The specific process of obtaining the second duration parameter of the candidate spectrum data is substantially identical to the specific process of steps S301 to S305, and will not be described herein.

In step S502 of some embodiments, encoding the candidate spectrum data through the second encoding layer and the second duration parameter, to obtain phoneme sequence features in the candidate spectrum data, and to obtain a speech hidden layer representation vector.

In step S503 of some embodiments, a speech object corpus is traversed, feature comparison is performed on the speech hidden layer token vector and speech data of a target speech object in the speech object corpus, similarity between the speech hidden layer token vector and candidate speech object embedding vectors of the speech data of the target speech object is calculated, a speech similarity value is obtained, and the candidate speech embedding vectors with the speech similarity value greater than or equal to a preset similarity threshold value are used as target speech embedding vectors of the target speech object. The candidate speech object embedded vector can be obtained by encoding speech data of a target speech object through a preset feature encoder, wherein the target speech object comprises a speaker, a singer or other speakers with certain identity, and the like, and the target speech embedded vector is used for representing the speaking style characteristics of the target speech object in a speech object corpus.

In step S504 of some embodiments, when attention calculation is performed on the hidden speech token vector by the second decoding layer, firstly, a query matrix, a value matrix and a key matrix of the hidden speech token vector are calculated, then weight calculation is performed on the query matrix, the value matrix and the key matrix by a softmax function to obtain an attention matrix corresponding to the hidden speech token vector, and weight calculation is performed on all feature vectors in the hidden speech token vector according to the attention moment matrix to obtain intermediate spectrum data, wherein, because interference of background sound is eliminated in steps S401 to S402, accuracy of the intermediate spectrum data is higher than that of the initial spectrum data.

The above steps S501 to S504 can implement the re-prediction of the original phoneme data, extract the speech feature information in the candidate spectrum data and the speaking characteristics of the target speaking object, and obtain the intermediate spectrum data with higher accuracy by further optimizing the quality of the spectrum data, and also obtain the target speech embedded vector for representing the speaking style characteristics of the target speaking object, so as to integrate the speaking style characteristics of the target object into the intermediate spectrum data in the subsequent speech synthesis process, and improve the accuracy of speech synthesis.

Referring to fig. 6, in some embodiments, step S106 includes, but is not limited to, steps S601 to S602:

step S601, extracting features of the target speech embedded vector to obtain tone features of a target speaking object;

step S602, embedding tone characteristics into intermediate frequency spectrum data to obtain target frequency spectrum data, wherein the target frequency spectrum data comprises tone characteristics of a target speaking object.

In step S601 of some embodiments, feature extraction may be performed on the target speech embedded vector by means of named entity recognition or the like, so as to obtain all the entity features in the target speech embedded vector, where the entity features include tone color features, pitch features, and the like.

In step S602 of some embodiments, the target spectrum data may be obtained by performing feature embedding by vector stitching or vector addition on the tone features and the intermediate spectrum data. For example, the timbre characteristic and the intermediate frequency spectrum data are vectorized, and then vector splicing or vector addition is performed on the timbre characteristic and the intermediate frequency spectrum data in the vector form to obtain target frequency spectrum data, so that the target frequency spectrum data can include the timbre characteristic of the target speaking object.

The steps S601 to S602 can more conveniently extract the tone characteristic of the target speaking object from the target speech embedding vector, and fuse the tone characteristic to the intermediate frequency spectrum data, so that the final target frequency spectrum data contains the tone characteristic of the target speaking object, and the accuracy of speech synthesis is improved.

Referring to fig. 7, in some embodiments, a decoding network may be constructed based on a HiFi-GAN vocoder, the decoding network including an upsampling module and a residual module, and step S107 may include, but is not limited to, steps S701 to S702:

step S701, up-sampling processing is carried out on target frequency spectrum data through an up-sampling module, and initial voice characteristics are obtained;

In step S702, the initial speech feature is subjected to vocoding conversion by the residual module, so as to obtain speech synthesis data.

In step S701 of some embodiments, the up-sampling module may perform up-sampling processing on the target spectrum data, so as to implement transpose convolution on the target spectrum data, and obtain an initial speech feature.

In step S702 of some embodiments, a residual module may be used to reconstruct the initial speech feature, so as to obtain a reconstructed speech waveform, and the speech waveform is used as speech synthesis data.

The steps S701 to S702 can relatively conveniently convert the target spectrum data from the spectrum feature map to the waveform feature, so as to realize the conversion of the target spectrum data from the frequency domain space to the time domain space, obtain the waveform voice synthesis data, and improve the accuracy and the efficiency of voice synthesis.

According to the voice synthesis method, original phoneme data are obtained, and the original phoneme data are text data; the original phoneme data is input into a preset voice synthesis model, wherein the voice synthesis model comprises a first prediction network, a second prediction network and a decoding network. And performing first spectrum prediction on the original phoneme data based on the first prediction network and a pre-acquired speaking object corpus to obtain initial spectrum data and initial speech embedding vectors, so that the initial prediction on the original phoneme data can be realized, and the phoneme characteristic information in the original phoneme data can be extracted. The initial spectrum data is filtered based on the initial speech embedding vector to obtain candidate spectrum data, so that the spectrum data representing background sound can be eliminated, the interference of background noise on speech synthesis is reduced, and the spectrum quality of the candidate spectrum data is improved. And performing second spectrum prediction on the candidate spectrum data based on the second prediction network and the speaking object corpus to obtain intermediate spectrum data and target speech embedding vectors, so that the re-prediction on the original phoneme data can be realized, the quality of the spectrum data is further optimized, the intermediate spectrum data with higher accuracy can be obtained, and the target speech embedding vectors used for representing the speaking style characteristics of the target speaking object can also be obtained. Further, the intermediate frequency spectrum data is reconstructed based on the target voice embedded vector to obtain target frequency spectrum data, and the target voice embedded vector can be fused to the intermediate frequency spectrum data, so that the final target frequency spectrum data contains the speaking style characteristics of the target speaking object. Finally, the target frequency spectrum data is subjected to sound code conversion based on the decoding network to obtain voice synthesis data, the target frequency spectrum data can be converted into waveform characteristics from the frequency spectrum characteristic diagram more conveniently, the conversion of the target frequency spectrum data from the frequency domain space to the time domain space is realized, the voice synthesis data in the waveform form is obtained, and the accuracy and the voice synthesis efficiency of voice synthesis are improved.

Referring to fig. 8, an embodiment of the present application further provides a speech synthesis apparatus, which may implement the above speech synthesis method, where the apparatus includes:

a data obtaining module 801, configured to obtain original phoneme data, where the original phoneme data is text data;

the input module 802 is configured to input the original phoneme data into a preset speech synthesis model, where the speech synthesis model includes a first prediction network, a second prediction network, and a decoding network;

the first spectrum prediction module 803 is configured to perform a first spectrum prediction on the original phoneme data based on a first prediction network and a pre-acquired speaking object corpus, so as to obtain initial spectrum data and an initial speech embedding vector, where the initial speech embedding vector is used to characterize a speaking style characteristic of a reference speaking object in the speaking object corpus;

the filtering module 804 is configured to perform filtering processing on the initial spectrum data based on the initial speech embedding vector, so as to obtain candidate spectrum data;

the second spectrum prediction module 805 is configured to perform a second spectrum prediction on the candidate spectrum data based on the second prediction network and the speaking object corpus, to obtain intermediate spectrum data and a target speech embedding vector, where the target speech embedding vector is used to characterize a speaking style of a target speaking object in the speaking object corpus;

A reconstruction module 806, configured to reconstruct the intermediate frequency spectrum data based on the target speech embedding vector to obtain target frequency spectrum data, where the target frequency spectrum data is a mel spectrogram;

the voice code conversion module 807 is configured to perform voice code conversion on the target spectrum data based on the decoding network, so as to obtain speech synthesis data.

The specific implementation of the speech synthesis apparatus is substantially the same as the specific embodiment of the speech synthesis method described above, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the voice synthesis method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

The memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the speech synthesis method to perform the embodiments of the present application;

an input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the voice synthesis method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, wherein original phoneme data are obtained, and the original phoneme data are text data; the original phoneme data is input into a preset voice synthesis model, wherein the voice synthesis model comprises a first prediction network, a second prediction network and a decoding network. And performing first spectrum prediction on the original phoneme data based on the first prediction network and a pre-acquired speaking object corpus to obtain initial spectrum data and initial speech embedding vectors, so that the initial prediction on the original phoneme data can be realized, and the phoneme characteristic information in the original phoneme data can be extracted. The initial spectrum data is filtered based on the initial speech embedding vector to obtain candidate spectrum data, so that the spectrum data representing background sound can be eliminated, the interference of background noise on speech synthesis is reduced, and the spectrum quality of the candidate spectrum data is improved. And performing second spectrum prediction on the candidate spectrum data based on the second prediction network and the speaking object corpus to obtain intermediate spectrum data and target speech embedding vectors, so that the re-prediction on the original phoneme data can be realized, the quality of the spectrum data is further optimized, the intermediate spectrum data with higher accuracy can be obtained, and the target speech embedding vectors used for representing the speaking style characteristics of the target speaking object can also be obtained. Further, the intermediate frequency spectrum data is reconstructed based on the target voice embedded vector to obtain target frequency spectrum data, and the target voice embedded vector can be fused to the intermediate frequency spectrum data, so that the final target frequency spectrum data contains the speaking style characteristics of the target speaking object. Finally, the target frequency spectrum data is subjected to sound code conversion based on the decoding network to obtain voice synthesis data, the target frequency spectrum data can be converted into waveform characteristics from the frequency spectrum characteristic diagram more conveniently, the conversion of the target frequency spectrum data from the frequency domain space to the time domain space is realized, the voice synthesis data in the waveform form is obtained, and the accuracy and the voice synthesis efficiency of voice synthesis are improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of speech synthesis, the method comprising:

2. The method according to claim 1, wherein the first prediction network includes a first coding layer and a first decoding layer, the first prediction network includes the first spectrum prediction for the original phoneme data based on the first prediction network and a pre-obtained speaking object corpus, and obtaining initial spectrum data and an initial speech embedding vector includes:

acquiring a first duration parameter of the original phoneme data;

3. The method of speech synthesis according to claim 2, wherein the obtaining the first duration parameter of the original phoneme data comprises:

4. The method according to claim 1, wherein the second prediction network includes a second coding layer and a second decoding layer, and the performing, based on the second prediction network and the speech object corpus, second spectrum prediction on the candidate spectrum data to obtain intermediate spectrum data and a target speech embedding vector includes:

acquiring a second duration parameter of the candidate spectrum data;

5. The method according to claim 1, wherein the filtering the initial spectrum data based on the initial speech embedding vector to obtain candidate spectrum data includes:

6. The method of claim 1, wherein the reconstructing the intermediate spectrum data based on the target speech embedding vector to obtain target spectrum data comprises:

7. The method according to any one of claims 1 to 6, wherein the decoding network includes an upsampling module and a residual module, and the performing vocoding on the target spectrum data based on the decoding network to obtain speech synthesis data includes:

8. A speech synthesis apparatus, the apparatus comprising:

9. An electronic device comprising a memory storing a computer program and a processor implementing the speech synthesis method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method of any one of claims 1 to 7.