CN112116903A

CN112116903A - Method and device for generating speech synthesis model, storage medium and electronic equipment

Info

Publication number: CN112116903A
Application number: CN202010827835.XA
Authority: CN
Inventors: 杨惠; 梁光; 吴雨璇; 舒景辰; 周鼎皓
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-12-22

Abstract

The embodiment of the application discloses a method and a device for generating a speech synthesis model, a storage medium and electronic equipment, and belongs to the technical field of computers. The method comprises the following steps: the server respectively samples the sample audio data based on at least two different sampling rates to obtain respective corresponding training audio data, and trains based on at least two training audio data to obtain the speech synthesis model, so that the extension of the training data is realized, sufficient training data is provided for training the speech synthesis model, and the speech synthesis model with better quality can be generated.

Description

Method and device for generating speech synthesis model, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a speech synthesis model, a storage medium, and an electronic device.

Background

The development of the internet gradually pushes the development of the artificial intelligence technology, the intelligent voice technology is particularly prominent in the development of the artificial intelligence technology, and the conversion of the text into the voice and the conversion of the voice into the text are realized on the basis of the intelligent voice technology. The method comprises the steps of converting a text into voice, wherein the voice is synthesized in an intelligent voice technology, a trained voice synthesis model is required to be used in the voice synthesis process, in the related technology, the model is trained for multiple times through training data to generate the voice synthesis model, but under the condition that the training data is missing or the training data is less, the voice synthesis model obtained through training the training data has the problem of poor data processing accuracy, and the time period of the training process is longer.

Disclosure of Invention

The embodiment of the application provides a method and a device for generating a speech synthesis model, a storage medium and electronic equipment, which can solve the problem that a speech synthesis model with better quality cannot be generated under the condition of missing training data or less training data. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for generating a speech synthesis model, where the method includes:

respectively sampling the sample audio data based on at least two different sampling rates to obtain respective corresponding training audio data;

and training based on at least two pieces of training audio data to obtain the speech synthesis model.

In a second aspect, an embodiment of the present application provides an apparatus for generating a speech synthesis model, where the apparatus for generating a speech synthesis model includes:

the sampling module is used for respectively sampling and processing the sample audio data based on at least two different sampling rates to obtain respective corresponding training audio data;

and the training module is used for training based on at least two pieces of training audio data to obtain the speech synthesis model.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

when the scheme of the embodiment of the application is executed, the server respectively samples and processes the sample audio data based on at least two different sampling rates to obtain the corresponding training audio data, trains based on at least two training audio data to obtain the speech synthesis model, realizes the extension of the training data, provides sufficient training data for training the speech synthesis model, and ensures that the speech synthesis model with better quality can be generated.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a flow chart of a method for generating a speech synthesis model according to an embodiment of the present application;

FIG. 3 is another schematic flow chart of a method for generating a speech synthesis model according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which a method for generating a speech synthesis model or a device for generating a speech synthesis model according to an embodiment of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is a medium used to provide communication links between the

terminal devices

101, 102, 103 and the server 105, and various communication client applications may be installed on the

terminal devices

101, 102, 103, such as: video recording application, video playing application, voice interaction application, search application, instant messaging tool, mailbox client, social platform software, etc. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like. The network 104 may include various types of wired or wireless communication links, such as: the wired communication link includes an optical fiber, a twisted pair wire, or a coaxial cable, and the WIreless communication link includes a bluetooth communication link, a WIreless-FIdelity (Wi-Fi) communication link, or a microwave communication link, etc. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal apparatuses

101, 102, and 103 are software, they may be installed in the electronic apparatuses listed above. Which may be implemented as multiple software or software modules (e.g., to provide distributed services) or as a single software or software module, and is not particularly limited herein. When the

terminal devices

101, 102, and 103 are hardware, the terminal devices may further include a display device and a camera, the display device may display various devices capable of implementing a display function, and the camera is used to collect a video stream; for example: the display device may be a Cathode ray tube (CR) display, a Light-emitting diode (LED) display, an electronic ink screen, a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), or the like. The user can view information such as displayed text, pictures, videos, etc. using the display device on the

terminal device

101, 102, 103.

It should be noted that the method for generating the speech synthesis model provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the speech synthesis apparatus is generally disposed in the server 105. The server 105 may be a server that provides various services, and the server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as a plurality of software or software modules (for example, for providing distributed services), or may be implemented as a single software or software module, and is not limited in particular herein.

The server 105 in the present application may be a terminal device providing various services, such as: the server respectively carries out sampling processing on the sample audio data based on at least two different sampling rates to obtain training audio data corresponding to the sample audio data, and carries out training based on at least two training audio data to obtain a voice synthesis model.

It should be noted that the method for generating the speech synthesis model provided in the embodiment of the present application may be executed by one or more of the

terminal devices

101, 102, and 103, and/or the server 105, and accordingly, the speech synthesis apparatus provided in the embodiment of the present application is generally disposed in the corresponding terminal device, and/or the server 105, but the present application is not limited thereto.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The method for generating a speech synthesis model according to the embodiment of the present application will be described in detail below with reference to fig. 2 to 3. Referring to fig. 2, a flow chart of a method for generating a speech synthesis model according to an embodiment of the present application is schematically provided. As shown in fig. 2, the method of the embodiment of the present application may include the steps of:

s201, respectively sampling the sample audio data based on at least two different sampling rates to obtain respective corresponding training audio data.

The sampling rate is the number of samples extracted from the audio signal per second and forming a discrete signal, such as: it may be 22050 points per second or 16000 points per second. The sample audio data is sample data for training the model, and the sample audio data contains sound features which need to be synthesized by the user. The training audio data is obtained by sampling the sample audio data, and the training audio data obtained by sampling are different according to different sampling rates.

Generally, a speech synthesis model is generated by training a large amount of sample data, which includes a large amount of sample text data and sample audio data corresponding to the sample text data. Under the condition that the sample audio data volume is less or data loss exists, new training audio data can be obtained by sampling the existing sample audio data for multiple times (at least twice) so as to ensure the data volume of the audio data of the training model, and an accurate speech synthesis model can be obtained under the condition of sufficient training of the training audio data. Such as: originally, only one sample audio data is provided, after the sample audio data is sampled based on the sampling rate of 22050 points per second and 16000 points per second, the first training audio data and the second training audio data can be obtained, and then model training can be carried out based on the first training audio data, the second training audio data and the sample audio data.

S202, training is carried out based on at least two pieces of training audio data to obtain a speech synthesis model.

Generally, a server obtains sample text data, at least two training audio data and sample audio data after sampling, respectively encodes and decodes the sample text data to obtain a mel spectrum corresponding to the sample text data, respectively calculates to obtain mel spectra corresponding to the at least two training audio data and the sample audio data, further respectively calculates a loss value between the mel spectrum corresponding to the sample text data and the mel spectra corresponding to the at least two training audio data and the sample audio data, and generates a pre-training speech synthesis model when the loss value is less than or equal to a preset threshold value.

Specifically, the training process of the speech synthesis model may include:

converting sample text data into at least one phoneme sequence by querying a phoneme table, converting the at least one phoneme sequence into at least one phoneme feature vector, performing dimensionality reduction processing on the at least one phoneme feature vector to obtain a first feature vector, performing position coding processing on the first feature vector based on text sequence information of the sample text data to obtain a second feature vector, wherein the text sequence information is used for indicating at least one of sequence and feature of words in the sample text data, and performing FFT (fast Fourier transform) processing on the second feature vector to obtain a phoneme sequence vector; performing probability evaluation on the duration of at least one phoneme sequence in the sample text data to obtain the pronunciation duration of at least one phoneme sequence, and performing duration extraction processing on the phoneme sequence vector based on the pronunciation duration of at least one phoneme sequence to obtain a phoneme alignment vector; performing position coding processing on the phoneme alignment vector based on text sequence information of the sample text data to obtain a third feature vector, wherein the text sequence information is used for indicating at least one of the sequence and the feature of words in the sample text data, performing FFT (fast Fourier transform) processing on the third feature vector to obtain a fourth feature vector, and processing the fourth feature vector based on a linear layer to obtain a Mel frequency spectrum corresponding to the sample text data at present; meanwhile, on the basis of the spectral features corresponding to the at least two training audio data and the sample audio data, respectively calculating to obtain Mel spectrums corresponding to the at least two training audio data and the sample audio data, further respectively calculating a loss value between the Mel spectrum corresponding to the sample text data and the at least two training audio data to obtain a loss value between the at least two Mel spectrums, calculating a loss value between the Mel spectrum corresponding to the sample text data and the Mel spectrum corresponding to the sample audio data, and generating a pre-training speech synthesis model when the loss value is smaller than or equal to a preset threshold value.

Referring to fig. 3, a flow chart of a method for generating a speech synthesis model according to an embodiment of the present application is provided, where the method for generating a speech synthesis model includes the following steps:

s301, sampling the sample audio data based on at least two different sampling rates to obtain respective corresponding training audio data.

Generally, a speech synthesis model is generated by training a large amount of sample data, which includes a large amount of sample text data and sample audio data corresponding to the sample text data. Under the condition that the sample audio data volume is less or data loss exists, new training audio data can be obtained by sampling the existing sample audio data for multiple times, so that the data volume of the audio data of the training model is ensured, and an accurate speech synthesis model can be obtained under the condition of sufficient training of the training audio data. Such as: originally, only one sample audio data is provided, after the sample audio data is sampled based on the sampling rate of 22050 points per second and 16000 points per second, the first training audio data and the second training audio data can be obtained, and then model training can be carried out based on the first training audio data, the second training audio data and the sample audio data.

S302, linear spectrums corresponding to the at least two training audio data and the sample audio data are determined.

The linear spectrum refers to the frequency spectral density corresponding to at least two training audio data and sample audio data respectively, and is different from the Mel frequency spectrum; the mel spectrum refers to a spectrum represented by a mel scale, and is a nonlinear spectrum.

Generally, after the server obtains at least two training audio data and sample audio data, fourier transform processing is performed on the at least two training audio data and the sample audio data, respectively, audio signals in a time domain are converted into a frequency domain to obtain respective corresponding frequency spectrums, and the respective corresponding frequency spectrums are subjected to frequency spectrum conversion to obtain respective corresponding mel frequency spectrums.

And S303, converting the linear frequency spectrum into at least two Mel frequency spectrums corresponding to the training audio data and the sample audio data respectively.

Here, the mel spectrum is a spectrum expressed by the mel scale, and the mel spectrum includes characteristics of sound.

Generally, since the Mel spectrum of audio data can better reflect the characteristics of sound, the original spectrum needs to be converted into a nonlinear Mel spectrum, which can be expressed by formula

And carrying out nonlinear spectrum conversion to obtain a Mel spectrum of the audio data.

And S304, adding the Mel frequency spectrum to the set of sample Mel frequency spectrums.

The sample mel-frequency spectrum set is a set comprising at least two mel-frequency spectrums, wherein the at least two mel-frequency spectrums are mel-frequency spectrums corresponding to training audio data, the sample mel-frequency spectrum set further comprises mel-frequency spectrums corresponding to the sample audio data, and the number of the mel-frequency spectrums in the sample mel-frequency spectrum set is related to the number of the training audio data and the number of the sample audio data. Usually, after the server calculates the mel spectrums of at least two training audio data and sample audio data, the server adds the mel spectrums corresponding to the training audio data and the sample audio data into the mel spectrum set, so as to facilitate the subsequent calculation of the loss value between the mel spectrum corresponding to the sample text data and each mel spectrum in the sample mel spectrum set.

S305, obtaining sample text data, and coding the sample text data to obtain a phoneme sequence vector.

The sample text data is data which is presented in a text form and contains sample content information, and corresponds to the content of the sample audio data. The phoneme sequence vector refers to a vector which is obtained by converting the phoneme sequence for multiple times and is expressed in a vector form. The phoneme sequence refers to phoneme elements which are arranged in a line, the sample text data can be text data of English words, and each English word in the sample text data corresponds to one phoneme sequence; the sample text data may also be chinese words, each word in the sample text data corresponding to a sequence of phonemes.

Generally, a speech synthesis model capable of synthesizing speech data based on text data needs to be obtained by training a large amount of sample data, where the large amount of sample data includes a large amount of sample text data and sample audio data corresponding to the sample text data.

The process of encoding the sample text data to obtain the phoneme sequence vector may include: obtaining a phoneme sequence corresponding to sample text data by querying a phoneme table, wherein each word/word in the sample text data corresponds to a phoneme sequence, the obtained multiple phoneme sequences need to be converted into a form of phoneme feature vectors for processing the obtained multiple phoneme sequences subsequently, the number of the phoneme feature vectors is the same as that of the phoneme sequences, each phoneme sequence corresponds to a phoneme feature vector, and the phoneme feature vector is a vector which is obtained by primarily converting the phoneme sequence and contains features corresponding to the phoneme sequences. And performing dimensionality reduction on at least one phoneme feature vector to obtain a first feature vector, wherein the first feature vector is obtained after dimensionality reduction of the phoneme feature vector and has dimensionality difference with the phoneme feature vector of the original dimensionality. By carrying out position coding processing on the first feature vector, text sequence information in the sample text data can be added into the first feature vector, and a second feature vector capable of embodying a time sequence can be obtained, wherein the text sequence information is related information of words/terms in the sample text data, and the text sequence information can be used for expressing at least one of the sequence and the feature of the words in the sample text data; the second feature vector is obtained by performing position coding processing on the first feature vector and is different from the first feature vector. The second feature vector can be subjected to FFT processing based on a transform feedforward network which consists of an FFT module and comprises an attention mechanism and a convolution layer, parameters contained in the second feature vector are trained, and a phoneme sequence vector can be obtained after information needing attention is extracted.

S306, duration extraction processing is carried out on the phoneme sequence vector to obtain a phoneme alignment vector.

The phoneme alignment vector is obtained by performing phoneme alignment based on the pronunciation duration of the phoneme sequence.

Generally, probability evaluation is performed on the duration of at least one phoneme sequence in the sample text data to obtain the pronunciation duration of the at least one phoneme sequence, where the pronunciation duration refers to the sum of pronunciation durations of the phonemes in the phoneme sequence, and each phoneme sequence corresponds to a pronunciation duration, that is, duration information of the phoneme sequence. The existing method for extracting the duration information of each phoneme in the phoneme sequence is to extract the duration information of each phoneme through a pre-trained model, has poor effect, only realizes the alignment of sentences, and does not realize the alignment of the phonemes to the phonemes; in the scheme, a statistical model (classical decoder) is adopted, and the phoneme sequence is processed to realize the forced alignment of phonemes, which specifically comprises the following steps: and carrying out statistics on pronunciation duration of the phoneme sequence corresponding to each word/word, further carrying out probability evaluation on the pronunciation duration of each obtained phoneme sequence, and selecting the phoneme sequence with the highest probability from the probability evaluation results as an output result, thereby realizing phoneme-to-phoneme alignment and further obtaining a phoneme alignment vector of phoneme alignment.

S307, decoding the phoneme alignment vector to obtain a Mel frequency spectrum corresponding to the sample text data at present.

Generally, text sequence information in sample text data can be added to a phoneme alignment feature vector by performing position coding processing on the phoneme alignment vector, and a third feature vector capable of embodying a time sequence is obtained, where the third feature vector is obtained by performing position coding processing on the phoneme alignment vector and is distinguished from the first feature vector and the second feature vector. The third feature vector can be subjected to FFT processing based on a transform feedforward network which consists of an FFT module and comprises an attention mechanism and a convolution layer, parameters contained in the third feature vector are trained, information needing attention is extracted, a fourth feature vector can be obtained, the fourth feature vector is obtained after the third feature vector is subjected to FFT processing and is distinguished from the first feature vector, the second feature vector and the third feature vector, and then the fourth feature vector can be processed based on a linear layer to obtain a Mel frequency spectrum alpha currently corresponding to the sample text data.

S308, loss values between the Mel frequency spectrum corresponding to the sample text data and at least one Mel frequency spectrum in the set of the sample Mel frequency spectrums are calculated respectively.

The loss value refers to the inconsistency degree between the Mel frequency spectrum alpha corresponding to the sample text data and the comparison label, and the smaller the loss value is, the better the robustness of the trained model is.

Generally, the fourth feature vector may be processed based on the linear layer to obtain a mel spectrum α corresponding to the sample text data, meanwhile, the mel spectra corresponding to the at least two training audio data and the sample audio data may be obtained by calculating respectively based on the spectral features corresponding to the at least two training audio data and the sample audio data, and the obtained mel spectra are added to the sample mel spectrum set, the mel spectra in the sample mel spectrum set may be recorded as the mel spectra β, and the mel spectra may be used as comparison tags, the mel spectra α corresponding to the sample text data and each comparison tag in the sample mel spectrum set are compared respectively to calculate a loss value between two and two, before each loss value does not reach a preset threshold, continuous iterative training is performed based on the above steps until each loss value is less than or equal to the preset threshold, and indicating that the model training is finished, and generating a trained speech synthesis model at the moment.

And S309, generating a speech synthesis model when the loss value is less than or equal to a preset threshold value.

The preset threshold is a preset maximum loss value based on the model to be trained, and when the loss value is smaller than or equal to the preset maximum loss value, the model training is finished.

Generally, through multiple iterative training, when the loss value between the mel frequency spectrum alpha currently corresponding to the sample text data and the comparison tag is reduced to or lower than a preset threshold value, it can be indicated that the speech synthesis model is trained, and the server can perform speech synthesis processing on the input text data based on the speech synthesis model, so as to obtain the speech data corresponding to the input text data.

S310, acquiring text data and converting the text data into at least one phoneme sequence.

The text data refers to data which is presented in a text form and contains content information, the phoneme sequence refers to phoneme elements which are arranged in a line, the text data can be text data of English words, and each English word in the text data corresponds to one phoneme sequence; the text data may also be chinese words, each word in the text data corresponding to a sequence of phonemes.

Generally, the server may obtain corresponding text data by receiving text information sent by the terminal and analyzing the text information, or the server may obtain corresponding text data by identifying specified text information, and after obtaining the text data, the server may convert each word in the text data into a respective corresponding phoneme sequence by querying a phoneme table, so as to facilitate subsequent processing of at least one phoneme sequence corresponding to the text data.

For example, the following steps are carried out: the text data consists of Chinese words and the content of the text data is' do you have a meal today? ", the server, after looking up the phone list, may convert the text data into 7 phone sequences: { j, i, n }, { t, i, a, n }, { n, i }, { c, h, i }, { f, a, n }, { l, e }, { m, a }, each word in the text data corresponds to a phoneme sequence.

S311, performing voice synthesis processing on at least one phoneme sequence based on the voice synthesis model to obtain a Mel frequency spectrum corresponding to the text data.

Generally, after the server obtains at least one phoneme sequence corresponding to the text data, a pre-trained speech synthesis model is used to perform speech synthesis on the at least one phoneme sequence to obtain a mel spectrum corresponding to the text data, the mel spectrum includes acoustic features corresponding to the text data, and the speech corresponding to the text data can be determined based on the mel spectrum.

And S312, obtaining the synthetic voice corresponding to the text data based on the Mel frequency spectrum corresponding to the text data.

The synthesized voice is the voice which is subjected to precise synthesis processing, and the synthesized voice can reflect the voice characteristics of the user more truly.

Generally, because the mel frequency spectrum corresponding to the text data contains the sound features corresponding to the text data, the fourier transform processing can be performed on the mel frequency spectrum based on the feature information in the mel frequency spectrum to obtain the synthesized voice corresponding to the text data, in order to ensure that the finally obtained sound data can be more real, the background noise data can be obtained based on the preset signal-to-noise ratio, the background noise data is added into the synthesized voice to obtain the voice corresponding to the text data, namely, the voice with a real background environment can be obtained, and the synthesized voice data after the background noise data is added is more real and natural.

When the scheme of the embodiment of the application is executed, the server respectively samples sample audio data based on at least two different sampling rates to obtain training audio data corresponding to the sample audio data, determines linear spectrums corresponding to the at least two training audio data and the sample audio data, converts the linear spectrums into mel spectrums corresponding to the at least two training audio data and the sample audio data, adds the mel spectrums into a sample mel spectrum set to obtain sample text data, codes the sample text data to obtain a phoneme sequence vector, extracts duration from the phoneme sequence vector to obtain a phoneme alignment vector, decodes the phoneme alignment vector to obtain the mel spectrum corresponding to the sample text data at present, and respectively calculates a loss value between the mel spectrum corresponding to the sample text data at present and at least one mel spectrum in the sample mel spectrum set, the method includes the steps that when the loss value is smaller than or equal to a preset threshold value, a speech synthesis model is generated, text data are obtained, the text data are converted into at least one phoneme sequence, speech synthesis processing is conducted on the at least one phoneme sequence based on the speech synthesis model to obtain a Mel frequency spectrum corresponding to the text data, and synthesized speech corresponding to the text data is obtained based on the Mel frequency spectrum corresponding to the text data.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Referring to fig. 4, a schematic structural diagram of a speech synthesis model generation apparatus according to an exemplary embodiment of the present application is shown. Hereinafter referred to as device 4, the device 4 may be implemented as all or part of a terminal, by software, hardware or a combination of both. The apparatus 4 comprises a sampling module 401, a training module 402.

The sampling module 401 is configured to sample the sample audio data based on at least two different sampling rates to obtain training audio data corresponding to each sample audio data;

a training module 402, configured to perform training based on at least two training audio data to obtain the speech synthesis model.

Optionally, the training module 402 further comprises:

and the training unit is used for training the voice based on the sample audio data to obtain the voice synthesis model.

Optionally, the training module 402 includes:

a first obtaining unit, configured to obtain sample text data, and obtain the at least two training audio data and the sample audio data;

the first processing unit is used for respectively carrying out encoding processing and decoding processing on the sample text data to obtain a Mel frequency spectrum corresponding to the sample text data at present;

a first generation unit, configured to generate the pre-trained speech synthesis model when the loss value is less than or equal to a preset threshold; and the loss value is the loss value between the Mel frequency spectrum corresponding to the sample text data and the Mel frequency spectrums corresponding to the at least two training audio data and the sample audio data.

Optionally, the training module 402 includes:

a second obtaining unit, configured to obtain respective mel spectrums corresponding to the at least two training audio data and the sample audio data;

an adding unit for adding the Mel spectrum to a set of sample Mel spectra.

Optionally, the training module 402 includes:

a determining unit, configured to determine linear spectrums corresponding to the at least two training audio data and the sample audio data respectively;

a conversion unit, configured to convert the linear spectrum into the mel spectrum corresponding to each of the at least two training audio data and the sample audio data.

Optionally, the training module 402 includes:

a third acquiring unit configured to acquire sample text data;

the coding unit is used for coding the sample text data to obtain a phoneme sequence vector;

a duration extraction unit, configured to perform duration extraction processing on the phoneme sequence vector to obtain a phoneme alignment vector;

a decoding unit, configured to decode the phoneme alignment vector to obtain a mel spectrum corresponding to the sample text data;

a calculating unit, configured to calculate a loss value between a mel spectrum currently corresponding to the sample text data and at least one mel spectrum in the sample mel spectrum set respectively;

a second generating unit configured to generate the speech synthesis model when the loss value is less than or equal to a preset threshold.

Optionally, the training module 402 further comprises:

the second processing unit is used for acquiring text data and converting the text data into at least one phoneme sequence;

a third processing unit, configured to perform speech synthesis processing on the at least one phoneme sequence based on the speech synthesis model to obtain a mel spectrum corresponding to the text data;

and the fourth processing unit is used for obtaining the synthetic voice corresponding to the text data based on the Mel frequency spectrum corresponding to the text data.

It should be noted that, when the apparatus 4 provided in the foregoing embodiment executes the method for generating a speech synthesis model, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed to different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the embodiments of the method for generating a speech synthesis model provided in the above embodiments belong to the same concept, and details of the implementation process are described in the embodiments of the method, which are not described herein again.

Fig. 5 is a schematic structural diagram of a device for generating a speech synthesis model according to an embodiment of the present application, which is hereinafter referred to as a device 5, where the device 5 may be integrated in the foregoing server or terminal device, as shown in fig. 5, the device includes: memory 502, processor 501, input device 503, output device 504, and communication interface.

The memory 502 may be a separate physical unit, and may be connected to the processor 501, the input device 503, and the output device 504 via a bus. The memory 502, processor 501, input device 503, and output device 504 may also be integrated, implemented in hardware, etc.

The memory 502 is used for storing a program for implementing the above method embodiment, or various modules of the apparatus embodiment, and the processor 501 calls the program to perform the operation of the above method embodiment.

Input devices 502 include, but are not limited to, a keyboard, a mouse, a touch panel, a camera, and a microphone; the output device includes, but is not limited to, a display screen.

Communication interfaces are used to send and receive various types of messages and include, but are not limited to, wireless interfaces or wired interfaces.

Alternatively, when part or all of the distributed task scheduling method of the above embodiments is implemented by software, the apparatus may also include only a processor. The memory for storing the program is located outside the device and the processor is connected to the memory by means of circuits/wires for reading and executing the program stored in the memory.

The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above.

Wherein the processor 501 calls the program code in the memory 502 for executing the following steps:

In one or more embodiments, processor 501 is further configured to:

and training based on the sample audio data to obtain the speech synthesis model.

In one or more embodiments, processor 501 is further configured to:

obtaining sample text data, and obtaining the at least two training audio data and the sample audio data;

respectively carrying out encoding processing and decoding processing on the sample text data to obtain a Mel frequency spectrum corresponding to the sample text data at present;

generating the pre-trained speech synthesis model when the loss value is less than or equal to a preset threshold value; and the loss value is the loss value between the Mel frequency spectrum corresponding to the sample text data and the Mel frequency spectrums corresponding to the at least two training audio data and the sample audio data.

In one or more embodiments, processor 501 is further configured to:

respectively acquiring Mel frequency spectrums corresponding to the at least two training audio data and the sample audio data;

adding the Mel spectrum to a set of sample Mel spectra.

In one or more embodiments, processor 501 is further configured to:

determining linear spectrums corresponding to the at least two training audio data and the sample audio data respectively;

converting the linear spectrum into the Mel frequency spectrums corresponding to the at least two training audio data and the sample audio data respectively.

In one or more embodiments, processor 501 is further configured to:

acquiring sample text data;

coding the sample text data to obtain a phoneme sequence vector;

carrying out duration extraction processing on the phoneme sequence vector to obtain a phoneme alignment vector;

decoding the phoneme alignment vector to obtain a Mel frequency spectrum corresponding to the sample text data at present;

respectively calculating a loss value between a Mel frequency spectrum currently corresponding to the sample text data and at least one Mel frequency spectrum in the sample Mel frequency spectrum set;

and generating the speech synthesis model when the loss value is less than or equal to a preset threshold value.

In one or more embodiments, processor 501 is further configured to:

acquiring text data and converting the text data into at least one phoneme sequence;

performing speech synthesis processing on the at least one phoneme sequence based on the speech synthesis model to obtain a Mel frequency spectrum corresponding to the text data;

and obtaining the synthetic voice corresponding to the text data based on the Mel frequency spectrum corresponding to the text data.

It should be noted that, when the apparatus 5 provided in the foregoing embodiment executes the method for generating a speech synthesis model, only the division of the above functional modules is taken as an example, and in practical applications, the above functions may be distributed to different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the embodiments of the method for generating a speech synthesis model provided in the above embodiments belong to the same concept, and details of the implementation process are described in the embodiments of the method, which are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps in the embodiments shown in fig. 2 to fig. 3, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 2 to fig. 3, which is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for generating a speech synthesis model, the method comprising:

2. The method of claim 1, wherein the training based on at least two training audio data results in the speech synthesis model, further comprising:

3. The method of claim 2, wherein the training based on at least two training audio data results in the speech synthesis model comprising:

4. The method of claim 3, wherein the obtaining the at least two training audio data and the sample audio data comprises:

adding the Mel spectrum to a set of sample Mel spectra.

5. The method of claim 4, wherein the obtaining respective Mel spectra corresponding to the at least two training audio data and the sample audio data comprises:

6. The method according to claim 4 or 5, wherein the encoding and decoding processes are performed on the sample text data respectively to obtain a Mel frequency spectrum corresponding to the sample text data;

generating the pre-trained speech synthesis model when the loss value is less than or equal to a preset threshold value; wherein the loss value is a loss value between a mel frequency spectrum currently corresponding to the sample text data and the mel frequency spectra corresponding to the at least two training audio data and the sample audio data, and comprises:

acquiring sample text data;

coding the sample text data to obtain a phoneme sequence vector;

7. The method of claim 1, wherein after the training based on at least two training audio data to obtain the speech synthesis model, further comprising:

8. An apparatus for generating a speech synthesis model, the apparatus comprising:

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 7.

10. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 7.