WO2020232860A1

WO2020232860A1 - Speech synthesis method and apparatus, and computer readable storage medium

Info

Publication number: WO2020232860A1
Application number: PCT/CN2019/102198
Authority: WO
Inventors: 彭话易; 程宁; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-22
Filing date: 2019-08-23
Publication date: 2020-11-26
Also published as: CN110136690A; CN110136690B

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed is a speech synthesis method. The method comprises: converting speech data of a source speaker into text content, and converting the text content into a text vector; converting the text vector into a Mel spectrogram of the source speaker; acquiring a speech signal of a target speaker, and converting the speech signal of the target speaker into a Mel frequency cepstrum coefficient feature of the target speaker; inputting the Mel frequency cepstrum coefficient feature of the source speaker and the Mel frequency cepstrum coefficient feature of the target speaker into a trained spectral feature conversion model to obtain a Mel spectrogram of the target speaker; and converting the Mel spectrogram of the target speaker into a speech corresponding to the text content and outputting the speech. The present application also provides a speech synthesis apparatus and a computer readable storage medium. The present application can realize the tone conversion of a speech synthesis system.

Description

Speech synthesis method, device and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 22, 2019. The application number is 201910438778.3 and the invention title is "Speech synthesis method, device and computer-readable storage medium". The entire content is incorporated by reference. In this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a speech synthesis method, device and computer-readable storage medium.

Background technique

With the development of technology, computers can already speak through speech synthesis systems, and it is easy for ordinary users to understand and accept. However, existing computers that can talk often can only speak according to one mode or one voice. However, end users often have higher requirements. For example, users may want the computer to read aloud in the user's own voice. Therefore, in this case, it is obvious that the existing computers can no longer meet this demand.

Summary of the invention

This application provides a speech synthesis method, device, and computer-readable storage medium, the main purpose of which is to provide a solution that can realize the tone color conversion of the speech synthesis system.

In order to achieve the above objective, a speech synthesis method provided by this application includes: receiving speech data of a source speaker, converting the speech data of the source speaker into text content, and converting the text content into a text vector; Convert the text vector into the Mel spectrum of the source speaker; obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker; The Mel spectrogram of the source speaker is input to a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, and the The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold , Performing transformation and adjustment on the target Mel language spectrogram until the loss value output by the loss function is less than the preset threshold, using the target Mel language spectrogram as the Mel language of the target speaker Spectrogram output; and converting the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output.

In addition, in order to achieve the above-mentioned object, the present application also provides a speech synthesis device, which includes a memory and a processor. The memory stores a speech synthesis program that can run on the processor, and the speech synthesis program is When the processor executes, the following steps are implemented: receiving the voice data of the source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector; converting the text vector into The Mel spectrum of the source speaker; the voice signal of the target speaker is obtained, and the voice signal of the target speaker is converted into the Mel frequency cepstrum coefficient feature of the target speaker; the Meier of the source speaker Input the Er language spectrogram into a trained spectral feature conversion model to convert the Mel language spectrogram of the source speaker into a target Mel language spectrogram, and use the target Mel language spectrogram as The training value and the Mel frequency cepstrum coefficient feature of the target speaker are input as a label value into a loss function. When the loss value output by the loss function is greater than or equal to a preset threshold, the target Mel Transform and adjust the spectrogram until the loss value output by the loss function is less than the preset threshold, output the target Mel spectrogram as the Mel spectrogram of the target speaker; and The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.

In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium with a speech synthesis program stored on the computer-readable storage medium. The speech synthesis program can be executed by one or more processors to achieve The steps of the speech synthesis method as described above.

The speech synthesis method, device and computer-readable storage medium proposed in this application use a pre-trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the Mel spectrogram of the target speaker, thereby using The text content of the timbre output of the source speaker is converted to the timbre output of the target speaker, which realizes the timbre conversion of the speech synthesis system.

Description of the drawings

FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of this application;

2 is a schematic diagram of converting text content into text vectors in a speech synthesis method provided by an embodiment of the application;

FIG. 3 is a schematic structural diagram of a spectral feature conversion model in a speech synthesis method provided by an embodiment of this application;

4 is a schematic diagram of the internal structure of a speech synthesis device provided by an embodiment of the application;

FIG. 5 is a schematic diagram of modules of a speech synthesis program in a speech synthesis device provided by an embodiment of the application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

This application provides a speech synthesis method. Referring to FIG. 1, it is a schematic flowchart of a speech synthesis method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

In this embodiment, the speech synthesis method includes:

S1. Receive the voice data of the source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.

This application uses a text embedding module to convert Chinese characters in the text content into text vectors.

This application uses the text embedding module to perform word segmentation operations on the Chinese characters in the input text content, and then translates the obtained word segmentation into Chinese pinyin with tones (using 1-5 to represent the four tones and soft tones of Mandarin), for example, Convert a participle "hello" to "nin2hao3".

Furthermore, this application uses one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese Pinyin into a one-dimensional text vector, and then converts it into a two-dimensional text vector according to the time sequence, as shown in Figure 2. Show.

S2. Convert the text vector into the Mel language spectrogram of the source speaker.

In a preferred embodiment of the present application, the text vector is input into a Mel spectrum generation module to convert the text vector into the Mel spectrum map of the source speaker.

The Mel language spectrum generation module of this application receives the text vector passed by the text embedding module, and uses the trained sequence-to-sequence neural network model to convert the text vector into the Mel language spectrum of the source speaker Figure.

The trained sequence-to-sequence neural network model described in this application adopts the Tacotron architecture and uses an undisclosed speech database for training. The voice database contains a female speaker (ie the source speaker) in a quiet environment, using a special recording device to record a total of about 30 hours of voice files, and the text file corresponding to each voice. After the input text vector is mapped from the trained sequence to the sequence neural network model, it will be converted into the Mel language spectrogram of the source speaker.

The Mel-language spectrogram is a spectrogram based on Mel Frequency Cepstrum Coefficient (MFCC) features. In order to obtain the characteristic of the Mel frequency cepstrum coefficient, this application first uses a Preemphasis filter to improve the high-frequency signal and the signal-to-noise ratio. The formula is: y(t)=x(t)-αx(t-1), Where x is the signal input, y is the signal output, x(t) is the signal at time t, x(t-1) is the signal at (t-1), and α is generally 0.97. According to the Preemphasis filter, the signal output y(t) at time t after the high-frequency signal and the signal-to-noise ratio are improved. Then perform short-time Fourier transform. In order to simulate the suppression of high-frequency signals by human ears, this application uses a set of filter banks containing multiple triangular filters to process the linear spectrum after short-time Fourier transform to obtain low-dimensional features and emphasize the low-frequency part. , Weaken the high frequency part, so as to obtain the Mel frequency cepstrum coefficient characteristic.

Preferably, in order to prevent energy leakage, the preferred embodiment of the present application will use the Hanning window function before performing the Fourier transform. The Hanning window can be regarded as the sum of the spectrum of 3 rectangular time windows, or the sum of 3 sin(t)-type functions, and the two items in brackets are left and right relative to the first spectral window. Each moved by π/T, so that the side lobes cancel each other out, eliminating high-frequency interference and leakage energy.

S3. Acquire the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker.

S4. Input the Mel spectrogram of the source speaker into a trained spectrogram feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, and The target Mel language spectrogram is used as the training value and the Mel frequency cepstral coefficient feature of the target speaker is input into a loss function as a label value. When the loss value output by the loss function is greater than or equal to the expected value When the threshold is set, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the target speaker's Mel language spectrogram output.

The spectral feature conversion model described in this application includes a convolutional neural network (Convolutional Neural Networks, CNN) model and a bidirectional LSTM-based recurrent neural network (Recurrent Neural Network, RNN) model. This application compresses the Mel spectrogram of the source speaker through a layer of pre-trained convolutional neural network to better represent the features in the Mel spectrogram. The processed Mel The spectrogram will be divided into frames according to the time sequence. The Mel frequency cepstral coefficient feature of each frame will be added with the identity feature of the target speaker, and then input into a two-layer two-way LSTM-based recurrent neural network for processing. The cyclic neural network of the bidirectional LSTM converts the Mel spectrogram of the source speaker into the target Mel spectrogram frame by frame. Further, the present application uses the converted target Mel language spectrogram as a training value, and uses the Mel frequency cepstral coefficient feature of the target speaker obtained in step S3 above as a label value and inputs it into a loss function. When the loss value output by the loss function is greater than or equal to the preset threshold value, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold value, the target Mel language The spectrogram is output as the Mel language spectrogram of the source speaker.

In a preferred embodiment of the present application, the structure of the spectral feature conversion model is shown in FIG. 3.

The convolutional neural network and the recurrent neural network based on bidirectional LSTM are also trained using a private speech data set. The voice data set contains recordings of N (preferably, N is 10) female speakers (each speaker has a voice file of about 1 hour in length), and the text content recorded by 10 speakers is all identical. One of the female speakers also recorded the speech database used by the sequence-to-sequence neural network model trained above. Therefore, the speaker is taken as the source speaker. The remaining nine speakers were regarded as target speakers and were given ID numbers 1-9. This number will be embedded in the corresponding Mel frequency cepstral coefficient feature as the target speaker identity vector during the training of the convolutional neural network and the bidirectional LSTM-based recurrent neural network and subsequent inferences.

The convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area. Its basic structure includes two layers. One is the feature extraction layer. The input of each neuron is The local receptive fields of the previous layer are connected, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a sigmoid function with a small influencing function core as the activation function of the convolutional network, which makes the feature mapping displacement invariant. In addition, since neurons on a mapping plane share weights, the number of free parameters of the network is reduced. Each convolutional layer in the convolutional neural network is followed by a calculation layer for local averaging and secondary extraction. This unique two-feature extraction structure reduces the feature resolution.

Input layer: The input layer is the only data input port of the entire convolutional neural network, which is mainly used to define different types of data input.

Convolutional layer: convolve the data of the input convolutional layer, and output the convolutional feature map.

Down-sampling layer (Pooling layer): The pooling layer performs down-sampling operations on the incoming data in spatial dimensions, so that the length and width of the input feature map become half of the original.

Fully connected layer: The fully connected layer is the same as an ordinary neural network. Each neuron is connected to all input neurons, and then calculated through an activation function.

Output layer: The output layer is also called the classification layer, and the classification score of each category will be calculated in the final output.

In the embodiment of this application, the input layer is the source speaker Mel's spectrogram, which sequentially enters a 7*7 convolutional layer, a 3*3 maximum pooling layer, and then enters 4 Convolution module. Each convolution module starts with a building block with linear projection, followed by a different number of building blocks with ontology mapping, and finally outputs a time-sequentially compressed Mel language spectrum in the softmax layer.

The cyclic neural network is usually used to describe dynamic sequence data, dynamically adjust its own network state as time changes, and continuously perform cyclic transmission. In the traditional neural network model, neurons go from the input layer to the hidden layer, and then from the hidden layer to the output layer. The layers are fully connected or locally connected, and the last one will be lost in the transmission of data. The feature information generated in the layer calculation process, and the RNN is different from the traditional neural network model in that the current output of a sequence is also related to the previous output. The specific form is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the solution points between the hidden layers are no longer disconnected but linked, and the output of the hidden layer includes not only the input The output of the layer also includes the output of the hidden layer at the previous moment.

In the embodiment of the present application, the Mel frequency cepstral coefficient feature for framing using time sequence is input into the two-layer LSTM-based cyclic neural network model, and the gradient descent method is used to solve the loss function.

In a neural network, the loss function is used to evaluate the predicted value of the network model output

The difference from the true value Y. Used here

To represent the loss function, it is a non-negative real number function, the smaller the loss value, the better the performance of the network model. According to the basic neuron formula in deep learning, the input and output of each layer are

C _i =f(z _i ),

Is the output of the _i- th neuron in the l-th layer network, Ws _i-1 is the link from the i-th neuron in the l-th layer network to the j-th neuron in the l+1-th layer network, U is the l-th layer network The weight of i neurons, x _i the i-th neuron of the l- _th layer network, C _i is the output value of each unit of the output layer, according to this input and output formula, use MSE to establish the loss function

Where Y _i is the correct answer for the i-th data in a batch, and

The predicted value given by the neural network. At the same time, in order to alleviate the problem of gradient dissipation, the ReLU function relu(x)=max(0,x) is selected as the activation function, where x is the input value of the neural network. This function satisfies the sparsity in bionics, only when the input value is high The neuron node is activated only at a certain number, and the input value is limited when the input value is lower than 0. When the input rises above a certain threshold, the independent variable and the dependent variable in the function have a linear relationship.

The preferred embodiment of the present application uses a gradient descent algorithm to solve the loss function. The gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function

The minimum value of, needs to update the variable y in the opposite direction of the gradient vector -dL/dy, so that the gradient can be reduced the fastest until the loss converges to the minimum. The parameter update formula is as follows: L=L-αdL/dy,α Represents the learning rate, so that the final neural network parameters can be obtained to identify the Mel language spectrogram.

Further, this application uses the Softmax function to input the classification label.

The Softmax is a promotion of logistic regression, which is used to deal with two classification problems, and its promoted Softmax regression is used to deal with multiple classification problems. According to the characteristics of the input Mel frequency cepstrum coefficient, the maximum value of the output probability of all categories is obtained through the activation function. The core formula is:

Suppose that there are K categories in total, x _k represents samples of category k, and x _j represents samples of category j, and therefore the target Mel language spectrogram is obtained.

S5. Convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.

The preferred embodiment of the present application uses a voice generation module to synthesize the Mel language spectrogram of the target speaker into voice.

The speech generation module is used to process Mel's spectrogram and generate high-fidelity and high-naturalness speech. This application uses a voice generation module after obtaining the Mel spectrogram of the target speaker, and uses the Mel spectrogram as a conditional input to generate the target speaker's voice. The speech generation module uses a vocoder called WaveNet. When inputting Mel spectrograms of different target speakers, the vocoder can generate high-fidelity sounds of different target speakers according to the Mel spectrograms.

The WaveNet vocoder used in the preferred embodiment of the present application is also trained from a non-public speech data set, which is the same data set used for training the convolutional neural network. The WaveNet is an end-to-end TTS (text to speech) model. Its main concept is causal convolution. The meaning of the so-called causal convolution is that when WaveNet generates elements at time t, it can only use time from 0 to t-1. Element value. Since the sound file is a one-dimensional array in time, a file with a sampling rate of 16KHz will have 16,000 elements per second, and the receptive field of the causal convolution mentioned above is very small, and it can only be used even if many layers are stacked. To generate the element at time t with very little data, in order to expand the receptive field of convolution, WaveNet uses stacked multi-layer convolution with holes to increase the receptive field of the network, so that when the network generates the next element, it can Use more previous element values.

The application also provides a speech synthesis device. Referring to FIG. 4, it is a schematic diagram of the internal structure of a speech synthesis device provided by an embodiment of this application.

In this embodiment, the speech synthesis device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer. The speech synthesis device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may be an internal storage unit of the speech synthesis device 1 in some embodiments, such as a hard disk of the speech synthesis device 1. In other embodiments, the memory 11 may also be an external storage device of the speech synthesis device 1, such as a plug-in hard disk equipped on the speech synthesis device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the speech synthesis apparatus 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the speech synthesis device 1, such as the code of the speech synthesis program 01, etc., but also to temporarily store data that has been output or will be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, such as execution of speech synthesis program 01, etc.

The communication bus 13 is used to realize the connection and communication between these components.

The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.

Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be called a display screen or a display unit as appropriate, for displaying information processed in the speech synthesis device 1 and for displaying a visualized user interface.

FIG. 4 only shows the speech synthesis device 1 with components 11-14 and the speech synthesis program 01. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the speech synthesis device 1, and may include Fewer or more components than shown, or some combination of components, or different component arrangement.

In the embodiment of the device 1 shown in FIG. 4, the speech synthesis program 01 is stored in the memory 11; when the processor 12 executes the speech synthesis program 01 stored in the memory 11, the following steps are implemented:

Step 1: Receive the voice data of the source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.

Step 2: Convert the text vector into the Mel language spectrogram of the source speaker.

The Mel-language spectrogram is a spectrogram based on Mel Frequency Cepstrum Coefficient (MFCC) features. In order to obtain the characteristic of the Mel frequency cepstrum coefficient, this application first uses a Preemphasis filter to improve the high-frequency signal and the signal-to-noise ratio. The formula is: y(t)=x(t)-αx(t-1), Where x is the signal input, y is the signal output, x(t) is the signal at time t, x(t-1) is the signal at (t-1), and α is generally 0.97. According to the Preemphasis filter, the signal output y(t) at time t after the high-frequency signal and the signal-to-noise ratio are improved. Then a short-time Fourier transform is performed. In order to simulate the suppression of high-frequency signals by human ears, this application uses a set of filter banks containing multiple triangular filters to process the linear spectrum after short-time Fourier transform to obtain low-dimensional features and emphasize the low-frequency part. , Weaken the high frequency part, so as to obtain the Mel frequency cepstrum coefficient characteristic.

Step 3: Obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker.

Step 4: Input the Mel spectrogram of the source speaker into a trained spectrogram feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, And use the target Mel language spectrogram as the training value and the Mel frequency cepstral coefficient feature of the target speaker as the label value and input into a loss function, when the loss value output by the loss function is greater than or equal to When the threshold is preset, the target Mel spectrogram is transformed and adjusted until the loss value output by the loss function is less than the preset threshold, the target Mel spectrogram is used as the target speaker The Mel language spectrogram output.

The spectral feature conversion model described in this application includes a convolutional neural network (Convolutional Neural Networks, CNN) model and a bidirectional LSTM-based recurrent neural network (Recurrent Neural Network, RNN) model. This application uses the source speaker’s Mel spectrogram through a layer of pre-trained convolutional neural network for time-series compression to better represent the features in the Mel spectroscopy, and the processed Mel The spectrogram will be divided into frames according to the time sequence. The Mel frequency cepstral coefficient feature of each frame will be added with the identity feature of the target speaker, and then input into a two-layer two-way LSTM-based recurrent neural network for processing. The cyclic neural network of the bidirectional LSTM converts the Mel spectrogram of the source speaker into the target Mel spectrogram frame by frame. Further, the present application uses the converted target Mel language spectrogram as a training value, and uses the Mel frequency cepstral coefficient feature of the target speaker obtained in step S3 above as a label value and inputs it into a loss function. When the loss value output by the loss function is greater than or equal to the preset threshold value, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold value, the target Mel language The spectrogram is output as the Mel language spectrogram of the source speaker.

The convolutional neural network and the recurrent neural network based on bidirectional LSTM are also trained using a private speech data set. The voice data set contains recordings of N (preferably, N is 10) female speakers (each speaker has a voice file of about 1 hour in length), and the text content recorded by 10 speakers is identical. One of the female speakers also recorded the speech database used by the sequence-to-sequence neural network model trained above. Therefore the speaker is taken as the source speaker. The remaining nine speakers were regarded as target speakers and were given ID numbers 1-9. This number will be embedded in the corresponding Mel frequency cepstral coefficient feature as the target speaker identity vector during the training of the convolutional neural network and the bidirectional LSTM-based recurrent neural network and subsequent inferences.

The cyclic neural network is usually used to describe dynamic sequence data, dynamically adjust its own network state as time changes, and continuously perform cyclic transmission. In the traditional neural network model, neurons go from the input layer to the hidden layer, and then from the hidden layer to the output layer. The layers are fully connected or locally connected, and the last one will be lost in the transmission of data. The feature information generated in the layer calculation process, and the RNN is different from the traditional neural network model in that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the solution points between the hidden layers are no longer disconnected but linked, and the output of the hidden layer includes not only the input The output of the layer also includes the output of the hidden layer at the previous moment.

The difference from the true value Y. Used here

C _i =f(z _i ),

Is the output of the _i- th neuron in the l-th layer network, Ws _i-1 is the link from the i-th neuron in the l-th layer network to the j-th neuron in the l+1-th layer network, U is the l-th layer network The weight of i neurons, x _i the i-th neuron of the l- _th layer network, C _j is the output value of each unit of the output layer, according to this input and output formula, use MSE to establish the loss function

Where Y _i is the correct answer for the i-th data in a batch, and

The minimum value of, needs to update the variable y in the opposite direction of the gradient vector -dL/dy, which can make the gradient decrease the fastest until the loss converges to the minimum. The parameter update formula is as follows: L=L-αdL/dy,α Represents the learning rate, so that the final neural network parameters can be obtained to identify the Mel language spectrogram.

Step 5: Convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.

Optionally, in other embodiments, the speech synthesis program 01 may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application. The module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the speech synthesis program in the speech synthesis device.

For example, referring to FIG. 5, it is a schematic diagram of the program modules of the speech synthesis program in an embodiment of the speech synthesis device of this application. In this embodiment, the speech synthesis program can be divided into a text embedding module 10 and a Mel language spectrum generating module 20. The spectrum feature conversion module 30 and the speech generation module 40, exemplarily:

The text embedding module 10 is configured to receive voice data of a source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.

Optionally, the text embedding module 10 is specifically configured to perform word segmentation operations on Chinese characters in the text content, and then translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to translate The pinyin letters and tonal numbers in the Chinese Pinyin are converted into a one-dimensional text vector, and then converted into a two-dimensional text vector according to the time sequence.

The Mel language spectrum generating module 20 is used for converting the text vector into the Mel language spectrum map of the source speaker.

Optionally, the Mel language spectrum generation module 20 uses a trained sequence-to-sequence neural network model to convert the two-dimensional text vector into the Mel language spectrum of the source speaker, wherein the trained sequence The neural network model to the sequence uses the Tacotron architecture and uses a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice. .

The spectral feature conversion module 30 is used to obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker, and convert the source speaker’s voice signal The Mel language spectrogram is input into a trained spectral feature conversion model to convert the Mel language spectrogram of the source speaker into a target Mel language spectrogram, and the target Mel language spectrogram As a training value and the Mel frequency cepstrum coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold, the target speaker The Er language spectrogram is transformed and adjusted until the loss value output by the loss function is less than the preset threshold, and the target Mel language spectrogram is output as the Mel language spectrogram of the target speaker.

Optionally, the spectral feature conversion module 30 passes the source speaker’s Mel spectrogram through the pre-trained convolutional neural network for time-series compression, and performs time-series compression on the time-series compressed Mel spectrogram according to Framing is performed in time sequence. The Mel frequency cepstral coefficient feature of each frame plus the identity feature of the target speaker are input to the recurrent neural network for processing. The recurrent neural network divides the source speaker's Mel The frequency cepstral coefficient feature is converted into the Mel frequency cepstral coefficient feature of the target speaker to obtain the training value.

The voice generation module 40 is used to convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.

The functions or operation steps implemented by the program modules such as the text embedding module 10, the Mel language spectrum generation module 20, the language spectrum feature conversion module 30, and the speech generation module 40 are substantially the same as those in the foregoing embodiment, and will not be repeated here. .

In addition, an embodiment of the present application also proposes a computer-readable storage medium with a speech synthesis program stored on the computer-readable storage medium, and the speech synthesis program can be executed by one or more processors to implement the following operations:

Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;

Converting the text vector into the Mel language spectrogram of the source speaker;

Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;

Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and

The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.

The specific implementations of the computer-readable storage medium of the present application are basically the same as those of the above-mentioned speech synthesis device and method, and will not be repeated here.

It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A speech synthesis method, characterized in that the method includes:

Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;

Converting the text vector into the Mel language spectrogram of the source speaker;

Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;

Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and

The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
The speech synthesis method according to claim 1, wherein said converting said text content into a text vector comprises:

Perform word segmentation operations on the Chinese characters in the text content, translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese pinyin into one-dimensional text Vector, and then convert the one-dimensional text vector into the two-dimensional text vector according to the time sequence.
5. The speech synthesis method according to claim 1, wherein said converting the text vector into the Mel language spectrogram of the source speaker comprises:

Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
3. The speech synthesis method according to claim 2, wherein the converting the text vector into the Mel language spectrogram of the source speaker comprises:

Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
The speech synthesis method according to claim 1, wherein the spectral feature conversion model includes a pre-trained convolutional neural network model and a two-layer bidirectional LSTM-based recurrent neural network, wherein The Mel spectrogram of the source speaker is input into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, including:

Passing the Mel spectrogram of the source speaker through the pre-trained convolutional neural network model to perform time series compression;

The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
The speech synthesis method according to any one of claims 2-4, wherein the spectral feature conversion model includes a pre-trained convolutional neural network model and a two-layer bidirectional LSTM-based recurrent neural network, wherein Said inputting the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, include:

Passing the Mel spectrogram of the source speaker through the pre-trained convolutional neural network model to perform time series compression;

The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
7. The speech synthesis method according to claim 6, wherein said passing the Mel spectrogram of the source speaker through a pre-trained convolutional neural network model for time-series compression comprises:

Input the Mel spectrogram of the source speaker into the input layer of the convolutional neural network model, and the Mel spectrogram sequentially enters a 7*7 convolutional layer and a 3*3 maximum pooling layer , 4 convolution modules, and finally output the time-sequential compressed Mel language spectrogram in the softmax layer.
A speech synthesis device, characterized in that the device includes a memory and a processor, the memory stores a speech synthesis program that can run on the processor, and when the speech synthesis program is executed by the processor Implement the following steps:

Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;

Converting the text vector into the Mel language spectrogram of the source speaker;

Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;

Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and

The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
8. The speech synthesis device according to claim 8, wherein said converting said text content into a two-dimensional text vector comprises:

Perform word segmentation operations on the Chinese characters in the text content, translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese pinyin into one-dimensional text Vector, and then convert the one-dimensional text vector into the two-dimensional text vector according to the time sequence.
8. The speech synthesis device according to claim 8, wherein said converting the text vector into the Mel language spectrogram of the source speaker comprises:

Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
9. The speech synthesis device according to claim 9, wherein said converting the text vector into the Mel language spectrogram of the source speaker comprises:

Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
The speech synthesis device according to claim 8, wherein the spectral feature conversion model includes a pre-trained convolutional neural network model and a two-layer recurrent neural network based on bidirectional LSTM, wherein The Mel spectrogram of the source speaker is input into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, including:

Passing the Mel language spectrogram of the source speaker through the pre-trained neural network model to perform timing compression;

The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
The speech synthesis device according to any one of claims 9-11, wherein the spectral feature conversion model comprises a pre-trained convolutional neural network model and a two-layer bidirectional LSTM-based recurrent neural network, wherein Said inputting the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, include:

Passing the Mel spectrogram of the source speaker through the pre-trained neural network model for temporal compression;

The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
The speech synthesis device according to claim 13, wherein said passing the Mel spectrogram of the source speaker through a pre-trained convolutional neural network model for time-series compression comprises:

Input the Mel spectrogram of the source speaker into the input layer of the convolutional neural network model, and the Mel spectrogram sequentially enters a 7*7 convolutional layer and a 3*3 maximum pooling layer , 4 convolution modules, and finally output the time-sequentially compressed Mel language spectrogram in the softmax layer.
A computer-readable storage medium, characterized in that a speech synthesis program is stored on the computer-readable storage medium, and the speech synthesis program can be executed by one or more processors to implement the following steps:

Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;

Converting the text vector into the Mel language spectrogram of the source speaker;

Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;

Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and

The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
15. The computer-readable storage medium of claim 15, wherein the converting the text content into a two-dimensional text vector comprises:

Perform word segmentation operations on the Chinese characters in the text content, translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese pinyin into one-dimensional text Vector, and then convert the one-dimensional text vector into the two-dimensional text vector according to the time sequence.
15. The computer-readable storage medium according to claim 15, wherein the converting the text vector into the Mel language spectrogram of the source speaker comprises:

Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
15. The computer-readable storage medium of claim 16, wherein the converting the text vector into the Mel language spectrogram of the source speaker comprises:

Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
The computer-readable storage medium according to any one of claims 15-18, wherein the spectral feature conversion model comprises a pre-trained convolutional neural network model and a two-layer recurrent neural network based on bidirectional LSTM , Wherein said inputting the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram Figures, including:

Passing the Mel language spectrogram of the source speaker through the pre-trained neural network model to perform timing compression;

The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
The computer-readable storage medium according to claim 19, wherein said passing the Mel spectrogram of the source speaker through a pre-trained convolutional neural network model for timing compression comprises:

Input the Mel spectrogram of the source speaker into the input layer of the convolutional neural network model, and the Mel spectrogram sequentially enters a 7*7 convolutional layer and a 3*3 maximum pooling layer , 4 convolution modules, and finally output the time-sequential compressed Mel language spectrogram in the softmax layer.