WO2020232860A1 - Speech synthesis method and apparatus, and computer readable storage medium - Google Patents

Speech synthesis method and apparatus, and computer readable storage medium Download PDF

Info

Publication number
WO2020232860A1
WO2020232860A1 PCT/CN2019/102198 CN2019102198W WO2020232860A1 WO 2020232860 A1 WO2020232860 A1 WO 2020232860A1 CN 2019102198 W CN2019102198 W CN 2019102198W WO 2020232860 A1 WO2020232860 A1 WO 2020232860A1
Authority
WO
WIPO (PCT)
Prior art keywords
mel
spectrogram
speaker
target
neural network
Prior art date
Application number
PCT/CN2019/102198
Other languages
French (fr)
Chinese (zh)
Inventor
彭话易
程宁
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020232860A1 publication Critical patent/WO2020232860A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a speech synthesis method, device and computer-readable storage medium.
  • This application provides a speech synthesis method, device, and computer-readable storage medium, the main purpose of which is to provide a solution that can realize the tone color conversion of the speech synthesis system.
  • a speech synthesis method includes: receiving speech data of a source speaker, converting the speech data of the source speaker into text content, and converting the text content into a text vector; Convert the text vector into the Mel spectrum of the source speaker; obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;
  • the Mel spectrogram of the source speaker is input to a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, and the The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold , Performing transformation and adjustment on the target Mel language spectrogram until the loss value output by the loss function is less than the preset threshold, using the target Mel language spectrogram as the Mel language
  • the present application also provides a speech synthesis device, which includes a memory and a processor.
  • the memory stores a speech synthesis program that can run on the processor, and the speech synthesis program is When the processor executes, the following steps are implemented: receiving the voice data of the source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector; converting the text vector into The Mel spectrum of the source speaker; the voice signal of the target speaker is obtained, and the voice signal of the target speaker is converted into the Mel frequency cepstrum coefficient feature of the target speaker; the Meier of the source speaker Input the Er language spectrogram into a trained spectral feature conversion model to convert the Mel language spectrogram of the source speaker into a target Mel language spectrogram, and use the target Mel language spectrogram as The training value and the Mel frequency cepstrum coefficient feature of the target speaker are input as a label value into a loss function.
  • the Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
  • the present application also provides a computer-readable storage medium with a speech synthesis program stored on the computer-readable storage medium.
  • the speech synthesis program can be executed by one or more processors to achieve The steps of the speech synthesis method as described above.
  • the speech synthesis method, device and computer-readable storage medium proposed in this application use a pre-trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the Mel spectrogram of the target speaker, thereby using The text content of the timbre output of the source speaker is converted to the timbre output of the target speaker, which realizes the timbre conversion of the speech synthesis system.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of converting text content into text vectors in a speech synthesis method provided by an embodiment of the application;
  • FIG. 3 is a schematic structural diagram of a spectral feature conversion model in a speech synthesis method provided by an embodiment of this application;
  • FIG. 4 is a schematic diagram of the internal structure of a speech synthesis device provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of modules of a speech synthesis program in a speech synthesis device provided by an embodiment of the application.
  • This application provides a speech synthesis method.
  • FIG. 1 it is a schematic flowchart of a speech synthesis method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the speech synthesis method includes:
  • This application uses a text embedding module to convert Chinese characters in the text content into text vectors.
  • This application uses the text embedding module to perform word segmentation operations on the Chinese characters in the input text content, and then translates the obtained word segmentation into Chinese pinyin with tones (using 1-5 to represent the four tones and soft tones of Mandarin), for example, Convert a participle "hello” to "nin2hao3".
  • this application uses one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese Pinyin into a one-dimensional text vector, and then converts it into a two-dimensional text vector according to the time sequence, as shown in Figure 2. Show.
  • the text vector is input into a Mel spectrum generation module to convert the text vector into the Mel spectrum map of the source speaker.
  • the Mel language spectrum generation module of this application receives the text vector passed by the text embedding module, and uses the trained sequence-to-sequence neural network model to convert the text vector into the Mel language spectrum of the source speaker Figure.
  • the trained sequence-to-sequence neural network model described in this application adopts the Tacotron architecture and uses an undisclosed speech database for training.
  • the voice database contains a female speaker (ie the source speaker) in a quiet environment, using a special recording device to record a total of about 30 hours of voice files, and the text file corresponding to each voice. After the input text vector is mapped from the trained sequence to the sequence neural network model, it will be converted into the Mel language spectrogram of the source speaker.
  • the Mel-language spectrogram is a spectrogram based on Mel Frequency Cepstrum Coefficient (MFCC) features.
  • MFCC Mel Frequency Cepstrum Coefficient
  • this application first uses a Preemphasis filter to improve the high-frequency signal and the signal-to-noise ratio.
  • the signal output y(t) at time t after the high-frequency signal and the signal-to-noise ratio are improved. Then perform short-time Fourier transform.
  • this application uses a set of filter banks containing multiple triangular filters to process the linear spectrum after short-time Fourier transform to obtain low-dimensional features and emphasize the low-frequency part. , Weaken the high frequency part, so as to obtain the Mel frequency cepstrum coefficient characteristic.
  • the preferred embodiment of the present application will use the Hanning window function before performing the Fourier transform.
  • the Hanning window can be regarded as the sum of the spectrum of 3 rectangular time windows, or the sum of 3 sin(t)-type functions, and the two items in brackets are left and right relative to the first spectral window. Each moved by ⁇ /T, so that the side lobes cancel each other out, eliminating high-frequency interference and leakage energy.
  • the spectral feature conversion model described in this application includes a convolutional neural network (Convolutional Neural Networks, CNN) model and a bidirectional LSTM-based recurrent neural network (Recurrent Neural Network, RNN) model.
  • This application compresses the Mel spectrogram of the source speaker through a layer of pre-trained convolutional neural network to better represent the features in the Mel spectrogram.
  • the processed Mel The spectrogram will be divided into frames according to the time sequence.
  • the Mel frequency cepstral coefficient feature of each frame will be added with the identity feature of the target speaker, and then input into a two-layer two-way LSTM-based recurrent neural network for processing.
  • the cyclic neural network of the bidirectional LSTM converts the Mel spectrogram of the source speaker into the target Mel spectrogram frame by frame. Further, the present application uses the converted target Mel language spectrogram as a training value, and uses the Mel frequency cepstral coefficient feature of the target speaker obtained in step S3 above as a label value and inputs it into a loss function. When the loss value output by the loss function is greater than or equal to the preset threshold value, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold value, the target Mel language The spectrogram is output as the Mel language spectrogram of the source speaker.
  • the structure of the spectral feature conversion model is shown in FIG. 3.
  • the convolutional neural network and the recurrent neural network based on bidirectional LSTM are also trained using a private speech data set.
  • the voice data set contains recordings of N (preferably, N is 10) female speakers (each speaker has a voice file of about 1 hour in length), and the text content recorded by 10 speakers is all identical.
  • One of the female speakers also recorded the speech database used by the sequence-to-sequence neural network model trained above. Therefore, the speaker is taken as the source speaker.
  • the remaining nine speakers were regarded as target speakers and were given ID numbers 1-9. This number will be embedded in the corresponding Mel frequency cepstral coefficient feature as the target speaker identity vector during the training of the convolutional neural network and the bidirectional LSTM-based recurrent neural network and subsequent inferences.
  • the convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area. Its basic structure includes two layers. One is the feature extraction layer. The input of each neuron is The local receptive fields of the previous layer are connected, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal.
  • the feature mapping structure uses a sigmoid function with a small influencing function core as the activation function of the convolutional network, which makes the feature mapping displacement invariant.
  • each convolutional layer in the convolutional neural network is followed by a calculation layer for local averaging and secondary extraction. This unique two-feature extraction structure reduces the feature resolution.
  • the input layer is the only data input port of the entire convolutional neural network, which is mainly used to define different types of data input.
  • Convolutional layer convolve the data of the input convolutional layer, and output the convolutional feature map.
  • Down-sampling layer The pooling layer performs down-sampling operations on the incoming data in spatial dimensions, so that the length and width of the input feature map become half of the original.
  • Fully connected layer The fully connected layer is the same as an ordinary neural network. Each neuron is connected to all input neurons, and then calculated through an activation function.
  • Output layer The output layer is also called the classification layer, and the classification score of each category will be calculated in the final output.
  • the input layer is the source speaker Mel's spectrogram, which sequentially enters a 7*7 convolutional layer, a 3*3 maximum pooling layer, and then enters 4 Convolution module.
  • Each convolution module starts with a building block with linear projection, followed by a different number of building blocks with ontology mapping, and finally outputs a time-sequentially compressed Mel language spectrum in the softmax layer.
  • the cyclic neural network is usually used to describe dynamic sequence data, dynamically adjust its own network state as time changes, and continuously perform cyclic transmission.
  • neurons go from the input layer to the hidden layer, and then from the hidden layer to the output layer.
  • the layers are fully connected or locally connected, and the last one will be lost in the transmission of data.
  • the feature information generated in the layer calculation process, and the RNN is different from the traditional neural network model in that the current output of a sequence is also related to the previous output.
  • the specific form is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the solution points between the hidden layers are no longer disconnected but linked, and the output of the hidden layer includes not only the input
  • the output of the layer also includes the output of the hidden layer at the previous moment.
  • the Mel frequency cepstral coefficient feature for framing using time sequence is input into the two-layer LSTM-based cyclic neural network model, and the gradient descent method is used to solve the loss function.
  • the loss function is used to evaluate the predicted value of the network model output The difference from the true value Y. Used here To represent the loss function, it is a non-negative real number function, the smaller the loss value, the better the performance of the network model.
  • the weight of i neurons, x i the i-th neuron of the l- th layer network, C i is the output value of each unit of the output layer, according to this input and output formula, use MSE to establish the loss function
  • Y i is the correct answer for the i-th data in a batch, and The predicted value given by the neural network.
  • This function satisfies the sparsity in bionics, only when the input value is high
  • the neuron node is activated only at a certain number, and the input value is limited when the input value is lower than 0.
  • the independent variable and the dependent variable in the function have a linear relationship.
  • the preferred embodiment of the present application uses a gradient descent algorithm to solve the loss function.
  • the gradient descent algorithm is the most commonly used optimization algorithm for neural network model training.
  • the minimum value of needs to update the variable y in the opposite direction of the gradient vector -dL/dy, so that the gradient can be reduced the fastest until the loss converges to the minimum.
  • this application uses the Softmax function to input the classification label.
  • the Softmax is a promotion of logistic regression, which is used to deal with two classification problems, and its promoted Softmax regression is used to deal with multiple classification problems. According to the characteristics of the input Mel frequency cepstrum coefficient, the maximum value of the output probability of all categories is obtained through the activation function.
  • the core formula is: Suppose that there are K categories in total, x k represents samples of category k, and x j represents samples of category j, and therefore the target Mel language spectrogram is obtained.
  • the preferred embodiment of the present application uses a voice generation module to synthesize the Mel language spectrogram of the target speaker into voice.
  • the speech generation module is used to process Mel's spectrogram and generate high-fidelity and high-naturalness speech.
  • This application uses a voice generation module after obtaining the Mel spectrogram of the target speaker, and uses the Mel spectrogram as a conditional input to generate the target speaker's voice.
  • the speech generation module uses a vocoder called WaveNet. When inputting Mel spectrograms of different target speakers, the vocoder can generate high-fidelity sounds of different target speakers according to the Mel spectrograms.
  • the WaveNet vocoder used in the preferred embodiment of the present application is also trained from a non-public speech data set, which is the same data set used for training the convolutional neural network.
  • the WaveNet is an end-to-end TTS (text to speech) model. Its main concept is causal convolution.
  • the meaning of the so-called causal convolution is that when WaveNet generates elements at time t, it can only use time from 0 to t-1. Element value. Since the sound file is a one-dimensional array in time, a file with a sampling rate of 16KHz will have 16,000 elements per second, and the receptive field of the causal convolution mentioned above is very small, and it can only be used even if many layers are stacked.
  • WaveNet uses stacked multi-layer convolution with holes to increase the receptive field of the network, so that when the network generates the next element, it can Use more previous element values.
  • the application also provides a speech synthesis device.
  • FIG. 4 it is a schematic diagram of the internal structure of a speech synthesis device provided by an embodiment of this application.
  • the speech synthesis device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the speech synthesis device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 11 may be an internal storage unit of the speech synthesis device 1 in some embodiments, such as a hard disk of the speech synthesis device 1.
  • the memory 11 may also be an external storage device of the speech synthesis device 1, such as a plug-in hard disk equipped on the speech synthesis device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc.
  • the memory 11 may also include both an internal storage unit of the speech synthesis apparatus 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the speech synthesis device 1, such as the code of the speech synthesis program 01, etc., but also to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, such as execution of speech synthesis program 01, etc.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor or other data processing chip
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
  • the device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc.
  • the display can also be called a display screen or a display unit as appropriate, for displaying information processed in the speech synthesis device 1 and for displaying a visualized user interface.
  • FIG. 4 only shows the speech synthesis device 1 with components 11-14 and the speech synthesis program 01. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the speech synthesis device 1, and may include Fewer or more components than shown, or some combination of components, or different component arrangement.
  • the speech synthesis program 01 is stored in the memory 11; when the processor 12 executes the speech synthesis program 01 stored in the memory 11, the following steps are implemented:
  • Step 1 Receive the voice data of the source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.
  • This application uses a text embedding module to convert Chinese characters in the text content into text vectors.
  • This application uses the text embedding module to perform word segmentation operations on the Chinese characters in the input text content, and then translates the obtained word segmentation into Chinese pinyin with tones (using 1-5 to represent the four tones and soft tones of Mandarin), for example, Convert a participle "hello” to "nin2hao3".
  • this application uses one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese Pinyin into a one-dimensional text vector, and then converts it into a two-dimensional text vector according to the time sequence, as shown in Figure 2. Show.
  • Step 2 Convert the text vector into the Mel language spectrogram of the source speaker.
  • the text vector is input into a Mel spectrum generation module to convert the text vector into the Mel spectrum map of the source speaker.
  • the Mel language spectrum generation module of this application receives the text vector passed by the text embedding module, and uses the trained sequence-to-sequence neural network model to convert the text vector into the Mel language spectrum of the source speaker Figure.
  • the trained sequence-to-sequence neural network model described in this application adopts the Tacotron architecture and uses an undisclosed speech database for training.
  • the voice database contains a female speaker (ie the source speaker) in a quiet environment, using a special recording device to record a total of about 30 hours of voice files, and the text file corresponding to each voice. After the input text vector is mapped from the trained sequence to the sequence neural network model, it will be converted into the Mel language spectrogram of the source speaker.
  • the Mel-language spectrogram is a spectrogram based on Mel Frequency Cepstrum Coefficient (MFCC) features.
  • MFCC Mel Frequency Cepstrum Coefficient
  • this application first uses a Preemphasis filter to improve the high-frequency signal and the signal-to-noise ratio.
  • the signal output y(t) at time t after the high-frequency signal and the signal-to-noise ratio are improved. Then a short-time Fourier transform is performed.
  • this application uses a set of filter banks containing multiple triangular filters to process the linear spectrum after short-time Fourier transform to obtain low-dimensional features and emphasize the low-frequency part. , Weaken the high frequency part, so as to obtain the Mel frequency cepstrum coefficient characteristic.
  • the preferred embodiment of the present application will use the Hanning window function before performing the Fourier transform.
  • the Hanning window can be regarded as the sum of the spectrum of 3 rectangular time windows, or the sum of 3 sin(t)-type functions, and the two items in brackets are left and right relative to the first spectral window. Each moved by ⁇ /T, so that the side lobes cancel each other out, eliminating high-frequency interference and leakage energy.
  • Step 3 Obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker.
  • Step 4 Input the Mel spectrogram of the source speaker into a trained spectrogram feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, And use the target Mel language spectrogram as the training value and the Mel frequency cepstral coefficient feature of the target speaker as the label value and input into a loss function, when the loss value output by the loss function is greater than or equal to When the threshold is preset, the target Mel spectrogram is transformed and adjusted until the loss value output by the loss function is less than the preset threshold, the target Mel spectrogram is used as the target speaker The Mel language spectrogram output.
  • the spectral feature conversion model described in this application includes a convolutional neural network (Convolutional Neural Networks, CNN) model and a bidirectional LSTM-based recurrent neural network (Recurrent Neural Network, RNN) model.
  • This application uses the source speaker’s Mel spectrogram through a layer of pre-trained convolutional neural network for time-series compression to better represent the features in the Mel spectroscopy, and the processed Mel The spectrogram will be divided into frames according to the time sequence. The Mel frequency cepstral coefficient feature of each frame will be added with the identity feature of the target speaker, and then input into a two-layer two-way LSTM-based recurrent neural network for processing.
  • the cyclic neural network of the bidirectional LSTM converts the Mel spectrogram of the source speaker into the target Mel spectrogram frame by frame. Further, the present application uses the converted target Mel language spectrogram as a training value, and uses the Mel frequency cepstral coefficient feature of the target speaker obtained in step S3 above as a label value and inputs it into a loss function. When the loss value output by the loss function is greater than or equal to the preset threshold value, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold value, the target Mel language The spectrogram is output as the Mel language spectrogram of the source speaker.
  • the structure of the spectral feature conversion model is shown in FIG. 3.
  • the convolutional neural network and the recurrent neural network based on bidirectional LSTM are also trained using a private speech data set.
  • the voice data set contains recordings of N (preferably, N is 10) female speakers (each speaker has a voice file of about 1 hour in length), and the text content recorded by 10 speakers is identical.
  • One of the female speakers also recorded the speech database used by the sequence-to-sequence neural network model trained above. Therefore the speaker is taken as the source speaker.
  • the remaining nine speakers were regarded as target speakers and were given ID numbers 1-9. This number will be embedded in the corresponding Mel frequency cepstral coefficient feature as the target speaker identity vector during the training of the convolutional neural network and the bidirectional LSTM-based recurrent neural network and subsequent inferences.
  • the convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area. Its basic structure includes two layers. One is the feature extraction layer. The input of each neuron is The local receptive fields of the previous layer are connected, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal.
  • the feature mapping structure uses a sigmoid function with a small influencing function core as the activation function of the convolutional network, which makes the feature mapping displacement invariant.
  • each convolutional layer in the convolutional neural network is followed by a calculation layer for local averaging and secondary extraction. This unique two-feature extraction structure reduces the feature resolution.
  • the input layer is the only data input port of the entire convolutional neural network, which is mainly used to define different types of data input.
  • Convolutional layer convolve the data of the input convolutional layer, and output the convolutional feature map.
  • Down-sampling layer The Pooling layer performs down-sampling operations on the incoming data in spatial dimensions, so that the length and width of the input feature map become half of the original.
  • Fully connected layer The fully connected layer is the same as an ordinary neural network. Each neuron is connected to all input neurons, and then calculated through an activation function.
  • Output layer The output layer is also called the classification layer, and the classification score of each category will be calculated in the final output.
  • the input layer is the source speaker Mel's spectrogram, which sequentially enters a 7*7 convolutional layer, a 3*3 maximum pooling layer, and then enters 4 Convolution module.
  • Each convolution module starts with a building block with linear projection, followed by a different number of building blocks with ontology mapping, and finally outputs a time-sequentially compressed Mel language spectrum in the softmax layer.
  • the cyclic neural network is usually used to describe dynamic sequence data, dynamically adjust its own network state as time changes, and continuously perform cyclic transmission.
  • neurons go from the input layer to the hidden layer, and then from the hidden layer to the output layer.
  • the layers are fully connected or locally connected, and the last one will be lost in the transmission of data.
  • the feature information generated in the layer calculation process, and the RNN is different from the traditional neural network model in that the current output of a sequence is also related to the previous output.
  • the specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the solution points between the hidden layers are no longer disconnected but linked, and the output of the hidden layer includes not only the input
  • the output of the layer also includes the output of the hidden layer at the previous moment.
  • the Mel frequency cepstral coefficient feature for framing using time sequence is input into the two-layer LSTM-based cyclic neural network model, and the gradient descent method is used to solve the loss function.
  • the loss function is used to evaluate the predicted value of the network model output The difference from the true value Y. Used here To represent the loss function, it is a non-negative real number function, the smaller the loss value, the better the performance of the network model.
  • the weight of i neurons, x i the i-th neuron of the l- th layer network, C j is the output value of each unit of the output layer, according to this input and output formula, use MSE to establish the loss function
  • Y i is the correct answer for the i-th data in a batch, and The predicted value given by the neural network.
  • This function satisfies the sparsity in bionics, only when the input value is high
  • the neuron node is activated only at a certain number, and the input value is limited when the input value is lower than 0.
  • the independent variable and the dependent variable in the function have a linear relationship.
  • the preferred embodiment of the present application uses a gradient descent algorithm to solve the loss function.
  • the gradient descent algorithm is the most commonly used optimization algorithm for neural network model training.
  • the minimum value of needs to update the variable y in the opposite direction of the gradient vector -dL/dy, which can make the gradient decrease the fastest until the loss converges to the minimum.
  • this application uses the Softmax function to input the classification label.
  • the Softmax is a promotion of logistic regression, which is used to deal with two classification problems, and its promoted Softmax regression is used to deal with multiple classification problems. According to the characteristics of the input Mel frequency cepstrum coefficient, the maximum value of the output probability of all categories is obtained through the activation function.
  • the core formula is: Suppose that there are K categories in total, x k represents samples of category k, and x j represents samples of category j, and therefore the target Mel language spectrogram is obtained.
  • Step 5 Convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.
  • the preferred embodiment of the present application uses a voice generation module to synthesize the Mel language spectrogram of the target speaker into voice.
  • the speech generation module is used to process Mel's spectrogram and generate high-fidelity and high-naturalness speech.
  • This application uses a voice generation module after obtaining the Mel spectrogram of the target speaker, and uses the Mel spectrogram as a conditional input to generate the target speaker's voice.
  • the speech generation module uses a vocoder called WaveNet. When inputting Mel spectrograms of different target speakers, the vocoder can generate high-fidelity sounds of different target speakers according to the Mel spectrograms.
  • the WaveNet vocoder used in the preferred embodiment of the present application is also trained from a non-public speech data set, which is the same data set used for training the convolutional neural network.
  • the WaveNet is an end-to-end TTS (text to speech) model. Its main concept is causal convolution.
  • the meaning of the so-called causal convolution is that when WaveNet generates elements at time t, it can only use time from 0 to t-1. Element value. Since the sound file is a one-dimensional array in time, a file with a sampling rate of 16KHz will have 16,000 elements per second, and the receptive field of the causal convolution mentioned above is very small, and it can only be used even if many layers are stacked.
  • WaveNet uses stacked multi-layer convolution with holes to increase the receptive field of the network, so that when the network generates the next element, it can Use more previous element values.
  • the speech synthesis program 01 may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application.
  • the module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the speech synthesis program in the speech synthesis device.
  • FIG. 5 it is a schematic diagram of the program modules of the speech synthesis program in an embodiment of the speech synthesis device of this application.
  • the speech synthesis program can be divided into a text embedding module 10 and a Mel language spectrum generating module 20.
  • the spectrum feature conversion module 30 and the speech generation module 40 exemplarily:
  • the text embedding module 10 is configured to receive voice data of a source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.
  • the text embedding module 10 is specifically configured to perform word segmentation operations on Chinese characters in the text content, and then translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to translate
  • the pinyin letters and tonal numbers in the Chinese Pinyin are converted into a one-dimensional text vector, and then converted into a two-dimensional text vector according to the time sequence.
  • the Mel language spectrum generating module 20 is used for converting the text vector into the Mel language spectrum map of the source speaker.
  • the Mel language spectrum generation module 20 uses a trained sequence-to-sequence neural network model to convert the two-dimensional text vector into the Mel language spectrum of the source speaker, wherein the trained sequence
  • the neural network model to the sequence uses the Tacotron architecture and uses a preset voice database for training.
  • the preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice. .
  • the spectral feature conversion module 30 is used to obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker, and convert the source speaker’s voice signal
  • the Mel language spectrogram is input into a trained spectral feature conversion model to convert the Mel language spectrogram of the source speaker into a target Mel language spectrogram, and the target Mel language spectrogram
  • a training value and the Mel frequency cepstrum coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold, the target speaker
  • the Er language spectrogram is transformed and adjusted until the loss value output by the loss function is less than the preset threshold, and the target Mel language spectrogram is output as the Mel language spectrogram of the target speaker.
  • the spectral feature conversion module 30 passes the source speaker’s Mel spectrogram through the pre-trained convolutional neural network for time-series compression, and performs time-series compression on the time-series compressed Mel spectrogram according to Framing is performed in time sequence.
  • the Mel frequency cepstral coefficient feature of each frame plus the identity feature of the target speaker are input to the recurrent neural network for processing.
  • the recurrent neural network divides the source speaker's Mel
  • the frequency cepstral coefficient feature is converted into the Mel frequency cepstral coefficient feature of the target speaker to obtain the training value.
  • the voice generation module 40 is used to convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.
  • an embodiment of the present application also proposes a computer-readable storage medium with a speech synthesis program stored on the computer-readable storage medium, and the speech synthesis program can be executed by one or more processors to implement the following operations:
  • the Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed is a speech synthesis method. The method comprises: converting speech data of a source speaker into text content, and converting the text content into a text vector; converting the text vector into a Mel spectrogram of the source speaker; acquiring a speech signal of a target speaker, and converting the speech signal of the target speaker into a Mel frequency cepstrum coefficient feature of the target speaker; inputting the Mel frequency cepstrum coefficient feature of the source speaker and the Mel frequency cepstrum coefficient feature of the target speaker into a trained spectral feature conversion model to obtain a Mel spectrogram of the target speaker; and converting the Mel spectrogram of the target speaker into a speech corresponding to the text content and outputting the speech. The present application also provides a speech synthesis apparatus and a computer readable storage medium. The present application can realize the tone conversion of a speech synthesis system.

Description

语音合成方法、装置及计算机可读存储介质Speech synthesis method, device and computer readable storage medium
本申请要求于2019年5月22日提交中国专利局,申请号为201910438778.3、发明名称为“语音合成方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 22, 2019. The application number is 201910438778.3 and the invention title is "Speech synthesis method, device and computer-readable storage medium". The entire content is incorporated by reference. In this application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种语音合成方法、装置及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a speech synthesis method, device and computer-readable storage medium.
背景技术Background technique
随着科技的发展,计算机已经可以通过语音合成系统进行说话,普通用户很容易听懂并接受。然而,现有能说话的计算机往往只能按照一个模式或者一种声音进行说话。然而终端用户却往往有着更高的需求,比如,用户可能希望计算机能够以用户自己的声音进行朗读。因此在这种情况下,显然现有的计算机已不能满足这样的需求。With the development of technology, computers can already speak through speech synthesis systems, and it is easy for ordinary users to understand and accept. However, existing computers that can talk often can only speak according to one mode or one voice. However, end users often have higher requirements. For example, users may want the computer to read aloud in the user's own voice. Therefore, in this case, it is obvious that the existing computers can no longer meet this demand.
发明内容Summary of the invention
本申请提供一种语音合成方法、装置及计算机可读存储介质,其主要目的在于提供一种可以实现语音合成系统的音色转换的方案。This application provides a speech synthesis method, device, and computer-readable storage medium, the main purpose of which is to provide a solution that can realize the tone color conversion of the speech synthesis system.
为实现上述目的,本申请提供的一种语音合成方法,包括:接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量;将所述文本向量转化为源说话人的梅尔语谱图;获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征;将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出;及将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。In order to achieve the above objective, a speech synthesis method provided by this application includes: receiving speech data of a source speaker, converting the speech data of the source speaker into text content, and converting the text content into a text vector; Convert the text vector into the Mel spectrum of the source speaker; obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker; The Mel spectrogram of the source speaker is input to a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, and the The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold , Performing transformation and adjustment on the target Mel language spectrogram until the loss value output by the loss function is less than the preset threshold, using the target Mel language spectrogram as the Mel language of the target speaker Spectrogram output; and converting the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output.
此外,为实现上述目的,本申请还提供一种语音合成装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的语音合成程序,所述语音合成程序被所述处理器执行时实现如下步骤:接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量;将所述文本向量转化为源说话人的梅尔语谱图;获取目标 说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征;将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出;及将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。In addition, in order to achieve the above-mentioned object, the present application also provides a speech synthesis device, which includes a memory and a processor. The memory stores a speech synthesis program that can run on the processor, and the speech synthesis program is When the processor executes, the following steps are implemented: receiving the voice data of the source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector; converting the text vector into The Mel spectrum of the source speaker; the voice signal of the target speaker is obtained, and the voice signal of the target speaker is converted into the Mel frequency cepstrum coefficient feature of the target speaker; the Meier of the source speaker Input the Er language spectrogram into a trained spectral feature conversion model to convert the Mel language spectrogram of the source speaker into a target Mel language spectrogram, and use the target Mel language spectrogram as The training value and the Mel frequency cepstrum coefficient feature of the target speaker are input as a label value into a loss function. When the loss value output by the loss function is greater than or equal to a preset threshold, the target Mel Transform and adjust the spectrogram until the loss value output by the loss function is less than the preset threshold, output the target Mel spectrogram as the Mel spectrogram of the target speaker; and The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有语音合成程序,所述语音合成程序可被一个或者多个处理器执行,以实现如上所述的语音合成方法的步骤。In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium with a speech synthesis program stored on the computer-readable storage medium. The speech synthesis program can be executed by one or more processors to achieve The steps of the speech synthesis method as described above.
本申请提出的语音合成方法、装置及计算机可读存储介质利用一个预先训练的语谱特征转换模型将将源说话人的梅尔语谱图转换目标说话人的梅尔语谱图,从而将利用源说话人的音色输出的文本内容转换为利用目标说话人的音色输出,实现了语音合成系统的音色转换。The speech synthesis method, device and computer-readable storage medium proposed in this application use a pre-trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the Mel spectrogram of the target speaker, thereby using The text content of the timbre output of the source speaker is converted to the timbre output of the target speaker, which realizes the timbre conversion of the speech synthesis system.
附图说明Description of the drawings
图1为本申请一实施例提供的语音合成方法的流程示意图;FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of this application;
图2为本申请一实施例提供的语音合成方法中将文本内容转化为文本向量的示意图;2 is a schematic diagram of converting text content into text vectors in a speech synthesis method provided by an embodiment of the application;
图3为本申请一实施例提供的语音合成方法中语谱特征转换模型的结构示意图;FIG. 3 is a schematic structural diagram of a spectral feature conversion model in a speech synthesis method provided by an embodiment of this application;
图4为本申请一实施例提供的语音合成装置的内部结构示意图;4 is a schematic diagram of the internal structure of a speech synthesis device provided by an embodiment of the application;
图5为本申请一实施例提供的语音合成装置中语音合成程序的模块示意图。FIG. 5 is a schematic diagram of modules of a speech synthesis program in a speech synthesis device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
本申请提供一种语音合成方法。参照图1所示,为本申请一实施例提供的语音合成方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides a speech synthesis method. Referring to FIG. 1, it is a schematic flowchart of a speech synthesis method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,语音合成方法包括:In this embodiment, the speech synthesis method includes:
S1、接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量。S1. Receive the voice data of the source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.
本申请通过一个文本嵌入模块将所述文本内容中的汉字转换为文本向量。This application uses a text embedding module to convert Chinese characters in the text content into text vectors.
本申请利用所述文本嵌入模块将输入的文本内容中的汉字进行分词操作,然后将得到的分词转译为带有声调(用1-5表示普通话的四种声调和轻声)的汉语拼音,例如,将一个分词“您好”转换为“nin2hao3”。This application uses the text embedding module to perform word segmentation operations on the Chinese characters in the input text content, and then translates the obtained word segmentation into Chinese pinyin with tones (using 1-5 to represent the four tones and soft tones of Mandarin), for example, Convert a participle "hello" to "nin2hao3".
进一步地,本申请通过独热编码的方式,将转译得到的汉语拼音中的拼音字母和声调数字转换为一维文本向量,再按照时间序列将其转化为一个二维文本向量,参阅图2所示。Furthermore, this application uses one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese Pinyin into a one-dimensional text vector, and then converts it into a two-dimensional text vector according to the time sequence, as shown in Figure 2. Show.
S2、将所述文本向量转化为源说话人的梅尔语谱图。S2. Convert the text vector into the Mel language spectrogram of the source speaker.
本申请较佳实施例通过将所述文本向量输入到一个梅尔语谱生成模块中,将所述文本向量转化为源说话人的梅尔语谱图。In a preferred embodiment of the present application, the text vector is input into a Mel spectrum generation module to convert the text vector into the Mel spectrum map of the source speaker.
本申请所述梅尔语谱生成模块接收所述文本嵌入模块传递来的文本向量,并利用经过训练的序列到序列的神经网络模型,将所述文本向量转化为源说话人的梅尔语谱图。The Mel language spectrum generation module of this application receives the text vector passed by the text embedding module, and uses the trained sequence-to-sequence neural network model to convert the text vector into the Mel language spectrum of the source speaker Figure.
本申请所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用了一份不公开的语音数据库进行训练。该语音数据库包含了一位女性说话人(即源说话人)在安静环境下,用专用录音设备录制的总时长约30个小时的语音文件,以及每条语音所对应的文本文件。输入的文本向量经过训练过的序列到序列的神经网络模型映射之后,会被转换为源说话人的梅尔语谱图。The trained sequence-to-sequence neural network model described in this application adopts the Tacotron architecture and uses an undisclosed speech database for training. The voice database contains a female speaker (ie the source speaker) in a quiet environment, using a special recording device to record a total of about 30 hours of voice files, and the text file corresponding to each voice. After the input text vector is mapped from the trained sequence to the sequence neural network model, it will be converted into the Mel language spectrogram of the source speaker.
所述梅尔语谱图是一种基于梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征的频谱图。为获得所述梅尔频率倒谱系数特征,本申请首先使用Preemphasis滤波器提高高频信号和信噪比,其公式为:y(t)=x(t)-αx(t-1),式中x为信号输入,y为信号输出,x(t)为t时刻的信号,x(t-1)为(t-1)的信号,α一般取0.97。根据所述Preemphasis滤波器得到提高了高频信号和信噪比之后的t时刻的信号输出y(t)。接着进行短时傅里叶变换。为了模拟人耳对高频信号的抑制,本申请利用一组包含多个三角滤波器的滤波组件(filterbank)对经过短时傅里叶变换的线性谱进行处理得到低维特征,并强调低频部分,弱化高频部分,从而得到所述梅尔频率倒谱系数特征。The Mel-language spectrogram is a spectrogram based on Mel Frequency Cepstrum Coefficient (MFCC) features. In order to obtain the characteristic of the Mel frequency cepstrum coefficient, this application first uses a Preemphasis filter to improve the high-frequency signal and the signal-to-noise ratio. The formula is: y(t)=x(t)-αx(t-1), Where x is the signal input, y is the signal output, x(t) is the signal at time t, x(t-1) is the signal at (t-1), and α is generally 0.97. According to the Preemphasis filter, the signal output y(t) at time t after the high-frequency signal and the signal-to-noise ratio are improved. Then perform short-time Fourier transform. In order to simulate the suppression of high-frequency signals by human ears, this application uses a set of filter banks containing multiple triangular filters to process the linear spectrum after short-time Fourier transform to obtain low-dimensional features and emphasize the low-frequency part. , Weaken the high frequency part, so as to obtain the Mel frequency cepstrum coefficient characteristic.
优选地,在进行傅里叶变换前,为了防止能量泄露本申请较佳实施例会使用汉宁窗函数。所述汉宁窗可以看作是3个矩形时间窗的频谱之和,或者说是3个sin(t)型函数之和,而括号中的两项相对于第一个谱窗向左、右各移动了π/T,从而使旁瓣互相抵消,消去高频干扰和漏能。Preferably, in order to prevent energy leakage, the preferred embodiment of the present application will use the Hanning window function before performing the Fourier transform. The Hanning window can be regarded as the sum of the spectrum of 3 rectangular time windows, or the sum of 3 sin(t)-type functions, and the two items in brackets are left and right relative to the first spectral window. Each moved by π/T, so that the side lobes cancel each other out, eliminating high-frequency interference and leakage energy.
S3、获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征。S3. Acquire the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker.
S4、将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输 出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出。S4. Input the Mel spectrogram of the source speaker into a trained spectrogram feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, and The target Mel language spectrogram is used as the training value and the Mel frequency cepstral coefficient feature of the target speaker is input into a loss function as a label value. When the loss value output by the loss function is greater than or equal to the expected value When the threshold is set, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the target speaker's Mel language spectrogram output.
本申请所述语谱特征转换模型包括卷积神经网络(Convolutional Neural Networks,CNN)模型和基于双向LSTM的循环神经网络(Recurrent Neural Network,RNN)模型。本申请将所述源说话人的梅尔语谱图通过一层预训练的卷积神经网络进行时序上的压缩以更好的对梅尔语谱图中的特征进行表示,处理过的梅尔语谱图会按照时序进行分帧,每一帧的梅尔频率倒谱系数特征将会加上目标说话人的身份特征,然后输入至一个两层的基于双向LSTM的循环神经网络中进行处理,该双向LSTM的循环神经网络逐帧的将源说话人的梅尔语谱图转换为目标梅尔语谱图。进一步地,本申请将所述转换得到的目标梅尔语谱图作为训练值,将上述步骤S3得到的目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将目标梅尔语谱图作为所述源说话人的梅尔语谱图输出。The spectral feature conversion model described in this application includes a convolutional neural network (Convolutional Neural Networks, CNN) model and a bidirectional LSTM-based recurrent neural network (Recurrent Neural Network, RNN) model. This application compresses the Mel spectrogram of the source speaker through a layer of pre-trained convolutional neural network to better represent the features in the Mel spectrogram. The processed Mel The spectrogram will be divided into frames according to the time sequence. The Mel frequency cepstral coefficient feature of each frame will be added with the identity feature of the target speaker, and then input into a two-layer two-way LSTM-based recurrent neural network for processing. The cyclic neural network of the bidirectional LSTM converts the Mel spectrogram of the source speaker into the target Mel spectrogram frame by frame. Further, the present application uses the converted target Mel language spectrogram as a training value, and uses the Mel frequency cepstral coefficient feature of the target speaker obtained in step S3 above as a label value and inputs it into a loss function. When the loss value output by the loss function is greater than or equal to the preset threshold value, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold value, the target Mel language The spectrogram is output as the Mel language spectrogram of the source speaker.
本申请较佳实施例中,所述语谱特征转换模型的结构如图3所示。In a preferred embodiment of the present application, the structure of the spectral feature conversion model is shown in FIG. 3.
所述卷积神经网络以及基于双向LSTM的循环神经网络也使用了一个非公开的语音数据集进行了训练。该语音数据集包含了N位(较佳的,N为10)位女性说话人的录音(每位说话人都有时长约1小时语音文件),并且10位说话人所录制的文本内容都是相同的。其中有一位女性说话人也录制了上述训练的序列到序列的神经网络模型所用的语音数据库。因此该位说话人被作为源说话人。而其余九位说话人则被当作目标说话人,并分别给予1-9的身份编号。该编号将在所述卷积神经网络以及基于双向LSTM的循环神经网络训练以及之后推理时,作为目标说话人身份向量嵌入相对应的梅尔频率倒谱系数特征中。The convolutional neural network and the recurrent neural network based on bidirectional LSTM are also trained using a private speech data set. The voice data set contains recordings of N (preferably, N is 10) female speakers (each speaker has a voice file of about 1 hour in length), and the text content recorded by 10 speakers is all identical. One of the female speakers also recorded the speech database used by the sequence-to-sequence neural network model trained above. Therefore, the speaker is taken as the source speaker. The remaining nine speakers were regarded as target speakers and were given ID numbers 1-9. This number will be embedded in the corresponding Mel frequency cepstral coefficient feature as the target speaker identity vector during the training of the convolutional neural network and the bidirectional LSTM-based recurrent neural network and subsequent inferences.
所述卷积神经网络是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,其基本结构包括两层,其一为特征提取层,每个神经元的输入与前一层的局部接受域相连,并提取该局部的特征。一旦该局部特征被提取后,它与其它特征间的位置关系也随之确定下来;其二是特征映射层,网络的每个计算层由多个特征映射组成,每个特征映射是一个平面,平面上所有神经元的权值相等。特征映射结构采用影响函数核小的sigmoid函数作为卷积网络的激活函数,使得特征映射具有位移不变性。此外,由于一个映射面上的神经元共享权值,因而减少了网络自由参数的个数。卷积神经网络中的每一个卷积层都紧跟着一个用来求局部平均与二次提取的计算层,这种特有的两次特征提取结构减小了特征分辨率。The convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area. Its basic structure includes two layers. One is the feature extraction layer. The input of each neuron is The local receptive fields of the previous layer are connected, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a sigmoid function with a small influencing function core as the activation function of the convolutional network, which makes the feature mapping displacement invariant. In addition, since neurons on a mapping plane share weights, the number of free parameters of the network is reduced. Each convolutional layer in the convolutional neural network is followed by a calculation layer for local averaging and secondary extraction. This unique two-feature extraction structure reduces the feature resolution.
输入层:输入层是整个卷积神经网络唯一的数据输入口,主要用于定义不同类型的数据输入。Input layer: The input layer is the only data input port of the entire convolutional neural network, which is mainly used to define different types of data input.
卷积层:对输入卷积层的数据进行卷积操作,输出卷积后的特征图。Convolutional layer: convolve the data of the input convolutional layer, and output the convolutional feature map.
下采样层(Pooling层):Pooling层对传入数据在空间维度上进行下采样 操作,使得输入的特征图的长和宽变为原来的一半。Down-sampling layer (Pooling layer): The pooling layer performs down-sampling operations on the incoming data in spatial dimensions, so that the length and width of the input feature map become half of the original.
全连接层:全连接层和普通神经网络一样,每个神经元都与输入的所有神经元相互连接,然后经过激活函数进行计算。Fully connected layer: The fully connected layer is the same as an ordinary neural network. Each neuron is connected to all input neurons, and then calculated through an activation function.
输出层:输出层也被称为分类层,在最后输出时会计算每一类别的分类分值。Output layer: The output layer is also called the classification layer, and the classification score of each category will be calculated in the final output.
在本申请实施例中,输入层为源说话人梅尔语谱图,该梅尔语谱图依次进入一个7*7的卷积层,3*3的最大值池化层,随后进入4个卷积模块。每个卷积模块从具有线性投影的构建块开始,随后是具有本体映射的不同数量的构建块,最后在softmax层输出经过时序压缩的梅尔语谱。In the embodiment of this application, the input layer is the source speaker Mel's spectrogram, which sequentially enters a 7*7 convolutional layer, a 3*3 maximum pooling layer, and then enters 4 Convolution module. Each convolution module starts with a building block with linear projection, followed by a different number of building blocks with ontology mapping, and finally outputs a time-sequentially compressed Mel language spectrum in the softmax layer.
所述循环神经网络通常用于描述动态的序列数据,随着时间的变化而动态调整自身的网络状态,并且不断进行循环传递。在传统的神经网络模型中,神经元从输入层到隐藏层,再从隐藏层到输出层,层与层之间是全连接或者局部连接的方式,且在数据的传递中,会丢失上一层计算过程中产生的特征信息,而RNN所不同于传统神经网络模型的地方在于一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用与当前输出的计算中,即隐藏层之间的解点不再是无连接的而是有链接的,并且隐藏层的输出不仅包括输入层的输出,还包括上一时刻隐藏层的输出。The cyclic neural network is usually used to describe dynamic sequence data, dynamically adjust its own network state as time changes, and continuously perform cyclic transmission. In the traditional neural network model, neurons go from the input layer to the hidden layer, and then from the hidden layer to the output layer. The layers are fully connected or locally connected, and the last one will be lost in the transmission of data. The feature information generated in the layer calculation process, and the RNN is different from the traditional neural network model in that the current output of a sequence is also related to the previous output. The specific form is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the solution points between the hidden layers are no longer disconnected but linked, and the output of the hidden layer includes not only the input The output of the layer also includes the output of the hidden layer at the previous moment.
在本申请实施例中,将利用时序进行分帧的梅尔频率倒谱系数特征输入到两层的基于LSTM的循环神经网络模型中,利用梯度下降法求解损失函数。In the embodiment of the present application, the Mel frequency cepstral coefficient feature for framing using time sequence is input into the two-layer LSTM-based cyclic neural network model, and the gradient descent method is used to solve the loss function.
在神经网络中,所述损失函数用来评价网络模型输出的预测值
Figure PCTCN2019102198-appb-000001
与真实值Y之间的差异。这里用
Figure PCTCN2019102198-appb-000002
来表示损失函数,它是一个非负实数函数,损失值越小,网络模型的性能越好。根据深度学习中神经元基本公式,各层输入、输出分别为
Figure PCTCN2019102198-appb-000003
C i=f(z i),
Figure PCTCN2019102198-appb-000004
为第l层网络第i个神经元的输出,Ws i-1为第l层网络第i个神经元到第l+1层网络中第j个神经元的链接,U为第l层网络第i个神经元的权重,x i第l层网络第i个神经元,C i为输出层各单元的输出值,根据这一输入输出公式,利用MSE建立损失函数
Figure PCTCN2019102198-appb-000005
式中Y i为一个batch中第i个数据的正确答案,而
Figure PCTCN2019102198-appb-000006
为神经网络给出的预测值。同时为缓解梯度消散问题,选择ReLU函数relu(x)=max(0,x)作为激活函数,式中x为神经网络的输入值,该函数满足仿生学中的稀疏性,只有当输入值高于一定数目时才激活该神经元节点,当输入值低于0时进行限制,当输入上升到某一阈值以上时,函数中自变量与因变量呈线性关系。
In a neural network, the loss function is used to evaluate the predicted value of the network model output
Figure PCTCN2019102198-appb-000001
The difference from the true value Y. Used here
Figure PCTCN2019102198-appb-000002
To represent the loss function, it is a non-negative real number function, the smaller the loss value, the better the performance of the network model. According to the basic neuron formula in deep learning, the input and output of each layer are
Figure PCTCN2019102198-appb-000003
C i =f(z i ),
Figure PCTCN2019102198-appb-000004
Is the output of the i- th neuron in the l-th layer network, Ws i-1 is the link from the i-th neuron in the l-th layer network to the j-th neuron in the l+1-th layer network, U is the l-th layer network The weight of i neurons, x i the i-th neuron of the l- th layer network, C i is the output value of each unit of the output layer, according to this input and output formula, use MSE to establish the loss function
Figure PCTCN2019102198-appb-000005
Where Y i is the correct answer for the i-th data in a batch, and
Figure PCTCN2019102198-appb-000006
The predicted value given by the neural network. At the same time, in order to alleviate the problem of gradient dissipation, the ReLU function relu(x)=max(0,x) is selected as the activation function, where x is the input value of the neural network. This function satisfies the sparsity in bionics, only when the input value is high The neuron node is activated only at a certain number, and the input value is limited when the input value is lower than 0. When the input rises above a certain threshold, the independent variable and the dependent variable in the function have a linear relationship.
本申请较佳实施例利用梯度下降算法求解所述损失函数。梯度下降算法是神经网络模型训练最常用的优化算法。为找到损失函数
Figure PCTCN2019102198-appb-000007
的最小值,需要沿着与梯度向量相反的方向-dL/dy更新变量y,这样可以使得梯度减少最快,直至损失收敛至最小值,参数更新公式如下:L=L-αdL/dy,α表示学习率,从而可以获取最终的神经网络参数用于识别梅尔语谱图。
The preferred embodiment of the present application uses a gradient descent algorithm to solve the loss function. The gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function
Figure PCTCN2019102198-appb-000007
The minimum value of, needs to update the variable y in the opposite direction of the gradient vector -dL/dy, so that the gradient can be reduced the fastest until the loss converges to the minimum. The parameter update formula is as follows: L=L-αdL/dy,α Represents the learning rate, so that the final neural network parameters can be obtained to identify the Mel language spectrogram.
进一步地,本申请利用Softmax函数输入分类标签。Further, this application uses the Softmax function to input the classification label.
所述Softmax是对逻辑回归的推广,逻辑回归用于处理二分类问题,其推 广的Softmax回归则用于处理多分类问题。根据所输入梅尔频率倒谱系数特征,通过该激活函数获得所有类别输出概率的最大值,其核心公式为:
Figure PCTCN2019102198-appb-000008
假设所属类别共有K个类,x k表示类别为k的样本,x j表示所属类别为j的样本,并因此得到目标梅尔语谱图。
The Softmax is a promotion of logistic regression, which is used to deal with two classification problems, and its promoted Softmax regression is used to deal with multiple classification problems. According to the characteristics of the input Mel frequency cepstrum coefficient, the maximum value of the output probability of all categories is obtained through the activation function. The core formula is:
Figure PCTCN2019102198-appb-000008
Suppose that there are K categories in total, x k represents samples of category k, and x j represents samples of category j, and therefore the target Mel language spectrogram is obtained.
S5、将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。S5. Convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.
本申请较佳实施例利用语音生成模块将目标说话人的梅尔语谱图合成为语音。The preferred embodiment of the present application uses a voice generation module to synthesize the Mel language spectrogram of the target speaker into voice.
语音生成模块用于处理梅尔语谱图并生成高保真以及高自然度的语音。本申请在获得了目标说话人的梅尔语谱图后,使用一个语音生成模块,把梅尔语谱图作为条件输入,生成目标说话人的语音。该语音生成模块采用了一种叫做WaveNet的声码器。当输入不同目标说话人的梅尔语谱图时,该声码器可以根据所述梅尔语谱图生成不同目标说话人的的高保真声音。The speech generation module is used to process Mel's spectrogram and generate high-fidelity and high-naturalness speech. This application uses a voice generation module after obtaining the Mel spectrogram of the target speaker, and uses the Mel spectrogram as a conditional input to generate the target speaker's voice. The speech generation module uses a vocoder called WaveNet. When inputting Mel spectrograms of different target speakers, the vocoder can generate high-fidelity sounds of different target speakers according to the Mel spectrograms.
本申请较佳实施例中所使用的WaveNet声码器,也是由一个非公开的语音数据集训练而成,该数据集与训练卷积神经网络所用的语音数据集为同一数据集。所述WaveNet是一个端到端的TTS(text to speech)模型,其主要概念是因果卷积,所谓因果卷积的意义就是WaveNet在生成t时刻的元素时,只能使用0到t-1时刻的元素值。由于声音文件是时间上的一维数组,16KHz的采样率的文件,每秒钟就会有16000个元素,而上面所说的因果卷积的感受野非常小,即使堆叠很多层也只能使用到很少的数据来生成t时刻的的元素,为了扩大卷积的感受野,WaveNet采用了堆叠了多层带洞卷积来增到网络的感受野,使得网络生成下一个元素的时候,能够使用更多之前的元素数值。The WaveNet vocoder used in the preferred embodiment of the present application is also trained from a non-public speech data set, which is the same data set used for training the convolutional neural network. The WaveNet is an end-to-end TTS (text to speech) model. Its main concept is causal convolution. The meaning of the so-called causal convolution is that when WaveNet generates elements at time t, it can only use time from 0 to t-1. Element value. Since the sound file is a one-dimensional array in time, a file with a sampling rate of 16KHz will have 16,000 elements per second, and the receptive field of the causal convolution mentioned above is very small, and it can only be used even if many layers are stacked. To generate the element at time t with very little data, in order to expand the receptive field of convolution, WaveNet uses stacked multi-layer convolution with holes to increase the receptive field of the network, so that when the network generates the next element, it can Use more previous element values.
本申请还提供一种语音合成装置。参照图4所示,为本申请一实施例提供的语音合成装置的内部结构示意图。The application also provides a speech synthesis device. Referring to FIG. 4, it is a schematic diagram of the internal structure of a speech synthesis device provided by an embodiment of this application.
在本实施例中,语音合成装置1可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、便携计算机等终端设备。该语音合成装置1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the speech synthesis device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer. The speech synthesis device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是语音合成装置1的内部存储单元,例如该语音合成装置1的硬盘。存储器11在另一些实施例中也可以是语音合成装置1的外部存储设备,例如语音合成装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括语音合成装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于语音合成装置1的应用软件及各类数据,例如语音合成程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may be an internal storage unit of the speech synthesis device 1 in some embodiments, such as a hard disk of the speech synthesis device 1. In other embodiments, the memory 11 may also be an external storage device of the speech synthesis device 1, such as a plug-in hard disk equipped on the speech synthesis device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the speech synthesis apparatus 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the speech synthesis device 1, such as the code of the speech synthesis program 01, etc., but also to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit, CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行语音合成程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, such as execution of speech synthesis program 01, etc.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在语音合成装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be called a display screen or a display unit as appropriate, for displaying information processed in the speech synthesis device 1 and for displaying a visualized user interface.
图4仅示出了具有组件11-14以及语音合成程序01的语音合成装置1,本领域技术人员可以理解的是,图4示出的结构并不构成对语音合成装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 4 only shows the speech synthesis device 1 with components 11-14 and the speech synthesis program 01. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the speech synthesis device 1, and may include Fewer or more components than shown, or some combination of components, or different component arrangement.
在图4所示的装置1实施例中,存储器11中存储有语音合成程序01;处理器12执行存储器11中存储的语音合成程序01时实现如下步骤:In the embodiment of the device 1 shown in FIG. 4, the speech synthesis program 01 is stored in the memory 11; when the processor 12 executes the speech synthesis program 01 stored in the memory 11, the following steps are implemented:
步骤一、接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量。Step 1: Receive the voice data of the source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.
本申请通过一个文本嵌入模块将所述文本内容中的汉字转换为文本向量。This application uses a text embedding module to convert Chinese characters in the text content into text vectors.
本申请利用所述文本嵌入模块将输入的文本内容中的汉字进行分词操作,然后将得到的分词转译为带有声调(用1-5表示普通话的四种声调和轻声)的汉语拼音,例如,将一个分词“您好”转换为“nin2hao3”。This application uses the text embedding module to perform word segmentation operations on the Chinese characters in the input text content, and then translates the obtained word segmentation into Chinese pinyin with tones (using 1-5 to represent the four tones and soft tones of Mandarin), for example, Convert a participle "hello" to "nin2hao3".
进一步地,本申请通过独热编码的方式,将转译得到的汉语拼音中的拼音字母和声调数字转换为一维文本向量,再按照时间序列将其转化为一个二维文本向量,参阅图2所示。Furthermore, this application uses one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese Pinyin into a one-dimensional text vector, and then converts it into a two-dimensional text vector according to the time sequence, as shown in Figure 2. Show.
步骤二、将所述文本向量转化为源说话人的梅尔语谱图。Step 2: Convert the text vector into the Mel language spectrogram of the source speaker.
本申请较佳实施例通过将所述文本向量输入到一个梅尔语谱生成模块中,将所述文本向量转化为源说话人的梅尔语谱图。In a preferred embodiment of the present application, the text vector is input into a Mel spectrum generation module to convert the text vector into the Mel spectrum map of the source speaker.
本申请所述梅尔语谱生成模块接收所述文本嵌入模块传递来的文本向量,并利用经过训练的序列到序列的神经网络模型,将所述文本向量转化为源说话人的梅尔语谱图。The Mel language spectrum generation module of this application receives the text vector passed by the text embedding module, and uses the trained sequence-to-sequence neural network model to convert the text vector into the Mel language spectrum of the source speaker Figure.
本申请所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用了一份不公开的语音数据库进行训练。该语音数据库包含了一位女性说话人(即源说话人)在安静环境下,用专用录音设备录制的总时长约30个小时的语音文件,以及每条语音所对应的文本文件。输入的文本向量经过训练过的序列到序列的神经网络模型映射之后,会被转换为源说话人的梅尔语谱图。The trained sequence-to-sequence neural network model described in this application adopts the Tacotron architecture and uses an undisclosed speech database for training. The voice database contains a female speaker (ie the source speaker) in a quiet environment, using a special recording device to record a total of about 30 hours of voice files, and the text file corresponding to each voice. After the input text vector is mapped from the trained sequence to the sequence neural network model, it will be converted into the Mel language spectrogram of the source speaker.
所述梅尔语谱图是一种基于梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)特征的频谱图。为获得所述梅尔频率倒谱系数特征,本申请首先使用Preemphasis滤波器提高高频信号和信噪比,其公式为:y(t)=x(t)-αx(t-1),式中x为信号输入,y为信号输出,x(t)为t时刻的信号,x(t-1)为(t-1)的信号,α一般取0.97。根据所述Preemphasis滤波器得到提高了高频信号和信噪比之后的t时刻的信号输出y(t)。接着进行短时傅里叶变换。为了模拟人耳对高频信号的抑制,本申请利用一组包含多个三角滤波器的滤波组件(filterbank)对经过短时傅里叶变换的线性谱进行处理得到低维特征,并强调低频部分,弱化高频部分,从而得到所述梅尔频率倒谱系数特征。The Mel-language spectrogram is a spectrogram based on Mel Frequency Cepstrum Coefficient (MFCC) features. In order to obtain the characteristic of the Mel frequency cepstrum coefficient, this application first uses a Preemphasis filter to improve the high-frequency signal and the signal-to-noise ratio. The formula is: y(t)=x(t)-αx(t-1), Where x is the signal input, y is the signal output, x(t) is the signal at time t, x(t-1) is the signal at (t-1), and α is generally 0.97. According to the Preemphasis filter, the signal output y(t) at time t after the high-frequency signal and the signal-to-noise ratio are improved. Then a short-time Fourier transform is performed. In order to simulate the suppression of high-frequency signals by human ears, this application uses a set of filter banks containing multiple triangular filters to process the linear spectrum after short-time Fourier transform to obtain low-dimensional features and emphasize the low-frequency part. , Weaken the high frequency part, so as to obtain the Mel frequency cepstrum coefficient characteristic.
优选地,在进行傅里叶变换前,为了防止能量泄露本申请较佳实施例会使用汉宁窗函数。所述汉宁窗可以看作是3个矩形时间窗的频谱之和,或者说是3个sin(t)型函数之和,而括号中的两项相对于第一个谱窗向左、右各移动了π/T,从而使旁瓣互相抵消,消去高频干扰和漏能。Preferably, in order to prevent energy leakage, the preferred embodiment of the present application will use the Hanning window function before performing the Fourier transform. The Hanning window can be regarded as the sum of the spectrum of 3 rectangular time windows, or the sum of 3 sin(t)-type functions, and the two items in brackets are left and right relative to the first spectral window. Each moved by π/T, so that the side lobes cancel each other out, eliminating high-frequency interference and leakage energy.
步骤三、获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征。Step 3: Obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker.
步骤四、将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出。Step 4: Input the Mel spectrogram of the source speaker into a trained spectrogram feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, And use the target Mel language spectrogram as the training value and the Mel frequency cepstral coefficient feature of the target speaker as the label value and input into a loss function, when the loss value output by the loss function is greater than or equal to When the threshold is preset, the target Mel spectrogram is transformed and adjusted until the loss value output by the loss function is less than the preset threshold, the target Mel spectrogram is used as the target speaker The Mel language spectrogram output.
本申请所述语谱特征转换模型包括卷积神经网络(Convolutional Neural Networks,CNN)模型和基于双向LSTM的循环神经网络(Recurrent Neural Network,RNN)模型。本申请将所述源说话人的梅尔语谱图通过一层预训练的卷积神经网络以进行时序上的压缩以更好的对梅尔语谱中的特征进行表示,处理过的梅尔语谱图会按照时序进行分帧,每一帧的梅尔频率倒谱系数特征将会加上目标说话人的身份特征,然后输入至一个两层的基于双向LSTM的循环神经网络中进行处理,该双向LSTM的循环神经网络逐帧的将源说话人的梅尔语谱图转换为目标梅尔语谱图。进一步地,本申请将所述转换得到的目标梅尔语谱图作为训练值,将上述步骤S3得到的目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将目标梅尔语谱图作为所述源说话人的梅尔语谱图输出。The spectral feature conversion model described in this application includes a convolutional neural network (Convolutional Neural Networks, CNN) model and a bidirectional LSTM-based recurrent neural network (Recurrent Neural Network, RNN) model. This application uses the source speaker’s Mel spectrogram through a layer of pre-trained convolutional neural network for time-series compression to better represent the features in the Mel spectroscopy, and the processed Mel The spectrogram will be divided into frames according to the time sequence. The Mel frequency cepstral coefficient feature of each frame will be added with the identity feature of the target speaker, and then input into a two-layer two-way LSTM-based recurrent neural network for processing. The cyclic neural network of the bidirectional LSTM converts the Mel spectrogram of the source speaker into the target Mel spectrogram frame by frame. Further, the present application uses the converted target Mel language spectrogram as a training value, and uses the Mel frequency cepstral coefficient feature of the target speaker obtained in step S3 above as a label value and inputs it into a loss function. When the loss value output by the loss function is greater than or equal to the preset threshold value, the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold value, the target Mel language The spectrogram is output as the Mel language spectrogram of the source speaker.
本申请较佳实施例中,所述语谱特征转换模型的结构如图3所示。In a preferred embodiment of the present application, the structure of the spectral feature conversion model is shown in FIG. 3.
所述卷积神经网络以及基于双向LSTM的循环神经网络也使用了一个非公开的语音数据集进行了训练。该语音数据集包含了N位(较佳的,N为10) 位女性说话人的录音(每位说话人都有时长约1小时语音文件),并且10位说话人所录制的文本内容都是相同的。其中有一位女性说话人也录制了上述训练的序列到序列的神经网络模型所用的语音数据库。因此该位说话人被作为源说话人。而其余九位说话人则被当作目标说话人,并分别给予1-9的身份编号。该编号将在所述卷积神经网络以及基于双向LSTM的循环神经网络训练以及之后推理时,作为目标说话人身份向量嵌入相对应的梅尔频率倒谱系数特征中。The convolutional neural network and the recurrent neural network based on bidirectional LSTM are also trained using a private speech data set. The voice data set contains recordings of N (preferably, N is 10) female speakers (each speaker has a voice file of about 1 hour in length), and the text content recorded by 10 speakers is identical. One of the female speakers also recorded the speech database used by the sequence-to-sequence neural network model trained above. Therefore the speaker is taken as the source speaker. The remaining nine speakers were regarded as target speakers and were given ID numbers 1-9. This number will be embedded in the corresponding Mel frequency cepstral coefficient feature as the target speaker identity vector during the training of the convolutional neural network and the bidirectional LSTM-based recurrent neural network and subsequent inferences.
所述卷积神经网络是一种前馈神经网络,它的人工神经元可以响应一部分覆盖范围内的周围单元,其基本结构包括两层,其一为特征提取层,每个神经元的输入与前一层的局部接受域相连,并提取该局部的特征。一旦该局部特征被提取后,它与其它特征间的位置关系也随之确定下来;其二是特征映射层,网络的每个计算层由多个特征映射组成,每个特征映射是一个平面,平面上所有神经元的权值相等。特征映射结构采用影响函数核小的sigmoid函数作为卷积网络的激活函数,使得特征映射具有位移不变性。此外,由于一个映射面上的神经元共享权值,因而减少了网络自由参数的个数。卷积神经网络中的每一个卷积层都紧跟着一个用来求局部平均与二次提取的计算层,这种特有的两次特征提取结构减小了特征分辨率。The convolutional neural network is a feed-forward neural network. Its artificial neurons can respond to a part of the surrounding units in the coverage area. Its basic structure includes two layers. One is the feature extraction layer. The input of each neuron is The local receptive fields of the previous layer are connected, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature mapping layer, each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The weights of all neurons on the plane are equal. The feature mapping structure uses a sigmoid function with a small influencing function core as the activation function of the convolutional network, which makes the feature mapping displacement invariant. In addition, since neurons on a mapping plane share weights, the number of free parameters of the network is reduced. Each convolutional layer in the convolutional neural network is followed by a calculation layer for local averaging and secondary extraction. This unique two-feature extraction structure reduces the feature resolution.
输入层:输入层是整个卷积神经网络唯一的数据输入口,主要用于定义不同类型的数据输入。Input layer: The input layer is the only data input port of the entire convolutional neural network, which is mainly used to define different types of data input.
卷积层:对输入卷积层的数据进行卷积操作,输出卷积后的特征图。Convolutional layer: convolve the data of the input convolutional layer, and output the convolutional feature map.
下采样层(Pooling层):Pooling层对传入数据在空间维度上进行下采样操作,使得输入的特征图的长和宽变为原来的一半。Down-sampling layer (Pooling layer): The Pooling layer performs down-sampling operations on the incoming data in spatial dimensions, so that the length and width of the input feature map become half of the original.
全连接层:全连接层和普通神经网络一样,每个神经元都与输入的所有神经元相互连接,然后经过激活函数进行计算。Fully connected layer: The fully connected layer is the same as an ordinary neural network. Each neuron is connected to all input neurons, and then calculated through an activation function.
输出层:输出层也被称为分类层,在最后输出时会计算每一类别的分类分值。Output layer: The output layer is also called the classification layer, and the classification score of each category will be calculated in the final output.
在本申请实施例中,输入层为源说话人梅尔语谱图,该梅尔语谱图依次进入一个7*7的卷积层,3*3的最大值池化层,随后进入4个卷积模块。每个卷积模块从具有线性投影的构建块开始,随后是具有本体映射的不同数量的构建块,最后在softmax层输出经过时序压缩的梅尔语谱。In the embodiment of this application, the input layer is the source speaker Mel's spectrogram, which sequentially enters a 7*7 convolutional layer, a 3*3 maximum pooling layer, and then enters 4 Convolution module. Each convolution module starts with a building block with linear projection, followed by a different number of building blocks with ontology mapping, and finally outputs a time-sequentially compressed Mel language spectrum in the softmax layer.
所述循环神经网络通常用于描述动态的序列数据,随着时间的变化而动态调整自身的网络状态,并且不断进行循环传递。在传统的神经网络模型中,神经元从输入层到隐藏层,再从隐藏层到输出层,层与层之间是全连接或者局部连接的方式,且在数据的传递中,会丢失上一层计算过程中产生的特征信息,而RNN所不同于传统神经网络模型的地方在于一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用与当前输出的计算中,即隐藏层之间的解点不再是无连接的而是有链接的,并且隐藏层的输出不仅包括输入层的输出,还包括上一时刻隐藏层的输出。The cyclic neural network is usually used to describe dynamic sequence data, dynamically adjust its own network state as time changes, and continuously perform cyclic transmission. In the traditional neural network model, neurons go from the input layer to the hidden layer, and then from the hidden layer to the output layer. The layers are fully connected or locally connected, and the last one will be lost in the transmission of data. The feature information generated in the layer calculation process, and the RNN is different from the traditional neural network model in that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the solution points between the hidden layers are no longer disconnected but linked, and the output of the hidden layer includes not only the input The output of the layer also includes the output of the hidden layer at the previous moment.
在本申请实施例中,将利用时序进行分帧的梅尔频率倒谱系数特征输入 到两层的基于LSTM的循环神经网络模型中,利用梯度下降法求解损失函数。In the embodiment of the present application, the Mel frequency cepstral coefficient feature for framing using time sequence is input into the two-layer LSTM-based cyclic neural network model, and the gradient descent method is used to solve the loss function.
在神经网络中,所述损失函数用来评价网络模型输出的预测值
Figure PCTCN2019102198-appb-000009
与真实值Y之间的差异。这里用
Figure PCTCN2019102198-appb-000010
来表示损失函数,它是一个非负实数函数,损失值越小,网络模型的性能越好。根据深度学习中神经元基本公式,各层输入、输出分别为
Figure PCTCN2019102198-appb-000011
C i=f(z i),
Figure PCTCN2019102198-appb-000012
为第l层网络第i个神经元的输出,Ws i-1为第l层网络第i个神经元到第l+1层网络中第j个神经元的链接,U为第l层网络第i个神经元的权重,x i第l层网络第i个神经元,C j为输出层各单元的输出值,根据这一输入输出公式,利用MSE建立损失函数
Figure PCTCN2019102198-appb-000013
式中Y i为一个batch中第i个数据的正确答案,而
Figure PCTCN2019102198-appb-000014
为神经网络给出的预测值。同时为缓解梯度消散问题,选择ReLU函数relu(x)=max(0,x)作为激活函数,式中x为神经网络的输入值,该函数满足仿生学中的稀疏性,只有当输入值高于一定数目时才激活该神经元节点,当输入值低于0时进行限制,当输入上升到某一阈值以上时,函数中自变量与因变量呈线性关系。
In a neural network, the loss function is used to evaluate the predicted value of the network model output
Figure PCTCN2019102198-appb-000009
The difference from the true value Y. Used here
Figure PCTCN2019102198-appb-000010
To represent the loss function, it is a non-negative real number function, the smaller the loss value, the better the performance of the network model. According to the basic neuron formula in deep learning, the input and output of each layer are
Figure PCTCN2019102198-appb-000011
C i =f(z i ),
Figure PCTCN2019102198-appb-000012
Is the output of the i- th neuron in the l-th layer network, Ws i-1 is the link from the i-th neuron in the l-th layer network to the j-th neuron in the l+1-th layer network, U is the l-th layer network The weight of i neurons, x i the i-th neuron of the l- th layer network, C j is the output value of each unit of the output layer, according to this input and output formula, use MSE to establish the loss function
Figure PCTCN2019102198-appb-000013
Where Y i is the correct answer for the i-th data in a batch, and
Figure PCTCN2019102198-appb-000014
The predicted value given by the neural network. At the same time, in order to alleviate the problem of gradient dissipation, the ReLU function relu(x)=max(0,x) is selected as the activation function, where x is the input value of the neural network. This function satisfies the sparsity in bionics, only when the input value is high The neuron node is activated only at a certain number, and the input value is limited when the input value is lower than 0. When the input rises above a certain threshold, the independent variable and the dependent variable in the function have a linear relationship.
本申请较佳实施例利用梯度下降算法求解所述损失函数。梯度下降算法是神经网络模型训练最常用的优化算法。为找到损失函数
Figure PCTCN2019102198-appb-000015
的最小值,需要沿着与梯度向量相反的方向-dL/dy更新变量y,这样可以使得梯度减少最快,直至损失收敛至最小值,参数更新公式如下:L=L-αdL/dy,α表示学习率,从而可以获取最终的神经网络参数用于识别梅尔语谱图。
The preferred embodiment of the present application uses a gradient descent algorithm to solve the loss function. The gradient descent algorithm is the most commonly used optimization algorithm for neural network model training. To find the loss function
Figure PCTCN2019102198-appb-000015
The minimum value of, needs to update the variable y in the opposite direction of the gradient vector -dL/dy, which can make the gradient decrease the fastest until the loss converges to the minimum. The parameter update formula is as follows: L=L-αdL/dy,α Represents the learning rate, so that the final neural network parameters can be obtained to identify the Mel language spectrogram.
进一步地,本申请利用Softmax函数输入分类标签。Further, this application uses the Softmax function to input the classification label.
所述Softmax是对逻辑回归的推广,逻辑回归用于处理二分类问题,其推广的Softmax回归则用于处理多分类问题。根据所输入梅尔频率倒谱系数特征,通过该激活函数获得所有类别输出概率的最大值,其核心公式为:
Figure PCTCN2019102198-appb-000016
假设所属类别共有K个类,x k表示类别为k的样本,x j表示所属类别为j的样本,并因此得到目标梅尔语谱图。
The Softmax is a promotion of logistic regression, which is used to deal with two classification problems, and its promoted Softmax regression is used to deal with multiple classification problems. According to the characteristics of the input Mel frequency cepstrum coefficient, the maximum value of the output probability of all categories is obtained through the activation function. The core formula is:
Figure PCTCN2019102198-appb-000016
Suppose that there are K categories in total, x k represents samples of category k, and x j represents samples of category j, and therefore the target Mel language spectrogram is obtained.
步骤五、将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。Step 5: Convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.
本申请较佳实施例利用语音生成模块将目标说话人的梅尔语谱图合成为语音。The preferred embodiment of the present application uses a voice generation module to synthesize the Mel language spectrogram of the target speaker into voice.
语音生成模块用于处理梅尔语谱图并生成高保真以及高自然度的语音。本申请在获得了目标说话人的梅尔语谱图后,使用一个语音生成模块,把梅尔语谱图作为条件输入,生成目标说话人的语音。该语音生成模块采用了一种叫做WaveNet的声码器。当输入不同目标说话人的梅尔语谱图时,该声码器可以根据所述梅尔语谱图生成不同目标说话人的的高保真声音。The speech generation module is used to process Mel's spectrogram and generate high-fidelity and high-naturalness speech. This application uses a voice generation module after obtaining the Mel spectrogram of the target speaker, and uses the Mel spectrogram as a conditional input to generate the target speaker's voice. The speech generation module uses a vocoder called WaveNet. When inputting Mel spectrograms of different target speakers, the vocoder can generate high-fidelity sounds of different target speakers according to the Mel spectrograms.
本申请较佳实施例中所使用的WaveNet声码器,也是由一个非公开的语音数据集训练而成,该数据集与训练卷积神经网络所用的语音数据集为同一数据集。所述WaveNet是一个端到端的TTS(text to speech)模型,其主要概念是因果卷积,所谓因果卷积的意义就是WaveNet在生成t时刻的元素时,只 能使用0到t-1时刻的元素值。由于声音文件是时间上的一维数组,16KHz的采样率的文件,每秒钟就会有16000个元素,而上面所说的因果卷积的感受野非常小,即使堆叠很多层也只能使用到很少的数据来生成t时刻的的元素,为了扩大卷积的感受野,WaveNet采用了堆叠了多层带洞卷积来增到网络的感受野,使得网络生成下一个元素的时候,能够使用更多之前的元素数值。The WaveNet vocoder used in the preferred embodiment of the present application is also trained from a non-public speech data set, which is the same data set used for training the convolutional neural network. The WaveNet is an end-to-end TTS (text to speech) model. Its main concept is causal convolution. The meaning of the so-called causal convolution is that when WaveNet generates elements at time t, it can only use time from 0 to t-1. Element value. Since the sound file is a one-dimensional array in time, a file with a sampling rate of 16KHz will have 16,000 elements per second, and the receptive field of the causal convolution mentioned above is very small, and it can only be used even if many layers are stacked. To generate the element at time t with very little data, in order to expand the receptive field of convolution, WaveNet uses stacked multi-layer convolution with holes to increase the receptive field of the network, so that when the network generates the next element, it can Use more previous element values.
可选地,在其他实施例中,语音合成程序01还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述语音合成程序在语音合成装置中的执行过程。Optionally, in other embodiments, the speech synthesis program 01 may also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (in this embodiment, The processor 12) is executed to complete the application. The module referred to in the application refers to a series of computer program instruction segments capable of completing specific functions, and is used to describe the execution process of the speech synthesis program in the speech synthesis device.
例如,参照图5所示,为本申请语音合成装置一实施例中的语音合成程序的程序模块示意图,该实施例中,语音合成程序可以被分割为文本嵌入模块10、梅尔语谱生成模块20、语谱特征转换模块30及语音生成模块40,示例性地:For example, referring to FIG. 5, it is a schematic diagram of the program modules of the speech synthesis program in an embodiment of the speech synthesis device of this application. In this embodiment, the speech synthesis program can be divided into a text embedding module 10 and a Mel language spectrum generating module 20. The spectrum feature conversion module 30 and the speech generation module 40, exemplarily:
所述文本嵌入模块10用于:接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量。The text embedding module 10 is configured to receive voice data of a source speaker, convert the voice data of the source speaker into text content, and convert the text content into a text vector.
可选地,所述文本嵌入模块10具体用于将所述文本内容中的汉字进行分词操作,然后将得到的分词转译为带有声调的汉语拼音,并通过独热编码的方式,将转译得到的汉语拼音中的拼音字母和声调数字转换为一维文本向量,再按照时间序列将其转化为一个二维的所述文本向量。Optionally, the text embedding module 10 is specifically configured to perform word segmentation operations on Chinese characters in the text content, and then translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to translate The pinyin letters and tonal numbers in the Chinese Pinyin are converted into a one-dimensional text vector, and then converted into a two-dimensional text vector according to the time sequence.
所述梅尔语谱生成模块20用于:将所述文本向量转化为源说话人的梅尔语谱图。The Mel language spectrum generating module 20 is used for converting the text vector into the Mel language spectrum map of the source speaker.
可选地,梅尔语谱生成模块20利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Optionally, the Mel language spectrum generation module 20 uses a trained sequence-to-sequence neural network model to convert the two-dimensional text vector into the Mel language spectrum of the source speaker, wherein the trained sequence The neural network model to the sequence uses the Tacotron architecture and uses a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice. .
所述语谱特征转换模块30用于:获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征,将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出。The spectral feature conversion module 30 is used to obtain the voice signal of the target speaker, and convert the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker, and convert the source speaker’s voice signal The Mel language spectrogram is input into a trained spectral feature conversion model to convert the Mel language spectrogram of the source speaker into a target Mel language spectrogram, and the target Mel language spectrogram As a training value and the Mel frequency cepstrum coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold, the target speaker The Er language spectrogram is transformed and adjusted until the loss value output by the loss function is less than the preset threshold, and the target Mel language spectrogram is output as the Mel language spectrogram of the target speaker.
可选地,所述语谱特征转换模块30将所述源说话人的梅尔语谱图通过所述预训练的卷积神经网络以进行时序压缩,对经过时序压缩的梅尔语谱图按 照时序进行分帧,每一帧的梅尔频率倒谱系数特征加上目标说话人的身份特征,并输入至所述循环神经网络中进行处理,该循环神经网络逐帧将源说话人的梅尔频率倒谱系数特征转换为目标说话人的梅尔频率倒谱系数特征,得到所述训练值。Optionally, the spectral feature conversion module 30 passes the source speaker’s Mel spectrogram through the pre-trained convolutional neural network for time-series compression, and performs time-series compression on the time-series compressed Mel spectrogram according to Framing is performed in time sequence. The Mel frequency cepstral coefficient feature of each frame plus the identity feature of the target speaker are input to the recurrent neural network for processing. The recurrent neural network divides the source speaker's Mel The frequency cepstral coefficient feature is converted into the Mel frequency cepstral coefficient feature of the target speaker to obtain the training value.
所述语音生成模块40用于:将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。The voice generation module 40 is used to convert the Mel language spectrogram of the target speaker into a voice corresponding to the text content and output it.
上述文本嵌入模块10、梅尔语谱生成模块20、语谱特征转换模块30和语音生成模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented by the program modules such as the text embedding module 10, the Mel language spectrum generation module 20, the language spectrum feature conversion module 30, and the speech generation module 40 are substantially the same as those in the foregoing embodiment, and will not be repeated here. .
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有语音合成程序,所述语音合成程序可被一个或多个处理器执行,以实现如下操作:In addition, an embodiment of the present application also proposes a computer-readable storage medium with a speech synthesis program stored on the computer-readable storage medium, and the speech synthesis program can be executed by one or more processors to implement the following operations:
接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量;Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;
将所述文本向量转化为源说话人的梅尔语谱图;Converting the text vector into the Mel language spectrogram of the source speaker;
获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征;Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;
将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出;及Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and
将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
本申请计算机可读存储介质具体实施方式与上述语音合成装置和方法各实施例基本相同,在此不作累述。The specific implementations of the computer-readable storage medium of the present application are basically the same as those of the above-mentioned speech synthesis device and method, and will not be repeated here.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的 技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种语音合成方法,其特征在于,所述方法包括:A speech synthesis method, characterized in that the method includes:
    接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量;Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;
    将所述文本向量转化为源说话人的梅尔语谱图;Converting the text vector into the Mel language spectrogram of the source speaker;
    获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征;Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;
    将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出;及Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and
    将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
  2. 如权利要求1所述的语音合成方法,其特征在于,所述将所述文本内容转化为文本向量包括:The speech synthesis method according to claim 1, wherein said converting said text content into a text vector comprises:
    将所述文本内容中的汉字进行分词操作,将得到的分词转译为带有声调的汉语拼音,通过独热编码的方式,将转译得到的汉语拼音中的拼音字母和声调数字转换为一维文本向量,再按照时间序列将所述一维文本向量转化为二维的所述文本向量。Perform word segmentation operations on the Chinese characters in the text content, translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese pinyin into one-dimensional text Vector, and then convert the one-dimensional text vector into the two-dimensional text vector according to the time sequence.
  3. 如权利要求1所述的语音合成方法,其特征在于,所述将所述文本向量转化为源说话人的梅尔语谱图,包括:5. The speech synthesis method according to claim 1, wherein said converting the text vector into the Mel language spectrogram of the source speaker comprises:
    利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
  4. 如权利要求2所述的语音合成方法,其特征在于,所述将所述文本向量转化为源说话人的梅尔语谱图,包括:3. The speech synthesis method according to claim 2, wherein the converting the text vector into the Mel language spectrogram of the source speaker comprises:
    利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
  5. 如权利要求1所述的语音合成方法,其特征在于,所述语谱特征转换模型包括预训练的卷积神经网络模型以及一个两层的基于双向LSTM的循环 神经网络,其中,所述将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,包括:The speech synthesis method according to claim 1, wherein the spectral feature conversion model includes a pre-trained convolutional neural network model and a two-layer bidirectional LSTM-based recurrent neural network, wherein The Mel spectrogram of the source speaker is input into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, including:
    将所述源说话人的梅尔语谱图通过所述预训练的卷积神经网络模型以进行时序压缩;Passing the Mel spectrogram of the source speaker through the pre-trained convolutional neural network model to perform time series compression;
    对经过时序压缩的梅尔语谱图按照时序进行分帧,每一帧的梅尔频率倒谱系数特征加上目标说话人的身份特征,并输入至所述循环神经网络中进行处理,该循环神经网络逐帧将源说话人的梅尔频率倒谱系数特征转换为目标说话人的梅尔频率倒谱系数特征,得到所述目标梅尔语谱图。The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
  6. 如权利要求2-4任一项所述的语音合成方法,其特征在于,所述语谱特征转换模型包括预训练的卷积神经网络模型以及一个两层的基于双向LSTM的循环神经网络,其中,所述将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,包括:The speech synthesis method according to any one of claims 2-4, wherein the spectral feature conversion model includes a pre-trained convolutional neural network model and a two-layer bidirectional LSTM-based recurrent neural network, wherein Said inputting the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, include:
    将所述源说话人的梅尔语谱图通过所述预训练的卷积神经网络模型以进行时序压缩;Passing the Mel spectrogram of the source speaker through the pre-trained convolutional neural network model to perform time series compression;
    对经过时序压缩的梅尔语谱图按照时序进行分帧,每一帧的梅尔频率倒谱系数特征加上目标说话人的身份特征,并输入至所述循环神经网络中进行处理,该循环神经网络逐帧将源说话人的梅尔频率倒谱系数特征转换为目标说话人的梅尔频率倒谱系数特征,得到所述目标梅尔语谱图。The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
  7. 如权利要求6所述的语音合成方法,其特征在于,所述将源说话人的梅尔语谱图通过预训练的卷积神经网络模型以进行时序压缩,包括:7. The speech synthesis method according to claim 6, wherein said passing the Mel spectrogram of the source speaker through a pre-trained convolutional neural network model for time-series compression comprises:
    将所述源说话人的梅尔语谱图输入所述卷积神经网络模型的输入层,该梅尔语谱图依次进入一个7*7的卷积层,3*3的最大值池化层,4个卷积模块,最后在softmax层输出经过时序压缩的梅尔语谱图。Input the Mel spectrogram of the source speaker into the input layer of the convolutional neural network model, and the Mel spectrogram sequentially enters a 7*7 convolutional layer and a 3*3 maximum pooling layer , 4 convolution modules, and finally output the time-sequential compressed Mel language spectrogram in the softmax layer.
  8. 一种语音合成装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的语音合成程序,所述语音合成程序被所述处理器执行时实现如下步骤:A speech synthesis device, characterized in that the device includes a memory and a processor, the memory stores a speech synthesis program that can run on the processor, and when the speech synthesis program is executed by the processor Implement the following steps:
    接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量;Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;
    将所述文本向量转化为源说话人的梅尔语谱图;Converting the text vector into the Mel language spectrogram of the source speaker;
    获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征;Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;
    将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的 梅尔语谱图输出;及Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and
    将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
  9. 如权利要求8所述的语音合成装置,其特征在于,所述将所述文本内容转化为二维文本向量包括:8. The speech synthesis device according to claim 8, wherein said converting said text content into a two-dimensional text vector comprises:
    将所述文本内容中的汉字进行分词操作,将得到的分词转译为带有声调的汉语拼音,通过独热编码的方式,将转译得到的汉语拼音中的拼音字母和声调数字转换为一维文本向量,再按照时间序列将所述一维文本向量转化为二维的所述文本向量。Perform word segmentation operations on the Chinese characters in the text content, translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese pinyin into one-dimensional text Vector, and then convert the one-dimensional text vector into the two-dimensional text vector according to the time sequence.
  10. 如权利要求8所述的语音合成装置,其特征在于,所述将所述文本向量转化为源说话人的梅尔语谱图,包括:8. The speech synthesis device according to claim 8, wherein said converting the text vector into the Mel language spectrogram of the source speaker comprises:
    利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
  11. 如权利要求9所述的语音合成装置,其特征在于,所述将所述文本向量转化为源说话人的梅尔语谱图,包括:9. The speech synthesis device according to claim 9, wherein said converting the text vector into the Mel language spectrogram of the source speaker comprises:
    利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
  12. 如权利要求8所述的语音合成装置,其特征在于,所述语谱特征转换模型包括预训练的卷积神经网络模型以及一个两层的基于双向LSTM的循环神经网络,其中,所述将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,包括:The speech synthesis device according to claim 8, wherein the spectral feature conversion model includes a pre-trained convolutional neural network model and a two-layer recurrent neural network based on bidirectional LSTM, wherein The Mel spectrogram of the source speaker is input into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, including:
    将所述源说话人的梅尔语谱图通过所述预训练的神经网络模型以进行时序压缩;Passing the Mel language spectrogram of the source speaker through the pre-trained neural network model to perform timing compression;
    对经过时序压缩的梅尔语谱图按照时序进行分帧,每一帧的梅尔频率倒谱系数特征加上目标说话人的身份特征,并输入至所述循环神经网络中进行处理,该循环神经网络逐帧将源说话人的梅尔频率倒谱系数特征转换为目标说话人的梅尔频率倒谱系数特征,得到所述目标梅尔语谱图。The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
  13. 如权利要求9-11任一项所述的语音合成装置,其特征在于,所述语谱特征转换模型包括预训练的卷积神经网络模型以及一个两层的基于双向LSTM的循环神经网络,其中,所述将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,包括:The speech synthesis device according to any one of claims 9-11, wherein the spectral feature conversion model comprises a pre-trained convolutional neural network model and a two-layer bidirectional LSTM-based recurrent neural network, wherein Said inputting the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into a target Mel spectrogram, include:
    将所述源说话人的梅尔语谱图通过所述预训练的神经网络模型以进行时 序压缩;Passing the Mel spectrogram of the source speaker through the pre-trained neural network model for temporal compression;
    对经过时序压缩的梅尔语谱图按照时序进行分帧,每一帧的梅尔频率倒谱系数特征加上目标说话人的身份特征,并输入至所述循环神经网络中进行处理,该循环神经网络逐帧将源说话人的梅尔频率倒谱系数特征转换为目标说话人的梅尔频率倒谱系数特征,得到所述目标梅尔语谱图。The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
  14. 如权利要求13所述的语音合成装置,其特征在于,所述将源说话人的梅尔语谱图通过预训练的卷积神经网络模型以进行时序压缩,包括:The speech synthesis device according to claim 13, wherein said passing the Mel spectrogram of the source speaker through a pre-trained convolutional neural network model for time-series compression comprises:
    将所述源说话人的梅尔语谱图输入所述卷积神经网络模型的输入层,该梅尔语谱图依次进入一个7*7的卷积层,3*3的最大值池化层,4个卷积模块,最后在softmax层输出经过时序压缩的梅尔语谱图。Input the Mel spectrogram of the source speaker into the input layer of the convolutional neural network model, and the Mel spectrogram sequentially enters a 7*7 convolutional layer and a 3*3 maximum pooling layer , 4 convolution modules, and finally output the time-sequentially compressed Mel language spectrogram in the softmax layer.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有语音合成程序,所述语音合成程序可被一个或者多个处理器执行,以实现如下的步骤:A computer-readable storage medium, characterized in that a speech synthesis program is stored on the computer-readable storage medium, and the speech synthesis program can be executed by one or more processors to implement the following steps:
    接收源说话人的语音数据,将所述源说话人的语音数据转换为文本内容,并将所述文本内容转化为文本向量;Receiving voice data of a source speaker, converting the voice data of the source speaker into text content, and converting the text content into a text vector;
    将所述文本向量转化为源说话人的梅尔语谱图;Converting the text vector into the Mel language spectrogram of the source speaker;
    获取目标说话人的语音信号,并将所述目标说话人的语音信号转换为目标说话人的梅尔频率倒谱系数特征;Acquiring the voice signal of the target speaker, and converting the voice signal of the target speaker into the Mel frequency cepstrum coefficient feature of the target speaker;
    将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,并将所述目标梅尔语谱图作为训练值以及将所述目标说话人的梅尔频率倒谱系数特征作为标签值输入至一个损失函数中,当所述损失函数输出的损失值大于或等于预设阈值时,对所述目标梅尔语谱图进行变换调整,直到所述损失函数输出的损失值小于所述预设阈值时,将所述目标梅尔语谱图作为所述目标说话人的梅尔语谱图输出;及Input the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram, and then The target Mel language spectrogram is used as a training value and the Mel frequency cepstral coefficient feature of the target speaker is input as a label value into a loss function, when the loss value output by the loss function is greater than or equal to a preset threshold When the target Mel language spectrogram is transformed and adjusted, until the loss value output by the loss function is less than the preset threshold, the target Mel language spectrogram is used as the Mel of the target speaker. Spectrogram output; and
    将所述目标说话人的梅尔语谱图转换为所述文本内容对应的语音并输出。The Mel language spectrogram of the target speaker is converted into a voice corresponding to the text content and output.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述将所述文本内容转化为二维文本向量包括:15. The computer-readable storage medium of claim 15, wherein the converting the text content into a two-dimensional text vector comprises:
    将所述文本内容中的汉字进行分词操作,将得到的分词转译为带有声调的汉语拼音,通过独热编码的方式,将转译得到的汉语拼音中的拼音字母和声调数字转换为一维文本向量,再按照时间序列将所述一维文本向量转化为二维的所述文本向量。Perform word segmentation operations on the Chinese characters in the text content, translate the obtained word segmentation into tonal Chinese pinyin, and use one-hot encoding to convert the pinyin letters and tonal numbers in the translated Chinese pinyin into one-dimensional text Vector, and then convert the one-dimensional text vector into the two-dimensional text vector according to the time sequence.
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述将所述文本向量转化为源说话人的梅尔语谱图,包括:15. The computer-readable storage medium according to claim 15, wherein the converting the text vector into the Mel language spectrogram of the source speaker comprises:
    利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述将所述文本向量转化为源说话人的梅尔语谱图,包括:15. The computer-readable storage medium of claim 16, wherein the converting the text vector into the Mel language spectrogram of the source speaker comprises:
    利用经过训练的序列到序列的神经网络模型,将所述二维文本向量转化为源说话人的梅尔语谱图,其中,所述经过训练的序列到序列的神经网络模型采用Tacotron架构,并使用预设语音数据库进行训练,该预设语音数据库包含了多个说话人在安静环境下用录音设备录制的语音文件以及每条语音所对应的文本文件。Use the trained sequence-to-sequence neural network model to transform the two-dimensional text vector into the Mel language spectrogram of the source speaker, where the trained sequence-to-sequence neural network model adopts the Tacotron architecture, and Use a preset voice database for training. The preset voice database contains voice files recorded by multiple speakers in a quiet environment with a recording device and a text file corresponding to each voice.
  19. 如权利要求15-18任一项所述的计算机可读存储介质,其特征在于,所述语谱特征转换模型包括预训练的卷积神经网络模型以及一个两层的基于双向LSTM的循环神经网络,其中,所述将所述源说话人的梅尔语谱图输入至一个经过训练的语谱特征转换模型中,以将所述源说话人的梅尔语谱图转换为目标梅尔语谱图,包括:The computer-readable storage medium according to any one of claims 15-18, wherein the spectral feature conversion model comprises a pre-trained convolutional neural network model and a two-layer recurrent neural network based on bidirectional LSTM , Wherein said inputting the Mel spectrogram of the source speaker into a trained spectral feature conversion model to convert the Mel spectrogram of the source speaker into the target Mel spectrogram Figures, including:
    将所述源说话人的梅尔语谱图通过所述预训练的神经网络模型以进行时序压缩;Passing the Mel language spectrogram of the source speaker through the pre-trained neural network model to perform timing compression;
    对经过时序压缩的梅尔语谱图按照时序进行分帧,每一帧的梅尔频率倒谱系数特征加上目标说话人的身份特征,并输入至所述循环神经网络中进行处理,该循环神经网络逐帧将源说话人的梅尔频率倒谱系数特征转换为目标说话人的梅尔频率倒谱系数特征,得到所述目标梅尔语谱图。The time-sequentially compressed Mel language spectrogram is divided into frames according to the time sequence. The Mel frequency cepstrum coefficient feature of each frame plus the identity feature of the target speaker are input into the recurrent neural network for processing. The neural network converts the Mel frequency cepstral coefficient feature of the source speaker into the Mel frequency cepstrum coefficient feature of the target speaker frame by frame to obtain the target Mel language spectrogram.
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述将源说话人的梅尔语谱图通过预训练的卷积神经网络模型以进行时序压缩,包括:The computer-readable storage medium according to claim 19, wherein said passing the Mel spectrogram of the source speaker through a pre-trained convolutional neural network model for timing compression comprises:
    将所述源说话人的梅尔语谱图输入所述卷积神经网络模型的输入层,该梅尔语谱图依次进入一个7*7的卷积层,3*3的最大值池化层,4个卷积模块,最后在softmax层输出经过时序压缩的梅尔语谱图。Input the Mel spectrogram of the source speaker into the input layer of the convolutional neural network model, and the Mel spectrogram sequentially enters a 7*7 convolutional layer and a 3*3 maximum pooling layer , 4 convolution modules, and finally output the time-sequential compressed Mel language spectrogram in the softmax layer.
PCT/CN2019/102198 2019-05-22 2019-08-23 Speech synthesis method and apparatus, and computer readable storage medium WO2020232860A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910438778.3A CN110136690B (en) 2019-05-22 2019-05-22 Speech synthesis method, device and computer readable storage medium
CN201910438778.3 2019-05-22

Publications (1)

Publication Number Publication Date
WO2020232860A1 true WO2020232860A1 (en) 2020-11-26

Family

ID=67572945

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102198 WO2020232860A1 (en) 2019-05-22 2019-08-23 Speech synthesis method and apparatus, and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110136690B (en)
WO (1) WO2020232860A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652325A (en) * 2020-12-15 2021-04-13 平安科技(深圳)有限公司 Remote voice adjusting method based on artificial intelligence and related equipment
CN112652318A (en) * 2020-12-21 2021-04-13 北京捷通华声科技股份有限公司 Tone conversion method and device and electronic equipment
CN112712812A (en) * 2020-12-24 2021-04-27 腾讯音乐娱乐科技(深圳)有限公司 Audio signal generation method, device, equipment and storage medium
CN112992177A (en) * 2021-02-20 2021-06-18 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of voice style migration model
CN113178200A (en) * 2021-04-28 2021-07-27 平安科技(深圳)有限公司 Voice conversion method, device, server and storage medium
CN113284499A (en) * 2021-05-24 2021-08-20 湖北亿咖通科技有限公司 Voice instruction recognition method and electronic equipment
CN113488057A (en) * 2021-08-18 2021-10-08 山东新一代信息产业技术研究院有限公司 Health-oriented conversation implementation method and system
CN113539231A (en) * 2020-12-30 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method, vocoder, device, equipment and storage medium
CN113611283A (en) * 2021-08-11 2021-11-05 北京工业大学 Voice synthesis method and device, electronic equipment and storage medium
CN113643687A (en) * 2021-07-08 2021-11-12 南京邮电大学 Non-parallel many-to-many voice conversion method fusing DSNet and EDSR network
CN113658583A (en) * 2021-08-17 2021-11-16 安徽大学 Method, system and device for converting ear voice based on generation countermeasure network
CN113837299A (en) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 Network training method and device based on artificial intelligence and electronic equipment
CN117745904A (en) * 2023-12-14 2024-03-22 山东浪潮超高清智能科技有限公司 2D playground speaking portrait synthesizing method and device

Families Citing this family (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110136690B (en) * 2019-05-22 2023-07-14 平安科技(深圳)有限公司 Speech synthesis method, device and computer readable storage medium
CN111508466A (en) * 2019-09-12 2020-08-07 马上消费金融股份有限公司 Text processing method, device and equipment and computer readable storage medium
CN111048071B (en) * 2019-11-11 2023-05-30 京东科技信息技术有限公司 Voice data processing method, device, computer equipment and storage medium
CN111161702B (en) * 2019-12-23 2022-08-26 爱驰汽车有限公司 Personalized speech synthesis method and device, electronic equipment and storage medium
WO2021127811A1 (en) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis method and apparatus, intelligent terminal, and readable medium
WO2021127978A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis method and apparatus, computer device and storage medium
CN111247584B (en) * 2019-12-24 2023-05-23 深圳市优必选科技股份有限公司 Voice conversion method, system, device and storage medium
WO2021128256A1 (en) * 2019-12-27 2021-07-01 深圳市优必选科技股份有限公司 Voice conversion method, apparatus and device, and storage medium
CN111433847B (en) * 2019-12-31 2023-06-09 深圳市优必选科技股份有限公司 Voice conversion method, training method, intelligent device and storage medium
CN110797002B (en) * 2020-01-03 2020-05-19 同盾控股有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111179905A (en) * 2020-01-10 2020-05-19 北京中科深智科技有限公司 Rapid dubbing generation method and device
CN111261177A (en) * 2020-01-19 2020-06-09 平安科技(深圳)有限公司 Voice conversion method, electronic device and computer readable storage medium
CN111489734B (en) * 2020-04-03 2023-08-22 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN111611431B (en) * 2020-04-16 2023-07-28 北京邮电大学 Music classification method based on deep learning
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111785247A (en) * 2020-07-13 2020-10-16 北京字节跳动网络技术有限公司 Voice generation method, device, equipment and computer readable medium
CN111899715B (en) * 2020-07-14 2024-03-29 升智信息科技(南京)有限公司 Speech synthesis method
KR20230058401A (en) * 2020-07-31 2023-05-03 디티에스, 인코포레이티드 Signal transformation based on unique key-based network guidance and conditioning
CN111985231B (en) * 2020-08-07 2023-12-26 中移(杭州)信息技术有限公司 Unsupervised role recognition method and device, electronic equipment and storage medium
CN112071325B (en) * 2020-09-04 2023-09-05 中山大学 Many-to-many voice conversion method based on double voiceprint feature vector and sequence-to-sequence modeling
CN112037766B (en) * 2020-09-09 2022-03-04 广州方硅信息技术有限公司 Voice tone conversion method and related equipment
CN112634918B (en) * 2020-09-29 2024-04-16 江苏清微智能科技有限公司 System and method for converting voice of any speaker based on acoustic posterior probability
CN112309365B (en) * 2020-10-21 2024-05-10 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112289299B (en) * 2020-10-21 2024-05-14 北京大米科技有限公司 Training method and device of speech synthesis model, storage medium and electronic equipment
CN112562728B (en) * 2020-11-13 2024-06-18 百果园技术(新加坡)有限公司 Method for generating countermeasure network training, method and device for audio style migration
CN112509550A (en) * 2020-11-13 2021-03-16 中信银行股份有限公司 Speech synthesis model training method, speech synthesis device and electronic equipment
CN112562634B (en) * 2020-12-02 2024-05-10 平安科技(深圳)有限公司 Multi-style audio synthesis method, device, equipment and storage medium
CN112509600A (en) * 2020-12-11 2021-03-16 平安科技(深圳)有限公司 Model training method and device, voice conversion method and device and storage medium
CN112767918B (en) * 2020-12-30 2023-12-01 中国人民解放军战略支援部队信息工程大学 Russian Chinese language translation method, russian Chinese language translation device and storage medium
CN112908294B (en) * 2021-01-14 2024-04-05 杭州倒映有声科技有限公司 Speech synthesis method and speech synthesis system
CN112712813B (en) * 2021-03-26 2021-07-20 北京达佳互联信息技术有限公司 Voice processing method, device, equipment and storage medium
CN113178201B (en) * 2021-04-30 2024-06-28 平安科技(深圳)有限公司 Voice conversion method, device, equipment and medium based on non-supervision
CN113436607B (en) * 2021-06-12 2024-04-09 西安工业大学 Quick voice cloning method
CN113409759B (en) * 2021-07-07 2023-04-07 浙江工业大学 End-to-end real-time speech synthesis method
CN113470616B (en) * 2021-07-14 2024-02-23 北京达佳互联信息技术有限公司 Speech processing method and device, vocoder and training method of vocoder
CN113345416B (en) * 2021-08-02 2021-10-29 智者四海(北京)技术有限公司 Voice synthesis method and device and electronic equipment
CN114283822A (en) * 2021-12-24 2022-04-05 华东理工大学 Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9082401B1 (en) * 2013-01-09 2015-07-14 Google Inc. Text-to-speech synthesis
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN109473091A (en) * 2018-12-25 2019-03-15 四川虹微技术有限公司 A kind of speech samples generation method and device
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10186251B1 (en) * 2015-08-06 2019-01-22 Oben, Inc. Voice conversion using deep neural network with intermediate voice training
CN107481713B (en) * 2017-07-17 2020-06-02 清华大学 Mixed language voice synthesis method and device
US10796686B2 (en) * 2017-10-19 2020-10-06 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning
CN109523993B (en) * 2018-11-02 2022-02-08 深圳市网联安瑞网络科技有限公司 Voice language classification method based on CNN and GRU fusion deep neural network
CN109584893B (en) * 2018-12-26 2021-09-14 南京邮电大学 VAE and i-vector based many-to-many voice conversion system under non-parallel text condition

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9082401B1 (en) * 2013-01-09 2015-07-14 Google Inc. Text-to-speech synthesis
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN108108357A (en) * 2018-01-12 2018-06-01 京东方科技集团股份有限公司 Accent conversion method and device, electronic equipment
CN109473091A (en) * 2018-12-25 2019-03-15 四川虹微技术有限公司 A kind of speech samples generation method and device
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652325A (en) * 2020-12-15 2021-04-13 平安科技(深圳)有限公司 Remote voice adjusting method based on artificial intelligence and related equipment
CN112652325B (en) * 2020-12-15 2023-12-15 平安科技(深圳)有限公司 Remote voice adjustment method based on artificial intelligence and related equipment
CN112652318A (en) * 2020-12-21 2021-04-13 北京捷通华声科技股份有限公司 Tone conversion method and device and electronic equipment
CN112652318B (en) * 2020-12-21 2024-03-29 北京捷通华声科技股份有限公司 Tone color conversion method and device and electronic equipment
CN112712812A (en) * 2020-12-24 2021-04-27 腾讯音乐娱乐科技(深圳)有限公司 Audio signal generation method, device, equipment and storage medium
CN112712812B (en) * 2020-12-24 2024-04-26 腾讯音乐娱乐科技(深圳)有限公司 Audio signal generation method, device, equipment and storage medium
CN113539231A (en) * 2020-12-30 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method, vocoder, device, equipment and storage medium
CN112992177B (en) * 2021-02-20 2023-10-17 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of voice style migration model
CN112992177A (en) * 2021-02-20 2021-06-18 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of voice style migration model
CN113178200A (en) * 2021-04-28 2021-07-27 平安科技(深圳)有限公司 Voice conversion method, device, server and storage medium
CN113178200B (en) * 2021-04-28 2024-03-01 平安科技(深圳)有限公司 Voice conversion method, device, server and storage medium
CN113284499A (en) * 2021-05-24 2021-08-20 湖北亿咖通科技有限公司 Voice instruction recognition method and electronic equipment
CN113643687B (en) * 2021-07-08 2023-07-18 南京邮电大学 Non-parallel many-to-many voice conversion method integrating DSNet and EDSR networks
CN113643687A (en) * 2021-07-08 2021-11-12 南京邮电大学 Non-parallel many-to-many voice conversion method fusing DSNet and EDSR network
CN113611283A (en) * 2021-08-11 2021-11-05 北京工业大学 Voice synthesis method and device, electronic equipment and storage medium
CN113611283B (en) * 2021-08-11 2024-04-05 北京工业大学 Speech synthesis method, device, electronic equipment and storage medium
CN113658583B (en) * 2021-08-17 2023-07-25 安徽大学 Ear voice conversion method, system and device based on generation countermeasure network
CN113658583A (en) * 2021-08-17 2021-11-16 安徽大学 Method, system and device for converting ear voice based on generation countermeasure network
CN113488057B (en) * 2021-08-18 2023-11-14 山东新一代信息产业技术研究院有限公司 Conversation realization method and system for health care
CN113488057A (en) * 2021-08-18 2021-10-08 山东新一代信息产业技术研究院有限公司 Health-oriented conversation implementation method and system
CN113837299B (en) * 2021-09-28 2023-09-01 平安科技(深圳)有限公司 Network training method and device based on artificial intelligence and electronic equipment
CN113837299A (en) * 2021-09-28 2021-12-24 平安科技(深圳)有限公司 Network training method and device based on artificial intelligence and electronic equipment
CN117745904A (en) * 2023-12-14 2024-03-22 山东浪潮超高清智能科技有限公司 2D playground speaking portrait synthesizing method and device

Also Published As

Publication number Publication date
CN110136690B (en) 2023-07-14
CN110136690A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
WO2020232860A1 (en) Speech synthesis method and apparatus, and computer readable storage medium
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
US20240038218A1 (en) Speech model personalization via ambient context harvesting
US11205444B2 (en) Utilizing bi-directional recurrent encoders with multi-hop attention for speech emotion recognition
CN111833845B (en) Multilingual speech recognition model training method, device, equipment and storage medium
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
CN112712813B (en) Voice processing method, device, equipment and storage medium
Wu et al. Audio classification using attention-augmented convolutional neural network
CN112204653A (en) Direct speech-to-speech translation through machine learning
CN111081230B (en) Speech recognition method and device
WO2022142850A1 (en) Audio processing method and apparatus, vocoder, electronic device, computer readable storage medium, and computer program product
CN113837299B (en) Network training method and device based on artificial intelligence and electronic equipment
WO2021127982A1 (en) Speech emotion recognition method, smart device, and computer-readable storage medium
US20230087916A1 (en) Transforming text data into acoustic feature
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN111144124A (en) Training method of machine learning model, intention recognition method, related device and equipment
CN113539232B (en) Voice synthesis method based on lesson-admiring voice data set
US20230237993A1 (en) Systems and Methods for Training Dual-Mode Machine-Learned Speech Recognition Models
CN115602165A (en) Digital staff intelligent system based on financial system
JP2022037862A (en) Method, system, and computer readable storage media for distilling longitudinal section type spoken language understanding knowledge utilizing text-based pre-learning model
WO2023102931A1 (en) Method for predicting prosodic structure, and electronic device, program product and storage medium
Cornell et al. Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection
JP2021157145A (en) Inference device and learning method of inference device
CN113948064A (en) Speech synthesis and speech recognition
CN113823271B (en) Training method and device for voice classification model, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929919

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929919

Country of ref document: EP

Kind code of ref document: A1