CN110767210A

CN110767210A - Method and device for generating personalized voice

Info

Publication number: CN110767210A
Application number: CN201911046823.7A
Authority: CN
Inventors: 周琳岷
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-02-07

Abstract

The invention discloses a method and a device for generating personalized voice. The method has the advantages that the characteristic of the target voice is combined with the text characteristic vector, the end-to-end text characteristic-to-audio characteristic unit is used for conducting self-adaptive learning on the generated mixed end-to-end model, and the method is equivalent to conducting self-adaptive learning on the input closest to the target voice characteristic. Through the personalized vocoder unit, the loss of the vocoder synthesis agent is reduced, and the naturalness of the voice synthesis is improved.

Description

Method and device for generating personalized voice

Technical Field

The invention relates to the technical field of voice personalization, in particular to a method and a device for generating personalized voice.

Background

With the development of intelligent home, the voice personalization technology is applied in more and more fields. The development of the voice broadcasting technology greatly facilitates the life of people and improves the quality of life. Most of the existing voice personalization technologies perform matrix change after voiceprint features are extracted through parallel corpora of a personalized target and the person, for example, DTW has high requirements on the number of the voice corpora and consumes much time.

Disclosure of Invention

In view of the above, the invention provides a small corpus personalization method and device based on speaker characteristics, and solves the problems that the existing voice personalization algorithm needs clear voice data and the training time is long.

The invention solves the problems through the following technical scheme: a method of generating personalized speech, the method comprising the steps of:

step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices;

b, training a voice feature extraction model by using sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors;

c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice;

step d, inputting the acoustic features generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model;

step e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model;

f, generating acoustic characteristics of a target by using the personalized end-to-end model, carrying out adaptive model training on the average model of the vocoder, and training a personalized vocoder model;

and g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice.

Preferably, in the step b, the speech feature extraction model is to input the sample acoustic features obtained in the step a into a deep speech recognition model, and then train the sample acoustic features with a deep learning network to obtain sample acoustic feature vectors corresponding to different sample acoustic features.

Preferably, the deep learning network comprises: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.

Preferably, in step c, an end-to-end neural network is adopted, the feature vectors of the text and the feature vectors of the acoustic features are combined, a limited range attention mechanism is used in the end-to-end network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features of the output end.

Preferably, in step f, a cyclic neural network is used to predict the coding value of the audio according to the acoustic features of the target, a personalized vocoder model is trained in combination with the output target audio, and a fuzzy process is used to generate the acoustic features during training, and a small amount of interference spectrum is inserted.

Further, the present invention also provides an apparatus for generating personalized speech, which is characterized in that, by using the foregoing method for generating personalized speech, the apparatus further includes: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.

The invention has the beneficial effects that: the invention can be applied to the field of speech individuation, but not limited to the field.

Drawings

FIG. 1 is a block diagram of a process for training models in generating personalized speech according to an embodiment of the present invention;

FIG. 2 is a main framework for generating a personalized voice network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a voice feature extraction network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an end-to-end network provided by an embodiment of the present invention;

fig. 5 is a block diagram of an apparatus for generating personalized speech according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

In a first embodiment, referring to fig. 1 and 2, the present invention provides a method for generating personalized speech, the method comprising the steps of:

step a, collecting target sample voice and large-scale sample voice, and extracting sample acoustic characteristics corresponding to the two voices. The method comprises the steps of collecting voices of multiple characters in a recording studio as large-scale sample voices, enabling the sampling frequency of the voices to be larger than 16000Hz as much as possible, and enabling collected target sample voices to contain all Chinese initials, finals and tones to be combined into Chinese phonemes as much as possible. The extracted acoustic features of the sample comprise Mel features, linear prediction coefficient features and the like, the Mel features are extracted by adopting windowed framed Fourier transform, a time domain is converted into a frequency domain by means of windowed Fourier transform, when the audio features are extracted, the Mel features are 40-80-dimensional, and the linear prediction coefficient feature input limits N scale cepstrum coefficients and M pitch parameters (such as period, correlation and the like).

And b, training a voice feature extraction model by using the sample acoustic features corresponding to the two voices to generate corresponding sample voice feature vectors. Referring to fig. 3, the speech feature extraction model is to input the sample acoustic features obtained in step a into a deep speech recognition model, and then train the sample acoustic features with a deep learning network to obtain sample acoustic feature vectors corresponding to different sample acoustic features. The deep learning network includes: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.

And c, training a mixed end-to-end model from text features to acoustic features by using the sound feature vectors of the large-scale sample voice and combining with texts corresponding to the large-scale sample voice. Referring to fig. 2 and 4, an end-to-end neural network is adopted, the feature vectors of the text and the feature vectors of the acoustic features are combined, a limited range attention mechanism is used in the end-to-end network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features of the output end. The acoustic feature vectors of the large scale sample speech are obtained from the audio acoustic features of the large scale sample speech, preferably at an audio sampling rate of 22k or more.

And d, inputting the acoustic characteristics generated by the mixed end-to-end model into a neural network vocoder model, outputting audio codes by the neural network vocoder model, and training to generate a vocoder average model. And predicting the coding value of the audio by adopting a neural network according to the target acoustic characteristics, and outputting the target audio in a combined manner.

And e, performing adaptive model training on the basis of the mixed end-to-end model by using the sound characteristic vector of the target sample voice and the text corresponding to the target sample voice, and training an individualized end-to-end model. Generating a feature vector of target sample voice through a feature extraction network, combining the voice feature vector generated by the target with a text vector corresponding to the target sample voice to serve as the input of a hybrid model, performing self-adaptive learning on the hybrid model by taking the linear prediction coefficient feature of the target as the output, and obtaining a personalized end-to-end model

And f, generating the acoustic characteristics of the target by using the personalized end-to-end model, and performing adaptive model training on the average model of the vocoder to train the personalized vocoder model. And predicting the coding value of the audio by adopting a recurrent neural network according to the acoustic characteristics of the target, training a personalized vocoder model by combining the output target audio, and inserting a small amount of interference spectrum into the generated acoustic characteristics by adopting fuzzy processing in the training.

In the personalized process, the voice feature extraction model does not need to carry out self-adaptive training operation, and the personalized end-to-end model and the personalized vocoder model of the text-to-voice feature need to carry out self-adaptive learning.

And g, in the synthesis stage, combining the feature vector of the required text and the sound feature vector of the target as input, obtaining the acoustic feature of the target through the personalized end-to-end model, and combining the personalized vocoder model to output the required target voice. The acoustic feature vector of the target may be an acoustic feature vector of the target obtained by the speech feature extraction model, or a generated feature vector generated by the target before the speech feature extraction model is used.

Meanwhile, an end-to-end network adopted by the hybrid end-to-end model and the personalized end-to-end model is shown in fig. 4, and specifically, text features of phonemes corresponding to text conversion and voice feature vectors generated in the previous step are combined to serve as input of the end-to-end network.

The end-to-end network is divided into three parts, namely an encoder, a decoder and a back-end processor. An attention mechanism is adopted in the middle of the decoder, a window is arranged around the previous maximum weight point, the next maximum weight point is searched, and the alignment efficiency is improved. The method adopts a cyclic neural network structure, improves the dimensionality of sound features, increases the feature vectors of loss functions, improves the fitting effect of training, and adopts the Mel features to improve the naturalness of synthesized sound.

In the second embodiment, the invention further provides a device for generating personalized voice, which is shown in fig. 5. The device can adopt the method for generating the personalized voice, and the device further comprises: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.

Through the device provided by the second embodiment, the audio frequency of the sound is self-adaptively learned on the basic model through a small amount of linguistic data, the user can complete the personalization of the voice signal in a short time without adopting other linguistic data, and the MOS (mean opinion score) of the synthesized sound is as high as about 4.0

According to the method, the characteristic of the target voice is combined with the text characteristic vector, the self-adaptive learning is carried out on the generated mixed end-to-end model through the end-to-end text characteristic to audio characteristic converting unit, which is equivalent to the self-adaptive learning on the input closest to the target voice characteristic, the time required by the self-adaptive learning is reduced through the method, the feedback loss of the neural network fitting is reduced, the adjustment amplitude of the neural network is reduced, and the accuracy of the self-adaptive learning is improved. Through the personalized vocoder unit, the loss of the vocoder synthesis agent is reduced, and the naturalness of the voice synthesis is improved.

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims

1. A method of generating personalized speech, the method comprising the steps of:

2. The method of claim 1, wherein in the step b, the speech feature extraction model is to input the acoustic features of the samples obtained in the step a into a deep speech recognition model, and train the model with a deep learning network to obtain acoustic feature vectors of the samples corresponding to the acoustic features of different samples.

3. The method of generating personalized speech according to claim 2, wherein the deep learning network comprises: the convolutional neural network is used for extracting the characteristics of the audio of the original voice; the weight calculation network is used for processing the convolution information to obtain the weight of each convolution characteristic and removing the voice signal with the minimum weight; generating K eigenvectors from the original signal, and performing matrix multiplication on the K eigenvectors and the obtained weights to obtain the dimensionality-reduced characteristics; and combining the loss function to obtain the corresponding N-dimensional target voice characteristics.

4. The method of claim 1, wherein in step c, an end-to-end neural network is adopted to combine the feature vectors of the text and the acoustic features, a limited attention mechanism is used in the end-to-end neural network, and the combined features are decoded by obtaining weights according to the attention mechanism to obtain the acoustic features at the output end.

5. The method of claim 1, wherein in step f, a recurrent neural network is used to predict the encoded value of the audio according to the acoustic features of the target, a personalized vocoder model is trained in conjunction with the output target audio, and a blurring process is used to interpolate a small amount of interference spectrum to the generated acoustic features during the training.

6. An apparatus for generating personalized speech, wherein the method of any one of claims 1-5 is employed, the apparatus further comprising: the voice acquisition unit and the extraction unit are used for acquiring target sample voice and large-scale sample voice and extracting acoustic characteristics of the voice; the speaker audio feature extraction unit is used for training a voice feature extraction model and extracting a voice feature vector; an end-to-end text feature to audio feature unit for training a mixed end-to-end model from text features to acoustic features; a vocoder unit for training generation of a vocoder average model; the personalized end-to-end text feature to audio feature unit is used for training a personalized end-to-end model from the target text feature to the acoustic feature; a personalized vocoder unit for training a personalized vocoder model; and the voice synthesis unit is connected with the personalized vocoder unit and is used for realizing personalized voice.