CN112992118B

CN112992118B - Speech model training and synthesizing method with few linguistic data

Info

Publication number: CN112992118B
Application number: CN202110561416.0A
Authority: CN
Inventors: 曹艳艳; 陈佩云
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-05-22
Filing date: 2021-05-22
Publication date: 2021-07-23
Anticipated expiration: 2041-05-22
Also published as: CN112992118A

Abstract

A speech model training and synthetic method of few linguistic data, including model training and speech synthesis; the model training comprises the following steps: s1, collecting a training sample set; s2, performing phonemic processing on each sample, and extracting Mel features; s3, training the voice model to obtain a generalized model MA; s4, performing fine tuning training on the reference tone sample on the basis of the generalized model MA to obtain a reference model MB; s5, classifying all samples of the training sample set according to tone, and training a tone conversion model MTR; and S6, training all samples of the training sample set to obtain a personality vocoder model MG corresponding to each tone. The invention can realize the training of few speech materials of other timbres and obtain the model required by the synthesized audio only by needing larger data volume of the reference timbre, shortens the training time of the model, and promotes the subsequent speech synthesis effect through the training of the conversion model and the individual vocoder model.

Description

Speech model training and synthesizing method with few linguistic data

Technical Field

The invention belongs to the technical field of voice processing, relates to a voice synthesis technology, and particularly relates to a method for training and synthesizing a voice model with few speech materials.

Background

In the technical field of artificial intelligence, speech enhancement and speech synthesis are always topics concerned by expert scholars and speech interaction product markets. In recent years, the deep learning technology promotes the rapid development of the field of artificial intelligence, the voice synthesis also has breakthrough progress, the reality of the synthesized voice in certain specific scenes can even be more similar to the sound production of a real person, and the voice synthesis technology is widely applied to the fields of news broadcasting, voiced novels, dubbing and the like.

Compared with the traditional speech synthesis method, the deep learning technology is adopted for synthesis, excessive knowledge of linguistics and signalology is not needed, linguistics labeling is not needed manually, the end-to-end processing technology can directly input texts, corresponding audio information is obtained through deep model calculation, and the synthesis effect is superior to that of the traditional speech synthesis algorithm.

However, the deep learning synthesis method also has disadvantages, such as that it is difficult to perform targeted optimization on a text with poor synthesis, a large amount of high-quality original corpora is required, the dependency on the corpora is large, and a training set with poor quality and insufficient quantity is difficult to fit a large number of parameters of an end-to-end model. In practical applications, customers often have many requirements on timbre, including age (male, female, old, young), timbre type (soft, lovely, serious, etc.), language (chinese, english, japanese, etc.), and the workload of collecting so many corpora is large. And the mixed synthesis of different languages often requires a speaker to use multiple languages, which is difficult to realize.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention discloses a speech model training and synthesizing method with less linguistic data.

The invention discloses a method for training and synthesizing a voice model with less linguistic data, which comprises model training and voice synthesis;

the model training comprises the following steps:

s1, collecting a training sample set, wherein the training sample set comprises a plurality of timbre samples, each sample comprises a text and a corresponding audio file, sample data of at least 1 timbre sample conforms to a standard timbre standard, and the standard timbre standard is that corpus data sample data has large sample data volume and high quality;

s2, performing phonation processing on the text of each sample to obtain a phonation text; extracting Mel characteristics from the audio files of the samples by the same method; selecting one of the samples meeting the standard of the reference tone as a sample of the reference tone;

s3, training the voice model to obtain a generalization model; the training method comprises the following steps: the phonemic texts of all samples are used as input, the Mel features of the corresponding audio are used as output, and a speech model is trained;

s4, performing fine tuning training on the reference tone sample on the basis of the generalized model to obtain a reference model;

s5, classifying all samples of the training sample set according to tone, training a tone conversion model, wherein each tone corresponds to one conversion model;

and S6, training the generalized vocoder model by using all samples of the training sample set, and then performing fine tuning training on the generalized vocoder model by using the sample corresponding to each tone to obtain a personalized vocoder model corresponding to each tone.

Preferably: the speech model in step S3 is any one of a tacotron model and a fastspeech model.

Preferably: the conversion model used in the training in the step S5 is a startan-vc model.

Preferably: the reference tone color standard is that the time length of the audio data of the sample is more than 10 hours.

Preferably, the audio file time length of the sample is greater than 10 minutes.

Preferably, the text of each sample in the training sample set is completely different.

Preferably, the speech synthesis comprises the steps of:

s7, preprocessing the text to be synthesized to obtain a phonemic text, and inputting the phonemic text into a reference model MB to obtain the Mel characteristics of the reference tone of the text to be synthesized;

s8, sending the Mel features obtained in the step S7 into a conversion model MTR corresponding to the target tone for conversion to obtain the Mel features of the target tone;

and S9, sending the Mel features of the target tone in the step S7 to the personality vocoder model MG of the corresponding tone, thereby synthesizing the sound with the designated tone.

Compared with the traditional voice synthesis method, the method has the advantages that the reference model is generated through the reference tone, the training on few voice materials of other tones can be realized and the model required by the synthesized audio can be obtained only by the large data volume of the reference tone, the model training time is shortened, and the subsequent voice synthesis effect is improved through the training of the conversion model and the individual vocoder model.

Drawings

FIG. 1 is a diagram of an embodiment of a method for training and synthesizing a corpus-reduced speech model according to the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

The invention relates to a method for training and synthesizing a voice model with few linguistic data, which comprises the following steps:

s1, collecting a training sample set, wherein the training sample set comprises a plurality of timbres, each sample comprises a text and a corresponding audio file, the sample data of at least 1 timbre accords with a standard of reference timbre, and the standard of reference timbre is a preset standard;

s3, training the voice model to obtain a generalization model MA; the training method comprises the following steps: the phonemic texts of all samples are used as input, the Mel features of the corresponding audio are used as output, and a speech model is trained;

s4, performing fine tuning training on the reference tone sample on the basis of the generalized model MA to obtain a reference model MB;

s5, classifying all samples of the training sample set according to tone, training a tone conversion model MTR, wherein each tone corresponds to one conversion model MTR;

and S6, training a generalized vocoder model MVN by using all samples of the training sample set, and then performing fine tuning training on the generalized vocoder model MVN by using samples corresponding to each tone color respectively to obtain a personalized vocoder model MG corresponding to each tone color.

One specific implementation including model training and speech synthesis is:

1) preparing corpus data of a plurality of kinds of target timbres as a training sample set for generalized model training, wherein one timbre B is used as a reference timbre for generalized model fine tuning, and the corpus data sample of the reference timbre has large sample data volume and high quality, for example, generally more than 10 hours. If a multilingual model needs to be trained, corpus data of each language is provided.

As described in the background art, in the speech synthesis technology, the timbre refers to different pronunciation types, and strictly speaking, the timbre of each person is different even if the age, sex, and language are the same. In the traditional voice synthesis scheme, each tone data needs a large amount of training data to synthesize accurate pronunciation, and the scheme provided by the invention only needs to provide a large amount of data of one tone, thereby effectively saving the training cost.

The corpus data of each training sample set includes text and audio corresponding to the text.

2) Text normalization processing is carried out on the text of each sample of the training sample set, phoneme normalization processing is carried out on numbers, units, special characters and the like in the text, for example, Chinese characters are converted into corresponding pinyin, and other languages are converted into corresponding phonemes such as English and converted into phonetic symbols. And normalizing the text to obtain a phonemic text.

The extraction of phonemes of various languages needs to ensure the uniqueness of the phonemes, if the phonemes are the same, the same phonemes need to be labeled in other ways to realize the distinction, for example, if the same phoneme t exists in Chinese, english, and japanese, the phoneme t in chinese, english, and japanese needs to be labeled to be distinguished, for example, the phoneme t in chinese, english, and japanese is modified to be the phoneme t1, the phoneme t2, and the phoneme t3, respectively. And obtaining the phoneticized text of each training sample set after the phoneme extraction. And processing the audio, normalizing the sampling rate, matching the optimal parameters according to the set sampling rate, and extracting the Mel characteristics.

The extraction modes of Mel features used in all model training of the invention must be consistent.

The time domain signal (waveform amplitude) of speech is less stable than the frequency domain signal such as Mel-spectrum feature (Mel-spectrum), which means that speech sounds the same as it can have a very different waveform in the time domain, but it remains uniform in the frequency domain. The speech processing field usually converts the time domain signal into a frequency domain signal for processing. The typical extraction steps of the Mel frequency spectrum features comprise framing, pre-emphasis, windowing, short-time Fourier transform (STFT) and obtaining Mel scales, and the sensitive features of human ears are extracted emphatically and are suitable for speech synthesis.

In the present invention, model training can be performed using a model structure that has been disclosed in the related art, such as tacotron or fastspeed, as a model for training. The method can be specifically carried out according to the following steps:

3) and (3) taking the phonemic texts of all samples of the training sample set as input, taking the Mel features of the corresponding audio as output, training a deep learning model, and marking the model name as a generalized model MA.

All samples are used as input to improve the generalization of the model, and the trained model parameters can cover all conversion information from tone text to Mel characteristics.

4) And (5) performing fine tuning training on the reference tone color B sample on the basis of the generalized model MA to obtain an acoustic model of the tone color B, and marking the acoustic model as a reference model MB.

The method has the advantages that fine tuning training is carried out on the generalized model MA, the obtained model parameters are more stable and rich due to the fact that the generalized model MA contains data features of different timbres, particularly, for texts which do not appear in a training set of the reference timbre B, the fine tuning training can better extract Mel features of the texts which do not appear compared with the training directly by using data of the reference timbre B, and the fitting effect is better compared with the training directly by using data of the reference timbre B.

5) And classifying all samples according to timbres, training a conversion model MTR for timbre conversion, wherein the conversion model MTR can realize the conversion from the reference timbre B to the Mel characteristics of other timbres, and each type of timbre corresponds to one conversion model MTR.

The tone conversion model realizes conversion among Mel features, is irrelevant to the specific content of the text of the training sample set, and can realize conversion based on less data volume. The prior published models such as the StarGAN-VC model have verified that the tone transformation model can complete transformation and achieve better transformation effect on a small amount of data such as several minutes, and refer to the paper StarGAN-VC: non-parallel mann-to-mann voice conversion with StaGAN (2018 IEEE Spoken Language Technology Workshop (SLT), authors Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo).

6) The generalized vocoder model MVN is trained by using all corpus data, and then each tone is finely trained on the generalized vocoder model MVN, so as to obtain a personality vocoder model MG corresponding to each tone.

Because the small corpus training vocoder model is difficult to fit, after the vocoder model is trained on all corpus data, each tone is subjected to fine tuning training again, so that the tone data can be ensured to be capable of synthesizing correct audio information under a small amount of corpus data, and the personalized characteristics of each tone are enhanced through the fine tuning training of each tone.

Through the steps, model training is carried out by utilizing the phonemic text, the Mel characteristics and the audio of the training sample set, and a generalized model MA, a reference model MB, a conversion model MTR, a generalized vocoder model MVN and a personalized vocoder model MG are obtained. With these models, target timbre audio synthesis can be performed, specifically:

7) and preprocessing the text to be synthesized, and then inputting the preprocessed text as a reference model MB to obtain the Mel characteristics of the B tone.

8) And (3) sending the Mel features obtained in the step (7) into a conversion model MTR corresponding to the target tone Y for conversion to obtain the Mel features of the target tone.

9) And (5) sending the Mel features of the target tone in the step 8) into a personality vocoder model MG corresponding to the tone, thereby synthesizing the sound with the designated tone.

The traditional end-to-end speech synthesis algorithm usually directly fine-tunes a finetune acoustic model on a generalized model and then sends the refined acoustic model into a vocoder, and under the condition of few training sample sets, the fine-tuning effect is not ideal, and the vocoder part can not accurately distinguish personalized differences among different timbres. According to the few-speech-material speech synthesis training method provided by the invention, the acoustic basic model and the vocoder basic model are obtained by training on all training speech materials, so that the fine tuning of the finetune model on the basic model can ensure the generalization of a large model and accurately fit parameters of the model under the condition of insufficient tone color data. The Mel features of the small corpus tone-color data are extracted by adopting a tone-color conversion model instead of a mode of directly obtaining the Mel features by inference through an acoustic model, and because the Mel features of the large corpus tone-color data B are predicted accurately, the Mel features of the large corpus tone-color data B are sent to the tone-color conversion model to obtain the Mel features of the target tone color, so that the problem of synthesis of different texts due to insufficient target tone-color corpora is solved, and the requirement on the number of training corpora is greatly reduced through the tone-color conversion model. The scheme of the invention improves the voice synthesis effect under the condition of insufficient tone data.

The fine tuning training (finetune) refers to: firstly, a large amount of data is used for training a basic model, and then a small amount of data is used for continuously carrying out superposition training on the basic model. Because the deep learning model usually has a large number of parameters, if the data amount as the training sample set is small, the model is easy to be not converged or over-fitted (i.e. the effect on the training set is good, the effect on the test set is poor), and the generalization capability is poor. And the fine tuning finetune mode is adopted for training, and as the basic model trained by a large number of samples can already express most of the characteristics of the target model, the small sample data can be fitted without missing a large number of characteristics which are not contained in the small sample data by only continuing the superposition training on the basic model.

The fine tuning adopts a single sample and takes the normalized phonemic text as input, the Mel characteristic as output to carry out repeated training, during the fine tuning, the specific training mode is variable, such as adjustable training parameter batch-size, learning rate (learning _ rate) and the like, and parameters of certain layers of the model can also be fixed, and only the parameter modification training set of the specified layer is updated.

The specific extraction method of the Mel characteristic, namely Mel-Spectrogram, is to perform the following processing on a speech time domain signal, namely original waveform data: framing, pre-emphasis, windowing, short-time fourier transform (STFT); so as to obtain the Mel characteristic.

One specific embodiment is as follows:

firstly, preparing a training corpus, and preparing training corpus data with 5 tones as a training sample set, wherein the training corpus data comprises 4 Chinese languages, 1 English language audio and corresponding texts, and the tone B audio data volume of one Chinese language in the 5 tones is preferably more than 10 hours; the data volume of each of the remaining timbres is more than 10 minutes, and the texts of the training sample sets are inconsistent, so that the diversity of the text data is ensured.

The timbre refers to different pronunciation types, is also related to gender, age, and the like, and can be defined by factors such as gender, age, pronunciation style, language, and the like, for example, (youth, female, seriousness, and chinese) can be set as a reference timbre B, and the other four timbres are (youth, female, gentle, chinese), (youth, female, lovely, chinese), (child, female, lovely, english), (old, female, seriousness, and chinese), respectively.

Text normalization processing, namely performing normalization processing on numbers, units, special characters and the like in the text; converting Chinese characters of Chinese language into corresponding phonemes, namely Chinese pinyin, converting English language into corresponding phonemes, namely phonetic symbols, ensuring that phoneme expression symbols between Chinese and English languages are not repeated, and processing the phonemic texts to obtain phonemic texts.

And processing the audios of all samples, unifying the sampling rate, and matching the optimal parameters according to the set sampling rate to extract Mel features, wherein the specific extraction of the Mel features is the prior art in the field and is not described herein again.

Model training:

A1) training an acoustic model, and uniquely encoding phonemic texts of all samples as model input; outputting the audio Mel characteristics corresponding to the training sample set as a model, and training the model; the model may enable the generation of corresponding Mel features from text. A generalized model MA was obtained.

B1) And performing fine tuning training on the reference tone color B data with the large sample on the basis of the generalized model MA to obtain a reference model MB.

C1) And classifying all the data according to the timbres, and training a conversion model MTR for converting the timbres, wherein the conversion model MTR can realize the conversion from the reference timbre B to the Mel characteristics of other timbres, and each type of timbres corresponds to one conversion model MTR.

D1) Training a vocoder model by adopting all samples, wherein the model input is Mel characteristics which are consistent with the Mel characteristics used in the step A1); the original audio file is output as a model. The function realized by the model is the conversion of Mel features into audio files, and the trained model is a generalized vocoder model MVN.

Then, fine tuning finetune training is respectively carried out on each tone on the generalized anti-decoding model MVN, and a personality vocoder model MG corresponding to each tone is obtained.

Thus, model training is performed using the phonemic text, mel feature, and audio of the training sample set, and the generalized model MA, the reference model MB, the conversion model MTR, the generalized vocoder model MVN, and the individual vocoder model are obtained.

And fourthly, synthesizing target tone color audio, wherein the target is to synthesize one text to be synthesized into an audio file with the target tone color Y.

A2) Preprocessing the text to be synthesized, wherein the specific processing mode is the same as the text processing mode of the sample in the previous training process, obtaining a phonemic text corresponding to the text, and taking the phonemic text as the input of a reference model MB, thereby obtaining the Mel characteristic of the reference tone B of the text to be synthesized.

B2) And B) sending the Mel features of the reference tone B obtained in the step A2) into a conversion model MTR corresponding to the target tone Y for conversion to obtain the Mel features of the target tone Y.

In the former synthesis mode, the Mel characteristics are directly obtained by an acoustic model, and because the target tone training data volume is small, the fitting effect on different texts is relatively worse, and the texts of other languages of the tone cannot be synthesized, if the target tone training set only contains Chinese data, the English text cannot be synthesized; in the invention, the reference tone color B is obtained by a large training set corresponding to the reference tone color, so the prediction of the Mel feature of the reference tone color B is more accurate, and the conversion model MTR only converts the Mel feature and is irrelevant to the text, so the Mel feature of the target tone color has better adaptability to different texts than the Mel feature obtained by directly adopting the acoustic model. And for different languages, even if the language material of the language is not in the training set of the target tone, the Mel feature of the corresponding language can be obtained through the Mel feature of the language material obtained by the reference tone B, and the synthesis of different languages of the target tone is realized.

C2) And B2), the Mel feature of the target tone Y obtained in the step B) is sent to the individual vocoder model corresponding to the target tone Y, so as to synthesize the sound of the tone. The input of the vocoder is only related to the Mel features, so when different timbres and Mel features of different corpora are input, corresponding synthesized audio is generated.

The technical effect of the invention is embodied by adopting the most common MOS scoring mode in the field of voice synthesis, namely, different people respectively score the original audio and the synthesized audio by MOS (total score is 5), and finally, the average value is calculated. And comprehensively scoring the aspects of definition, naturalness, intelligibility and the like.

The Chinese corpus is prepared in a standard tone B for about 10 hours, the other eight common tones are 1-8, the corpus of each tone is about 20 minutes, the English tone is E for one, and the English corpus is about 10 hours.

After the training method is adopted for training, a plurality of individual vocoder models are obtained as a model 1 after the training is finished, and a model 2 is a vocoder model obtained by utilizing the same linguistic data and training by adopting a traditional method.

The multiple listeners score the audios synthesized from the input chinese or english text by the individual vocoder model, and average the scores to obtain the final score. The international MOS (mean Opinion score) test was used. The MOS is a subjective test method, which is characterized in that the behaviors of answering and sensing voice quality of users are researched and quantized, different investigation users respectively carry out subjective feeling comparison on original standard voice and synthesized or attenuated voice, and MOS scores are evaluated. The specific scoring criteria are as follows:

the value after the score is graded by 50 persons by adopting MOS grading is given below, the model 1-timbre B represents the personality vocoder model obtained by utilizing the training of the invention, and the personality vocoder model corresponding to the timbre B is selected to synthesize the audio file corresponding to the timbre B; model 2-timbre B represents the audio file of model 2 synthetic timbre B obtained using conventional methods; the rest is analogized in the same way.

The specific values after statistics are as follows:

it can be seen from the above table that the scores of the audio files synthesized by the personalized vocoder model obtained by the present invention are all higher than the scores of the traditional model 2 in the same language, and the scores of the corresponding items of the model 1 are all higher than those of the model 2 in the Chinese and English text no matter the reference tone or the common tone, and the score difference with the original audio is also small.

Since the model 2 is trained by the traditional method, the Chinese vocoder model cannot synthesize English text, and vice versa, so that the score is zero, and the specific expression is that the synthesized cross-language audio is completely noise. The individual vocoder model adopted by the invention can synthesize the English text, wherein the individual vocoder model corresponding to the English timbre E has the highest synthesis score for the English text, the reference timbre B has good synthesis score for the English text due to large training data volume, even if the common timbre 1 with small data volume can synthesize the audio corresponding to the English text, and the score is closer to the medium scoring standard of 3.0.

The foregoing is a description of preferred embodiments of the present invention, and the preferred embodiments in the preferred embodiments may be combined and combined in any combination, if not obviously contradictory or prerequisite to a certain preferred embodiment, and the specific parameters in the examples and the embodiments are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the patent protection scope of the present invention, which is defined by the claims and the equivalent structural changes made by the content of the description of the present invention are also included in the protection scope of the present invention.

Claims

1. A speech model training and synthetic method of few linguistic data, including model training and speech synthesis;

the method is characterized in that the model training comprises the following steps:

2. The method for corpus-reduced speech model training and synthesis according to claim 1, wherein: the speech model in step S3 is any one of a tacotron model and a fastspeech model.

3. The method for corpus-reduced speech model training and synthesis according to claim 1, wherein: the conversion model used in the training in the step S5 is a startan-vc model.

4. The method for corpus-reduced speech model training and synthesis according to claim 1, wherein: the reference tone color standard is that the time length of the audio data of the sample is more than 10 hours.

5. The method of claim 1, wherein the sample audio file time length is greater than 10 minutes.

6. The method of claim 1, wherein the text of each sample in the training sample set is completely different.

7. The method of claim 1, wherein the speech synthesis comprises the steps of:

s7, preprocessing the text to be synthesized to obtain a phonemic text, and inputting the phonemic text into a reference model to obtain the Mel characteristics of the reference tone of the text to be synthesized;

s8, sending the Mel features obtained in the step S7 into a conversion model corresponding to the target tone for conversion, and obtaining the Mel features of the target tone;

and S9, sending the Mel features of the target tone in the step S7 into the individual vocoder model of the corresponding tone, thereby synthesizing the sound with the designated tone.