CN113763924B

CN113763924B - Acoustic deep learning model training method, and voice generation method and device

Info

Publication number: CN113763924B
Application number: CN202111310778.9A
Authority: CN
Inventors: 陈栋
Original assignee: Beijing Youmu Technology Co ltd
Current assignee: Beijing Youmu Technology Co ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-15
Anticipated expiration: 2041-11-08
Also published as: CN113763924A

Abstract

The application provides an acoustic deep learning model training method, a voice generation method and equipment, wherein the voice generation method comprises the following steps: acquiring text data and language information; converting the text data into phonemes according to the language information, and adding language labels to each phoneme; the method comprises the steps of generating audio data by utilizing a deep learning model, wherein the deep learning model comprises an audio generation module, a text feature extraction module, a stream mapping module and a duration prediction module, the text feature extraction module is used for extracting text feature values from phonemes, the duration prediction module is used for generating duration information according to the text features extracted by the text feature extraction module, the stream mapping module is used for calculating latent variables according to the text feature values and the duration information and generating frequency spectrum feature data according to the latent variables, and the audio generation module generates audio data according to the frequency spectrum feature data.

Description

Acoustic deep learning model training method, and voice generation method and device

Technical Field

The invention relates to the field of voice analysis and synthesis, in particular to an acoustic deep learning model training method, a voice generation method and voice generation equipment.

Background

The micro course (Microlecture) refers to a structured digital resource which presents fragmented learning content, processes and extension materials by applying an information technology according to a cognitive rule. One function of the micro lesson is to generate a video of the explanation of the people in the photo according to the text or PPT uploaded by the user and the photo. The function firstly synthesizes voice according to text content, then drives and generates lip and head actions of a character, and finally synthesizes the whole video. Since many foreign unit users often have mixed languages in the submitted text, clear and smooth Chinese-English mixed speech needs to be synthesized according to the text.

Taking a scene of mixing Chinese and English as an example, when a Chinese-English mixed language TTS (Text To Speech) model is developed, Chinese-English mixed Speech data is required To be used as training data, but the training data of the TTS model has high requirements on noise, the accent and fluency of a speaker, the requirement on the Chinese-English mixed TTS Speech data is higher, the speaker is required To speak two languages fluently and switch smoothly, and then the TTS model is recorded in a professional recording studio, so that the data is difficult To obtain or the obtaining cost is very high.

On the other hand, the existing speech synthesis technology adopts a two-stage method for synthesis, and the first stage is to train an acoustic model to generate a Mel spectrum from a text according to the Mel spectrum of speech data and a corresponding text; the second stage is to convert the Mel spectrum generated in the first stage into a speech signal by using a vocoder. The two-stage TTS technology needs to train two models respectively, the training period is long, and the problem of any link can lead to unsatisfactory final result. Especially in the second stage, it is always a matter of controversy whether the training vocoder uses the spectrum of the original training voice data or the spectrum generated by the acoustic model, and if the latter is used, the vocoder needs to be trained again after each update of the acoustic model, which leads to the continuous increase of the training cost of the TTS model.

Therefore, according to the existing speech synthesis technology, to synthesize the mixed-language speech, the model needs to be trained by using the speech data and the text data of the mixed language. However, the cost of acquiring such data is very high, requiring the speaker to be able to use multiple languages with each language sounding very standard, and also being able to switch languages skillfully and smoothly. In addition, in the prior art, two stages are adopted for voice synthesis, the Mel spectrum of the voice is synthesized firstly, and then the Mel spectrum is converted into the voice, so that the whole technical stack is long, the model training cost is high, and when one link goes wrong, the whole effect is affected.

Disclosure of Invention

In view of the above, the present application provides an acoustic deep learning model training method, including:

acquiring original sample data of a plurality of languages, wherein the original sample data comprises audio data, text data and language information of a speaker;

converting the text data into phonemes according to the language information, and adding language labels to each phoneme;

extracting linear spectrum data of the audio data;

the deep learning model is trained using a plurality of training data, the training data including the linear spectrum data, the phonemes and language labels thereof, the text data, the deep learning model comprises an audio generation module, a text feature extraction module, a stream mapping module, an alignment search module and a duration prediction module, wherein the audio generation module is used for generating audio data according to the linear spectrum data, the text feature extraction module is used for extracting text feature values from the phonemes, the stream mapping module is used for mapping the audio features extracted by the audio generation module to latent variables, the alignment searching module is used for establishing mapping relation data of the text characteristic value and the latent variable, the duration prediction module is used for determining duration information according to the text features extracted by the text feature extraction module and the mapping relation data.

Optionally, the original sample data is original sample data of multiple speakers, at least some of the speakers use different languages, and the original sample data of the same speaker is in a single language.

Optionally, the audio generating module comprises an a posteriori spectral encoder and a decoder, wherein the a posteriori spectral encoder is configured to extract audio features from the linear spectral data, and the decoder is configured to generate the audio data according to the audio features.

Optionally, the text feature extraction module includes a phoneme coder and a mapping module, wherein the phoneme coder is configured to extract text feature data from the phonemes, and the mapping module is configured to process the text feature data into the feature values.

Optionally, the characteristic values are mean and variance of multivariate gaussian distribution.

Optionally, the training data further includes identification data of a speaker, and the identification data is sent to the audio generation module, the stream mapping module, and the duration prediction module in a process of training the deep learning model by using a plurality of training data.

The present application further provides a speech generating method, including:

acquiring text data and language information;

the method comprises the steps of generating audio data by utilizing a deep learning model, wherein the deep learning model comprises an audio generation module, a text feature extraction module, a stream mapping module and a duration prediction module, the text feature extraction module is used for extracting text feature values from phonemes, the duration prediction module is used for generating duration information according to the text features extracted by the text feature extraction module, the stream mapping module is used for calculating latent variables according to the text feature values and the duration information and generating frequency spectrum feature data according to the latent variables, and the audio generation module generates audio data according to the frequency spectrum feature data.

Optionally, before generating the audio data by using the deep learning model, obtaining identification data of the speaker, where the identification data is sent to the audio generating module, the stream mapping module, and the duration prediction module during the process of generating the audio data by using the deep learning model.

Accordingly, the present application provides an acoustic deep learning model training device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the one processor to cause the at least one processor to perform the acoustic deep learning model training method described above.

Accordingly, the present application provides a speech generating device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the one processor, the instructions being executable by the at least one processor to cause the at least one processor to perform the speech generation method described above.

The acoustic deep learning model training scheme provided by the application completes the whole processing flow from text to voice in a model frame, establishes the alignment relation between text characteristics and spectrum characteristics (latent variables) through an alignment search algorithm, does not need other auxiliary models, changes a two-stage training method in the prior art into a single stage, greatly improves the training efficiency, realizes real complete end-to-end training, and improves the model training efficiency.

The acoustic deep learning model training scheme provided by the application adds language information when text coding is a phoneme, does not need to add embedding of languages in the model, can effectively distinguish different languages from this, and gives the model to learn common pronunciation characteristics between different languages, so that the model can efficiently extract language characteristics. Under the condition of needing mixed languages, if the voice data of a certain speaker is only in a single language, because the model learns the pronunciation modes of other speakers speaking other languages, and finally the model can also synthesize the pronunciation parts of other languages by using the tone of the speaker, the scheme does not need to use the mixed language voice data as the training data of the model.

According to the voice generation scheme provided by the application, when model reasoning is carried out, the voice is generated in parallel by using the Flow inverse transformation and the GAN generator, so that the voice generation speed can be greatly increased, the whole process can be operated in parallel, the computing capacity of the GPU is fully exerted, the real-time rate of voice synthesis can reach 0.02, and the engineering deployment is very convenient.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of a model training scheme in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a preferred model training scheme in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech synthesis scheme in an embodiment of the present invention;

fig. 4 is a schematic diagram of a preferred speech synthesis scheme in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The embodiment provides an acoustic deep learning model training method, which can be executed by an electronic device such as a computer, a server and the like, and the method comprises the following operations:

the method comprises the steps of obtaining original sample data of a plurality of languages, wherein the original sample data comprises audio data, text data and language information of a speaker. The plurality of languages may be, for example, languages of different countries or regions such as chinese, english, japanese, etc., and the audio data and the text data may be in only a single language or may be mixed in multiple languages.

The single language means that one original data is a pure Chinese voice and a pure Chinese text of a speaker who uses Chinese; or pure english speech and pure english text of a speaker using english. At least two languages must exist in all the acquired original sample data.

Multilingual clutter means that a piece of raw data can be the chinese, english speech and chinese-english text of a speaker who is in use and english, such as the illustrative sentence "hello". Such confounding samples may, but need not, be present in all of the original sample data acquired. In view of the difficulty in obtaining multi-language mixed sample data, in a specific embodiment, original sample data of multiple speakers is used, wherein at least some of the speakers use different languages, and the original sample data of the same speaker is in a single language.

Converting the text data into phonemes according to the language information, and adding language tags to each phoneme respectively. Phonemes are the smallest units of speech that are divided according to the natural attributes of the speech, and one action constitutes a phoneme according to the pronunciation action in a syllable. Different phoneme conversion methods of different languages are adopted, that is, a conversion mode corresponding to the language is adopted to process the text, for example, english "hello" is converted into 4 phonemes, which are respectively "HH", "AH 0", "L", "OW 1"; the Chinese "hello" is converted into 4 phonemes, respectively "N", "I3", "H", "AO 3".

The embodiment needs to add labels for representing the languages of the phonemes, such as the label "EN" for english and "CN" for chinese. The tagged phonemes can be denoted "EN-HH", "EN-AH 0", "EN-L", "EN-OW 1", "CN-N", "CN-I3", "CN-H", "CN-AO 3".

Linear spectral data of the audio data is extracted. Specifically, the linear spectrum of the speech can be obtained by processing the audio signal by means such as short-time fourier transform.

The training data used for training the deep learning network model can be obtained through the preprocessing, and the training data used in the embodiment includes audio data, linear spectrum data, phonemes, language labels thereof, and text data.

The deep learning model in this embodiment is shown in fig. 1, and includes an audio generation module 1, a text feature extraction module 2, a stream mapping module 3, an alignment search module 4, and a duration prediction module 5.

The audio generation module 1 is configured to generate audio data according to the linear spectrum data, and the module extracts feature vectors from the linear spectrum by using a generation countermeasure learning mechanism, and further generates audio data according to the feature vectors.

The text feature extraction module 2 is configured to extract text feature values from phonemes, where each phoneme is added with a language label. The module can adopt a feature extraction network in various neural network models, and the finally obtained feature value can be configured according to the network type, for example, a mapping layer is arranged, and the feature vector extracted by the network is mapped into a corresponding numerical value according to a set rule.

The stream mapping module 3 is used to map the audio features extracted by the audio generating module 1 into latent variables. The audio generation module 1 necessarily extracts feature vectors, that is, feature data of linear spectra, in the process of generating audio, and the latent variables may be directly interpreted as spectral features, or higher-dimensional or lower-dimensional data obtained by further processing the spectral features.

The alignment search module 4 is configured to establish mapping relationship data between text feature values and latent variables, and specifically includes a plurality of selectable alignment search algorithms, where a Monotonic alignment search algorithm is adopted as a preferred embodiment.

The duration prediction module 5 is configured to predict duration information according to the text features extracted by the text feature extraction module 2 and the mapping relationship data, where the duration information corresponds to all phonemes and is a total duration of speech used for sending out all phonemes.

The model will optimize the parameters of each module according to the output results (including the audio data output by the audio generation module 1 and the duration information output by the duration prediction module 5) until the desired effect is achieved.

The acoustic deep learning model training scheme provided by the embodiment of the invention completes the whole processing flow from text to voice in a model frame, establishes the alignment relation between the text characteristics and the frequency spectrum characteristics (latent variables) through the alignment search algorithm, does not need other auxiliary models, converts the two-stage training method in the prior art into a single stage, greatly improves the training efficiency, realizes the real complete end-to-end training and improves the model training efficiency.

According to the acoustic deep learning model training scheme provided by the embodiment of the invention, language information is added when a text is coded into phonemes, and language embedding does not need to be added into the model, so that different languages can be effectively distinguished, and common pronunciation characteristics among different languages are submitted to the model for learning, so that the model can efficiently extract language characteristics. Under the condition of needing mixed languages, if the voice data of a certain speaker is only in a single language, because the model learns the pronunciation modes of other speakers speaking other languages, and finally the model can also synthesize the pronunciation parts of other languages by using the tone of the speaker, the scheme does not need to use the mixed language voice data as the training data of the model.

Fig. 2 shows a preferred deep learning model structure, the audio generation module of which includes a posteriori spectral coder 11 and decoder 12, and the text feature extraction module includes a phoneme coder 21 and a mapping module 22.

In a preferred embodiment, the training data further comprises identification data of the speaker, specifically an Embedding vector, where Embedding refers to converting discrete variables into continuous vectors, and the identification is used to distinguish different speakers. For example, the training data includes audio and text for N speakers with different embedding vectors. In the training process, the imbedding vectors of the speakers corresponding to the text and the audio are sent to the audio generation module, the stream mapping module and the duration prediction module, so that the models can distinguish different speakers in the learning process, and after training, a user can select one of the selectable speakers, namely the user can select the voice of the output voice.

Specifically, in the present embodiment, the embedding vector of the speaker is fed into the posterior spectral encoder 11, the decoder 12, the stream mapping module 3, and the duration prediction module 5. The a posteriori spectrum encoder 11 in the present embodiment extracts audio features from linear spectrum data, and the decoder 12 generates audio data from the audio features. The phoneme coder 21 converts the phonemes into corresponding embedding vectors (which can be interpreted as text feature data), and the mapping module 22 is configured to process the text feature data into feature values. The mapping module 22 maps the text layer to a mean μ and a variance σ of a multivariate gaussian distribution in the present embodiment.

The Flow mapping module 3 maps the audio features extracted by the encoder 11 to the latent variable z by using a Flow standardized Flow, and the mapping calculation is denoted as Flow f. The alignment search module 4 establishes a mapping relationship between the mean μ and variance σ (text feature) and the latent variable z using a monotonic alignment search algorithm. Meanwhile, the duration prediction module 5 predicts the duration, and in order to prevent the duration predictor from influencing other modules during training, the duration prediction module is subjected to gradient interrupt processing.

The application also provides a speech generation method, wherein the model obtained by the training method is used for generating speech according to the text, and in the embodiment, only part of the modules in the model are used instead of all the modules. The speech generation method comprises the following operations:

and acquiring text data and language information. Converting the text data into phonemes according to the language information, and adding language tags to each phoneme respectively. Reference may be made to the above embodiments, which are not described herein again.

As shown in fig. 3, the audio data is generated by using a deep learning model, and the deep learning model in this embodiment includes an audio generating module 1, a text feature extracting module 2, a stream mapping module 3, an alignment searching module 4, and a duration predicting module 5. The text feature extraction module 2 is used for extracting text feature values from phonemes, the duration prediction module 5 is used for generating duration information according to the text features extracted by the text feature extraction module 2, the alignment search module 4 is used for calculating latent variables according to the text feature values and the duration information, the stream mapping module 3 is used for generating frequency spectrum feature data according to the latent variables, and the audio generation module 1 is used for generating audio data according to the frequency spectrum feature data.

FIG. 4 shows a preferred model structure corresponding to FIG. 2, the model making predictionsIn time, the classifiers in the a posteriori spectral coders and the GAN (generation countermeasure network) can be discarded, and the other parts are reserved for prediction. Performed by the Flow mapping module 3 in the speech generation method is an inverse transformation operation, denoted Flow f^-1. The text feature extraction module comprises a phoneme coder 21 and a mapping module 22, wherein the phoneme coder 21 is used for extracting text feature data from phonemes, and the mapping module 22 is used for processing the text feature data into feature values, and the feature values are preferably mean values mu and variance sigma of multivariate Gaussian distribution. The duration prediction module 5 generates duration information according to the text feature data, calculates a latent variable according to the mean value mu, the variance sigma and the duration information, performs inverse transformation through the stream mapping module 3 to generate spectrum features in parallel, and finally generates voice according to the spectrum features through the decoder 12.

According to the voice generation method provided by the embodiment of the invention, when model reasoning is carried out, the voice is generated in parallel by using the Flow inverse transformation and the GAN generator, so that the voice generation speed can be greatly improved, the whole process can be operated in parallel, the calculation capacity of the GPU is fully exerted, the real-time rate of voice synthesis can reach 0.02, and the engineering deployment is very convenient.

In the case that the embedding vector of the speaker is used in the training process, the embodiment allows the user to select one speaker, obtains the identification data of the speaker selected by the user before generating the audio data by using the deep learning model, and in the process of generating the audio data by using the deep learning model, the identification data is sent to the audio generation module, the stream mapping module and the duration prediction module, so that the voice of the audio data finally generated by the model is the voice of the speaker selected by the user.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A training method of an acoustic deep learning model is characterized by comprising the following steps:

extracting linear spectrum data of the audio data;

the deep learning model is trained by using a plurality of training data, the training data comprises the linear spectrum data, the phonemes and language labels thereof, and the text data, the deep learning model comprises an audio generation module, a text feature extraction module, a stream mapping module, an alignment search module and a duration prediction module, wherein the audio generation module comprises a posterior spectrum encoder and a decoder, the posterior spectrum encoder is used for extracting audio features from the linear spectrum data, the decoder is used for generating the audio data according to the audio features, the text feature extraction module comprises a phoneme encoder and a mapping module, the phoneme encoder is used for extracting text feature data from the phonemes, the mapping module is used for processing the text feature data into text feature values, and the text feature values are the mean value and the variance of multivariate Gaussian distribution, the stream mapping module is used for mapping the audio features extracted by the audio generating module into latent variables, the alignment searching module is used for establishing mapping relation data of the text feature values and the latent variables, and the duration predicting module is used for determining duration information according to the text feature data extracted by the text feature extracting module and the mapping relation data.

2. The method of claim 1, wherein the original sample data is original sample data of multiple speakers, at least some of the speakers use different languages, and the original sample data of the same speaker is in a single language.

3. The method according to claim 1 or 2, wherein the training data further comprises identification data of a speaker, and the identification data is sent to the audio generation module, the stream mapping module and the duration prediction module during the training of the deep learning model by using a plurality of training data.

4. A method of speech generation, comprising:

acquiring text data and language information;

the deep learning model trained by the training method according to any one of claims 1 to 3 generates audio data, and the deep learning model comprises an audio generation module, a text feature extraction module, a stream mapping module and a duration prediction module, wherein the text feature extraction module is used for extracting text feature values from the phonemes, the text feature values are mean values and variance values of multivariate Gaussian distribution, the duration prediction module is used for generating duration information according to the text feature data extracted by the text feature extraction module, the stream mapping module is used for calculating latent variables according to the text feature values and the duration information and generating spectral feature data according to the latent variables, and the audio generation module generates audio data according to the spectral feature data.

5. The method of claim 4, wherein the text feature extraction module comprises a phoneme coder and a mapping module, wherein the phoneme coder is configured to extract text feature data from the phonemes, and the mapping module is configured to process the text feature data into the text feature values.

6. The method of claim 4, further comprising obtaining identification data of the speaker before generating the audio data using the deep learning model, the identification data being fed into the audio generation module, the stream mapping module, and the duration prediction module during the generating of the audio data using the deep learning model.

7. An acoustic deep learning model training apparatus, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the acoustic deep learning model training method of any one of claims 1-3.

8. A speech generating device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of speech generation according to any of claims 4-6.