CN113362836B

CN113362836B - Vocoder training method, terminal and storage medium

Info

Publication number: CN113362836B
Application number: CN202110615714.3A
Authority: CN
Inventors: 徐东
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2024-06-11
Anticipated expiration: 2041-06-02
Also published as: CN113362836A

Abstract

The application discloses a vocoder training method, a terminal and a storage medium, and belongs to the technical field of Internet. The method comprises the following steps: acquiring time domain data of sample audio as reference time domain data; determining first frequency spectrum data corresponding to the reference time domain data, and inputting the first frequency spectrum data into a self-attention learning module in the trained acoustic model to obtain second frequency spectrum data; inputting the second frequency spectrum data into a vocoder to obtain predicted time domain data; the vocoder is trained based on the predicted time domain data and the reference time domain data. The matching degree of the trained vocoder and the trained acoustic model obtained based on the method is higher than that of the trained acoustic model and the trained vocoder in the prior art, and the sound of sand in the synthesized sound is reduced to a certain degree.

Description

Vocoder training method, terminal and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a vocoder training method, a terminal, and a storage medium.

Background

With the continuous development of internet technology, people often read the content of the novels through an AI model when reading the novels.

In the related art, the AI model is actually composed of a phoneme conversion model, a pause prediction model, an acoustic model, and a vocoder. The specific process of obtaining the target text by applying the models is as follows: and respectively inputting the target text into a phoneme conversion model and a pause prediction model to obtain a phoneme sequence and pause information, wherein the pause information comprises pause positions and pause time. And inputting the phoneme sequence and the pause information into the trained acoustic model to obtain spectrum data. And inputting the frequency spectrum data into the trained vocoder to obtain target time domain data corresponding to the target text, and playing the target time domain data by the terminal.

Because the vocoder is trained based on the spectrum data obtained by the real sound, the spectrum data of the vocoder which is input into the vocoder after training in the actual use process is only the spectrum data of the similar real sound obtained by the acoustic model based on the phoneme sequence and the pause information, and is not the spectrum data of the real sound, the acoustic model after training is not matched with the vocoder after training, and therefore the vocoder can not recognize the spectrum data obtained by the acoustic model, and the sound of 'sand' existing in the synthesized sound is caused.

Disclosure of Invention

The embodiment of the application provides a vocoder training method, a terminal and a storage medium, and the matching degree of a trained vocoder and a trained acoustic model obtained based on the method is higher than that of the trained acoustic model and the trained vocoder in the prior art, so that the sound of sand in synthesized sound is reduced to a certain extent. The technical scheme is as follows:

in one aspect, a method of training a vocoder is provided, the method comprising:

Acquiring time domain data of sample audio as reference time domain data;

Determining first frequency spectrum data corresponding to the reference time domain data, and inputting the first frequency spectrum data into a self-attention learning module in the trained acoustic model to obtain second frequency spectrum data;

inputting the second frequency spectrum data into a vocoder to obtain predicted time domain data;

The vocoder is trained based on the predicted time domain data and the reference time domain data.

Optionally, the method further comprises:

obtaining a phoneme sequence and pause information corresponding to a target text, wherein the pause information comprises pause positions and pause time;

inputting the phoneme sequence and the pause information into a trained acoustic model to obtain third spectrum data;

And inputting the third frequency spectrum data into a trained vocoder to obtain target time domain data corresponding to the target text.

Optionally, the obtaining the phoneme sequence and the pause information corresponding to the target text includes:

inputting the target text into a phoneme conversion model to obtain a phoneme sequence corresponding to the target text;

and inputting the target text into a pause prediction model to obtain pause information corresponding to the target text.

Optionally, the inputting the phoneme sequence and the pause information into the trained acoustic model to obtain third spectrum data includes:

inputting the phoneme sequence and the pause information into a frequency spectrum prediction module in the trained acoustic model to obtain fourth frequency spectrum data;

and inputting the fourth frequency spectrum data into a self-attention learning module in the trained acoustic model to obtain the third frequency spectrum data.

Optionally, the method further comprises:

determining the speech rate of each sample audio in a sample library, wherein the sample library stores a plurality of sample audios and sample texts corresponding to the plurality of sample audios respectively;

Determining a first sample audio frequency with the speech rate within a preset numerical range and a first sample text corresponding to the first sample audio frequency;

and training the acoustic model based on the first sample text and the corresponding first sample audio to obtain a trained acoustic model.

Optionally, the training the acoustic model based on the first sample text and the corresponding first sample audio to obtain a trained acoustic model includes:

acquiring spectrum data corresponding to the first sample audio as reference spectrum data;

Determining a sample phoneme sequence and sample pause information corresponding to the first sample text, wherein the sample pause information comprises a sample pause position and a sample pause time;

inputting the sample phoneme sequence and the sample pause information into an acoustic model to obtain predicted spectrum data;

and training the acoustic model based on the reference spectrum data and the predicted spectrum data to obtain a trained acoustic model.

Optionally, the method further comprises:

Determining a first sample audio with a speech rate within a preset numerical range, a first sample text corresponding to the first sample audio, a second sample audio with a speech rate outside the preset numerical range and a second sample text corresponding to the second sample audio;

training the acoustic model based on the second sample text and the corresponding second sample audio to obtain a primarily trained acoustic model;

And training the primarily trained acoustic model based on the first sample text and the corresponding first sample audio to obtain a trained acoustic model.

Optionally, the determining the speech rate of each sample audio in the sample library includes:

For each sample audio, determining a first duration corresponding to the sample audio, a second duration corresponding to a noise audio segment in the sample audio and a third duration of a mute audio segment in the sample audio, and determining a sum of the second duration and the third duration to obtain a difference value between the sum and the first duration as an effective duration corresponding to the sample audio; determining a text corresponding to the sample audio, and determining the number of words included in the text with punctuation marks removed as the number of effective words included in the sample audio; and determining the ratio of the effective word number corresponding to the audio frequency to the corresponding effective time length to serve as the speech speed corresponding to the sample audio frequency.

In one aspect, there is provided a training vocoder apparatus comprising:

an acquisition module configured to acquire time domain data of the sample audio as reference time domain data;

the first input module is configured to determine first spectrum data corresponding to the reference time domain data, input the first spectrum data into the self-attention learning module in the trained acoustic model, and obtain second spectrum data;

a second input module configured to input the second spectral data into a vocoder to obtain predicted time domain data;

a first training module configured to train the vocoder based on the predicted time domain data and the reference time domain data.

Optionally, the apparatus further comprises a usage module configured to:

Optionally, the usage module is configured to:

Optionally, the apparatus further comprises a second training module configured to:

Optionally, the second training module is configured to:

Optionally, the second training module is further configured to:

Optionally, the second training module is configured to:

In one aspect, a terminal is provided that includes a processor and a memory having stored therein at least one program code that is loaded and executed by the processor to implement the above-described method of training a vocoder.

In one aspect, a computer readable storage medium having at least one program code stored therein is provided, the at least one program code loaded and executed by a processor to implement the above-described method of training a vocoder.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a computer device, the computer program code being executed by the processor, causing the computer device to perform the above-described training vocoder method.

In the process of training the vocoder, the frequency spectrum data of real sound is input into a self-attention learning module of the acoustic model after training to obtain second frequency spectrum data, and the vocoder is trained based on the second frequency spectrum data to obtain the trained vocoder. The second frequency spectrum data is generated based on the self-attention learning module in the acoustic model after training, so that the difference between the frequency spectrum data input of the real sound and the frequency spectrum data predicted by the acoustic model is relieved, the matching degree of the vocoder obtained based on the training method of the embodiment of the application and the acoustic model after training is higher, the tone quality of the synthesized sound output by the vocoder is improved, the sound of sand in the synthesized sound is reduced, and the quality of the synthesized sound is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment of a method for training a vocoder according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for training a vocoder provided by an embodiment of the present application;

FIG. 3 is a flow chart of a method for training a vocoder provided by an embodiment of the present application;

FIG. 4 is a flow chart of a method for training a vocoder provided by an embodiment of the present application;

FIG. 5 is a flow chart of a method for training a vocoder provided by an embodiment of the present application;

Fig. 6 is a schematic diagram of a training vocoder device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a vocoder training method according to an embodiment of the present application. As shown in fig. 1, the method may be implemented by the terminal 101 or the server 102.

The terminal 101 may include a processor, memory, etc. The processor, which may be a CPU (Central Processing Unit ) or the like, may be configured to obtain time domain data of the sample audio, determine first spectrum data corresponding to the reference time domain data, input the first spectrum data to a self-care learning module in the trained acoustic model, obtain second spectrum data, input the second spectrum data to the vocoder, obtain predicted time domain data, train the vocoder based on the predicted time domain data and the reference time domain data, and the like. The memory may be RAM (Random Access Memory ), flash (Flash memory), etc., and may be used to store sample audio, etc. The terminal 101 may further include a transceiver, an image detection section, a screen, an audio output section, an audio input section, and the like. Wherein, the audio output component can be a sound box, a headset, and the like. The audio input means may be a microphone or the like.

The server 102 may include a processor, memory, etc. The processor, which may be a CPU (Central Processing Unit ) or the like, may be configured to obtain time domain data of the sample audio, determine first spectrum data corresponding to the reference time domain data, input the first spectrum data to a self-care learning module in the trained acoustic model, obtain second spectrum data, input the second spectrum data to the vocoder, obtain predicted time domain data, perform processing such as training on the vocoder based on the predicted time domain data and the reference time domain data. The memory may be RAM (Random Access Memory ), flash (Flash memory), etc., and may be used for sample audio, etc.

Fig. 2 is a flow chart of a method for training a vocoder according to an embodiment of the present application. Referring to fig. 2, this embodiment includes:

step 201, time domain data of sample audio is obtained as reference time domain data.

Wherein the sample audio is audio including human voice. The time domain data of the sample audio may be time domain waveform data corresponding to the sample audio.

And randomly selecting time domain data corresponding to the sample audio from the sample library, and taking the time domain data as reference time domain data.

Step 202, determining first spectrum data corresponding to the reference time domain data, and inputting the first spectrum data into a self-care learning module in the trained acoustic model to obtain second spectrum data.

Wherein the spectral data may be mel spectrum.

In implementation, based on the method for obtaining the mel frequency spectrum in the related technology, a first mel frequency spectrum corresponding to the reference time domain data is obtained, and the first mel frequency spectrum is input into a self-attention learning module in the trained acoustic model to obtain a second mel frequency spectrum.

It should be noted that the self-care learning module in the embodiment of the present application predicts the second mel spectrum based on the first mel spectrum, where the first mel spectrum and the second mel spectrum are both composed of mel spectrums of a plurality of audio frames. The principle of predicting the second mel spectrum is: the mel frequency spectrum of the 1 st audio frame of the second mel frequency spectrum is the mel frequency spectrum of the 1 st audio frame of the first mel frequency spectrum, the mel frequency spectrum of the 2 nd audio frame of the second mel frequency spectrum is obtained based on the mel frequency spectrum prediction of the 1 st audio frame of the first mel frequency spectrum, … … the mel frequency of the n-th audio frame of the second mel frequency spectrum is obtained based on the mel frequency prediction of the n-1 st audio frame of the first mel frequency spectrum, wherein n represents the number of frames contained in the second mel frequency spectrum.

Step 203, inputting the second spectrum data into the vocoder to obtain the predicted time domain data.

In an implementation, the second mel-frequency spectrum is input into a vocoder to obtain the predicted time domain data.

It should be noted that, the vocoder in the embodiment of the present application may be a vocoder that is not trained, i.e., an initial vocoder, or may be a vocoder that has been pre-trained, i.e., a primarily trained vocoder. The method comprises the following specific steps of obtaining the preliminarily trained vocoder: time domain data corresponding to the sample audio and spectrum data corresponding to the sample audio are obtained, and the time domain data corresponding to the sample audio is used as reference time domain data. And inputting the spectrum data of the sample audio into an initial vocoder to obtain the predicted time domain data. And inputting the predicted time domain data and the reference time domain data into a first loss function to obtain first loss information, and training the initial vocoder based on the first loss information. And acquiring time domain data and frequency domain data corresponding to other sample audios, and training the vocoder obtained in the last training until the preset training process is completed, so as to obtain the primarily trained vocoder. Or obtaining the preliminarily trained vocoder until the first loss information obtained by the first loss function meets a first preset condition. The first preset condition is that the first loss information tends to be stable, or that the difference between the current first loss information and the last obtained first loss information is smaller than a first preset value.

The vocoders in embodiments of the present application may be neural network vocoders, which may include, but are not limited to, autoregressive neural network vocoders (e.g., waveRNN vocoders), gan network-based neural network vocoders (e.g., melGan vocoders, hifiGan vocoders).

Step 204, training the vocoder based on the predicted time domain data and the reference time domain data.

In implementation, the predicted time domain data acquired in step 201 and the reference time domain data acquired in step 203 are input into a second loss function, so as to obtain second loss information. And adjusting parameters to be adjusted in the vocoder based on the second loss information, so as to complete one training. And obtaining time domain data of other sample audios, training the vocoder obtained by the last training based on the time domain data until the preset training process is completed, and obtaining the trained vocoder. Or until the second loss information obtained by the second loss function meets a second preset condition, obtaining the trained vocoder. The second preset condition is that the second loss information tends to be stable, or that the difference between the current second loss information and the second loss information obtained last time is smaller than a second preset value.

The first loss function and the second loss function may be the same loss function or different loss functions. The first preset value and the second preset value may be the same value or different values.

In the process of training the vocoder, the frequency spectrum data of real sound is input into a self-attention learning module of the acoustic model after training to obtain second frequency spectrum data, and the vocoder is trained based on the second frequency spectrum data to obtain the trained vocoder. The second frequency spectrum data is generated based on the self-attention learning module in the acoustic model after training, so that the difference between the frequency spectrum data of the real sound and the frequency spectrum data predicted by the acoustic model is reduced, the matching degree of the vocoder obtained by the training method and the acoustic model after training is higher, the frequency spectrum data output by the acoustic model is identified by using the vocoder, the tone quality of the synthesized sound output by the vocoder can be improved, the sound of sand in the synthesized sound is reduced, and the quality of the synthesized sound is further improved.

When the sample audio frequency of the normal speech speed in the sample library is enough, the acoustic model can be directly trained based on the sample audio frequency of the normal speech speed, and the trained acoustic model is obtained. FIG. 3 is a flow chart of a method for training an acoustic model according to an embodiment of the present application. Referring to fig. 3, this embodiment includes:

Step 301, determining the speech rate of each sample audio in the sample library.

The sample library stores a plurality of sample audios and sample texts corresponding to the plurality of sample audios respectively.

Optionally, for each sample audio, determining a first duration corresponding to the sample audio, a second duration corresponding to a noise audio segment in the sample audio, and a third duration corresponding to a mute audio segment in the sample audio, determining a sum of the second duration and the third duration, and further taking a difference value between the sum value and the first duration as an effective duration corresponding to the sample audio. And determining the text corresponding to the sample audio, and determining the number of words included in the text with punctuation marks removed as the number of effective words included in the sample audio. And determining the ratio of the number of effective words corresponding to the audio to the corresponding effective duration, and taking the ratio as the speech rate corresponding to the sample audio.

The noise audio frequency band is an audio frequency band only comprising noise, and the mute audio frequency band is an audio frequency band without sound. The speech rate is defined as: speech speed s=number of valid words N/valid sound production duration T.

In implementation, for each sample audio in a sample library, a first duration of the sample audio is acquired, and a noise frequency band and a mute frequency band in the sample audio are determined, so that a second duration corresponding to the noise frequency band and a third duration corresponding to the mute frequency band are obtained. And adding the second time length and the third time length, determining the sum of the second time length and the third time length, and determining the difference between the sum and the first time length as the effective time length corresponding to the sample audio. And converting the sample audio into a text, comparing the text with each punctuation in a punctuation library, and determining the punctuation in the text. And removing punctuation marks in the text to obtain the text from which the punctuation marks are removed, and further determining the number of words contained in the text from which the punctuation marks are removed, wherein the number of words is used as the number of effective words contained in the sample audio. And determining the ratio of the number of effective words corresponding to the audio to the corresponding effective duration, and taking the ratio as the speech rate corresponding to the sample audio.

It should be noted that, in the embodiment of the present application, the method for determining the noise frequency band and the mute frequency band is: and determining a non-human voice audio frame in the sample audio, further determining a non-human voice frequency band in the sample audio, taking the non-human voice audio band containing the voice as a noise frequency band, and taking the non-human voice audio band not containing the voice as a mute audio band. Wherein the non-human voice audio frames in the sample audio may be determined using the method of determining non-human voice audio frames in the related art. The audio segments containing sound and the audio segments not containing sound in the non-human sound audio segment may be determined by detecting whether each audio frame in the non-human sound segment includes spectral data.

The sample audio is "today, the air temperature is 37 ℃, the text corresponding to the audio is" today, the air temperature is thirty-seven ℃, the text with punctuation marks removed is "today, the air temperature is thirty-seven ℃, and the effective word number of the text with the symbols removed is 11 words.

Step 302, determining a first sample audio with a speech rate within a preset numerical range and a first sample text corresponding to the first sample audio.

The preset value range may be a range formed by the first speed threshold value and the second speed threshold value, such as 3.3 words/second to 4.3 words/second. The audio with the speech speed of less than 3.3 words/second is regarded as slow speech speed audio, the audio with the speech speed of more than 4.3 words/second is regarded as fast speech speed audio, and the audio with the speech speed of between 3.3 words/second and 4.3 words/second is regarded as normal speech speed audio. The 3.3 words/second and 4.3 words/second are just one limit for measuring the speech rate, but other values are also possible.

In an implementation, first sample audio at a speech rate between 3.3 words/second and 4.3 words/second is determined, and first sample text corresponding to each first sample audio is determined.

Step 303, training the acoustic model based on the first sample text and the corresponding first sample audio to obtain a trained acoustic model.

Optionally, spectrum data corresponding to the first sample audio is obtained as the reference spectrum data. And determining a sample phoneme sequence and sample pause information corresponding to the first sample text, wherein the sample pause information comprises a sample pause position and a sample pause time length. And inputting the sample phoneme sequence and the sample pause information into an acoustic model to obtain predicted spectrum data. And training the acoustic model based on the reference spectrum data and the predicted spectrum data to obtain a trained acoustic model.

In implementation, based on the first sample audio, a mel spectrum corresponding to the first sample audio is determined and used as a first reference mel spectrum. And inputting the first text sample into a phoneme conversion model to obtain a first sample phoneme sequence corresponding to the first text sample. Inputting the first sample into a word segmentation model to obtain a first sample word segmentation result, comparing the first sample word segmentation result with the first sample text, and determining a first sample pause position in the first sample text. And inputting the first sample phoneme sequence, the first sample pause position and the preset pause time into a spectrum prediction module in the acoustic model to obtain a first predicted Mel spectrum. The first predicted mel-frequency spectrum is input to a self-care learning module in the acoustic model to obtain a second predicted mel-frequency spectrum. And inputting the second predicted mel frequency spectrum and the first reference mel frequency spectrum into a third loss function to obtain third loss information. And training parameters to be adjusted in the acoustic model based on the third loss information, so as to complete a training process. And training the acoustic model obtained by the last training by using other first sample audios and first text samples corresponding to other first sample audios until a preset training process is completed, so as to obtain the trained acoustic model. Or until the third loss information output by the third loss function meets a third preset condition, obtaining the trained acoustic model. The third preset condition is that the third loss information tends to be stable, or the difference between the current third loss information and the third loss information obtained last time is smaller than a third preset value.

It should be noted that, the acoustic model in the embodiment of the present application is obtained based on the audio training of the normal speech rate. Therefore, the speech speed of the audio output by the acoustic model after training is normal, and the problem of negligence and slow speech speed of the audio output by the acoustic model after training in the prior art is avoided.

Alternatively, if the desired trained acoustic model is a rapid acoustic model, the rapid acoustic model may be derived based on the following steps: sample audio having a speech rate greater than a second speed threshold, such as 4.3 words/second, is determined, and sample text corresponding to the sample audio is determined. Training the acoustic model based on the sample texts and the corresponding sample audios to obtain a quick acoustic model.

Alternatively, if the desired trained acoustic model is a slow speech acoustic model, the slow speech acoustic model may be derived based on the following steps: sample audio having a speech rate less than a first speed threshold, such as 3.3 words/second, is determined, and sample text corresponding to the sample audio is determined. Training the acoustic model based on the sample texts and the corresponding sample audio to obtain the slow-speech acoustic model.

The speech rate of the acoustic model obtained through training is relatively stable, and the problem of the speech rate of the audio obtained by the acoustic model obtained through training in the prior art is avoided.

When the number of the sample audios with normal speech rate in the sample library is insufficient, the acoustic model which is obtained by training and is only based on the sample audios with normal speech rate cannot be converged. Therefore, the embodiment of the application can train the acoustic model based on the sample audio of the abnormal speech speed to obtain the acoustic model for the first training. And training the acoustic model which is trained for the first time based on the sample audio of the normal speech speed to obtain the acoustic model which is trained. FIG. 4 is a flow chart of another method for training an acoustic model provided by an embodiment of the present application, see FIG. 4, which includes:

step 401, determining the speech rate of each sample audio in the sample library.

This step is similar to the implementation of step 301 and will not be described here again.

Step 402, determining a first sample audio with a speech rate within a preset numerical range, a first sample text corresponding to the first sample audio, a second sample audio with a speech rate outside the preset numerical range, and a second sample text corresponding to the second sample audio.

Wherein a sample audio with a speech rate within a preset range of values is considered to be an audio with a normal speech rate, for example a sample audio with a speech rate between 3.3 words/sec and 4.3 words/sec. The sample audio with the speech speed being out of the preset numerical range is regarded as the abnormal speech speed audio, wherein the abnormal speech speed audio comprises fast speech speed audio and slow speech speed audio, the fast speech speed audio is the audio with the speech speed being more than 4.3 words/second, and the slow speech speed audio is the audio with the speech speed being less than 3.3 words/second.

Step 403, training the acoustic model based on the second sample text and the corresponding second sample audio to obtain a primarily trained acoustic model.

In implementation, based on the second sample audio, a mel spectrum corresponding to the second sample audio is determined and used as a second reference mel spectrum. And inputting the second sample text into a phoneme conversion model to obtain a second sample phoneme sequence corresponding to the second sample text. And inputting the second sample text into the word segmentation model to obtain a second sample word segmentation result, comparing the second sample word segmentation result with the second sample text, and determining a second sample pause position in the second sample text. And inputting the second sample phoneme sequence, the second sample pause position and the preset pause time into a spectrum prediction module in the acoustic model to obtain a third predicted Mel spectrum. And inputting the third predicted mel frequency spectrum into a self-attention learning module in the acoustic model to obtain a fourth predicted mel frequency spectrum. And inputting the fourth predicted mel frequency spectrum and the second reference mel frequency spectrum into a fourth loss function to obtain fourth loss information. And training parameters to be adjusted in the acoustic model based on the fourth loss information, so as to complete a training process. And training the acoustic model obtained by the last training by using other second sample audios and second sample text corresponding to other second sample audios until a preset training process is completed, so as to obtain a primarily trained acoustic model. Or until the fourth loss information output by the fourth loss function meets a fourth preset condition, obtaining the preliminarily trained acoustic model. The fourth preset condition is that the fourth loss information tends to be stable, or that the difference between the current fourth loss information and the fourth loss information obtained last time is smaller than a fourth preset value.

The first and second loss functions in embodiment 1, the third loss function in embodiment 2, and the fourth loss function in this embodiment may be the same loss function or different loss functions. The first preset value and the second preset value in embodiment 1, the third preset value in embodiment 2, and the fourth preset value in this embodiment may be the same value or different values.

In the embodiment of the application, three methods for training the acoustic model based on the second sample text and the corresponding second sample audio exist, and the three methods are specifically as follows:

1. And directly training and adjusting the acoustic model by using the second sample text and the corresponding second sample audio without distinguishing the fast voice frequency from the slow voice frequency in the second sample audio to obtain a primarily trained acoustic model.

2. The fast speech audio and the slow speech audio in the second sample audio are distinguished. The method comprises the steps of training an acoustic model based on quick voice frequency and sample text corresponding to the quick voice frequency to obtain a first acoustic model. And training the first acoustic model based on the slow voice frequency and the sample text corresponding to the slow voice frequency to obtain a primarily trained acoustic model.

3. The fast speech audio and the slow speech audio in the second sample audio are distinguished. Firstly training an acoustic model based on slow voice frequency and sample text corresponding to the slow voice frequency to obtain a second acoustic model. And training the second acoustic model based on the quick voice frequency and the sample text corresponding to the quick voice frequency to obtain a primarily trained acoustic model.

Step 404, training the primarily trained acoustic model based on the first sample text and the corresponding first sample audio to obtain a trained acoustic model.

The training method is similar to the training method mentioned in step 403 or step 303, and will not be described here.

In the embodiment of the application, the acoustic model which is trained is required to be obtained as the acoustic model with normal speech speed, so that the problem of speech speed negligence and speed negligence can not occur in the sound output by the acoustic model which is trained. Because the acoustic model is trained by using the sample audio with abnormal speech speed in advance, the problem that the effect of the trained acoustic model is poor due to the fact that the number of the sample audio with normal speech speed is insufficient is avoided to a certain extent. And if the acoustic model which is required to be obtained and is trained is a fast-language or slow-language acoustic model, the acoustic model can be trained by adopting fast-language or slow-language sample audio and corresponding sample text.

For example, the method for acquiring the acoustic model of the quick language is as follows: sample audio and corresponding sample text having a speech rate greater than 4.3 words/second and sample audio and corresponding sample text having a speech rate less than 4.3 words/second are determined. The acoustic model is trained based on sample audio with the speech speed being less than 4.3 words/second and corresponding sample text, and a third acoustic model is obtained. And training the third acoustic model based on the sample audio frequency with the speech speed of more than 4.3 words/second and the corresponding sample text to obtain the fast speech acoustic model.

The method for acquiring the acoustic model of the slow speech rate comprises the following steps: sample audio and corresponding sample text having a speech rate of less than 3.3 words/second and sample audio and corresponding sample text having a speech rate of greater than 3.3 words/second are determined. The acoustic model is trained based on sample audio with the speech speed being less than 3.3 words/second and corresponding sample text, and a fourth acoustic model is obtained. And training the fourth acoustic model based on the sample audio frequency with the speech speed of more than 3.3 words/second and the corresponding sample text to obtain the acoustic model with the slow speech speed.

Fig. 5 is a flow chart of a method for synthesizing sound using a trained acoustic model and a trained vocoder, provided by an embodiment of the present application. Referring to fig. 5, this embodiment includes:

step 501, obtaining a phoneme sequence and pause information corresponding to a target text.

Wherein the pause information includes a pause location and a pause duration.

Optionally, inputting the target text into a phoneme conversion model to obtain a phoneme sequence corresponding to the target text. And inputting the target text into a pause prediction model to obtain pause information corresponding to the target text.

The phoneme conversion model and the pause prediction model in the embodiment of the application can be a machine learning model or a non-machine learning model. Specifically, the pause prediction model may be a word segmentation model. When the phoneme conversion model and the word segmentation model are non-machine learning models, a large number of corresponding relations between words and phonemes are stored in the phoneme conversion model, and a large number of words are stored in the word segmentation model.

In an implementation, a target text is input into a phoneme conversion model, and the target text is converted into a phoneme sequence based on correspondence between words and phonemes stored in the phoneme conversion model. And comparing the target text with each word in the word segmentation model to obtain a word segmentation result. And comparing the word segmentation result with the target text, and determining the pause position in the target text. And obtaining pause information corresponding to the target text according to the pause position and the preset pause time.

Step 502, inputting the phoneme sequence and the pause information into the trained acoustic model to obtain third spectrum data.

The acoustic model after training was an acoustic model obtained by training based on the method provided in example 2 or example 3.

Optionally, inputting the phoneme sequence and the pause information into a spectrum prediction module in the trained acoustic model to obtain fourth spectrum data. And inputting the fourth frequency spectrum data into a self-attention learning module in the trained acoustic model to obtain third frequency spectrum data.

In an implementation, the phoneme sequence, the pause position and the pause duration are input into a spectrum prediction module in the trained acoustic model to obtain fourth spectrum data. And inputting the fourth frequency spectrum data into a self-attention learning module in the trained acoustic model to obtain third frequency spectrum data.

Step 503, inputting the third spectrum data into the trained vocoder to obtain the target time domain data corresponding to the target text.

The trained vocoder was a vocoder trained based on the method provided in example 1. In an embodiment of the present application, in order to make the trained acoustic model and the trained vocoder more matched in step 502, the trained acoustic model and the trained vocoder may be the same speech rate type acoustic model and vocoder. For example, the trained acoustic model in step 502 is an acoustic model of normal speech rate, and the trained acoustic model is an acoustic model of normal speech rate. Or the trained acoustic model in step 502 is a slow speech acoustic model and the trained vocoder is a slow speech vocoder. Or the trained acoustic model in step 502 is a fast-speech acoustic model and the trained vocoder is a fast-speech vocoder.

Wherein, the step of obtaining the vocoder of the normal speech rate comprises: first, according to the step of obtaining the preliminarily trained vocoder in step 203, the preliminary vocoder is trained based on time domain data and frequency domain data of sample audio of a normal speech rate, so as to obtain the preliminarily trained vocoder with a speech rate of a normal speech rate. Secondly, according to the method for training a vocoder provided in embodiment 1, the vocoder which is primarily trained and has the normal speech rate is trained based on the time domain data of the sample audio with the normal speech rate, so as to obtain the vocoder with the normal speech rate.

The sample audio of the normal speech rate used in the first step and the sample audio of the normal speech rate used in the second step may be the same sample audio or may be different sample audio.

Similarly, the method of obtaining a slow speech vocoder or a fast speech vocoder is similar to the method of obtaining a normal speech vocoder, and will not be described in detail herein.

Optionally, the trained acoustic model and the trained vocoder are provided in the user terminal. When the user terminal needs to play the audio of the target text, the method provided by the embodiment of the application can be used for acquiring the target time domain data corresponding to the target text, and playing the audio of the target text based on the target time domain data after acquiring the target time domain data of the target text.

The trained acoustic model and the trained vocoder are stored in the server in advance, and when the server receives the synthesized voice request of the terminal, the trained acoustic model and the trained vocoder are sent to the user terminal, so that time domain data corresponding to the target text is synthesized in the user terminal.

Optionally, the trained acoustic model and the trained vocoder are located in a server. When the user terminal needs to play the audio of the target text, the user terminal sends an audio acquisition request to the server, wherein the audio acquisition request carries the target text. After receiving an audio acquisition request sent by a user terminal, a server acquires a target text in the audio acquisition request, acquires target time domain data corresponding to the target text based on the method provided by the embodiment of the application, and sends the target time domain data to the user terminal, so that audio corresponding to the target text is played on the user terminal.

Fig. 6 is a schematic structural diagram of a vocoder training apparatus according to an embodiment of the present application, referring to fig. 6, the apparatus includes:

An acquisition module 610 configured to acquire time domain data of the sample audio as reference time domain data;

a first input module 620 configured to determine first spectrum data corresponding to the reference time domain data, and input the first spectrum data into a self-care learning module in the trained acoustic model to obtain second spectrum data;

A second input module 630 configured to input the second spectral data into a vocoder to obtain predicted time domain data;

a first training module 640 configured to train the vocoder based on the predicted time domain data and the reference time domain data.

Optionally, the apparatus further comprises a usage module configured to:

Optionally, the usage module is configured to:

Optionally, the second training module is configured to:

Optionally, the second training module is further configured to:

Optionally, the second training module is configured to:

For each sample audio, determining a first duration corresponding to the sample audio, a second duration corresponding to a noise audio segment in the sample audio and a third duration of a mute audio segment in the sample audio, and determining a sum of the second duration and the third duration to obtain a difference value between the sum and the first duration as an effective duration corresponding to the sample audio; determining a text corresponding to the sample audio, and determining the number of words contained in the text from which punctuation marks are removed, wherein the number of words is used as the number of effective words contained in the sample audio; and determining the ratio of the effective word number corresponding to the audio frequency to the corresponding effective time length to serve as the speech speed corresponding to the sample audio frequency.

It should be noted that: in the vocoder training device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the training vocoder device and the training vocoder method provided in the above embodiments belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not described herein again.

Fig. 7 shows a block diagram of a terminal 700 according to an exemplary embodiment of the present application. The terminal 700 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 700 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, the terminal 700 includes: a processor 701 and a memory 702.

Processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 701 may also include a main processor and a coprocessor, wherein the main processor is a processor for processing data in an awake state, and is also called a CPU (Central Processing Unit ); a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 701 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one program code for execution by processor 701 to implement the training vocoder method provided by the method embodiments of the present application.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 703 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, a display 705, a camera assembly 706, audio circuitry 707, a positioning assembly 708, and a power supply 709.

A peripheral interface 703 may be used to connect I/O (Input/Output) related at least one peripheral to the processor 701 and memory 702. In some embodiments, the processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 704 is configured to receive and transmit RF (Radio Frequency) signals, also referred to as electromagnetic signals. The radio frequency circuitry 704 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 704 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 704 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 705 is a touch display, the display 705 also has the ability to collect touch signals at or above the surface of the display 705. The touch signal may be input to the processor 701 as a control signal for processing. At this time, the display 705 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 705 may be one and disposed on the front panel of the terminal 700; in other embodiments, the display 705 may be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in other embodiments, the display 705 may be a flexible display disposed on a curved surface or a folded surface of the terminal 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The display 705 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 706 is used to capture images or video. Optionally, the camera assembly 706 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing, or inputting the electric signals to the radio frequency circuit 704 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 707 may also include a headphone jack.

The location component 708 is operative to locate the current geographic location of the terminal 700 for navigation or LBS (Location Based Service, location-based services). The positioning component 708 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.

A power supply 709 is used to power the various components in the terminal 700. The power supply 709 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 700 further includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyroscope sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 701 may control the display screen 705 to display a user interface in a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 711. The acceleration sensor 711 may also be used for the acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may collect a 3D motion of the user to the terminal 700 in cooperation with the acceleration sensor 711. The processor 701 may implement the following functions based on the data collected by the gyro sensor 712: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

The pressure sensor 713 may be disposed at a side frame of the terminal 700 and/or at a lower layer of the display screen 705. When the pressure sensor 713 is disposed at a side frame of the terminal 700, a grip signal of the user to the terminal 700 may be detected, and the processor 701 performs left-right hand recognition or quick operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at the lower layer of the display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 705. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 714 is used to collect a fingerprint of the user, and the processor 701 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 714 may be provided on the front, back, or side of the terminal 700. When a physical key or vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical key or vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 705 is turned up; when the ambient light intensity is low, the display brightness of the display screen 705 is turned down. In another embodiment, the processor 701 may also dynamically adjust the shooting parameters of the camera assembly 706 based on the ambient light intensity collected by the optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically provided on the front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front face of the terminal 700 gradually decreases, the processor 701 controls the display 705 to switch from the bright screen state to the off screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually increases, the processor 701 controls the display screen 705 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 7 is not limiting of the terminal 700 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

The computer device provided by the embodiment of the application can be provided as a server. Fig. 8 is a schematic diagram of a server according to an embodiment of the present application, where the server 800 may have a relatively large difference between configurations or performances, and may include one or more processors (central processing units, CPU) 801 and one or more memories 802, where at least one program code is stored in the memories 802, and the at least one program code is loaded and executed by the processors 801 to implement the vocoder training method according to the above method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input obtaining interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, such as a memory comprising program code executable by a processor in a terminal or server to perform the method of training a vocoder in the above embodiments is also provided. For example, the computer readable storage medium may be read-only memory (ROM), random-access memory (random access memory), RAM), compact-disk-read-only memory (cd-ROM), magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by program code related hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of training a vocoder, the method comprising:

Acquiring time domain data of sample audio as reference time domain data;

2. The method according to claim 1, wherein the method further comprises:

3. The method of claim 2, wherein training the acoustic model based on the first sample text and the corresponding first sample audio to obtain a trained acoustic model comprises:

4. The method according to claim 1, wherein the method further comprises:

5. The method of claim 2 or 4, wherein determining the speech rate of each sample audio in the sample library comprises:

6. A terminal comprising a processor and a memory, the memory having stored therein at least one program code that is loaded and executed by the processor to perform the operations performed by the training vocoder method of any of claims 1 to 5.

7. A computer readable storage medium having stored therein at least one program code loaded and executed by a processor to perform the operations performed by the training vocoder method of any of claims 1 to 5.

8. A method of synthesizing sound, the method further comprising:

and inputting the third frequency spectrum data into a trained vocoder to obtain target time domain data corresponding to the target text, wherein the vocoder is trained by using the method for training the vocoder according to any one of claims 1 to 5.

9. The method of claim 8, wherein the obtaining the phoneme sequence and the pause information corresponding to the target text comprises:

10. The method of claim 8, wherein inputting the phoneme sequence and the pause information into a trained acoustic model to obtain third spectral data comprises: