CN116312466B - Speaker adaptation method, voice translation method and system based on small amount of samples - Google Patents

Speaker adaptation method, voice translation method and system based on small amount of samples Download PDF

Info

Publication number
CN116312466B
CN116312466B CN202310580319.5A CN202310580319A CN116312466B CN 116312466 B CN116312466 B CN 116312466B CN 202310580319 A CN202310580319 A CN 202310580319A CN 116312466 B CN116312466 B CN 116312466B
Authority
CN
China
Prior art keywords
speaker
text
voice
frequency spectrum
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310580319.5A
Other languages
Chinese (zh)
Other versions
CN116312466A (en
Inventor
柯登峰
佟运佳
徐艳艳
王运峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocdop Ltd
Original Assignee
Ocdop Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocdop Ltd filed Critical Ocdop Ltd
Priority to CN202310580319.5A priority Critical patent/CN116312466B/en
Publication of CN116312466A publication Critical patent/CN116312466A/en
Application granted granted Critical
Publication of CN116312466B publication Critical patent/CN116312466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of speech synthesis, and particularly discloses a speaker adaptation method, a speech translation method and a system based on a small number of samples, which comprise the steps of obtaining speech data with text labels, preprocessing the speech data to generate a Mel frequency spectrum; constructing a personalized speech synthesis model, and inputting the Mel frequency spectrum and the text into the personalized speech synthesis model to obtain a predicted Mel frequency spectrum; pre-training a personalized voice synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and performing fine tuning; acquiring the voice of a target speaker and any text information, and preprocessing the voice of the target speaker to obtain a Mel frequency spectrum; inputting the Mel frequency spectrum and any text information into a trained personalized speech synthesis model to obtain a predicted Mel frequency spectrum; generating target voice corresponding to any text information based on the predicted Mel frequency spectrum; the method separates the content characteristics from the speaker characteristics in the voice and solves the problem of low speaker similarity in the synthesis of a small number of sample voices.

Description

Speaker adaptation method, voice translation method and system based on small amount of samples
Technical Field
The application relates to the technical field of speech synthesis, in particular to a speaker adaptation method, a speech translation method and a speech translation system based on a small number of samples.
Background
Small sample speaker adaptation aims to synthesize arbitrary voices of a target speaker using a small number of target speaker voice-text pairs; while training an end-to-end TTS (speech synthesis) system requires a large amount of text-to-audio paired data and high quality recordings, resulting in too high a cost of collecting sufficient speech data; adapting TTS models to speaker adaptation with a small number of samples is therefore a research hotspot in the academia and industry in recent years; current mainstream methods include speaker adaptation and speaker coding.
The speaker self-adaptive method uses a small amount of registration samples to carry out fine adjustment on the basis of a trained multi-speaker TTS model, but the method usually needs at least thousands of steps of fine adjustment to achieve a high-quality self-adaptive effect, and is difficult to deploy to mobile equipment; the speaker coding method extracts the speaker vector for the registration sample, and then the trained TTS model can input and output the voice of the appointed user on the condition of the speaker vector, however, the speaker coder is often underrepresented due to the influence of the generalized difference between the visible speaker and the invisible speaker, so that the similarity between the synthesized voice and the voice of the speaker coder is lower.
Disclosure of Invention
In view of the above problems, an object of the present application is to provide a speaker adaptation method based on a small number of samples, which implements speaker adaptation of a small number of samples based on a multi-granularity coding structure, and extracts corresponding content features and speaker features from a user speech signal using the multi-granularity coding structure; the method separates the content characteristics in the voice from the speaker characteristics including tone color, pronunciation characteristics and pause, and can solve the problem of low speaker similarity in the synthesis of a small number of sample voices.
It is a second object of the application to provide a speaker adaptation system based on a small number of samples.
The third object of the present application is to provide a speech translation method, which performs acoustic feature extraction on native language speech of a target speaker, separates content features in the native language speech from speaker identity features including tone, pronunciation features and pauses, combines the translated target language text with the speaker features, and performs personalized speech synthesis on the target language text to achieve the effect of personalized speech translation.
It is a fourth object of the present application to provide a speech translation system.
The first technical scheme adopted by the application is as follows: a speaker adaptation method based on a small number of samples, comprising the steps of:
s100: acquiring voice data with text labels, and preprocessing the voice data to generate a Mel frequency spectrum;
s200: constructing a personalized speech synthesis model, and inputting the Mel frequency spectrum and the text into the personalized speech synthesis model so as to obtain a predicted Mel frequency spectrum;
s300: pre-training the personalized speech synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and fine-tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with text labels to obtain a trained personalized speech synthesis model;
s400: acquiring the voice and any text information of a target speaker, and preprocessing the voice of the target speaker to obtain a Mel frequency spectrum; inputting the mel frequency spectrum and any text information into the trained personalized speech synthesis model to obtain a predicted mel frequency spectrum; generating target voice corresponding to any text information based on the predicted Mel frequency spectrum;
wherein, the step S200 includes the following substeps:
s210: inputting the Mel frequency spectrum into a preprocessing network to obtain a preprocessing result; encoding the preprocessing result through a GRU module so as to obtain hidden layer characteristics;
s220: inputting the preprocessing result into a multi-granularity speaker encoder so as to obtain speaker characteristics; and inputting the hidden layer feature into a multiparticulate content encoder to obtain a content feature;
s230: inputting the content features and the speaker features into a voice feature reconstruction module, thereby obtaining reconstructed voice features;
s240: inputting the text into a phoneme encoder to obtain text features; inputting the reconstructed voice feature, the text feature and the speaker feature into a reference attention module to obtain an output result;
s250: the output result and the text feature are spliced and then input into a variable adapter, so that a first hidden feature is obtained;
s260: the first concealment feature is input into a mel-spectrum decoder, thereby obtaining a predicted mel-spectrum.
Preferably, the preprocessing in step S100 includes: and generating a Mel frequency spectrum by performing short-time Fourier transform and conversion of the Mel frequency spectrum on the voice waveform of the voice data.
Preferably, the multiparticulate content encoder and multiparticulate speaker encoder in step S220 each comprise a multiparticulate feature encoder comprising 4 different scale convolutions of 1×1,3×3,5×5 and 7×7, respectively; after convolution of 3×3,5×5 and 7×7, a group normalization layer, a GeLU activation function and a statistical pooling layer with an attention mechanism are sequentially connected.
Preferably, the step S230 includes:
the content features pass through an example normalization layer in a voice feature reconstruction module to obtain content features with mean and variance removed;
the speaker characteristics pass through a full connection layer in the voice characteristic reconstruction module to obtain a new mean value and a new difference;
and replacing the new mean value and the new difference into the content features with the mean value and the variance removed, so as to obtain the reconstructed voice features.
Preferably, the step S240 includes:
k and V taking the reconstructed voice features as a reference attention module; splicing the text features and the speaker features, and taking the spliced text features and the speaker features as Q of a reference attention module; q, K and V are input to the reference attention module to obtain an output result output by the reference attention module.
Preferably, the step S300 includes:
and carrying out loss calculation on the predicted mel frequency spectrum and the mel frequency spectrum by using the mean square error, and pre-training the personalized speech synthesis model based on the loss until convergence to obtain a pre-trained personalized speech synthesis model.
The second technical scheme adopted by the application is as follows: a speaker adaptation system based on a small number of samples comprises a preprocessing module, a model construction module, a model training module and a personalized speech synthesis module;
the preprocessing module is used for acquiring voice data with text labels, and preprocessing the voice data to generate a Mel frequency spectrum;
the model construction module is used for constructing a personalized voice synthesis model, and the personalized voice synthesis model comprises a preprocessing network, a GRU module, a multi-granularity content encoder, a multi-granularity speaker encoder, a voice characteristic reconstruction module, a reference attention module, a phoneme encoder, a variable adapter and a Mel spectrogram decoder; inputting the mel spectrum and text into the personalized speech synthesis model, thereby obtaining a predicted mel spectrum;
the model training module is used for pre-training the personalized speech synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and fine-tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with the text label so as to obtain a trained personalized speech synthesis model;
the personalized speech synthesis module is used for acquiring the speech of the target speaker and any text information, and preprocessing the speech of the target speaker to obtain a Mel frequency spectrum; inputting the mel frequency spectrum and any text information into the trained personalized speech synthesis model to obtain a predicted mel frequency spectrum; and generating target voice corresponding to any text information based on the predicted Mel frequency spectrum.
Preferably, the personalized speech synthesis model performs the following steps to obtain a predicted mel spectrum:
s210: inputting the Mel frequency spectrum into the preprocessing network to obtain preprocessing results; encoding the preprocessing result through the GRU module so as to obtain hidden layer characteristics;
s220: inputting the preprocessing result into the multi-granularity speaker encoder so as to obtain speaker characteristics; and inputting the hidden layer feature into the multiparticulate content encoder to obtain a content feature;
s230: inputting the content features and the speaker features into a voice feature reconstruction module, thereby obtaining reconstructed voice features;
s240: inputting the text into a phoneme encoder to obtain text features; inputting the reconstructed voice feature, the text feature and the speaker feature into a reference attention module to obtain an output result;
s250: the output result and the text feature are spliced and then input into a variable adapter, so that a first hidden feature is obtained;
s260: the first concealment feature is input into a mel-spectrum decoder, thereby obtaining a predicted mel-spectrum.
The third technical scheme adopted by the application is as follows: a speech translation method comprising the steps of:
s10: acquiring a user voice signal of a text to be translated;
s20: translating the text to be translated into a target language text;
s30: the user speech signal and the target language text are input into the speaker adaptation system based on a small number of samples as described in the second technical means, thereby obtaining the target speech.
The fourth technical scheme adopted by the application is as follows: a speech translation system comprises a data acquisition module, a text translation module and a speech synthesis module;
the data acquisition module is used for acquiring a user voice signal of a text to be translated;
the text translation module is used for translating the text to be translated into a target language text;
the voice synthesis module comprises the speaker adaptation system based on a small number of samples in the second technical scheme, and is used for obtaining target voice according to the user voice signal and the target language text.
The beneficial effects of the technical scheme are that:
(1) The application discloses a speaker adaptation method based on a small number of samples, which is used for realizing speaker adaptation of the small number of samples based on a multi-granularity coding structure, and extracting corresponding content characteristics and speaker characteristics in a user voice signal by using the multi-granularity coding structure; and uses the IN-processed content embedding (i.e., removing the mean and variance content features) and the Linear-processed speaker embedding (i.e., new mean and new variance) to reconstruct the speech features to enhance and verify the ability of the multi-scale coding structure to extract features.
(2) The application can achieve good speaker characteristic extraction effect after the multi-granularity content encoder and the multi-granularity speaker encoder are used for training together; the method has the advantages that the speaker characteristic is adopted, the voice recognition and voice synthesis results can be optimized, the text expression is more accurate, the fluency and speaker similarity of the voice synthesis results are improved, the simultaneous interpretation function is realized, and the method can be applied to the field of voice personalized translation, but is not limited to the field.
(3) The application discloses a voice translation method, which is used for extracting acoustic characteristics of native language voice of a target speaker, separating content characteristics in the native language voice from speaker characteristics comprising tone, pronunciation characteristics and pause, combining target language text obtained by translation with the speaker characteristics, and performing personalized voice synthesis on the target language text to achieve the effect of personalized voice translation.
(4) Compared with the traditional speech translation, the method can not well capture the language variation of the speaker, so that the original meaning of the speaker can not be correctly expressed; meanwhile, the traditional voice cannot change the pronunciation of the same word according to the language features of different areas, and the pronunciation is single; in the speech synthesis process of translation, the application performs personalized speech synthesis on the target language according to the different speech speeds, pauses, tones and tone colors of the speaker, thereby achieving the effect of personalized speech translation.
Drawings
FIG. 1 is a block flow diagram of a speaker adaptation method based on a small number of samples according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a personalized speech synthesis model according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a multi-granularity encoding structure according to an embodiment of the present application;
FIG. 4 is a comparison result of different speech synthesis methods according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a speaker adaptation system based on a small number of samples according to an embodiment of the present application;
FIG. 6 is a flowchart of a speech translation method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a speech translation system according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in further detail below with reference to the accompanying drawings and examples. The following detailed description of the embodiments and the accompanying drawings are provided to illustrate the principles of the application and are not intended to limit the scope of the application, i.e. the application is not limited to the preferred embodiments described, which is defined by the claims.
In the description of the present application, it is to be noted that, unless otherwise indicated, the meaning of "plurality" means two or more; the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance; the specific meaning of the above terms in the present application can be understood as appropriate by those of ordinary skill in the art.
Example 1
As shown in fig. 1, one embodiment of the present application provides a speaker adaptation method based on a small number of samples, including the steps of:
s100: acquiring voice data with text labels, and preprocessing the voice data to generate a Mel frequency spectrum (namely acoustic characteristics);
(1) Collecting and acquiring voice data with text labels from an open source pure voice data set;
(2) Preprocessing voice data, including: the speech waveform of the speech data is passed through a Short Time Fourier Transform (STFT) and conversion of the mel spectrum to generate a mel spectrum.
Resampling the voice data according to the parameters in table 1, converting the sampling rate of all voice data to 22050HZ, and carrying out 0.97 pre-emphasis processing on the resampled voice data; then processing by a Short Time Fourier Transform (STFT) algorithm, wherein the frame shift is 256, and the window length and the frame length are 1024; and finally, carrying out conversion on the Mel frequency spectrum to obtain the Mel frequency spectrum, wherein the Mel filter adopts 80 Mel filter groups, and the frequency is increased from the minimum frequency of 0 to the maximum frequency of 8000 so as to keep consistent with the setting of the HiFi-GAN vocoder.
TABLE 1 Audio parameters
S200: constructing a personalized voice synthesis model; inputting the mel spectrum and text into the personalized speech synthesis model, thereby obtaining a predicted mel spectrum;
as shown in fig. 2, the personalized speech synthesis model includes a preprocessing network, a GRU module, a multiparticulate content encoder, a multiparticulate speaker encoder, a speech feature reconstruction module, a reference attention module, a phoneme encoder, a variable adapter, and a mel-spectrogram decoder;
s210: inputting the mel spectrum (i.e. mel spectrogram) into a preprocessing network to obtain a preprocessing result; the preprocessing network consists of two-dimensional convolution layers, wherein 512 filters with the shape of 5×1 are included; and encoding the pretreatment result through 3 GRU modules (namely a gating circulating unit) to obtain hidden layer characteristics.
S220: inputting the preprocessing result into a multi-granularity speaker encoder so as to obtain speaker characteristics; and inputting the hidden layer feature into a multiparticulate content encoder to obtain a content feature.
In order to achieve both local and global consideration, the multi-granularity content encoder and the multi-granularity speaker encoder have the same structure and comprise multi-granularity feature encoders as shown in fig. 3, wherein the multi-granularity feature encoder comprises 4 convolutions with different scales, and the four convolutions are respectively 1×1,3×3,5×5 and 7×7; in addition to the 1×1 convolution, the other three scale convolutions are sequentially connected with a group normalization layer (i.e., groupnum), a GeLU activation function (i.e., gaussian error loss unit) and a statistical pooling layer with an attention mechanism (i.e., attentive statistics pooling); the other three scale convolved outputs are represented by the following formulas:
X i scale features =As pooling(GeLU(GroupNorm(Conv i scale (X First hidden feature )))),
X multiparticulate feature = X1 scale feature + X2 scale feature + X3 scale feature + X4 scale feature,
wherein X is i scale features I=1 to 4 for the output of convolution of different scales; x is X First hidden feature An input to a multi-granularity content encoder or a multi-granularity speaker encoder; x is X Multi-particle size characterization Is the output of a multi-granularity content encoder or a multi-granularity speaker encoder; x is X 1 scale features 、X 2-scale features 、X 3 scale features 、X 4-scale features The outputs of the first, second, third and fourth scale convolutions respectively.
The present application uses Attentive statistics pooling for pooling operations, attentive statistics pooling contains an attention model that uses an attention mechanism to give different frames different weights and simultaneously generates a weighted average, a weighted standard deviation; in a section of voice, the frame-level features of some frames are often more unique and important than the features of other frames, so that the application uses the degree to endow each frame with different weights; the specific process is represented by the following formula:
in the method, in the process of the application,output of the attention model; />Representing a nonlinear activation function, such as a tanh or ReLU function;Wandbparameters that are linear activation functions; />In order to hide the layer characteristics,t= 1,...,TTis the total frame number;k、vparameters for the attention model; t is the transpose.
Reuse of softmaxThe normalization is carried out according to the following specific formula:
in the method, in the process of the application,the normalized output is the weight corresponding to different frame level characteristics; />Output of the attention model;tis the current frame;Tis the total frame number.
Finally, the characteristics of each frame are weighted and summed through the following formula:
in the method, in the process of the application,is a weighted average;tis the current frame;Tis the total frame number; />Weights corresponding to different frame-level features;to conceal layer features.
Attentive statistics pooling considers both the saturation and standard deviation as shown below:
in the method, in the process of the application,is the weighted standard deviation;tis the current frame;Tis the total frame number; />Weights corresponding to different frame-level features;is a hidden layer feature; the Hadamard product; />Is a weighted average.
S230: inputting the content features and the speaker features into a voice feature reconstruction module so as to obtain reconstructed voice features;
the content features are normalized by removing the mean and variance of the content features through an example normalization (i.e., IN, instance Norm) layer IN the voice feature reconstruction module; the normalization process is shown in the following formula:
in the method, in the process of the application,μis the mean value of the content features;mis the dimension of the content feature;x i is special for contentSign of the disease;ia sequence number of a dimension;δ 2 variance of content features;content features for removing mean and variance; />The small positive number added to avoid the variance being 0 is, for example, 0.001.
After the speaker characteristics are subjected to Linear transformation of a full-connection layer (namely a Linear layer) in the voice characteristic reconstruction module, a new mean value and a new difference are obtained; the new mean and new difference are expressed by the following formulas:
in the method, in the process of the application,is a new mean value;xfor the characteristics of the content after the splicing,x=x 1 +x 2 +...+x i x i is a content feature; />Is the new variance.
Replacing the new mean value and the new difference into the content features with the mean value and the variance removed, so as to obtain a reconstructed voice feature; the specific process is as follows:
in the method, in the process of the application,reconstructing the speech features; />Content features for removing mean and variance; />Is a new variance; />Is a tiny positive number; />Is a new mean value.
To verify and enhance the ability of the multi-granularity encoder to extract features, the present application reconstructs speech features based on the output results of the multi-granularity content encoder and the multi-granularity speaker encoder.
S240: inputting the text into a phoneme encoder to obtain text features; inputting the reconstructed voice feature, the text feature and the speaker feature into a reference attention module to obtain an output result;
k and V of taking the reconstructed voice characteristics as a reference attention module; inputting text (i.e., a sequence of text) into a phoneme encoder, thereby outputting text features; splicing the text features and the speaker features, and taking the spliced text features and the speaker features as Q of a reference attention module; q, K and V are input to the reference attention module to obtain an output result.
The reference attention module comprises a first MatMul layer, a Scale layer, a Softmax layer and a second MatMul layer, K and O are firstly input into the first MatMul layer, then are input into the second MatMul layer together with V through the Scale layer and the Softmax layer, and the output of the second MatMul layer is the output characteristic of the reference attention module.
S250: and after the output result output by the reference attention module is spliced with the text feature output by the phoneme encoder, inputting the text feature into the variable adapter, so as to obtain a first hidden feature.
S260: the first concealment feature is input to a mel-spectrum decoder to obtain a predicted mel-spectrum (i.e., a predicted mel-spectrum).
S300: pre-training the personalized speech synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and fine-tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with text labels to obtain a trained personalized speech synthesis model;
(1) Pre-training;
performing loss calculation on the predicted mel spectrum and the mel spectrum by using a Mean Square Error (MSE), wherein the coefficient of the loss weight is 1; and pre-training the personalized speech synthesis model based on the loss until convergence to obtain a pre-trained personalized speech synthesis model.
According to the application, a personalized speech synthesis model is trained in 250K steps of iteration on a NVIDIA GeForce GTX 1080 GPU, and the batch processing size is 16; using Adam optimizer, β1=0.9, β2=0.98; a preheating learning strategy is adopted before 4000 iterations; a total of about 12000 voice data pairs participate in the pre-training; 8 persons serve as target unknown speakers and do not participate in pre-training and serve as data for subsequent fine tuning and testing.
(2) Fine tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with the text labels to obtain a trained personalized speech synthesis model;
acquiring voice data of a small amount of target unknown speakers with text labels, and preprocessing the voice data to obtain a Mel frequency spectrum; inputting the Mel frequency spectrum and the text into a pre-trained personalized speech synthesis model, and performing fine tuning on the pre-trained personalized speech synthesis model, so as to obtain a fine tuning model aiming at a small number of sample target speakers, and then obtaining the trained personalized speech synthesis model.
Further, in one embodiment, the method further comprises testing and analyzing the trained personalized speech synthesis model;
selecting 8 unknown speakers' reference audio and text data to be synthesized, and testing; and inviting 15 native speakers to perform subjective evaluation, wherein the subjective evaluation is performed from two aspects of synthesized voice quality (namely, naturalness subjective opinion score) and voice speaker similarity (namely, similarity subjective opinion score); the international standard 5-point scoring system is selected, and the method sequentially comprises the following steps from 0 to 5: very bad emotion is not close to the target emotion at all, and emotion expression is very bad; poor emotion is basically close to the target emotion, and emotion expression is extremely poor; moderate emotion is more appropriate to the target emotion, and emotion expressive force is better; good emotion and target emotion are relatively close, and emotion expressive force is sufficient; excellent emotion and target emotion are close, and emotion expressive force is outstanding; every 0.5 is divided into 1 interval.
As shown in fig. 4, in experimental evaluation, the method of the present application (i.e., OURS) was compared with two typical fixed length speaker insertion methods (GMVAE and CDFSE), also based on fastpech 2; the application uses three models to carry out speech synthesis on known speakers and unknown speakers; as can be seen from FIG. 5, the method (i.e., OURS) of the present application is superior to the other two baselines in terms of speaker similarity, and the method of the present application achieves a subjective opinion score of 4.15 for the similarity of the known speaker and 3.73 for the similarity of the unknown speaker.
S400: acquiring the voice and any text information of a target speaker, and preprocessing the voice of the target speaker to obtain a Mel frequency spectrum; inputting the mel frequency spectrum and any text information into the trained personalized speech synthesis model to obtain a predicted mel frequency spectrum; generating target voice corresponding to any text information based on the predicted Mel frequency spectrum;
the vocoder is used for converting the predicted Mel frequency spectrum into a waveform sequence, so that target voice with the characteristics of the original speaker is generated, and personalized voice synthesis is realized; the vocoder uses a full-trained HiFi-GAN generic version to generate waveform sequences.
Example two
As shown in fig. 5, one embodiment of the present application provides a speaker adaptation system based on a small number of samples, including a preprocessing module, a model construction module, a model training module, and a personalized speech synthesis module;
the preprocessing module is used for acquiring voice data with text labels, and preprocessing the voice data to generate a Mel frequency spectrum;
the model construction module is used for constructing a personalized voice synthesis model, and the personalized voice synthesis model comprises a preprocessing network, a GRU module (namely a gating circulating unit), a multi-granularity content encoder, a multi-granularity speaker encoder, a voice characteristic reconstruction module, a reference attention module, a phoneme encoder, a variable adapter and a Mel spectrogram decoder; inputting the mel spectrum and text into the personalized speech synthesis model, thereby obtaining a predicted mel spectrum;
the model training module is used for pre-training the personalized speech synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and fine-tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with the text label so as to obtain a trained personalized speech synthesis model;
the personalized speech synthesis module is used for acquiring the speech of the target speaker and any text information, and preprocessing the speech of the target speaker to obtain a Mel frequency spectrum; inputting the mel frequency spectrum and any text information into the trained personalized speech synthesis model to obtain a predicted mel frequency spectrum; and generating target voice corresponding to any text information based on the predicted Mel frequency spectrum.
Example III
As shown in fig. 6, one embodiment of the present application provides a speech translation method, which includes the steps of:
s10: acquiring a user voice signal of a text to be translated;
a microphone array in the voice acquisition device acquires an audio signal of a user of the text to be translated, and voice signals of the user are obtained by performing voice enhancement processing on the acquired audio signal.
S20: translating the text to be translated into a target language text;
inputting the text to be translated into a voice translation model (namely a text translation unit) so as to obtain a target language text; the speech translation model is implemented based on a neural machine translation algorithm, and is, for example, an end-to-end transducer model, a seq2seq model based on an attention mechanism, a Helsinki-NLP model, and the like.
For example, the speech translation model adopts an end-to-end converter model, adds position information to the text to be translated, inputs the text to be translated into the converter model, and realizes the translation from the text to be translated to the target language text through an FFT module and a self-attention mechanism in the converter model.
S30: and inputting the user voice signal and the target language text into the speaker adaptation system based on a small number of samples, so as to obtain target voice.
The user voice signal passes through the personalized voice synthesis module in the speaker adaptation system based on a small number of samples in the embodiment 2 to obtain a mel frequency spectrum; inputting the mel frequency spectrum and the target language text into a trained personalized speech synthesis model in the speaker adaptation system based on a small number of samples to obtain a predicted mel frequency spectrum; generating target voice corresponding to the target language text based on the predicted Mel frequency spectrum; namely, the personalized speech synthesis module separates the content characteristics in the native language speech (namely the user speech signal) from the speaker characteristics comprising tone color, pronunciation characteristics and pause; and combining the target language text obtained by translation with the speaker characteristics, and performing personalized voice synthesis on the target language text to obtain target voice corresponding to the target language text, thereby achieving the effect of personalized voice translation.
Example IV
As shown in fig. 7, one embodiment of the present application provides a speech translation system, which includes a data acquisition module, a text translation module, and a speech synthesis module;
the data acquisition module is used for acquiring a user voice signal of a text to be translated;
the text translation module is used for translating the text to be translated into a target language text;
the speech synthesis module includes the small sample based speaker adaptation system described in embodiment 2 for obtaining target speech from the user speech signal and target language text.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (9)

1. A speaker adaptation method based on a small number of samples, comprising the steps of:
s100: acquiring voice data with text labels, and preprocessing the voice data to generate a Mel frequency spectrum;
s200: constructing a personalized speech synthesis model, and inputting the Mel frequency spectrum and the text into the personalized speech synthesis model so as to obtain a predicted Mel frequency spectrum;
s300: pre-training the personalized speech synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and fine-tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with text labels to obtain a trained personalized speech synthesis model;
s400: acquiring the voice and any text information of a target speaker, and preprocessing the voice of the target speaker to obtain a Mel frequency spectrum; inputting the mel frequency spectrum and any text information into the trained personalized speech synthesis model to obtain a predicted mel frequency spectrum; generating target voice corresponding to any text information based on the predicted Mel frequency spectrum;
wherein, the step S200 includes the following substeps:
s210: inputting the Mel frequency spectrum into a preprocessing network to obtain a preprocessing result; encoding the preprocessing result through a GRU module so as to obtain hidden layer characteristics;
s220: inputting the preprocessing result into a multi-granularity speaker encoder so as to obtain speaker characteristics; and inputting the hidden layer feature into a multiparticulate content encoder to obtain a content feature;
s230: inputting the content features and the speaker features into a voice feature reconstruction module, thereby obtaining reconstructed voice features;
s240: inputting text into a phoneme encoder to obtain text features; inputting the reconstructed voice feature, the text feature and the speaker feature into a reference attention module to obtain an output result;
s250: the output result and the text feature are spliced and then input into a variable adapter, so that a first hidden feature is obtained;
s260: the first concealment feature is input into a mel-spectrum decoder, thereby obtaining a predicted mel-spectrum.
2. The speaker adaptation method according to claim 1, wherein the preprocessing in step S100 comprises: and generating a Mel frequency spectrum by performing short-time Fourier transform and conversion of the Mel frequency spectrum on the voice waveform of the voice data.
3. The speaker adaptation method according to claim 1, wherein the multiparticulate content encoder and multiparticulate speaker encoder in step S220 each comprise a multiparticulate feature encoder comprising 4 different scale convolutions of 1 x1, 3 x3, 5 x 5 and 7 x 7, respectively; after convolution of 3×3,5×5 and 7×7, a group normalization layer, a GeLU activation function and a statistical pooling layer with an attention mechanism are sequentially connected.
4. The speaker adaptation method according to claim 1, wherein said step S230 comprises:
the content features pass through an example normalization layer in a voice feature reconstruction module to obtain content features with mean and variance removed;
the speaker characteristics pass through a full connection layer in the voice characteristic reconstruction module to obtain a new mean value and a new difference;
and replacing the new mean value and the new difference into the content features with the mean value and the variance removed, so as to obtain the reconstructed voice features.
5. The speaker adaptation method according to claim 1, wherein said step S240 comprises:
k and V taking the reconstructed voice features as a reference attention module; splicing the text features and the speaker features, and taking the spliced text features and the speaker features as Q of a reference attention module; q, K and V are input to the reference attention module to obtain an output result output by the reference attention module.
6. The speaker adaptation method according to claim 1, wherein said step S300 comprises:
and carrying out loss calculation on the predicted mel frequency spectrum and the mel frequency spectrum by using the mean square error, and pre-training the personalized speech synthesis model based on the loss until convergence to obtain a pre-trained personalized speech synthesis model.
7. The speaker adaptation system based on a small number of samples is characterized by comprising a preprocessing module, a model construction module, a model training module and a personalized speech synthesis module;
the preprocessing module is used for acquiring voice data with text labels, and preprocessing the voice data to generate a Mel frequency spectrum;
the model construction module is used for constructing a personalized voice synthesis model, and the personalized voice synthesis model comprises a preprocessing network, a GRU module, a multi-granularity content encoder, a multi-granularity speaker encoder, a voice characteristic reconstruction module, a reference attention module, a phoneme encoder, a variable adapter and a Mel spectrogram decoder; inputting the mel spectrum and text into the personalized speech synthesis model, thereby obtaining a predicted mel spectrum;
the model training module is used for pre-training the personalized speech synthesis model based on the Mel frequency spectrum and the predicted Mel frequency spectrum, and fine-tuning the pre-trained personalized speech synthesis model by using the speech data of the target unknown speaker with the text label so as to obtain a trained personalized speech synthesis model;
the personalized speech synthesis module is used for acquiring the speech of the target speaker and any text information, and preprocessing the speech of the target speaker to obtain a Mel frequency spectrum; inputting the mel frequency spectrum and any text information into the trained personalized speech synthesis model to obtain a predicted mel frequency spectrum; generating target voice corresponding to any text information based on the predicted Mel frequency spectrum;
wherein the personalized speech synthesis model performs the following steps to obtain a predicted mel spectrum:
s210: inputting the Mel frequency spectrum into the preprocessing network to obtain preprocessing results; encoding the preprocessing result through the GRU module so as to obtain hidden layer characteristics;
s220: inputting the preprocessing result into the multi-granularity speaker encoder so as to obtain speaker characteristics; and inputting the hidden layer feature into the multiparticulate content encoder to obtain a content feature;
s230: inputting the content features and the speaker features into a voice feature reconstruction module, thereby obtaining reconstructed voice features;
s240: inputting the text into a phoneme encoder to obtain text features; inputting the reconstructed voice feature, the text feature and the speaker feature into a reference attention module to obtain an output result;
s250: the output result and the text feature are spliced and then input into a variable adapter, so that a first hidden feature is obtained;
s260: the first concealment feature is input into a mel-spectrum decoder, thereby obtaining a predicted mel-spectrum.
8. A method of speech translation comprising the steps of:
s10: acquiring a user voice signal of a text to be translated;
s20: translating the text to be translated into a target language text;
s30: the user speech signal and target language text are input into the speaker adaptation system of claim 7 to obtain target speech.
9. The voice translation system is characterized by comprising a data acquisition module, a text translation module and a voice synthesis module;
the data acquisition module is used for acquiring a user voice signal of a text to be translated;
the text translation module is used for translating the text to be translated into a target language text;
the speech synthesis module comprises the speaker adaptation system of claim 7 for obtaining a target speech from the user speech signal and a target language text.
CN202310580319.5A 2023-05-23 2023-05-23 Speaker adaptation method, voice translation method and system based on small amount of samples Active CN116312466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310580319.5A CN116312466B (en) 2023-05-23 2023-05-23 Speaker adaptation method, voice translation method and system based on small amount of samples

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310580319.5A CN116312466B (en) 2023-05-23 2023-05-23 Speaker adaptation method, voice translation method and system based on small amount of samples

Publications (2)

Publication Number Publication Date
CN116312466A CN116312466A (en) 2023-06-23
CN116312466B true CN116312466B (en) 2023-08-15

Family

ID=86820730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310580319.5A Active CN116312466B (en) 2023-05-23 2023-05-23 Speaker adaptation method, voice translation method and system based on small amount of samples

Country Status (1)

Country Link
CN (1) CN116312466B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019222591A1 (en) * 2018-05-17 2019-11-21 Google Llc Synthesis of speech from text in a voice of a target speaker using neural networks
CN114360493A (en) * 2021-12-15 2022-04-15 腾讯科技(深圳)有限公司 Speech synthesis method, apparatus, medium, computer device and program product
CN115713933A (en) * 2022-11-15 2023-02-24 南京邮电大学 Cross-language voice conversion method based on mutual information quantity and SE attention mechanism
CN116030786A (en) * 2023-02-02 2023-04-28 澳克多普有限公司 Speech synthesis method and system based on self-adaptive attention mechanism
CN116030792A (en) * 2023-03-30 2023-04-28 澳克多普有限公司 Method, apparatus, electronic device and readable medium for converting voice tone

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019222591A1 (en) * 2018-05-17 2019-11-21 Google Llc Synthesis of speech from text in a voice of a target speaker using neural networks
CN114360493A (en) * 2021-12-15 2022-04-15 腾讯科技(深圳)有限公司 Speech synthesis method, apparatus, medium, computer device and program product
CN115713933A (en) * 2022-11-15 2023-02-24 南京邮电大学 Cross-language voice conversion method based on mutual information quantity and SE attention mechanism
CN116030786A (en) * 2023-02-02 2023-04-28 澳克多普有限公司 Speech synthesis method and system based on self-adaptive attention mechanism
CN116030792A (en) * 2023-03-30 2023-04-28 澳克多普有限公司 Method, apparatus, electronic device and readable medium for converting voice tone

Also Published As

Publication number Publication date
CN116312466A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111462769B (en) End-to-end accent conversion method
Song et al. ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems
CN110767210A (en) Method and device for generating personalized voice
Ranjard et al. Unsupervised bird song syllable classification using evolving neural networks
CN108198566B (en) Information processing method and device, electronic device and storage medium
Rammo et al. Detecting the speaker language using CNN deep learning algorithm
Murugappan et al. DWT and MFCC based human emotional speech classification using LDA
KR102272554B1 (en) Method and system of text to multiple speech
CN110570842B (en) Speech recognition method and system based on phoneme approximation degree and pronunciation standard degree
KR20190135853A (en) Method and system of text to multiple speech
Zahner et al. Conversion from facial myoelectric signals to speech: a unit selection approach
Yu et al. Reconstructing speech from real-time articulatory MRI using neural vocoders
CN114283822A (en) Many-to-one voice conversion method based on gamma pass frequency cepstrum coefficient
Koizumi et al. Miipher: A robust speech restoration model integrating self-supervised speech and text representations
CN113744715A (en) Vocoder speech synthesis method, device, computer equipment and storage medium
Diener et al. Investigating Objective Intelligibility in Real-Time EMG-to-Speech Conversion.
CN116364096B (en) Electroencephalogram signal voice decoding method based on generation countermeasure network
Hsu Synthesizing personalized non-speech vocalization from discrete speech representations
CN116312466B (en) Speaker adaptation method, voice translation method and system based on small amount of samples
Wand et al. Towards Speaker-adaptive Speech Recognition based on Surface Electromyography.
Kwon et al. Effective parameter estimation methods for an excitnet model in generative text-to-speech systems
CN113314109B (en) Voice generation method based on cycle generation network
CN112951256B (en) Voice processing method and device
CN114550701A (en) Deep neural network-based Chinese electronic larynx voice conversion device and method
CN115035904A (en) High-quality vocoder model based on generative antagonistic neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20230623

Assignee: Shenzhen Weiou Technology Co.,Ltd.

Assignor: Ocdop Ltd.

Contract record no.: X2023980048769

Denomination of invention: Speaker adaptation methods, speech translation methods, and systems based on a small number of samples

Granted publication date: 20230815

License type: Common License

Record date: 20231128

EE01 Entry into force of recordation of patent licensing contract