CN116453502A - Cross-language speech synthesis method and system based on double-speaker embedding - Google Patents

Cross-language speech synthesis method and system based on double-speaker embedding Download PDF

Info

Publication number
CN116453502A
CN116453502A CN202310572407.0A CN202310572407A CN116453502A CN 116453502 A CN116453502 A CN 116453502A CN 202310572407 A CN202310572407 A CN 202310572407A CN 116453502 A CN116453502 A CN 116453502A
Authority
CN
China
Prior art keywords
language
speaker
cross
features
native language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310572407.0A
Other languages
Chinese (zh)
Inventor
俞凯
刘森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202310572407.0A priority Critical patent/CN116453502A/en
Publication of CN116453502A publication Critical patent/CN116453502A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Abstract

The embodiment of the invention provides a cross-language speech synthesis method based on double-speaker embedding. The method comprises the following steps: text and native language speakers are input embedded into the txt2vec acoustic model, the phoneme sequence encoding of the text is determined by a text encoder, determining vector quantized acoustic features and assist features from the phoneme sequence encoding and native language speaker embedding by a decoder; the target language speaker is embedded, the vector quantization acoustic feature and the auxiliary feature are input to a vec2wav vocoder, the X-vector feature embedded by the target language speaker is extracted, inputting the X-vector characteristic, the vector quantization acoustic characteristic and the auxiliary characteristic into a characteristic encoder to obtain a cross-language acoustic characteristic; determination with generator Cross-language acoustics the cross-language synthesis of the features. The embodiment of the invention constructs a cross-language TTS model based on the VQTTS, the language speaking style and the speaker timbre are modeled separately, thus, cross-language speech synthesis with high-altitude and similar timbre to the target speaker is achieved.

Description

Cross-language speech synthesis method and system based on double-speaker embedding
Technical Field
The invention relates to the field of intelligent voice, in particular to a cross-language voice synthesis method and system based on double-speaker embedding.
Background
With the development of technology, TTS (Text To Speech) models have made great progress in synthesizing high fidelity and prosodic rich Speech. However, in a multi-language TTS scenario, the speech effect of cross-language synthesis is still unsatisfactory, because it is difficult for synthesized speech in such a scenario to accurately preserve the speaker's timbre and eliminate accents in their first language. More specifically, cross-language synthesis (speech synthesis that spans from native language to non-native language) has difficulty in achieving native language properties of non-native language while maintaining speaker similarity, where native language properties refer to the proximity of speech to native language (i.e., the synthesized speech is accented and the sensation of speaking is not good enough).
In order to solve the above problems, a domain countermeasure training method is adopted in the prior art, so that the voice synthesis model can transmit the voice characteristics of different speakers in different languages; a method of minimizing mutual information is also used to maintain speaker consistency during cross-language synthesis; the loss function may also be used to encourage the speech synthesis model to learn language independent speaker representations.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:
these prior art methods typically rely on fusion spectra as acoustic features that are highly correlated in the time and frequency axes and contain rich speaker-dependent information. However, when the speaker related information is synthesized in a cross-language manner, the speaker information and the language information are difficult to be completely decoupled, so that the naturalness of the non-native language is difficult to be obtained while the similarity of the speakers is kept in the cross-language manner.
Disclosure of Invention
In order to at least solve the problem that in the prior art, the naturalness of the non-native language is difficult to obtain while the similarity of the speakers is kept during the cross-language synthesis.
In a first aspect, an embodiment of the present invention provides a cross-language speech synthesis method based on dual speaker embedding, including:
embedding and inputting a text and a native language speaker into a txt2vec acoustic model, determining a phoneme sequence code of the text through a text encoder, and determining vector quantization acoustic features and auxiliary features of a pronunciation style of the native language speaker from the phoneme sequence code and the native language speaker by a decoder in the txt2vec acoustic model;
the method comprises the steps of embedding a target language speaker serving as a non-native language, inputting the vector quantized acoustic features and auxiliary features into a vec2wav vocoder, extracting X-vector features embedded by the target language speaker in the vec2wav vocoder, and inputting the X-vector features, the vector quantized acoustic features and the auxiliary features into a feature encoder to obtain cross-language acoustic features simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker;
a cross-language synthesized speech of the cross-language acoustic features is determined with a generator.
In a second aspect, an embodiment of the present invention provides a cross-language speech synthesis system based on dual speaker embedding, including:
a pronunciation style feature determination program module for embedding text and native language speakers into a txt2vec acoustic model in which a phoneme sequence encoding of the text is determined by a text encoder, and vector quantized acoustic features and auxiliary features for determining a pronunciation style of a native language speaker are embedded from the phoneme sequence encoding and the native language speakers by a decoder;
the tone characteristic determining program module is used for embedding a target language speaker serving as a non-native language, inputting the vector quantization acoustic characteristic and the auxiliary characteristic into a vec2wav vocoder, extracting an X-vector characteristic embedded by the target language speaker in the vec2wav vocoder, and inputting the X-vector characteristic, the vector quantization acoustic characteristic and the auxiliary characteristic into a characteristic encoder to obtain a cross-language acoustic characteristic simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker;
and the voice synthesis program module is used for determining cross-language synthesized voice of the cross-language acoustic characteristics by using a generator.
In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the dual speaker embedding-based cross-language speech synthesis method of any of the embodiments of the present invention.
In a fourth aspect, embodiments of the present invention provide a storage medium having a computer program stored thereon, wherein the program when executed by a processor implements the steps of the cross-language speech synthesis method based on dual speaker embedding of any of the embodiments of the present invention.
The embodiment of the invention has the beneficial effects that: the method constructs a cross-language TTS model based on the VQTTS, which consists of double-speaker embedding and respectively carries out independent modeling on language speaking styles and speaker timbres. VQ features have fewer speaker dependent features. With this finding, cross-language speech synthesis with high-altitude and similar timbre to the target speaker is thus achieved. Experiments show that the voice effect of the method in the intra-language and cross-language synthesis is superior to that of the prior art.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a cross-language speech synthesis method based on dual speaker embedding according to an embodiment of the present invention;
FIG. 2 is a DSE-TTS framework diagram of a cross-language speech synthesis method based on dual speaker embedding according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of training data set information of a cross-language speech synthesis method based on double speaker embedding according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of speaker classification accuracy for different acoustic features of a cross-language speech synthesis method based on dual speaker embedding according to an embodiment of the present invention;
FIG. 5 is a diagram of native language MOS and ASR intra-speech synthesis data based on a cross-language speech synthesis method with dual speaker embedding according to an embodiment of the present invention;
FIG. 6 is a diagram of a cross-language speech synthesis data for Chinese and English based on a cross-language speech synthesis method with double speaker embedding according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a cross-language speech synthesis system based on dual speaker embedding according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an embodiment of an electronic device for cross-language speech synthesis based on dual speaker embedding according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a cross-language speech synthesis method based on double-speaker embedding according to an embodiment of the present invention, including the following steps:
s11: embedding and inputting a text and a native language speaker into a txt2vec acoustic model, determining a phoneme sequence code of the text through a text encoder, and determining vector quantization acoustic features and auxiliary features of a pronunciation style of the native language speaker from the phoneme sequence code and the native language speaker by a decoder in the txt2vec acoustic model;
s12: the method comprises the steps of embedding a target language speaker serving as a non-native language, inputting the vector quantized acoustic features and auxiliary features into a vec2wav vocoder, extracting X-vector features embedded by the target language speaker in the vec2wav vocoder, and inputting the X-vector features, the vector quantized acoustic features and the auxiliary features into a feature encoder to obtain cross-language acoustic features simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker;
s13: a cross-language synthesized speech of the cross-language acoustic features is determined with a generator.
In this embodiment, the method experimentally finds that the quantified acoustic signature contains little speaker information. Thus, the quantization characteristic more easily separates tone color and language information than the conventional mel spectrum. Based on the finding, the method adopts double-speaker embedding (native language speaker embedding and non-native language target language speaker embedding) to respectively control the tone and speaking style of the synthesized voice so as to solve the problem of cross-language voice synthesis.
For step S11, the method is improved on the basis of a VQTTS (vector-quantized TTS, vector-quantized TTS (Text To Speech)) Speech synthesis model. The improved acoustic model architecture and dual speaker embedded framework is shown in fig. 2. In a specific speech synthesis step, the text to be speech synthesized is input into the model architecture of the method. For a two-speaker insert, where a native language speaker insert is input into the acoustic model txt2vec, a non-native language target language speaker insert is input into the vocoder vec2 wav.
For the acoustic model txt2vec, the method takes as input the speaker and language ID. The speaker ID is embedded into 256-dimensional vectors, which are then projected and added to the encoder output. The language IDs are processed in a similar manner to support various languages embedded in 128-dimensional vectors. These two types of embedding are used to learn language features of different languages.
In the acoustic model txt2vec processing section, input text to be speech synthesized is normalized and converted into IPA (International Phonetic Alphabet, international phonetic symbol) phoneme sequences using a phonemic toolkit.
To facilitate alignment between text and speech, tones and accents of different languages are preserved in the input sequence. Also used are cross-language shared punctuation marks, divided into four groups according to pause length, denoted "sp1", "sp2", "sp3" and "sp4", respectively. Further, "sil" is used as a start and end mark of each sentence. Before feeding the input sequence to the text encoder in txt2vec, 384-dimensional vectors are assigned to each phoneme (token) using an embedding table.
The international phonetic symbol phoneme sequence and the native language speaker are embedded and processed by an auxiliary controller to obtain PL (Phone-level) auxiliary labels, and the labels can be used for assisting in more accurately predicting indexes of each codebook (the steps are all prepared in advance for extracting VQ (vector-quantized) acoustic features subsequently, accurately predicting the indexes, and high-fidelity voice can be constructed).
As an embodiment, after the determining, by the text encoder, the phoneme sequence encoding of the text, the method further comprises:
the phoneme sequence and the native language speaker insert are aligned with a length adjuster to preserve the pitch and accent of the text being spoken by the native language speaker.
In this embodiment, since the pre-step uses cross-language shared punctuation marks, and the pre-step is divided into four groups according to the pause length, the phoneme sequence and the native language speaker embedding can be aligned by using the length adjuster at this time, so that the tone and accent of the text uttered by the native language speaker can be preserved.
After processing by the length adjuster, decoding is performed to extract VQ acoustic features from the phoneme sequence and native language speaker embedding.
As one embodiment, the vector quantization acoustic feature and auxiliary feature for determining native language speaker pronunciation style from the phoneme sequence encoding and the native language speaker embedding by the decoder comprises:
quantizing the phoneme sequence code into a plurality of voice frames, and independently predicting the codebook indexes of each voice frame in the embedding of the native language speaker by using an auxiliary controller so as to construct the native language high-fidelity voice;
determining, by a decoder, vector quantized acoustic features of native language speaker pronunciation styles and auxiliary features from the native language high-fidelity speech, wherein the auxiliary features include: the codebook index probability of a speech frame is predicted.
In this embodiment, the method extracts VQ acoustic features using a wav2vec 2.0 model having two quantized codebooks, each codebook containing 320 codewords. The wav2vec 2.0 model was pre-trained on 10000 hours of mandarin data. It quantizes each input speech into a number of frames in steps of 20ms and each frame can be represented by concatenating two 256-dimensional codewords in each codebook. In the mixed language dataset of the present method, all possible index combinations are about 28.8k. The goal is to accurately predict these index pairs to construct high fidelity speech. Furthermore, the present method predicts the index of each codebook separately, rather than their combination, resulting in two 320 class classification problems. The method selects wav2vec 2.0 as the VQ feature extractor because it provides a more robust speech representation with less speaker information than other VQ features. The basic principle of this option will be explained further in the experimental section.
Like VQTTS (vector-quantized TTS), the present method utilizes logarithmic pitch (log pitch), energy (energy), and POV (probability of voice, speech probability) as auxiliary features. First, a phoneme-level representation of a mixed-language dataset is calculated and normalized. These representations are then grouped into 128 different classes using k-means clustering, and the resulting cluster index is used as an auxiliary label for PL (Phone-level, phoneme level) information.
The cross-language TTS of the prior art has difficulty in accurately preserving the speaker's timbre and eliminating the accent of its first language, resulting in unnatural synthesized speech. The main reason is usually due to the implications between the speaker and the language, which are often manifested in the nature of traditional acoustic features, such as fusion spectra. However, the present method experiments have found that the self-supervising VQ features contain far fewer speaker features than the traditional acoustic features. Thus, in VQ-based TTS methods, it is not necessary to use additional techniques to untangling speaker and language implications within the acoustic model. This allows the model to focus only on modeling text and language features, while delegating the task of controlling speaker timbre to the vocoder. Thus, VQ-based TTS models naturally learn how to use characters of non-native speakers to speak different languages in native language.
For step S12, based on the DSE-TTS (Dual Speaker Embedding-TTS) double-speaker embedding TTS framework of the method, the originality and speaker similarity in a cross-language TTS scene are improved. The vector quantization acoustic feature, the auxiliary feature, and the vec2wav vocoder of the non-native language of the double-speaker embedding determined in step S11 are embedded into the frame, and in the speech synthesis stage, whether the same-language intra-synthesis or cross-language synthesis is performed, the speaker of the native language speaker corresponding to the input language is selected because the input speaker of txt2vec is embedded into the native language speaker of the language in which the input text is used. In contrast, the speaker embedded in vec2wav is set as the target speaker. Thus, in the case of cross-language, this means that the selected native language user is embedded in txt 2wav, which represents the language speaking style, and the target user is embedded in vec2wav, which controls the timbre. In this way, the speaking style of a particular language and the timbre of a speaker are naturally separated by double embedding.
For vec2wav, the tone color is controlled using X-vector as speaker embedding, and is extracted from a pre-trained speaker recognition model. Further, in order to make the tone color closer to the target speaker when cross-language synthesis is performed, the distribution of native language speaker pitches predicted by txt2vec is changed to match the pitch of the target speaker. The formula is as follows:
wherein the subscripts "tgt" and "ntv" represent the target and native language users, respectively. μ and σ are the mean and standard deviation of the pitch values of the target or native speaker in the training set. The method performs the pitch shifting before sending the assist features to vec2wav for synthesis.
By the method, the cross-language acoustic characteristics for simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker can be obtained.
For step S13, a final cross-language synthesized speech is generated by a HIFIGAN (High Fidelity Generative Adversarial Networks, high fidelity generation resistant network) generator using cross-language acoustic features, enabling the use of specified non-native speakers to speak foreign language in the ground.
As one embodiment, the txt2vec acoustic model and the vec2wav vocoder are pre-trained from a training dataset comprising native language speaker embeddings and target language speaker embeddings of non-native languages.
In this embodiment, specifically, in the model training phase, the training data set (native language speaker embedding and non-native language target language speaker embedding both correspond to the same speaker).
The speaker embedding of the native speaker corresponding to the target language is selected as the speaker embedding of the acoustic model txt2vec, the speaker embedding of the native speaker and the corresponding language ID number are input to the auxiliary controller, the auxiliary controller outputs a predicted phoneme class auxiliary label, and training is performed based on the error between the pre-prepared reference phoneme class auxiliary label and the obtained predicted phoneme class auxiliary label, so that the predicted phoneme class auxiliary label output by the auxiliary controller approaches to the reference phoneme class auxiliary label. Based on the same way, in vec2wav, the extraction of the training X-vector is embedded by using the target language speaker of the non-native language, so as to extract the tone of the target language speaker of the non-native language more accurately.
According to the implementation mode, the method constructs a cross-language TTS model based on the VQTTS, and the cross-language TTS model consists of double-speaker embedding and is used for independently modeling the language speaking style and the speaker tone. VQ features have fewer speaker dependent features. With this finding, cross-language speech synthesis with high-altitude and similar timbre to the target speaker is thus achieved. Experiments show that the voice effect of the method in the intra-language and cross-language synthesis is superior to that of the prior art.
The method is experimentally described, and regarding the dataset, the dataset of the method comprises four languages: mandarin (ZH), english (EN), spanish (ES), and German (DE). Data in german and spanish are obtained from m_ailas (dataset), while data in english and mandarin are from LibriTTS and Aishell3, respectively. In fact, some languages may have difficulty collecting enough data. To simulate this and test the language adaptation of the method, several hours of data were randomly selected from german and spanish as low-resource languages. The total duration and number of utterances are listed as in fig. 3. During training, all speech was resampled to 24kHz and 5% of the utterances were used for the validation and test set. In order to extract the ground truth phoneme duration, forced alignment is performed using Kaldi.
With respect to experimental setup, the present method trains 200 time periods (epochs) on txt2vec and 100 epochs on vec2wav using batch sizes of 16 and 8, respectively. The training process is performed separately on the NVIDIA 2080Ti GPU. A publicly available pre-trained wav2vec 2.0 model was used for VQ acoustic feature extraction. To evaluate the performance of the present method model, SANE-TTS of the prior art was used as baseline. The present method trains the SANE-TS model for 200 epochs using a batch size of 16, while keeping all other parameters consistent.
In order to study the relationship between different acoustic features and a speaker, a speaker classification model is first constructed to evaluate the classification accuracy of the various features. The present method compares the mel-spectrogram (an acoustic feature widely used in TTS models) with four different VQ features extracted from the open source pre-training model, including VQ-wav2vec, wav2vec 2.0, xlr-53 and encoec. The classification model of the method uses an X vector architecture and adds two linear layers to predict speaker characteristics. The model is trained on a LibriTTS training set that includes more than 2000 speakers. After 80 periods of training on the model, the classification accuracy of speaker features on the test set was analyzed. As shown in FIG. 4, the mel-profile contains sufficient information about the speaker's characteristics to allow for high accuracy in speaker classification. In contrast, VQ features have significantly less speaker information, resulting in lower accuracy than mel spectrograms. According to the experimental results of the present method, wav2vec 2.0 was chosen as the acoustic feature of the present method, since it has a relatively low speaker recognition performance, which indicates that it contains less speaker-dependent information.
The method uses subjective and objective measurement methods to evaluate the quality of intra-and inter-lingual synthesis. Subjective measures include NMOS (fluency MOS), and SMOS (Similarity MOS), where MOS (Mean Option Score, mean opinion score). NMOS is used to evaluate the fluency of synthesized speech, and SMOS is used to evaluate the degree of speaker similarity. The higher the NMOS score, the closer the synthesized speech is to the native language. The MOS rating was based on 1-5 minutes, with an increment of 0.5 minutes, with a confidence interval of 95%. 30 speech samples were synthesized for each language using random text in the test set and evaluated by recruiting multiple raters. The raters included 15 mandarin and english speaker bilingual persons to evaluate the speech quality of english and mandarin, 15 mandarin, english, german and english, mandarin, spanish, and synthesized speech in german and spanish, respectively. For objective indicators, we (word error rate), CER (character error rate ) and SECS (speaker embedding cosine similarity, speaker embedded cosine similarity) between synthesized speech and real speech were calculated. WER is used for spanish, german and english and CER is used for mandarin. Pre-trained ASR models were used, whisper, german and english were used, and transducer ASR models were used for mandarin. For speaker similarity, a separately trained ResNet-based r-vector speaker verification model was used and a cosine similarity score between 0 and 1 was calculated. The greater the score, the better the speaker's similarity. To compare the present method model to the baseline model, 100 speech samples for each language were synthesized by randomly selecting sentences from the test set.
The average NMOS and WER (CER) in the intra-lingual assessment is shown as FIG. 5. It is clear that DSE-TS has achieved NMOS scores close to ground truth and is superior to the baseline model in all metrics and all languages. Specifically, DSE-TTS has an NMOS score of over 4.3 in each language and is lower than the WER (CER) of the baseline model.
The evaluation result of the cross-language synthesis of the method is shown in fig. 6. It can be observed that the results are consistent with those obtained in intra-speech synthesis, as DSE-TTS is much better than SANE-TTS in terms of both NMOS and WER scores. Specifically, in NMOS scoring, the grader's preference for DSETTS over baseline exceeds 0.3 in all speaker language combinations. In addition, SMOS and SECS scores indicate that DSE-TTS maintains speaker characteristics similar to SANE-TTS. These findings indicate that DSE-TS can synthesize high quality german and spanish speech in the voice of non-native language persons, but with greater similarity to native language persons' speech than the baseline model.
The method also performed ablative studies to study the effect of Double Speaker Embedding (DSE) on the performance of the method model. The results, as shown in fig. 6, demonstrate that the inherent nature of the synthesized speech is significantly enhanced and the WER is significantly reduced after DSE integration. Observations also indicate that using DSE results in a slightly reduced speaker similarity score compared to not using DSE. This may be because different languages have unique speaking styles, and users of non-native languages may sound slightly different when fluently speaking a foreign language. This is also evidence that DSE-TS produces speech in the native language, although without training by bilingual users.
In general, the method proposes DSE-TTS, a VQTTS-based cross-language TTS model consisting of dual speaker embedding, modeling language speaking styles and speaker timbres, respectively. VQ features have fewer speaker dependent features. With this finding, the present approach improves the model of the method by dual speaker embedding, thereby enabling cross-language speech synthesis with high-altitude and similar timbre to the target speaker. Experiments have shown that DSE-TTS is superior to SANE-TTS both in terms of intra-and cross-lingual synthesis, especially in terms of naturalness. The method also verifies the effectiveness of the double speaker embedding through ablation study.
Fig. 7 is a schematic structural diagram of a cross-language speech synthesis system based on dual speaker embedding according to an embodiment of the present invention, where the system may execute the cross-language speech synthesis method based on dual speaker embedding according to any of the above embodiments and be configured in a terminal.
The cross-language speech synthesis system 10 based on the double-speaker embedding provided in this embodiment includes: a pronunciation style feature determination program module 11, a timbre feature determination program module 12 and a speech synthesis program module 13.
Wherein the pronunciation style feature determining program module 11 is configured to input text and a native language speaker into a txt2vec acoustic model, in which a phoneme sequence encoding of the text is determined by a text encoder, and vector quantization acoustic features and auxiliary features for determining a pronunciation style of the native language speaker are embedded from the phoneme sequence encoding and the native language speaker by a decoder; the tone feature determining program module 12 is configured to input the target language speaker as a non-native language, the vector quantized acoustic feature and the auxiliary feature to a vec2wav vocoder, extract an X-vector feature embedded in the target language speaker in the vec2wav vocoder, and input the X-vector feature, the vector quantized acoustic feature and the auxiliary feature to a feature encoder to obtain a cross-language acoustic feature simulating the tone of the target language speaker based on the pronunciation style of the native language speaker; the speech synthesis program module 13 is for cross-language synthesis speech for determining said cross-language acoustic features using a generator.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the cross-language voice synthesis method based on the double-speaker embedding in any of the method embodiments;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
embedding and inputting a text and a native language speaker into a txt2vec acoustic model, determining a phoneme sequence code of the text through a text encoder, and determining vector quantization acoustic features and auxiliary features of a pronunciation style of the native language speaker from the phoneme sequence code and the native language speaker by a decoder in the txt2vec acoustic model;
the method comprises the steps of embedding a target language speaker serving as a non-native language, inputting the vector quantized acoustic features and auxiliary features into a vec2wav vocoder, extracting X-vector features embedded by the target language speaker in the vec2wav vocoder, and inputting the X-vector features, the vector quantized acoustic features and the auxiliary features into a feature encoder to obtain cross-language acoustic features simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker;
a cross-language synthesized speech of the cross-language acoustic features is determined with a generator.
As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the cross-language speech synthesis method based on double speaker embedding in any of the method embodiments described above.
Fig. 8 is a schematic hardware structure of an electronic device based on a cross-language speech synthesis method with dual speaker embedding according to another embodiment of the present application, as shown in fig. 8, where the device includes:
one or more processors 810, and a memory 820, one processor 810 being illustrated in fig. 8. The device based on the cross-language speech synthesis method of double speaker embedding may further include: an input device 830 and an output device 840.
Processor 810, memory 820, input device 830, and output device 840 may be connected by a bus or other means, for example in fig. 8.
The memory 820 is used as a non-volatile computer readable storage medium for storing non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the cross-language speech synthesis method based on double speaker embedding in the embodiments of the present application. The processor 810 performs various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 820, i.e., implements the cross-language speech synthesis method based on the double speaker embedding of the above-described method embodiments.
Memory 820 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data, etc. In addition, memory 820 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 820 may optionally include memory located remotely from processor 810, which may be connected to the mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 830 may receive input numerical or character information. The output device 840 may include a display device such as a display screen.
The one or more modules are stored in the memory 820 that, when executed by the one or more processors 810, perform the cross-language speech synthesis method based on dual speaker embedding in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the dual speaker embedding-based cross-language speech synthesis method of any of the embodiments of the present invention.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.
(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) Other electronic devices with data processing functions.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A cross-language speech synthesis method based on double-speaker embedding comprises the following steps:
embedding and inputting a text and a native language speaker into a txt2vec acoustic model, determining a phoneme sequence code of the text through a text encoder, and determining vector quantization acoustic features and auxiliary features of a pronunciation style of the native language speaker from the phoneme sequence code and the native language speaker by a decoder in the txt2vec acoustic model;
the method comprises the steps of embedding a target language speaker serving as a non-native language, inputting the vector quantized acoustic features and auxiliary features into a vec2wav vocoder, extracting X-vector features embedded by the target language speaker in the vec2wav vocoder, and inputting the X-vector features, the vector quantized acoustic features and the auxiliary features into a feature encoder to obtain cross-language acoustic features simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker;
a cross-language synthesized speech of the cross-language acoustic features is determined with a generator.
2. The method of claim 1, wherein after the determining, by a text encoder, a phoneme sequence encoding of the text, the method further comprises:
the phoneme sequence and the native language speaker insert are aligned with a length adjuster to preserve the pitch and accent of the text being spoken by the native language speaker.
3. The method of claim 1, wherein the determining, by a decoder, vector quantized acoustic features and assist features of native language speaker pronunciation styles from the phoneme sequence encoding and the native language speaker embedding comprises:
quantizing the phoneme sequence code into a plurality of voice frames, and independently predicting the codebook indexes of each voice frame in the embedding of the native language speaker by using an auxiliary controller so as to construct the native language high-fidelity voice;
determining, by a decoder, vector quantized acoustic features of native language speaker pronunciation styles and auxiliary features from the native language high-fidelity speech, wherein the auxiliary features include: the codebook index probability of a speech frame is predicted.
4. The method of claim 1, wherein the txt2vec acoustic model and the vec2wav vocoder are pre-trained from a training dataset comprising native language speaker embeddings and target language speaker embeddings of non-native languages.
5. A dual speaker embedding-based cross-language speech synthesis system, comprising:
a pronunciation style feature determination program module for embedding text and native language speakers into a txt2vec acoustic model in which a phoneme sequence encoding of the text is determined by a text encoder, and vector quantized acoustic features and auxiliary features for determining a pronunciation style of a native language speaker are embedded from the phoneme sequence encoding and the native language speakers by a decoder;
the tone characteristic determining program module is used for embedding a target language speaker serving as a non-native language, inputting the vector quantization acoustic characteristic and the auxiliary characteristic into a vec2wav vocoder, extracting an X-vector characteristic embedded by the target language speaker in the vec2wav vocoder, and inputting the X-vector characteristic, the vector quantization acoustic characteristic and the auxiliary characteristic into a characteristic encoder to obtain a cross-language acoustic characteristic simulating the tone of the target language speaker on the basis of the pronunciation style of the native language speaker;
and the voice synthesis program module is used for determining cross-language synthesized voice of the cross-language acoustic characteristics by using a generator.
6. The system of claim 5, wherein the system further comprises a length adjustment module to:
the phoneme sequence and the native language speaker insert are aligned with a length adjuster to preserve the pitch and accent of the text being spoken by the native language speaker.
7. The system of claim 5, wherein the pronunciation style feature determination program module is to:
quantizing the phoneme sequence code into a plurality of voice frames, and independently predicting the codebook indexes of each voice frame in the embedding of the native language speaker by using an auxiliary controller so as to construct the native language high-fidelity voice;
determining, by a decoder, vector quantized acoustic features of native language speaker pronunciation styles and auxiliary features from the native language high-fidelity speech, wherein the auxiliary features include: the codebook index probability of a speech frame is predicted.
8. The system of claim 5, wherein the txt2vec acoustic model and the vec2wav vocoder are pre-trained from a training dataset comprising native language speaker embeddings and target language speaker embeddings of non-native languages.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-4.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-4.
CN202310572407.0A 2023-05-19 2023-05-19 Cross-language speech synthesis method and system based on double-speaker embedding Pending CN116453502A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310572407.0A CN116453502A (en) 2023-05-19 2023-05-19 Cross-language speech synthesis method and system based on double-speaker embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310572407.0A CN116453502A (en) 2023-05-19 2023-05-19 Cross-language speech synthesis method and system based on double-speaker embedding

Publications (1)

Publication Number Publication Date
CN116453502A true CN116453502A (en) 2023-07-18

Family

ID=87125798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310572407.0A Pending CN116453502A (en) 2023-05-19 2023-05-19 Cross-language speech synthesis method and system based on double-speaker embedding

Country Status (1)

Country Link
CN (1) CN116453502A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (en) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (en) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium
CN117727288B (en) * 2024-02-07 2024-04-30 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
KR102246943B1 (en) Method of multilingual text-to-speech synthesis
KR102581346B1 (en) Multilingual speech synthesis and cross-language speech replication
JP6523893B2 (en) Learning apparatus, speech synthesis apparatus, learning method, speech synthesis method, learning program and speech synthesis program
US9905220B2 (en) Multilingual prosody generation
US20220013106A1 (en) Multi-speaker neural text-to-speech synthesis
KR20230003056A (en) Speech recognition using non-speech text and speech synthesis
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
US10515637B1 (en) Dynamic speech processing
US9798653B1 (en) Methods, apparatus and data structure for cross-language speech adaptation
US20090157408A1 (en) Speech synthesizing method and apparatus
Shang et al. Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech.
CN116453502A (en) Cross-language speech synthesis method and system based on double-speaker embedding
JP2022079397A (en) Voice recognition system and method
Suebvisai et al. Thai automatic speech recognition
Sitaram et al. Text to speech in new languages without a standardized orthography
CN117012177A (en) Speech synthesis method, electronic device, and storage medium
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
Bonafonte et al. The UPC TTS system description for the 2008 blizzard challenge
JP2021085943A (en) Voice synthesis device and program
Korostik et al. The stc text-to-speech system for blizzard challenge 2019
CN117854474A (en) Speech data set synthesis method and system with expressive force and electronic equipment
Wu et al. Synthesis of spontaneous speech with syllable contraction using state-based context-dependent voice transformation
Zhang et al. Zero-shot multi-speaker accent TTS with limited accent data
CN117316142A (en) Speech synthesis method, system and electronic equipment
Ajayi et al. Indigenuous Vocabulary Reformulation for Continuousyorùbá Speech Recognition In M-Commerce Using Acoustic Nudging-Based Gaussian Mixture Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination