WO2022140966A1 - Cross-language voice conversion method, computer device, and storage medium - Google Patents

Cross-language voice conversion method, computer device, and storage medium Download PDF

Info

Publication number
WO2022140966A1
WO2022140966A1 PCT/CN2020/140344 CN2020140344W WO2022140966A1 WO 2022140966 A1 WO2022140966 A1 WO 2022140966A1 CN 2020140344 W CN2020140344 W CN 2020140344W WO 2022140966 A1 WO2022140966 A1 WO 2022140966A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
training
converted
speech
vector
Prior art date
Application number
PCT/CN2020/140344
Other languages
French (fr)
Chinese (zh)
Inventor
赵之源
王若童
黄东延
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2020/140344 priority Critical patent/WO2022140966A1/en
Publication of WO2022140966A1 publication Critical patent/WO2022140966A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application relates to the field of computer technology, and in particular, to a cross-language voice conversion method, computer device and storage medium.
  • Machine learning and deep learning rely on massive data and the powerful processing power of computers, and have made major breakthroughs in the fields of image, speech, and text. Since the same type of framework can achieve good results in different fields, neural network algorithm models that have been used to solve text and image problems are all applied to the field of speech.
  • the existing neural network algorithm models applied in the field of speech can capture the characteristics of the target speaker's voice, so as to stably synthesize other voices of the target speaker, and are close to the level of real people in terms of timbre similarity and language naturalness, but
  • the synthesized speech can only be the same as the target speaker's language.
  • the target speaker's voice cannot be synthesized into the target speaker's speech in other languages. If the target speaker can only speak Chinese, it can only be synthesized. The voice of the Chinese language cannot be synthesized into the voice of other national languages.
  • an embodiment of the present application provides a method for cross-language voice conversion, the method comprising:
  • the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
  • the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  • an embodiment of the present application provides a computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to perform the following steps:
  • the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
  • the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
  • the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
  • the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  • the to-be-converted voice and the sample voice with different languages used in the voice content are obtained, and the two are input into the voice content obtained by using a pre-trained voice conversion model to obtain the same voice content as the to-be-converted voice, simulating the sample voice.
  • the target speech solves the problem that the target speaker's voice cannot be synthesized into the speech issued by the target speaker in other national languages, and obtains the beneficial effect of synthesizing the target user's speech across languages.
  • Fig. 1 is the application environment diagram of the cross-language speech conversion method in one embodiment
  • Fig. 2 is the flow chart of the cross-language speech conversion method in one embodiment
  • Fig. 3 is the flow chart of step S130 in the cross-language speech conversion method in one embodiment
  • step S110 is a flowchart of step S110 in the method for cross-language speech conversion in one embodiment
  • Fig. 5 is the flow chart of step S120 in the cross-language speech conversion method in one embodiment
  • step S410 is a flowchart of step S410 in the method for cross-language speech conversion in one embodiment
  • Fig. 7 is the flow chart of the speech conversion model training method in one embodiment
  • FIG. 8 is a structural block diagram of a computer device in one embodiment.
  • FIG. 1 is an application environment diagram of a method for cross-language speech conversion in one embodiment.
  • the cross-language voice conversion method is applied to a cross-language voice conversion system.
  • the cross-language voice conversion system includes a terminal 110 and a server 120 .
  • the terminal 110 and the server 120 are connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like.
  • the server 120 can be implemented by an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is used for the voice to be converted and the sample voice of the target user and uploads it to the server 120.
  • the language used for the voice content of the voice to be converted is different from the language used for the voice content of the sample voice.
  • Convert the voice and the sample voice of the target user perform preprocessing on the voice to be converted to obtain the voice feature to be converted, and perform preprocessing on the sample voice to obtain the sample voice feature; take the voice feature to be converted and the sample voice feature as Input, use a pre-trained voice conversion model to obtain target voice features; convert the target voice features into a target voice that simulates the example voice, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  • the above-mentioned cross-language voice conversion method can also be directly applied to the terminal 110, where the terminal 110 is used to obtain the voice to be converted and the sample voice of the target user, the language used in the voice content of the voice to be converted and the The speech content of the sample voices uses different languages; the voice to be converted is preprocessed to obtain the voice feature to be converted, and the sample voice is preprocessed to obtain the sample voice feature; the voice feature to be converted and the sample voice are obtained by preprocessing
  • the feature is used as input, and the target voice feature is obtained using a pre-trained voice conversion model; the target voice feature is converted into a target voice that simulates the example voice, the voice content of the target voice and the voice content of the voice to be converted same.
  • a method for cross-language speech conversion is provided.
  • the method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example.
  • the cross-language voice conversion method specifically includes the following steps:
  • the user when executing the method for cross-language voice conversion, the user may execute it on a mobile device, such as a mobile phone.
  • a mobile device such as a mobile phone.
  • the user needs to input the voice to be converted and the sample voice of the target user, wherein the voice content of the voice to be converted is the last desired voice of the user.
  • the obtained voice content, the sample voice of the target user is the sound feature of the voice sound that the user finally wishes to obtain.
  • the language used for the voice content of the voice to be converted is different from the language used for the voice content of the sample voice, that is, the voice to be converted may be Chinese, then the sample voice may be English, and the voice to be converted may also be English plus Chinese, The example speech can be in English.
  • S120 Preprocess the speech to be converted to obtain speech features to be converted, and perform preprocessing on the sample speech to obtain sample speech features.
  • the voice conversion model is a neural network model, which is pre-trained with a large number of user voices.
  • the input and output during the training process are also voice features.
  • the voice conversion model can extract the voice content in the voice features to be converted and the sample voice features. Therefore, after inputting the speech features to be converted and the sample speech features into the pre-trained speech conversion model, the target speech features can be obtained.
  • the target voice feature into target voice through other preset neural network models, and the target voice obtained by converting the target voice feature obtained by the voice conversion model simulates the voice feature of the sample voice, and the output voice content is
  • the target voice obtained by converting the target voice feature obtained by the voice conversion model simulates the voice feature of the sample voice, and the output voice content is
  • Other preset neural network models can be WaveNet neural network models, WaveRNN neural network models, and so on.
  • the to-be-converted voice and the sample voice with different languages used in the voice content are obtained, and the two are input into the voice content obtained by using a pre-trained voice conversion model to obtain the same voice content as the to-be-converted voice, simulating the sample voice.
  • the target speech solves the problem that the target speaker's voice cannot be synthesized into the speech issued by the target speaker in other national languages, and obtains the beneficial effect of synthesizing the target user's speech across languages.
  • step S130 specifically includes:
  • the speech feature to be converted is the to-be-converted cepstrum
  • the example speech feature is the example cepstrum.
  • the speech conversion model includes a first encoder, a second encoder, a length regulator and a decoder.
  • the first encoder is built based on the FastSpeech framework, and the first encoder includes FFT Block (Feed-Forward Transformer Block, FFT block), which is based on a non-autoregressive self-attention mechanism and a one-dimensional convolutional neural network.
  • FFT Block eed-Forward Transformer Block
  • the network is generated, so that the first encoder does not depend on the output of the previous frame, and can perform parallel operations, thereby greatly speeding up the generation of target speech features.
  • the first encoder includes a CNN (Convolutional Neural Network) model, a Positional Enecoding (Position-based Word Embedding) model and an FFT Block
  • the second encoder includes an LSTM (Long Short-Term Memory, long short-term memory network) model , Linear (linear regression algorithm) model, as well as pooling layer and normalization layer
  • length adjuster includes CNN model and Linear model
  • decoder includes FFT Block, Linear model, Post-Net and output layer.
  • the Mel cepstrum to be converted is input into the first encoder, and the CNN model in the first encoder is used to compress the Mel cepstrum to be converted to obtain Bottle-neck features, so as to better extract speech content, and then based on the parallel operation of the FFT Block, the first vector is quickly output.
  • the vector length of the first vector takes the maximum value of the input sequence length in the batch processing (Btach), and the rest of the sequences that are not long enough are filled with 0 at the back.
  • the first vector is used as the extracted speech content.
  • the part of the example cepstrum is input to the second encoder, and the second encoder will output a second vector, wherein the part of the example cepstrum is randomly intercepted from the example speech feature, that is, the example cepstrum.
  • the second encoder will output a second vector, wherein the part of the example cepstrum is randomly intercepted from the example speech feature, that is, the example cepstrum.
  • the length adjuster can also obtain the predicted extension length of each frame in the third vector according to the third vector through its own two-layer convolutional layer, which is equivalent to predicting the length of each frame in the Mel cepstrum, and according to the predicted extension length Extend the third vector to the fourth vector.
  • the speech content corresponding to the third vector is "How are you", its feature length is 3, and the predicted extension length obtained by the length adjuster according to the third vector corresponds to [4, 2, 3], then the final obtained In the fourth vector, the feature length of "you” is 4, the feature length of "good” is 2, and the feature length of "do” is 3.
  • the fourth vector is input to the decoder to obtain the predicted Mel cepstrum, and the predicted Mel cepstrum is used as the target speech feature.
  • the first encoder does not depend on the output of the previous frame, and can perform parallel operations, thereby greatly speeding up the Generation speed of target speech features.
  • step S110 specifically includes:
  • S320 Convert the to-be-converted text into synthesized speech, and as the to-be-converted speech, the language used for the speech content of the to-be-converted speech is different from the language used for the speech content of the example speech.
  • the voice to be converted for the user to read aloud is directly obtained as the input voice feature of the subsequent voice conversion model, the user's own reasons may interfere with the input voice feature, such as coughing, slurred speech, etc., in order to avoid the above
  • the text to be converted is obtained, and the text content of the text to be converted is the same as the voice content of the voice to be converted, and then the TTS (TextToSpeech, from text to speech) technology is used to convert the text to be converted into synthetic speech, as the to-be-converted text. Convert voice. Therefore, by converting the text to be converted with the same content into clear and accurate synthesized speech, interference caused by the user's own reasons is eliminated.
  • the function of the first encoder in the speech conversion model is to remove the speech feature si of the input speech from the input sequence, and only keep the speech content c, then the input sequence can be expressed as the following form:
  • step S120 specifically includes:
  • the voice to be converted when the voice to be converted is preprocessed to obtain the features of the voice to be converted, specifically, it is first necessary to perform short-time Fourier transform on the voice to be converted, and the voice to be converted is subjected to short-time Fourier transform to obtain the amplitude spectrum and Phase spectrum, which converts the waveform of the speech to be converted from the time domain to the frequency domain, which is convenient for the extraction of speech features. Only the amplitude spectrum is used for filtering to obtain the Mel spectrum.
  • the filter used for filtering can be Filter Bank (Filter Bank), Filter Bank is based on the principle that people are more sensitive to high-frequency sounds.
  • the filters are denser and the threshold value is larger, while the filters at high frequencies are sparser, the threshold value is smaller, and the filtering results are more suitable.
  • MFC Mel-Frequency Spectrum
  • the spectrum is used as the speech feature to be converted. It should be noted that, the target voice needs to be processed in the same way as the voice to be converted, which is not repeated in this embodiment of the present application.
  • the embodiment of the present application By converting the speech to be converted into a Mel cepstrum, the embodiment of the present application not only approximates the characteristics of the human vocalization mechanism and the nonlinear auditory system, but also facilitates the training and input and output of the neural network model.
  • step S410 specifically includes:
  • the speech to be converted since there will be blank parts in the head and tail parts of the speech to be converted, in order to better align the learning and conversion of the speech conversion model, when the speech to be converted is subjected to short-time Fourier transform to obtain the amplitude spectrum, the It is necessary to subtract the blanks at the beginning and end of the speech to be converted to obtain the first corrected speech to be converted. In addition, in order to better adapt to the short-time Fourier transform, after obtaining the first corrected speech to be converted, it is also necessary to modify the first corrected speech to be converted. The voice is pre-emphasized, divided into frames and windowed to obtain the second modified voice to be converted.
  • steps S510 and S520 in this embodiment of the present application may be selectively executed according to user requirements.
  • a method for training a speech conversion model is provided.
  • the method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example.
  • the speech conversion model training specifically includes the following steps:
  • Preprocess the training speech to obtain training speech features perform preprocessing on the first training example speech to obtain first training example speech features, and perform preprocessing on the second training example speech to obtain a second training example Example speech features.
  • the training example speech includes the first training example speech and the second training example speech, wherein the first training example speech is The speech content is the same as the speech content of the training speech.
  • the language used in the speech content of the training speech is different from the language used in the speech content of the second training example speech.
  • the first training example speech is the predicted speech we need to get at the end, and the second training example speech
  • the example speech is the speech feature used as the input model.
  • the training speech needs to be preprocessed to obtain the training speech features
  • the first training example speech is preprocessed to obtain the first training example speech features
  • the second training example speech is preprocessed to obtain the second training example speech features
  • the training speech feature is the training cepstrum
  • the speech feature of the first training example is the first training example cepstrum
  • the second training example speech feature is the second training example cepstrum.
  • the training predicted Mel cepstrum After obtaining the training predicted Mel cepstrum, it is also necessary to calculate the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum, that is, the loss between the predicted value and the actual value, and finally reverse the training loss according to the training loss. Propagation to update the training weights of the speech-to-speech model until the speech-to-speech model converges.
  • the training voice includes "YES”
  • the first training example voice of the voice content needs to be obtained at the same time, that is, "YES” issued by the training user.
  • the second voice content with a different language.
  • the training example voice that is, the voice of other languages uttered by the training user, such as "good”, when there is enough data in the training set, the "good” issued by the training user is the first training example voice when the training voice includes "good", at this time Then, there is no need to additionally acquire the second training example speech.
  • the language used in the speech content of the training speech includes the language used in the speech content of the speech to be converted in actual use, that is, the speech used in the speech content of the speech to be converted participates in the training of the speech conversion model, training the user It also includes the target user, that is, the target user participates in the training of the speech conversion model as a training user, so that the cross-language conversion can be more accurately achieved.
  • the first encoder does not depend on the output of the previous frame, the training speed of the speech conversion model is greatly accelerated.
  • Figure 8 shows an internal structure diagram of a computer device in one embodiment.
  • the computer device may be a terminal or a server.
  • the computer device includes a processor, memory, and a network interface connected by a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and also stores a computer program, which, when executed by the processor, enables the processor to implement a method for cross-language voice conversion.
  • a computer program can also be stored in the internal memory, and when the computer program is executed by the processor, can cause the processor to execute the cross-language speech conversion method.
  • a computer device comprising a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:
  • the voice feature to be converted is a Mel cepstrum to be converted
  • the example voice feature is an example Mel cepstrum
  • the voice conversion model includes a first encoder, a second encoder, a length adjustment
  • the first encoder includes an FFT Block
  • the voice feature to be converted and the sample voice feature are used as input
  • the target voice feature obtained by using a pre-trained voice conversion model includes: using the Mel The cepstrum is input to the first encoder to obtain the first vector;
  • the partial example Mel cepstrum is input to the second encoder to obtain the second vector, and the partial example Mel cepstrum is in the example obtained by random interception in the Mel cepstrum;
  • the third vector is obtained by splicing the first vector and the second vector;
  • the third vector is input into the length adjuster to obtain the fourth vector;
  • the The fourth vector is input to the decoder to obtain the predicted Mel cepstrum as the target speech feature.
  • the first encoder is configured to compress the Mel cepstrum to obtain a first vector
  • the length adjuster is configured to obtain each frame in the third vector according to the third vector and extending the third vector into a fourth vector according to the predicted extension length.
  • the training of the speech conversion model includes: acquiring a training speech, a first training example speech and a second training example speech of a training user, the speech content of the first training example speech and the difference of the training speech
  • the voice content is the same, and the language used in the voice content of the training voice is different from the language used in the voice content of the second training example voice
  • the training voice is preprocessed to obtain training voice features
  • the first The training example speech is preprocessed to obtain the first training example speech feature
  • the second training example speech is preprocessed to obtain the second training example speech feature
  • the training speech feature is the training Mel cepstrum
  • the first training example speech is the first training example Mel cepstrum
  • the training Mel cepstrum is input to the first encoder to obtain the first cepstrum.
  • a vector inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example mel cepstrum is the cepstrum of the second training example obtained by random interception in ;
  • the third vector is obtained by splicing the first vector and the second vector;
  • the third vector is input to the length adjuster to obtain the fourth vector;
  • the fourth vector is input to the decoder to obtain the training predicted cepstrum; compute the training loss for the training predicted cepstrum and the first training example cepstrum; perform backpropagation based on the training loss to update the speech Transform the training weights of the model until the speech translation model converges.
  • the acquiring the speech to be converted includes: acquiring the text to be converted; and converting the text to be converted into synthesized speech as the speech to be converted.
  • the preprocessing of the speech to be converted to obtain the speech features to be converted includes: performing a short-time Fourier transform on the speech to be converted to obtain an amplitude spectrum; filtering the amplitude spectrum to obtain an amplitude spectrum Mel spectrum; cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the feature of the voice to be converted.
  • the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum includes: subtracting the blank parts of the to-be-converted speech to obtain a first modified to-be-converted speech; The first modified voice to be converted is subjected to pre-emphasis, framing and windowing to obtain a second modified voice to be converted; short-time Fourier transform is performed on the second modified voice to be converted to obtain an amplitude spectrum.
  • a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
  • the voice feature to be converted is a Mel cepstrum to be converted
  • the example voice feature is an example Mel cepstrum
  • the voice conversion model includes a first encoder, a second encoder, a length adjustment
  • the first encoder includes an FFT Block
  • the voice feature to be converted and the sample voice feature are used as input
  • the target voice feature obtained by using a pre-trained voice conversion model includes: using the Mel The cepstrum is input to the first encoder to obtain the first vector;
  • the partial example Mel cepstrum is input to the second encoder to obtain the second vector, and the partial example Mel cepstrum is in the example obtained by random interception in the Mel cepstrum;
  • the third vector is obtained by splicing the first vector and the second vector;
  • the third vector is input to the length adjuster to obtain the fourth vector;
  • the The fourth vector is input to the decoder to obtain the predicted Mel cepstrum as the target speech feature.
  • the first encoder is configured to compress the Mel cepstrum to obtain a first vector
  • the length adjuster is configured to obtain each frame in the third vector according to the third vector and extending the third vector into a fourth vector according to the predicted extension length.
  • the training of the speech conversion model includes: acquiring a training speech, a first training example speech and a second training example speech of a training user, the speech content of the first training example speech and the difference of the training speech
  • the voice content is the same, and the language used in the voice content of the training voice is different from the language used in the voice content of the second training example voice
  • the training voice is preprocessed to obtain training voice features
  • the first The training example speech is preprocessed to obtain the first training example speech feature
  • the second training example speech is preprocessed to obtain the second training example speech feature
  • the training speech feature is the training Mel cepstrum
  • the first training example speech is the first training example Mel cepstrum
  • the training Mel cepstrum is input to the first encoder to obtain the first cepstrum.
  • a vector inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example mel cepstrum is the cepstrum of the second training example obtained by random interception in ;
  • the third vector is obtained by splicing the first vector and the second vector;
  • the third vector is input to the length adjuster to obtain the fourth vector;
  • the fourth vector is input to the decoder to obtain the training predicted cepstrum; compute the training loss for the training predicted cepstrum and the first training example cepstrum; perform backpropagation based on the training loss to update the speech Transform the training weights of the model until the speech translation model converges.
  • the acquiring the speech to be converted includes: acquiring the text to be converted; and converting the text to be converted into synthesized speech as the speech to be converted.
  • the preprocessing of the speech to be converted to obtain the speech features to be converted includes: performing a short-time Fourier transform on the speech to be converted to obtain an amplitude spectrum; filtering the amplitude spectrum to obtain an amplitude spectrum Mel spectrum; cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the feature of the voice to be converted.
  • the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum includes: subtracting the blank parts of the to-be-converted speech to obtain a first modified to-be-converted speech; The first modified voice to be converted is subjected to pre-emphasis, framing and windowing to obtain a second modified voice to be converted; short-time Fourier transform is performed on the second modified voice to be converted to obtain an amplitude spectrum.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

A cross-language voice conversion method, comprising: obtaining a voice to be converted and an example voice of a target user, wherein the language used by voice content of the voice to be converted is different from the language used by voice content of the example voice (S110); preprocessing the voice to be converted to obtain a voice feature to be converted, and preprocessing the example voice to obtain an example voice feature (S120); taking the voice feature to be converted and the example voice feature as inputs, and using a pre-trained voice conversion model to obtain a target voice feature (S130); and converting the target voice feature into a target voice simulating the example voice, wherein voice content of the target voice is the same as the voice content of the voice to be converted (S140). Thus, the cross-language synthesis of the voice of the target user is implemented. Also provided are a computer device and a storage medium.

Description

跨语言语音转换方法、计算机设备和存储介质Cross-language speech conversion method, computer device and storage medium 技术领域technical field
本申请涉及计算机技术领域,尤其涉及一种跨语言语音转换方法、计算机设备和存储介质。The present application relates to the field of computer technology, and in particular, to a cross-language voice conversion method, computer device and storage medium.
背景技术Background technique
机器学习与深度学习依靠海量数据和计算机强大的处理能力,在图像、语音、文本等领域取得了重大突破。由于同类型框架在不同领域都能取得很好的效果,曾被用于解决文本和图像问题的神经网络算法模型都被应用于语音领域。Machine learning and deep learning rely on massive data and the powerful processing power of computers, and have made major breakthroughs in the fields of image, speech, and text. Since the same type of framework can achieve good results in different fields, neural network algorithm models that have been used to solve text and image problems are all applied to the field of speech.
现有的应用于语音领域的神经网络算法模型可以根据目标说话人的声音捕捉其特征,从而稳定合成目标说话人的其他语音,并且在音色相似度和语言自然度方面都接近真人的水平,但是合成的语音只能是与目标说话人的语言相同的语音,无法将目标说话人的声音合成为该目标说话人使用其他国家语言发出的语音,如果目标说话人只会说中文,则只能合成出中文的语音,无法合成其他国家语言的语音。The existing neural network algorithm models applied in the field of speech can capture the characteristics of the target speaker's voice, so as to stably synthesize other voices of the target speaker, and are close to the level of real people in terms of timbre similarity and language naturalness, but The synthesized speech can only be the same as the target speaker's language. The target speaker's voice cannot be synthesized into the target speaker's speech in other languages. If the target speaker can only speak Chinese, it can only be synthesized. The voice of the Chinese language cannot be synthesized into the voice of other national languages.
申请内容Application content
基于此,有必要针对上述问题,提出了一种跨语言语音转换方法、计算机设备和存储介质。Based on this, it is necessary to propose a cross-language voice conversion method, computer equipment and storage medium for the above problems.
第一方面,本申请实施例提供一种跨语言语音转换方法,所述方法包括:In a first aspect, an embodiment of the present application provides a method for cross-language voice conversion, the method comprising:
获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;Acquiring the voice to be converted and the sample voice of the target user, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;
将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;
将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
第二方面,本申请实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以 下步骤:In a second aspect, an embodiment of the present application provides a computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to perform the following steps:
获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;Acquiring the voice to be converted and the sample voice of the target user, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;
将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;
将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
第三方面,本申请实施例提供一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:In a third aspect, an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;Acquiring the voice to be converted and the sample voice of the target user, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;
将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;
将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
本申请实施例通过获取语音内容使用的语言不相同的待转换语音和示例语音,将两者输入至用预先训练好的语音转换模型得到语音内容和待转换语音的语音内容相同、模拟示例语音的目标语音,解决了无法将目标说话人的声音合成为该目标说话人使用其他国家语言发出的语音的问题,获得了跨语言的合成目标用户语音的有益效果。In this embodiment of the present application, the to-be-converted voice and the sample voice with different languages used in the voice content are obtained, and the two are input into the voice content obtained by using a pre-trained voice conversion model to obtain the same voice content as the to-be-converted voice, simulating the sample voice. The target speech solves the problem that the target speaker's voice cannot be synthesized into the speech issued by the target speaker in other national languages, and obtains the beneficial effect of synthesizing the target user's speech across languages.
附图说明Description of drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.
其中:in:
图1为一个实施例中跨语言语音转换方法的应用环境图;Fig. 1 is the application environment diagram of the cross-language speech conversion method in one embodiment;
图2为一个实施例中跨语言语音转换方法的流程图;Fig. 2 is the flow chart of the cross-language speech conversion method in one embodiment;
图3为一个实施例中跨语言语音转换方法中步骤S130的流程图;Fig. 3 is the flow chart of step S130 in the cross-language speech conversion method in one embodiment;
图4为一个实施例中跨语言语音转换方法中步骤S110的流程图;4 is a flowchart of step S110 in the method for cross-language speech conversion in one embodiment;
图5为一个实施例中跨语言语音转换方法中步骤S120的流程图;Fig. 5 is the flow chart of step S120 in the cross-language speech conversion method in one embodiment;
图6为一个实施例中跨语言语音转换方法中步骤S410的流程图;6 is a flowchart of step S410 in the method for cross-language speech conversion in one embodiment;
图7为一个实施例中语音转换模型训练方法的流程图;Fig. 7 is the flow chart of the speech conversion model training method in one embodiment;
图8为一个实施例中计算机设备的结构框图。FIG. 8 is a structural block diagram of a computer device in one embodiment.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
图1为一个实施例中跨语言语音转换方法的应用环境图。参照图1,该跨语言语音转换方法应用于跨语言语音转换系统。该跨语言语音转换系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端110用于待转换语音和目标用户的示例语音并上传到服务器120,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同,服务器120用于接收待转换语音和目标用户的示例语音;对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。FIG. 1 is an application environment diagram of a method for cross-language speech conversion in one embodiment. Referring to FIG. 1 , the cross-language voice conversion method is applied to a cross-language voice conversion system. The cross-language voice conversion system includes a terminal 110 and a server 120 . The terminal 110 and the server 120 are connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 can be implemented by an independent server or a server cluster composed of multiple servers. The terminal 110 is used for the voice to be converted and the sample voice of the target user and uploads it to the server 120. The language used for the voice content of the voice to be converted is different from the language used for the voice content of the sample voice. Convert the voice and the sample voice of the target user; perform preprocessing on the voice to be converted to obtain the voice feature to be converted, and perform preprocessing on the sample voice to obtain the sample voice feature; take the voice feature to be converted and the sample voice feature as Input, use a pre-trained voice conversion model to obtain target voice features; convert the target voice features into a target voice that simulates the example voice, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.
在另一个实施例中,上述跨语言语音转换方法也可以直接应用于终端110,终端110用于获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;将所述待转换语音 特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。In another embodiment, the above-mentioned cross-language voice conversion method can also be directly applied to the terminal 110, where the terminal 110 is used to obtain the voice to be converted and the sample voice of the target user, the language used in the voice content of the voice to be converted and the The speech content of the sample voices uses different languages; the voice to be converted is preprocessed to obtain the voice feature to be converted, and the sample voice is preprocessed to obtain the sample voice feature; the voice feature to be converted and the sample voice are obtained by preprocessing The feature is used as input, and the target voice feature is obtained using a pre-trained voice conversion model; the target voice feature is converted into a target voice that simulates the example voice, the voice content of the target voice and the voice content of the voice to be converted same.
如图2所示,在一个实施例中,提供了一种跨语言语音转换方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该跨语言语音转换方法具体包括如下步骤:As shown in FIG. 2, in one embodiment, a method for cross-language speech conversion is provided. The method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example. The cross-language voice conversion method specifically includes the following steps:
S110、获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同。S110. Acquire the voice to be converted and the sample voice of the target user, where the voice content of the voice to be converted is in a different language than the voice content of the sample voice.
本实施例中,在执行跨语言语音转换方法时,用户可以在移动设备,例如手机上执行,首先用户需要输入待转换语音和目标用户的示例语音,其中待转换语音的语音内容是用户最后希望获得的语音内容,目标用户的示例语音是用户最后希望获得的语音声音的声音特征。此外,待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同,即待转换语音可以是中文,那么示例语音可以是英文,待转换语音还可以是英文加中文,示例语音可以是英文,需要说明的是,只要待转换语音的语音内容使用的语言和示例语音的语音内容使用的语言存在部分不相同,或不完全相同,即视为不相同。示例性的,用户想要获取只会说中文的甲,说出“Yes”的目标语音,只需要自行说出“Yes”作为待转换语音,并获取甲的示例语音,该示例语音可以为甲说的任意一段中文语音。In this embodiment, when executing the method for cross-language voice conversion, the user may execute it on a mobile device, such as a mobile phone. First, the user needs to input the voice to be converted and the sample voice of the target user, wherein the voice content of the voice to be converted is the last desired voice of the user. The obtained voice content, the sample voice of the target user is the sound feature of the voice sound that the user finally wishes to obtain. In addition, the language used for the voice content of the voice to be converted is different from the language used for the voice content of the sample voice, that is, the voice to be converted may be Chinese, then the sample voice may be English, and the voice to be converted may also be English plus Chinese, The example speech can be in English. It should be noted that as long as the language used for the speech content of the speech to be converted and the language used for the speech content of the example speech are partially different, or not completely the same, it is regarded as different. Exemplarily, if the user wants to obtain the target voice of X who can only speak Chinese and say "Yes", he only needs to say "Yes" as the voice to be converted, and obtain the sample voice of X, which can be X. any Chinese voice spoken.
S120、对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征。S120. Preprocess the speech to be converted to obtain speech features to be converted, and perform preprocessing on the sample speech to obtain sample speech features.
S130、将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征。S130. Using the to-be-converted speech feature and the example speech feature as input, use a pre-trained speech conversion model to obtain the target speech feature.
S140、将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。S140. Convert the target voice feature into a target voice that simulates the example voice, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.
本实施例中,得到待转换语音和示例语音后,还需要对待转换语音进行预处理得到待转换语音特征,并对示例语音进行预处理得到示例语音特征,以方便输入至语音转换模型,其中语音转换模型为神经网络模型,预先经过大量训练用户的语音进行训练,训练过程中的输入和输出也都为语音特征,该语音转换模型可以提取出待转换语音特征中的语音内容和示例语音特征中的声音特征并进行结合,因此将待转换语音特征和示例语音特征输入至预先训练好的语音转换模型后可以得到目标语音特征。最后 还需要将所述目标语音特征经过其他的预设神经网络模型转换为目标语音,经过语音转换模型得到的目标语音特征转换得到的目标语音,模拟了示例语音的声音特征,发出的语音内容为待转换语音的语音内容,由于待转换语音的语音内容使用的语言和示例语音的语音内容使用的语言不相同,由此完成了跨语言的语音转换。其他的预设神经网络模型可以为WaveNet神经网络模型,也可以为WaveRNN神经网络模型等等。In this embodiment, after obtaining the to-be-converted voice and the sample voice, it is also necessary to preprocess the to-be-converted voice to obtain the to-be-converted voice feature, and preprocess the sample voice to obtain the sample voice feature, so as to facilitate input into the voice conversion model, where the voice The conversion model is a neural network model, which is pre-trained with a large number of user voices. The input and output during the training process are also voice features. The voice conversion model can extract the voice content in the voice features to be converted and the sample voice features. Therefore, after inputting the speech features to be converted and the sample speech features into the pre-trained speech conversion model, the target speech features can be obtained. Finally, it is necessary to convert the target voice feature into target voice through other preset neural network models, and the target voice obtained by converting the target voice feature obtained by the voice conversion model simulates the voice feature of the sample voice, and the output voice content is For the speech content of the speech to be converted, since the language used for the speech content of the speech to be converted is different from the language used for the speech content of the example speech, the cross-language speech conversion is completed. Other preset neural network models can be WaveNet neural network models, WaveRNN neural network models, and so on.
本申请实施例通过获取语音内容使用的语言不相同的待转换语音和示例语音,将两者输入至用预先训练好的语音转换模型得到语音内容和待转换语音的语音内容相同、模拟示例语音的目标语音,解决了无法将目标说话人的声音合成为该目标说话人使用其他国家语言发出的语音的问题,获得了跨语言的合成目标用户语音的有益效果。In this embodiment of the present application, the to-be-converted voice and the sample voice with different languages used in the voice content are obtained, and the two are input into the voice content obtained by using a pre-trained voice conversion model to obtain the same voice content as the to-be-converted voice, simulating the sample voice. The target speech solves the problem that the target speaker's voice cannot be synthesized into the speech issued by the target speaker in other national languages, and obtains the beneficial effect of synthesizing the target user's speech across languages.
在一个实施例中,如图3所示,步骤S130具体包括:In one embodiment, as shown in FIG. 3 , step S130 specifically includes:
S210、将所述梅尔倒频谱输入至所述第一编码器以得到第一向量。S210. Input the Mel cepstrum to the first encoder to obtain a first vector.
S220、将部分示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分示例梅尔倒频谱为在所述示例梅尔倒频谱中随机截取得到的。S220. Input a part of the example cepstrum to the second encoder to obtain a second vector, where the part of the example cepstrum is randomly intercepted from the example cepstrum.
本实施例中,待转换语音特征为待转换梅尔倒频谱,示例语音特征为示例梅尔倒频谱,获得待转换语音特征和示例语音特征后,就可以将待转换语音特征和示例语音特征输入至预先训练好的语音转换模型,其中语音转换模型包括第一编码器、第二编码器、长度调节器和解码器。其中第一编码器基于FastSpeech框架搭建,第一编码器包括FFT Block(Feed-Forward Transformer Block,FFT块),FFT Block基于非自回归的自注意力机制(self-attention)和一维卷积神经网络生成,使得第一编码器不会依赖于上一帧的输出,可以进行并行运算,从而大大加快了目标语音特征的生成速度。具体的,第一编码器包括CNN(卷积神经网络)模型、Positional Enecoding(基于位置的词嵌入)模型和FFT Block,第二编码器包括LSTM(Long Short-Term Memory,长短期记忆网络)模型、Linear(线性回归算法)模型,以及池化层和标准化层,长度调节器包括CNN模型和Linear模型,解码器包括FFT Block、Linear模型、Post-Net和输出层。In this embodiment, the speech feature to be converted is the to-be-converted cepstrum, and the example speech feature is the example cepstrum. After obtaining the to-be-converted speech feature and the example speech feature, the to-be-converted speech feature and the example speech feature can be input to a pre-trained speech conversion model, wherein the speech conversion model includes a first encoder, a second encoder, a length regulator and a decoder. The first encoder is built based on the FastSpeech framework, and the first encoder includes FFT Block (Feed-Forward Transformer Block, FFT block), which is based on a non-autoregressive self-attention mechanism and a one-dimensional convolutional neural network. The network is generated, so that the first encoder does not depend on the output of the previous frame, and can perform parallel operations, thereby greatly speeding up the generation of target speech features. Specifically, the first encoder includes a CNN (Convolutional Neural Network) model, a Positional Enecoding (Position-based Word Embedding) model and an FFT Block, and the second encoder includes an LSTM (Long Short-Term Memory, long short-term memory network) model , Linear (linear regression algorithm) model, as well as pooling layer and normalization layer, length adjuster includes CNN model and Linear model, decoder includes FFT Block, Linear model, Post-Net and output layer.
具体的,将待转换梅尔倒频谱输入至第一编码器,第一编码器中的CNN模型用于将待转换梅尔倒频谱进行压缩,得到Bottle-neck特征,从而更好的提取出语音内容,然后基于FFT Block的并行运算,快速输出第一向量,第一向量的向量长度取批处理(Btach)中输入序列长度的最大数值,其余不够长的序列在后面补0,由此得到的第一向量作为提取出的语音内容。然后将部分示例梅尔倒频谱输入至第二编码器,第二 编码器会输出第二向量,其中部分示例梅尔倒频谱为从示例语音特征,即示例梅尔倒频谱中随机截取得到的。具体的,将示例语音转换为示例梅尔倒频谱后,随机选取该目标用户的示例梅尔倒频谱的预设个数的截取片段,将这些截取片段拼接后作为部分示例梅尔倒频谱,由此得到的第二向量作为提取出的声音特征。Specifically, the Mel cepstrum to be converted is input into the first encoder, and the CNN model in the first encoder is used to compress the Mel cepstrum to be converted to obtain Bottle-neck features, so as to better extract speech content, and then based on the parallel operation of the FFT Block, the first vector is quickly output. The vector length of the first vector takes the maximum value of the input sequence length in the batch processing (Btach), and the rest of the sequences that are not long enough are filled with 0 at the back. The first vector is used as the extracted speech content. Then the part of the example cepstrum is input to the second encoder, and the second encoder will output a second vector, wherein the part of the example cepstrum is randomly intercepted from the example speech feature, that is, the example cepstrum. Specifically, after converting the sample speech into the sample Mel cepstrum, randomly select a preset number of clipped fragments of the sample Mel cepstrum of the target user, and splicing these clipped fragments as part of the sample Mel cepstrum, which is represented by The obtained second vector is used as the extracted sound feature.
S230、将所述第一向量和第二向量进行拼接后得到第三向量。S230. After splicing the first vector and the second vector, a third vector is obtained.
S240、将所述第三向量输入至所述长度调节器以得到第四向量。S240. Input the third vector to the length adjuster to obtain a fourth vector.
S250、将所述第四向量输入至所述解码器以得到预测梅尔倒频谱,作为目标语音特征。S250. Input the fourth vector to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
本实施例中,得到第一向量和第二向量后,还需要将第一向量和第二向量进行拼接,得到第三向量,然后将第三向量输入至所述长度调节器,由于得到的第一向量经过第一编码器的压缩。因此长度调节器还可以通过自身的两层卷积层根据第三向量获得第三向量中每一帧的预测扩展长度,相当于预测梅尔倒频谱中每一帧的长度,并根据预测扩展长度将第三向量扩展为第四向量。示例性的,第三向量对应的语音内容为“你好吗”,其特征长度为3,长度调节器根据第三向量得到的预测扩展长度对应为【4,2,3】,那么最后得到的第四向量中,“你”的特征长度为4,“好”的特征长度为2,“吗”的特征长度为3。最后将第四向量输入至解码器就可以得到预测梅尔倒频谱,将预测梅尔倒频谱作为目标语音特征。In this embodiment, after the first vector and the second vector are obtained, the first vector and the second vector need to be spliced together to obtain a third vector, and then the third vector is input to the length adjuster. A vector is compressed by the first encoder. Therefore, the length adjuster can also obtain the predicted extension length of each frame in the third vector according to the third vector through its own two-layer convolutional layer, which is equivalent to predicting the length of each frame in the Mel cepstrum, and according to the predicted extension length Extend the third vector to the fourth vector. Exemplarily, the speech content corresponding to the third vector is "How are you", its feature length is 3, and the predicted extension length obtained by the length adjuster according to the third vector corresponds to [4, 2, 3], then the final obtained In the fourth vector, the feature length of "you" is 4, the feature length of "good" is 2, and the feature length of "do" is 3. Finally, the fourth vector is input to the decoder to obtain the predicted Mel cepstrum, and the predicted Mel cepstrum is used as the target speech feature.
本发明实施例通过基于非自回归的自注意力机制和一维卷积神经网络生成的FFT Block,使得第一编码器不会依赖于上一帧的输出,可以进行并行运算,从而大大加快了目标语音特征的生成速度。In the embodiment of the present invention, through the FFT Block generated by the non-autoregressive self-attention mechanism and the one-dimensional convolutional neural network, the first encoder does not depend on the output of the previous frame, and can perform parallel operations, thereby greatly speeding up the Generation speed of target speech features.
在一个实施例中,如图4所示,步骤S110具体包括:In one embodiment, as shown in FIG. 4 , step S110 specifically includes:
S310、获取待转换文本和目标用户的示例语音。S310. Acquire the text to be converted and the sample speech of the target user.
S320、将所述待转换文本转换为合成语音,作为待转换语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同。S320. Convert the to-be-converted text into synthesized speech, and as the to-be-converted speech, the language used for the speech content of the to-be-converted speech is different from the language used for the speech content of the example speech.
本实施例中,若直接获取让用户朗读的待转换语音作为后续语音转换模型的输入语音特征,因用户自身的原因可能对输入语音特征产生的干扰,例如咳嗽、吐字不清等,为了避免上述问题,本实施例中获取待转换文本,其中待转换文本的文本内容和待转换语音的语音内容相同,然后采用TTS(TextToSpeech,从文本到语音)技术将待转换文本转换为合成语音,作为待转换语音。由此通过将内容相同的待转换文本转换为清晰准确的合成语音,排除了因用户自身的原因产生的干扰。In this embodiment, if the voice to be converted for the user to read aloud is directly obtained as the input voice feature of the subsequent voice conversion model, the user's own reasons may interfere with the input voice feature, such as coughing, slurred speech, etc., in order to avoid the above The problem, in this embodiment, the text to be converted is obtained, and the text content of the text to be converted is the same as the voice content of the voice to be converted, and then the TTS (TextToSpeech, from text to speech) technology is used to convert the text to be converted into synthetic speech, as the to-be-converted text. Convert voice. Therefore, by converting the text to be converted with the same content into clear and accurate synthesized speech, interference caused by the user's own reasons is eliminated.
进一步的,为了说明采用合成语音作为语音转换模型的输入可以排除因用户自身的原因产生的干扰,在使用该语音转换模型的过程中,假设输入的待转换语音特征的特征序列为x=(x 1,x 2,…,x n),这里的n代表待转换梅尔倒频谱的时间序列上的第n帧,语音转换模型预测的目标语音特征的特征序列为y=(y 1,y 2,…,y m),同样,这里的m也代表预测梅尔倒频谱的时间序列上的第m帧。我们希望语音转换模型预测的特征序列能尽量接近实际语音特征的目标特征序列
Figure PCTCN2020140344-appb-000001
这里我们假设输入特征序列的每一帧中都包含两个隐含变量,一个隐含变量是输入语音的语音内容c=(c 1,c 2,…,c n),另一个隐含变量是输入语音的声音特征s=(s 1,s 2,…,s i),而在目标序列
Figure PCTCN2020140344-appb-000002
中同样包含目标用户的声音特征
Figure PCTCN2020140344-appb-000003
其中i表示输入语音,t表示目标用户,i∈{1,2,…,j},t∈{1,2,…,k},其中的j表示整个输入数据集中输入语音的数量,k表示整个输入数据集中目标用户的数量。
Further, in order to illustrate that the use of synthetic speech as the input of the speech conversion model can eliminate the interference caused by the user's own reasons, in the process of using the speech conversion model, it is assumed that the input feature sequence of the speech feature to be converted is x=(x 1 ,x 2 ,...,x n ), where n represents the nth frame on the time series of the Mel cepstrum to be converted, and the feature sequence of the target speech feature predicted by the speech conversion model is y=(y 1 ,y 2 ,…,y m ), similarly, m here also represents the mth frame on the time series of the predicted Mel cepstrum. We hope that the feature sequence predicted by the speech conversion model can be as close as possible to the target feature sequence of the actual speech features
Figure PCTCN2020140344-appb-000001
Here we assume that each frame of the input feature sequence contains two latent variables, one latent variable is the speech content of the input speech c=(c 1 ,c 2 ,..., cn ), and the other latent variable is The sound feature s=(s 1 ,s 2 ,...,s i ) of the input speech, while in the target sequence
Figure PCTCN2020140344-appb-000002
also contains the voice characteristics of the target user
Figure PCTCN2020140344-appb-000003
where i denotes the input speech, t denotes the target user, i∈{1,2,...,j},t∈{1,2,...,k}, where j denotes the number of input speech in the entire input dataset, k denotes The number of target users in the entire input dataset.
语音转换模型中的第一编码器的作用是将输入语音的语音特征s i从输入序列中剔除,只保留语音内容c,则输入序列可以表示为如下形式: The function of the first encoder in the speech conversion model is to remove the speech feature si of the input speech from the input sequence, and only keep the speech content c, then the input sequence can be expressed as the following form:
Figure PCTCN2020140344-appb-000004
Figure PCTCN2020140344-appb-000004
由于我们使用了TTS合成语音转真人语音的方法,来达到分离用户的声音特征和语音内容的目的,因为在输入语音的声音特征只有一个,即该合成语音的声音特征,我们设其为s 0,可认为s 0是一个常量。根据贝叶斯定理,公式(1)可变为: Since we use the method of converting TTS synthetic voice to real voice to achieve the purpose of separating the user's voice features and voice content, because there is only one voice feature in the input voice, that is, the voice feature of the synthesized voice, we set it as s 0 , s 0 can be considered as a constant. According to Bayes' theorem, formula (1) can be transformed into:
Figure PCTCN2020140344-appb-000005
Figure PCTCN2020140344-appb-000005
对于预测序列y,用同样的方法可以表示为:For the prediction sequence y, in the same way, it can be expressed as:
Figure PCTCN2020140344-appb-000006
Figure PCTCN2020140344-appb-000006
其中,
Figure PCTCN2020140344-appb-000007
是第二编码器的输出,而c是第一编码器的输出,二者组合在一起经过长度调节器的调节后作为解码器的输入,最后由解码器输出预测的序列y。由于c和
Figure PCTCN2020140344-appb-000008
是来自于两个序列,可以认为这两者相互独立。因此结合公式(2)和(3),可以得到:
in,
Figure PCTCN2020140344-appb-000007
is the output of the second encoder, and c is the output of the first encoder. The combination of the two is adjusted by the length adjuster as the input of the decoder, and finally the predicted sequence y is output by the decoder. due to c and
Figure PCTCN2020140344-appb-000008
is derived from two sequences, which can be considered independent of each other. Therefore, combining formulas (2) and (3), we can get:
Figure PCTCN2020140344-appb-000009
Figure PCTCN2020140344-appb-000009
从公式(4)中可知,当输入语音为固定的合成语音时,预测序列y只和输入序列 x、训练用户
Figure PCTCN2020140344-appb-000010
以及语音内容c有关。从而解除了直接获取用户朗读的待转换语音作为输入语音,对语音转换模型中提取语音内容的干扰。
It can be seen from formula (4) that when the input speech is a fixed synthetic speech, the prediction sequence y is only the same as the input sequence x and the training user
Figure PCTCN2020140344-appb-000010
And the voice content c is related. Therefore, the interference of directly acquiring the voice to be converted read aloud by the user as the input voice and extracting the voice content in the voice conversion model is eliminated.
在一个实施例中,如图5所示,步骤S120具体包括:In one embodiment, as shown in FIG. 5 , step S120 specifically includes:
S410、对所述待转换语音进行短时傅里叶变换得到幅度谱。S410. Perform short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum.
S420、对所述幅度谱进行滤波得到梅尔频谱。S420. Filter the amplitude spectrum to obtain a Mel spectrum.
S430、对所述梅尔频谱进行倒谱分析得到待转换梅尔倒频谱,作为待转换语音特征。S430. Perform cepstrum analysis on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the voice feature to be converted.
本实施例中,在将待转换语音进行预处理得到待转换语音特征时,具体的,首先需要对待转换语音进行短时傅里叶变换,待转换语音经过短时傅里叶变换得到幅度谱和相位谱,将待转换语音的波形从时域转换到频域,方便语音特征的提取,只取其中的幅度谱进行滤波就可以得到梅尔频谱,其中进行滤波时采用的滤波器可以为Filter Bank(滤波器组),Filter Bank基于人对高频声音更敏感的原则,在低频处滤波器更密集,门限值大,而高频处滤波器更稀疏,门限值小,滤波结果更适符合人声。为了获得更接近人类发声机制的特征,更接近人类非线性的听觉系统,最后还需要对梅尔频谱进行倒谱分析,得到梅尔倒频谱(MFC,Mel-Frequency Spectrum),将该梅尔倒频谱作为待转换语音特征。需要说明的是,对目标语音需要进行与待转换语音相同的处理,本申请实施例在此不再赘述。In this embodiment, when the voice to be converted is preprocessed to obtain the features of the voice to be converted, specifically, it is first necessary to perform short-time Fourier transform on the voice to be converted, and the voice to be converted is subjected to short-time Fourier transform to obtain the amplitude spectrum and Phase spectrum, which converts the waveform of the speech to be converted from the time domain to the frequency domain, which is convenient for the extraction of speech features. Only the amplitude spectrum is used for filtering to obtain the Mel spectrum. The filter used for filtering can be Filter Bank (Filter Bank), Filter Bank is based on the principle that people are more sensitive to high-frequency sounds. At low frequencies, the filters are denser and the threshold value is larger, while the filters at high frequencies are sparser, the threshold value is smaller, and the filtering results are more suitable. In line with the human voice. In order to obtain features closer to the human vocalization mechanism and closer to the human nonlinear auditory system, it is necessary to perform cepstral analysis on the Mel spectrum to obtain the Mel-Frequency Spectrum (MFC, Mel-Frequency Spectrum). The spectrum is used as the speech feature to be converted. It should be noted that, the target voice needs to be processed in the same way as the voice to be converted, which is not repeated in this embodiment of the present application.
本申请实施例通过将待转换语音转换为梅尔倒频谱,不仅更接近人类发声机制的特征和非线性的听觉系统,还有利于神经网络模型的训练和输入输出。By converting the speech to be converted into a Mel cepstrum, the embodiment of the present application not only approximates the characteristics of the human vocalization mechanism and the nonlinear auditory system, but also facilitates the training and input and output of the neural network model.
在一个实施例中,如图6所示,步骤S410具体包括:In one embodiment, as shown in FIG. 6 , step S410 specifically includes:
S510、减去所述待转换语音中的首尾空白部分得到第一修正待转换语音。S510 , subtracting the blank parts at the beginning and end of the speech to be converted to obtain a first corrected speech to be converted.
S520、对所述第一修正待转换语音进行预加重、分帧和加窗得到第二修正待转换语音。S520. Perform pre-emphasis, framing and windowing on the first modified voice to be converted to obtain a second modified voice to be converted.
S530、对所述第二修正待转换语音进行短时傅里叶变换得到幅度谱。S530. Perform short-time Fourier transform on the second modified speech to be converted to obtain an amplitude spectrum.
本实施例中,因待转换语音中首尾部分会存在空白部分,为了让语音转换模型更好的对齐学习和转换,在对待转换语音进行短时傅里叶变换得到幅度谱时,在此之前还需要减去待转换语音中的首尾空白部分得到第一修正待转换语音,此外,为了更好的适应短时傅里叶变换,得到第一修正待转换语音后,还需要对第一修正待转换语音进行预加重、分帧和加窗得到第二修正待转换语音,经过预加重,可以使待转换语音 添加高频信息,并过滤掉一部分噪音,经过分帧和加窗,可以使待转换语音更平稳和连续,最后对第二修正待转换语音进行短时傅里叶变换得到幅度谱。其中,本申请实施例中的步骤S510和S520可以根据用户需求选择性的执行。In this embodiment, since there will be blank parts in the head and tail parts of the speech to be converted, in order to better align the learning and conversion of the speech conversion model, when the speech to be converted is subjected to short-time Fourier transform to obtain the amplitude spectrum, the It is necessary to subtract the blanks at the beginning and end of the speech to be converted to obtain the first corrected speech to be converted. In addition, in order to better adapt to the short-time Fourier transform, after obtaining the first corrected speech to be converted, it is also necessary to modify the first corrected speech to be converted. The voice is pre-emphasized, divided into frames and windowed to obtain the second modified voice to be converted. After pre-emphasis, high-frequency information can be added to the voice to be converted, and part of the noise can be filtered out. After framing and windowing, the voice to be converted can be made It is more stable and continuous, and finally, short-time Fourier transform is performed on the second modified speech to be converted to obtain an amplitude spectrum. Wherein, steps S510 and S520 in this embodiment of the present application may be selectively executed according to user requirements.
如图7所示,在一个实施例中,提供了一种语音转换模型训练方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该语音转换模型训练具体包括如下步骤:As shown in FIG. 7 , in one embodiment, a method for training a speech conversion model is provided. The method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example. The speech conversion model training specifically includes the following steps:
S610、获取训练语音、训练用户的第一训练示例语音和第二训练示例语音。S610. Acquire the training voice, the first training sample voice and the second training sample voice of the training user.
S620、对所述训练语音进行预处理得到训练语音特征,并对所述第一训练示例语音进行预处理得到第一训练示例语音特征,对所述第二训练示例语音进行预处理得到第二训练示例语音特征。S620. Preprocess the training speech to obtain training speech features, perform preprocessing on the first training example speech to obtain first training example speech features, and perform preprocessing on the second training example speech to obtain a second training example Example speech features.
S630、将所述训练梅尔倒频谱输入至所述第一编码器以得到第一向量。S630. Input the training Mel cepstrum to the first encoder to obtain a first vector.
S640、将部分第二训练示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分第二训练示例梅尔倒频谱为在所述第二训练示例梅尔倒频谱中随机截取得到的。S640. Input part of the second training example cepstrum to the second encoder to obtain a second vector, where the part of the second training example cepstrum is in the second training example cepstrum Randomly intercepted.
S650、将所述第一向量和第二向量进行拼接后得到第三向量。S650. After splicing the first vector and the second vector, a third vector is obtained.
S660、将所述第三向量输入至所述长度调节器以得到第四向量。S660. Input the third vector to the length adjuster to obtain a fourth vector.
S670、将所述第四向量输入至所述解码器以得到训练预测梅尔倒频谱。S670. Input the fourth vector to the decoder to obtain a training prediction Mel cepstrum.
S680、计算所述训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失。S680. Calculate the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum.
S690、根据所述训练损失进行反向传播以更新所述语音转换模型的训练权重,直至所述语音转换模型收敛。S690. Perform backpropagation according to the training loss to update the training weight of the speech conversion model until the speech conversion model converges.
本实施例中,在训练该语音转换模型时,首先需要获取训练语音和训练用户的训练示例语音,训练示例语音包括第一训练示例语音和第二训练示例语音,其中,第一训练示例语音的语音内容和训练语音的语音内容相同,训练语音的语音内容使用的语言和第二训练示例语音的语音内容使用的语言不相同,第一训练示例语音是我们最后需要得到的预测语音,第二训练示例语音则是作为输入模型的语音特征。然后需要对训练语音进行预处理得到训练语音特征,并对第一训练示例语音进行预处理得到第一训练示例语音特征,对第二训练示例语音进行预处理得到第二训练示例语音特征,其中,训练语音特征为训练梅尔倒频谱,第一训练示例语音特征为第一训练示例梅尔倒频谱,第二训练示例语音特征为第二训练示例梅尔倒频谱。后续操作和本申请实施例S210-S250相同,本申请实施例不再赘述。在得到训练预测梅尔倒频谱后,还需要计算 训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失,即预测值和实际值之间的损失,最后根据训练损失进行反向传播以更新语音转换模型的训练权重,直至语音转换模型收敛。In this embodiment, when training the speech conversion model, it is first necessary to obtain the training speech and the training example speech of the training user. The training example speech includes the first training example speech and the second training example speech, wherein the first training example speech is The speech content is the same as the speech content of the training speech. The language used in the speech content of the training speech is different from the language used in the speech content of the second training example speech. The first training example speech is the predicted speech we need to get at the end, and the second training example speech The example speech is the speech feature used as the input model. Then, the training speech needs to be preprocessed to obtain the training speech features, the first training example speech is preprocessed to obtain the first training example speech features, and the second training example speech is preprocessed to obtain the second training example speech features, wherein, The training speech feature is the training cepstrum, the speech feature of the first training example is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum. Subsequent operations are the same as S210-S250 in the embodiments of the present application, and details are not repeated in the embodiments of the present application. After obtaining the training predicted Mel cepstrum, it is also necessary to calculate the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum, that is, the loss between the predicted value and the actual value, and finally reverse the training loss according to the training loss. Propagation to update the training weights of the speech-to-speech model until the speech-to-speech model converges.
其中,因需要获取两种训练示例语音,但是当训练集数据足够多时,则不会造成额外的数据收集。示例性的,训练语音包括“YES”,则同时需要获取到语音内容的第一训练示例语音,即训练用户发出的“YES”,此外,还需要获取到语音内容使用的语言不相同的第二训练示例语音,即训练用户发出的其他语言的语音,例如“好”,当训练集数据足够多时,训练用户发出的“好”作为训练语音包括“好”时的第一训练示例语音,此时则不需要额外获取到第二训练示例语音。Among them, two kinds of training example voices need to be obtained, but when the training set data is enough, it will not cause additional data collection. Exemplarily, if the training voice includes "YES", then the first training example voice of the voice content needs to be obtained at the same time, that is, "YES" issued by the training user. In addition, it is also necessary to obtain the second voice content with a different language. The training example voice, that is, the voice of other languages uttered by the training user, such as "good", when there is enough data in the training set, the "good" issued by the training user is the first training example voice when the training voice includes "good", at this time Then, there is no need to additionally acquire the second training example speech.
作为优选的,训练语音的语音内容使用的语言包括了实际使用时的待转换语音的语音内容使用的语言,即使用了待转换语音的语音内容使用的语音参与了语音转换模型的训练,训练用户也包括目标用户,即目标用户作为训练用户参与了语音转换模型的训练,如此就能更准确的实现跨语言的转换。此外,因第一编码器不会依赖于上一帧的输出,从而大大加快了语音转换模型的训练速度。Preferably, the language used in the speech content of the training speech includes the language used in the speech content of the speech to be converted in actual use, that is, the speech used in the speech content of the speech to be converted participates in the training of the speech conversion model, training the user It also includes the target user, that is, the target user participates in the training of the speech conversion model as a training user, so that the cross-language conversion can be more accurately achieved. In addition, since the first encoder does not depend on the output of the previous frame, the training speed of the speech conversion model is greatly accelerated.
图8示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器。如图8所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现跨语言语音转换方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行跨语言语音转换方法。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Figure 8 shows an internal structure diagram of a computer device in one embodiment. Specifically, the computer device may be a terminal or a server. As shown in FIG. 8, the computer device includes a processor, memory, and a network interface connected by a system bus. Wherein, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and also stores a computer program, which, when executed by the processor, enables the processor to implement a method for cross-language voice conversion. A computer program can also be stored in the internal memory, and when the computer program is executed by the processor, can cause the processor to execute the cross-language speech conversion method. Those skilled in the art can understand that the structure shown in FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一个实施例中,提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:In one embodiment, a computer device is proposed, comprising a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:
获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征; 将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice; preprocess the voice to be converted to obtain the features of the voice to be converted , and preprocess the sample voice to obtain the sample voice feature; take the to-be-converted voice feature and the sample voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature; Convert the target voice feature into The target voice of the sample voice is simulated, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.
在一个实施例中,所述待转换语音特征为待转换梅尔倒频谱,所述示例语音特征为示例梅尔倒频谱,所述语音转换模型包括第一编码器、第二编码器、长度调节器和解码器,所述第一编码器包括FFT Block,所述将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征包括:将所述梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分示例梅尔倒频谱为在所述示例梅尔倒频谱中随机截取得到的;将所述第一向量和第二向量进行拼接后得到第三向量;将所述第三向量输入至所述长度调节器以得到第四向量;将所述第四向量输入至所述解码器以得到预测梅尔倒频谱,作为目标语音特征。In one embodiment, the voice feature to be converted is a Mel cepstrum to be converted, the example voice feature is an example Mel cepstrum, and the voice conversion model includes a first encoder, a second encoder, a length adjustment The first encoder includes an FFT Block, and the voice feature to be converted and the sample voice feature are used as input, and the target voice feature obtained by using a pre-trained voice conversion model includes: using the Mel The cepstrum is input to the first encoder to obtain the first vector; the partial example Mel cepstrum is input to the second encoder to obtain the second vector, and the partial example Mel cepstrum is in the example obtained by random interception in the Mel cepstrum; the third vector is obtained by splicing the first vector and the second vector; the third vector is input into the length adjuster to obtain the fourth vector; the The fourth vector is input to the decoder to obtain the predicted Mel cepstrum as the target speech feature.
在一个实施例中,所述第一编码器用于将所述梅尔倒频谱进行压缩以得到第一向量,所述长度调节器用于根据所述第三向量获得所述第三向量中每一帧的预测扩展长度,并根据所述预测扩展长度将所述第三向量扩展为第四向量。In one embodiment, the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain each frame in the third vector according to the third vector and extending the third vector into a fourth vector according to the predicted extension length.
在一个实施例中,所述语音转换模型的训练包括:获取训练语音、训练用户的第一训练示例语音和第二训练示例语音,所述第一训练示例语音的语音内容和所述训练语音的语音内容相同,所述训练语音的语音内容使用的语言和所述第二训练示例语音的语音内容使用的语言不相同;对所述训练语音进行预处理得到训练语音特征,并对所述第一训练示例语音进行预处理得到第一训练示例语音特征,对所述第二训练示例语音进行预处理得到第二训练示例语音特征,所述训练语音特征为训练梅尔倒频谱,所述第一训练示例语音特征为第一训练示例梅尔倒频谱,所述第二训练示例语音特征为第二训练示例梅尔倒频谱;将所述训练梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分第二训练示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分第二训练示例梅尔倒频谱为在所述第二训练示例梅尔倒频谱中随机截取得到的;将所述第一向量和第二向量进行拼接后得到第三向量;将所述第三向量输入至所述长度调节器以得到第四向量;将所述第四向量输入至所述解码器以得到训练预测梅尔倒频谱;计算所述训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失;根据所述训练损失进行反向传播以更新所述语音转换模型的训练权重,直至所述语音转换模型收敛。In one embodiment, the training of the speech conversion model includes: acquiring a training speech, a first training example speech and a second training example speech of a training user, the speech content of the first training example speech and the difference of the training speech The voice content is the same, and the language used in the voice content of the training voice is different from the language used in the voice content of the second training example voice; the training voice is preprocessed to obtain training voice features, and the first The training example speech is preprocessed to obtain the first training example speech feature, the second training example speech is preprocessed to obtain the second training example speech feature, the training speech feature is the training Mel cepstrum, and the first training example speech The example speech feature is the first training example Mel cepstrum, and the second training example speech feature is the second training example Mel cepstrum; the training Mel cepstrum is input to the first encoder to obtain the first cepstrum. a vector; inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example mel cepstrum is the cepstrum of the second training example obtained by random interception in ; the third vector is obtained by splicing the first vector and the second vector; the third vector is input to the length adjuster to obtain the fourth vector; the fourth vector is input to the decoder to obtain the training predicted cepstrum; compute the training loss for the training predicted cepstrum and the first training example cepstrum; perform backpropagation based on the training loss to update the speech Transform the training weights of the model until the speech translation model converges.
在一个实施例中,所述获取待转换语音包括:获取待转换文本;将所述待转换文 本转换为合成语音,作为待转换语音。In one embodiment, the acquiring the speech to be converted includes: acquiring the text to be converted; and converting the text to be converted into synthesized speech as the speech to be converted.
在一个实施例中,所述对所述待转换语音进行预处理得到待转换语音特征包括:对所述待转换语音进行短时傅里叶变换得到幅度谱;对所述幅度谱进行滤波得到梅尔频谱;对所述梅尔频谱进行倒谱分析得到待转换梅尔倒频谱,作为待转换语音特征。In one embodiment, the preprocessing of the speech to be converted to obtain the speech features to be converted includes: performing a short-time Fourier transform on the speech to be converted to obtain an amplitude spectrum; filtering the amplitude spectrum to obtain an amplitude spectrum Mel spectrum; cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the feature of the voice to be converted.
在一个实施例中,所述对所述待转换语音进行短时傅里叶变换得到幅度谱包括:减去所述待转换语音中的首尾空白部分得到第一修正待转换语音;对所述第一修正待转换语音进行预加重、分帧和加窗得到第二修正待转换语音;对所述第二修正待转换语音进行短时傅里叶变换得到幅度谱。In an embodiment, the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum includes: subtracting the blank parts of the to-be-converted speech to obtain a first modified to-be-converted speech; The first modified voice to be converted is subjected to pre-emphasis, framing and windowing to obtain a second modified voice to be converted; short-time Fourier transform is performed on the second modified voice to be converted to obtain an amplitude spectrum.
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:In one embodiment, a computer-readable storage medium is provided, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice; preprocess the voice to be converted to obtain the features of the voice to be converted , and preprocess the sample voice to obtain the sample voice feature; take the to-be-converted voice feature and the sample voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature; convert the target voice feature into The target voice of the sample voice is simulated, and the voice content of the target voice is the same as the voice content of the to-be-converted voice.
在一个实施例中,所述待转换语音特征为待转换梅尔倒频谱,所述示例语音特征为示例梅尔倒频谱,所述语音转换模型包括第一编码器、第二编码器、长度调节器和解码器,所述第一编码器包括FFT Block,所述将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征包括:将所述梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分示例梅尔倒频谱为在所述示例梅尔倒频谱中随机截取得到的;将所述第一向量和第二向量进行拼接后得到第三向量;将所述第三向量输入至所述长度调节器以得到第四向量;将所述第四向量输入至所述解码器以得到预测梅尔倒频谱,作为目标语音特征。In one embodiment, the voice feature to be converted is a Mel cepstrum to be converted, the example voice feature is an example Mel cepstrum, and the voice conversion model includes a first encoder, a second encoder, a length adjustment The first encoder includes an FFT Block, and the voice feature to be converted and the sample voice feature are used as input, and the target voice feature obtained by using a pre-trained voice conversion model includes: using the Mel The cepstrum is input to the first encoder to obtain the first vector; the partial example Mel cepstrum is input to the second encoder to obtain the second vector, and the partial example Mel cepstrum is in the example obtained by random interception in the Mel cepstrum; the third vector is obtained by splicing the first vector and the second vector; the third vector is input to the length adjuster to obtain the fourth vector; the The fourth vector is input to the decoder to obtain the predicted Mel cepstrum as the target speech feature.
在一个实施例中,所述第一编码器用于将所述梅尔倒频谱进行压缩以得到第一向量,所述长度调节器用于根据所述第三向量获得所述第三向量中每一帧的预测扩展长度,并根据所述预测扩展长度将所述第三向量扩展为第四向量。In one embodiment, the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain each frame in the third vector according to the third vector and extending the third vector into a fourth vector according to the predicted extension length.
在一个实施例中,所述语音转换模型的训练包括:获取训练语音、训练用户的第一训练示例语音和第二训练示例语音,所述第一训练示例语音的语音内容和所述训练 语音的语音内容相同,所述训练语音的语音内容使用的语言和所述第二训练示例语音的语音内容使用的语言不相同;对所述训练语音进行预处理得到训练语音特征,并对所述第一训练示例语音进行预处理得到第一训练示例语音特征,对所述第二训练示例语音进行预处理得到第二训练示例语音特征,所述训练语音特征为训练梅尔倒频谱,所述第一训练示例语音特征为第一训练示例梅尔倒频谱,所述第二训练示例语音特征为第二训练示例梅尔倒频谱;将所述训练梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分第二训练示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分第二训练示例梅尔倒频谱为在所述第二训练示例梅尔倒频谱中随机截取得到的;将所述第一向量和第二向量进行拼接后得到第三向量;将所述第三向量输入至所述长度调节器以得到第四向量;将所述第四向量输入至所述解码器以得到训练预测梅尔倒频谱;计算所述训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失;根据所述训练损失进行反向传播以更新所述语音转换模型的训练权重,直至所述语音转换模型收敛。In one embodiment, the training of the speech conversion model includes: acquiring a training speech, a first training example speech and a second training example speech of a training user, the speech content of the first training example speech and the difference of the training speech The voice content is the same, and the language used in the voice content of the training voice is different from the language used in the voice content of the second training example voice; the training voice is preprocessed to obtain training voice features, and the first The training example speech is preprocessed to obtain the first training example speech feature, the second training example speech is preprocessed to obtain the second training example speech feature, the training speech feature is the training Mel cepstrum, and the first training example speech The example speech feature is the first training example Mel cepstrum, and the second training example speech feature is the second training example Mel cepstrum; the training Mel cepstrum is input to the first encoder to obtain the first cepstrum. a vector; inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example mel cepstrum is the cepstrum of the second training example obtained by random interception in ; the third vector is obtained by splicing the first vector and the second vector; the third vector is input to the length adjuster to obtain the fourth vector; the fourth vector is input to the decoder to obtain the training predicted cepstrum; compute the training loss for the training predicted cepstrum and the first training example cepstrum; perform backpropagation based on the training loss to update the speech Transform the training weights of the model until the speech translation model converges.
在一个实施例中,所述获取待转换语音包括:获取待转换文本;将所述待转换文本转换为合成语音,作为待转换语音。In one embodiment, the acquiring the speech to be converted includes: acquiring the text to be converted; and converting the text to be converted into synthesized speech as the speech to be converted.
在一个实施例中,所述对所述待转换语音进行预处理得到待转换语音特征包括:对所述待转换语音进行短时傅里叶变换得到幅度谱;对所述幅度谱进行滤波得到梅尔频谱;对所述梅尔频谱进行倒谱分析得到待转换梅尔倒频谱,作为待转换语音特征。In one embodiment, the preprocessing of the speech to be converted to obtain the speech features to be converted includes: performing a short-time Fourier transform on the speech to be converted to obtain an amplitude spectrum; filtering the amplitude spectrum to obtain an amplitude spectrum Mel spectrum; cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the feature of the voice to be converted.
在一个实施例中,所述对所述待转换语音进行短时傅里叶变换得到幅度谱包括:减去所述待转换语音中的首尾空白部分得到第一修正待转换语音;对所述第一修正待转换语音进行预加重、分帧和加窗得到第二修正待转换语音;对所述第二修正待转换语音进行短时傅里叶变换得到幅度谱。In an embodiment, the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum includes: subtracting the blank parts of the to-be-converted speech to obtain a first modified to-be-converted speech; The first modified voice to be converted is subjected to pre-emphasis, framing and windowing to obtain a second modified voice to be converted; short-time Fourier transform is performed on the second modified voice to be converted to obtain an amplitude spectrum.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步 DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a non-volatile computer-readable storage medium , when the program is executed, it may include the flow of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are relatively specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims (19)

  1. 一种跨语言语音转换方法,其特征在于,所述方法包括:A method for cross-language voice conversion, characterized in that the method comprises:
    获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;Acquiring the voice to be converted and the sample voice of the target user, the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
    对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;
    将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;
    将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  2. 根据权利要求1所述的方法,其特征在于,所述待转换语音特征为待转换梅尔倒频谱,所述示例语音特征为示例梅尔倒频谱,所述语音转换模型包括第一编码器、第二编码器、长度调节器和解码器,所述第一编码器包括FFT Block,所述将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征包括:The method according to claim 1, wherein the speech feature to be converted is a Mel cepstrum to be converted, the example speech feature is an example Mel cepstrum, and the speech conversion model comprises a first encoder, A second encoder, a length regulator and a decoder, the first encoder includes an FFT Block, and the target speech feature is obtained by using the pre-trained speech conversion model using the to-be-converted speech feature and the example speech feature as input include:
    将所述梅尔倒频谱输入至所述第一编码器以得到第一向量;inputting the Mel cepstrum to the first encoder to obtain a first vector;
    将部分示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分示例梅尔倒频谱为在所述示例梅尔倒频谱中随机截取得到的;inputting part of the example cepstrum to the second encoder to obtain a second vector, the part of the example cepstrum is randomly intercepted from the example cepstrum;
    将所述第一向量和第二向量进行拼接后得到第三向量;After splicing the first vector and the second vector, a third vector is obtained;
    将所述第三向量输入至所述长度调节器以得到第四向量;inputting the third vector to the length adjuster to obtain a fourth vector;
    将所述第四向量输入至所述解码器以得到预测梅尔倒频谱,作为目标语音特征。The fourth vector is input to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
  3. 根据权利要求2所述的方法,其特征在于,所述第一编码器用于将所述梅尔倒频谱进行压缩以得到第一向量,所述长度调节器用于根据所述第三向量获得所述第三向量中每一帧的预测扩展长度,并根据所述预测扩展长度将所述第三向量扩展为第四向量。The method according to claim 2, wherein the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain the third vector according to the third vector. The predicted extension length of each frame in the third vector, and the third vector is extended into a fourth vector according to the predicted extension length.
  4. 根据权利要求1所述的方法,其特征在于,所述语音转换模型的训练包括:The method according to claim 1, wherein the training of the speech conversion model comprises:
    获取训练语音、训练用户的第一训练示例语音和第二训练示例语音,所述第一训练示例语音的语音内容和所述训练语音的语音内容相同,所述训练语音的语音内容使用的语言和所述第二训练示例语音的语音内容使用的语言不相同;Acquire the training voice, the first training sample voice and the second training sample voice of the training user, the voice content of the first training sample voice is the same as the voice content of the training voice, and the voice content of the training voice uses the language and The languages used in the speech content of the second training example speech are not the same;
    对所述训练语音进行预处理得到训练语音特征,并对所述第一训练示例语音进行预处理得到第一训练示例语音特征,对所述第二训练示例语音进行预处理得到第二训练示例语音特征,所述训练语音特征为训练梅尔倒频谱,所述第一训练示例语音特征为第一训练示例梅尔倒频谱,所述第二训练示例语音特征为第二训练示例梅尔倒频谱;Preprocessing the training voice to obtain training voice features, preprocessing the first training sample voice to obtain first training sample voice features, and preprocessing the second training sample voice to obtain a second training sample voice feature, the training speech feature is the training cepstrum, the first training example speech feature is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum;
    将所述训练梅尔倒频谱输入至所述第一编码器以得到第一向量;inputting the training Mel cepstrum to the first encoder to obtain a first vector;
    将部分第二训练示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分第二训练示例梅尔倒频谱为在所述第二训练示例梅尔倒频谱中随机截取得到的;inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example cepstrum is randomly intercepted in the second training example cepstrum owned;
    将所述第一向量和第二向量进行拼接后得到第三向量;After splicing the first vector and the second vector, a third vector is obtained;
    将所述第三向量输入至所述长度调节器以得到第四向量;inputting the third vector to the length adjuster to obtain a fourth vector;
    将所述第四向量输入至所述解码器以得到训练预测梅尔倒频谱;inputting the fourth vector to the decoder to obtain a training predicted Mel cepstrum;
    计算所述训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失;calculating the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum;
    根据所述训练损失进行反向传播以更新所述语音转换模型的训练权重,直至所述语音转换模型收敛。Back-propagation is performed according to the training loss to update the training weights of the speech conversion model until the speech conversion model converges.
  5. 根据权利要求1所述的方法,其特征在于,所述获取待转换语音包括:The method according to claim 1, wherein the acquiring the voice to be converted comprises:
    获取待转换文本;Get the text to be converted;
    将所述待转换文本转换为合成语音,作为待转换语音。Convert the to-be-converted text into synthesized speech as the to-be-converted speech.
  6. 根据权利要求2所述的方法,其特征在于,所述对所述待转换语音进行预处理得到待转换语音特征包括:The method according to claim 2, wherein the preprocessing of the to-be-converted speech to obtain the to-be-converted speech features comprises:
    对所述待转换语音进行短时傅里叶变换得到幅度谱;performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum;
    对所述幅度谱进行滤波得到梅尔频谱;Filtering the amplitude spectrum to obtain a Mel spectrum;
    对所述梅尔频谱进行倒谱分析得到待转换梅尔倒频谱,作为待转换语音特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the speech feature to be converted.
  7. 根据权利要求6所述的方法,其特征在于,所述对所述待转换语音进行短时傅里叶变换得到幅度谱包括:The method according to claim 6, wherein the performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum comprises:
    减去所述待转换语音中的首尾空白部分得到第一修正待转换语音;Subtract the first and last blank parts in the voice to be converted to obtain the first modified voice to be converted;
    对所述第一修正待转换语音进行预加重、分帧和加窗得到第二修正待转换语音;Performing pre-emphasis, framing and windowing on the first modified voice to be converted to obtain a second modified voice to be converted;
    对所述第二修正待转换语音进行短时傅里叶变换得到幅度谱。Short-time Fourier transform is performed on the second modified speech to be converted to obtain an amplitude spectrum.
  8. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,执行如下步骤:A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the following steps are performed:
    获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
    对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;
    将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;
    将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  9. 根据权利要求8所述的设备,其特征在于,所述待转换语音特征为待转换梅尔倒频谱,所述示例语音特征为示例梅尔倒频谱,所述语音转换模型包括第一编码器、第二编码器、长度调节器和解码器,所述第一编码器包括FFT Block,所述将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征包括:The device according to claim 8, wherein the speech feature to be converted is a to-be-converted Mel cepstrum, the example speech feature is an example Mel cepstrum, and the speech conversion model comprises a first encoder, A second encoder, a length regulator and a decoder, the first encoder includes an FFT Block, and the target speech feature is obtained by using the pre-trained speech conversion model using the to-be-converted speech feature and the example speech feature as input include:
    将所述梅尔倒频谱输入至所述第一编码器以得到第一向量;inputting the Mel cepstrum to the first encoder to obtain a first vector;
    将部分示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分示例梅尔倒频谱为在所述示例梅尔倒频谱中随机截取得到的;inputting part of the example cepstrum to the second encoder to obtain a second vector, the part of the example cepstrum is randomly intercepted from the example cepstrum;
    将所述第一向量和第二向量进行拼接后得到第三向量;After splicing the first vector and the second vector, a third vector is obtained;
    将所述第三向量输入至所述长度调节器以得到第四向量;inputting the third vector to the length adjuster to obtain a fourth vector;
    将所述第四向量输入至所述解码器以得到预测梅尔倒频谱,作为目标语音特征。The fourth vector is input to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
  10. 根据权利要求9所述的设备,其特征在于,所述第一编码器用于将所述梅尔倒频谱进行压缩以得到第一向量,所述长度调节器用于根据所述第三向量获得所述第三向量中每一帧的预测扩展长度,并根据所述预测扩展长度将所述第三向量扩展为第四向量。The device according to claim 9, wherein the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain the third vector according to the third vector. The predicted extension length of each frame in the third vector, and the third vector is extended into a fourth vector according to the predicted extension length.
  11. 根据权利要求8所述的设备,其特征在于,所述语音转换模型的训练包括:The device according to claim 8, wherein the training of the speech conversion model comprises:
    获取训练语音、训练用户的第一训练示例语音和第二训练示例语音,所述第一训练示例语音的语音内容和所述训练语音的语音内容相同,所述训练语音的语音内容使用的语言和所述第二训练示例语音的语音内容使用的语言不相同;Acquire the training voice, the first training sample voice and the second training sample voice of the training user, the voice content of the first training sample voice is the same as the voice content of the training voice, and the voice content of the training voice uses the language and The languages used in the speech content of the second training example speech are not the same;
    对所述训练语音进行预处理得到训练语音特征,并对所述第一训练示例语音进行预处理得到第一训练示例语音特征,对所述第二训练示例语音进行预处理得到第二训练示例语音特征,所述训练语音特征为训练梅尔倒频谱,所述第一训练示例语音特征为第一训练示例梅尔倒频谱,所述第二训练示例语音特征为第二训练示例梅尔倒频谱;Preprocessing the training voice to obtain training voice features, preprocessing the first training sample voice to obtain first training sample voice features, and preprocessing the second training sample voice to obtain a second training sample voice feature, the training speech feature is the training cepstrum, the first training example speech feature is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum;
    将所述训练梅尔倒频谱输入至所述第一编码器以得到第一向量;inputting the training Mel cepstrum to the first encoder to obtain a first vector;
    将部分第二训练示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分第二训练示例梅尔倒频谱为在所述第二训练示例梅尔倒频谱中随机截取得到的;inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example cepstrum is randomly intercepted in the second training example cepstrum owned;
    将所述第一向量和第二向量进行拼接后得到第三向量;After splicing the first vector and the second vector, a third vector is obtained;
    将所述第三向量输入至所述长度调节器以得到第四向量;inputting the third vector to the length adjuster to obtain a fourth vector;
    将所述第四向量输入至所述解码器以得到训练预测梅尔倒频谱;inputting the fourth vector to the decoder to obtain a training predicted Mel cepstrum;
    计算所述训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失;calculating the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum;
    根据所述训练损失进行反向传播以更新所述语音转换模型的训练权重,直至所述语音转换模型收敛。Back-propagation is performed according to the training loss to update the training weights of the speech conversion model until the speech conversion model converges.
  12. 根据权利要求8所述的设备,其特征在于,所述获取待转换语音包括:The device according to claim 8, wherein the acquiring the voice to be converted comprises:
    获取待转换文本;Get the text to be converted;
    将所述待转换文本转换为合成语音,作为待转换语音。Convert the to-be-converted text into synthesized speech as the to-be-converted speech.
  13. 根据权利要求9所述的设备,其特征在于,所述对所述待转换语音进行预处理得到待转换语音特征包括:The device according to claim 9, wherein the preprocessing of the to-be-converted speech to obtain the to-be-converted speech features comprises:
    对所述待转换语音进行短时傅里叶变换得到幅度谱;performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum;
    对所述幅度谱进行滤波得到梅尔频谱;Filtering the amplitude spectrum to obtain a Mel spectrum;
    对所述梅尔频谱进行倒谱分析得到待转换梅尔倒频谱,作为待转换语音特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the speech feature to be converted.
  14. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,执行如下步骤:A computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the following steps are performed:
    获取待转换语音和目标用户的示例语音,所述待转换语音的语音内容使用的语言和所述示例语音的语音内容使用的语言不相同;Obtain the voice to be converted and the sample voice of the target user, and the language used by the voice content of the voice to be converted is different from the language used by the voice content of the sample voice;
    对所述待转换语音进行预处理得到待转换语音特征,并对所述示例语音进行预处理得到示例语音特征;Preprocessing the to-be-converted voice to obtain the to-be-converted voice feature, and preprocessing the sample voice to obtain the sample voice feature;
    将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标语音特征;Using the to-be-converted voice feature and the example voice feature as input, use the pre-trained voice conversion model to obtain the target voice feature;
    将所述目标语音特征转换为模拟所述示例语音的目标语音,所述目标语音的语音内容和所述待转换语音的语音内容相同。Converting the target voice feature into a target voice simulating the example voice, the voice content of the target voice is the same as the voice content of the to-be-converted voice.
  15. 根据权利要求14所述的存储介质,其特征在于,所述待转换语音特征为待转换梅尔倒频谱,所述示例语音特征为示例梅尔倒频谱,所述语音转换模型包括第一编码器、第二编码器、长度调节器和解码器,所述第一编码器包括FFT Block,所述将所述待转换语音特征和示例语音特征作为输入,使用预先训练好的语音转换模型得到目标 语音特征包括:The storage medium according to claim 14, wherein the speech feature to be converted is a cepstrum to be converted, the example speech feature is an example cepstrum, and the speech conversion model comprises a first encoder , a second encoder, a length regulator and a decoder, the first encoder includes an FFT Block, and the described voice feature to be converted and the example voice feature are used as input, and the pre-trained voice conversion model is used to obtain the target voice Features include:
    将所述梅尔倒频谱输入至所述第一编码器以得到第一向量;inputting the Mel cepstrum to the first encoder to obtain a first vector;
    将部分示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分示例梅尔倒频谱为在所述示例梅尔倒频谱中随机截取得到的;inputting part of the example cepstrum to the second encoder to obtain a second vector, the part of the example cepstrum is randomly intercepted from the example cepstrum;
    将所述第一向量和第二向量进行拼接后得到第三向量;After splicing the first vector and the second vector, a third vector is obtained;
    将所述第三向量输入至所述长度调节器以得到第四向量;inputting the third vector to the length adjuster to obtain a fourth vector;
    将所述第四向量输入至所述解码器以得到预测梅尔倒频谱,作为目标语音特征。The fourth vector is input to the decoder to obtain a predicted Mel cepstrum as a target speech feature.
  16. 根据权利要求15所述的存储介质,其特征在于,所述第一编码器用于将所述梅尔倒频谱进行压缩以得到第一向量,所述长度调节器用于根据所述第三向量获得所述第三向量中每一帧的预测扩展长度,并根据所述预测扩展长度将所述第三向量扩展为第四向量。The storage medium according to claim 15, wherein the first encoder is configured to compress the Mel cepstrum to obtain a first vector, and the length adjuster is configured to obtain the third vector according to the third vector. The predicted extension length of each frame in the third vector is extended, and the third vector is extended into a fourth vector according to the predicted extension length.
  17. 根据权利要求14所述的存储介质,其特征在于,所述语音转换模型的训练包括:The storage medium according to claim 14, wherein the training of the speech conversion model comprises:
    获取训练语音、训练用户的第一训练示例语音和第二训练示例语音,所述第一训练示例语音的语音内容和所述训练语音的语音内容相同,所述训练语音的语音内容使用的语言和所述第二训练示例语音的语音内容使用的语言不相同;Acquire the training voice, the first training sample voice and the second training sample voice of the training user, the voice content of the first training sample voice is the same as the voice content of the training voice, and the voice content of the training voice uses the language and The languages used in the speech content of the second training example speech are not the same;
    对所述训练语音进行预处理得到训练语音特征,并对所述第一训练示例语音进行预处理得到第一训练示例语音特征,对所述第二训练示例语音进行预处理得到第二训练示例语音特征,所述训练语音特征为训练梅尔倒频谱,所述第一训练示例语音特征为第一训练示例梅尔倒频谱,所述第二训练示例语音特征为第二训练示例梅尔倒频谱;Preprocessing the training voice to obtain training voice features, preprocessing the first training sample voice to obtain first training sample voice features, and preprocessing the second training sample voice to obtain a second training sample voice feature, the training speech feature is the training cepstrum, the first training example speech feature is the first training example cepstrum, and the second training example speech feature is the second training example cepstrum;
    将所述训练梅尔倒频谱输入至所述第一编码器以得到第一向量;inputting the training Mel cepstrum to the first encoder to obtain a first vector;
    将部分第二训练示例梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分第二训练示例梅尔倒频谱为在所述第二训练示例梅尔倒频谱中随机截取得到的;inputting part of the second training example cepstrum to the second encoder to obtain a second vector, the part of the second training example cepstrum is randomly intercepted in the second training example cepstrum owned;
    将所述第一向量和第二向量进行拼接后得到第三向量;After splicing the first vector and the second vector, a third vector is obtained;
    将所述第三向量输入至所述长度调节器以得到第四向量;inputting the third vector to the length adjuster to obtain a fourth vector;
    将所述第四向量输入至所述解码器以得到训练预测梅尔倒频谱;inputting the fourth vector to the decoder to obtain a training predicted Mel cepstrum;
    计算所述训练预测梅尔倒频谱和第一训练示例梅尔倒频谱的训练损失;calculating the training loss of the training predicted Mel cepstrum and the first training example Mel cepstrum;
    根据所述训练损失进行反向传播以更新所述语音转换模型的训练权重,直至所述语音转换模型收敛。Back-propagation is performed according to the training loss to update the training weights of the speech conversion model until the speech conversion model converges.
  18. 根据权利要求14所述的存储介质,其特征在于,所述获取待转换语音包括:The storage medium according to claim 14, wherein the acquiring the voice to be converted comprises:
    获取待转换文本;Get the text to be converted;
    将所述待转换文本转换为合成语音,作为待转换语音。Convert the to-be-converted text into synthesized speech as the to-be-converted speech.
  19. 根据权利要求15所述的存储介质,其特征在于,所述对所述待转换语音进行预处理得到待转换语音特征包括:The storage medium according to claim 15, wherein the preprocessing of the to-be-converted speech to obtain the to-be-converted speech features comprises:
    对所述待转换语音进行短时傅里叶变换得到幅度谱;performing short-time Fourier transform on the to-be-converted speech to obtain an amplitude spectrum;
    对所述幅度谱进行滤波得到梅尔频谱;Filtering the amplitude spectrum to obtain a Mel spectrum;
    对所述梅尔频谱进行倒谱分析得到待转换梅尔倒频谱,作为待转换语音特征。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel cepstrum to be converted, which is used as the speech feature to be converted.
PCT/CN2020/140344 2020-12-28 2020-12-28 Cross-language voice conversion method, computer device, and storage medium WO2022140966A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/140344 WO2022140966A1 (en) 2020-12-28 2020-12-28 Cross-language voice conversion method, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/140344 WO2022140966A1 (en) 2020-12-28 2020-12-28 Cross-language voice conversion method, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022140966A1 true WO2022140966A1 (en) 2022-07-07

Family

ID=82259001

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/140344 WO2022140966A1 (en) 2020-12-28 2020-12-28 Cross-language voice conversion method, computer device, and storage medium

Country Status (1)

Country Link
WO (1) WO2022140966A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018084604A (en) * 2016-11-21 2018-05-31 日本電信電話株式会社 Cross lingual voice synthesis model learning device, cross lingual voice synthesis device, cross lingual voice synthesis model learning method, and program
CN110808034A (en) * 2019-10-31 2020-02-18 北京大米科技有限公司 Voice conversion method, device, storage medium and electronic equipment
CN110866410A (en) * 2019-11-15 2020-03-06 深圳市赛为智能股份有限公司 Multi-language conversion method, device, computer equipment and storage medium
CN111247584A (en) * 2019-12-24 2020-06-05 深圳市优必选科技股份有限公司 Voice conversion method, system, device and storage medium
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018084604A (en) * 2016-11-21 2018-05-31 日本電信電話株式会社 Cross lingual voice synthesis model learning device, cross lingual voice synthesis device, cross lingual voice synthesis model learning method, and program
CN110808034A (en) * 2019-10-31 2020-02-18 北京大米科技有限公司 Voice conversion method, device, storage medium and electronic equipment
CN110866410A (en) * 2019-11-15 2020-03-06 深圳市赛为智能股份有限公司 Multi-language conversion method, device, computer equipment and storage medium
CN111247584A (en) * 2019-12-24 2020-06-05 深圳市优必选科技股份有限公司 Voice conversion method, system, device and storage medium
CN111785261A (en) * 2020-05-18 2020-10-16 南京邮电大学 Cross-language voice conversion method and system based on disentanglement and explanatory representation

Similar Documents

Publication Publication Date Title
JP7280386B2 (en) Multilingual speech synthesis and cross-language voice cloning
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
WO2020118521A1 (en) Multi-speaker neural text-to-speech synthesis
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
US11355097B2 (en) Sample-efficient adaptive text-to-speech
US11335324B2 (en) Synthesized data augmentation using voice conversion and speech recognition models
Xu et al. Voice conversion based on Gaussian processes by coherent and asymmetric training with limited training data
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
JP2024505076A (en) Generate diverse, natural-looking text-to-speech samples
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
JP7393585B2 (en) WaveNet self-training for text-to-speech
CN113963679A (en) Voice style migration method and device, electronic equipment and storage medium
Mariya Celin et al. Data augmentation techniques for transfer learning-based continuous dysarthric speech recognition
Viacheslav et al. System of methods of automated cognitive linguistic analysis of speech signals with noise
JP7423056B2 (en) Reasoners and how to learn them
WO2023279976A1 (en) Speech synthesis method, apparatus, device, and storage medium
CN112712789B (en) Cross-language audio conversion method, device, computer equipment and storage medium
Kumar et al. Towards building text-to-speech systems for the next billion users
WO2022140966A1 (en) Cross-language voice conversion method, computer device, and storage medium
Wang et al. Learning explicit prosody models and deep speaker embeddings for atypical voice conversion
WO2022133630A1 (en) Cross-language audio conversion method, computer device and storage medium
Kiran Reddy et al. DNN-based cross-lingual voice conversion using Bottleneck Features
Jamal et al. Exploring Transfer Learning for Urdu Speech Synthesis
Kotani et al. Voice Conversion Based on Deep Neural Networks for Time-Variant Linear Transformations
Ali et al. Arabic voice system to help illiterate or blind for using computer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967317

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967317

Country of ref document: EP

Kind code of ref document: A1