WO2022133630A1 - 跨语言音频转换方法、计算机设备和存储介质 - Google Patents

跨语言音频转换方法、计算机设备和存储介质 Download PDF

Info

Publication number
WO2022133630A1
WO2022133630A1 PCT/CN2020/137868 CN2020137868W WO2022133630A1 WO 2022133630 A1 WO2022133630 A1 WO 2022133630A1 CN 2020137868 W CN2020137868 W CN 2020137868W WO 2022133630 A1 WO2022133630 A1 WO 2022133630A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
audio
target
synthetic
text
Prior art date
Application number
PCT/CN2020/137868
Other languages
English (en)
French (fr)
Inventor
赵之源
黄东延
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2020/137868 priority Critical patent/WO2022133630A1/zh
Publication of WO2022133630A1 publication Critical patent/WO2022133630A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application relates to the field of computer technology, and in particular, to a cross-language audio conversion method, computer device and storage medium.
  • Machine learning and deep learning rely on massive data and the powerful processing power of computers, and have made major breakthroughs in the fields of image, speech, and text. Since the same type of framework can achieve good results in different fields, neural network algorithm models that have been used to solve text and image problems are all applied to the field of speech.
  • the existing neural network algorithm models applied in the field of speech can capture the characteristics of the target speaker's voice, so as to stably synthesize other voices of the target speaker, and are close to the level of real people in terms of timbre similarity and language naturalness, but
  • the synthesized speech can only be the same as the target speaker's language.
  • the target speaker's voice cannot be synthesized into the target speaker's speech in other languages. If the target speaker can only speak Chinese, it can only be synthesized. The voice of the Chinese language cannot be synthesized into the voice of other national languages.
  • an embodiment of the present application provides a method for cross-language audio conversion, the method comprising:
  • the text to be converted includes at least one language
  • an embodiment of the present application provides a computer device, including a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor is caused to perform the following steps:
  • the text to be converted includes at least one language
  • an embodiment of the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
  • the text to be converted includes at least one language
  • the text to be converted including at least one language is obtained and converted into synthetic audio to be used as the original audio of the target text voice
  • the target voice of the target user is obtained as the user voice feature of the target text voice
  • the pre-trained audio conversion model is used to obtain the target text voice that simulates the target voice issued by the user, which solves the problem that the target speaker's voice cannot be synthesized into the target speaker's voice in other national languages, and obtains a cross-language voice.
  • the beneficial effect of synthesizing the target user's voice is used to obtain the target text voice that simulates the target voice issued by the user, which solves the problem that the target speaker's voice cannot be synthesized into the target speaker's voice in other national languages, and obtains a cross-language voice.
  • FIG. 1 is an application environment diagram of a method for cross-language audio conversion in an embodiment of the present application
  • FIG. 2 is a flowchart of a method for cross-language audio conversion in an embodiment of the application
  • step S130 is a flowchart of step S130 in the method for cross-language audio conversion in an embodiment of the present application
  • step S210 is a flowchart of step S210 in a method for cross-language audio conversion in an embodiment of the present application
  • step S210 is a flowchart of step S210 in the method for cross-language audio conversion in an embodiment of the present application
  • FIG. 6 is a flowchart of an audio conversion model training method in an embodiment of the application.
  • step S550 is a flowchart of step S550 in the audio conversion model training method in an embodiment of the application.
  • FIG. 8 is a structural block diagram of a cross-language audio conversion device in an embodiment of the application.
  • FIG. 9 is a structural block diagram of a computer device in an embodiment of the present application.
  • FIG. 1 is an application environment diagram of a method for cross-language audio conversion in one embodiment.
  • the cross-language audio conversion method is applied to a cross-language audio conversion system.
  • the cross-language audio conversion system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like.
  • the server 120 can be implemented by an independent server or a server cluster composed of multiple servers.
  • the terminal 110 is configured to acquire the text to be converted and the target voice of the target user and upload them to the server 120, where the text to be converted includes at least one language, and the server 120 is configured to receive the text to be converted and the target voice of the target user; Converting text into synthetic audio; preprocessing the synthetic audio to obtain synthetic audio features; using the synthetic audio features and target voice as input, using a pre-trained audio conversion model to obtain target audio features; The features are converted into target text speech that simulates the target speech.
  • the above-mentioned cross-language audio conversion method can also be directly applied to the terminal 110, and the terminal 110 is used to obtain the text to be converted and the target voice of the target user, and the text to be converted includes at least one language; Convert the text to be converted into synthetic audio; perform preprocessing on the synthetic audio to obtain synthetic audio features; take the synthetic audio features and target speech as input, and use a pre-trained audio conversion model to obtain target audio features; The audio features are converted into target text speech simulating the target speech.
  • a method for cross-language audio conversion is provided.
  • the method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example.
  • the cross-language audio conversion method specifically includes the following steps:
  • the user when executing the method for cross-language audio conversion, the user can execute it on a mobile device, such as a mobile phone.
  • the user needs to input the text to be converted and the target voice of the target user, where the text to be converted is the voice that the user finally wishes to obtain content, the target voice of the target user is the sound feature of the voice sound that the user finally wishes to obtain.
  • the text to be converted includes at least one language, that is, the text to be converted may be Chinese, English, English plus Chinese, and so on. Exemplarily, if the user wants to obtain A who can only speak Chinese and speaks the target text voice of "Yes", he only needs to input the text to be converted "Yes" and the target voice of A, and the target voice can be any voice spoken by A.
  • a Chinese speech if the user wants to obtain A who can only speak Chinese and speaks the target text voice of "Yes", he only needs to input the text to be converted "Yes" and the target voice of A, and the target voice can be any voice spoken by A.
  • the text to be converted needs to be processed first, and the text to be converted needs to be converted into synthesized audio.
  • the text to be converted is converted into synthetic audio, and then the synthetic audio is preprocessed to obtain synthetic audio features, wherein the synthetic audio features are synthetic Mel cepstrums, and the obtained synthetic audios need to be converted into synthetic Mel cepstrums to facilitate input neural in the network model.
  • the target speech after obtaining the synthesized audio features, that is, synthesizing the Mel cepstrum, the target speech also needs to be converted into the Mel cepstrum, and then input into the pre-trained audio conversion model.
  • the audio conversion model converts The target audio feature will be output, where the target audio feature is the target Mel cepstrum, and the audio conversion model is a neural network model, which has been pre-trained with a large number of user speech and training text.
  • the final obtained target Mel cepstrum also needs to be converted into target text speech through other preset neural network models, and the target text speech is the speech of the text content to be converted which is issued by simulating the sound characteristics of the target speech, among which other preset
  • the neural network model can be a WaveNet neural network model, a WaveRNN neural network model, or the like.
  • the text to be converted including at least one language is obtained and converted into synthetic audio to be used as the original audio of the target text voice
  • the target voice of the target user is obtained as the user voice feature of the target text voice
  • the pre-trained audio conversion model is used to obtain the target text voice that simulates the target voice issued by the user, which solves the problem that the target speaker's voice cannot be synthesized into the target speaker's voice in other national languages, and obtains a cross-language voice.
  • the beneficial effect of synthesizing the target user's voice is used to obtain the target text voice that simulates the target voice issued by the user, which solves the problem that the target speaker's voice cannot be synthesized into the target speaker's voice in other national languages, and obtains a cross-language voice.
  • step S130 specifically includes:
  • the synthesized audio when the synthesized audio is preprocessed to obtain the synthesized audio features, specifically, the synthesized audio needs to be subjected to short-time Fourier transform first, and the synthesized audio is subjected to short-time Fourier transform to obtain an amplitude spectrum and a phase spectrum, Convert the waveform of the synthesized audio from the time domain to the frequency domain, which is convenient for the extraction of speech features.
  • the Mel spectrum can be obtained by filtering only the amplitude spectrum.
  • the filter used for filtering can be Filter Bank (filter bank). ), Filter Bank is based on the principle that people are more sensitive to high-frequency sounds.
  • the filters are denser and the threshold value is large, while the filters at high frequencies are sparser and the threshold value is small, and the filtering results are more suitable for human voices.
  • MFC Mel-Frequency Spectrum
  • Cepstrals as synthesized audio features. It should be noted that, the target speech needs to be processed in the same way as the synthesized audio, which is not repeated here in this embodiment of the present application.
  • the embodiment of the present application By converting the synthesized audio into Mel cepstrum, the embodiment of the present application not only approximates the characteristics of the human vocalization mechanism and the nonlinear auditory system, but also facilitates the training and input and output of the neural network model.
  • step S210 specifically includes:
  • the modified synthetic audio is obtained by subtracting the leading and trailing blanks in the synthetic audio, and then short-time Fourier transform is performed on the modified synthetic audio to obtain the amplitude spectrum.
  • step S210 may further include:
  • steps S410 and S420 in this embodiment of the present application may be executed jointly after step S310.
  • an audio conversion model training method is provided.
  • the method can be applied to both a terminal and a server, and this embodiment is described by taking the application to a terminal as an example.
  • the audio conversion model training specifically includes the following steps:
  • the training text when training the audio conversion model, it is first necessary to obtain the training text and the training voice of the training user, wherein the training text and the training voice of the training user are in one-to-one correspondence, and the training text includes at least one language.
  • the training text includes at least two languages, and the language corresponding to the text to be converted in actual use is also included in the training text. If the training text has only one language, the target text speech obtained when the audio conversion model is used will be the language closest to the text to be converted among the languages using the target speech. For example, the training text only includes English, and the corresponding training voice only includes English. If the text to be converted is Chinese and the target voice is English, then the final target text voice is the English voice that uses the Chinese pronunciation closest to the text to be converted. . .
  • the training speech also includes training the user to issue “YES”
  • the training text includes "Mr. YES”
  • the training speech also includes training the user to issue "Mr. YES”.
  • the training voices of multiple training users can be obtained.
  • the training user includes the target user in step S110, so that the audio conversion model uses the target user as the training data set during training, so that when the audio conversion model is used to obtain the target user-based target text and speech, the accuracy rate Greatly improved, even if the training user does not include the target user in step S110, when the number of training data sets of the audio conversion model is large enough, the audio conversion model will use the training user that is closest to the target user's voice characteristics as the output result, Its similarity is also guaranteed.
  • the training target speech features include the training target Mel cepstrum, and the specific conversion and preprocessing methods are the same as those of steps S120 and S130, which are not repeated in this embodiment of the present application, wherein the training text and The training speech is in one-to-one correspondence, that is, the speech content of the training synthetic audio and the training speech is the same, but the speech features are different.
  • the audio conversion model can be trained based on the training synthetic audio feature and the training target speech feature, and the training synthetic audio feature is used as the input, and the training target speech feature is used as the output to train the audio conversion model.
  • step S550 may further include:
  • S620 Input part of the training target cepstrum to the second encoder to obtain a second vector, where the part of the training target cepstrum is randomly intercepted from the training target cepstrum.
  • the audio conversion model includes a first encoder, a second encoder and a decoder. Specifically, when training the audio conversion model based on the training synthetic audio feature and the training target voice feature, firstly, the obtained training synthetic audio feature, that is, the training synthetic Mel cepstrum, is input to the first encoder, and the first encoder will output the first vector, The vector length of the first vector takes the maximum value of the length of the input sequence in the batch (Btach), and the rest of the sequences that are not long enough are filled with 0 at the back.
  • part of the training target Mel cepstrum is input to the second encoder, and the second encoder will output a second vector, of which part of the training target Mel cepstrum is the training target speech feature, that is, randomly intercepted from the training target Mel cepstrum owned.
  • a preset number of intercepted segments of the training user's Mel cepstrum are randomly selected, and these intercepted segments are spliced as the target speech feature part of the training target Mel cepstrum.
  • the target of random interception may be the training target voice feature corresponding to the training synthetic audio feature, that is, the speech content corresponding to the training synthetic audio feature and the training target voice feature is the same, or it may not correspond , which is not limited in the embodiments of the present application.
  • the first vector and the second vector are spliced and input to the decoder to obtain the training prediction Mel cepstrum, and calculate the training prediction Mel cepstrum.
  • the training loss of the cepstrum and the training target Mel cepstrum, and back-propagation is performed according to the training loss to update the training weight of the audio conversion model until the audio conversion model converges.
  • the first encoder includes a 2-layer CNN model, a 5-layer Bi-LSTM model, a Linear Projection (linear projection) layer and a batch normalization (batch normalization) layer
  • the second encoder includes a 3-layer LSTM model and a 1-layer Linear Models, as well as pooling and normalization layers
  • decoders include Pre-Net (rain removal network), Attention model, LSTM model, Linear model, Post-Net, pooling layer and output layer.
  • the function of the first encoder in the audio conversion model is to remove the speech feature si of the input audio from the input sequence, and only keep the speech content c, then the input sequence can be expressed as the following form:
  • a cross-language audio conversion apparatus is provided, and the cross-language audio conversion apparatus provided in this embodiment can execute the cross-language audio conversion method provided by any embodiment of the present application, and has the ability to perform Corresponding functional modules and beneficial effects of the method.
  • the cross-language audio conversion apparatus includes a text acquisition module 100 , a text conversion module 200 , a feature acquisition module 300 , a feature conversion module 400 , and a speech simulation module 500 .
  • the text acquisition module 100 is used to acquire the text to be converted and the target voice of the target user, the text to be converted includes at least one language; the text conversion module 200 is used to convert the text to be converted into synthetic audio; feature acquisition The module 300 is used to preprocess the synthetic audio to obtain synthetic audio features; the feature conversion module 400 is used to use the synthetic audio features and the target voice as input, and use the pre-trained audio conversion model to obtain the target audio features; voice simulation The module 500 is used for converting the target audio feature into a target text speech simulating the target speech.
  • the above-mentioned apparatus further includes a model training module 600, which is used for acquiring training text and training user's training voice, the training text including at least one language; converting the training text into training text Synthesizing audio; preprocessing the training synthetic audio to obtain training synthetic audio features; generating training target voice features based on the training voice; training an audio conversion model based on the training synthetic audio features and the training target voice features.
  • a model training module 600 which is used for acquiring training text and training user's training voice, the training text including at least one language; converting the training text into training text Synthesizing audio; preprocessing the training synthetic audio to obtain training synthetic audio features; generating training target voice features based on the training voice; training an audio conversion model based on the training synthetic audio features and the training target voice features.
  • the training synthetic audio feature is a training synthetic Mel cepstrum
  • the training target speech feature is a training target Mel cepstrum
  • the audio conversion model includes a first encoder, a second encoder and decoder
  • the model training module 600 is specifically configured to input the training synthesized Mel cepstrum to the first encoder to obtain a first vector
  • input part of the training target Mel cepstrum to the second encoder to obtain a first vector
  • Obtain a second vector the part of the training target Mel cepstrum is randomly intercepted in the training target Mel cepstrum
  • the first vector and the second vector are spliced and input to the decoder to obtain
  • the synthetic audio feature is a synthetic Mel cepstrum
  • the target audio feature is a target Mel cepstrum
  • the feature acquisition module 300 is specifically configured to perform short-time Fourier transform on the synthesized audio to obtain an amplitude spectrum; filter the amplitude spectrum to obtain a mel spectrum; perform cepstrum on the mel spectrum
  • the synthetic Mel cepstrum is obtained by analysis as a synthetic audio feature.
  • the feature acquisition module 300 is further configured to subtract the blanks at the beginning and the end of the synthesized audio to obtain a modified synthesized audio; perform short-time Fourier transform on the modified synthesized audio to obtain an amplitude spectrum.
  • the feature acquisition module 300 is further configured to perform pre-emphasis, framing and windowing on the synthesized audio to obtain modified synthesized audio; and to perform short-time Fourier transform on the modified synthesized audio to obtain an amplitude spectrum.
  • Figure 9 shows an internal structure diagram of a computer device in one embodiment.
  • the computer device may be a terminal or a server.
  • the computer device includes a processor, memory, and a network interface connected by a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and also stores a computer program, which, when executed by the processor, can cause the processor to implement the method for cross-language audio conversion.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor may execute the cross-language audio conversion method.
  • FIG. 9 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • a computer device comprising a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:
  • the text to be converted includes at least one language; converting the text to be converted into synthetic audio; preprocessing the synthetic audio to obtain synthetic audio features;
  • the audio feature and the target voice are used as input, and the target audio feature is obtained by using a pre-trained audio conversion model; the target audio feature is converted into a target text voice that simulates the target voice.
  • the training of the audio conversion model includes:
  • the training text includes at least one language; converting the training text into training synthetic audio; preprocessing the training synthetic audio to obtain training synthetic audio features; based on the training Voice generation training target voice features; training an audio conversion model based on the training synthetic audio features and the training target voice features.
  • the training synthetic audio feature is a training synthetic Mel cepstrum
  • the training target speech feature is a training target Mel cepstrum
  • the audio conversion model includes a first encoder, a second encoder and The decoder, the training audio conversion model based on the training synthetic audio features and the training target voice features includes:
  • the synthetic audio feature is a synthetic Mel cepstrum
  • the target audio feature is a target Mel cepstrum
  • the preprocessing of the synthesized audio to obtain synthesized audio features includes:
  • the performing a short-time Fourier transform on the synthesized audio to obtain an amplitude spectrum includes:
  • a modified synthetic audio is obtained by subtracting the blank parts at the beginning and the end of the synthetic audio; and an amplitude spectrum is obtained by performing a short-time Fourier transform on the modified synthetic audio.
  • the performing a short-time Fourier transform on the synthesized audio to obtain an amplitude spectrum includes:
  • the synthetic audio is pre-emphasized, framed and windowed to obtain a modified synthetic audio; and a short-time Fourier transform is performed on the modified synthetic audio to obtain an amplitude spectrum.
  • a computer-readable storage medium which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
  • the text to be converted includes at least one language; converting the text to be converted into synthetic audio; preprocessing the synthetic audio to obtain synthetic audio features;
  • the audio feature and the target voice are used as input, and the target audio feature is obtained by using a pre-trained audio conversion model; the target audio feature is converted into a target text voice that simulates the target voice.
  • the training of the audio conversion model includes:
  • the training text includes at least one language; converting the training text into training synthetic audio; preprocessing the training synthetic audio to obtain training synthetic audio features; based on the training Voice generation training target voice features; training an audio conversion model based on the training synthetic audio features and the training target voice features.
  • the training synthetic audio feature is a training synthetic Mel cepstrum
  • the training target speech feature is a training target Mel cepstrum
  • the audio conversion model includes a first encoder, a second encoder and The decoder, the training audio conversion model based on the training synthetic audio features and the training target voice features includes:
  • the synthetic audio feature is a synthetic Mel cepstrum
  • the target audio feature is a target Mel cepstrum
  • the preprocessing of the synthesized audio to obtain synthesized audio features includes:
  • the performing a short-time Fourier transform on the synthesized audio to obtain an amplitude spectrum includes:
  • a modified synthetic audio is obtained by subtracting the blank parts at the beginning and the end of the synthetic audio; and an amplitude spectrum is obtained by performing a short-time Fourier transform on the modified synthetic audio.
  • the performing a short-time Fourier transform on the synthesized audio to obtain an amplitude spectrum includes:
  • the synthetic audio is pre-emphasized, framed and windowed to obtain a modified synthetic audio; and a short-time Fourier transform is performed on the modified synthetic audio to obtain an amplitude spectrum.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种跨语言音频转换方法、计算机设备和存储介质。该方法包括:获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言(S110);将所述待转换文本转换为合成音频(S120);对所述合成音频进行预处理得到合成音频特征(S130);将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征(S140);将所述目标音频特征转换为模拟所述目标语音的目标文本语音(S150)。该方法实现了跨语言的合成目标用户语音。

Description

跨语言音频转换方法、计算机设备和存储介质 技术领域
本申请涉及计算机技术领域,尤其涉及一种跨语言音频转换方法、计算机设备和存储介质。
背景技术
机器学习与深度学习依靠海量数据和计算机强大的处理能力,在图像、语音、文本等领域取得了重大突破。由于同类型框架在不同领域都能取得很好的效果,曾被用于解决文本和图像问题的神经网络算法模型都被应用于语音领域。
现有的应用于语音领域的神经网络算法模型可以根据目标说话人的声音捕捉其特征,从而稳定合成目标说话人的其他语音,并且在音色相似度和语言自然度方面都接近真人的水平,但是合成的语音只能是与目标说话人的语言相同的语音,无法将目标说话人的声音合成为该目标说话人使用其他国家语言发出的语音,如果目标说话人只会说中文,则只能合成出中文的语音,无法合成其他国家语言的语音。
申请内容
基于此,有必要针对上述问题,提出了一种跨语言音频转换方法、计算机设备和存储介质。
第一方面,本申请实施例提供一种跨语言音频转换方法,所述方法包括:
获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;
将所述待转换文本转换为合成音频;
对所述合成音频进行预处理得到合成音频特征;
将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模 型得到目标音频特征;
将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
第二方面,本申请实施例提供一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;
将所述待转换文本转换为合成音频;
对所述合成音频进行预处理得到合成音频特征;
将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;
将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
第三方面,本申请实施例提供一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;
将所述待转换文本转换为合成音频;
对所述合成音频进行预处理得到合成音频特征;
将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;
将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
实施本申请实施例,将具有如下有益效果:
本申请实施例通过获取包括至少一种语言的待转换文本,并转化为合成音频以作为目标文本语音的原始音频,获取目标用户的目标语音作为目标文本语音的用户语音特征,将两者输入至用预先训练好的音频转换模型得到模拟用户发出的目标语音的目标文本语音,解决了无法将目标说话人的声音合成为该目 标说话人使用其他国家语言发出的语音的问题,获得了跨语言的合成目标用户语音的有益效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为本申请一个实施例中跨语言音频转换方法的应用环境图;
图2为本申请一个实施例中跨语言音频转换方法的流程图;
图3为本申请一个实施例中跨语言音频转换方法中步骤S130的流程图;
图4为本申请一个实施例中跨语言音频转换方法中步骤S210的流程图;
图5为本申请一个实施例中跨语言音频转换方法中步骤S210的流程图;
图6为本申请一个实施例中音频转换模型训练方法的流程图;
图7为本申请一个实施例中音频转换模型训练方法中步骤S550的流程图;
图8为本申请一个实施例中跨语言音频转换装置的结构框图;
图9为本申请一个实施例中计算机设备的结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为一个实施例中跨语言音频转换方法的应用环境图。参照图1,该跨语言音频转换方法应用于跨语言音频转换系统。该跨语言音频转换系统包括终 端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。终端110用于获取待转换文本和目标用户的目标语音并上传到服务器120,所述待转换文本包括至少一种语言,服务器120用于接收待转换文本和目标用户的目标语音;将所述待转换文本转换为合成音频;对所述合成音频进行预处理得到合成音频特征;将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
在另一个实施例中,上述跨语言音频转换方法也可以直接应用于终端110,终端110用于获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;将所述待转换文本转换为合成音频;对所述合成音频进行预处理得到合成音频特征;将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
如图2所示,在一个实施例中,提供了一种跨语言音频转换方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该跨语言音频转换方法具体包括如下步骤:
S110、获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言。
本实施例中,在执行跨语言音频转换方法时,用户可以在移动设备,例如手机上执行,首先用户需要输入待转换文本和目标用户的目标语音,其中待转换文本是用户最后希望获得的语音内容,目标用户的目标语音是用户最后希望获得的语音声音的声音特征。此外,待转换文本包括至少一种语言,即待转换文本可以是中文,也可以是英文,还可以是英文加中文等等。示例性的,用户想要获取只会说中文的甲,说出“Yes”的目标文本语音,只需要待转换文本 输入文本“Yes”和甲的目标语音,该目标语音可以为甲说的任意一段中文语音。
S120、将所述待转换文本转换为合成音频。
S130、对所述合成音频进行预处理得到合成音频特征。
本实施例中,在获得待转换文本和目标用户的目标语音后,首先需要对待转换文本进行处理,将待转换文本转换为合成音频,具体的,采用TTS(TextToSpeech,从文本到语音)技术将待转换文本转换为合成音频,然后对合成音频进行预处理得到合成音频特征,其中,合成音频特征为合成梅尔倒频谱,需要将得到的合成音频转换为合成梅尔倒频谱,以方便输入神经网络模型中。
需要说明的是,若直接让用户朗读待转换文本的音频作为后续音频转换模型的输入音频,因用户自身的原因可能对输入音频产生的干扰,例如咳嗽、吐字不清等,本申请实施例通过将待转换文本转换为清晰准确的合成音频,排除了因用户自身的原因产生的干扰,而获取的目标语音只用于提取目标用户的语音特征。
S140、将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征。
S150、将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
本实施例中,在得到合成音频特征后,即合成梅尔倒频谱,也需要将目标语音同样转换为梅尔倒频谱,然后一起输入至预先训练好的音频转换模型中,该音频转换模型将会输出目标音频特征,其中,目标音频特征为目标梅尔倒频谱,音频转换模型为神经网络模型,预先经过大量训练用户的语音和训练文本的训练。最后得到的目标梅尔倒频谱还需要经过其他的预设神经网络模型转换为目标文本语音,目标文本语音为模拟所述目标语音的声音特征发出的待转换文本内容的语音,其中其他的预设神经网络模型可以为WaveNet神经网络模型,也可以为WaveRNN神经网络模型等等。
本申请实施例通过获取包括至少一种语言的待转换文本,并转化为合成音频以作为目标文本语音的原始音频,获取目标用户的目标语音作为目标文本语音的用户语音特征,将两者输入至用预先训练好的音频转换模型得到模拟用户发出的目标语音的目标文本语音,解决了无法将目标说话人的声音合成为该目标说话人使用其他国家语言发出的语音的问题,获得了跨语言的合成目标用户语音的有益效果。
在一个实施例中,如图3所示,步骤S130具体包括:
S210、对所述合成音频进行短时傅里叶变换得到幅度谱。
S220、对所述幅度谱进行滤波得到梅尔频谱。
S230、对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
本实施例中,在将合成音频进行预处理得到合成音频特征时,具体的,首先需要对合成音频进行短时傅里叶变换,合成音频经过短时傅里叶变换得到幅度谱和相位谱,将合成音频的波形从时域转换到频域,方便语音特征的提取,只取其中的幅度谱进行滤波就可以得到梅尔频谱,其中进行滤波时采用的滤波器可以为Filter Bank(滤波器组),Filter Bank基于人对高频声音更敏感的原则,在低频处滤波器更密集,门限值大,而高频处滤波器更稀疏,门限值小,滤波结果更适符合人声。为了获得更接近人类发声机制的特征,更接近人类非线性的听觉系统,最后还需要对梅尔频谱进行倒谱分析,得到梅尔倒频谱(MFC,Mel-Frequency Spectrum),将该合成梅尔倒频谱作为合成音频特征。需要说明的是,对目标语音需要进行与合成音频相同的处理,本申请实施例在此不再赘述。
本申请实施例通过将合成音频转换为梅尔倒频谱,不仅更接近人类发声机制的特征和非线性的听觉系统,还有利于神经网络模型的训练和输入输出。
在一个实施例中,如图4所示,步骤S210具体包括:
S310、减去所述合成音频中的首尾空白部分得到修正合成音频。
S320、对所述修正合成音频进行短时傅里叶变换得到幅度谱。
本实施例中,因合成音频中首尾部分会存在空白部分,为了让音频转换模型更好的对齐学习和转换,在对合成音频进行短时傅里叶变换得到幅度谱时,在此之前还需要减去合成音频中的首尾空白部分得到修正合成音频,然后对修正合成音频进行短时傅里叶变换得到幅度谱。
在一个实施例中,如图5所示,步骤S210具体还可以包括:
S410、对所述合成音频进行预加重、分帧和加窗得到修正合成音频。
S420、对所述修正合成音频进行短时傅里叶变换得到幅度谱。
本实施例中,为了更好的适应短时傅里叶变换,在对合成音频进行短时傅里叶变换之前,还需要对合成音频进行预加重、分帧和加窗得到修正合成音频,经过预加重,可以使合成音频添加高频信息,并过滤掉一部分噪音,经过分帧和加窗,可以使合成音频更平稳和连续,最后对修正合成音频进行短时傅里叶变换得到幅度谱。其中,本申请实施例中的步骤S410和S420可以在步骤S310之后共同执行。
如图6所示,在一个实施例中,提供了一种音频转换模型训练方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该音频转换模型训练具体包括如下步骤:
S510、获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言。
S520、将所述训练文本转换为训练合成音频。
S530、对所述训练合成音频进行预处理得到训练合成音频特征。
S540、基于所述训练语音生成训练目标语音特征。
S550、基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
本实施例中,在训练该音频转换模型时,首先需要获取训练文本和训练用户的训练语音,其中,训练文本和训练用户的训练语音一一对应,训练文本包括至少一种语言,若在使用该音频转换模型的过程中想要实现跨语言的语音转 换,则训练文本包括至少两种语言,且实际使用时的待转换文本对应的语言也被包括在训练文本中。若训练文本只有一种语言,在使用该音频转换模型时得到的目标文本语音会是使用目标语音的语言中,最接近待转换文本的语言。例如,训练文本只包括英文,相应的训练语音也只包括英文,若待转换文本为中文,目标语音为英文,那么最后得到的目标文本语音是使用最接近该待转换文本的中文发音的英文语音。。
示例性的,训练文本中包括“YES”,那么训练语音中也包括训练用户发出“YES”的训练语音,训练文本中包括“YES先生”,那么练语音中也包括训练用户发出“YES先生”的训练语音,此外在训练时可获取多个训练用户的训练语音。作为优选的,训练用户包括步骤S110中的目标用户,使得该音频转换模型在训练时使用了目标用户作为训练数据集,如此在使用该音频转换模型获得基于目标用户的目标文本语音时,准确率大大提高,即使训练用户不包括步骤S110中的目标用户,在该音频转换模型的训练数据集数量足够大的时候,音频转换模型也会根据与目标用户声音特征最接近的训练用户作为输出结果,其相似度也得到了保证。
进一步的,得到训练文本和训练语音后,需要将训练文本转换为训练合成音频,对训练合成音频进行预处理得到训练合成音频特征,还需要基于训练语音生成训练目标语音特征,其中训练合成音频特征为训练合成梅尔倒频谱,训练目标语音特征包括训练目标梅尔倒频谱,其具体的转换和预处理方法与步骤S120和S130相同,本申请实施例对此不再赘述,其中,训练文本和训练语音是一一对应的,即训练合成音频和训练语音的说话内容是相同的,但是语音特征不同。最后就可以基于训练合成音频特征和训练目标语音特征训练音频转换模型,将训练合成音频特征作为输入,将训练目标语音特征作为输出训练该音频转换模型。
在一个实施例中,如图7所示,步骤S550具体还可以包括:
S610、将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量。
S620、将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的。
S630、将所述第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱。
S640、计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失。
S650、根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
本实施例中,音频转换模型包括第一编码器、第二编码器和解码器。具体基于训练合成音频特征和训练目标语音特征训练音频转换模型时,首先将得到的训练合成音频特征,即训练合成梅尔倒频谱输入至第一编码器,第一编码器会输出第一向量,第一向量的向量长度取批处理(Btach)中输入序列长度的最大数值,其余不够长的序列在后面补0。然后将部分训练目标梅尔倒频谱输入至第二编码器,第二编码器会输出第二向量,其中部分训练目标梅尔倒频谱为训练目标语音特征,即训练目标梅尔倒频谱中随机截取得到的。具体的,将训练语音转换为梅尔倒频谱后,随机选取该训练用户的梅尔倒频谱的预设个数的截取片段,将这些截取片段拼接后作为目标语音特征部分训练目标梅尔倒频谱,需要说明的是,进行随机截取时的目标可以是与训练合成音频特征对应的训练目标语音特征,即训练合成音频特征与训练目标语音特征对应的说话内容是相同的,也可以是不对应的,本申请实施例对此不作限制。进一步的,在音频转换模型中得到第一向量和第二向量后,将第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱,并计算训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失,根据该训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
具体的,第一编码器包括2层CNN模型、5层Bi-LSTM模型,以及Linear Projection(线性投影)层和batch normalization(批标准化)层,第二编码器 包括3层LSTM模型、1层Linear模型,以及池化层和标准化层,解码器包括Pre-Net(去雨网络)、Attention模型、LSTM模型、Linear模型、Post-Net、池化层和输出层。
进一步的,为了说明采用合成音频作为音频转换模型的输入可以排除因用户自身的原因产生的干扰,在训练该音频转换模型的过程中,假设输入的训练合成音频特征的特征序列为x=(x 1,x 2,…,x n),这里的n代表训练合成梅尔倒频谱的时间序列上的第n帧,音频转换模型预测的训练预测特征的特征序列为y=(y 1,y 2,…,y m),同样,这里的m也代表训练预测梅尔倒频谱的时间序列上的第m帧。我们希望音频转换模型预测的特征序列能尽量接近训练目标语音特征的目标特征序列
Figure PCTCN2020137868-appb-000001
这里我们假设输入特征序列的每一帧中都包含两个隐含变量,一个隐含变量是输入音频的语音内容c=(c 1,c 2,…,c n),另一个隐含变量是输入音频的语音特征s=(s 2,s 2,…,s i),而在目标序列
Figure PCTCN2020137868-appb-000002
中同样包含目标用户的目标语音特征
Figure PCTCN2020137868-appb-000003
其中i表示输入音频,t表示目标用户,i∈{1,2,…,j},t∈{1,2,…,k},其中的j表示整个训练数据集中输入音频的数量,k表示整个训练数据集中目标用户的数量。
音频转换模型中的第一编码器的作用是将输入音频的语音特征s i从输入序列中剔除,只保留说话内容c,则输入序列可以表示为如下形式:
Figure PCTCN2020137868-appb-000004
由于我们使用了TTS合成语音转真人语音的方法,来达到分离用户的语音特征和语音内容的目的,因为在输入音频的语音特征只有一个,即该合成音频的语音特征,我们设其为s 0,可认为s 0是一个常量。根据贝叶斯定理,公式(1)可变为:
Figure PCTCN2020137868-appb-000005
对于预测序列y,用同样的方法可以表示为:
Figure PCTCN2020137868-appb-000006
其中,
Figure PCTCN2020137868-appb-000007
是第二编码器的输出,而c是第一编码器的输出,二者组合在一起作为解码器的输入,最后由解码器输出预测的序列y。由于c和
Figure PCTCN2020137868-appb-000008
是来自于两个序列,可以认为这两者相互独立。因此结合公式(2)和(3),可以得到:
Figure PCTCN2020137868-appb-000009
从公式(4)中可知,当输入音频为固定的合成音频时,预测序列y只和输入序列x、训练用户
Figure PCTCN2020137868-appb-000010
以及语音内容c有关。从而解除了直接获取用户朗读待转换文本的音频作为输入音频,对音频转换模型中提取语音内容的干扰。
如图8所示,在一个实施例中,提供了一种跨语言音频转换装置,该实施例提供的跨语言音频转换装置可执行本申请任意实施例所提供的跨语言音频转换方法,具备执行方法相应的功能模块和有益效果。该跨语言音频转换装置包括文本获取模块100、文本转换模块200、特征获取模块300、特征转换模块400、语音模拟模块500。
具体的,文本获取模块100用于获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;文本转换模块200用于将所述待转换文本转换为合成音频;特征获取模块300用于对所述合成音频进行预处理得到合成音频特征;特征转换模块400用于将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;语音模拟模块500用于将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
在一个实施例中,上述装置还包括模型训练模块600,该模型训练模块600用于获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言;将所述训练文本转换为训练合成音频;对所述训练合成音频进行预处理得到训 练合成音频特征;基于所述训练语音生成训练目标语音特征;基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
在一个实施例中,所述训练合成音频特征为训练合成梅尔倒频谱,所述训练目标语音特征为训练目标梅尔倒频谱,所述音频转换模型包括第一编码器、第二编码器和解码器,模型训练模块600具体用于将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的;将所述第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱;计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失;根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
在一个实施例中,所述合成音频特征为合成梅尔倒频谱,所述目标音频特征为目标梅尔倒频谱。
在一个实施例中,特征获取模块300具体用于对所述合成音频进行短时傅里叶变换得到幅度谱;对所述幅度谱进行滤波得到梅尔频谱;对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
在一个实施例中,特征获取模块300具体还用于减去所述合成音频中的首尾空白部分得到修正合成音频;对所述修正合成音频进行短时傅里叶变换得到幅度谱。
在一个实施例中,特征获取模块300具体还用于对所述合成音频进行预加重、分帧和加窗得到修正合成音频;对所述修正合成音频进行短时傅里叶变换得到幅度谱。
图9示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器。如图9所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机 程序,该计算机程序被处理器执行时,可使得处理器实现跨语言音频转换方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行跨语言音频转换方法。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提出了一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;将所述待转换文本转换为合成音频;对所述合成音频进行预处理得到合成音频特征;将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
在一个实施例中,所述音频转换模型的训练包括:
获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言;将所述训练文本转换为训练合成音频;对所述训练合成音频进行预处理得到训练合成音频特征;基于所述训练语音生成训练目标语音特征;基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
在一个实施例中,所述训练合成音频特征为训练合成梅尔倒频谱,所述训练目标语音特征为训练目标梅尔倒频谱,所述音频转换模型包括第一编码器、第二编码器和解码器,所述基于所述训练合成音频特征和训练目标语音特征训练音频转换模型包括:
将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的;将所述第一 向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱;计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失;根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
在一个实施例中,所述合成音频特征为合成梅尔倒频谱,所述目标音频特征为目标梅尔倒频谱。
在一个实施例中,所述对所述合成音频进行预处理得到合成音频特征包括:
对所述合成音频进行短时傅里叶变换得到幅度谱;对所述幅度谱进行滤波得到梅尔频谱;对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
在一个实施例中,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
减去所述合成音频中的首尾空白部分得到修正合成音频;对所述修正合成音频进行短时傅里叶变换得到幅度谱。
在一个实施例中,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
对所述合成音频进行预加重、分帧和加窗得到修正合成音频;对所述修正合成音频进行短时傅里叶变换得到幅度谱。
在一个实施例中,提出了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;将所述待转换文本转换为合成音频;对所述合成音频进行预处理得到合成音频特征;将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
在一个实施例中,所述音频转换模型的训练包括:
获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言;将所述训练文本转换为训练合成音频;对所述训练合成音频进行预处理得到训练合成音频特征;基于所述训练语音生成训练目标语音特征;基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
在一个实施例中,所述训练合成音频特征为训练合成梅尔倒频谱,所述训练目标语音特征为训练目标梅尔倒频谱,所述音频转换模型包括第一编码器、第二编码器和解码器,所述基于所述训练合成音频特征和训练目标语音特征训练音频转换模型包括:
将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量;将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的;将所述第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱;计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失;根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
在一个实施例中,所述合成音频特征为合成梅尔倒频谱,所述目标音频特征为目标梅尔倒频谱。
在一个实施例中,所述对所述合成音频进行预处理得到合成音频特征包括:
对所述合成音频进行短时傅里叶变换得到幅度谱;对所述幅度谱进行滤波得到梅尔频谱;对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
在一个实施例中,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
减去所述合成音频中的首尾空白部分得到修正合成音频;对所述修正合成音频进行短时傅里叶变换得到幅度谱。
在一个实施例中,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
对所述合成音频进行预加重、分帧和加窗得到修正合成音频;对所述修正合成音频进行短时傅里叶变换得到幅度谱。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。请输入具体实施内容部分。

Claims (21)

  1. 一种跨语言音频转换方法,其特征在于,所述方法包括:
    获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;
    将所述待转换文本转换为合成音频;
    对所述合成音频进行预处理得到合成音频特征;
    将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;
    将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
  2. 根据权利要求1所述的方法,其特征在于,所述音频转换模型的训练包括:
    获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言;
    将所述训练文本转换为训练合成音频;
    对所述训练合成音频进行预处理得到训练合成音频特征;
    基于所述训练语音生成训练目标语音特征;
    基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
  3. 根据权利要求2所述的方法,其特征在于,所述训练合成音频特征为训练合成梅尔倒频谱,所述训练目标语音特征为训练目标梅尔倒频谱,所述音频转换模型包括第一编码器、第二编码器和解码器,所述基于所述训练合成音频特征和训练目标语音特征训练音频转换模型包括:
    将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量;
    将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的;
    将所述第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱;
    计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失;
    根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
  4. 根据权利要求1所述的方法,其特征在于,所述合成音频特征为合成梅尔倒频谱,所述目标音频特征为目标梅尔倒频谱。
  5. 根据权利要求4所述的方法,其特征在于,所述对所述合成音频进行预处理得到合成音频特征包括:
    对所述合成音频进行短时傅里叶变换得到幅度谱;
    对所述幅度谱进行滤波得到梅尔频谱;
    对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
  6. 根据权利要求5所述的方法,其特征在于,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
    减去所述合成音频中的首尾空白部分得到修正合成音频;
    对所述修正合成音频进行短时傅里叶变换得到幅度谱。
  7. 根据权利要求5所述的方法,其特征在于,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
    对所述合成音频进行预加重、分帧和加窗得到修正合成音频;
    对所述修正合成音频进行短时傅里叶变换得到幅度谱。
  8. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
    获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;
    将所述待转换文本转换为合成音频;
    对所述合成音频进行预处理得到合成音频特征;
    将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;
    将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
  9. 根据权利要求8所述的计算机设备,其特征在于,所述音频转换模型的训练包括:
    获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言;
    将所述训练文本转换为训练合成音频;
    对所述训练合成音频进行预处理得到训练合成音频特征;
    基于所述训练语音生成训练目标语音特征;
    基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
  10. 根据权利要求9所述的计算机设备,其特征在于,所述训练合成音频特征为训练合成梅尔倒频谱,所述训练目标语音特征为训练目标梅尔倒频谱,所述音频转换模型包括第一编码器、第二编码器和解码器,所述基于所述训练合成音频特征和训练目标语音特征训练音频转换模型包括:
    将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量;
    将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的;
    将所述第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱;
    计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失;
    根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
  11. 根据权利要求8所述的计算机设备,其特征在于,所述合成音频特征为合成梅尔倒频谱,所述目标音频特征为目标梅尔倒频谱。
  12. 根据权利要求11所述的计算机设备,其特征在于,所述对所述合成音频进行预处理得到合成音频特征包括:
    对所述合成音频进行短时傅里叶变换得到幅度谱;
    对所述幅度谱进行滤波得到梅尔频谱;
    对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
    减去所述合成音频中的首尾空白部分得到修正合成音频;
    对所述修正合成音频进行短时傅里叶变换得到幅度谱。
  14. 根据权利要求12所述的计算机设备,其特征在于,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
    对所述合成音频进行预加重、分帧和加窗得到修正合成音频;
    对所述修正合成音频进行短时傅里叶变换得到幅度谱。
  15. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
    获取待转换文本和目标用户的目标语音,所述待转换文本包括至少一种语言;
    将所述待转换文本转换为合成音频;
    对所述合成音频进行预处理得到合成音频特征;
    将所述合成音频特征和目标语音作为输入,使用预先训练好的音频转换模型得到目标音频特征;
    将所述目标音频特征转换为模拟所述目标语音的目标文本语音。
  16. 根据权利要求15所述的存储介质,其特征在于,所述音频转换模型的训练包括:
    获取训练文本和训练用户的训练语音,所述训练文本包括至少一种语言;
    将所述训练文本转换为训练合成音频;
    对所述训练合成音频进行预处理得到训练合成音频特征;
    基于所述训练语音生成训练目标语音特征;
    基于所述训练合成音频特征和训练目标语音特征训练音频转换模型。
  17. 根据权利要求16所述的存储介质,其特征在于,所述训练合成音频特征为训练合成梅尔倒频谱,所述训练目标语音特征为训练目标梅尔倒频谱,所 述音频转换模型包括第一编码器、第二编码器和解码器,所述基于所述训练合成音频特征和训练目标语音特征训练音频转换模型包括:
    将所述训练合成梅尔倒频谱输入至所述第一编码器以得到第一向量;
    将部分训练目标梅尔倒频谱输入至所述第二编码器以得到第二向量,所述部分训练目标梅尔倒频谱为在所述训练目标梅尔倒频谱中随机截取得到的;
    将所述第一向量和第二向量进行拼接后输入至所述解码器以得到训练预测梅尔倒频谱;
    计算所述训练预测梅尔倒频谱和训练目标梅尔倒频谱的训练损失;
    根据所述训练损失进行反向传播以更新所述音频转换模型的训练权重,直至所述音频转换模型收敛。
  18. 根据权利要求15所述的存储介质,其特征在于,所述合成音频特征为合成梅尔倒频谱,所述目标音频特征为目标梅尔倒频谱。
  19. 根据权利要求18所述的存储介质,其特征在于,所述对所述合成音频进行预处理得到合成音频特征包括:
    对所述合成音频进行短时傅里叶变换得到幅度谱;
    对所述幅度谱进行滤波得到梅尔频谱;
    对所述梅尔频谱进行倒谱分析得到合成梅尔倒频谱,作为合成音频特征。
  20. 根据权利要求19所述的存储介质,其特征在于,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
    减去所述合成音频中的首尾空白部分得到修正合成音频;
    对所述修正合成音频进行短时傅里叶变换得到幅度谱。
  21. 根据权利要求19所述的存储介质,其特征在于,所述对所述合成音频进行短时傅里叶变换得到幅度谱包括:
    对所述合成音频进行预加重、分帧和加窗得到修正合成音频;
    对所述修正合成音频进行短时傅里叶变换得到幅度谱。
PCT/CN2020/137868 2020-12-21 2020-12-21 跨语言音频转换方法、计算机设备和存储介质 WO2022133630A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/137868 WO2022133630A1 (zh) 2020-12-21 2020-12-21 跨语言音频转换方法、计算机设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/137868 WO2022133630A1 (zh) 2020-12-21 2020-12-21 跨语言音频转换方法、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022133630A1 true WO2022133630A1 (zh) 2022-06-30

Family

ID=82157045

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137868 WO2022133630A1 (zh) 2020-12-21 2020-12-21 跨语言音频转换方法、计算机设备和存储介质

Country Status (1)

Country Link
WO (1) WO2022133630A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150340034A1 (en) * 2014-05-22 2015-11-26 Google Inc. Recognizing speech using neural networks
CN111247585A (zh) * 2019-12-27 2020-06-05 深圳市优必选科技股份有限公司 语音转换方法、装置、设备及存储介质
CN111247581A (zh) * 2019-12-23 2020-06-05 深圳市优必选科技股份有限公司 一种多语言文本合成语音方法、装置、设备及存储介质
CN111316352A (zh) * 2019-12-24 2020-06-19 深圳市优必选科技股份有限公司 语音合成方法、装置、计算机设备和存储介质
CN111667814A (zh) * 2020-05-26 2020-09-15 北京声智科技有限公司 一种多语种的语音合成方法及装置
CN111899719A (zh) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 用于生成音频的方法、装置、设备和介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150340034A1 (en) * 2014-05-22 2015-11-26 Google Inc. Recognizing speech using neural networks
CN111247581A (zh) * 2019-12-23 2020-06-05 深圳市优必选科技股份有限公司 一种多语言文本合成语音方法、装置、设备及存储介质
CN111316352A (zh) * 2019-12-24 2020-06-19 深圳市优必选科技股份有限公司 语音合成方法、装置、计算机设备和存储介质
CN111247585A (zh) * 2019-12-27 2020-06-05 深圳市优必选科技股份有限公司 语音转换方法、装置、设备及存储介质
CN111667814A (zh) * 2020-05-26 2020-09-15 北京声智科技有限公司 一种多语种的语音合成方法及装置
CN111899719A (zh) * 2020-07-30 2020-11-06 北京字节跳动网络技术有限公司 用于生成音频的方法、装置、设备和介质

Similar Documents

Publication Publication Date Title
CN111785261B (zh) 基于解纠缠和解释性表征的跨语种语音转换方法及系统
Chou et al. One-shot voice conversion by separating speaker and content representations with instance normalization
Dave Feature extraction methods LPC, PLP and MFCC in speech recognition
WO2020215666A1 (zh) 语音合成方法、装置、计算机设备及存储介质
US20220013106A1 (en) Multi-speaker neural text-to-speech synthesis
CN112712789B (zh) 跨语言音频转换方法、装置、计算机设备和存储介质
CN109767778B (zh) 一种融合Bi-LSTM和WaveNet的语音转换方法
EP4118641A1 (en) Speech recognition using unspoken text and speech synthesis
US11355097B2 (en) Sample-efficient adaptive text-to-speech
Hu et al. Neural text-to-speech adaptation from low quality public recordings
JP7393585B2 (ja) テキスト読み上げのためのWaveNetの自己トレーニング
WO2024055752A9 (zh) 语音合成模型的训练方法、语音合成方法和相关装置
CN112530400A (zh) 基于深度学习的文本生成语音的方法、系统、装置及介质
Xue et al. Cross-modal information fusion for voice spoofing detection
WO2023279976A1 (zh) 语音合成方法、装置、设备及存储介质
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN112767912A (zh) 跨语言语音转换方法、装置、计算机设备和存储介质
Kumar et al. Towards building text-to-speech systems for the next billion users
CN113506586A (zh) 用户情绪识别的方法和系统
Choi et al. Adversarial speaker-consistency learning using untranscribed speech data for zero-shot multi-speaker text-to-speech
JP2021157145A (ja) 推論器および推論器の学習方法
WO2022133630A1 (zh) 跨语言音频转换方法、计算机设备和存储介质
WO2023102932A1 (zh) 音频转换方法、电子设备、程序产品及存储介质
WO2022140966A1 (zh) 跨语言语音转换方法、计算机设备和存储介质
Kiran Reddy et al. DNN-based cross-lingual voice conversion using Bottleneck Features

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20966211

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20966211

Country of ref document: EP

Kind code of ref document: A1