WO2024109375A1 - 语音转换模型的训练方法、装置、设备及介质 - Google Patents

语音转换模型的训练方法、装置、设备及介质 Download PDF

Info

Publication number
WO2024109375A1
WO2024109375A1 PCT/CN2023/124162 CN2023124162W WO2024109375A1 WO 2024109375 A1 WO2024109375 A1 WO 2024109375A1 CN 2023124162 W CN2023124162 W CN 2023124162W WO 2024109375 A1 WO2024109375 A1 WO 2024109375A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
audio
model
conversion
accent
Prior art date
Application number
PCT/CN2023/124162
Other languages
English (en)
French (fr)
Inventor
杨培基
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2024109375A1 publication Critical patent/WO2024109375A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the embodiments of the present application relate to the field of audio processing technology, and in particular to a training method, device, equipment and medium for a speech conversion model.
  • accent conversion is usually implemented using a voice conversion model, and a large amount of parallel corpus is required in the process of training the voice conversion model.
  • the parallel corpus is audio with different accents of the same voice content.
  • the embodiment of the present application provides a method, device, equipment and medium for training a speech conversion model, which can ensure the training quality of the speech conversion model while reducing the demand for manually recorded parallel corpus.
  • the technical solution is as follows:
  • an embodiment of the present application provides a method for training a speech conversion model, the method being executed by a computer device, comprising:
  • a speech conversion model is generated based on the first ASR model, the second conversion model, and the third conversion model obtained through training, and the speech conversion model is used to convert audio in a first accent into audio in a second accent.
  • an embodiment of the present application provides a speech conversion method, the method is performed by a computer device, a speech conversion model is set in the computer device, the speech conversion model includes a first ASR model, a second conversion model and a third conversion model, the method includes:
  • the second content feature is converted into audio through the third conversion model to obtain second accent audio.
  • an embodiment of the present application provides a training device for a speech conversion model, the device comprising:
  • a training module configured to train a first ASR model based on a first sample audio, and to train a second ASR model based on a second sample audio, wherein the first sample audio corresponds to a first accent, and the second sample audio corresponds to a second accent;
  • the training module is further used to train a first conversion model based on a first sample text and a first sample content feature corresponding to the first sample audio, wherein the first sample content feature is obtained by extracting the first sample audio by the first ASR model, and the first conversion model is used to convert the text into content features of the first accent;
  • the training module is further used to construct parallel sample data based on the first conversion model, a second sample text corresponding to the second sample audio, and a second sample content feature, wherein the second sample content feature is extracted by the second ASR model from the second sample audio, and the parallel sample data includes different content features, wherein different content features correspond to different accents, and different content features correspond to the same text; a second conversion model is trained based on the parallel sample data, wherein the second conversion model is used to convert content features between the first accent and the second accent;
  • the training module is further used to train a third conversion model based on sample content features of different sample audios, wherein the third conversion model is used to convert the content features into audio;
  • a generation module is used to generate a speech conversion model based on the first ASR model, the second conversion model and the third conversion model obtained through training, wherein the speech conversion model is used to convert audio in a first accent into audio in a second accent.
  • an embodiment of the present application provides a speech conversion device, wherein the device includes:
  • An acquisition module configured to acquire a first accent audio, where the first accent audio corresponds to the first accent
  • an extraction module configured to extract a first content feature from the first accent audio by using the first ASR model, wherein the first content feature corresponds to the first accent;
  • a content feature conversion module configured to convert the first content feature into a second content feature by using the second conversion model, wherein the second content feature corresponds to a second accent
  • An audio conversion module is used to perform audio conversion on the second content feature through the third conversion model to obtain a second accent audio.
  • an embodiment of the present application provides a computer device, which includes a processor and a memory, wherein the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the training method of the speech conversion model as described in the above aspects, or the speech conversion method as described in the above aspects.
  • an embodiment of the present application provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the training method of the speech conversion model as described in the above aspects, or the speech conversion method as described in the above aspects.
  • an embodiment of the present application provides a computer program product, which includes computer instructions, and the computer instructions are stored in a computer-readable storage medium; a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the speech conversion model as described in the above aspects, or the speech conversion method as described in the above aspects.
  • a first conversion model for converting text into content features is trained, thereby using the first conversion model and the second sample text corresponding to the second sample audio to construct parallel sample data containing the same text content but corresponding to different accents, and then using the parallel sample data to train a second conversion model for converting content features between different accents, and a third conversion model for converting content features into audio, to complete the speech conversion model training; during the model training process, the intermediate model obtained by training is used to construct the parallel corpus, and there is no need to record audio files of different accents before model training.
  • Parallel corpus can reduce the demand for manually recorded parallel corpus in model training while ensuring the quality of model training, which helps to improve the efficiency of model training and improve the training quality of the model when samples are insufficient.
  • FIG1 shows a schematic diagram of a speech conversion system provided by an exemplary embodiment of the present application
  • FIG2 shows a flow chart of a method for training a speech conversion model provided by an exemplary embodiment of the present application
  • FIG3 shows a flow chart of an accent conversion method provided by an exemplary embodiment of the present application
  • FIG4 is a schematic diagram of a voice setting interface shown in an exemplary embodiment of the present application.
  • FIG5 is a schematic diagram of an implementation of an accent conversion process provided by an exemplary embodiment of the present application.
  • FIG6 is a flowchart of a text-to-content feature process shown in an exemplary embodiment of the present application.
  • FIG7 is a diagram of an FFT structure provided by an exemplary embodiment of the present application.
  • FIG8 is a schematic structural diagram of a first conversion model shown in an exemplary embodiment of the present application.
  • FIG9 is a flow chart of a second conversion model training process shown in an exemplary embodiment of the present application.
  • FIG10 is a schematic diagram of the structure of a second conversion model shown in an exemplary embodiment of the present application.
  • FIG11 is a schematic diagram of the structure of a third conversion model shown in an exemplary embodiment of the present application.
  • FIG12 is a flowchart of a third conversion model training process shown in an exemplary embodiment of the present application.
  • FIG13 is a schematic diagram of an implementation of an accent conversion process provided by another exemplary embodiment of the present application.
  • FIG14 is a structural block diagram of a training device for a speech conversion model provided by an exemplary embodiment of the present application.
  • FIG15 is a structural block diagram of a speech conversion device provided by an exemplary embodiment of the present application.
  • FIG. 16 shows a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
  • the speech conversion model is composed of a first ASR model (for converting audio to text), a second conversion model (for converting content features between different accents) and a third conversion model (for converting content features to audio).
  • the first conversion model for converting text to content features is trained, thereby constructing parallel sample data with the help of the first conversion model for subsequent training of the second conversion model and the third conversion model.
  • parallel corpora are constructed with the help of the conversion models obtained through training, without the need to manually record a large amount of parallel corpora in advance, thereby reducing the dependence of the training process on parallel corpora and ensuring the quality of model training.
  • the information including but not limited to user device information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with relevant laws, regulations and standards of relevant countries and regions.
  • the audio, accent and text involved in this application are all obtained with full authorization.
  • the speech conversion model trained by the training method provided in the embodiment of the present application can be applied to various scenarios requiring accent conversion.
  • FIG1 it shows a schematic diagram of a speech conversion system shown in an exemplary embodiment of the present application.
  • the speech conversion system includes: an audio acquisition device 110, a terminal 120 and a server 130.
  • the audio acquisition device 110 is a device for collecting user voice.
  • the audio acquisition device 110 can be an earphone, a microphone, or an AR/VR device with a sound receiving function, etc., which is not limited in this embodiment of the present application.
  • the audio collection device 110 is connected to the terminal 120 by wire or wireless means, and is used to transmit the collected user voice to the terminal 120, and the terminal 120 further performs accent conversion processing on the user voice.
  • the terminal 120 can be an electronic device such as a smart phone, a tablet computer, a personal computer, or a vehicle-mounted terminal.
  • an application with an accent conversion function is provided in the terminal 120. Through the application, the user can set an accent conversion target, thereby converting the user's voice from an original voice to a target voice.
  • the accent conversion may be implemented locally by the terminal 120 (the voice conversion model is set in the terminal 120); in another possible implementation, the accent conversion may be implemented by the terminal 120 with the aid of the server 130 (the voice conversion model is set in the server 130, and the terminal 120 transmits the accent conversion requirement to the server 130).
  • the server 130 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
  • the server 130 may be a background server that implements the accent conversion function, and is used to provide conversion services between different accents.
  • multiple speech conversion models are provided in the server 130, and different speech conversion models are used to achieve conversion between different accents. For example, when supporting conversion of Mandarin into n local accents, n speech conversion models are provided in the server 130.
  • the server 130 obtains accent corpora of different accents, where the accent corpora are composed of audio and corresponding text, so as to train a corresponding speech conversion model based on the accent corpora.
  • the user before performing accent conversion, sets the first accent to be converted into the second accent through the terminal 120, and the terminal 120 sends an accent conversion request to the server 130, requesting the server 130 to use the corresponding speech conversion model (convert the first accent into the second accent) to perform accent conversion.
  • the audio acquisition device 110 transmits the collected user voice in the first accent to the terminal 120, and the terminal 120 transmits the user voice in the first accent to the server 130.
  • the server 130 converts it into the user voice in the second accent through the voice conversion model, and feeds it back to the terminal 120, which is further processed by the terminal 120.
  • the terminal 120 processes the user voice in different ways.
  • the following uses a concentrated exemplary application scenario for illustration.
  • the terminal After the terminal obtains the converted user voice, it merges the user voice with the produced content (such as virtual human short video, virtual human long video, etc.) to obtain virtual human content.
  • the mouth of the virtual human can be controlled according to the converted user voice to improve the matching degree between the virtual human mouth movement and the voice.
  • the terminal obtains the first accent audio corresponding to the real user, the first accent audio corresponds to the first accent of the real user, and the terminal extracts the first accent audio through the first ASR model in the speech conversion model to obtain the first content feature under the first accent; the terminal converts the first content feature into the second content feature through the second conversion model in the speech conversion model, and the second content feature corresponds to the second accent; after converting the accent, the terminal performs audio conversion on the second content feature under the second accent through the third conversion model in the speech conversion model to obtain the second accent audio corresponding to the virtual human.
  • the virtual anchor can pre-set the live broadcast accent through the accent setting interface.
  • the terminal sends the user voice collected by the microphone to the server, and the server converts the user voice with the original accent into the user voice with the live broadcast accent and feeds it back to the terminal.
  • the terminal merges the user voice with the live broadcast accent with the video stream containing the virtual anchor image, and then pushes the merged audio and video stream to each viewer client in the live broadcast room through the push stream server.
  • the accent of the virtual anchor during live broadcast can be preset.
  • the terminal collects the first accent audio corresponding to the real user through a microphone, and the first accent audio corresponds to the first accent of the real user.
  • the terminal extracts the first accent audio through the first ASR model in the voice conversion model to obtain the first content feature under the first accent;
  • the terminal converts the first content feature into the second content feature through the second conversion model in the voice conversion model, and the second content feature corresponds to the second accent; after the accent is converted, the terminal performs audio conversion on the second content feature under the second accent through the third conversion model in the voice conversion model to obtain the second accent audio corresponding to the virtual anchor, that is, the virtual anchor broadcasts live with the second accent audio.
  • users can set the accent to be used when interacting in the Metaverse.
  • the user's voice is collected by headsets, AR/VR and other devices and transmitted to the terminal, which is then handed over to the server for accent conversion.
  • the server controls the virtual character in the Metaverse to play the converted accent audio to achieve voice interaction with other virtual characters.
  • the second accent for interacting with other virtual characters can be pre-selected.
  • the terminal collects the first accent audio corresponding to the real user through the microphone, and the first accent audio corresponds to the first accent of the real user.
  • the terminal extracts the first accent audio through the first ASR model in the voice conversion model to obtain the first content feature under the first accent; the terminal converts the first content feature into the second content feature through the second conversion model in the voice conversion model, and the second content feature corresponds to the second accent; after the accent is converted, the terminal performs audio conversion on the second content feature under the second accent through the third conversion model in the voice conversion model to obtain the second accent audio corresponding to the virtual character in the metaverse, that is, the virtual character in the metaverse interacts with other virtual characters with the second accent audio.
  • the above application scenarios are only exemplary descriptions.
  • the speech conversion model trained by the method provided in the embodiments of the present application can also be used in real-world application scenarios such as voice calls (to facilitate voice communication between callers with different accents) and translation, and the embodiments of the present application do not constitute a limitation to this.
  • the training and use of the speech conversion model are used in a computer device (which may be a terminal or a server), and the speech conversion model trained to convert a first accent into a second accent is used as an example for explanation (other schemes for converting a source speech into a target speech are similar), but this is not intended to be limiting.
  • Fig. 2 shows a flow chart of a method for training a speech conversion model provided by an exemplary embodiment of the present application. The method is executed by a computer device and includes the following steps.
  • Step 201 training a first ASR model based on a first sample audio, and training a second ASR model based on a second sample audio, wherein the first sample audio corresponds to a first accent, and the second sample audio corresponds to a second accent.
  • the first accent is the source accent
  • the second accent is the target speech, that is, the trained speech conversion model is used to convert speech with the first accent into speech with the second accent.
  • the first sample audio corresponds to a first sample text
  • the second sample audio corresponds to a second sample text.
  • the first sample text does not need to be the same as the second sample text, so a public speech data set can be directly used for model training.
  • a computer device uses the Wenet Speech dataset as the first sample audio and the KeSpeech dataset as the second sample audio, wherein the Wenet Speech dataset includes 10,000 hours of ASR data, as can be seen in the introduction at the URL: https://zhuanlan.zhihu.com/p/424118791; the KeSpeech dataset contains ASR data of dialects from different regions, as can be seen in the introduction at the URL: https://datasets-benchmarks-proceedings.neurips.cc/pap er/2021/hash/0336dcbab05b9d5ad24f4333c7658a0e-Abstract-round2.html.
  • a computer device inputs sample audio into the ASR model to obtain a predicted text output by the ASR model, thereby training the ASR model based on the predicted text and the sample text corresponding to the sample audio.
  • the model architecture of the ASR model includes but is not limited to Wenet, wav2vec2, Kaldi, etc., which is not limited in the embodiments of the present application.
  • Wenet is a speech recognition toolkit for industrial applications that was open-sourced by the Chuwenwen voice team and the Speech Laboratory of Northwestern Polytechnical University. The tool provides a one-stop service from training to deployment of speech recognition with a simple solution. Please refer to the website: https://zhuanlan.zhihu.com/p/349586567 for introduction. Wav2vec was proposed in an article included in Interspeech 2019.
  • Kaldi is an open source speech recognition tool that uses WFST to implement the decoding algorithm.
  • the main code of Kaldi is written in C++, and some tools are made using bash and python scripts.
  • the website https://zhuanlan.zhihu.com/p/84050431 Shao.
  • the ASR model can be retrained based on the sample audio (applicable to situations where the number of sample audios is large), or it can be obtained by fine-tuning the pre-trained ASR model based on the sample audio (applicable to situations where the number of sample audios is small).
  • the first ASR model is retrained based on the first sample audio, and the second ASR model is fine-tuned based on the second sample audio on the basis of the first ASR model.
  • the trained ASR model is used to extract content features in speech.
  • the content features are called BN (BottleNeck) features, which are usually the last layer of features of the ASR model, which retains the content features of the speech and eliminates other features such as timbre and pitch.
  • the training process of the first ASR model includes: the computer device inputs the first sample audio into the first ASR model for text extraction to obtain a first predicted text; the computer device calculates the loss function value between the first predicted text and the first sample text corresponding to the first sample audio; the computer device updates the model parameters of the first ASR model based on the loss function value between the first predicted text and the first sample text, thereby realizing the training of the first ASR model.
  • the training process of the second ASR model includes: the computer device inputs the second sample audio into the second ASR model for text extraction to obtain a second predicted text; the computer device calculates the loss function value between the second predicted text and the second sample text corresponding to the second sample audio; the computer device updates the model parameters of the second ASR model based on the loss function value between the second predicted text and the second sample text, thereby realizing the training of the second ASR model.
  • Step 202 training a first conversion model based on a first sample text corresponding to a first sample audio and a first sample content feature, wherein the first sample content feature is extracted by a first ASR model from the first sample audio, and the first conversion model is used to convert the text into content features of a first accent.
  • a data enhancement scheme is adopted to realize the content feature conversion between non-parallel corpora (that is, corpora corresponding to different accents and corresponding to different texts).
  • the computer device extracts features from the first sample audio through the trained first ASR model to obtain first sample content features of the first sample audio, thereby training a first conversion model based on the first sample text corresponding to the first sample audio and the first sample content features.
  • the first conversion model can be called a text content feature conversion model (Text2BN model), which is used to realize the conversion between text and source accent content features.
  • the training process of the first conversion model includes: the computer device inputs the first sample text into the first conversion model to obtain the first predicted content feature; the computer device extracts the first sample audio through the first ASR model to obtain the first sample content feature; the computer device calculates the loss function value between the first predicted content feature and the first sample content feature; the computer device updates the model parameters of the first conversion model based on the loss function value between the first predicted content feature and the first sample content feature, thereby realizing the training of the first conversion model.
  • Step 203 constructing parallel sample data based on the first conversion model, the second sample text corresponding to the second sample audio, and the second sample content feature, wherein the second sample content feature is extracted by the second ASR model from the second sample audio, and the parallel sample data includes different content features, different content features correspond to different accents, and different content features correspond to the same text.
  • the computer device performs text conversion on the second sample text corresponding to the second sample audio based on the first conversion model to obtain content features of the first accent corresponding to the second sample text; the computer device aggregates the content features of the first accent corresponding to the second sample text and the content features of the second accent corresponding to the second sample text to obtain parallel sample data.
  • the content features of the second accent corresponding to the second sample text are extracted by a second ASR model.
  • the computer device After the first conversion model is trained, the computer device performs data enhancement based on the second sample text corresponding to the second sample audio and the first conversion model, thereby constructing parallel sample data based on the second sample content features and the content features of the first accent obtained by data enhancement.
  • the parallel sample data includes the content features of the first accent corresponding to the same text (generated by the first conversion model) and the content features of the second accent (extracted by the second ASR model).
  • the computer device can construct parallel sample data corresponding to text A based on the first conversion model, text A and the dialect sample content features of the dialect sample audio corresponding to text A, and the parallel sample data includes the Mandarin and dialect content features corresponding to text A.
  • Step 204 training a second conversion model based on the parallel sample data, where the second conversion model is used to convert content features between the first accent and the second accent.
  • the computer device trains a second conversion model based on parallel sample data corresponding to the same text.
  • the second conversion model can be called a content feature conversion model (BN2BN model), which is used to convert the content features of the source accent into the content features of the target accent.
  • BN2BN model is used to implement the accent transfer task, and the introduction can be found in the website: https://zhuanlan.zhihu.com/p/586037409.
  • the computer device is trained to obtain a second conversion model for converting content features of Mandarin into content features of the dialect.
  • the sample content features corresponding to the first sample audio and the second sample audio can be directly used to train the second conversion model.
  • the training process of the second conversion model includes: the computer device extracts the second sample audio through the second ASR model to obtain the second sample content feature; the computer device converts the second sample text corresponding to the second sample audio through the first conversion model to obtain the third sample content feature, and the third sample content feature refers to the content feature of the audio generated by expressing the second sample text in the first accent; the computer device inputs the third sample content feature into the second conversion model to obtain the second predicted content feature; the computer device calculates the loss function value between the second predicted content feature and the second sample content feature; the computer device updates the model parameters of the second conversion model based on the loss function value between the second predicted content feature and the second sample content feature, thereby realizing the training of the second conversion model.
  • Step 205 training a third conversion model based on sample content features of different sample audios, where the third conversion model is used to convert the content features into audios.
  • the third conversion model can be called a content audio conversion model, which is used to convert the audio of the target speech based on the content features of the target speech.
  • the third conversion model may include an acoustic model and a vocoder, wherein the acoustic model is used to generate an audio spectrum based on the content feature, and the vocoder is used to generate audio based on the audio spectrum.
  • the samples for training the third conversion model may be sample audios of various accents.
  • the third conversion model can be executed after the ASR model training is completed, that is, the third conversion model can be trained synchronously with the first and second conversion models.
  • the embodiment of the present application does not limit the training sequence of the model.
  • the training process of the third conversion model includes: the computer device inputs the sample content features and the speaker identifier corresponding to the sample audio into the third conversion model to generate audio and obtain predicted audio; the computer device calculates the loss function value between the predicted audio and the sample audio; the computer device updates the model parameters of the third conversion model based on the loss function value between the predicted audio and the sample audio, thereby realizing the training of the third conversion model.
  • Step 206 Generate a speech conversion model based on the trained first ASR model, the second conversion model, and the third conversion model, where the speech conversion model is used to convert the audio of the first accent into the audio of the second accent.
  • the computer device After the first ASR model, the second conversion model and the third conversion model are trained through the above steps, the computer device combines the above models to obtain the final speech conversion model.
  • the order of splicing between the models is the first ASR model ⁇ the second conversion model ⁇ the third conversion model, that is, the output of the first ASR model is input into the second conversion model, and the output of the second conversion model is input into the third conversion model.
  • the speech conversion model trained to convert Mandarin into a dialect consists of Mandarin It consists of an ASR model, a Mandarin-dialect content conversion model, and a content-audio conversion model.
  • a first conversion model for converting text into content features is trained, thereby using the first conversion model and the second sample text corresponding to the second sample audio to construct parallel sample data containing the same text content but corresponding to different accents, and then using the parallel sample data to train a second conversion model for converting content features between different accents, and a third conversion model for converting content features into audio, to complete the training of the speech conversion model; during the model training process, the intermediate model obtained by training is used to construct parallel corpora, and there is no need to record parallel corpora of different accents before model training. While ensuring the quality of model training, the demand for manually recorded parallel corpora for model training can be reduced, which helps to improve the efficiency of model training and improve the training quality of the model when samples are insufficient.
  • the speech conversion method can be implemented by using the speech conversion model, and the speech conversion method is executed by a computer device, and the speech conversion model includes a first ASR model, a second conversion model, and a third conversion model.
  • the computer device When performing speech conversion, the computer device obtains a first accent audio, and the first accent audio corresponds to a first accent; the computer device extracts the first accent audio through the first ASR model to obtain a first content feature, and the first content feature corresponds to the first accent; the computer device converts the first content feature into a second content feature through the second conversion model, and the second content feature corresponds to the second accent; the computer device performs audio conversion on the second content feature through the third conversion model to obtain a second accent audio, thereby completing the speech conversion.
  • the computer device after receiving the first accent audio in the first accent, extracts content features through the first ASR model in the speech conversion model to obtain the first content features.
  • the computer device inputs the first content features extracted by the first ASR model into the second conversion model, and the second conversion model converts the content features between the first accent and the second accent to obtain the second content features in the second accent.
  • the first content feature and the second content feature correspond to the same text (both are texts corresponding to the first accent audio).
  • the second conversion model includes a convolution layer and N stacked FFT layers; after the computer device performs convolution processing on the first content feature through the convolution layer in the second conversion model, the convolution result is input into the N stacked FFT layers for conversion to obtain the second content feature.
  • the computer device inputs the second content feature and the speaker identifier of the speaker corresponding to the target timbre into a third conversion model to obtain the second accent audio.
  • the third conversion model includes a third conversion sub-model and a vocoder, wherein the third conversion sub-model is used to convert the content features into audio spectrum features, and the vocoder is used to generate audio based on the audio spectrum features.
  • the third conversion sub-model includes a convolution layer and N layers of stacked FFT
  • the audio spectrum features may be Mel spectrum features, MFCC (Mel Frequency Cepstrum Coefficient) features, etc., which is not limited to the embodiments of the present application.
  • the vocoder may be an autoregressive Wavenet or WaveRNN, or a non-autoregressive Hifigan or Melgan, etc., which is not limited in the embodiments of the present application.
  • the audio spectrum feature is described by taking the audio spectrum feature as the Mel spectrum feature and the vocoder as the hifigan as an example, but this is not a limitation.
  • the computer device inputs the second content feature and the speaker identifier into a third conversion sub-model to obtain an audio spectrum feature; the computer device inputs the audio spectrum feature into a vocoder to obtain a second accent audio.
  • Figure 3 shows a flow chart of an accent conversion method provided by an exemplary embodiment of the present application. The method is executed by a computer device and includes the following steps.
  • Step 301 in response to an accent conversion instruction, extracting a first content feature of a first accent audio through a first ASR model, the first content feature corresponding to a first accent, and the accent conversion instruction is used to instruct to convert the audio from the first accent to a second accent. sound.
  • the accent conversion instruction is triggered after the accent setting is completed.
  • the Metaverse virtual character setting interface 41 in addition to the virtual character image setting option, it also includes a voice setting option.
  • the user can set the timbre and accent of the virtual character through the voice setting option.
  • you can enter the Metaverse by triggering the enter button 42.
  • the computer device receives the accent conversion instruction, which includes the accent identifiers of the source accent and the target accent.
  • the source accent is the first accent and the target accent is the second accent as an example for explanation.
  • the computer device After receiving the first accent audio in the first accent, the computer device extracts content features through the first ASR model in the speech conversion model to obtain a first content feature, which eliminates interference such as timbre and pitch and only retains features at the level of the expressed content.
  • the computer device uses the last layer BN feature of the first ASR model as the first content feature.
  • the computer device when it is necessary to convert Mandarin into a dialect, the computer device extracts features of Mandarin audio 51 through a Mandarin ASR model 52 to obtain Mandarin content features 53 .
  • Step 302 Convert the first content feature into a second content feature using a second conversion model, where the second content feature corresponds to a second accent.
  • the computer device inputs the first content feature extracted by the first ASR model into the second conversion model, and the second conversion model converts the content feature between the first accent and the second accent to obtain the second content feature under the second accent.
  • the first content feature and the second content feature correspond to the same text (both are texts corresponding to the audio of the first accent).
  • the BN2BN model 54 is used to convert content features between Mandarin and dialect. After obtaining the Mandarin content features 53 , the computer device further performs feature conversion on the Mandarin content features 53 through the BN2BN model 54 to obtain the dialect content features 55 .
  • Step 303 Perform audio conversion on the second content feature through a third conversion model to obtain second accent audio.
  • the computer device inputs the second content feature into a third conversion model, and the third conversion model generates a second accent audio based on the content feature.
  • the computer device inputs the dialect content feature 55 into the BN2Wav model 56 to obtain the dialect audio 57 output by the BN2Wav model 56 .
  • the first conversion model serves as a key model for constructing parallel sample data.
  • the computer device inputs the first sample text into the first conversion model to obtain the first predicted content feature output by the first conversion model, thereby training the first conversion model with the first sample content feature as the supervision of the first predicted content feature.
  • the computer device uses the first sample content feature as the supervision of the first predicted content feature, determines the first conversion model loss based on the feature difference between the first predicted content feature and the first sample content feature, and trains the first conversion model based on the first conversion model loss.
  • the loss may be an MSE (Mean Square Error) loss or other types of losses, which are not limited in this embodiment.
  • the mean square error refers to the average of the sum of squares of feature difference values between the first predicted content feature and the first sample content feature, that is, the average of the sum of squares of errors.
  • the loss of the first conversion model Text2BN can be expressed as:
  • BN na is the first sample content feature extracted by the first ASR model
  • the first predicted content feature is output by the first conversion model.
  • the first conversion model includes a first conversion sub-model, a duration prediction sub-model and a second conversion sub-model, wherein the first conversion sub-model is used to realize the conversion between text and text encoding features, the duration prediction sub-model is used to predict the expression duration of the text, and the second conversion sub-model is used to convert the text encoding features into content features.
  • Step 601 encode a first sample text through a first conversion sub-model to obtain a first text encoding feature.
  • an N-layer stacked FFT (Feed Forward Transformer) is used to form the first conversion sub-model.
  • the FFT is used to map the data to a high-dimensional space and then to a low-dimensional space through linear transformation, so as to extract deeper features.
  • the FFT includes a multi-head attention mechanism layer and a convolution layer.
  • the FFT structure is shown in FIG7 .
  • the original input is first processed by the multi-head attention layer 701, and the multi-channel results obtained by the multi-head attention layer 701 and the original input are processed by weighting and normalization 702, and then input to the convolution layer 703 for convolution processing.
  • the input and output of the convolution layer 703 are added and then weighted and normalized 702 are processed for the final output.
  • the first conversion sub-model obtained by superimposing multiple layers of FFT is used for text encoding to improve the quality of text encoding.
  • the first conversion sub-model can also be implemented using other types of modules such as LSTM (Long Short-Term Memory) (which needs to include an attention mechanism and keep the input and output sizes consistent), which is not limited in the embodiments of the present application.
  • LSTM Long Short-Term Memory
  • Step 602 perform duration prediction on the first text encoding feature through the duration prediction sub-model to obtain a predicted duration, and the predicted duration is used to represent the pronunciation duration of the first sample text.
  • the computer device Since spoken text has a certain duration, in order to improve the authenticity of the audio obtained through subsequent conversion (to make the converted speech conform to the real person's speaking speed), the computer device performs duration prediction through the duration prediction sub-model to obtain the pronunciation duration of the first sample text.
  • the predicted duration includes the pronunciation sub-durations corresponding to each sub-text in the first sample text. For example, if the first sample text is "The weather is really good today", the predicted duration includes the pronunciation durations corresponding to "today”, “day”, “weather”, “air”, “real”, and "good”.
  • Step 603 Expand the first text encoding feature based on the predicted duration to obtain a second text encoding feature.
  • the computer device performs feature expansion on the first text encoding feature based on the predicted duration, and copies the sub-features in the first text encoding feature so that the duration corresponding to the copied sub-features is consistent with the pronunciation sub-duration of the corresponding sub-text.
  • the first text coding feature is "abcd”
  • the second text coding feature after feature expansion is "aabbbcdddd”.
  • Step 604 Convert the second text encoding feature into a first predicted content feature through a second conversion sub-model.
  • the feature size of the first predicted content feature output by the second conversion sub-model is consistent with the feature size of the second text encoding feature output by the second conversion sub-model.
  • the second conversion sub-model includes N layers of FFT to improve the conversion quality of text encoding features to content features.
  • the first conversion sub-model 81 first performs feature encoding on the first sample text to obtain the first text encoding feature, and inputs the first text encoding feature into the duration prediction sub-model 82 to obtain the predicted duration, and performs feature expansion processing on the first text encoding feature based on the predicted duration to obtain the second text encoding feature.
  • the second conversion sub-model 83 performs feature conversion on the second text encoding feature to obtain the first predicted content feature.
  • the process may include the following steps.
  • Step 901 converting the second sample text through the first conversion model to obtain a third sample content feature, where the third sample content feature refers to a content feature of the audio generated by expressing the second sample text in the first accent.
  • the computer device When constructing parallel sample data based on the second sample audio, the computer device performs content feature conversion on the second sample text corresponding to the second sample audio to obtain the third sample content feature. Since the first conversion model is used to convert the text into the content feature of the first accent, the first conversion model is used to convert the content feature of the second sample text to obtain the third sample content feature.
  • the content feature is the content feature of the audio generated by expressing the second sample text in the first accent.
  • the content features of the parallel corpus can be generated, eliminating the process of manually recording the parallel corpus and extracting content features from the parallel corpus.
  • Step 902 construct parallel sample data based on the second sample content feature and the third sample content feature.
  • Step 903 Input the third sample content feature into the second conversion model to obtain a second predicted content feature.
  • the second conversion model includes a convolution layer and N layers of stacked FFT, wherein the specific structure of FFT can refer to FIG7, and this embodiment is not repeated here.
  • the content feature is first processed by the convolution layer, and then processed by the N layers of FFT to obtain the converted content feature.
  • the convolution result is input into the N-layer FFT 1002 to obtain the second predicted content feature.
  • Step 904 training a second conversion model using the second sample content feature as supervision for the second predicted content feature.
  • the computer device determines the second conversion model loss based on the difference between the second sample content feature and the second predicted content feature, thereby training the second conversion model based on the second conversion model loss.
  • the loss may be an MSE loss or other types of losses, which is not limited in this embodiment.
  • the loss BN2BN of the second conversion model can be expressed as:
  • BN ac is the second sample content feature extracted by the second ASR model
  • the second predicted content feature is output by the second conversion model.
  • the speaker identifier of the sample audio needs to be taken as part of the input so that the trained third conversion model can output audio with a specific timbre.
  • the computer device inputs the sample content feature and the speaker identifier corresponding to the sample audio into the third conversion model to obtain the predicted audio, thereby training the third conversion model based on the predicted audio and the sample audio.
  • the predicted audio and the sample audio correspond to the same audio content and have the same timbre.
  • speakers correspond to different speaker identifiers.
  • speakers are pre-classified into different timbres, so that the same speaker identifier is assigned to different speakers corresponding to the same timbre.
  • the third conversion model includes a third conversion sub-model and a vocoder, wherein the third conversion sub-model is used to convert content features into audio spectrum features, and the vocoder is used to generate audio based on the audio spectrum features.
  • the third conversion model includes a convolution layer and N layers of stacked FFT
  • the audio spectrum features may be Mel spectrum features, MFCC (Mel Frequency Cepstrum Coefficient) features, etc., which is not limited to the embodiments of the present application.
  • the vocoder may be an autoregressive Wavenet or WaveRNN, or a non-autoregressive hifigan or melgan, etc., which is not limited in the embodiments of the present application.
  • the audio spectrum feature is described by taking the audio spectrum feature as the Mel spectrum feature and the vocoder as the hifigan as an example, but this is not a limitation.
  • the computer device inputs the sample content features and the speaker identifier into the third conversion sub-model to obtain the predicted audio spectrum features, and then inputs the predicted audio spectrum features into the vocoder to obtain the predicted audio.
  • the BN2Wav model includes a BN2Mel sub-model 1101 and a hifigan sub-model 1102, wherein the BN2Mel sub-model 1101 includes a convolution layer 11011 and an N-layer stacked FFT 11012.
  • the computer device inputs the sample content feature BN and the speaker identifier spk_id of the sample audio into the BN2Mel sub-model.
  • Model 1101 The BN2Mel sub-model 1101 inputs the converted Mel spectrum into the hifigan sub-model 1102, which converts the Mel spectrum into the predicted audio.
  • the computer device jointly trains the third conversion sub-model and the vocoder.
  • the computer device first trains the third conversion sub-model and then trains the vocoder based on the trained third conversion sub-model, so as to improve the training efficiency.
  • the training process of the third conversion model may include the following steps.
  • Step 1201 Input sample content features and speaker identification into a third conversion sub-model to obtain predicted audio spectrum features.
  • the computer device inputs the sample content feature and the speaker identifier into the third conversion sub-model to obtain a predicted Mel spectrum corresponding to the sample audio.
  • Step 1202 train a third conversion sub-model using the sample audio spectrum features of the sample audio as supervision for predicting the audio spectrum features.
  • a computer device extracts audio spectrum features from the sample audio to obtain sample audio spectrum features, thereby determining a third conversion sub-model loss based on a difference between the predicted audio spectrum features and the sample audio spectrum features, thereby training a third conversion sub-model based on the third conversion sub-model loss.
  • the loss may be an MSE loss or other types of losses, which is not limited in this embodiment.
  • loss BN2Mel of the third conversion sub-model can be expressed as:
  • Mel is the sample audio spectrum feature extracted by directly performing audio spectrum feature extraction on the sample audio. It is the predicted audio spectrum feature output by the third conversion sub-model.
  • Step 1203 when the training of the third conversion sub-model is completed, the predicted audio spectrum features output by the third conversion sub-model after the training are input into the vocoder to obtain the predicted audio.
  • the computer device After completing the training of the third conversion sub-model, the computer device inputs the sample content features and the speaker identification into the trained third conversion sub-model to obtain the predicted audio spectrum features, and then inputs the predicted audio spectrum features into the vocoder to obtain the predicted audio output by the vocoder.
  • a computer device inputs the predicted Mel spectrum features output by the trained BN2Mel sub-model into hifigan to obtain the predicted audio output by hifigan.
  • Step 1204 train a vocoder in a third conversion model based on the predicted audio and the sample audio.
  • the computer device determines a conversion loss of a vocoder using the sample audio as supervision for the predicted audio, thereby training the vocoder based on the loss.
  • the computer device when the vocoder adopts an adversarial network, taking hifigan as an example, the computer device adopts the adversarial training idea, through the generator and the discriminator adversarial training.
  • the Mel spectrum features obtained by reconverting the audio G(s) generated by the generator is the Mel spectrum feature extracted from the sample audio;
  • L FM (G; D) is the feature matching loss between the generated audio and the sample audio;
  • LG (G; D) is the discriminant loss of the generated audio.
  • LD (G;D) (D(x)-1) 2+ (D(G(s))) 2
  • D(x) is the discriminator's discrimination result for the sample audio
  • D(G(s)) is the discriminator's discrimination result for the predicted audio.
  • the third conversion model trained in the above manner can not only convert content features into audio, but also add a specific timbre to the converted audio. In addition, you can select a target sound.
  • the computer device when the accent conversion instruction includes a target timbre, the computer device inputs the second content feature and the speaker identifier of the speaker corresponding to the target timbre into a third conversion model to obtain second accent audio, wherein the second accent audio has a second accent and the target timbre.
  • the computer device when it is necessary to convert Mandarin into a dialect with a target timbre, the computer device extracts features of Mandarin audio 1301 through Mandarin ASR model 1302 to obtain Mandarin content features 1303.
  • the computer device further performs feature conversion on Mandarin content features 1303 through BN2BN model 1304 to obtain dialect content features 1305.
  • the computer device inputs the dialect content features 1305 and the speaker identifier corresponding to the target timbre 1306 into BN2Wav model 1307 to obtain dialect audio 1308 with the target timbre output by BN2Wav model 1307.
  • the speaker identification corresponding to the sample audio is also taken as input, so that the third conversion model can perform audio conversion based on the content features and the speaker's timbre features during training.
  • the third conversion model can output audio with different timbres, thereby achieving dual conversion of accent and timbre.
  • FIG. 14 is a structural block diagram of a training device for a speech conversion model provided by an exemplary embodiment of the present application, the device comprising:
  • a training module 1401 configured to train a first ASR model based on a first sample audio, and to train a second ASR model based on a second sample audio, wherein the first sample audio corresponds to a first accent, and the second sample audio corresponds to a second accent;
  • the training module 1401 is further used to train a first conversion model based on a first sample text and a first sample content feature corresponding to the first sample audio, wherein the first sample content feature is obtained by extracting the first sample audio by the first ASR model, and the first conversion model is used to convert the text into content features of the first accent;
  • the training module 1401 is further used to construct parallel sample data based on the first conversion model, a second sample text corresponding to the second sample audio, and a second sample content feature, wherein the second sample content feature is extracted by the second ASR model from the second sample audio, and the parallel sample data includes different content features, and different content features correspond to different accents, and different content features correspond to the same text; train a second conversion model based on the parallel sample data, and the second conversion model is used to convert content features between the first accent and the second accent;
  • the training module 1401 is further used to train a third conversion model based on sample content features of different sample audios, wherein the third conversion model is used to convert the content features into audio;
  • the generation module 1402 is used to generate a speech conversion model based on the first ASR model, the second conversion model and the third conversion model obtained through training, wherein the speech conversion model is used to convert audio in a first accent into audio in a second accent.
  • the training module 1401 is used to:
  • the second sample text is converted by the first conversion model to obtain a third sample content feature, where the third sample content feature refers to a content feature of an audio generated by expressing the second sample text in the first accent;
  • the parallel sample data is constructed based on the second sample content feature and the third sample content feature.
  • the training module 1401 is used to: input the third sample content feature into the second conversion model to obtain a second predicted content feature;
  • the second conversion model is trained by using the second sample content feature as supervision of the second predicted content feature.
  • the training module 1401 is used to: input the first sample text into the first conversion model to obtain a first predicted content feature output by the first conversion model;
  • the first conversion model is trained by using the first sample content feature as supervision of the first predicted content feature.
  • the first conversion model includes a first conversion sub-model, a duration prediction sub-model and a second conversion sub-model;
  • the training module 1401 is used to:
  • duration prediction on the first text encoding feature by using the duration prediction sub-model to obtain a predicted duration, wherein the predicted duration is used to represent the pronunciation duration of the first sample text;
  • the second text encoding feature is converted into the first predicted content feature through the second conversion sub-model.
  • the first conversion sub-model and the second conversion sub-model include FFT, and the FFT includes a multi-head attention mechanism layer and a convolution layer.
  • the training module 1401 is used to:
  • the third conversion model is trained based on the predicted audio and the sample audio.
  • the third conversion model includes a third conversion sub-model and a vocoder
  • the training module 1401 is used to:
  • the predicted audio spectrum feature is input into the vocoder to obtain the predicted audio.
  • the training module 1401 is used to:
  • the predicted audio spectrum feature output by the third conversion sub-model after the training is completed is input into the vocoder to obtain the predicted audio;
  • the vocoder in the third conversion model is trained based on the predicted audio and the sample audio.
  • the device further comprises:
  • a conversion module configured to extract a first content feature of the first accent audio through the first ASR model in response to an accent conversion instruction, wherein the first content feature corresponds to the first accent, and the accent conversion instruction is used to instruct to convert the audio from the first accent to the second accent;
  • the second content feature is converted into audio through the third conversion model to obtain second accent audio.
  • the accent conversion instruction includes a target timbre
  • the conversion module is used to:
  • the second content feature and the speaker identifier of the speaker corresponding to the target timbre are input into the third conversion model to obtain the second accent audio, wherein different speakers correspond to different speaker identifiers.
  • a first conversion model for converting text into content features is trained, thereby using the first conversion model and the second sample text corresponding to the second sample audio to construct parallel sample data containing the same text content but corresponding to different accents, and then using the parallel sample data to train a second conversion model for converting content features between different accents, and a third conversion model for converting content features into audio, to complete the training of the speech conversion model; during the model training process, the intermediate model obtained by training is used to construct parallel corpora, and there is no need to record parallel corpora of different accents before model training. While ensuring the quality of model training, the demand for manually recorded parallel corpora for model training can be reduced, which helps to improve the efficiency of model training and improve the training quality of the model when samples are insufficient.
  • the device provided in the above embodiment is only illustrated by the division of the above functional modules.
  • the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and method embodiments provided in the above embodiment belong to the same concept, and the implementation process thereof is detailed in the method embodiment, which will not be repeated here.
  • FIG. 15 is a structural block diagram of a speech conversion device provided by an exemplary embodiment of the present application, the device comprising:
  • An acquisition module 1501 is used to acquire a first accent audio, where the first accent audio corresponds to a first accent;
  • An extraction module 1502 is configured to extract a first content feature from the first accent audio by using the first ASR model, where the first content feature corresponds to the first accent;
  • a content feature conversion module 1503 configured to convert the first content feature into a second content feature by using the second conversion model, wherein the second content feature corresponds to a second accent;
  • the audio conversion module 1504 is used to perform audio conversion on the second content feature through the third conversion model to obtain a second accent audio.
  • the content feature conversion module 1503 is further used to input the first content feature extracted by the first ASR model into a second conversion model, and the second conversion model performs content feature conversion between the first accent and the second accent to obtain a second content feature in the second accent.
  • the second transformation model includes a convolutional layer and an N-layer stacked FFT.
  • the content feature conversion module 1503 is further used to perform convolution processing on the first content feature through the convolution layer in the second conversion model, and input the convolution result into the N-layer stacked FFT conversion to obtain the second content feature.
  • the third conversion model includes a third conversion sub-model and a vocoder, wherein the third conversion sub-model is used to convert the content features into audio spectrum features, and the vocoder is used to generate audio based on the audio spectrum features.
  • the audio conversion module 1504 is further configured to input the second content feature and the speaker identifier into a third conversion sub-model to obtain an audio spectrum feature.
  • the audio conversion module 1504 is further used to input the audio spectrum features into the vocoder to obtain the second accent audio.
  • the third transformation sub-model includes a convolutional layer and N layers of stacked FFT.
  • FIG. 16 shows a schematic diagram of the structure of a computer device provided by an exemplary embodiment of the present application.
  • the computer device can be a screen projection device or terminal in the above-mentioned embodiment.
  • the computer device 1600 includes a central processing unit (CPU) 1601, a system memory 1604 including a random access memory 1602 and a read-only memory 1603, and a system bus 1605 connecting the system memory 1604 and the central processing unit 1601.
  • the computer device 1600 also includes a basic input/output system (I/O system) 1606 that helps transmit information between various devices in the computer, and a large-capacity storage device 1607 for storing an operating system 1613, an application program 1614 and other program modules 1615.
  • I/O system basic input/output system
  • the basic input/output system 1606 includes a display 1608 for displaying information and an input device 1609 such as a mouse and a keyboard for user inputting information.
  • the display 1608 and the input device 1609 are connected to the central processing unit 1601 through an input/output controller 1610 connected to the system bus 1605.
  • the basic input/output system 1606 may also include an input/output controller 1610 for receiving and processing inputs from a plurality of other devices such as a keyboard, a mouse, or an electronic stylus.
  • the input/output controller 1610 also provides output to a display screen, a printer, or other types of output devices.
  • the mass storage device 1607 is connected to the central processing unit 1601 through a mass storage controller (not shown) connected to the system bus 1605.
  • the mass storage device 1607 and its associated computer readable media provide non-volatile storage for the computer device 1600. That is, the mass storage device 1607 may include a computer readable medium (not shown) such as a hard disk or drive.
  • the computer-readable medium may include computer storage media and communication media.
  • Computer storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data.
  • Computer storage media include random access memory.
  • the computer storage medium may be a random access memory (RAM), a read-only memory (ROM), a flash memory or other solid-state storage technology, a compact disc read-only memory (CD-ROM), a digital versatile disc (DVD) or other optical storage, a cassette, a magnetic tape, a disk storage or other magnetic storage device.
  • RAM random access memory
  • ROM read-only memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disc
  • the computer storage medium is not limited to the above.
  • the system memory 1604 and the mass storage device 1607 may be collectively referred to as a memory.
  • the memory stores one or more programs, and the one or more programs are configured to be executed by one or more central processing units 1601.
  • the one or more programs contain instructions for implementing the above-mentioned methods.
  • the central processing unit 1601 executes the one or more programs to implement the methods provided by the above-mentioned various method embodiments.
  • the computer device 1600 can also be connected to a remote computer on the network through a network such as the Internet. That is, the computer device 1600 can be connected to the network 1612 through the network interface unit 1611 connected to the system bus 1605, or the network interface unit 1611 can be used to connect to other types of networks or remote computer systems (not shown).
  • An embodiment of the present application also provides a computer-readable storage medium, which stores at least one instruction, and the at least one instruction is loaded and executed by a processor to implement the training method of the speech conversion model described in the above embodiment, or the speech conversion method described in the above aspect.
  • the computer readable storage medium may include: ROM, RAM, solid state drives (SSD, Solid State Drives) or optical disks, etc.
  • RAM may include resistance random access memory (ReRAM, Resistance Random Access Memory) and dynamic random access memory (DRAM, Dynamic Random Access Memory).
  • the embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the speech conversion model described in the above embodiment, or the speech conversion method described in the above aspect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

一种语音转换模型的训练方法、装置、设备及介质。包括:基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型(201);基于第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型(202);基于第一转换模型、第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据(203);基于平行样本数据训练第二转换模型,第二转换模型用于对第一口音和第二口音间进行内容特征转换(204);基于不同样本音频的样本内容特征训练第三转换模型(205);基于训练得到的第一ASR模型、第二转换模型和第三转换模型生成语音转换模型(206)。

Description

语音转换模型的训练方法、装置、设备及介质
本申请要求于2022年11月21日提交的申请号为202211455842.7、发明名称为“语音转换模型的训练方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及音频处理技术领域,特别涉及一种语音转换模型的训练方法、装置、设备及介质。
背景技术
随着网络技术的不断发展,越来越多用户开始使用虚拟形象在网络中进行直播、游戏、社交或者在线会议。
为了保护个人隐私安全,用户在使用虚拟形象过程中,可以设置虚拟形象的口音,使原始口音的用户语音被转换为所设置口音后播放,并保证用户语音内容保持不变。相关技术中,口音转换通常使用语音转换模型实现,而在训练语音转换模型过程中需要大量的平行语料。其中,该平行语料为相同语音内容的不同口音音频。
然而,平行语料通常需要人工录制,导致平行语料的获取难度较高,在平行语料不足的情况下,训练得到语音转换模型的质量较差,进而影响口音转换效果。
发明内容
本申请实施例提供了一种语音转换模型的训练方法、装置、设备及介质,能够在降低对人工录制的平行语料的需求的前提下,保证语音转换模型的训练质量。所述技术方案如下:
一方面,本申请实施例提供了一种语音转换模型的训练方法,该方法由计算机设备执行,包括:
基于第一样本音频训练第一ASR(Automatic Speech Recognition,自动语音识别)模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;
基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;
基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;
基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;
基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;
基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
一方面,本申请实施例提供了一种语音转换方法,所述方法由计算机设备执行,所述计算机设备中设置有语音转换模型,所述语音转换模型包括第一ASR模型、第二转换模型和第三转换模型,所述方法包括:
获取第一口音音频,所述第一口音音频对应第一口音;
通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;
通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;
通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
另一方面,本申请实施例提供了一种语音转换模型的训练装置,所述装置包括:
训练模块,用于基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;
所述训练模块,还用于基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;
所述训练模块,还用于基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;
所述训练模块,还用于基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;
生成模块,用于基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
另一方面,本申请实施例提供了一种语音转换装置,其中,所述装置包括:
获取模块,用于获取第一口音音频,所述第一口音音频对应第一口音;
提取模块,用于通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;
内容特征转换模块,用于通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;
音频转换模块,用于通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
另一方面,本申请实施例提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如上述方面所述的语音转换模型的训练方法,或,如上述方面所述的语音转换方法。
另一方面,本申请实施例提供了一种计算机可读存储介质,所述可读存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现如上述方面所述的语音转换模型的训练方法,或,如上述方面所述的语音转换方法。
另一方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行如上述方面所述的语音转换模型的训练方法,或,如上述方面所述的语音转换方法。
本申请实施例中,在缺少第二口音的第二样本音频对应平行语料的情况下,首先基于第一口音的第一样本音频,训练用于将文本转换为内容特征的第一转换模型,从而利用该第一转换模型以及第二样本音频对应的第二样本文本,构建得到包含对应相同文本内容但对应不同口音的平行样本数据,进而利用该平行样本数据训练在不同口音间进行内容特征转换的第二转换模型,以及用于将内容特征转换为音频的第三转换模型,完成语音转换模型训练;模型训练过程中,利用训练得到的中间模型构建平行语料,无需在模型训练前录制不同口音的 平行语料,能够在保证模型训练质量的情况下,降低模型训练对人工录制的平行语料的需求,有助于提高模型训练效率,并提高样本不足情况下模型的训练质量。
附图说明
图1示出了本申请一个示例性实施例提供的语音转换系统的示意图;
图2示出了本申请一个示例性实施例提供的语音转换模型的训练方法的流程图;
图3示出了本申请一个示例性实施例提供的口音转换方法的流程图;
图4是本申请一个示例性实施例示出的语音设置界面的示意图;
图5是本申请一个示例性实施例提供的口音转换过程的实施示意图;
图6是本申请一个示例性实施例示出的文本转内容特征过程的流程图;
图7是本申请一个示例性实施例提供的FFT结构图;
图8是本申请一个示例性实施例示出的第一转换模型的结构示意图;
图9是本申请一个示例性实施例示出的第二转换模型训练过程的流程图;
图10是本申请一个示例性实施例示出的第二转换模型的结构示意图;
图11是本申请一个示例性实施例示出的第三转换模型的结构示意图;
图12是本申请一个示例性实施例示出的第三转换模型训练过程的流程图;
图13是本申请另一个示例性实施例提供的口音转换过程的实施示意图;
图14是本申请一个示例性实施例提供的语音转换模型的训练装置的结构框图;
图15是本申请一个示例性实施例提供的语音转换装置的结构框图;
图16示出了本申请一个示例性实施例提供的计算机设备的结构示意图。
具体实施方式
为了降低模型训练过程对预先录制的平行语料的依赖,本申请实施例中,语音转换模型由第一ASR模型(用于将音频转换为文本)、第二转换模型(用于在不同口音间进行内容特征转换)以及第三转换模型(用于将内容特征转换为音频)构成。并且,在训练过程中,完成第一ASR模型训练后,训练用于将文本转换为内容特征的第一转换模型,从而借助第一转换模型构建平行样本数据,以用于后续第二转换模型以及第三转换模型训练。训练过程中,借助训练得到的转换模型构建平行语料,无需预先人工录制大量平行语料,从而降低训练过程对平行语料的依赖,保证模型训练质量。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的音频、口音、文本都是在充分授权的情况下获取的。
采用本申请实施例提供的训练方法所训练得到的语音转换模型,能够适用于各种需要进行口音转换的场景。如图1所示,其示出了本申请一个示例性实施例示出的语音转换系统的示意图。该语音转换系统中包括:音频采集设备110、终端120以及服务器130。
音频采集设备110是用于采集用户语音的设备,该音频采集设备110可以是耳麦、麦克风或者具有收音功能的AR/VR设备等等,本申请实施例对此不作限定。
音频采集设备110与终端120之间通过有线或无线方式相连,用于将采集到的用户语音传输至终端120,由终端120进一步对用户语音进行口音转换处理。其中,终端120可以是具有智能手机、平板电脑、个人计算机、车载终端等电子设备。
在一些实施例中,终端120中设置有具有口音转换功能的应用程序。通过该应用程序,用户可以设置口音转换目标,从而实现将用户语音由原始语音转换为目标语音。
在一种可能的实施方式中,口音转换可以由终端120在本地实现(语音转换模型设置在终端120中);在另一种可能的实施方式中,口音转换可以由终端120借助服务器130实现(语音转换模型设置在服务器130中,终端120向服务器130传输口音转换需求)。
服务器130可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。本申请实施例中,服务器130可以为实现口音转换功能的后台服务器,用于提供不同口音间的转换服务。
在一些实施例中,服务器130中设置有多个语音转换模型,不同语音转换模型用于实现不同口音间的转换。比如,当支持将普通话转换为n种地方口音时,服务器130中设置有n个语音转换模型。
并且,在实现口音转换功能前,服务器130获取不同口音的口音语料,该口音语料由音频以及对应的文本构成,从而基于口音语料训练相应的语音转换模型。
如图1所示,在进行口音转换前,用户通过终端120设置将第一口音转换为第二口音,由终端120向服务器130发送口音转换请求,请求服务器130采用相应的语音转换模型(将第一口音转换为第二口音)进行口音转换。
音频采集设备110将采集到的第一口音的用户语音传输至终端120,由终端120将第一口音的用户语音传输至服务器130。服务器130通过语音转换模型将其转换为第二口音的用户语音,并反馈至终端120,由终端120进行进一步处理。
其中,不同应用场景下,终端120对用户语音的处理方式不同。下面采用集中示例性的应用场景进行说明。
1、虚拟人内容制作场景
虚拟人内容制作场景下,终端获取转换得到的用户语音后,将该用户语音与制作的内容(比如虚拟人短视频、虚拟人长视频等等)进行融合,得到虚拟人内容。其中,在进行融合时,可以根据转换得到的用户语音对虚拟人的嘴部进行控制,提高虚拟人嘴部动作与语音的匹配度。
示例性地,在虚拟人内容制作场景下,以制作虚拟人的短视频为例,终端获取真实用户对应的第一口音音频,第一口音音频对应真实用户的第一口音,终端通过语音转换模型中的第一ASR模型对第一口音音频提取得到第一口音下的第一内容特征;终端通过语音转换模型中的第二转换模型将第一内容特征转换为第二内容特征,第二内容特征对应第二口音;在转换完口音后,终端通过语音转换模型中的第三转换模型对第二口音下的第二内容特征进行音频转换,得到虚拟人对应的第二口音音频。
2、虚拟主播直播场景
虚拟主播直播场景下,虚拟主播可以通过口音设置界面预先设置直播口音。直播过程中,终端将通过麦克风采集的用户语音发送至服务器,由服务器将原始口音的用户语音转换为直播口音的用户语音,并反馈至终端。终端将直播口音的用户语音与包含虚拟主播形象的视频流进行合并,从而通过推流服务器将合并得到的音视频流推送至直播间内的各个观众客户端。
示例性地,在虚拟主播直播场景下,可预先设置虚拟主播直播时的口音。终端通过麦克风采集真实用户对应的第一口音音频,第一口音音频对应真实用户的第一口音,终端通过语音转换模型中的第一ASR模型对第一口音音频提取得到第一口音下的第一内容特征;终端通过语音转换模型中的第二转换模型将第一内容特征转换为第二内容特征,第二内容特征对应第二口音;在转换完口音后,终端通过语音转换模型中的第三转换模型对第二口音下的第二内容特征进行音频转换,得到虚拟主播对应的第二口音音频,即,虚拟主播以第二口音音频进行直播。
3、元宇宙场景
元宇宙场景下,用户可以设置在元宇宙中进行交互时采用的口音。用户控制虚拟角色在元宇宙中与其他虚拟角色进行交互时,用户语音由耳麦、AR/VR等设备采集并传输至终端,由终端进一步交由服务器进行口音转换,并在元宇宙中控制虚拟角色播放转换得到的口音音频,实现与其他虚拟角色之间的语音互动。
示例性地,在元宇宙场景下,可预先选择与其他虚拟角色进行交互时的第二口音。在交互时,终端通过麦克风采集真实用户对应的第一口音音频,第一口音音频对应真实用户的第一口音,终端通过语音转换模型中的第一ASR模型对第一口音音频提取得到第一口音下的第一内容特征;终端通过语音转换模型中的第二转换模型将第一内容特征转换为第二内容特征,第二内容特征对应第二口音;在转换完口音后,终端通过语音转换模型中的第三转换模型对第二口音下的第二内容特征进行音频转换,得到元宇宙中的虚拟角色对应的第二口音音频,即,元宇宙中的虚拟角色以第二口音音频与其他虚拟角色进行交互。
上述应用场景仅作为示例性的说明,采用本申请实施例提供的方法训练得到的语音转换模型,还可以用于语音通话(方便不同口音的通话对象之间进行语音交流)、翻译等真实世界应用场景,本申请实施例并不对此构成限定。
并且,为了方便表述,下述各个实施例中,以语音转换模型的训练以及使用均用于计算机设备(可以为终端或者服务器),且训练用于将第一口音转换为第二口音的语音转换模型为例进行说明(其他由源语音转换为目标语音的方案类似),但并不对此构成限定。
图2示出了本申请一个示例性实施例提供的语音转换模型的训练方法的流程图。该方法由计算机设备执行,该方法包括如下步骤。
步骤201,基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,第一样本音频对应第一口音,第二样本音频对应第二口音。
其中,第一口音为源口音,第二口音为目标语音,即训练得到的语音转换模型用于将第一口音的语音转换为第二口音的语音。
在一些实施例中,第一样本音频对应有第一样本文本,第二样本音频对应有第二样本文本。本申请实施例中,第一样本文本无需与第二样本文本相同,因此可以直接采用公开语音数据集用于模型训练。
在一个示意性的例子中,计算机设备采用Wenet Speech数据集作为第一样本音频,采用KeSpeech数据集作为第二样本音频,其中,Wenet Speech数据集包括1万小时的ASR数据,可参见网址:https://zhuanlan.zhihu.com/p/424118791中的介绍;KeSpeech数据集包含不同地区方言的ASR数据,可参见网址:https://datasets-benchmarks-proceedings.neurips.cc/pap er/2021/hash/0336dcbab05b9d5ad24f4333c7658a0e-Abstract-round2.html中的介绍。
关于ASR模型的训练方式,在一种可能的实施方式中,计算机设备将样本音频输入ASR模型,得到ASR模型输出的预测文本,从而基于预测文本以及样本音频对应的样本文本对ASR模型进行训练。
可选的,ASR模型的模型架构包括但不限于Wenet、wav2vec2、Kaldi等等,本申请实施例对此不作限定。Wenet是出门问问语音团队联合西工大语音实验室开源的一款面向工业落地应用的语音识别工具包,该工具用一套简洁的方案提供了语音识别从训练到部署的一条龙服务,可参见网址:https://zhuanlan.zhihu.com/p/349586567中的介绍。wav2vec是Interspe ech 2019收录的文章中提出的,作者用无监督预训练的卷积神经网络来改善语音识别任务,提出了一种噪声对比学习二分类任务,从而使得wav2vec可以在大量未标注的数据上进行训练,可参见网址:https://zhuanlan.zhihu.com/p/302463174中的介绍。Kaldi是一种开源语音识别工具,它使用WFST来实现解码算法,Kaldi的主要代码是C++编写,在此之上使用bash和python脚本做了一些工具,可参见网址:https://zhuanlan.zhihu.com/p/84050431中的介 绍。
在一些实施例中,ASR模型可以基于样本音频重新训练得到(适用于样本音频数量较多的情况),也可以基于样本音频对预训练的ASR模型进行微调得到(适用于样本音频数量较少的情况)。
比如,当第一口音为普通话,第二口音为方言时,第一ASR模型基于第一样本音频重新训练得到,第二ASR模型在第一ASR模型的基础上,基于第二样本音频进行微调得到。
本申请实施例中,训练得到的ASR模型用于提取语音中的内容特征。在一些实施例中,该内容特征被称为BN(BottleNeck,瓶颈)特征,通常为ASR模型的最后一层特征,其保留了语音的内容特征,并剔除了诸如音色、音调等其他特征。
在一些实施例中,第一ASR模型的训练过程包括:计算机设备将第一样本音频输入第一ASR模型进行文本提取,得到第一预测文本;计算机设备计算得到第一预测文本与第一样本音频对应的第一样本文本之间的损失函数值;计算机设备基于第一预测文本与第一样本文本之间的损失函数值对第一ASR模型的模型参数进行更新,从而实现第一ASR模型的训练。
在一些实施例中,第二ASR模型的训练过程包括:计算机设备将第二样本音频输入第二ASR模型进行文本提取,得到第二预测文本;计算机设备计算得到第二预测文本与第二样本音频对应的第二样本文本之间的损失函数值;计算机设备基于第二预测文本与第二样本文本之间的损失函数值对第二ASR模型的模型参数进行更新,从而实现第二ASR模型的训练。
步骤202,基于第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,第一样本内容特征由第一ASR模型对第一样本音频进行提取得到,第一转换模型用于将文本转换为第一口音的内容特征。
由于采用不同口音表述相同文本内容时的发音并不相同,因此对相同文本对应的不同口音语音进行内容特征提取所得到的内容特征也不同,相应的,在不同口音间实现内容特征转换成为实现口音转换的关键。
本申请实施例中采用了数据增强方案实现非平行语料(即对应不同口音且对应不同文本的语料)之间的内容特征转换。
在一些实施例中,计算机设备通过训练得到的第一ASR模型对第一样本音频进行特征提取,得到第一样本音频的第一样本内容特征,从而基于第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型。其中,第一转换模型可以被称为文本内容特征转换模型(Text2BN模型),用于实现文本至源口音内容特征之间的转换。
在一些实施例中,第一转换模型的训练过程包括:计算机设备将第一样本文本输入第一转换模型,得到第一预测内容特征;计算机设备通过第一ASR模型对第一样本音频进行提取得到第一样本内容特征;计算机设备计算得到第一预测内容特征与第一样本内容特征之间的损失函数值;计算机设备基于第一预测内容特征与第一样本内容特征之间的损失函数值对第一转换模型的模型参数进行更新,从而实现第一转换模型的训练。
步骤203,基于第一转换模型、第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,第二样本内容特征由第二ASR模型对第二样本音频进行提取得到,平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本。
在一些实施例中,计算机设备基于第一转换模型对第二样本音频对应的第二样本文本进行文本转化,得到第二样本文本对应的第一口音的内容特征;计算机设备将第二样本文本对应的第一口音的内容特征和第二样本文本对应的第二口音的内容特征进行汇总,得到平行样本数据。
可选地,第二样本文本对应的第二口音的内容特征的由第二ASR模型提取得到的。
训练得到第一转换模型后,计算机设备基于第二样本音频对应的第二样本文本以及第一转换模型进行数据增强,从而根据第二样本内容特征以及数据增强得到的第一口音的内容特征,构建得到平行样本数据。其中,平行样本数据包括相同文本对应的第一口音的内容特征(由第一转换模型生成)以及第二口音的内容特征(由第二ASR模型提取得到)。
比如,在包含对应文本A的方言样本音频,而不包含对应文本A的普通话样本音频的情况下,计算机设备可以基于第一转换模型、文本A以及对应文本A的方言样本音频的方言样本内容特征,构建得到文本A对应的平行样本数据,该平行样本数据包含文本A对应的普通话以及方言的内容特征。
步骤204,基于平行样本数据训练第二转换模型,第二转换模型用于对第一口音和第二口音间进行内容特征转换。
进一步的,计算机设备基于对应相同文本的平行样本数据,训练第二转换模型。其中,该第二转换模型可以被称为内容特征转换模型(BN2BN模型),用于将源口音的内容特征转换为目标口音的内容特征。BN2BN模型用于实现口音迁移任务,可参见网址:https://zhuanlan.zhihu.com/p/586037409中的介绍。
比如,当第一口音为普通话,第二口音为方言时,计算机设备训练得到第二转换模型用于将普通话的内容特征转换为方言的内容特征。
需要说明的是,当第一样本音频与第二样本音频对应相同样本文本时,第一样本音频以及第二样本音频对应的样本内容特征可以直接被用于训练第二转换模型。
在一些实施例中,第二转换模型的训练过程包括:计算机设备通过第二ASR模型对第二样本音频进行提取得到第二样本内容特征;计算机设备通过第一转换模型对第二样本音频对应的第二样本文本进行转换得到第三样本内容特征,第三样本内容特征指采用第一口音表述第二样本文本所产生音频的内容特征;计算机设备将第三样本内容特征输入第二转换模型,得到第二预测内容特征;计算机设备计算得到第二预测内容特征与第二样本内容特征之间的损失函数值;计算机设备基于第二预测内容特征与第二样本内容特征之间的损失函数值对第二转换模型的模型参数进行更新,从而实现第二转换模型的训练。
步骤205,基于不同样本音频的样本内容特征训练第三转换模型,第三转换模型用于将内容特征转换为音频。
其中,第三转换模型可以被称为内容音频转换模型,用于基于目标语音的内容特征转换得到目标语音的音频。
在一些实施例中,该第三转换模型可以包括声学模型以及声码器,其中,声学模型用于基于内容特征生成音频频谱,而声码器则用于基于音频频谱生成音频。
在一些实施例中,训练第三转换模型的样本可以为各种口音的样本音频。
需要说明的是,第三转换模型可以在ASR模型训练完成后执行,即第三转换模型可以与第一以及第二转换模型同步训练。本申请实施例并不对模型的训练时序构成限定。
在一些实施例中,第三转换模型的训练过程包括:计算机设备将样本内容特征以及样本音频对应的说话者标识输入第三转换模型进行音频的生成,得到预测音频;计算机设备计算得到预测音频与样本音频之间的损失函数值;计算机设备基于预测音频与样本音频之间的损失函数值对第三转换模型的模型参数进行更新,从而实现第三转换模型的训练。
步骤206,基于训练得到的第一ASR模型、第二转换模型和第三转换模型生成语音转换模型,语音转换模型用于将第一口音的音频转换为第二口音的音频。
通过上述步骤训练得到第一ASR模型、第二转换模型和第三转换模型后,计算机设备将上述模型组合得到最终的语音转换模型。其中,模型之间的拼接顺序为第一ASR模型→第二转换模型→第三转换模型,即第一ASR模型的输出被输入第二转换模型,第二转换模型的输出被输入第三转换模型。
在一个示意性的例子中,训练得到的用于将普通话转换为方言的语音转换模型由普通话 ASR模型,普通话-方言内容转换模型以及内容音频转换模型构成。
综上所述,本申请实施例中,在缺少第二口音的第二样本音频对应平行语料的情况下,首先基于第一口音的第一样本音频,训练用于将文本转换为内容特征的第一转换模型,从而利用该第一转换模型以及第二样本音频对应的第二样本文本,构建得到包含对应相同文本内容但对应不同口音的平行样本数据,进而利用该平行样本数据训练在不同口音间进行内容特征转换的第二转换模型,以及用于将内容特征转换为音频的第三转换模型,完成语音转换模型训练;模型训练过程中,利用训练得到的中间模型构建平行语料,无需在模型训练前录制不同口音的平行语料,能够在保证模型训练质量的情况下,降低模型训练对人工录制的平行语料的需求,有助于提高模型训练效率,并提高样本不足情况下模型的训练质量。
下面对采用上述方案训练得到的语音转换模型的应用过程进行说明。通过利用该语音转换模型可实现语音转换方法,该语音转换方法由计算机设备执行,该语音转换模型包括第一ASR模型、第二转换模型和第三转换模型。在进行语音转换时,计算机设备获取第一口音音频,第一口音音频对应第一口音;计算机设备通过第一ASR模型对第一口音音频提取得到第一内容特征,第一内容特征对应第一口音;计算机设备通过第二转换模型将第一内容特征转换为第二内容特征,第二内容特征对应第二口音;计算机设备通过第三转换模型对第二内容特征进行音频转换,得到第二口音音频,从而完成语音转换。
示例性地,计算机设备接收到第一口音的第一口音音频后,通过语音转换模型中的第一ASR模型进行内容特征提取,得到第一内容特征。
在一些实施例中,计算机设备将第一ASR模型提取到的第一内容特征输入第二转换模型,由第二转换模型在第一口音和第二口音之间进行内容特征转换,得到第二口音下的第二内容特征。
其中,第一内容特征和第二内容特征对应相同文本(均为第一口音音频对应的文本)。
可选地,第二转换模型包括卷积层以及N层堆叠的FFT;计算机设备通过第二转换模型中的卷积层对第一内容特征进行卷积处理后,将卷积结果输入至N层堆叠的FFT进行转换得到第二内容特征。
在一些实施例中,计算机设备将第二内容特征以及目标音色对应说话者的说话者标识输入第三转换模型,得到第二口音音频。
其中,不同说话者对应不同说话者标识。
可选地,第三转换模型包括第三转换子模型以及声码器,其中,第三转换子模型用于将内容特征转换为音频谱特征,声码器则用于基于音频谱特征生成音频。
可选的,该第三转换子模型包括卷积层以及N层堆叠的FFT,该音频谱特征可以为Mel(梅尔)谱特征、MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征等等,本申请实施例对此不作限定。
可选的,该声码器可以是采用自回归的Wavenet或WaveRNN,或者采用非自回归的Hifigan或Melgan等等,本申请实施例对此不作限定。
为了方便表述,下述实施例中以音频谱特征为Mel谱特征,声码器为hifigan为例进行说明,但并不对此构成限定。
在一些实施例中,计算机设备将第二内容特征以及说话者标识输入第三转换子模型,得到音频谱特征;计算机设备将音频谱特征输入声码器,得到第二口音音频。
下面对采用上述方案训练得到的语音转换模型的应用过程进行说明。图3示出了本申请一个示例性实施例提供的口音转换方法的流程图。该方法由计算机设备执行,该方法包括如下步骤。
步骤301,响应于口音转换指令,通过第一ASR模型提取第一口音音频的第一内容特征,第一内容特征对应第一口音,口音转换指令用于指示将音频由第一口音转换为第二口 音。
在一些实施例中,该口音转换指令在完成口音设置后触发。在一种可能的场景中,如图4所示,在元宇宙虚拟角色设置界面41中,除了包含虚拟角色形象设置选项外,还包括语音设置选项。用户可以通过语音设置选项设置虚拟角色的音色以及口音。当完成语音以及形象设置后,通过触发进入按键42即可进入元宇宙。其中,触发进入按键42后,计算机设备接收到口音转换指令,该口音转换指令中包含源口音以及目标口音的口音标识。本实施例中,以源口音为第一口音,目标口音为第二口音为例进行说明。
计算机设备接收到第一口音的第一口音音频后,通过语音转换模型中的第一ASR模型进行内容特征提取,得到第一内容特征,该第一内容特征提出了音色、音调等干扰,仅保留所表达内容层面的特征。
在一些实施例中,计算机设备将第一ASR模型最后一层BN特征作为第一内容特征。
示意性的,如图5所示,当需要将普通话转换为方言时,计算机设备通过普通话ASR模型52对普通话音频51进行特征提取,得到普通话内容特征53。
步骤302,通过第二转换模型将第一内容特征转换为第二内容特征,第二内容特征对应第二口音。
进一步的,计算机设备将第一ASR模型提取到的第一内容特征输入第二转换模型,由第二转换模型在第一口音和第二口音之间进行内容特征转换,得到第二口音下的第二内容特征。其中,第一内容特征和第二内容特征对应相同文本(均为第一口音音频对应的文本)。
示意性的,如图5所示,BN2BN模型54用于在普通话和方言之间进行内容特征转换。得到普通话内容特征53后,计算机设备进一步通过BN2BN模型54对普通话内容特征53进行特征转换,得到方言内容特征55。
步骤303,通过第三转换模型对第二内容特征进行音频转换,得到第二口音音频。
进一步的,计算机设备将第二内容特征输入第三转换模型,由第三转换模型基于内容特征生成第二口音音频。
示意性的,如图5所示,计算机设备将方言内容特征55输入BN2Wav模型56,得到BN2Wav模型56输出的方言音频57。
第一转换模型作为构建平行样本数据的关键模型,在训练第一转换模型的过程中,计算机设备将第一样本文本输入第一转换模型,得到第一转换模型输出的第一预测内容特征,从而以第一样本内容特征为第一预测内容特征的监督,训练第一转换模型。
在一些实施例中,计算机设备以第一样本内容特征为第一预测内容特征的监督,基于第一预测内容特征与第一样本内容特征之间的特征差异确定第一转换模型损失,从而基于该第一转换模型损失训练第一转换模型。其中,该损失可以为MSE(Mean Square Error,均方误差)损失或其他类型损失,本实施例对此不作限定。
均方误差是指第一预测内容特征与第一样本内容特征之间的特征差异值的平方和的平均数,也就是误差平方和的平均数。
可选的,第一转换模型的损失lossText2BN可以表示为:
其中,BNna为第一ASR模型提取到的第一样本内容特征,为第一转换模型输出的第一预测内容特征。
为了提升文本到内容特征的转换质量,在一种可能的设计中,该第一转换模型包括第一转换子模型、时长预测子模型以及第二转换子模型,其中,第一转换子模型用于实现文本到文本编码特征之间的转换,时长预测子模型用于预测文本的表述时长,而第二转换子模型则用于将文本编码特征转换为内容特征。
相应的,第一转换模型将文本转换为内容特征的过程如图6所示。
步骤601,通过第一转换子模型对第一样本文本进行编码,得到第一文本编码特征。
由于文本表述具有前后关联性,因此本申请实施例中,为了提高后续特征转换质量,在一种可能的设计中,采用N层堆叠的FFT(Feed Forward Transformer,前馈转换)构成第一转换子模型。其中,该FFT用于通过线性变换,先将数据映射到高纬度的空间再映射到低纬度的空间,以此提取更深层次的特征。
并且,FFT包括多头注意力机制层和卷积层。在一个示意性的例子中,该FFT结构如图7所示。原始输入首先经过多头注意力层701处理,多头注意力层701处理得到的多路结果和原始输入共同经过加权和标准化702处理后,输入卷积层703进行卷积处理。卷积层703的输入与输出相加后继续进行加权和标准化702处理最终输出。
由于FFT通过多头注意力机制和卷积层实现,且使用了残差网络思想,因此利用多层FFT叠加得到的第一转换子模型进行文本编码,能够提高文本编码质量。
当然,第一转换子模型除了可以采用堆叠的FFT外,还可以采用LSTM(Long Short-Term Memory,长短期记忆)等其他类型的模块(需要包含注意力机制,且保持输入与输出的尺寸一致)实现,本申请实施例对此不作限定。
步骤602,通过时长预测子模型对第一文本编码特征进行时长预测,得到预测时长,预测时长用于表征第一样本文本的发音时长。
由于采用口语表述文本时,具有一定的表述时长,因此为了提高后续转换得到的音频的真实性(使转换的得到的语音符合真人语速),计算机设备通过时长预测子模型进行时长预测,得到第一样本文本的发音时长。
在一些实施例中,该预测时长包括第一样本文本中各个子文本对应的发音子时长。比如,第一样本文本为“今天天气真好”,该预测时长包括“今”、“天”、“天”、“气”、“真”、“好”各自对应的发音时长。
步骤603,基于预测时长对第一文本编码特征进行特征扩充,得到第二文本编码特征。
进一步的,计算机设备基于预测时长对第一文本编码特征进行特征扩充,复制第一文本编码特征中的子特征,使复制后子特征对应的持续时长与对应子文本的发音子时长保持一致。
在一个示意性的例子中,第一文本编码特征为“abcd”,经过特征扩充后的第二文本编码特征为“aabbbcdddd”。
步骤604,通过第二转换子模型将第二文本编码特征转换得到第一预测内容特征。
在一些实施例中,第二转换子模型输出的第一预测内容特征与输出的第二文本编码特征的特征尺寸保持一致。
在一些实施例中,第二转换子模型包括N层FFT,以此提升文本编码特征到内容特征的转换质量。
在一个示意性的例子中,如图8所示,第一转换子模型81首先对第一样本文本进行特征编码,得到第一文本编码特征,并将第一文本编码特征输入时长预测子模型82,得到预测时长,并基于预测时长对第一文本编码特征进行特征扩充处理,得到第二文本编码特征。最终通过第二转换子模型83对第二文本编码特征进行特征转换,得到第一预测内容特征。
针对上述平行样本数据的构建,以及基于平行样本数据训练第二转换模型的过程,在一种可能的实施方式中,如图9所示,该过程可以包括如下步骤。
步骤901,通过第一转换模型将第二样本文本转换得到第三样本内容特征,第三样本内容特征指采用第一口音表述第二样本文本所产生音频的内容特征。
在基于第二样本音频构建平行样本数据时,计算机设备对第二样本音频对应的第二样本文本进行内容特征转换,得到第三样本内容特征。由于第一转换模型用于将文本转换为第一口音的内容特征,因此利用第一转换模型对第二样本文本进行内容特征转换,得到第三样本 内容特征即为采用第一口音表述第二样本文本所产生音频的内容特征。
借助第一转换模型,即便缺少第二样本音频对应的平行语料,也能够生成平行语料的内容特征,免去了人工录制平行语料,以及对平行语料进行内容特征提取的流程。
步骤902,基于第二样本内容特征和第三样本内容特征构建平行样本数据。
由于第三样本内容特征与第二样本内容特征对应不同口音,且对应相同文本,因此两者组合集合构建得到平行样本数据。
步骤903,将第三样本内容特征输入第二转换模型,得到第二预测内容特征。
在一种可能的设计中,为了提高内容特征转换质量,第二转换模型包括卷积层以及N层堆叠的FFT,其中,FFT的具体结构可以参考图7,本实施例在此不做赘述。进行内容特征转换时,内容特征首先经过卷积层卷积处理,然后经过N层FFT处理,得到转换后的内容特征。
示意性的,如图10所示,计算机设备通过卷积层1001对第三样本内容特征进行卷积处理后,将卷积结果输入N层FFT 1002,得到第二预测内容特征。
步骤904,以第二样本内容特征为第二预测内容特征的监督,训练第二转换模型。
为了使第二转换模型的内容转换结果接近第二ASR模型输出的第二样本音频的第二样本内容特征,在一种可能的实施方式中,计算机设备基于第二样本内容特征与第二预测内容特征的差异确定第二转换模型损失,从而基于该第二转换模型损失训练第二转换模型。
其中,该损失可以为MSE损失或其他类型损失,本实施例对此不作限定。
可选的,第二转换模型的损失lossBN2BN可以表示为:
其中,BNac为第二ASR模型提取到的第二样本内容特征,为第二转换模型输出的第二预测内容特征。
由于内容特征剔除了音色等因素的影响,而样本音频具有音色特征,因此在训练第三转换模型过程中,需要将样本音频的说话者标识作为输入的一部分,使训练得到的第三转换模型能够输出具有特定音色的音频。
在一种可能的实施方式中,计算机设备将样本内容特征以及样本音频对应的说话者标识输入第三转换模型,得到预测音频,从而基于预测音频以及样本音频,训练第三转换模型。其中,预测音频与样本音频对应相同音频内容,且具有相同音色。
可选的,不同说话者对应不同说话者标识。在一些实施例中,预先将说话者划分为不同音色,从而为同一音色对应的不同说话者分配相同说话者标识。
在一种可能的设计中,第三转换模型包括第三转换子模型以及声码器,其中,第三转换子模型用于将内容特征转换为音频谱特征,声码器则用于基于音频谱特征生成音频。
可选的,该第三转换模型包括卷积层以及N层堆叠的FFT,该音频谱特征可以为Mel(梅尔)谱特征、MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征等等,本申请实施例对此不作限定。
可选的,该声码器可以是采用自回归的Wavenet或WaveRNN,或者采用非自回归的hifigan或melgan等等,本申请实施例对此不作限定。
为了方便表述,下述实施例中以音频谱特征为Mel谱特征,声码器为hifigan为例进行说明,但并不对此构成限定。
相应的,训练过程中,计算机设备将样本内容特征以及说话者标识输入第三转换子模型,得到预测音频谱特征,从而将预测音频谱特征输入声码器,得到预测音频。
示意性的,如图11所示,BN2Wav模型包括BN2Mel子模型1101以及hifigan子模型1102,其中,BN2Mel子模型1101包括卷积层11011以及N层堆叠的FFT 11012。模型训练过程中,计算机设备将样本音频的样本内容特征BN以及说话者标识spk_id输入BN2Mel子 模型1101。BN2Mel子模型1101将转换得到的Mel频谱输入hifigan子模型1102,由hifigan子模型1102转换得到预测音频。
在一种可能的实施方式中,计算机设备联合训练第三转换子模型和声码器。
在另一种可能的实施方式中,计算机设备首先训练第三转换子模型,然后基于训练完成的第三转换子模型训练声码器,以此提高训练效率。
如图12所示,第三转换模型的训练过程可以包括如下步骤。
步骤1201,将样本内容特征以及说话者标识输入第三转换子模型,得到预测音频谱特征。
在一种可能的实施方式中,计算机设备将样本内容特征以及说话者标识输入第三转换子模型,得到样本音频对应的预测Mel频谱。
步骤1202,以样本音频的样本音频谱特征为预测音频谱特征的监督,训练第三转换子模型。
在一些实施例中,计算机设备对样本音频进行音频谱特征提取,得到样本音频谱特征,从而基于预测音频谱特征与样本音频谱特征的差异确定第三转换子模型损失,从而基于该第三转换子模型损失训练第三转换子模型。
其中,该损失可以为MSE损失或其他类型损失,本实施例对此不作限定。
可选的,第三转换子模型的损失lossBN2Mel可以表示为:
其中,Mel为直接对样本音频进行音频谱特征提取到的样本音频谱特征,为第三转换子模型输出的预测音频谱特征。
步骤1203,在第三转换子模型训练完成的情况下,将训练完成后第三转换子模型输出的预测音频谱特征输入声码器,得到预测音频。
完成第三转换子模型训练后,计算机设备将样本内容特征以及说话者标识输入训练得到的第三转化子模型,得到预测音频谱特征,然后将该预测音频谱特征输入声码器,得到声码器输出的预测音频。
在一个示意性的例子中,计算机设备将训练完成的BN2Mel子模型输出的预测Mel频谱特征输入hifigan,得到hifigan输出的预测音频。
步骤1204,基于预测音频以及样本音频,训练第三转换模型中的声码器。
可选的,计算机设备以样本音频为预测音频的监督,确定声码器的转换损失,从而基于该损失训练声码器。
在一些实施例中,当声码器采用对抗网络时,以hifigan为例,计算机设备采用对抗训练思想,通过生成器(Generator)和判别器(Discriminator)对抗训练。对抗训练过程中生成器的损失可以表示为:
LG=LG(G;D)+LFM(G;D)+Lmel(G)
其中,为生成器生成的音频G(s)重新转换得到的Mel谱特征,为从样本音频中提取到的Mel谱特征;LFM(G;D)为生成音频与样本音频的特征匹配损失;LG(G;D)为生成音频的判别损失。
对抗训练过程中判别器的损失可以表示为:
LD(G;D)=(D(x)-1)2+(D(G(s)))2
其中,D(x)为判别器对样本音频的判别结果,D(G(s))为判别器对预测音频的判别结果。
显然,通过上述方式训练得到的第三转换模型,处理能够将内容特征转换为音频外,还能够在转换得到的音频中添加特定的音色。相应的,在应用过程中,用户除了选择目标口音 外,还可以选择目标音色。
在一种可能的实施方式中,在口音转换指令中包含目标音色的情况下,计算机设备将第二内容特征以及目标音色对应说话者的说话者标识输入第三转换模型,得到第二口音音频,其中,第二口音音频具有第二口音以及目标音色。
示意性的,如图13所示,当需要将普通话转换为方言,且具有目标音色时,计算机设备通过普通话ASR模型1302对普通话音频1301进行特征提取,得到普通话内容特征1303。计算机设备进一步通过BN2BN模型1304对普通话内容特征1303进行特征转换,得到方言内容特征1305。计算机设备将方言内容特征1305以及目标音色1306对应的说话者标识输入BN2Wav模型1307,得到BN2Wav模型1307输出的带有目标音色的方言音频1308。
需要说明的是,当需要保持口音转换前后音色一致时,需要预先获取当前用户的语料数据(比如累计30分钟时长的语音数据),并为当前用户分配说话者标识,从而基于当前用户的语料数据以及说话者标识训练第三转换模型,本实施例在此不作赘述。
本实施例中,在训练第三转换模型的过程中,除了将样本音频的内容特征作为输入外,还将样本音频对应说话者标识作为输入,使第三转换模型在训练中能够基于内容特征以及说话者的音色特征进行音频转换。后续使用过程中,对于相同文本内容,通过输入不同的说话者标识,第三转换模型能够输出不同音色的音频,实现口音以及音色的双重转换。
图14是本申请一个示例性实施例提供的语音转换模型的训练装置的结构框图,该装置包括:
训练模块1401,用于基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;
所述训练模块1401,还用于基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;
所述训练模块1401,还用于基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;
所述训练模块1401,还用于基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;
生成模块1402,用于基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
可选的,所述训练模块1401,用于:
通过所述第一转换模型将所述第二样本文本转换得到第三样本内容特征,所述第三样本内容特征指采用所述第一口音表述第二样本文本所产生音频的内容特征;
基于所述第二样本内容特征和所述第三样本内容特征构建所述平行样本数据。
可选的,所述训练模块1401,用于:将所述第三样本内容特征输入所述第二转换模型,得到第二预测内容特征;
以所述第二样本内容特征为所述第二预测内容特征的监督,训练所述第二转换模型。
可选的,所述训练模块1401,用于:将所述第一样本文本输入所述第一转换模型,得到所述第一转换模型输出的第一预测内容特征;
以所述第一样本内容特征为所述第一预测内容特征的监督,训练所述第一转换模型。
可选的,所述第一转换模型中包括第一转换子模型、时长预测子模型以及第二转换子模型;
所述训练模块1401,用于:
通过所述第一转换子模型对所述第一样本文本进行编码,得到第一文本编码特征;
通过所述时长预测子模型对所述第一文本编码特征进行时长预测,得到预测时长,所述预测时长用于表征所述第一样本文本的发音时长;
基于所述预测时长对所述第一文本编码特征进行特征扩充,得到第二文本编码特征;
通过所述第二转换子模型将所述第二文本编码特征转换得到所述第一预测内容特征。
可选的,所述第一转换子模型和所述第二转换子模型包括FFT,所述FFT包括多头注意力机制层和卷积层。
可选的,所述训练模块1401,用于:
将所述样本内容特征以及所述样本音频对应的说话者标识输入所述第三转换模型,得到预测音频,所述预测音频与所述样本音频对应相同音频内容,且具有相同音色,其中,不同说话者对应不同说话者标识;
基于所述预测音频以及所述样本音频,训练所述第三转换模型。
可选的,所述第三转换模型包括第三转换子模型以及声码器;
所述训练模块1401,用于:
将所述样本内容特征以及所述说话者标识输入所述第三转换子模型,得到预测音频谱特征;
将所述预测音频谱特征输入所述声码器,得到所述预测音频。
可选的,所述训练模块1401,用于:
以所述样本音频的样本音频谱特征为所述预测音频谱特征的监督,训练所述第三转换子模型;
在所述第三转换子模型训练完成的情况下,将训练完成后所述第三转换子模型输出的所述预测音频谱特征输入所述声码器,得到所述预测音频;
基于所述预测音频以及所述样本音频,训练所述第三转换模型中的所述声码器。
可选的,所述装置还包括:
转换模块,用于响应于口音转换指令,通过所述第一ASR模型提取第一口音音频的第一内容特征,第一内容特征对应所述第一口音,所述口音转换指令用于指示将音频由所述第一口音转换为所述第二口音;
通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应所述第二口音;
通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
可选的,所述口音转换指令中包含目标音色;
所述转换模块,用于:
将所述第二内容特征以及所述目标音色对应说话者的说话者标识输入所述第三转换模型,得到所述第二口音音频,其中,不同说话者对应不同说话者标识。
综上所述,本申请实施例中,在缺少第二口音的第二样本音频对应平行语料的情况下,首先基于第一口音的第一样本音频,训练用于将文本转换为内容特征的第一转换模型,从而利用该第一转换模型以及第二样本音频对应的第二样本文本,构建得到包含对应相同文本内容但对应不同口音的平行样本数据,进而利用该平行样本数据训练在不同口音间进行内容特征转换的第二转换模型,以及用于将内容特征转换为音频的第三转换模型,完成语音转换模型训练;模型训练过程中,利用训练得到的中间模型构建平行语料,无需在模型训练前录制不同口音的平行语料,能够在保证模型训练质量的情况下,降低模型训练对人工录制的平行语料的需求,有助于提高模型训练效率,并提高样本不足情况下模型的训练质量。
需要说明的是:上述实施例提供的装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其实现过程详见方法实施例,这里不再赘述。
图15是本申请一个示例性实施例提供的语音转换装置的结构框图,该装置包括:
获取模块1501,用于获取第一口音音频,所述第一口音音频对应第一口音;
提取模块1502,用于通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;
内容特征转换模块1503,用于通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;
音频转换模块1504,用于通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
在一些实施例中,内容特征转换模块1503,还用于将第一ASR模型提取到的第一内容特征输入第二转换模型,由第二转换模型在第一口音和第二口音之间进行内容特征转换,得到第二口音下的第二内容特征。
在一些实施例中,第二转换模型包括卷积层以及N层堆叠的FFT。
在一些实施例中,内容特征转换模块1503,还用于通过第二转换模型中的卷积层对第一内容特征进行卷积处理,将卷积结果输入至N层堆叠的FFT转换得到第二内容特征。
在一些实施例中,第三转换模型包括第三转换子模型以及声码器,其中,第三转换子模型用于将内容特征转换为音频谱特征,声码器则用于基于音频谱特征生成音频。
在一些实施例中,音频转换模块1504,还用于将第二内容特征以及说话者标识输入第三转换子模型,得到音频谱特征。
在一些实施例中,音频转换模块1504,还用于将音频谱特征输入声码器,得到第二口音音频。
在一些实施例中,第三转换子模型包括卷积层以及N层堆叠的FFT。
请参考图16,其示出了本申请一个示例性实施例提供的计算机设备的结构示意图,该计算机设备可以为上述实施例中的投屏设备或终端。具体来讲:所述计算机设备1600包括中央处理单元(Central Processing Unit,CPU)1601、包括随机存取存储器1602和只读存储器1603的系统存储器1604,以及连接系统存储器1604和中央处理单元1601的系统总线1605。所述计算机设备1600还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(Input/Output,I/O系统)1606,和用于存储操作系统1613、应用程序1614和其他程序模块1615的大容量存储设备1607。
所述基本输入/输出系统1606包括有用于显示信息的显示器1608和用于用户输入信息的诸如鼠标、键盘之类的输入设备1609。其中所述显示器1608和输入设备1609都通过连接到系统总线1605的输入输出控制器1610连接到中央处理单元1601。所述基本输入/输出系统1606还可以包括输入输出控制器1610以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1610还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备1607通过连接到系统总线1605的大容量存储控制器(未示出)连接到中央处理单元1601。所述大容量存储设备1607及其相关联的计算机可读介质为计算机设备1600提供非易失性存储。也就是说,所述大容量存储设备1607可以包括诸如硬盘或者驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括随机存取记 忆体(RAM,Random Access Memory)、只读存储器(ROM,Read Only Memory)、闪存或其他固态存储其技术,只读光盘(Compact Disc Read-Only Memory,CD-ROM)、数字通用光盘(Digital Versatile Disc,DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1604和大容量存储设备1607可以统称为存储器。
存储器存储有一个或多个程序,一个或多个程序被配置成由一个或多个中央处理单元1601执行,一个或多个程序包含用于实现上述方法的指令,中央处理单元1601执行该一个或多个程序实现上述各个方法实施例提供的方法。
根据本申请的各种实施例,所述计算机设备1600还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1600可以通过连接在所述系统总线1605上的网络接口单元1611连接到网络1612,或者说,也可以使用网络接口单元1611来连接到其他类型的网络或远程计算机系统(未示出)。
本申请实施例还提供一种计算机可读存储介质,该可读存储介质中存储有至少一条指令,至少一条指令由处理器加载并执行以实现上述实施例所述的语音转换模型的训练方法,或,如上述方面所述的语音转换方法。
可选地,该计算机可读存储介质可以包括:ROM、RAM、固态硬盘(SSD,Solid State Drives)或光盘等。其中,RAM可以包括电阻式随机存取记忆体(ReRAM,Resistance Random Access Memory)和动态随机存取存储器(DRAM,Dynamic Random Access Memory)。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实施例所述的语音转换模型的训练方法,或,如上述方面所述的语音转换方法。

Claims (20)

  1. 一种语音转换模型的训练方法,所述方法由计算机设备执行,所述方法包括:
    基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;
    基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;
    基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;
    基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;
    基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;
    基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
  2. 根据权利要求1所述的方法,其中,所述基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,包括:
    通过所述第一转换模型将所述第二样本文本转换得到第三样本内容特征,所述第三样本内容特征指采用所述第一口音表述第二样本文本所产生音频的内容特征;
    基于所述第二样本内容特征和所述第三样本内容特征构建所述平行样本数据。
  3. 根据权利要求2所述的方法,其中,所述基于所述平行样本数据训练第二转换模型,包括:
    将所述第三样本内容特征输入所述第二转换模型,得到第二预测内容特征;
    以所述第二样本内容特征为所述第二预测内容特征的监督,训练所述第二转换模型。
  4. 根据权利要求3所述的方法,其中,所述以所述第二样本内容特征为所述第二预测内容特征的监督,训练所述第二转换模型,包括:
    根据所述第二样本内容特征与所述第二预测内容特征之间的差异确定第二转换模型损失,基于所述第二转换模型损失训练所述第二转换模型。
  5. 根据权利要求1至4任一所述的方法,其中,所述基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,包括:
    将所述第一样本文本输入所述第一转换模型,得到所述第一转换模型输出的第一预测内容特征;
    以所述第一样本内容特征为所述第一预测内容特征的监督,训练所述第一转换模型。
  6. 根据权利要求5所述的方法,其中,所述第一转换模型中包括第一转换子模型、时长预测子模型以及第二转换子模型;
    所述将所述第一样本文本输入所述第一转换模型,得到所述第一转换模型输出的第一预测内容特征,包括:
    通过所述第一转换子模型对所述第一样本文本进行编码,得到第一文本编码特征;
    通过所述时长预测子模型对所述第一文本编码特征进行时长预测,得到预测时长,所述预测时长用于表征所述第一样本文本的发音时长;
    基于所述预测时长对所述第一文本编码特征进行特征扩充,得到第二文本编码特征;
    通过所述第二转换子模型将所述第二文本编码特征转换得到所述第一预测内容特征。
  7. 根据权利要求6所述的方法,其中,所述第一转换子模型和所述第二转换子模型包括FFT,所述FFT包括多头注意力机制层和卷积层。
  8. 根据权利要求5所述的方法,其中,所述以所述第一样本内容特征为所述第一预测内容特征的监督,训练所述第一转换模型,包括:
    根据所述第一样本内容特征与所述第一预测内容特征之间的差异确定第一转换模型损失,基于所述第一转换模型损失训练所述第一转换模型。
  9. 根据权利要求1至8任一所述的方法,其中,所述基于不同样本音频的样本内容特征训练第三转换模型,包括:
    将所述样本内容特征以及所述样本音频对应的说话者标识输入所述第三转换模型,得到预测音频,所述预测音频与所述样本音频对应相同音频内容,且具有相同音色,其中,不同说话者对应不同说话者标识;
    基于所述预测音频以及所述样本音频,训练所述第三转换模型。
  10. 根据权利要求9所述的方法,其中,所述第三转换模型包括第三转换子模型以及声码器;
    所述将所述样本内容特征以及所述样本音频对应的说话者标识输入所述第三转换模型,得到预测音频,包括:
    将所述样本内容特征以及所述说话者标识输入所述第三转换子模型,得到预测音频谱特征;
    将所述预测音频谱特征输入所述声码器,得到所述预测音频。
  11. 根据权利要求10所述的方法,其中,所述将所述预测音频谱特征输入所述声码器,得到所述预测音频之前,所述方法还包括:
    以所述样本音频的样本音频谱特征为所述预测音频谱特征的监督,训练所述第三转换子模型;
    所述将所述预测音频谱特征输入所述声码器,得到所述预测音频,包括:
    在所述第三转换子模型训练完成的情况下,将训练完成后所述第三转换子模型输出的所述预测音频谱特征输入所述声码器,得到所述预测音频;
    所述基于所述预测音频以及所述样本音频,训练所述第三转换模型,包括:
    基于所述预测音频以及所述样本音频,训练所述第三转换模型中的所述声码器。
  12. 根据权利要求11所述的方法,其中,所述以所述样本音频的样本音频谱特征为所述预测音频谱特征的监督,训练所述第三转换子模型,包括:
    对所述样本音频进行音频谱特征提取,得到样本音频谱特征;
    根据所述预测音频谱特征与所述样本音频谱特征的差异确定第三转换子模型损失,基于所述第三转换子模型损失训练所述第三转换子模型。
  13. 根据权利要求1至12任一所述的方法,其中,所述方法包括:
    响应于口音转换指令,通过所述第一ASR模型提取第一口音音频的第一内容特征,第一内容特征对应所述第一口音,所述口音转换指令用于指示将音频由所述第一口音转换为所述第二口音;
    通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应所述第二口音;
    通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
  14. 根据权利要求13所述的方法,其中,所述口音转换指令中包含目标音色;
    所述通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频,包括:
    将所述第二内容特征以及所述目标音色对应说话者的说话者标识输入所述第三转换模型,得到所述第二口音音频,其中,不同说话者对应不同说话者标识。
  15. 一种语音转换方法,所述方法由计算机设备执行,所述计算机设备中设置有语音转换模型,所述语音转换模型包括第一ASR模型、第二转换模型和第三转换模型,所述方法包括:
    获取第一口音音频,所述第一口音音频对应第一口音;
    通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;
    通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;
    通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
  16. 一种语音转换模型的训练装置,其中,所述装置包括:
    训练模块,用于基于第一样本音频训练第一ASR模型,以及基于第二样本音频训练第二ASR模型,所述第一样本音频对应第一口音,所述第二样本音频对应第二口音;
    所述训练模块,还用于基于所述第一样本音频对应的第一样本文本以及第一样本内容特征,训练第一转换模型,所述第一样本内容特征由所述第一ASR模型对所述第一样本音频进行提取得到,所述第一转换模型用于将文本转换为所述第一口音的内容特征;
    所述训练模块,还用于基于所述第一转换模型、所述第二样本音频对应的第二样本文本以及第二样本内容特征,构建平行样本数据,所述第二样本内容特征由所述第二ASR模型对所述第二样本音频进行提取得到,所述平行样本数据包括不同内容特征,不同内容特征对应不同口音,且不同内容特征对应相同文本;基于所述平行样本数据训练第二转换模型,所述第二转换模型用于对所述第一口音和所述第二口音间进行内容特征转换;
    所述训练模块,还用于基于不同样本音频的样本内容特征训练第三转换模型,所述第三转换模型用于将内容特征转换为音频;
    生成模块,用于基于训练得到的所述第一ASR模型、所述第二转换模型和所述第三转换模型生成语音转换模型,所述语音转换模型用于将第一口音的音频转换为第二口音的音频。
  17. 一种语音转换装置,其中,所述装置包括:
    获取模块,用于获取第一口音音频,所述第一口音音频对应第一口音;
    提取模块,用于通过所述第一ASR模型对所述第一口音音频提取得到第一内容特征,所述第一内容特征对应所述第一口音;
    内容特征转换模块,用于通过所述第二转换模型将所述第一内容特征转换为第二内容特征,所述第二内容特征对应第二口音;
    音频转换模块,用于通过所述第三转换模型对所述第二内容特征进行音频转换,得到第二口音音频。
  18. 一种计算机设备,其中,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由所述处理器加载并执行以实现如权利要求1至14任一所述的语音转换模型的训练方法,或,如权利要求15所述的语音转换方法。
  19. 一种计算机可读存储介质,其中,所述可读存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现如权利要求1至14任一所述的语音转换模型的训练方法,或,如权利要求15所述的语音转换方法。
  20. 一种计算机程序产品,其中,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取所述计算机指令,所述处理器执行所述计算机指令,使得所述计算机设备执行如权利要求1至14任一所述的语音转换模型的训练方法,或,如权利要求15所述的语音转换方法。
PCT/CN2023/124162 2022-11-21 2023-10-12 语音转换模型的训练方法、装置、设备及介质 WO2024109375A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211455842.7 2022-11-21
CN202211455842.7A CN116959447A (zh) 2022-11-21 2022-11-21 语音转换模型的训练方法、装置、设备及介质

Publications (1)

Publication Number Publication Date
WO2024109375A1 true WO2024109375A1 (zh) 2024-05-30

Family

ID=88453595

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/124162 WO2024109375A1 (zh) 2022-11-21 2023-10-12 语音转换模型的训练方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN116959447A (zh)
WO (1) WO2024109375A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160064033A1 (en) * 2014-08-26 2016-03-03 Microsoft Corporation Personalized audio and/or video shows
CN110085244A (zh) * 2019-05-05 2019-08-02 广州虎牙信息科技有限公司 直播互动方法、装置、电子设备及可读存储介质
CN112767912A (zh) * 2020-12-28 2021-05-07 深圳市优必选科技股份有限公司 跨语言语音转换方法、装置、计算机设备和存储介质
CN113223542A (zh) * 2021-04-26 2021-08-06 北京搜狗科技发展有限公司 音频的转换方法、装置、存储介质及电子设备
CN113450759A (zh) * 2021-06-22 2021-09-28 北京百度网讯科技有限公司 语音生成方法、装置、电子设备以及存储介质
CN113838448A (zh) * 2021-06-16 2021-12-24 腾讯科技(深圳)有限公司 一种语音合成方法、装置、设备及计算机可读存储介质
CN114038484A (zh) * 2021-12-16 2022-02-11 游密科技(深圳)有限公司 语音数据处理方法、装置、计算机设备和存储介质
US20220382998A1 (en) * 2021-05-25 2022-12-01 Compal Electronics, Inc. Translation method and translation device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160064033A1 (en) * 2014-08-26 2016-03-03 Microsoft Corporation Personalized audio and/or video shows
CN110085244A (zh) * 2019-05-05 2019-08-02 广州虎牙信息科技有限公司 直播互动方法、装置、电子设备及可读存储介质
CN112767912A (zh) * 2020-12-28 2021-05-07 深圳市优必选科技股份有限公司 跨语言语音转换方法、装置、计算机设备和存储介质
CN113223542A (zh) * 2021-04-26 2021-08-06 北京搜狗科技发展有限公司 音频的转换方法、装置、存储介质及电子设备
US20220382998A1 (en) * 2021-05-25 2022-12-01 Compal Electronics, Inc. Translation method and translation device
CN113838448A (zh) * 2021-06-16 2021-12-24 腾讯科技(深圳)有限公司 一种语音合成方法、装置、设备及计算机可读存储介质
CN113450759A (zh) * 2021-06-22 2021-09-28 北京百度网讯科技有限公司 语音生成方法、装置、电子设备以及存储介质
CN114038484A (zh) * 2021-12-16 2022-02-11 游密科技(深圳)有限公司 语音数据处理方法、装置、计算机设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG YONGMAO; WANG ZHICHAO; YANG PEIJI; SUN HONGSHEN; WANG ZHISHENG; XIE LEI: "AccentSpeech: Learning Accent from Crowd-sourced Data for Target Speaker TTS with Accents", 2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), IEEE, 11 December 2022 (2022-12-11), pages 76 - 80, XP034290712, DOI: 10.1109/ISCSLP57327.2022.10037914 *

Also Published As

Publication number Publication date
CN116959447A (zh) 2023-10-27

Similar Documents

Publication Publication Date Title
CN110223705B (zh) 语音转换方法、装置、设备及可读存储介质
CN113408385B (zh) 一种音视频多模态情感分类方法及系统
JP4246790B2 (ja) 音声合成装置
JP6876752B2 (ja) 応答方法及び装置
EP3855340B1 (en) Cross-lingual voice conversion system and method
US20140244252A1 (en) Method for preparing a transcript of a conversion
JP2014519082A (ja) 文字に基づく映像生成
JP2014519082A5 (zh)
WO2019116889A1 (ja) 信号処理装置および方法、学習装置および方法、並びにプログラム
CN112687259A (zh) 一种语音合成方法、装置以及可读存储介质
JP2012181358A (ja) テキスト表示時間決定装置、テキスト表示システム、方法およびプログラム
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
CN112530400A (zh) 基于深度学习的文本生成语音的方法、系统、装置及介质
US11687576B1 (en) Summarizing content of live media programs
Zhang et al. Promptspeaker: Speaker generation based on text descriptions
WO2021159734A1 (zh) 一种数据处理方法、装置、设备及介质
WO2024109375A1 (zh) 语音转换模型的训练方法、装置、设备及介质
WO2023142590A1 (zh) 手语视频的生成方法、装置、计算机设备及存储介质
US20220383850A1 (en) System and method for posthumous dynamic speech synthesis using neural networks and deep learning
CN116469369A (zh) 虚拟声音合成方法、装置及相关设备
KR102605178B1 (ko) 가족 관계에 기초하여 음성 데이터를 생성하는 장치, 방법 및 컴퓨터 프로그램
CN113889130A (zh) 一种语音转换方法、装置、设备及介质
KR102426020B1 (ko) 한 화자의 적은 음성 데이터로 감정 운율을 담은 음성 합성 방법 및 장치
CN117373463A (zh) 用于语音处理的模型训练方法、设备、介质及程序产品
US20230186899A1 (en) Incremental post-editing and learning in speech transcription and translation services