WO2023045954A1 - Speech synthesis method and apparatus, electronic device, and readable storage medium - Google Patents

Speech synthesis method and apparatus, electronic device, and readable storage medium Download PDF

Info

Publication number
WO2023045954A1
WO2023045954A1 PCT/CN2022/120120 CN2022120120W WO2023045954A1 WO 2023045954 A1 WO2023045954 A1 WO 2023045954A1 CN 2022120120 W CN2022120120 W CN 2022120120W WO 2023045954 A1 WO2023045954 A1 WO 2023045954A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
processed
feature
audio
Prior art date
Application number
PCT/CN2022/120120
Other languages
French (fr)
Chinese (zh)
Inventor
代东洋
黄雷
陈彦洁
李鑫
陈远哲
王玉平
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023045954A1 publication Critical patent/WO2023045954A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and in particular to a speech synthesis method, device, electronic equipment and readable storage medium.
  • the present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium.
  • the present disclosure provides a speech synthesis method, including:
  • the text to be processed is input into the speech synthesis model, and the spectral features corresponding to the text to be processed output by the speech synthesis model are obtained;
  • the speech synthesis model includes: a prosody sub-model and a timbre sub-model, and the prosody The sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, and the first acoustic feature includes a bottleneck feature for characterizing the target rap style;
  • the timbre sub-model is used for According to the input first acoustic feature, output the spectral feature corresponding to the text to be processed, the spectral feature corresponding to the text to be processed includes the spectral feature used to characterize the target timbre;
  • the target audio corresponding to the text to be processed is acquired, and the target audio has the target timbre and the target rap style.
  • the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
  • the first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
  • the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;
  • the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth
  • the acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  • the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
  • the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;
  • the third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
  • the first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
  • the method also includes:
  • the present disclosure provides a speech synthesis device, including:
  • Obtaining module used to obtain the text to be processed
  • a processing module configured to input the text to be processed into a speech synthesis model, and obtain spectral features corresponding to the text to be processed output by the speech synthesis model;
  • the speech synthesis model includes: a prosody sub-model and a timbre sub-model Model, the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style;
  • the The timbre sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre;
  • the processing module is configured to acquire target audio corresponding to the text to be processed according to the frequency spectrum feature corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
  • the present disclosure provides an electronic device, including: a memory, a processor, and a computer program;
  • said memory is configured to store said computer program
  • the processor is configured to execute the computer program to implement the speech synthesis method according to any one of the first aspect.
  • the present disclosure provides a readable storage medium, including: a computer program
  • the speech synthesis method according to any one of the first aspect can be realized.
  • the present disclosure provides a program product, the program product including: a computer program; the computer program is stored in a readable storage medium, and an electronic device acquires the computer program from the readable storage medium, the At least one processor of the electronic device executes the computer program to implement the speech synthesis method according to any one of the first aspect.
  • the present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium, wherein, the present disclosure analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes prosodic Model and timbre sub-model, the prosody sub-model is used to receive the text to be processed as input, and output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model The model receives the first acoustic feature as input, and outputs the spectral feature corresponding to the text to be processed.
  • the speech synthesis model includes prosodic Model and timbre sub-model
  • the prosody sub-model is used to receive the text to be processed as input, and output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck
  • the spectral feature includes the spectral feature used to characterize the target timbre; by converting the spectral feature output by the speech synthesis model, it can obtain the target rap style and The rap audio of the target tone meets the user's personalized needs for synthesized audio; and the speech synthesis model supports the conversion of any text to be processed, which reduces the requirements for the user's music creation ability and is conducive to improving the user's enthusiasm for creating multimedia content .
  • FIG. 1a to 1c are structural schematic diagrams of a speech synthesis model provided by an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a speech synthesis method provided by another embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure.
  • Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the present disclosure provides a speech synthesis method, device, electronic equipment, readable storage medium, and program product, wherein the method implements the conversion of text into audio with a target rap style and target timbre through a pre-trained speech synthesis model, and the speech
  • the synthesis model can realize the relatively independent control of the target rap style and timbre on the speech synthesis, so as to meet the user's demand for personalized speech synthesis.
  • the target rap style mentioned in the present disclosure may include any type of rap style, and the present disclosure does not limit the specific rap style of the target rap style.
  • the target rap style may be any rap style among popular rap, alternative rap, comedy rap, jazz rap, and hip-hop rap.
  • the speech synthesis method provided by the present disclosure can be executed by electronic equipment.
  • the electronic device can be a tablet computer, a mobile phone (such as a folding screen mobile phone, a large-screen mobile phone, etc.), a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a notebook computer, etc. , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), smart TV, smart screen, high-definition TV, 4K TV, smart speaker, smart projector and other Internet of Things (the Internet of things, IOT) equipment, this disclosure does not make any restrictions on the specific type of electronic equipment.
  • a mobile phone such as a folding screen mobile phone, a large-screen mobile phone, etc.
  • a wearable device such as a folding screen mobile phone, a large-screen mobile phone, etc.
  • a vehicle-mounted device such as a folding screen mobile phone,
  • the electronic device that trains and obtains the speech synthesis model and the electronic device that uses the speech synthesis model to execute the speech synthesis service may be different electronic devices or the same electronic device, which is not limited in the present disclosure.
  • the speech synthesis model is obtained through the training of the server device, and the server device sends the trained speech synthesis model to the terminal device/server device, and the terminal device/server device executes the speech synthesis service according to the speech synthesis model; another example
  • the speech synthesis model is trained by the server device, and then the trained speech synthesis model is deployed on the server device, and then the server device invokes the speech synthesis model to process the speech synthesis service.
  • the present disclosure does not limit this, and it can be set flexibly in practical applications.
  • the speech synthesis model in this solution decouples the speech synthesis model into two sub-models by introducing acoustic features including bottleneck features, namely: the prosody sub-model and the timbre sub-model, wherein the prosody sub-model is used to create text
  • the depth mapping between the acoustic features including the bottleneck features and the timbre sub-model is used to establish the depth mapping between the acoustic features including the bottleneck features and the spectral features.
  • the two decoupled feature extraction sub-models can be trained using different sample audio.
  • the prosody sub-model is used to establish a deep mapping between the text sequence and the acoustic features containing bottleneck features.
  • the prosody sub-model needs to use high-quality first sample audio with the target rap style and the annotated text corresponding to the first sample audio , together as the sample data to train the prosody sub-model.
  • the timbre sub-model is used to establish the depth mapping between the acoustic features including bottleneck features and the spectral features.
  • the timbre sub-model can be trained using the second sample audio that has not marked the corresponding text. Since there is no need to label the text corresponding to the second sample audio, This can greatly reduce the cost of acquiring a second sample of audio.
  • the acoustic features output by the prosody sub-model include the bottleneck features used to characterize the target rap style, and realize the control of rap style on speech synthesis.
  • the acoustic features output by the prosody sub-model may also include fundamental frequency features used to characterize pitch, so as to realize the control of speech synthesis by pitch.
  • the spectral features corresponding to the text output by the timbre sub-model include the spectral features used to characterize the target timbre, so as to realize the control of the timbre over speech synthesis.
  • the spectral features output by the timbre sub-model also include the spectral features used to represent the target rap style, and the spectral features representing the target timbre and the spectral features representing the target rap style are the same spectral features. If the acoustic features of the prosody sub-model output also include fundamental frequency features, the spectral features of the timbre sub-model output also include spectral features for representing the corresponding fundamental frequency, and represent the spectral features of the target timbre, the spectral features of the target rap style, and The spectral features characterizing the fundamental frequency are the same spectral features.
  • the speech synthesis model can be trained by the third sample audio of less target timbre, so that the final speech synthesis model can synthesize audio with the target timbre, and even if the quality of the third sample audio is not high, such as non-standard pronunciation, Even if the speech is not fluent, etc., the speech synthesis model can still synthesize audio with the target timbre stably.
  • the timbre sub-model Since the timbre sub-model has been trained through the second sample audio, the timbre sub-model already has a high ability to control the speech synthesis of timbre. Therefore, even if the timbre sub-model learns a small amount of third sample audio, it can be better. to master the target Voice.
  • FIG. 1a shows the overall frame diagram of the training and acquisition of the speech synthesis model
  • Fig. 1b and Fig. 1c respectively exemplarily show the structural diagrams of the prosody sub-model and the timbre sub-model included in the speech synthesis model.
  • the speech synthesis model 100 includes: a prosody sub-model 101 and a timbre sub-model 102 .
  • the process of training the speech synthesis model 100 includes the process of training the prosody sub-model 101 and the process of training the timbre sub-model 102 .
  • the prosodic sub-model 101 is used for training according to the labeled text corresponding to the first sample audio and the labeled acoustic features (hereinafter the labeled acoustic features corresponding to the first sample audio are referred to as the second acoustic feature), by learning the first sample audio
  • the relationship between the corresponding labeled text and the second acoustic features enables the prosody sub-model 101 to obtain the ability to establish a depth mapping between the text and the acoustic features including bottleneck features.
  • the aforementioned marked text may specifically be a text sequence.
  • the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.
  • the loss function information of the current round of training is calculated, and according to the loss of the current round of training
  • the function information adjusts the coefficient values of the parameters included in the prosody sub-model 101 .
  • the labeled text corresponding to the first sample audios, and the second acoustic features (including the first labeled bottleneck features) corresponding to the first sample audios finally obtain the corresponding convergence conditions
  • the first feature extraction model 101 The first feature extraction model 101 .
  • the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .
  • the first audio sample can include a high-quality audio file (high-quality audio text can also be understood as clean audio), and the annotation text corresponding to the first audio sample can include one or more audio files corresponding to the first audio sample. characters or one or more phonemes, which is not limited in this disclosure.
  • the first audio sample can be obtained by recording and cleaning multiple times according to actual needs, or can also be obtained by filtering from an audio database and cleaning multiple times. The present disclosure does not limit the acquisition method of the first sample audio.
  • the annotation text corresponding to the first audio sample may also be obtained through repeated annotation and correction, so as to ensure the accuracy of the annotation text.
  • the first sample audio mentioned in this disclosure is audio with the target rap style. This disclosure does not limit the duration, file format, quantity and other parameters of the first sample audio, and the first sample audio can be A piece of music sung by the same or a different singer.
  • the fifth acoustic feature corresponding to the labeled text can be understood as the predicted acoustic feature corresponding to the labeled text output by the prosodic sub-model 101, and the fifth acoustic feature corresponding to the labeled text can also be understood as the fifth acoustic feature corresponding to the first sample audio .
  • the second acoustic feature includes: a first labeled bottleneck feature corresponding to the first audio sample.
  • the bottleneck is a nonlinear feature transformation technology and an effective dimension reduction technology.
  • the bottleneck feature may include information of dimensions such as prosody and content.
  • the first labeled bottleneck feature corresponding to the first audio sample may be obtained by an encoder (encoder) of an end-to-end speech recognition (ASR) model.
  • encoder encoder
  • ASR end-to-end speech recognition
  • ASR model for short.
  • the first sample audio can be input to the ASR model 104, and the first marked bottleneck feature corresponding to the first sample audio output by the encoder of the ASR model 104 is obtained, wherein the ASR model 104
  • the encoder of is equivalent to the extractor of the bottleneck feature in advance, and the encoder of the ASR model 104 can be used to prepare sample data in this solution.
  • the ASR model 104 may also include other modules.
  • the ASR model 104 also includes a decoder (decoder) and an attention network (attention network).
  • decoder decoder
  • attention network attention network
  • the encoder of the ASR model 104 is only an example, and is not a limitation to the implementation manner of obtaining the first marked bottleneck feature corresponding to the first sample audio. In practical applications, it can also be obtained in other ways, which is not limited in the present disclosure.
  • the database stores the first sample audio and the first labeled bottleneck feature corresponding to the first sample audio
  • the electronic device may also acquire the first sample audio and the first labeled bottleneck feature from the database.
  • the second acoustic feature corresponding to the first sample audio includes: a first labeled bottleneck feature corresponding to the first sample audio and a first labeled fundamental frequency feature corresponding to the first sample audio.
  • the first marked bottleneck feature can refer to the detailed description of the foregoing examples, and for the sake of brevity, details are not repeated here.
  • the pitch represents the subjective feeling of the human ear for the pitch of the sound.
  • the pitch mainly depends on the fundamental frequency of the sound. The higher the fundamental frequency, the higher the pitch, and the lower the fundamental frequency, the lower the pitch.
  • pitch is also one of the important factors affecting the effect of speech synthesis.
  • this solution introduces the fundamental frequency feature while introducing the bottleneck feature, so that the final prosodic sub-model 101 can output the corresponding bottleneck feature according to the input text and fundamental frequency features.
  • the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.
  • the fifth acoustic feature corresponding to the tagged text may be understood as the predicted acoustic feature corresponding to the tagged text output by the prosody sub-model 101 .
  • the fifth acoustic feature corresponding to the marked text may also be understood as the fifth acoustic feature corresponding to the first audio sample.
  • the second acoustic feature corresponding to the first sample audio includes: the first labeled bottleneck feature and the first labeled fundamental frequency feature, then during the training process, the first sample audio output by the prosody sub-model 101 corresponds to
  • the fifth acoustic feature also includes: a predicted bottleneck feature and a predicted fundamental frequency feature corresponding to the first audio sample.
  • the loss function information of the current round of training is calculated, and the prosody is adjusted according to the loss function information.
  • the coefficient values of the parameters included in the sub-model 101 are adjusted.
  • the labeled text corresponding to the first sample audio, and the second acoustic feature corresponding to the first sample audio (including the first labeled bottleneck feature and the first labeled fundamental frequency feature)
  • the first feature extraction model 101 satisfying the corresponding convergence condition is obtained.
  • the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .
  • the first labeled fundamental frequency feature corresponding to the first sample audio can be obtained by analyzing the first sample audio by a digital signal processing (DSP) method.
  • DSP digital signal processing
  • digital signal processing may be performed on the first sample audio by the digital signal processor 105 to obtain the first labeled fundamental frequency feature corresponding to the first sample audio.
  • the specific implementation manner of the digital signal processor 105 is not limited, as long as it can extract the first marked fundamental frequency feature corresponding to the input first sample audio.
  • the first marked fundamental frequency feature corresponding to the first sample audio is not limited to be obtained by digital signal processing, and the present disclosure does not limit the implementation manner of obtaining the first marked fundamental frequency feature.
  • some databases store the first sample audio and the first labeled fundamental frequency feature corresponding to the first sample audio, and the first sample audio and the first labeled fundamental frequency feature may also be acquired from the database.
  • the convergence condition corresponding to the prosody sub-model may include, but is not limited to, evaluation indicators such as the number of iterations and loss threshold.
  • the present disclosure does not limit the convergence conditions corresponding to the training prosodic sub-models.
  • the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or, according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, the convergence conditions may have differences.
  • the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or performs training according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, and the pre-built prosodic
  • the loss functions corresponding to the models can be the same or different.
  • the present disclosure does not limit the implementation manner of the loss function corresponding to the pre-built prosody sub-model.
  • the network structure of the prosodic sub-model is exemplarily shown below.
  • FIG. 1 b exemplarily shows an implementation of the prosodic sub-model 101 .
  • the prosodic sub-model 101 may include: a text encoding network (text encoder) 1011 , an attention network (attention) 1012 and a decoding network (decoder) 1013 .
  • text encoder text encoder
  • attention network attention network
  • decoding network decoder
  • the text coding network 1011 is used to receive text as input, analyze the context and time sequence relationship of the input text, and model an intermediate feature sequence, which contains context information and time sequence relationship.
  • the decoding network 1013 can adopt an autoregressive network structure, by using the output of the previous time step as the input of the next time step.
  • the attention network 1012 is mainly used to output attention coefficients.
  • the attention coefficient and the intermediate feature sequence output by the text encoding network 1011 are weighted and averaged to obtain a weighted average result, which is used as another conditional input for each time step of the decoding network 1013 .
  • the decoding network 1013 outputs the predicted acoustic features corresponding to the text by performing feature conversion on the input (ie, the weighted average result and the output of the previous time step).
  • the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text; or, the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text The predicted fundamental frequency features corresponding to the text.
  • the initial values of the coefficients of the parameters included in the prosody sub-model 101 may be randomly generated, preset, or determined in other ways, which is not limited in the present disclosure.
  • the prosodic sub-model 101 is iteratively trained through the marked texts corresponding to the plurality of first sample audios and the second acoustic features respectively corresponding to the first sample audios, and the coefficient values of the parameters included in the prosody sub-model 101 are continuously optimized, Until the convergence condition of the prosody sub-model 101 is met, the training for the prosody sub-model 101 is stopped.
  • Training the timbre sub-model 102 includes two stages, wherein the first stage is to train the timbre sub-model based on the second sample audio to obtain an intermediate model; the second stage is to fine-tune the intermediate model based on the third sample audio to obtain The final Timbre submodel.
  • the present disclosure does not limit the timbre of the second sample audio; in addition, the third sample audio is a sample audio with a target timbre.
  • the spectral features output by the above-mentioned timbre sub-model may be Mel spectral features, or other types of spectral features.
  • the first labeled spectral feature corresponding to the second sample audio input to the timbre sub-model is the first labeled Mel spectral feature
  • the second labeled spectral feature corresponding to the third sample audio is the second labeled Mel spectral feature.
  • the Mel spectral feature and the predicted spectral feature output by the timbre sub-model are taken as an example to illustrate the predicted Mel spectral feature.
  • the first stage is a first stage
  • the timbre sub-model 102 is used to perform iterative training according to the second sample audio to obtain an intermediate model.
  • the timbre sub-model 102 learns the mapping relationship between the third acoustic feature corresponding to the second sample audio and the first labeled mel spectrum feature of the second sample audio, and obtains an intermediate model with certain speech synthesis control capabilities for timbre, wherein , the first marked Mel spectral feature includes: a spectral feature used to characterize the timbre of the corresponding second sample audio.
  • the present disclosure does not limit parameters such as the timbre, duration, storage format, and quantity of the second sample audio of the second sample audio.
  • the second sample audio may include the audio of the specific target tone, or may include the audio of the non-target tone, or the second sample audio may include both the audio of the target tone and the audio of the non-target tone.
  • the timbre sub-model 102 is used to analyze the second acoustic feature corresponding to the input second sample audio, and output the predicted mel spectrum feature corresponding to the second sample audio; then based on the second The first labeled Mel spectrum feature corresponding to the sample audio and the predicted Mel spectrum feature corresponding to the second sample audio adjust the coefficient values of the parameters included in the timbre sub-model 102; Continuous iterative training to obtain an intermediate model.
  • the first marked Mel spectrum feature can be understood as the learning goal of the timbre sub-model 102 in the first stage.
  • the second sample audio does not need to be marked with corresponding text, which can greatly reduce the time and labor cost of obtaining the second sample audio. Moreover, a large amount of audio can be obtained at a lower cost as the second sample audio for iterative training of the timbre sub-model 102, and then the timbre sub-model 102 is trained through a large amount of second sample audio, so that the intermediate model has a higher Speech synthesis controls for timbres.
  • the intermediate model is trained based on the third sample, so that the intermediate model learns the target timbre and obtains the speech synthesis control ability for the target timbre.
  • the intermediate model already has a high ability to control speech synthesis for timbre, the requirements for the third sample audio are reduced, for example, the duration of the third sample audio, the third sample audio Quality requirements, even if the duration of the third sample audio is short, the pronunciation is not clear, etc., the final timbre sub-model 102 obtained through training can still obtain a higher speech synthesis control ability for the target timbre.
  • the third sample audio has a target tone
  • the third sample audio may be an audio recorded by a user, or may be an audio of a desired tone uploaded by a user, and the disclosure does not limit the source and acquisition method of the third sample audio.
  • the fourth acoustic feature corresponding to the third sample audio is input to the intermediate model, and the predicted mel spectrum feature corresponding to the third sample audio output by the intermediate model is obtained; then based on the second labeled mel spectrum corresponding to the third sample audio feature and the predicted mel spectrum feature corresponding to the third sample audio, and calculate the loss function information corresponding to the current round of training; adjust the coefficient values of the parameters included in the intermediate model according to the loss function information, so as to obtain the final timbre sub-model 102 .
  • the second labeled mel spectrum feature corresponding to the third audio sample can be understood as the learning target of the intermediate model.
  • the prosody sub-model 101 outputs the fifth acoustic feature according to the input annotation text of the first sample audio including the predicted bottleneck feature, that is, the prosody sub-model 101 can realize The mapping of text to bottleneck features
  • the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second marked bottleneck feature corresponding to the second sample audio
  • the acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  • the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature. For the sake of simplicity, I won’t repeat them here.
  • the output fifth acoustic feature includes the predicted bottleneck feature and the predicted fundamental frequency feature, that is, the prosody sub-model 101 can realize the text-to-bottleneck feature and The mapping of the fundamental frequency feature
  • the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second labeled bottleneck feature and the second labeled fundamental frequency feature corresponding to the second sample audio
  • the fourth acoustic feature corresponding to the sample audio includes a third labeled bottleneck feature and a third labeled fundamental frequency feature corresponding to the third sample audio.
  • the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature;
  • the second marked fundamental frequency feature and the third marked fundamental frequency feature can be obtained by analyzing the second sample audio and the third sample audio respectively through digital signal processing technology, which is similar to the implementation method of obtaining the first marked fundamental frequency feature. For simplicity, I won't repeat them here.
  • the input of the timbre sub-model 102 is consistent with the output of the prosody sub-model 101 .
  • the initial values of the coefficients corresponding to the parameters included in the timbre sub-model 102 may be preset or initialized randomly, which is not limited in the present disclosure.
  • the loss functions corresponding to the timbre sub-models used respectively may be the same or different, which is not limited in the present disclosure.
  • FIG. 1 c exemplarily shows an implementation manner of the timbre sub-model 102 .
  • the timbre sub-model 102 can be implemented using a self-attention network structure.
  • the timbre sub-model 102 includes: a convolutional network 1021 and one or more residual networks 1022 .
  • each residual network 1022 includes: a self-attention network 1022a and a linear network 1022b.
  • the convolution network 1021 is mainly used to perform convolution processing on the acoustic features corresponding to the input sample audio, and to model local feature information.
  • the convolutional network 1021 may include one or more convolutional layers, and this disclosure does not limit the number of convolutional layers included in the convolutional network 1021 .
  • the convolutional network 1021 inputs the local feature information to the connected residual network 1022 .
  • the one or more residual networks 1022 are converted into spectral features (such as Mel spectral features) after passing through the one or more residual networks 1022 .
  • the structure of the intermediate model is the same as that of the timbre sub-model 102 shown in FIG. 1c, the difference lies in that the weight coefficients of the parameters included are not completely the same.
  • the first feature extraction model and the second feature extraction model that meet the requirements of speech synthesis are finally obtained;
  • the model is spliced to obtain a speech synthesis model capable of synthesizing the target timbre.
  • the speech synthesis model 100 may further include: a vocoder (vocoder) 103 .
  • the vocoder 103 is used to convert the spectral features (such as Mel spectral features) output by the timbre sub-model 102 into audio.
  • the vocoder can also be used as an independent module, not bound together with the speech synthesis model. And this solution does not limit the specific type of the vocoder.
  • the target speech synthesis model finally obtained through training has the ability to stably synthesize the audio of the target timbre. Based on this, the target speech synthesis model can be used to process corresponding speech synthesis services.
  • Fig. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure. As shown in Figure 2, the speech synthesis method provided by this embodiment includes:
  • the text to be processed may include one or more characters, or the text to be processed may also include one or more phonemes.
  • the text to be processed is used to synthesize audio with the target rap style and target timbre.
  • the present disclosure does not limit the manner in which the electronic device obtains the text to be processed.
  • the electronic device can display the text input window and the soft keyboard to the user, and the user can input the text to be processed into the text input window by operating the soft keyboard displayed on the electronic device; or, the user can also copy and paste the text to the text input window Input the text to be processed; or, the user can also input a piece of audio to the electronic device by voice, and the electronic device obtains the text to be processed by performing voice recognition on the audio input by the user; or, it can also import the text to be processed to the electronic device Corresponding files, so that the electronic device obtains the text to be processed.
  • the user may, but is not limited to, input the text to be processed into the electronic device by means of the above examples.
  • the operation is simple and convenient, and the user's enthusiasm for creating multimedia content can be enhanced.
  • the text to be processed is input into the speech synthesis model
  • the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed
  • the bottleneck feature wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style; the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, and outputs the spectral feature corresponding to the text to be processed.
  • the text to be processed is input into the speech synthesis model, and the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed.
  • the bottleneck feature and the fundamental frequency feature corresponding to the text to be processed wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style, and the fundamental frequency feature included in the first acoustic feature is used to characterize the pitch;
  • the timbre sub-model receives The first acoustic feature corresponding to the text to be processed is used as input, and the spectral feature (such as Mel spectral feature) corresponding to the text to be processed is output.
  • the speech synthesis model can be obtained through the implementation of the embodiment shown in Figures 1a to 1c, wherein, the network structure of the speech synthesis model and the implementation of training the speech synthesis model can refer to the implementation shown in the aforementioned Figures 1a to 1c For the sake of brevity, the detailed description of the example is omitted here.
  • the text encoding network included in the prosodic sub-model can receive the text to be processed as an input, and model the intermediate feature sequence by analyzing the context and temporal relationship of the text to be processed; then according to the prosody sub-model The attention coefficient output by the included attention network is weighted and averaged with the intermediate feature sequence to obtain the weighted average result; the decoding network included in the prosody sub-model is characterized by the input weighted average result and the output of the previous time step Convert, output the first acoustic feature corresponding to the text to be processed, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed, or, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed and the corresponding text to be processed fundamental frequency features.
  • the convolutional network included in the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, performs convolution processing on the first acoustic feature corresponding to the text to be processed, and models Local feature information; the convolutional network inputs the local feature information into the connected residual network, and after passing through one or more residual networks, outputs the spectral features (such as Mel spectral features) corresponding to the text to be processed.
  • the spectral features such as Mel spectral features
  • the electronic device may perform digital signal processing on the spectral features corresponding to the text to be processed based on the vocoder, so as to convert the spectral features corresponding to the text to be processed (such as the Mel spectrum feature corresponding to the text to be processed) into Audio with a target timbre and a target rap style, ie target audio.
  • the vocoder can be used as a part of the speech synthesis model, and the speech synthesis model can directly output audio with the target timbre and target rap style; in other cases, the vocoder can be used as an independent module, the vocoder can receive as input the spectral features corresponding to the text to be processed, and convert the spectral features corresponding to the text to be processed into audio with a target timbre and a target rap style.
  • the speech synthesis method analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed
  • the text is used as input, and the first acoustic feature corresponding to the text to be processed is output, wherein the first acoustic feature includes the bottleneck feature used to characterize the target rap style;
  • the timbre sub-model receives the first acoustic feature as input, and outputs the text to be processed
  • the corresponding spectral features which include the information of the target timbre; by converting the spectral features output by the speech synthesis model, rap audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and the voice
  • the synthesis model supports the conversion of any text to be processed, which reduces the requirement on the
  • Fig. 3 is a schematic flowchart of a speech synthesis method provided by another embodiment of the present disclosure.
  • the speech synthesis method provided by this embodiment is based on the embodiment shown in Fig. 2, step S203, after obtaining the target audio corresponding to the text to be processed according to the spectral characteristics corresponding to the text to be processed, may also include :
  • the present disclosure does not limit the implementation manner of adding the target audio to the target multimedia content.
  • the electronic device when it adds the target audio to the target multimedia content, it can combine the duration of the target multimedia content and the duration of the target audio to speed up or slow down the playback speed of the target audio; it can also add the target audio to the playback interface of the target multimedia content
  • Corresponding subtitles can also not add the subtitles corresponding to the target audio; if the subtitles corresponding to the target audio are added on the playback interface of the target multimedia content, display parameters such as the color, font size, and font of the subtitles can also be set.
  • the method provided in this embodiment analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed as Input, output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature used to characterize the target rap style; the timbre sub-model receives the first acoustic feature as input, and outputs the corresponding to the text to be processed Spectral features, including spectral features used to characterize the target timbre; by converting the spectral features output by the speech synthesis model, audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and
  • the speech synthesis model supports the conversion of any text to be processed, which reduces the requirements on the user's music creation ability
  • adding the target audio to the target multimedia content makes the target multimedia content more interesting, thereby satisfying the user's demand for creative video creation.
  • the present disclosure also provides a speech synthesis device.
  • Fig. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure.
  • the speech synthesis device 400 provided in this embodiment includes:
  • An acquisition module 401 configured to acquire text to be processed.
  • the processing module 402 is configured to input the text to be processed into the speech synthesis model, and obtain the spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model , the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre The sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre.
  • the speech synthesis model includes: a prosody sub-model and a timbre sub-model , the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text
  • the processing module 402 is further configured to acquire the target audio corresponding to the text to be processed according to the spectral features corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
  • the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
  • the first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
  • the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;
  • the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth
  • the acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  • the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
  • the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;
  • the third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
  • the first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
  • the processing module 402 is further configured to add the target audio corresponding to the text to be processed to the target multimedia content.
  • the speech synthesis device provided in this embodiment can be used to implement the technical method of any of the above method embodiments, and its implementation principle and technical effect are similar. For details, please refer to the detailed description of the foregoing method embodiments.
  • the present disclosure also provides an electronic device.
  • Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device provided in this embodiment includes: a memory 501 and a processor 502 .
  • the memory 501 may be an independent physical unit, and may be connected with the processor 502 through the bus 503 .
  • the memory 501 and the processor 502 may also be integrated together, implemented by hardware, and the like.
  • the memory 501 is used to store program instructions, and the processor 502 invokes the program instructions to execute the operations of any one of the above method embodiments.
  • the foregoing electronic device 500 may also include only the processor 502 .
  • the memory 501 for storing programs is located outside the electronic device 500, and the processor 502 is connected to the memory through circuits/wires, and is used to read and execute the programs stored in the memory.
  • the processor 502 may be a central processing unit (central processing unit, CPU), a network processor (network processor, NP) or a combination of CPU and NP.
  • CPU central processing unit
  • NP network processor
  • the processor 502 may further include a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • the aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • the memory 501 may include a volatile memory (volatile memory), such as a random-access memory (random-access memory, RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory) ), a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD); the memory can also include a combination of the above-mentioned types of memory.
  • volatile memory such as a random-access memory (random-access memory, RAM
  • non-volatile memory such as a flash memory (flash memory)
  • HDD hard disk drive
  • solid-state drive solid-state drive
  • the present disclosure also provides a readable storage medium, including: computer program instructions; when the computer program instructions are executed by at least one processor of the electronic device, the speech synthesis method shown in any one of the above method embodiments is implemented.
  • the present disclosure also provides a program product, the program product includes a computer program, the computer program is stored in a readable storage medium, and at least one processor of the electronic device can read the computer program from the readable storage medium.
  • the computer program, the at least one processor executes the computer program to enable the electronic device to implement the speech synthesis method as shown in any one of the above method embodiments.

Abstract

Provided are a speech synthesis method and apparatus, an electronic device, a readable storage medium, and a program product. The method comprises: acquiring a text to be processed (S201); inputting the text to be processed into a speech synthesis model, so as to obtain an outputted spectral feature corresponding to the text to be processed (S202); wherein the speech synthesis model comprises: a rhythm sub-model and a timbre sub-model, the rhythm sub-model being used to output a corresponding first acoustic feature according to the inputted text to be processed, and the first acoustic feature comprises a bottleneck feature for representing a target rap style; the timbre sub-model is used to output, according to the input first acoustic feature, a spectral feature for representing a target timbre; according to the spectral feature corresponding to the text to be processed, obtaining a target audio corresponding to the text to be processed, the target audio having a target timbre and a target rap style (S203).

Description

语音合成方法、装置、电子设备及可读存储介质Speech synthesis method, device, electronic device and readable storage medium
相关申请的交叉引用Cross References to Related Applications
本公开要求于2021年9月22日提交的,申请名称为“语音合成方法、装置、电子设备及可读存储介质”的、中国专利申请号为“202111107875.8”的优先权,该中国专利申请的全部内容通过引用结合在本公开中。This disclosure claims the priority of the Chinese patent application number "202111107875.8" filed on September 22, 2021 with the title of "speech synthesis method, device, electronic equipment and readable storage medium". The entire contents are incorporated by reference in this disclosure.
技术领域technical field
本公开涉及人工智能技术领域,尤其涉及一种语音合成方法、装置、电子设备及可读存储介质。The present disclosure relates to the technical field of artificial intelligence, and in particular to a speech synthesis method, device, electronic equipment and readable storage medium.
背景技术Background technique
随着互联网技术的不断发展,应用程序能够支持用户合成创意视频,在合成创意视频时,通常需要为视频添加配乐。目前,为视频添加配乐通常是从音乐库中选择音乐,这样的方式添加的配乐无法满足用户个性化的需求。With the continuous development of Internet technology, application programs can support users to synthesize creative videos. When synthesizing creative videos, it is usually necessary to add soundtracks to the videos. Currently, adding a soundtrack to a video usually involves selecting music from a music library, and the soundtrack added in this way cannot meet the personalized needs of users.
技术解决方案technical solution
为了解决上述技术问题或者至少部分地解决上述技术问题,本公开提供了一种语音合成方法、装置、电子设备及可读存储介质。In order to solve the above technical problems or at least partly solve the above technical problems, the present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium.
第一方面,本公开提供了一种语音合成方法,包括:In a first aspect, the present disclosure provides a speech synthesis method, including:
获取待处理文本;Get the text to be processed;
将所述待处理文本输入至语音合成模型,获取所述语音合成模型输出的所述待处理文本对应的频谱特征;其中,所述语音合成模型包括:韵律子模型和音色子模型,所述韵律子模型用于根据输入的待处理文本,输出所述待处理文本对应的第一声学特征,所述第一声学特征包括用于表征目标说唱风格的瓶颈特征;所述音色子模型用于根据 输入的第一声学特征,输出所述待处理文本对应的频谱特征,所述待处理文本对应的频谱特征包括用于表征目标音色的频谱特征;The text to be processed is input into the speech synthesis model, and the spectral features corresponding to the text to be processed output by the speech synthesis model are obtained; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model, and the prosody The sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, and the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model is used for According to the input first acoustic feature, output the spectral feature corresponding to the text to be processed, the spectral feature corresponding to the text to be processed includes the spectral feature used to characterize the target timbre;
根据所述待处理文本对应的频谱特征,获取所述待处理文本对应的目标音频,所述目标音频具有所述目标音色以及所述目标说唱风格。According to the spectrum feature corresponding to the text to be processed, the target audio corresponding to the text to be processed is acquired, and the target audio has the target timbre and the target rap style.
在一些可能的实施方式中,所述韵律子模型是根据第一样本音频对应的标注文本以及所述第一样本音频对应的第二声学特征,进行训练获得的;In some possible implementation manners, the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
所述第一样本音频包括至少一个所述目标说唱风格的音频;所述第二声学特征包括所述第一样本音频对应的第一标注瓶颈特征。The first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
在一些可能的实施方式中,所述音色子模型是根据第二样本音频对应的第三声学特征、第二样本音频对应的第一标注频谱特征、第三样本音频对应的第四声学特征以及第三样本音频对应的第二标注频谱特征进行训练获得的;In some possible implementation manners, the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;
其中,所述第三声学特征包括所述第二样本音频对应的第二标注瓶颈特征;所述第三样本音频包括至少一个具有所述目标音色的音频,所述第三样本音频对应的第四声学特征包括第三样本音频对应的第三标注瓶颈特征。Wherein, the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
在一些可能的实施方式中,所述第一样本音频对应的第一标注瓶颈特征、所述第二样本音频对应的第二标注瓶颈特征以及所述第三样本音频对应的第三标注瓶颈特征是通过端到端语音识别模型的编码器分别对输入的所述第一样本音频、所述第二样本音频和所述第三样本音频进行瓶颈特征提取获得的。In some possible implementation manners, the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
在一些可能的实施方式中,所述第二声学特征还包括:所述第一样本音频对应的第一标注基频特征;In some possible implementation manners, the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;
所述第三声学特征还包括:所述第二样本音频对应的第二标注基频特征;所述第四声学特征还包括:所述第三样本音频对应的第三标注基频特征;The third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
所述第一声学特征还包括:所述待处理文本对应的基频特征。The first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
在一些可能的实施方式中,所述方法还包括:In some possible implementation manners, the method also includes:
将所述待处理文本对应的目标音频添加至目标多媒体内容。Add the target audio corresponding to the text to be processed to the target multimedia content.
第二方面,本公开提供了一种语音合成装置,包括:In a second aspect, the present disclosure provides a speech synthesis device, including:
获取模块,用于获取待处理文本;Obtaining module, used to obtain the text to be processed;
处理模块,用于将所述待处理文本输入至语音合成模型,获取所述语音合成模型输出的所述待处理文本对应的频谱特征;其中,所述语音合成模型包括:韵律子模型和音色子模型,所述韵律子模型用于根据输入的待处理文本,输出所述待处理文本对应的第一声学特征,所述第一声学特征包括用于表征目标说唱风格的瓶颈特征;所述音色子模型用于根据输入的第一声学特征,输入所述待处理文本对应的频谱特征,所述待处理文本对应的频谱特征包括用于表征目标音色的频谱特征;A processing module, configured to input the text to be processed into a speech synthesis model, and obtain spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model Model, the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the The timbre sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre;
所述处理模块,用于根据所述待处理文本对应的频谱特征,获取所述待处理文本对应的目标音频,所述目标音频具有所述目标音色以及所述目标说唱风格。The processing module is configured to acquire target audio corresponding to the text to be processed according to the frequency spectrum feature corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
第三方面,本公开提供了一种电子设备,包括:存储器、处理器以及计算机程序;In a third aspect, the present disclosure provides an electronic device, including: a memory, a processor, and a computer program;
所述存储器被配置为存储所述计算机程序;said memory is configured to store said computer program;
所述处理器被配置为执行所述计算机程序,以实现如第一方面任一项所述的语音合成方法。The processor is configured to execute the computer program to implement the speech synthesis method according to any one of the first aspect.
第四方面,本公开提供一种可读存储介质,包括:计算机程序;In a fourth aspect, the present disclosure provides a readable storage medium, including: a computer program;
所述计算机程序被电子设备的至少一个处理器执行时,以实现如第一方面任一项所述的语音合成方法。When the computer program is executed by at least one processor of the electronic device, the speech synthesis method according to any one of the first aspect can be realized.
第五方面,本公开提供一种程序产品,所述程序产品包括:计算机程序;所述计算机程序存储在可读存储介质中,电子设备从所述可读存储介质获取所述计算机程序,所述电子设备的至少一个处理器质性所述计算机程序时,以实现如第一方面任一项所述的语音合成方法。In a fifth aspect, the present disclosure provides a program product, the program product including: a computer program; the computer program is stored in a readable storage medium, and an electronic device acquires the computer program from the readable storage medium, the At least one processor of the electronic device executes the computer program to implement the speech synthesis method according to any one of the first aspect.
本公开提供一种语音合成方法、装置、电子设备及可读存储介质, 其中,本公开基于语音合成模型对待处理文本进行分析,输出待处理文本对应的频谱特征,其中,语音合成模型包括韵律子模型和音色子模型,韵律子模型用于接收待处理文本作为输入,输出待处理文本对应的第一声学特征,其中,第一声学特征包括用于表征目标说唱风格的瓶颈特征;音色子模型接收第一声学特征作为输入,输出待处理文本对应的频谱特征,频谱特征包括用于表征目标音色的频谱特征;通过对语音合成模型输出的频谱特征进行转换,能够获得具有目标说唱风格以及目标音色的说唱音频,满足了用户对于合成音频的个性化需求;且语音合成模型支持对任意待处理文本的转换,降低了对用户的音乐创作能力的要求,有利于提升用户创作多媒体内容的积极性。The present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium, wherein, the present disclosure analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes prosodic Model and timbre sub-model, the prosody sub-model is used to receive the text to be processed as input, and output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model The model receives the first acoustic feature as input, and outputs the spectral feature corresponding to the text to be processed. The spectral feature includes the spectral feature used to characterize the target timbre; by converting the spectral feature output by the speech synthesis model, it can obtain the target rap style and The rap audio of the target tone meets the user's personalized needs for synthesized audio; and the speech synthesis model supports the conversion of any text to be processed, which reduces the requirements for the user's music creation ability and is conducive to improving the user's enthusiasm for creating multimedia content .
附图说明Description of drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.
为了更清楚地说明本公开实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or related technologies, the following will briefly introduce the drawings that need to be used in the descriptions of the embodiments or related technologies. Obviously, for those of ordinary skill in the art, Other drawings can also be obtained from these drawings without any creative effort.
图1a至图1c为本公开一实施例提供的语音合成模型的结构示意图;1a to 1c are structural schematic diagrams of a speech synthesis model provided by an embodiment of the present disclosure;
图2为本公开一实施例提供的语音合成方法的流程图;FIG. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure;
图3为本公开另一实施例提供的语音合成方法的流程图;FIG. 3 is a flowchart of a speech synthesis method provided by another embodiment of the present disclosure;
图4为本公开一实施例提供的语音合成装置的结构示意图;FIG. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure;
图5为本公开一实施例提供的电子设备的结构示意图。Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为了能够更清楚地理解本公开的上述目的、特征和优点,下面将对本公开的方案进行进一步描述。需要说明的是,在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合。In order to more clearly understand the above objects, features and advantages of the present disclosure, the solutions of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other.
在下面的描述中阐述了很多具体细节以便于充分理解本公开,但本公开还可以采用其他不同于在此描述的方式来实施;显然,说明书中的实施例只是本公开的一部分实施例,而不是全部的实施例。In the following description, many specific details are set forth in order to fully understand the present disclosure, but the present disclosure can also be implemented in other ways than described here; obviously, the embodiments in the description are only some of the embodiments of the present disclosure, and Not all examples.
本公开提供一种语音合成方法、装置、电子设备、可读存储介质及程序产品,其中,该方法通过预先训练的语音合成模型实现文本到具有目标说唱风格以及目标音色的音频的转换,该语音合成模型能够实现目标说唱风格和音色相对独立地对语音合成的控制,从而满足用户对于个性化语音合成的需求。The present disclosure provides a speech synthesis method, device, electronic equipment, readable storage medium, and program product, wherein the method implements the conversion of text into audio with a target rap style and target timbre through a pre-trained speech synthesis model, and the speech The synthesis model can realize the relatively independent control of the target rap style and timbre on the speech synthesis, so as to meet the user's demand for personalized speech synthesis.
本公开提及的目标说唱风格可以包括任意类别的说唱风格,本公开对于目标说唱风格具体为何种说说唱风格不做限定。例如,目标说唱风格可以为流行说唱、另类说唱、喜剧说唱、爵士说唱、嘻哈说唱中的任一种说唱风格。The target rap style mentioned in the present disclosure may include any type of rap style, and the present disclosure does not limit the specific rap style of the target rap style. For example, the target rap style may be any rap style among popular rap, alternative rap, comedy rap, jazz rap, and hip-hop rap.
本公开提供的语音合成方法,可以由电子设备来执行。其中,电子设备可以是平板电脑、手机(如折叠屏手机、大屏手机等)、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personaldigital assistant,PDA)、智能电视、智慧屏、高清电视、4K电视、智能音箱、智能投影仪等物联网(the internet of things,IOT)设备,本公开对电子设备的具体类型不作任何限制。The speech synthesis method provided by the present disclosure can be executed by electronic equipment. Among them, the electronic device can be a tablet computer, a mobile phone (such as a folding screen mobile phone, a large-screen mobile phone, etc.), a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a notebook computer, etc. , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), smart TV, smart screen, high-definition TV, 4K TV, smart speaker, smart projector and other Internet of Things (the Internet of things, IOT) equipment, this disclosure does not make any restrictions on the specific type of electronic equipment.
需要说明的是,训练获取语音合成模型的电子设备和利用语音合成模型执行语音合成业务的电子设备,可以是不同的电子设备,也可以是相同的电子设备,本公开对此不作限定。例如,由服务端设备训练获取语音合成模型,服务端设备将训练好的语音合成模型下发至终端设备/服务端设备,由终端设备/服务端设备根据语音合成模型执行语音合成业务;又如,由服务端设备训练获取语音合成模型,之后,将训练好的语音合成模型部署在该服务端设备,之后,服务端设备调用语音合成模型处理语音合成业务。本公开对此不做限制,实际应用中, 可灵活设置。It should be noted that the electronic device that trains and obtains the speech synthesis model and the electronic device that uses the speech synthesis model to execute the speech synthesis service may be different electronic devices or the same electronic device, which is not limited in the present disclosure. For example, the speech synthesis model is obtained through the training of the server device, and the server device sends the trained speech synthesis model to the terminal device/server device, and the terminal device/server device executes the speech synthesis service according to the speech synthesis model; another example The speech synthesis model is trained by the server device, and then the trained speech synthesis model is deployed on the server device, and then the server device invokes the speech synthesis model to process the speech synthesis service. The present disclosure does not limit this, and it can be set flexibly in practical applications.
下面,首先对本方案中的语音合成模型进行介绍。In the following, the speech synthesis model in this solution is firstly introduced.
本方案中的语音合成模型通过引入包括瓶颈(bottleneck)特征的声学特征,将语音合成模型解耦成两个子模型,分别为:韵律子模型和音色子模型,其中,韵律子模型用于建立文本到包含瓶颈特征的声学特征之间的深度映射;音色子模型用于建立包含瓶颈特征的声学特征到频谱特征之间的深度映射。The speech synthesis model in this solution decouples the speech synthesis model into two sub-models by introducing acoustic features including bottleneck features, namely: the prosody sub-model and the timbre sub-model, wherein the prosody sub-model is used to create text The depth mapping between the acoustic features including the bottleneck features and the timbre sub-model is used to establish the depth mapping between the acoustic features including the bottleneck features and the spectral features.
在此基础上,至少具有以下有益效果:On this basis, it has at least the following beneficial effects:
1、解耦后的两个特征提取子模型可以使用不同的样本音频进行训练。1. The two decoupled feature extraction sub-models can be trained using different sample audio.
韵律子模型,用于建立文本序列到包含瓶颈特征的声学特征之间的深度映射,韵律子模型需要使用高质量的具有目标说唱风格的第一样本音频以及第一样本音频对应的标注文本,共同作为样本数据对韵律子模型进行训练。The prosody sub-model is used to establish a deep mapping between the text sequence and the acoustic features containing bottleneck features. The prosody sub-model needs to use high-quality first sample audio with the target rap style and the annotated text corresponding to the first sample audio , together as the sample data to train the prosody sub-model.
音色子模型,用于建立包含瓶颈特征的声学特征到频谱特征之间的深度映射,音色子模型可以使用未标注相应文本的第二样本音频进行训练,由于无需标注第二样本音频对应的文本,这样可以大大降低获取第二样本音频的成本。The timbre sub-model is used to establish the depth mapping between the acoustic features including bottleneck features and the spectral features. The timbre sub-model can be trained using the second sample audio that has not marked the corresponding text. Since there is no need to label the text corresponding to the second sample audio, This can greatly reduce the cost of acquiring a second sample of audio.
2、通过解耦语音合成模型,实现了说唱风格和音色相对独立地对语音合成的控制。2. By decoupling the speech synthesis model, the relatively independent control of speech synthesis by rap style and timbre is realized.
韵律子模型输出的声学特征包括用于表征目标说唱风格的瓶颈特征,实现说唱风格对语音合成的控制。此外,韵律子模型输出的声学特征还可以包括用于表征音调的基频特征,实现音调对语音合成的控制。The acoustic features output by the prosody sub-model include the bottleneck features used to characterize the target rap style, and realize the control of rap style on speech synthesis. In addition, the acoustic features output by the prosody sub-model may also include fundamental frequency features used to characterize pitch, so as to realize the control of speech synthesis by pitch.
音色子模型输出的文本对应的频谱特征包括用于表征目标音色的频谱特征,从而实现音色对语音合成的控制。The spectral features corresponding to the text output by the timbre sub-model include the spectral features used to characterize the target timbre, so as to realize the control of the timbre over speech synthesis.
此外,需要说明的是,音色子模型输出的频谱特征还包括用于表征目标说唱风格的频谱特征,且表征目标音色的频谱特征和表征目标 说唱风格的频谱特征为相同的频谱特征。若韵律子模型输出的声学特征还包括基频特征,则音色子模型输出的频谱特征还包括用于表征相应基频的频谱特征,且表征目标音色的频谱特征、表征目标说唱风格的频谱特征以及表征基频的频谱特征为相同的频谱特征。In addition, it should be noted that the spectral features output by the timbre sub-model also include the spectral features used to represent the target rap style, and the spectral features representing the target timbre and the spectral features representing the target rap style are the same spectral features. If the acoustic features of the prosody sub-model output also include fundamental frequency features, the spectral features of the timbre sub-model output also include spectral features for representing the corresponding fundamental frequency, and represent the spectral features of the target timbre, the spectral features of the target rap style, and The spectral features characterizing the fundamental frequency are the same spectral features.
3、降低了对具有目标音色的第三样本音频的要求3. Reduced the requirement for the third sample audio with the target patch
该语音合成模型可以通过较少的目标音色的第三样本音频进行训练,即可使最终的语音合成模型合成具有目标音色的音频,且即使第三样本音频的质量不高,如发音不标准、说话不流利等,语音合成模型依然可以稳定地合成具有目标音色的音频。The speech synthesis model can be trained by the third sample audio of less target timbre, so that the final speech synthesis model can synthesize audio with the target timbre, and even if the quality of the third sample audio is not high, such as non-standard pronunciation, Even if the speech is not fluent, etc., the speech synthesis model can still synthesize audio with the target timbre stably.
由于通过第二样本音频已对音色子模型进行训练,使得音色子模型已经具备了较高的针对音色的语音合成控制能力,因此,即使音色子模型学习少量的第三样本音频,也能够较好的掌握目标音色。Since the timbre sub-model has been trained through the second sample audio, the timbre sub-model already has a high ability to control the speech synthesis of timbre. Therefore, even if the timbre sub-model learns a small amount of third sample audio, it can be better. to master the target Voice.
下面通过几个具体实施例对语音合成模型的结构以及如何训练获取语音合成模型进行详细介绍。下述实施例中,以电子设备为例,结合附图,进行详细介绍。The structure of the speech synthesis model and how to train and obtain the speech synthesis model will be introduced in detail below through several specific embodiments. In the following embodiments, an electronic device is taken as an example to describe in detail with reference to the accompanying drawings.
其中,图1a示出了训练获取语音合成模型的整体框架图;图1b和图1c分别示例性地示出了语音合成模型包括的韵律子模型和音色子模型的结构示意图。Among them, Fig. 1a shows the overall frame diagram of the training and acquisition of the speech synthesis model; Fig. 1b and Fig. 1c respectively exemplarily show the structural diagrams of the prosody sub-model and the timbre sub-model included in the speech synthesis model.
参照图1a所示,语音合成模型100包括:韵律子模型101和音色子模型102。对于语音合成模型100进行训练的过程包括针对韵律子模型101进行训练的过程和对音色子模型102进行训练的过程。Referring to FIG. 1 a , the speech synthesis model 100 includes: a prosody sub-model 101 and a timbre sub-model 102 . The process of training the speech synthesis model 100 includes the process of training the prosody sub-model 101 and the process of training the timbre sub-model 102 .
下面分别介绍对韵律子模型101进行训练的过程和对音色子模型102进行训练的过程。The process of training the prosody sub-model 101 and the process of training the timbre sub-model 102 are respectively introduced below.
一、对韵律子模型101进行训练1. Training the prosodic sub-model 101
韵律子模型101用于根据第一样本音频对应的标注文本以及标注声学特征(以下将第一样本音频对应的标注声学特征称为第二声学特征)进行训练,通过学习第一样本音频对应的标注文本以及第二声学特征之间的关系,使得韵律子模型101获得建立文本到包含瓶颈特征 的声学特征之间的深度映射的能力。The prosodic sub-model 101 is used for training according to the labeled text corresponding to the first sample audio and the labeled acoustic features (hereinafter the labeled acoustic features corresponding to the first sample audio are referred to as the second acoustic feature), by learning the first sample audio The relationship between the corresponding labeled text and the second acoustic features enables the prosody sub-model 101 to obtain the ability to establish a depth mapping between the text and the acoustic features including bottleneck features.
可选地,前述标注文本具体可以为文本序列。Optionally, the aforementioned marked text may specifically be a text sequence.
具体地,韵律子模型101具体用于对输入的第一样本音频对应的标注文本进行分析,建模中间特征序列,并对中间特征序列进行特征转换以及降维,输出标注文本对应的第五声学特征。Specifically, the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.
之后,再基于第一样本音频对应的第二声学特征、第一样本音频对应的第五声学特征以及预先构建的损失函数,计算本轮训练的损失函数信息,并根据本轮训练的损失函数信息对韵律子模型101包括的参数的系数值进行调整。Afterwards, based on the second acoustic feature corresponding to the first sample audio, the fifth acoustic feature corresponding to the first sample audio, and the pre-built loss function, the loss function information of the current round of training is calculated, and according to the loss of the current round of training The function information adjusts the coefficient values of the parameters included in the prosody sub-model 101 .
通过多个第一样本音频、第一样本音频对应的标注文本、第一样本音频对应的第二声学特征(包括第一标注瓶颈特征)的不断迭代训练,最终获得满足相应收敛条件的第一特征提取模型101。Through the continuous iterative training of multiple first sample audios, the labeled text corresponding to the first sample audios, and the second acoustic features (including the first labeled bottleneck features) corresponding to the first sample audios, finally obtain the corresponding convergence conditions The first feature extraction model 101 .
在训练过程中,第一样本音频对应的第二声学特征,可以理解为韵律子模型101的学习目标。During the training process, the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .
其中,第一样本音频可以包括高质量的音频文件(高质量的音频文本也可以理解为干净的音频),第一样本音频对应的标注文本可以包括第一样本音频对应的一个或多个字符或者一个或多个音素,本公开对此不做限定。第一样本音频可以根据实际需求进行录制、多次的清理获得的,或者,也可以从音频数据库中筛选并多次清理获得,本公开对于第一样本音频的获取方式不做限制。类似地,第一样本音频对应的标注文本,也可以是通过反复的标注、校正获得的,从而保证标注文本的准确性。Wherein, the first audio sample can include a high-quality audio file (high-quality audio text can also be understood as clean audio), and the annotation text corresponding to the first audio sample can include one or more audio files corresponding to the first audio sample. characters or one or more phonemes, which is not limited in this disclosure. The first audio sample can be obtained by recording and cleaning multiple times according to actual needs, or can also be obtained by filtering from an audio database and cleaning multiple times. The present disclosure does not limit the acquisition method of the first sample audio. Similarly, the annotation text corresponding to the first audio sample may also be obtained through repeated annotation and correction, so as to ensure the accuracy of the annotation text.
此外,本公开提及的第一样本音频为具有目标说唱风格的音频,本公开对于第一样本音频的时长、文件格式、数量等等参数不做限定,且第一样本音频可以是相同或者不同歌手演唱的音乐片段。In addition, the first sample audio mentioned in this disclosure is audio with the target rap style. This disclosure does not limit the duration, file format, quantity and other parameters of the first sample audio, and the first sample audio can be A piece of music sung by the same or a different singer.
此外,标注文本对应的第五声学特征可以理解为韵律子模型101输出的标注文本对应的预测声学特征,标注文本对应的第五声学特征也可以理解为第一样本音频对应的第五声学特征。In addition, the fifth acoustic feature corresponding to the labeled text can be understood as the predicted acoustic feature corresponding to the labeled text output by the prosodic sub-model 101, and the fifth acoustic feature corresponding to the labeled text can also be understood as the fifth acoustic feature corresponding to the first sample audio .
一些实施例中,第二声学特征包括:第一样本音频对应的第一标注瓶颈特征。In some embodiments, the second acoustic feature includes: a first labeled bottleneck feature corresponding to the first audio sample.
其中,瓶颈(bottleneck)是一种非线性的特征转换技术以及有效的降维技术。在本方案所提及的针对特定音色的语音合成场景中,瓶颈特征可以包括韵律、内容等维度的信息。Among them, the bottleneck (bottleneck) is a nonlinear feature transformation technology and an effective dimension reduction technology. In the speech synthesis scenario for a specific timbre mentioned in this solution, the bottleneck feature may include information of dimensions such as prosody and content.
一种可能的实现方式,第一样本音频对应的第一标注瓶颈特征可以通过端到端语音识别(ASR)模型的编码器(encoder)获得。In a possible implementation manner, the first labeled bottleneck feature corresponding to the first audio sample may be obtained by an encoder (encoder) of an end-to-end speech recognition (ASR) model.
下文中,端到端ASR模型简称为:ASR模型。Hereinafter, the end-to-end ASR model is referred to as: ASR model for short.
示例性地,参照图1a所示,可将第一样本音频输入至ASR模型104,获取ASR模型104的编码器输出的第一样本音频对应的第一标注瓶颈特征,其中,ASR模型104的编码器相当于提前瓶颈特征的提取器,在本方案中ASR模型104的编码器可以用于准备样本数据。Exemplarily, as shown in FIG. 1a, the first sample audio can be input to the ASR model 104, and the first marked bottleneck feature corresponding to the first sample audio output by the encoder of the ASR model 104 is obtained, wherein the ASR model 104 The encoder of is equivalent to the extractor of the bottleneck feature in advance, and the encoder of the ASR model 104 can be used to prepare sample data in this solution.
需要说明的是,ASR模型104还可以包括其他模块,例如图1a所示,ASR模型104还包括解码器(decoder)以及注意力网络(attention network)。针对ASR模型104中除编码器以外的其他模块输出的处理结果,可以不做任何处理,且本公开对于ASR模型中除编码器以外的其他模块或者网络的功能、实现方式不作限定。It should be noted that the ASR model 104 may also include other modules. For example, as shown in FIG. 1a, the ASR model 104 also includes a decoder (decoder) and an attention network (attention network). No processing may be performed on the processing results output by modules other than the encoder in the ASR model 104 , and this disclosure does not limit the functions and implementations of modules or networks other than the encoder in the ASR model.
其中,通过ASR模型104的编码器获得第一样本音频对应的第一标注瓶颈特征仅是示例,并不是对获得第一样本音频对应的第一标注瓶颈特征的实现方式的限制。实际应用中,也可以通过其他方式获得,本公开对此不做限制。例如,数据库中存储第一样本音频以及第一样本音频对应的第一标注瓶颈特征,电子设备也可以从数据库中获取第一样本音频以及第一标注瓶颈特征。Obtaining the first marked bottleneck feature corresponding to the first sample audio by the encoder of the ASR model 104 is only an example, and is not a limitation to the implementation manner of obtaining the first marked bottleneck feature corresponding to the first sample audio. In practical applications, it can also be obtained in other ways, which is not limited in the present disclosure. For example, the database stores the first sample audio and the first labeled bottleneck feature corresponding to the first sample audio, and the electronic device may also acquire the first sample audio and the first labeled bottleneck feature from the database.
另一些实施例中,第一样本音频对应的第二声学特征包括:第一样本音频对应的第一标注瓶颈特征和第一样本音频对应的第一标注基频特征。In some other embodiments, the second acoustic feature corresponding to the first sample audio includes: a first labeled bottleneck feature corresponding to the first sample audio and a first labeled fundamental frequency feature corresponding to the first sample audio.
其中,第一标注瓶颈特征可参照前述示例的详细描述,简明起见,此处不再赘述。Wherein, the first marked bottleneck feature can refer to the detailed description of the foregoing examples, and for the sake of brevity, details are not repeated here.
其中,音调表示人耳对于声音的音调高低的主观感受,音调的高低主要取决于声音的基频,基频频率越高则音调越高,基频频率越低则音调越低。在语音合成过程中,音调也是影响语音合成效果的重要因素之一。为了使得最终的语音合成模型具备对音调的语音合成控制能力,本方案在引入瓶颈特征的同时,还引入基频特征,使得最终的韵律子模型101具有根据输入的文本,输出相对应的瓶颈特征和基频特征的能力。Among them, the pitch represents the subjective feeling of the human ear for the pitch of the sound. The pitch mainly depends on the fundamental frequency of the sound. The higher the fundamental frequency, the higher the pitch, and the lower the fundamental frequency, the lower the pitch. In the process of speech synthesis, pitch is also one of the important factors affecting the effect of speech synthesis. In order to enable the final speech synthesis model to have the ability to control the speech synthesis of pitch, this solution introduces the fundamental frequency feature while introducing the bottleneck feature, so that the final prosodic sub-model 101 can output the corresponding bottleneck feature according to the input text and fundamental frequency features.
具体地,韵律子模型101具体用于对输入的第一样本音频对应的标注文本进行分析,建模中间特征序列,并对中间特征序列进行特征转换以及降维,输出标注文本对应的第五声学特征。Specifically, the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.
其中,标注文本对应第五声学特征可以理解为韵律子模型101输出的标注文本对应的预测声学特征。标注文本对应的第五声学特征也可以理解为第一样本音频对应的第五声学特征。Wherein, the fifth acoustic feature corresponding to the tagged text may be understood as the predicted acoustic feature corresponding to the tagged text output by the prosody sub-model 101 . The fifth acoustic feature corresponding to the marked text may also be understood as the fifth acoustic feature corresponding to the first audio sample.
需要说明的是,第一样本音频对应的第二声学特征包括:第一标注瓶颈特征和第一标注基频特征,则在训练的过程中,韵律子模型101输出的第一样本音频对应的第五声学特征也包括:第一样本音频对应的预测瓶颈特征和预测基频特征。It should be noted that the second acoustic feature corresponding to the first sample audio includes: the first labeled bottleneck feature and the first labeled fundamental frequency feature, then during the training process, the first sample audio output by the prosody sub-model 101 corresponds to The fifth acoustic feature also includes: a predicted bottleneck feature and a predicted fundamental frequency feature corresponding to the first audio sample.
之后,再基于第一样本音频对应的第二声学特征、第一样本音频对应的第五声学特征以及预先构建的损失函数,计算本轮训练的损失函数信息,并根据损失函数信息对韵律子模型101包括的参数的系数值进行调整。After that, based on the second acoustic feature corresponding to the first sample audio, the fifth acoustic feature corresponding to the first sample audio, and the pre-built loss function, the loss function information of the current round of training is calculated, and the prosody is adjusted according to the loss function information. The coefficient values of the parameters included in the sub-model 101 are adjusted.
通过海量的第一样本音频、第一样本音频对应的标注文本、第一样本音频对应的第二声学特征(包括第一标注瓶颈特征和第一标注基频特征)的不断迭代训练,最终获得满足相应收敛条件的第一特征提取模型101。Through the continuous iterative training of the massive first sample audio, the labeled text corresponding to the first sample audio, and the second acoustic feature corresponding to the first sample audio (including the first labeled bottleneck feature and the first labeled fundamental frequency feature), Finally, the first feature extraction model 101 satisfying the corresponding convergence condition is obtained.
在训练过程中,第一样本音频对应的第二声学特征,可以理解为韵律子模型101的学习目标。During the training process, the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .
一种可能的实现方式,第一样本音频对应的第一标注基频特征可 以通过数字信号处理(DSP)的方法对第一样本音频进行分析获得。示例性地,如图1a中所示,可以通过数字信号处理器105对第一样本音频进行数字信号处理,获取第一样本音频对应的第一标注基频特征。其中,数字信号处理器105的具体实现方式不作限定,其只要能够提取输入的第一样本音频对应的第一标注基频特征即可。In a possible implementation manner, the first labeled fundamental frequency feature corresponding to the first sample audio can be obtained by analyzing the first sample audio by a digital signal processing (DSP) method. Exemplarily, as shown in FIG. 1 a , digital signal processing may be performed on the first sample audio by the digital signal processor 105 to obtain the first labeled fundamental frequency feature corresponding to the first sample audio. Wherein, the specific implementation manner of the digital signal processor 105 is not limited, as long as it can extract the first marked fundamental frequency feature corresponding to the input first sample audio.
此外,第一样本音频对应的第一标注基频特征并不限于通过数字信号处理的方法获得,本公开对于获取第一标注基频特征的实现方式不作限定。例如,一些数据库中存储第一样本音频以及第一样本音频对应的第一标注基频特征,也可以从数据库中获取第一样本音频以及第一标注基频特征。In addition, the first marked fundamental frequency feature corresponding to the first sample audio is not limited to be obtained by digital signal processing, and the present disclosure does not limit the implementation manner of obtaining the first marked fundamental frequency feature. For example, some databases store the first sample audio and the first labeled fundamental frequency feature corresponding to the first sample audio, and the first sample audio and the first labeled fundamental frequency feature may also be acquired from the database.
需要说明的是,韵律子模型对应的收敛条件可以但不限于包括迭代次数、损失阈值等评价指标。本公开对于训练韵律子模型对应的收敛条件不做限制。且电子设备根据第一样本音频对应的第一标注瓶颈特征进行训练,或者,根据第一样本音频对应的第一标注瓶颈特征和第一标注基频特征进行训练,收敛条件可以具备差异。It should be noted that the convergence condition corresponding to the prosody sub-model may include, but is not limited to, evaluation indicators such as the number of iterations and loss threshold. The present disclosure does not limit the convergence conditions corresponding to the training prosodic sub-models. And the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or, according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, the convergence conditions may have differences.
此外,电子设备根据第一样本音频对应的第一标注瓶颈特征进行训练,或者,根据第一样本音频对应的第一标注瓶颈特征和第一标注基频特征进行训练,预先构建的韵律子模型对应的损失函数可以相同,也可以具备差异。本公开对于预先构建的韵律子模型对应的损失函数的实现方式不做限定。In addition, the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or performs training according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, and the pre-built prosodic The loss functions corresponding to the models can be the same or different. The present disclosure does not limit the implementation manner of the loss function corresponding to the pre-built prosody sub-model.
下面示例性地示出韵律子模型的网络结构。The network structure of the prosodic sub-model is exemplarily shown below.
图1b示例性地示出了韵律子模型101的一种实现方式。参照图1b所示,韵律子模型101可以包括:文本编码网络(text encoder)1011、注意力网络(attention)1012以及解码网络(decoder)1013。FIG. 1 b exemplarily shows an implementation of the prosodic sub-model 101 . As shown in FIG. 1 b , the prosodic sub-model 101 may include: a text encoding network (text encoder) 1011 , an attention network (attention) 1012 and a decoding network (decoder) 1013 .
其中,文本编码网络1011,用于接收文本作为输入,并对输入的文本的上下文以及时序关系进行分析,建模中间特征序列,该中间特征序列包含上下文信息以及时序关系。Among them, the text coding network 1011 is used to receive text as input, analyze the context and time sequence relationship of the input text, and model an intermediate feature sequence, which contains context information and time sequence relationship.
解码网络1013,可以采用自回归网络结构,通过使用上一个时间 步的输出作为下一个时间步的输入。The decoding network 1013 can adopt an autoregressive network structure, by using the output of the previous time step as the input of the next time step.
注意力网络1012主要用于输出的注意力系数。将注意力系数与文本编码网络1011输出的中间特征序列进行加权平均,获得加权平均结果,该加权平均结果作为解码网络1013每个时间步的另一个条件输入。解码网络1013通过对输入(即加权平均结果以及上一个时间步的输出)进行特征转换,输出文本对应的预测声学特征。The attention network 1012 is mainly used to output attention coefficients. The attention coefficient and the intermediate feature sequence output by the text encoding network 1011 are weighted and averaged to obtain a weighted average result, which is used as another conditional input for each time step of the decoding network 1013 . The decoding network 1013 outputs the predicted acoustic features corresponding to the text by performing feature conversion on the input (ie, the weighted average result and the output of the previous time step).
结合前述两种实施方式,解码网络1013输出的文本对应的预测声学特征可以包括:文本对应的预测瓶颈特征;或者,解码网络1013输出的文本对应的预测声学特征可以包括:文本对应的预测瓶颈特征和文本对应的预测基频特征。In combination with the foregoing two implementation manners, the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text; or, the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text The predicted fundamental frequency features corresponding to the text.
此外,韵律子模型101包括的参数的系数的初始值可以是随机生成的,也可以是预设的,或者,还可以是通过其他方式确定的,本公开对此不作限定。In addition, the initial values of the coefficients of the parameters included in the prosody sub-model 101 may be randomly generated, preset, or determined in other ways, which is not limited in the present disclosure.
通过多个第一样本音频分别对应的标注文本、以及第一样本音频分别对应的第二声学特征,对韵律子模型101进行迭代训练,不断优化韵律子模型101包括的参数的系数值,直至满足韵律子模型101的收敛条件,则停止针对韵律子模型101的训练。The prosodic sub-model 101 is iteratively trained through the marked texts corresponding to the plurality of first sample audios and the second acoustic features respectively corresponding to the first sample audios, and the coefficient values of the parameters included in the prosody sub-model 101 are continuously optimized, Until the convergence condition of the prosody sub-model 101 is met, the training for the prosody sub-model 101 is stopped.
应理解,上述描述的第一样本音频与相应的标注文本之间一一对应,是成对的样本数据。It should be understood that the one-to-one correspondence between the first sample audio described above and the corresponding annotation text is a pair of sample data.
二、对音色子模型102进行训练2. Train the timbre sub-model 102
针对音色子模型102进行训练包括两个阶段,其中,第一阶段是基于第二样本音频对音色子模型进行训练,获得中间模型;第二阶段是基于第三样本音频对中间模型进行微调,获得最终的音色子模型。Training the timbre sub-model 102 includes two stages, wherein the first stage is to train the timbre sub-model based on the second sample audio to obtain an intermediate model; the second stage is to fine-tune the intermediate model based on the third sample audio to obtain The final Timbre submodel.
其中,本公开对于第二样本音频的音色不作限定;此外,第三样本音频为具有目标音色的样本音频。Wherein, the present disclosure does not limit the timbre of the second sample audio; in addition, the third sample audio is a sample audio with a target timbre.
需要说明的是,上述音色子模型输出的频谱特征可以是梅尔频谱特征,或者,也可以是其他类型的频谱特征。在接下来的示例中,以输入至音色子模型的第二样本音频对应的第一标注频谱特征为第一标 注梅尔频谱特征、第三样本音频对应的第二标注频谱特征为第二标注梅尔频谱特征、音色子模型输出的预测频谱特征为预测梅尔频谱特征为例进行举例说明。It should be noted that the spectral features output by the above-mentioned timbre sub-model may be Mel spectral features, or other types of spectral features. In the following example, the first labeled spectral feature corresponding to the second sample audio input to the timbre sub-model is the first labeled Mel spectral feature, and the second labeled spectral feature corresponding to the third sample audio is the second labeled Mel spectral feature. The Mel spectral feature and the predicted spectral feature output by the timbre sub-model are taken as an example to illustrate the predicted Mel spectral feature.
下面对音色子模型102的训练过程进行详细介绍:The training process of the timbre sub-model 102 is described in detail below:
第一阶段:The first stage:
在第一阶段的训练中,音色子模型102,用于根据第二样本音频进行迭代训练,获得中间模型。In the first stage of training, the timbre sub-model 102 is used to perform iterative training according to the second sample audio to obtain an intermediate model.
音色子模型102通过学习第二样本音频对应的第三声学特征和第二样本音频的第一标注梅尔频谱特征之间的映射关系,获得针对音色具有一定的语音合成控制能力的中间模型,其中,第一标注梅尔频谱特征包括:用于表征相应第二样本音频的音色的频谱特征。The timbre sub-model 102 learns the mapping relationship between the third acoustic feature corresponding to the second sample audio and the first labeled mel spectrum feature of the second sample audio, and obtains an intermediate model with certain speech synthesis control capabilities for timbre, wherein , the first marked Mel spectral feature includes: a spectral feature used to characterize the timbre of the corresponding second sample audio.
本公开对于第二样本音频的音色、时长、存储格式、第二样本音频的数量等等参数不作限定。第二样本音频可以包括具体目标音色的音频,也可以包括非目标音色的音频,或者,第二样本音频同时包括目标音色的音频和非目标音色的音频。The present disclosure does not limit parameters such as the timbre, duration, storage format, and quantity of the second sample audio of the second sample audio. The second sample audio may include the audio of the specific target tone, or may include the audio of the non-target tone, or the second sample audio may include both the audio of the target tone and the audio of the non-target tone.
在第一阶段的训练过程中,音色子模型102,用于对输入的第二样本音频对应的第二声学特征进行分析,并输出第二样本音频对应的预测梅尔频谱特征;再基于第二样本音频对应的第一标注梅尔频谱特征以及第二样本音频对应的预测梅尔频谱特征,对音色子模型102包括的参数的系数值进行调整;通过海量的第二样本音频对音色子模型102的不断迭代训练,获得中间模型。In the training process of the first stage, the timbre sub-model 102 is used to analyze the second acoustic feature corresponding to the input second sample audio, and output the predicted mel spectrum feature corresponding to the second sample audio; then based on the second The first labeled Mel spectrum feature corresponding to the sample audio and the predicted Mel spectrum feature corresponding to the second sample audio adjust the coefficient values of the parameters included in the timbre sub-model 102; Continuous iterative training to obtain an intermediate model.
在第一阶段的训练过程中,第一标注梅尔频谱特征可以理解为音色子模型102在第一阶段的学习目标。In the training process of the first stage, the first marked Mel spectrum feature can be understood as the learning goal of the timbre sub-model 102 in the first stage.
由于音色子模型102的输入是第二样本音频对应的第三声学特征,因此,第二样本音频无需标注对应的文本,从而可大大降低获取第二样本音频带来的时间及人力成本。且能够通过较低的成本获得大量的音频作为第二样本音频,用于音色子模型102的迭代训练,进而通过大量的第二样本音频对音色子模型102进行训练,使得中间模型具备 较高的针对音色的语音合成控制能力。Since the input of the timbre sub-model 102 is the third acoustic feature corresponding to the second sample audio, the second sample audio does not need to be marked with corresponding text, which can greatly reduce the time and labor cost of obtaining the second sample audio. Moreover, a large amount of audio can be obtained at a lower cost as the second sample audio for iterative training of the timbre sub-model 102, and then the timbre sub-model 102 is trained through a large amount of second sample audio, so that the intermediate model has a higher Speech synthesis controls for timbres.
第二阶段:second stage:
第二阶段中,是基于第三样本对中间模型进行训练,使中间模型学习目标音色,获得针对目标音色的语音合成控制能力。In the second stage, the intermediate model is trained based on the third sample, so that the intermediate model learns the target timbre and obtains the speech synthesis control ability for the target timbre.
需要说明的是,由于中间模型已经具备较高的针对音色的语音合成控制能力,因此,降低了对于第三样本音频的要求,例如,降低了对于第三样本音频的时长、第三样本音频的质量的要求,即使第三样本音频的时长较短、发音不清晰等等情况下,训练获得的最终的音色子模型102依然能够获得较高的针对目标音色的语音合成控制能力。It should be noted that since the intermediate model already has a high ability to control speech synthesis for timbre, the requirements for the third sample audio are reduced, for example, the duration of the third sample audio, the third sample audio Quality requirements, even if the duration of the third sample audio is short, the pronunciation is not clear, etc., the final timbre sub-model 102 obtained through training can still obtain a higher speech synthesis control ability for the target timbre.
此外,第三样本音频具有目标音色,第三样本音频可以是用户录制的音频,也可以是用户上传的想要的音色的音频,本公开对于第三样本音频的来源以及获取方式不作限定。In addition, the third sample audio has a target tone, and the third sample audio may be an audio recorded by a user, or may be an audio of a desired tone uploaded by a user, and the disclosure does not limit the source and acquisition method of the third sample audio.
具体地,将第三样本音频对应的第四声学特征输入至中间模型,获取中间模型输出的第三样本音频对应的预测梅尔频谱特征;再基于第三样本音频对应的第二标注梅尔频谱特征以及第三样本音频对应的预测梅尔频谱特征,计算本轮训练对应的损失函数信息;根据损失函数信息,对中间模型包括的参数的系数值进行调整,从而获得最终的音色子模型102。Specifically, the fourth acoustic feature corresponding to the third sample audio is input to the intermediate model, and the predicted mel spectrum feature corresponding to the third sample audio output by the intermediate model is obtained; then based on the second labeled mel spectrum corresponding to the third sample audio feature and the predicted mel spectrum feature corresponding to the third sample audio, and calculate the loss function information corresponding to the current round of training; adjust the coefficient values of the parameters included in the intermediate model according to the loss function information, so as to obtain the final timbre sub-model 102 .
在第二阶段的训练过程中,第三样本音频对应的第二标注梅尔频谱特征可以理解为中间模型的学习目标。During the training process of the second stage, the second labeled mel spectrum feature corresponding to the third audio sample can be understood as the learning target of the intermediate model.
结合前述关于韵律子模型101的介绍,在训练过程中,若韵律子模型101根据输入的第一样本音频的标注文本,输出的第五声学特征包括预测瓶颈特征,即韵律子模型101能够实现文本到瓶颈特征的映射,则输入音色子模型102的第二样本音频对应的第三声学特征包括第二样本音频对应的第二标注瓶颈特征,且输入中间模型的第三样本音频对应的第四声学特征包括第三样本音频对应的第三标注瓶颈特征。In conjunction with the aforementioned introduction about the prosody sub-model 101, during the training process, if the prosody sub-model 101 outputs the fifth acoustic feature according to the input annotation text of the first sample audio including the predicted bottleneck feature, that is, the prosody sub-model 101 can realize The mapping of text to bottleneck features, the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second marked bottleneck feature corresponding to the second sample audio, and the fourth input corresponding to the third sample audio of the intermediate model The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
其中,第二标注瓶颈特征和第三标注瓶颈特征可以通过ASR模型的编码器分别对第二样本音频和第三样本音频进行瓶颈特征提取获得, 与获取第一标注瓶颈特征的实现方式类似,简明起见,此处不再赘述。Among them, the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature. For the sake of simplicity, I won’t repeat them here.
在训练过程中,若韵律子模型101根据输入的第一样本音频的标注文本,输出的第五声学特征包括预测瓶颈特征和预测基频特征,即韵律子模型101能够实现文本到瓶颈特征和基频特征的映射,则输入音色子模型102的第二样本音频对应的第三声学特征包括第二样本音频对应的第二标注瓶颈特征和第二标注基频特征,且输入中间模型的第三样本音频对应的第四声学特征包括第三样本音频对应的第三标注瓶颈特征和第三标注基频特征。During the training process, if the prosody sub-model 101 is based on the input annotated text of the first sample audio, the output fifth acoustic feature includes the predicted bottleneck feature and the predicted fundamental frequency feature, that is, the prosody sub-model 101 can realize the text-to-bottleneck feature and The mapping of the fundamental frequency feature, then the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second labeled bottleneck feature and the second labeled fundamental frequency feature corresponding to the second sample audio, and the third acoustic feature of the input intermediate model The fourth acoustic feature corresponding to the sample audio includes a third labeled bottleneck feature and a third labeled fundamental frequency feature corresponding to the third sample audio.
其中,第二标注瓶颈特征和第三标注瓶颈特征可以通过ASR模型的编码器分别对第二样本音频和第三样本音频进行瓶颈特征提取获得,与获取第一标注瓶颈特征的实现方式类似;第二标注基频特征和第三标注基频特征可以通过数字信号处理技术,分别对第二样本音频和第三样本音频进行分析获得,与获取第一标注基频特征的实现方式类似,简明起见,此处不再赘述。Among them, the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature; The second marked fundamental frequency feature and the third marked fundamental frequency feature can be obtained by analyzing the second sample audio and the third sample audio respectively through digital signal processing technology, which is similar to the implementation method of obtaining the first marked fundamental frequency feature. For simplicity, I won't repeat them here.
综上,在训练过程中,音色子模型102的输入和韵律子模型101的输出保持一致。To sum up, during the training process, the input of the timbre sub-model 102 is consistent with the output of the prosody sub-model 101 .
此外,在对音色子模型102进行训练时,音色子模型102包括的各参数对应的系数的初始值可以是预先设定的,也可以是随机初始化的,本公开对此不作限定。In addition, when the timbre sub-model 102 is trained, the initial values of the coefficients corresponding to the parameters included in the timbre sub-model 102 may be preset or initialized randomly, which is not limited in the present disclosure.
且在第一阶段的训练过程中和第二阶段的训练过程中,分别采用的音色子模型对应的损失函数可以相同,也可以不同,本公开对此不作限定。Moreover, in the training process of the first stage and the training process of the second stage, the loss functions corresponding to the timbre sub-models used respectively may be the same or different, which is not limited in the present disclosure.
其中,图1c示例性地示出了音色子模型102的一种实现方式。参照图1c所示,音色子模型102可以采用自注意力(self-attention)的网络结构实现。Wherein, FIG. 1 c exemplarily shows an implementation manner of the timbre sub-model 102 . Referring to FIG. 1 c , the timbre sub-model 102 can be implemented using a self-attention network structure.
图1c中,音色子模型102包括:卷积网络1021、一个或者多个残差网络1022。其中,每个残差网络1022包括:自注意力网络1022a以 及线性网络1022b。In FIG. 1 c , the timbre sub-model 102 includes: a convolutional network 1021 and one or more residual networks 1022 . Wherein, each residual network 1022 includes: a self-attention network 1022a and a linear network 1022b.
卷积网络1021,主要用于对输入的样本音频对应的声学特征进行卷积处理,建模局部特征信息。其中,卷积网络1021可以包括一个或者多个卷积层,本公开对于卷积网络1021包括的卷积层的数量不做限制。且卷积网络1021将局部特征信息输入至相连接的残差网络1022。The convolution network 1021 is mainly used to perform convolution processing on the acoustic features corresponding to the input sample audio, and to model local feature information. Wherein, the convolutional network 1021 may include one or more convolutional layers, and this disclosure does not limit the number of convolutional layers included in the convolutional network 1021 . And the convolutional network 1021 inputs the local feature information to the connected residual network 1022 .
上述一个或多个残差网络1022,对在经过上述一个或者多个残差网络1022之后,转换为频谱特征(如梅尔频谱特征)。The one or more residual networks 1022 are converted into spectral features (such as Mel spectral features) after passing through the one or more residual networks 1022 .
应理解,中间模型与图1c所示的音色子模型102的结构相同,区别在于包括的参数的权重系数不完全相同。It should be understood that the structure of the intermediate model is the same as that of the timbre sub-model 102 shown in FIG. 1c, the difference lies in that the weight coefficients of the parameters included are not completely the same.
通过前述对韵律子模型101和音色子模型102分别进行训练,最终获得满足语音合成要求的第一特征提取模型和第二特征提取模型;再将最终获得的第一特征提取模型和第二特征提取模型进行拼接,即获得能够合成目标音色的语音合成模型。Through the aforementioned training of the prosody sub-model 101 and the timbre sub-model 102 respectively, the first feature extraction model and the second feature extraction model that meet the requirements of speech synthesis are finally obtained; The model is spliced to obtain a speech synthesis model capable of synthesizing the target timbre.
一些可能的实施方式中,语音合成模型100还可以包括:声码器(vocoder)103。声码器103用于将音色子模型102输出的频谱特征(如梅尔频谱特征)转换为音频。当然,声码器也可以作为独立的模块,不与语音合成模型绑定在一起。且本方案对于声码器的具体类型不做限制。In some possible implementation manners, the speech synthesis model 100 may further include: a vocoder (vocoder) 103 . The vocoder 103 is used to convert the spectral features (such as Mel spectral features) output by the timbre sub-model 102 into audio. Of course, the vocoder can also be used as an independent module, not bound together with the speech synthesis model. And this solution does not limit the specific type of the vocoder.
在上述图1a至图1c所示实施例的基础上,通过训练最终获得的目标语音合成模型具有稳定合成目标音色的音频的能力,基于此,可使用目标语音合成模型处理相应的语音合成业务。On the basis of the above-mentioned embodiments shown in FIG. 1a to FIG. 1c, the target speech synthesis model finally obtained through training has the ability to stably synthesize the audio of the target timbre. Based on this, the target speech synthesis model can be used to process corresponding speech synthesis services.
图2为本公开一实施例提供的语音合成方法的流程图。参照图2所示,本实施例提供的语音合成方法包括:Fig. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure. As shown in Figure 2, the speech synthesis method provided by this embodiment includes:
S201、获取待处理文本。S201. Obtain text to be processed.
其中,待处理文本可以包括一个或多个字符,或者,待处理文本也可以包括一个或多个音素。待处理文本用于合成具有目标说唱风格 以及目标音色的音频。Wherein, the text to be processed may include one or more characters, or the text to be processed may also include one or more phonemes. The text to be processed is used to synthesize audio with the target rap style and target timbre.
本公开对于电子设备获取待处理文本的方式不做限定。The present disclosure does not limit the manner in which the electronic device obtains the text to be processed.
例如,电子设备可以通过向用户展示文本输入窗口以及软键盘,用户通过操作电子设备上显示的软键盘,向文本输入窗口输入待处理文本;或者,用户也可以通过复制粘贴的方式向文本输入窗口输入待处理文本;或者,用户还可以通过语音的方式向电子设备输入一段音频,电子设备通过对用户输入的音频进行语音识别,获取待处理文本;或者,也可以通过向电子设备导入待处理文本对应的文件,使得电子设备获取待处理文本。For example, the electronic device can display the text input window and the soft keyboard to the user, and the user can input the text to be processed into the text input window by operating the soft keyboard displayed on the electronic device; or, the user can also copy and paste the text to the text input window Input the text to be processed; or, the user can also input a piece of audio to the electronic device by voice, and the electronic device obtains the text to be processed by performing voice recognition on the audio input by the user; or, it can also import the text to be processed to the electronic device Corresponding files, so that the electronic device obtains the text to be processed.
用户可以但不限于通过上述示例的方式,向电子设备输入待处理文本,对于用户来说,操作简单便捷,能够提升用户创作多媒体内容的积极性。The user may, but is not limited to, input the text to be processed into the electronic device by means of the above examples. For the user, the operation is simple and convenient, and the user's enthusiasm for creating multimedia content can be enhanced.
S202、将所述待处理文本输入至语音合成模型,获取所述语音合成模型输出的所述待处理文本对应的频谱特征。S202. Input the text to be processed into a speech synthesis model, and acquire spectral features corresponding to the text to be processed output by the speech synthesis model.
一些实施例中,将待处理文本输入至语音合成模型中,韵律子模型通过对待处理文本进行特征提取,输出待处理文本对应的第一声学特征,第一声学特征包括待处理文本对应的瓶颈特征,其中,第一声学特征包括的瓶颈特征用于表征目标说唱风格;音色子模型接收待处理文本对应的第一声学特征作为输入,输出待处理文本对应的频谱特征。In some embodiments, the text to be processed is input into the speech synthesis model, and the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed The bottleneck feature, wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style; the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, and outputs the spectral feature corresponding to the text to be processed.
另一些实施例中,将待处理文本输入至语音合成模型中,韵律子模型通过对待处理文本进行特征提取,输出待处理文本对应的第一声学特征,第一声学特征包括待处理文本对应的瓶颈特征和待处理文本对应的基频特征,其中,第一声学特征包括的瓶颈特征用于表征目标说唱风格,第一声学特征包括的基频特征用于表征音调;音色子模型接收待处理文本对应的第一声学特征作为输入,输出待处理文本对应的频谱特征(如梅尔频谱特征)。In some other embodiments, the text to be processed is input into the speech synthesis model, and the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed. The bottleneck feature and the fundamental frequency feature corresponding to the text to be processed, wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style, and the fundamental frequency feature included in the first acoustic feature is used to characterize the pitch; the timbre sub-model receives The first acoustic feature corresponding to the text to be processed is used as input, and the spectral feature (such as Mel spectral feature) corresponding to the text to be processed is output.
其中,语音合成模型可以是通过图1a至图1c所示实施例的实现 方式获得,其中,语音合成模型的网络结构以及训练语音合成模型的实现方式,可参照前述图1a至图1c所示实施例的详细描述,简明起见,此处不再赘述。Wherein, the speech synthesis model can be obtained through the implementation of the embodiment shown in Figures 1a to 1c, wherein, the network structure of the speech synthesis model and the implementation of training the speech synthesis model can refer to the implementation shown in the aforementioned Figures 1a to 1c For the sake of brevity, the detailed description of the example is omitted here.
结合前述图1a以及1b所示实施例,韵律子模型包括的文本编码网络可以接收待处理文本作为输入,通过对待处理文本的上下文以及时序关系进行分析,建模中间特征序列;再根据韵律子模型包括的注意力网络输出的注意力系数,将注意力系数与中间特征序列进行加权平均,获得加权平均结果;韵律子模型包括的解码网络通过对输入加权平均结果以及上一个时间步的输出进行特征转换,输出待处理文本对应的第一声学特征,第一声学特征可以包括待处理文本对应的瓶颈特征,或者,第一声学特征可以包括待处理文本对应的瓶颈特征和待处理文本对应的基频特征。In combination with the aforementioned embodiments shown in Figures 1a and 1b, the text encoding network included in the prosodic sub-model can receive the text to be processed as an input, and model the intermediate feature sequence by analyzing the context and temporal relationship of the text to be processed; then according to the prosody sub-model The attention coefficient output by the included attention network is weighted and averaged with the intermediate feature sequence to obtain the weighted average result; the decoding network included in the prosody sub-model is characterized by the input weighted average result and the output of the previous time step Convert, output the first acoustic feature corresponding to the text to be processed, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed, or, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed and the corresponding text to be processed fundamental frequency features.
结合前述图1a以及1c所示实施例,音色子模型包括的卷积网络接收待处理文本对应的第一声学特征作为输入,对待处理文本对应的第一声学特征进行卷积处理,建模局部特征信息;卷积网络将局部特征信息输入至相连接的残差网络,在经过一个或者多个残差网络之后,输出待处理文本对应的频谱特征(如梅尔频谱特征)。In combination with the aforementioned embodiments shown in Figures 1a and 1c, the convolutional network included in the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, performs convolution processing on the first acoustic feature corresponding to the text to be processed, and models Local feature information; the convolutional network inputs the local feature information into the connected residual network, and after passing through one or more residual networks, outputs the spectral features (such as Mel spectral features) corresponding to the text to be processed.
S203、根据待处理文本对应的频谱特征,获取待处理文本对应的目标音频,目标音频具有目标音色以及目标说唱风格。S203. Acquire target audio corresponding to the text to be processed according to the spectral features corresponding to the text to be processed, where the target audio has a target timbre and a target rap style.
一种可能的实施方式,电子设备可基于声码器,对待处理文本对应的频谱特征进行数字信号处理,从而将待处理文本对应的频谱特征(如待处理文本对应的梅尔频谱特征)转换为具有目标音色以及目标说唱风格的音频,即目标音频。In a possible implementation manner, the electronic device may perform digital signal processing on the spectral features corresponding to the text to be processed based on the vocoder, so as to convert the spectral features corresponding to the text to be processed (such as the Mel spectrum feature corresponding to the text to be processed) into Audio with a target timbre and a target rap style, ie target audio.
需要说明的是,声码器可以作为语音合成模型的一部分,则语音合成模型可以直接输出具有目标音色以及目标说唱风格的音频;另一些情况下,声码器可以作为语音合成模型之外的独立模块,声码器可以接收待处理文本对应的频谱特征作为输入,将待处理文本对应的频谱特征转换为具有目标音色以及目标说唱风格的音频。It should be noted that the vocoder can be used as a part of the speech synthesis model, and the speech synthesis model can directly output audio with the target timbre and target rap style; in other cases, the vocoder can be used as an independent module, the vocoder can receive as input the spectral features corresponding to the text to be processed, and convert the spectral features corresponding to the text to be processed into audio with a target timbre and a target rap style.
本实施例提供的语音合成方法,基于语音合成模型对待处理文本进行分析,输出待处理文本对应的频谱特征,其中,语音合成模型包括韵律子模型和音色子模型,韵律子模型用于接收待处理文本作为输入,输出待处理文本对应的第一声学特征,其中,第一声学特征包括用于表征目标说唱风格的瓶颈特征;音色子模型接收第一声学特征作为输入,输出待处理文本对应的频谱特征,频谱特征包括目标音色的信息;通过对语音合成模型输出的频谱特征进行转换,能够获得具有目标说唱风格以及目标音色的说唱音频,满足了用户对于音频的个性化需求;且语音合成模型支持对任意待处理文本的转换,降低了对用户的音乐创作能力的要求,有利于提升用户创作多媒体内容的积极性。The speech synthesis method provided in this embodiment analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed The text is used as input, and the first acoustic feature corresponding to the text to be processed is output, wherein the first acoustic feature includes the bottleneck feature used to characterize the target rap style; the timbre sub-model receives the first acoustic feature as input, and outputs the text to be processed The corresponding spectral features, which include the information of the target timbre; by converting the spectral features output by the speech synthesis model, rap audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and the voice The synthesis model supports the conversion of any text to be processed, which reduces the requirement on the user's music creation ability, and is conducive to improving the user's enthusiasm for creating multimedia content.
图3为本公开另一实施例提供的语音合成方法的流程示意图。参照图3所示,本实施例提供的语音合成方法在图2所示实施例的基础上,步骤S203、根据待处理文本对应的频谱特征,获取待处理文本对应的目标音频之后,还可以包括:Fig. 3 is a schematic flowchart of a speech synthesis method provided by another embodiment of the present disclosure. Referring to Fig. 3, the speech synthesis method provided by this embodiment is based on the embodiment shown in Fig. 2, step S203, after obtaining the target audio corresponding to the text to be processed according to the spectral characteristics corresponding to the text to be processed, may also include :
S204、将所述待处理文本对应的目标音频添加至目标多媒体内容。S204. Add the target audio corresponding to the text to be processed to the target multimedia content.
本公开对于将目标音频添加至目标多媒体内容的实现方式不做限定。例如,电子设备将目标音频添加至目标多媒体内容时,可以结合目标多媒体内容的时长以及目标音频的时长,将目标音频的播放速度加快或者减慢;还可以在目标多媒体内容的播放界面添加目标音频对应的字幕,当然,也可以不添加目标音频对应的字幕;若在目标多媒体内容的播放界面添加目标音频对应的字幕,还可以设置字幕的颜色、字号大小、字体等等显示参数。The present disclosure does not limit the implementation manner of adding the target audio to the target multimedia content. For example, when the electronic device adds the target audio to the target multimedia content, it can combine the duration of the target multimedia content and the duration of the target audio to speed up or slow down the playback speed of the target audio; it can also add the target audio to the playback interface of the target multimedia content Corresponding subtitles, of course, can also not add the subtitles corresponding to the target audio; if the subtitles corresponding to the target audio are added on the playback interface of the target multimedia content, display parameters such as the color, font size, and font of the subtitles can also be set.
本实施例提供的方法,基于语音合成模型对待处理文本进行分析,输出待处理文本对应的频谱特征,其中,语音合成模型包括韵律子模型和音色子模型,韵律子模型用于接收待处理文本作为输入,输出待处理文本对应的第一声学特征,其中,第一声学特征包括用于表征目标说唱风格的瓶颈特征;音色子模型接收第一声学特征作为输入,输出待处理文本对应的频谱特征,频谱特征包括用于表征目标音色的频 谱特征;通过对语音合成模型输出的频谱特征进行转换,能够获得具有目标说唱风格以及目标音色的音频,满足了用户对于音频的个性化需求;且语音合成模型支持对任意待处理文本的转换,降低了对用户的音乐创作能力的要求,有利于提升用户创作多媒体内容的积极性。The method provided in this embodiment analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed as Input, output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature used to characterize the target rap style; the timbre sub-model receives the first acoustic feature as input, and outputs the corresponding to the text to be processed Spectral features, including spectral features used to characterize the target timbre; by converting the spectral features output by the speech synthesis model, audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and The speech synthesis model supports the conversion of any text to be processed, which reduces the requirements on the user's music creation ability and is conducive to improving the user's enthusiasm for creating multimedia content.
此外,将目标音频添加至目标多媒体内容,使得目标多媒体内容的趣味性更强,从而满足用户创作创意视频的需求。In addition, adding the target audio to the target multimedia content makes the target multimedia content more interesting, thereby satisfying the user's demand for creative video creation.
示例性地,本公开还提供一种语音合成装置。Exemplarily, the present disclosure also provides a speech synthesis device.
图4为本公开一实施例提供的语音合成装置的结构示意图。参照图4所示,本实施例提供的语音合成装置400包括:Fig. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure. Referring to Figure 4, the speech synthesis device 400 provided in this embodiment includes:
获取模块401,用于获取待处理文本。An acquisition module 401, configured to acquire text to be processed.
处理模块402,用于将待处理文本输入至语音合成模型,获取所述语音合成模型输出的所述待处理文本对应的频谱特征;其中,所述语音合成模型包括:韵律子模型和音色子模型,所述韵律子模型用于根据输入的待处理文本,输出所述待处理文本对应的第一声学特征,所述第一声学特征包括用于表征目标说唱风格的瓶颈特征;所述音色子模型用于根据输入的第一声学特征,输入所述待处理文本对应的频谱特征,所述待处理文本对应的频谱特征包括用于表征目标音色的频谱特征。The processing module 402 is configured to input the text to be processed into the speech synthesis model, and obtain the spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model , the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre The sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre.
处理模块402,还用于根据所述待处理文本对应的频谱特征,获取所述待处理文本对应的目标音频,所述目标音频具有所述目标音色以及所述目标说唱风格。The processing module 402 is further configured to acquire the target audio corresponding to the text to be processed according to the spectral features corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
在一些可能的实施方式中,韵律子模型是根据第一样本音频对应的标注文本以及所述第一样本音频对应的第二声学特征,进行训练获得的;In some possible implementation manners, the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
所述第一样本音频包括至少一个所述目标说唱风格的音频;所述第二声学特征包括所述第一样本音频对应的第一标注瓶颈特征。The first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
在一些可能的实施方式中,所述音色子模型是根据第二样本音频 对应的第三声学特征、第二样本音频对应的第一标注频谱特征、第三样本音频对应的第四声学特征以及第三样本音频对应的第二标注频谱特征进行训练获得的;In some possible implementation manners, the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;
其中,所述第三声学特征包括所述第二样本音频对应的第二标注瓶颈特征;所述第三样本音频包括至少一个具有所述目标音色的音频,所述第三样本音频对应的第四声学特征包括第三样本音频对应的第三标注瓶颈特征。Wherein, the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
在一些可能的实施方式中,所述第一样本音频对应的第一标注瓶颈特征、所述第二样本音频对应的第二标注瓶颈特征以及所述第三样本音频对应的第三标注瓶颈特征是通过端到端语音识别模型的编码器分别对输入的所述第一样本音频、所述第二样本音频和所述第三样本音频进行瓶颈特征提取获得的。In some possible implementation manners, the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
在一些可能的实施方式中,所述第二声学特征还包括:所述第一样本音频对应的第一标注基频特征;In some possible implementation manners, the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;
所述第三声学特征还包括:所述第二样本音频对应的第二标注基频特征;所述第四声学特征还包括:所述第三样本音频对应的第三标注基频特征;The third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
所述第一声学特征还包括:所述待处理文本对应的基频特征。The first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
在一些可能的实施方式中,处理模块402,还用于将待处理文本对应的目标音频添加至目标多媒体内容。In some possible implementation manners, the processing module 402 is further configured to add the target audio corresponding to the text to be processed to the target multimedia content.
本实施例提供的语音合成装置,可以用于执行上述任一方法实施例的技术方式,其实现原理以及技术效果类似,可参照前述方法实施例的详细描述,简明起见,此处不再赘述。The speech synthesis device provided in this embodiment can be used to implement the technical method of any of the above method embodiments, and its implementation principle and technical effect are similar. For details, please refer to the detailed description of the foregoing method embodiments.
示例性地,本公开还提供一种电子设备。Exemplarily, the present disclosure also provides an electronic device.
图5为本公开一实施例提供的电子设备的结构示意图。参照图5所示,本实施例提供的电子设备包括:存储器501和处理器502。Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. Referring to FIG. 5 , the electronic device provided in this embodiment includes: a memory 501 and a processor 502 .
其中,存储器501可以是独立的物理单元,与处理器502可以通过总线503连接。存储器501、处理器502也可以集成在一起,通过硬 件实现等。Wherein, the memory 501 may be an independent physical unit, and may be connected with the processor 502 through the bus 503 . The memory 501 and the processor 502 may also be integrated together, implemented by hardware, and the like.
存储器501用于存储程序指令,处理器502调用该程序指令,执行以上任一方法实施例的操作。The memory 501 is used to store program instructions, and the processor 502 invokes the program instructions to execute the operations of any one of the above method embodiments.
可选地,当上述实施例的方法中的部分或全部通过软件实现时,上述电子设备500也可以只包括处理器502。用于存储程序的存储器501位于电子设备500之外,处理器502通过电路/电线与存储器连接,用于读取并执行存储器中存储的程序。Optionally, when part or all of the methods in the foregoing embodiments are implemented by software, the foregoing electronic device 500 may also include only the processor 502 . The memory 501 for storing programs is located outside the electronic device 500, and the processor 502 is connected to the memory through circuits/wires, and is used to read and execute the programs stored in the memory.
处理器502可以是中央处理器(central processing unit,CPU),网络处理器(network processor,NP)或者CPU和NP的组合。The processor 502 may be a central processing unit (central processing unit, CPU), a network processor (network processor, NP) or a combination of CPU and NP.
处理器502还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。The processor 502 may further include a hardware chip. The aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof. The aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
存储器501可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器还可以包括上述种类的存储器的组合。The memory 501 may include a volatile memory (volatile memory), such as a random-access memory (random-access memory, RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory) ), a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD); the memory can also include a combination of the above-mentioned types of memory.
本公开还提供一种可读存储介质,包括:计算机程序指令;计算机程序指令被电子设备的至少一个处理器执行时,实现上述任一方法实施例所示的语音合成方法。The present disclosure also provides a readable storage medium, including: computer program instructions; when the computer program instructions are executed by at least one processor of the electronic device, the speech synthesis method shown in any one of the above method embodiments is implemented.
本公开还提供一种程序产品,所述程序产品包括计算机程序,所述计算机程序存储在可读存储介质中,所述电子设备的至少一个处理器可以从所述可读存储介质中读取所述计算机程序,所述至少一个处理器执行所述计算机程序使得所述电子设备实现如上述任一方法实施例所示的语音合成方法。The present disclosure also provides a program product, the program product includes a computer program, the computer program is stored in a readable storage medium, and at least one processor of the electronic device can read the computer program from the readable storage medium. The computer program, the at least one processor executes the computer program to enable the electronic device to implement the speech synthesis method as shown in any one of the above method embodiments.
需要说明的是,在本文中,诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relative terms such as "first" and "second" are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these No such actual relationship or order exists between entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
以上所述仅是本公开的具体实施方式,使本领域技术人员能够理解或实现本公开。对这些实施例的多种修改对本领域的技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本公开的精神或范围的情况下,在其它实施例中实现。因此,本公开将不会被限制于本文所述的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above descriptions are only specific implementation manners of the present disclosure, so that those skilled in the art can understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present disclosure. Therefore, the present disclosure will not be limited to the embodiments described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

  1. 一种语音合成方法,包括:A speech synthesis method, comprising:
    获取待处理文本;Get the text to be processed;
    将所述待处理文本输入至语音合成模型,获取所述语音合成模型输出的所述待处理文本对应的频谱特征;其中,所述语音合成模型包括:韵律子模型和音色子模型,所述韵律子模型用于根据输入的待处理文本,输出所述待处理文本对应的第一声学特征,所述第一声学特征包括用于表征目标说唱风格的瓶颈特征;所述音色子模型用于根据输入的第一声学特征,输出所述待处理文本对应的频谱特征,所述待处理文本对应的频谱特征包括用于表征目标音色的频谱特征;The text to be processed is input into the speech synthesis model, and the spectral features corresponding to the text to be processed output by the speech synthesis model are obtained; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model, and the prosody The sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, and the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model is used for According to the input first acoustic feature, output the spectral feature corresponding to the text to be processed, the spectral feature corresponding to the text to be processed includes the spectral feature used to characterize the target timbre;
    根据所述待处理文本对应的频谱特征,获取所述待处理文本对应的目标音频,所述目标音频具有所述目标音色以及所述目标说唱风格。According to the spectrum feature corresponding to the text to be processed, the target audio corresponding to the text to be processed is acquired, and the target audio has the target timbre and the target rap style.
  2. 根据权利要求1所述的方法,其中,所述韵律子模型是根据第一样本音频对应的标注文本以及所述第一样本音频对应的第二声学特征,进行训练获得的;The method according to claim 1, wherein the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
    所述第一样本音频包括至少一个所述目标说唱风格的音频;所述第二声学特征包括所述第一样本音频对应的第一标注瓶颈特征。The first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
  3. 根据权利要求2所述的方法,其中,所述音色子模型是根据第二样本音频对应的第三声学特征、第二样本音频对应的第一标注频谱特征、第三样本音频对应的第四声学特征以及第三样本音频对应的第二标注频谱特征进行训练获得的;The method according to claim 2, wherein the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, and the fourth acoustic feature corresponding to the third sample audio The feature and the second labeled spectral feature corresponding to the third sample audio are obtained through training;
    其中,所述第三声学特征包括所述第二样本音频对应的第二标注瓶颈特征;所述第三样本音频包括至少一个具有所述目标音色的音频,所述第三样本音频对应的第四声学特征包括第三样本音频对应的第三标注瓶颈特征。Wherein, the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  4. 根据权利要求3所述的方法,其中,所述第一样本音频对应的第一标注瓶颈特征、所述第二样本音频对应的第二标注瓶颈特征以及 所述第三样本音频对应的第三标注瓶颈特征是通过端到端语音识别模型的编码器分别对输入的所述第一样本音频、所述第二样本音频和所述第三样本音频进行瓶颈特征提取获得的。The method according to claim 3, wherein the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio The marked bottleneck features are obtained by extracting bottleneck features from the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
  5. 根据权利要求3所述的方法,其中,所述第二声学特征还包括:所述第一样本音频对应的第一标注基频特征;The method according to claim 3, wherein the second acoustic feature further comprises: a first labeled fundamental frequency feature corresponding to the first sample audio;
    所述第三声学特征还包括:所述第二样本音频对应的第二标注基频特征;所述第四声学特征还包括:所述第三样本音频对应的第三标注基频特征;The third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
    所述第一声学特征还包括:所述待处理文本对应的基频特征。The first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
  6. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    将所述待处理文本对应的目标音频添加至目标多媒体内容。Add the target audio corresponding to the text to be processed to the target multimedia content.
  7. 一种语音合成装置,包括:A speech synthesis device, comprising:
    获取模块,用于获取待处理文本;Obtaining module, used to obtain the text to be processed;
    处理模块,用于将所述待处理文本输入至语音合成模型,获取所述语音合成模型输出的所述待处理文本对应的频谱特征;其中,所述语音合成模型包括:韵律子模型和音色子模型,所述韵律子模型用于根据输入的待处理文本,输出所述待处理文本对应的第一声学特征,所述第一声学特征包括用于表征目标说唱风格的瓶颈特征,所述第一声学特征包括所述待处理文本对应的瓶颈特征;所述音色子模型用于根据输入的第一声学特征,输出所述待处理文本对应的频谱特征,所述待处理文本对应的频谱特征包括用于表征目标音色的频谱特征;A processing module, configured to input the text to be processed into a speech synthesis model, and obtain spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model model, the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style, the The first acoustic feature includes the bottleneck feature corresponding to the text to be processed; the timbre sub-model is used to output the spectral feature corresponding to the text to be processed according to the input first acoustic feature, and the corresponding text to be processed The spectral features include spectral features used to characterize the target timbre;
    所述处理模块,用于根据所述待处理文本对应的频谱特征,获取所述待处理文本对应的目标音频,所述目标音频具有所述目标音色以及所述目标说唱风格。The processing module is configured to acquire target audio corresponding to the text to be processed according to the frequency spectrum feature corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
  8. 一种电子设备,包括:存储器、处理器以及计算机程序;An electronic device, comprising: a memory, a processor, and a computer program;
    所述存储器被配置为存储所述计算机程序;said memory is configured to store said computer program;
    所述处理器被配置为执行所述计算机程序,实现如权利要求1至6任一项所述语音合成方法。The processor is configured to execute the computer program to implement the speech synthesis method according to any one of claims 1 to 6.
  9. 一种可读存储介质,包括:计算机程序指令;A readable storage medium, comprising: computer program instructions;
    所述计算机程序指令被电子设备的至少一个处理器执行时,实现如权利要求1至6任一项所述的语音合成方法。When the computer program instructions are executed by at least one processor of the electronic device, the speech synthesis method according to any one of claims 1 to 6 is implemented.
  10. 一种程序产品,包括:计算机程序指令;A program product comprising: computer program instructions;
    所述计算机程序指存储于可读存储介质中,电子设备的至少一个处理器从所述可读存储介质中读取所述计算机程序指令,所述至少一个处理器执行所述计算机程序指令,以实现如权利要求1至6任一项所述的语音合成方法。The computer program is stored in a readable storage medium, at least one processor of the electronic device reads the computer program instructions from the readable storage medium, and the at least one processor executes the computer program instructions to Realize the speech synthesis method as described in any one of claims 1 to 6.
PCT/CN2022/120120 2021-09-22 2022-09-21 Speech synthesis method and apparatus, electronic device, and readable storage medium WO2023045954A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111107875.8A CN115938338A (en) 2021-09-22 2021-09-22 Speech synthesis method, device, electronic equipment and readable storage medium
CN202111107875.8 2021-09-22

Publications (1)

Publication Number Publication Date
WO2023045954A1 true WO2023045954A1 (en) 2023-03-30

Family

ID=85720073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/120120 WO2023045954A1 (en) 2021-09-22 2022-09-21 Speech synthesis method and apparatus, electronic device, and readable storage medium

Country Status (2)

Country Link
CN (1) CN115938338A (en)
WO (1) WO2023045954A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (en) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
US10692484B1 (en) * 2018-06-13 2020-06-23 Amazon Technologies, Inc. Text-to-speech (TTS) processing
CN111402855A (en) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device
CN112365882A (en) * 2020-11-30 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, model training method, device, equipment and storage medium
CN112509552A (en) * 2020-11-27 2021-03-16 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
US10692484B1 (en) * 2018-06-13 2020-06-23 Amazon Technologies, Inc. Text-to-speech (TTS) processing
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111402855A (en) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device
CN112509552A (en) * 2020-11-27 2021-03-16 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112365882A (en) * 2020-11-30 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, model training method, device, equipment and storage medium
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (en) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium
CN117727288B (en) * 2024-02-07 2024-04-30 翌东寰球(深圳)数字科技有限公司 Speech synthesis method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115938338A (en) 2023-04-07

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
CN106898340B (en) Song synthesis method and terminal
US9552807B2 (en) Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
US9318100B2 (en) Supplementing audio recorded in a media file
CN108831437B (en) Singing voice generation method, singing voice generation device, terminal and storage medium
WO2020098115A1 (en) Subtitle adding method, apparatus, electronic device, and computer readable storage medium
KR20210048441A (en) Matching mouth shape and movement in digital video to alternative audio
CN110675886A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
TW200821862A (en) RSS content administration for rendering RSS content on a digital audio player
CN110599998B (en) Voice data generation method and device
CN113053357B (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN111465982A (en) Signal processing device and method, training device and method, and program
WO2023045954A1 (en) Speech synthesis method and apparatus, electronic device, and readable storage medium
WO2022126904A1 (en) Voice conversion method and apparatus, computer device, and storage medium
US11462207B1 (en) Method and apparatus for editing audio, electronic device and storage medium
CN113012678A (en) Method and device for synthesizing voice of specific speaker without marking
CN112580669B (en) Training method and device for voice information
TWI223231B (en) Digital audio with parameters for real-time time scaling
CN116013274A (en) Speech recognition method, device, computer equipment and storage medium
CN115171645A (en) Dubbing method and device, electronic equipment and storage medium
CN115910021A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN113990295A (en) Video generation method and device
CN113870833A (en) Speech synthesis related system, method, device and equipment
JP2020173776A (en) Method and device for generating video
CN117423329B (en) Model training and voice generating method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22872000

Country of ref document: EP

Kind code of ref document: A1