WO2023045954A1 - Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible - Google Patents

Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible Download PDF

Info

Publication number
WO2023045954A1
WO2023045954A1 PCT/CN2022/120120 CN2022120120W WO2023045954A1 WO 2023045954 A1 WO2023045954 A1 WO 2023045954A1 CN 2022120120 W CN2022120120 W CN 2022120120W WO 2023045954 A1 WO2023045954 A1 WO 2023045954A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
model
processed
feature
audio
Prior art date
Application number
PCT/CN2022/120120
Other languages
English (en)
Chinese (zh)
Inventor
代东洋
黄雷
陈彦洁
李鑫
陈远哲
王玉平
Original Assignee
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字跳网络技术有限公司 filed Critical 北京字跳网络技术有限公司
Publication of WO2023045954A1 publication Critical patent/WO2023045954A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and in particular to a speech synthesis method, device, electronic equipment and readable storage medium.
  • the present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium.
  • the present disclosure provides a speech synthesis method, including:
  • the text to be processed is input into the speech synthesis model, and the spectral features corresponding to the text to be processed output by the speech synthesis model are obtained;
  • the speech synthesis model includes: a prosody sub-model and a timbre sub-model, and the prosody The sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, and the first acoustic feature includes a bottleneck feature for characterizing the target rap style;
  • the timbre sub-model is used for According to the input first acoustic feature, output the spectral feature corresponding to the text to be processed, the spectral feature corresponding to the text to be processed includes the spectral feature used to characterize the target timbre;
  • the target audio corresponding to the text to be processed is acquired, and the target audio has the target timbre and the target rap style.
  • the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
  • the first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
  • the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;
  • the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth
  • the acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  • the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
  • the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;
  • the third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
  • the first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
  • the method also includes:
  • the present disclosure provides a speech synthesis device, including:
  • Obtaining module used to obtain the text to be processed
  • a processing module configured to input the text to be processed into a speech synthesis model, and obtain spectral features corresponding to the text to be processed output by the speech synthesis model;
  • the speech synthesis model includes: a prosody sub-model and a timbre sub-model Model, the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style;
  • the The timbre sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre;
  • the processing module is configured to acquire target audio corresponding to the text to be processed according to the frequency spectrum feature corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
  • the present disclosure provides an electronic device, including: a memory, a processor, and a computer program;
  • said memory is configured to store said computer program
  • the processor is configured to execute the computer program to implement the speech synthesis method according to any one of the first aspect.
  • the present disclosure provides a readable storage medium, including: a computer program
  • the speech synthesis method according to any one of the first aspect can be realized.
  • the present disclosure provides a program product, the program product including: a computer program; the computer program is stored in a readable storage medium, and an electronic device acquires the computer program from the readable storage medium, the At least one processor of the electronic device executes the computer program to implement the speech synthesis method according to any one of the first aspect.
  • the present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium, wherein, the present disclosure analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes prosodic Model and timbre sub-model, the prosody sub-model is used to receive the text to be processed as input, and output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model The model receives the first acoustic feature as input, and outputs the spectral feature corresponding to the text to be processed.
  • the speech synthesis model includes prosodic Model and timbre sub-model
  • the prosody sub-model is used to receive the text to be processed as input, and output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck
  • the spectral feature includes the spectral feature used to characterize the target timbre; by converting the spectral feature output by the speech synthesis model, it can obtain the target rap style and The rap audio of the target tone meets the user's personalized needs for synthesized audio; and the speech synthesis model supports the conversion of any text to be processed, which reduces the requirements for the user's music creation ability and is conducive to improving the user's enthusiasm for creating multimedia content .
  • FIG. 1a to 1c are structural schematic diagrams of a speech synthesis model provided by an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a speech synthesis method provided by another embodiment of the present disclosure.
  • FIG. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure.
  • Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the present disclosure provides a speech synthesis method, device, electronic equipment, readable storage medium, and program product, wherein the method implements the conversion of text into audio with a target rap style and target timbre through a pre-trained speech synthesis model, and the speech
  • the synthesis model can realize the relatively independent control of the target rap style and timbre on the speech synthesis, so as to meet the user's demand for personalized speech synthesis.
  • the target rap style mentioned in the present disclosure may include any type of rap style, and the present disclosure does not limit the specific rap style of the target rap style.
  • the target rap style may be any rap style among popular rap, alternative rap, comedy rap, jazz rap, and hip-hop rap.
  • the speech synthesis method provided by the present disclosure can be executed by electronic equipment.
  • the electronic device can be a tablet computer, a mobile phone (such as a folding screen mobile phone, a large-screen mobile phone, etc.), a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a notebook computer, etc. , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), smart TV, smart screen, high-definition TV, 4K TV, smart speaker, smart projector and other Internet of Things (the Internet of things, IOT) equipment, this disclosure does not make any restrictions on the specific type of electronic equipment.
  • a mobile phone such as a folding screen mobile phone, a large-screen mobile phone, etc.
  • a wearable device such as a folding screen mobile phone, a large-screen mobile phone, etc.
  • a vehicle-mounted device such as a folding screen mobile phone,
  • the electronic device that trains and obtains the speech synthesis model and the electronic device that uses the speech synthesis model to execute the speech synthesis service may be different electronic devices or the same electronic device, which is not limited in the present disclosure.
  • the speech synthesis model is obtained through the training of the server device, and the server device sends the trained speech synthesis model to the terminal device/server device, and the terminal device/server device executes the speech synthesis service according to the speech synthesis model; another example
  • the speech synthesis model is trained by the server device, and then the trained speech synthesis model is deployed on the server device, and then the server device invokes the speech synthesis model to process the speech synthesis service.
  • the present disclosure does not limit this, and it can be set flexibly in practical applications.
  • the speech synthesis model in this solution decouples the speech synthesis model into two sub-models by introducing acoustic features including bottleneck features, namely: the prosody sub-model and the timbre sub-model, wherein the prosody sub-model is used to create text
  • the depth mapping between the acoustic features including the bottleneck features and the timbre sub-model is used to establish the depth mapping between the acoustic features including the bottleneck features and the spectral features.
  • the two decoupled feature extraction sub-models can be trained using different sample audio.
  • the prosody sub-model is used to establish a deep mapping between the text sequence and the acoustic features containing bottleneck features.
  • the prosody sub-model needs to use high-quality first sample audio with the target rap style and the annotated text corresponding to the first sample audio , together as the sample data to train the prosody sub-model.
  • the timbre sub-model is used to establish the depth mapping between the acoustic features including bottleneck features and the spectral features.
  • the timbre sub-model can be trained using the second sample audio that has not marked the corresponding text. Since there is no need to label the text corresponding to the second sample audio, This can greatly reduce the cost of acquiring a second sample of audio.
  • the acoustic features output by the prosody sub-model include the bottleneck features used to characterize the target rap style, and realize the control of rap style on speech synthesis.
  • the acoustic features output by the prosody sub-model may also include fundamental frequency features used to characterize pitch, so as to realize the control of speech synthesis by pitch.
  • the spectral features corresponding to the text output by the timbre sub-model include the spectral features used to characterize the target timbre, so as to realize the control of the timbre over speech synthesis.
  • the spectral features output by the timbre sub-model also include the spectral features used to represent the target rap style, and the spectral features representing the target timbre and the spectral features representing the target rap style are the same spectral features. If the acoustic features of the prosody sub-model output also include fundamental frequency features, the spectral features of the timbre sub-model output also include spectral features for representing the corresponding fundamental frequency, and represent the spectral features of the target timbre, the spectral features of the target rap style, and The spectral features characterizing the fundamental frequency are the same spectral features.
  • the speech synthesis model can be trained by the third sample audio of less target timbre, so that the final speech synthesis model can synthesize audio with the target timbre, and even if the quality of the third sample audio is not high, such as non-standard pronunciation, Even if the speech is not fluent, etc., the speech synthesis model can still synthesize audio with the target timbre stably.
  • the timbre sub-model Since the timbre sub-model has been trained through the second sample audio, the timbre sub-model already has a high ability to control the speech synthesis of timbre. Therefore, even if the timbre sub-model learns a small amount of third sample audio, it can be better. to master the target Voice.
  • FIG. 1a shows the overall frame diagram of the training and acquisition of the speech synthesis model
  • Fig. 1b and Fig. 1c respectively exemplarily show the structural diagrams of the prosody sub-model and the timbre sub-model included in the speech synthesis model.
  • the speech synthesis model 100 includes: a prosody sub-model 101 and a timbre sub-model 102 .
  • the process of training the speech synthesis model 100 includes the process of training the prosody sub-model 101 and the process of training the timbre sub-model 102 .
  • the prosodic sub-model 101 is used for training according to the labeled text corresponding to the first sample audio and the labeled acoustic features (hereinafter the labeled acoustic features corresponding to the first sample audio are referred to as the second acoustic feature), by learning the first sample audio
  • the relationship between the corresponding labeled text and the second acoustic features enables the prosody sub-model 101 to obtain the ability to establish a depth mapping between the text and the acoustic features including bottleneck features.
  • the aforementioned marked text may specifically be a text sequence.
  • the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.
  • the loss function information of the current round of training is calculated, and according to the loss of the current round of training
  • the function information adjusts the coefficient values of the parameters included in the prosody sub-model 101 .
  • the labeled text corresponding to the first sample audios, and the second acoustic features (including the first labeled bottleneck features) corresponding to the first sample audios finally obtain the corresponding convergence conditions
  • the first feature extraction model 101 The first feature extraction model 101 .
  • the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .
  • the first audio sample can include a high-quality audio file (high-quality audio text can also be understood as clean audio), and the annotation text corresponding to the first audio sample can include one or more audio files corresponding to the first audio sample. characters or one or more phonemes, which is not limited in this disclosure.
  • the first audio sample can be obtained by recording and cleaning multiple times according to actual needs, or can also be obtained by filtering from an audio database and cleaning multiple times. The present disclosure does not limit the acquisition method of the first sample audio.
  • the annotation text corresponding to the first audio sample may also be obtained through repeated annotation and correction, so as to ensure the accuracy of the annotation text.
  • the first sample audio mentioned in this disclosure is audio with the target rap style. This disclosure does not limit the duration, file format, quantity and other parameters of the first sample audio, and the first sample audio can be A piece of music sung by the same or a different singer.
  • the fifth acoustic feature corresponding to the labeled text can be understood as the predicted acoustic feature corresponding to the labeled text output by the prosodic sub-model 101, and the fifth acoustic feature corresponding to the labeled text can also be understood as the fifth acoustic feature corresponding to the first sample audio .
  • the second acoustic feature includes: a first labeled bottleneck feature corresponding to the first audio sample.
  • the bottleneck is a nonlinear feature transformation technology and an effective dimension reduction technology.
  • the bottleneck feature may include information of dimensions such as prosody and content.
  • the first labeled bottleneck feature corresponding to the first audio sample may be obtained by an encoder (encoder) of an end-to-end speech recognition (ASR) model.
  • encoder encoder
  • ASR end-to-end speech recognition
  • ASR model for short.
  • the first sample audio can be input to the ASR model 104, and the first marked bottleneck feature corresponding to the first sample audio output by the encoder of the ASR model 104 is obtained, wherein the ASR model 104
  • the encoder of is equivalent to the extractor of the bottleneck feature in advance, and the encoder of the ASR model 104 can be used to prepare sample data in this solution.
  • the ASR model 104 may also include other modules.
  • the ASR model 104 also includes a decoder (decoder) and an attention network (attention network).
  • decoder decoder
  • attention network attention network
  • the encoder of the ASR model 104 is only an example, and is not a limitation to the implementation manner of obtaining the first marked bottleneck feature corresponding to the first sample audio. In practical applications, it can also be obtained in other ways, which is not limited in the present disclosure.
  • the database stores the first sample audio and the first labeled bottleneck feature corresponding to the first sample audio
  • the electronic device may also acquire the first sample audio and the first labeled bottleneck feature from the database.
  • the second acoustic feature corresponding to the first sample audio includes: a first labeled bottleneck feature corresponding to the first sample audio and a first labeled fundamental frequency feature corresponding to the first sample audio.
  • the first marked bottleneck feature can refer to the detailed description of the foregoing examples, and for the sake of brevity, details are not repeated here.
  • the pitch represents the subjective feeling of the human ear for the pitch of the sound.
  • the pitch mainly depends on the fundamental frequency of the sound. The higher the fundamental frequency, the higher the pitch, and the lower the fundamental frequency, the lower the pitch.
  • pitch is also one of the important factors affecting the effect of speech synthesis.
  • this solution introduces the fundamental frequency feature while introducing the bottleneck feature, so that the final prosodic sub-model 101 can output the corresponding bottleneck feature according to the input text and fundamental frequency features.
  • the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.
  • the fifth acoustic feature corresponding to the tagged text may be understood as the predicted acoustic feature corresponding to the tagged text output by the prosody sub-model 101 .
  • the fifth acoustic feature corresponding to the marked text may also be understood as the fifth acoustic feature corresponding to the first audio sample.
  • the second acoustic feature corresponding to the first sample audio includes: the first labeled bottleneck feature and the first labeled fundamental frequency feature, then during the training process, the first sample audio output by the prosody sub-model 101 corresponds to
  • the fifth acoustic feature also includes: a predicted bottleneck feature and a predicted fundamental frequency feature corresponding to the first audio sample.
  • the loss function information of the current round of training is calculated, and the prosody is adjusted according to the loss function information.
  • the coefficient values of the parameters included in the sub-model 101 are adjusted.
  • the labeled text corresponding to the first sample audio, and the second acoustic feature corresponding to the first sample audio (including the first labeled bottleneck feature and the first labeled fundamental frequency feature)
  • the first feature extraction model 101 satisfying the corresponding convergence condition is obtained.
  • the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .
  • the first labeled fundamental frequency feature corresponding to the first sample audio can be obtained by analyzing the first sample audio by a digital signal processing (DSP) method.
  • DSP digital signal processing
  • digital signal processing may be performed on the first sample audio by the digital signal processor 105 to obtain the first labeled fundamental frequency feature corresponding to the first sample audio.
  • the specific implementation manner of the digital signal processor 105 is not limited, as long as it can extract the first marked fundamental frequency feature corresponding to the input first sample audio.
  • the first marked fundamental frequency feature corresponding to the first sample audio is not limited to be obtained by digital signal processing, and the present disclosure does not limit the implementation manner of obtaining the first marked fundamental frequency feature.
  • some databases store the first sample audio and the first labeled fundamental frequency feature corresponding to the first sample audio, and the first sample audio and the first labeled fundamental frequency feature may also be acquired from the database.
  • the convergence condition corresponding to the prosody sub-model may include, but is not limited to, evaluation indicators such as the number of iterations and loss threshold.
  • the present disclosure does not limit the convergence conditions corresponding to the training prosodic sub-models.
  • the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or, according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, the convergence conditions may have differences.
  • the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or performs training according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, and the pre-built prosodic
  • the loss functions corresponding to the models can be the same or different.
  • the present disclosure does not limit the implementation manner of the loss function corresponding to the pre-built prosody sub-model.
  • the network structure of the prosodic sub-model is exemplarily shown below.
  • FIG. 1 b exemplarily shows an implementation of the prosodic sub-model 101 .
  • the prosodic sub-model 101 may include: a text encoding network (text encoder) 1011 , an attention network (attention) 1012 and a decoding network (decoder) 1013 .
  • text encoder text encoder
  • attention network attention network
  • decoding network decoder
  • the text coding network 1011 is used to receive text as input, analyze the context and time sequence relationship of the input text, and model an intermediate feature sequence, which contains context information and time sequence relationship.
  • the decoding network 1013 can adopt an autoregressive network structure, by using the output of the previous time step as the input of the next time step.
  • the attention network 1012 is mainly used to output attention coefficients.
  • the attention coefficient and the intermediate feature sequence output by the text encoding network 1011 are weighted and averaged to obtain a weighted average result, which is used as another conditional input for each time step of the decoding network 1013 .
  • the decoding network 1013 outputs the predicted acoustic features corresponding to the text by performing feature conversion on the input (ie, the weighted average result and the output of the previous time step).
  • the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text; or, the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text The predicted fundamental frequency features corresponding to the text.
  • the initial values of the coefficients of the parameters included in the prosody sub-model 101 may be randomly generated, preset, or determined in other ways, which is not limited in the present disclosure.
  • the prosodic sub-model 101 is iteratively trained through the marked texts corresponding to the plurality of first sample audios and the second acoustic features respectively corresponding to the first sample audios, and the coefficient values of the parameters included in the prosody sub-model 101 are continuously optimized, Until the convergence condition of the prosody sub-model 101 is met, the training for the prosody sub-model 101 is stopped.
  • Training the timbre sub-model 102 includes two stages, wherein the first stage is to train the timbre sub-model based on the second sample audio to obtain an intermediate model; the second stage is to fine-tune the intermediate model based on the third sample audio to obtain The final Timbre submodel.
  • the present disclosure does not limit the timbre of the second sample audio; in addition, the third sample audio is a sample audio with a target timbre.
  • the spectral features output by the above-mentioned timbre sub-model may be Mel spectral features, or other types of spectral features.
  • the first labeled spectral feature corresponding to the second sample audio input to the timbre sub-model is the first labeled Mel spectral feature
  • the second labeled spectral feature corresponding to the third sample audio is the second labeled Mel spectral feature.
  • the Mel spectral feature and the predicted spectral feature output by the timbre sub-model are taken as an example to illustrate the predicted Mel spectral feature.
  • the first stage is a first stage
  • the timbre sub-model 102 is used to perform iterative training according to the second sample audio to obtain an intermediate model.
  • the timbre sub-model 102 learns the mapping relationship between the third acoustic feature corresponding to the second sample audio and the first labeled mel spectrum feature of the second sample audio, and obtains an intermediate model with certain speech synthesis control capabilities for timbre, wherein , the first marked Mel spectral feature includes: a spectral feature used to characterize the timbre of the corresponding second sample audio.
  • the present disclosure does not limit parameters such as the timbre, duration, storage format, and quantity of the second sample audio of the second sample audio.
  • the second sample audio may include the audio of the specific target tone, or may include the audio of the non-target tone, or the second sample audio may include both the audio of the target tone and the audio of the non-target tone.
  • the timbre sub-model 102 is used to analyze the second acoustic feature corresponding to the input second sample audio, and output the predicted mel spectrum feature corresponding to the second sample audio; then based on the second The first labeled Mel spectrum feature corresponding to the sample audio and the predicted Mel spectrum feature corresponding to the second sample audio adjust the coefficient values of the parameters included in the timbre sub-model 102; Continuous iterative training to obtain an intermediate model.
  • the first marked Mel spectrum feature can be understood as the learning goal of the timbre sub-model 102 in the first stage.
  • the second sample audio does not need to be marked with corresponding text, which can greatly reduce the time and labor cost of obtaining the second sample audio. Moreover, a large amount of audio can be obtained at a lower cost as the second sample audio for iterative training of the timbre sub-model 102, and then the timbre sub-model 102 is trained through a large amount of second sample audio, so that the intermediate model has a higher Speech synthesis controls for timbres.
  • the intermediate model is trained based on the third sample, so that the intermediate model learns the target timbre and obtains the speech synthesis control ability for the target timbre.
  • the intermediate model already has a high ability to control speech synthesis for timbre, the requirements for the third sample audio are reduced, for example, the duration of the third sample audio, the third sample audio Quality requirements, even if the duration of the third sample audio is short, the pronunciation is not clear, etc., the final timbre sub-model 102 obtained through training can still obtain a higher speech synthesis control ability for the target timbre.
  • the third sample audio has a target tone
  • the third sample audio may be an audio recorded by a user, or may be an audio of a desired tone uploaded by a user, and the disclosure does not limit the source and acquisition method of the third sample audio.
  • the fourth acoustic feature corresponding to the third sample audio is input to the intermediate model, and the predicted mel spectrum feature corresponding to the third sample audio output by the intermediate model is obtained; then based on the second labeled mel spectrum corresponding to the third sample audio feature and the predicted mel spectrum feature corresponding to the third sample audio, and calculate the loss function information corresponding to the current round of training; adjust the coefficient values of the parameters included in the intermediate model according to the loss function information, so as to obtain the final timbre sub-model 102 .
  • the second labeled mel spectrum feature corresponding to the third audio sample can be understood as the learning target of the intermediate model.
  • the prosody sub-model 101 outputs the fifth acoustic feature according to the input annotation text of the first sample audio including the predicted bottleneck feature, that is, the prosody sub-model 101 can realize The mapping of text to bottleneck features
  • the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second marked bottleneck feature corresponding to the second sample audio
  • the acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  • the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature. For the sake of simplicity, I won’t repeat them here.
  • the output fifth acoustic feature includes the predicted bottleneck feature and the predicted fundamental frequency feature, that is, the prosody sub-model 101 can realize the text-to-bottleneck feature and The mapping of the fundamental frequency feature
  • the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second labeled bottleneck feature and the second labeled fundamental frequency feature corresponding to the second sample audio
  • the fourth acoustic feature corresponding to the sample audio includes a third labeled bottleneck feature and a third labeled fundamental frequency feature corresponding to the third sample audio.
  • the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature;
  • the second marked fundamental frequency feature and the third marked fundamental frequency feature can be obtained by analyzing the second sample audio and the third sample audio respectively through digital signal processing technology, which is similar to the implementation method of obtaining the first marked fundamental frequency feature. For simplicity, I won't repeat them here.
  • the input of the timbre sub-model 102 is consistent with the output of the prosody sub-model 101 .
  • the initial values of the coefficients corresponding to the parameters included in the timbre sub-model 102 may be preset or initialized randomly, which is not limited in the present disclosure.
  • the loss functions corresponding to the timbre sub-models used respectively may be the same or different, which is not limited in the present disclosure.
  • FIG. 1 c exemplarily shows an implementation manner of the timbre sub-model 102 .
  • the timbre sub-model 102 can be implemented using a self-attention network structure.
  • the timbre sub-model 102 includes: a convolutional network 1021 and one or more residual networks 1022 .
  • each residual network 1022 includes: a self-attention network 1022a and a linear network 1022b.
  • the convolution network 1021 is mainly used to perform convolution processing on the acoustic features corresponding to the input sample audio, and to model local feature information.
  • the convolutional network 1021 may include one or more convolutional layers, and this disclosure does not limit the number of convolutional layers included in the convolutional network 1021 .
  • the convolutional network 1021 inputs the local feature information to the connected residual network 1022 .
  • the one or more residual networks 1022 are converted into spectral features (such as Mel spectral features) after passing through the one or more residual networks 1022 .
  • the structure of the intermediate model is the same as that of the timbre sub-model 102 shown in FIG. 1c, the difference lies in that the weight coefficients of the parameters included are not completely the same.
  • the first feature extraction model and the second feature extraction model that meet the requirements of speech synthesis are finally obtained;
  • the model is spliced to obtain a speech synthesis model capable of synthesizing the target timbre.
  • the speech synthesis model 100 may further include: a vocoder (vocoder) 103 .
  • the vocoder 103 is used to convert the spectral features (such as Mel spectral features) output by the timbre sub-model 102 into audio.
  • the vocoder can also be used as an independent module, not bound together with the speech synthesis model. And this solution does not limit the specific type of the vocoder.
  • the target speech synthesis model finally obtained through training has the ability to stably synthesize the audio of the target timbre. Based on this, the target speech synthesis model can be used to process corresponding speech synthesis services.
  • Fig. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure. As shown in Figure 2, the speech synthesis method provided by this embodiment includes:
  • the text to be processed may include one or more characters, or the text to be processed may also include one or more phonemes.
  • the text to be processed is used to synthesize audio with the target rap style and target timbre.
  • the present disclosure does not limit the manner in which the electronic device obtains the text to be processed.
  • the electronic device can display the text input window and the soft keyboard to the user, and the user can input the text to be processed into the text input window by operating the soft keyboard displayed on the electronic device; or, the user can also copy and paste the text to the text input window Input the text to be processed; or, the user can also input a piece of audio to the electronic device by voice, and the electronic device obtains the text to be processed by performing voice recognition on the audio input by the user; or, it can also import the text to be processed to the electronic device Corresponding files, so that the electronic device obtains the text to be processed.
  • the user may, but is not limited to, input the text to be processed into the electronic device by means of the above examples.
  • the operation is simple and convenient, and the user's enthusiasm for creating multimedia content can be enhanced.
  • the text to be processed is input into the speech synthesis model
  • the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed
  • the bottleneck feature wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style; the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, and outputs the spectral feature corresponding to the text to be processed.
  • the text to be processed is input into the speech synthesis model, and the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed.
  • the bottleneck feature and the fundamental frequency feature corresponding to the text to be processed wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style, and the fundamental frequency feature included in the first acoustic feature is used to characterize the pitch;
  • the timbre sub-model receives The first acoustic feature corresponding to the text to be processed is used as input, and the spectral feature (such as Mel spectral feature) corresponding to the text to be processed is output.
  • the speech synthesis model can be obtained through the implementation of the embodiment shown in Figures 1a to 1c, wherein, the network structure of the speech synthesis model and the implementation of training the speech synthesis model can refer to the implementation shown in the aforementioned Figures 1a to 1c For the sake of brevity, the detailed description of the example is omitted here.
  • the text encoding network included in the prosodic sub-model can receive the text to be processed as an input, and model the intermediate feature sequence by analyzing the context and temporal relationship of the text to be processed; then according to the prosody sub-model The attention coefficient output by the included attention network is weighted and averaged with the intermediate feature sequence to obtain the weighted average result; the decoding network included in the prosody sub-model is characterized by the input weighted average result and the output of the previous time step Convert, output the first acoustic feature corresponding to the text to be processed, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed, or, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed and the corresponding text to be processed fundamental frequency features.
  • the convolutional network included in the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, performs convolution processing on the first acoustic feature corresponding to the text to be processed, and models Local feature information; the convolutional network inputs the local feature information into the connected residual network, and after passing through one or more residual networks, outputs the spectral features (such as Mel spectral features) corresponding to the text to be processed.
  • the spectral features such as Mel spectral features
  • the electronic device may perform digital signal processing on the spectral features corresponding to the text to be processed based on the vocoder, so as to convert the spectral features corresponding to the text to be processed (such as the Mel spectrum feature corresponding to the text to be processed) into Audio with a target timbre and a target rap style, ie target audio.
  • the vocoder can be used as a part of the speech synthesis model, and the speech synthesis model can directly output audio with the target timbre and target rap style; in other cases, the vocoder can be used as an independent module, the vocoder can receive as input the spectral features corresponding to the text to be processed, and convert the spectral features corresponding to the text to be processed into audio with a target timbre and a target rap style.
  • the speech synthesis method analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed
  • the text is used as input, and the first acoustic feature corresponding to the text to be processed is output, wherein the first acoustic feature includes the bottleneck feature used to characterize the target rap style;
  • the timbre sub-model receives the first acoustic feature as input, and outputs the text to be processed
  • the corresponding spectral features which include the information of the target timbre; by converting the spectral features output by the speech synthesis model, rap audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and the voice
  • the synthesis model supports the conversion of any text to be processed, which reduces the requirement on the
  • Fig. 3 is a schematic flowchart of a speech synthesis method provided by another embodiment of the present disclosure.
  • the speech synthesis method provided by this embodiment is based on the embodiment shown in Fig. 2, step S203, after obtaining the target audio corresponding to the text to be processed according to the spectral characteristics corresponding to the text to be processed, may also include :
  • the present disclosure does not limit the implementation manner of adding the target audio to the target multimedia content.
  • the electronic device when it adds the target audio to the target multimedia content, it can combine the duration of the target multimedia content and the duration of the target audio to speed up or slow down the playback speed of the target audio; it can also add the target audio to the playback interface of the target multimedia content
  • Corresponding subtitles can also not add the subtitles corresponding to the target audio; if the subtitles corresponding to the target audio are added on the playback interface of the target multimedia content, display parameters such as the color, font size, and font of the subtitles can also be set.
  • the method provided in this embodiment analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed as Input, output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature used to characterize the target rap style; the timbre sub-model receives the first acoustic feature as input, and outputs the corresponding to the text to be processed Spectral features, including spectral features used to characterize the target timbre; by converting the spectral features output by the speech synthesis model, audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and
  • the speech synthesis model supports the conversion of any text to be processed, which reduces the requirements on the user's music creation ability
  • adding the target audio to the target multimedia content makes the target multimedia content more interesting, thereby satisfying the user's demand for creative video creation.
  • the present disclosure also provides a speech synthesis device.
  • Fig. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure.
  • the speech synthesis device 400 provided in this embodiment includes:
  • An acquisition module 401 configured to acquire text to be processed.
  • the processing module 402 is configured to input the text to be processed into the speech synthesis model, and obtain the spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model , the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre The sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre.
  • the speech synthesis model includes: a prosody sub-model and a timbre sub-model , the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text
  • the processing module 402 is further configured to acquire the target audio corresponding to the text to be processed according to the spectral features corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
  • the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;
  • the first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
  • the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;
  • the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth
  • the acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
  • the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
  • the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;
  • the third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;
  • the first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
  • the processing module 402 is further configured to add the target audio corresponding to the text to be processed to the target multimedia content.
  • the speech synthesis device provided in this embodiment can be used to implement the technical method of any of the above method embodiments, and its implementation principle and technical effect are similar. For details, please refer to the detailed description of the foregoing method embodiments.
  • the present disclosure also provides an electronic device.
  • Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device provided in this embodiment includes: a memory 501 and a processor 502 .
  • the memory 501 may be an independent physical unit, and may be connected with the processor 502 through the bus 503 .
  • the memory 501 and the processor 502 may also be integrated together, implemented by hardware, and the like.
  • the memory 501 is used to store program instructions, and the processor 502 invokes the program instructions to execute the operations of any one of the above method embodiments.
  • the foregoing electronic device 500 may also include only the processor 502 .
  • the memory 501 for storing programs is located outside the electronic device 500, and the processor 502 is connected to the memory through circuits/wires, and is used to read and execute the programs stored in the memory.
  • the processor 502 may be a central processing unit (central processing unit, CPU), a network processor (network processor, NP) or a combination of CPU and NP.
  • CPU central processing unit
  • NP network processor
  • the processor 502 may further include a hardware chip.
  • the aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • the aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • the memory 501 may include a volatile memory (volatile memory), such as a random-access memory (random-access memory, RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory) ), a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD); the memory can also include a combination of the above-mentioned types of memory.
  • volatile memory such as a random-access memory (random-access memory, RAM
  • non-volatile memory such as a flash memory (flash memory)
  • HDD hard disk drive
  • solid-state drive solid-state drive
  • the present disclosure also provides a readable storage medium, including: computer program instructions; when the computer program instructions are executed by at least one processor of the electronic device, the speech synthesis method shown in any one of the above method embodiments is implemented.
  • the present disclosure also provides a program product, the program product includes a computer program, the computer program is stored in a readable storage medium, and at least one processor of the electronic device can read the computer program from the readable storage medium.
  • the computer program, the at least one processor executes the computer program to enable the electronic device to implement the speech synthesis method as shown in any one of the above method embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un appareil de synthèse de la parole, un dispositif électronique, un support de stockage lisible et un produit de programme. Le procédé consiste : à acquérir un texte à traiter (S201) ; à entrer le texte à traiter dans un modèle de synthèse de la parole de sorte à obtenir une caractéristique spectrale émise correspondant au texte à traiter (S202) ; le modèle de synthèse de la parole comprenant : un sous-modèle de rythme et un sous-modèle de timbre, le sous-modèle de rythme étant utilisé pour délivrer en sortie une première caractéristique acoustique correspondante en fonction du texte entré à traiter et la première caractéristique acoustique comprenant une caractéristique de goulot d'étranglement pour représenter un style rap cible ; le sous-modèle de timbre étant utilisé pour délivrer en sortie, en fonction de la première caractéristique acoustique d'entrée, une caractéristique spectrale pour représenter un timbre cible ; en fonction de la caractéristique spectrale correspondant au texte à traiter, à obtenir un audio cible correspondant au texte à traiter, l'audio cible ayant un timbre cible et un style rap cible (S203).
PCT/CN2022/120120 2021-09-22 2022-09-21 Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible WO2023045954A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111107875.8A CN115938338A (zh) 2021-09-22 2021-09-22 语音合成方法、装置、电子设备及可读存储介质
CN202111107875.8 2021-09-22

Publications (1)

Publication Number Publication Date
WO2023045954A1 true WO2023045954A1 (fr) 2023-03-30

Family

ID=85720073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/120120 WO2023045954A1 (fr) 2021-09-22 2022-09-21 Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible

Country Status (2)

Country Link
CN (1) CN115938338A (fr)
WO (1) WO2023045954A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727288A (zh) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 一种语音合成方法、装置、设备及存储介质
CN117912446A (zh) * 2023-12-27 2024-04-19 暗物质(北京)智能科技有限公司 一种音色和风格深度解耦的语音风格迁移系统及方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117975933B (zh) * 2023-12-29 2024-08-27 北京稀宇极智科技有限公司 音色混合方法和装置、音频处理方法和装置、电子设备、存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261355A (zh) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 一种语音合成方法和装置
US10692484B1 (en) * 2018-06-13 2020-06-23 Amazon Technologies, Inc. Text-to-speech (TTS) processing
CN111326138A (zh) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 语音生成方法及装置
CN111402855A (zh) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备
CN111508469A (zh) * 2020-04-26 2020-08-07 北京声智科技有限公司 一种文语转换方法及装置
CN112365882A (zh) * 2020-11-30 2021-02-12 北京百度网讯科技有限公司 语音合成方法及模型训练方法、装置、设备及存储介质
CN112509552A (zh) * 2020-11-27 2021-03-16 北京百度网讯科技有限公司 语音合成方法、装置、电子设备和存储介质
CN113409764A (zh) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 一种语音合成方法、装置和用于语音合成的装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261355A (zh) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 一种语音合成方法和装置
US10692484B1 (en) * 2018-06-13 2020-06-23 Amazon Technologies, Inc. Text-to-speech (TTS) processing
CN111326138A (zh) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 语音生成方法及装置
CN111402855A (zh) * 2020-03-06 2020-07-10 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备
CN111508469A (zh) * 2020-04-26 2020-08-07 北京声智科技有限公司 一种文语转换方法及装置
CN112509552A (zh) * 2020-11-27 2021-03-16 北京百度网讯科技有限公司 语音合成方法、装置、电子设备和存储介质
CN112365882A (zh) * 2020-11-30 2021-02-12 北京百度网讯科技有限公司 语音合成方法及模型训练方法、装置、设备及存储介质
CN113409764A (zh) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 一种语音合成方法、装置和用于语音合成的装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912446A (zh) * 2023-12-27 2024-04-19 暗物质(北京)智能科技有限公司 一种音色和风格深度解耦的语音风格迁移系统及方法
CN117727288A (zh) * 2024-02-07 2024-03-19 翌东寰球(深圳)数字科技有限公司 一种语音合成方法、装置、设备及存储介质
CN117727288B (zh) * 2024-02-07 2024-04-30 翌东寰球(深圳)数字科技有限公司 一种语音合成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115938338A (zh) 2023-04-07

Similar Documents

Publication Publication Date Title
WO2023045954A1 (fr) Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible
CN106898340B (zh) 一种歌曲的合成方法及终端
US9552807B2 (en) Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
CN108831437B (zh) 一种歌声生成方法、装置、终端和存储介质
WO2020113733A1 (fr) Procédé et appareil de génération d'animation, dispositif électronique, et support d'informations lisible par ordinateur
KR20210048441A (ko) 디지털 비디오에서의 입 모양과 움직임을 대체 오디오에 매칭
WO2020098115A1 (fr) Procédé d'ajout de sous-titres, appareil, dispositif électronique et support de stockage lisible par ordinateur
CN108780643A (zh) 自动配音方法和装置
CN110675886A (zh) 音频信号处理方法、装置、电子设备及存储介质
US11462207B1 (en) Method and apparatus for editing audio, electronic device and storage medium
CN114173067B (zh) 一种视频生成方法、装置、设备及存储介质
WO2022126904A1 (fr) Procédé et appareil de conversion vocale, dispositif informatique et support de stockage
CN110599998A (zh) 一种语音数据生成方法及装置
CN113012678A (zh) 一种免标注的特定说话人语音合成方法及装置
CN116013274A (zh) 语音识别的方法、装置、计算机设备和存储介质
CN117496944B (zh) 一种多情感多说话人语音合成方法和系统
WO2024146338A1 (fr) Procédé et appareil de génération de vidéo, dispositif électronique et support de stockage
CN112580669B (zh) 一种对语音信息的训练方法及装置
CN113870833A (zh) 语音合成相关系统、方法、装置及设备
Zhang et al. From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning
CN116844562A (zh) 一种基于深度学习的短视频背景音乐剪辑方法
JP2020173776A (ja) 映像を生成するための方法および装置
CN109300472A (zh) 一种语音识别方法、装置、设备及介质
US20240274120A1 (en) Speech synthesis method and apparatus, electronic device, and readable storage medium
CN113990295A (zh) 一种视频生成方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22872000

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10.07.2024)