WO2023045954A1

WO2023045954A1 - Speech synthesis method and apparatus, electronic device, and readable storage medium

Info

Publication number: WO2023045954A1
Application number: PCT/CN2022/120120
Authority: WO
Inventors: 代东洋; 黄雷; 陈彦洁; 李鑫; 陈远哲; 王玉平
Original assignee: 北京字跳网络技术有限公司
Priority date: 2021-09-22
Filing date: 2022-09-21
Publication date: 2023-03-30
Also published as: CN115938338A

Abstract

Provided are a speech synthesis method and apparatus, an electronic device, a readable storage medium, and a program product. The method comprises: acquiring a text to be processed (S201); inputting the text to be processed into a speech synthesis model, so as to obtain an outputted spectral feature corresponding to the text to be processed (S202); wherein the speech synthesis model comprises: a rhythm sub-model and a timbre sub-model, the rhythm sub-model being used to output a corresponding first acoustic feature according to the inputted text to be processed, and the first acoustic feature comprises a bottleneck feature for representing a target rap style; the timbre sub-model is used to output, according to the input first acoustic feature, a spectral feature for representing a target timbre; according to the spectral feature corresponding to the text to be processed, obtaining a target audio corresponding to the text to be processed, the target audio having a target timbre and a target rap style (S203).

Description

Speech synthesis method, device, electronic device and readable storage medium

Cross References to Related Applications

This disclosure claims the priority of the Chinese patent application number "202111107875.8" filed on September 22, 2021 with the title of "speech synthesis method, device, electronic equipment and readable storage medium". The entire contents are incorporated by reference in this disclosure.

technical field

The present disclosure relates to the technical field of artificial intelligence, and in particular to a speech synthesis method, device, electronic equipment and readable storage medium.

Background technique

With the continuous development of Internet technology, application programs can support users to synthesize creative videos. When synthesizing creative videos, it is usually necessary to add soundtracks to the videos. Currently, adding a soundtrack to a video usually involves selecting music from a music library, and the soundtrack added in this way cannot meet the personalized needs of users.

technical solution

In order to solve the above technical problems or at least partly solve the above technical problems, the present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium.

In a first aspect, the present disclosure provides a speech synthesis method, including:

Get the text to be processed;

The text to be processed is input into the speech synthesis model, and the spectral features corresponding to the text to be processed output by the speech synthesis model are obtained; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model, and the prosody The sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, and the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model is used for According to the input first acoustic feature, output the spectral feature corresponding to the text to be processed, the spectral feature corresponding to the text to be processed includes the spectral feature used to characterize the target timbre;

According to the spectrum feature corresponding to the text to be processed, the target audio corresponding to the text to be processed is acquired, and the target audio has the target timbre and the target rap style.

In some possible implementation manners, the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;

The first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.

In some possible implementation manners, the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, the fourth acoustic feature corresponding to the third sample audio, and the first The second labeled spectral feature corresponding to the three-sample audio is obtained through training;

Wherein, the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.

In some possible implementation manners, the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio It is obtained by performing bottleneck feature extraction on the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.

In some possible implementation manners, the second acoustic feature further includes: a first labeled fundamental frequency feature corresponding to the first sample audio;

The third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;

The first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.

In some possible implementation manners, the method also includes:

Add the target audio corresponding to the text to be processed to the target multimedia content.

In a second aspect, the present disclosure provides a speech synthesis device, including:

Obtaining module, used to obtain the text to be processed;

A processing module, configured to input the text to be processed into a speech synthesis model, and obtain spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model Model, the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the The timbre sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre;

The processing module is configured to acquire target audio corresponding to the text to be processed according to the frequency spectrum feature corresponding to the text to be processed, the target audio having the target timbre and the target rap style.

In a third aspect, the present disclosure provides an electronic device, including: a memory, a processor, and a computer program;

said memory is configured to store said computer program;

The processor is configured to execute the computer program to implement the speech synthesis method according to any one of the first aspect.

In a fourth aspect, the present disclosure provides a readable storage medium, including: a computer program;

When the computer program is executed by at least one processor of the electronic device, the speech synthesis method according to any one of the first aspect can be realized.

In a fifth aspect, the present disclosure provides a program product, the program product including: a computer program; the computer program is stored in a readable storage medium, and an electronic device acquires the computer program from the readable storage medium, the At least one processor of the electronic device executes the computer program to implement the speech synthesis method according to any one of the first aspect.

The present disclosure provides a speech synthesis method, device, electronic equipment and readable storage medium, wherein, the present disclosure analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes prosodic Model and timbre sub-model, the prosody sub-model is used to receive the text to be processed as input, and output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model The model receives the first acoustic feature as input, and outputs the spectral feature corresponding to the text to be processed. The spectral feature includes the spectral feature used to characterize the target timbre; by converting the spectral feature output by the speech synthesis model, it can obtain the target rap style and The rap audio of the target tone meets the user's personalized needs for synthesized audio; and the speech synthesis model supports the conversion of any text to be processed, which reduces the requirements for the user's music creation ability and is conducive to improving the user's enthusiasm for creating multimedia content .

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description serve to explain the principles of the disclosure.

In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or related technologies, the following will briefly introduce the drawings that need to be used in the descriptions of the embodiments or related technologies. Obviously, for those of ordinary skill in the art, Other drawings can also be obtained from these drawings without any creative effort.

1a to 1c are structural schematic diagrams of a speech synthesis model provided by an embodiment of the present disclosure;

FIG. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure;

FIG. 3 is a flowchart of a speech synthesis method provided by another embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure;

Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

In order to more clearly understand the above objects, features and advantages of the present disclosure, the solutions of the present disclosure will be further described below. It should be noted that, in the case of no conflict, the embodiments of the present disclosure and the features in the embodiments can be combined with each other.

In the following description, many specific details are set forth in order to fully understand the present disclosure, but the present disclosure can also be implemented in other ways than described here; obviously, the embodiments in the description are only some of the embodiments of the present disclosure, and Not all examples.

The present disclosure provides a speech synthesis method, device, electronic equipment, readable storage medium, and program product, wherein the method implements the conversion of text into audio with a target rap style and target timbre through a pre-trained speech synthesis model, and the speech The synthesis model can realize the relatively independent control of the target rap style and timbre on the speech synthesis, so as to meet the user's demand for personalized speech synthesis.

The target rap style mentioned in the present disclosure may include any type of rap style, and the present disclosure does not limit the specific rap style of the target rap style. For example, the target rap style may be any rap style among popular rap, alternative rap, comedy rap, jazz rap, and hip-hop rap.

The speech synthesis method provided by the present disclosure can be executed by electronic equipment. Among them, the electronic device can be a tablet computer, a mobile phone (such as a folding screen mobile phone, a large-screen mobile phone, etc.), a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) device, a notebook computer, etc. , ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, personal digital assistant (personal digital assistant, PDA), smart TV, smart screen, high-definition TV, 4K TV, smart speaker, smart projector and other Internet of Things (the Internet of things, IOT) equipment, this disclosure does not make any restrictions on the specific type of electronic equipment.

It should be noted that the electronic device that trains and obtains the speech synthesis model and the electronic device that uses the speech synthesis model to execute the speech synthesis service may be different electronic devices or the same electronic device, which is not limited in the present disclosure. For example, the speech synthesis model is obtained through the training of the server device, and the server device sends the trained speech synthesis model to the terminal device/server device, and the terminal device/server device executes the speech synthesis service according to the speech synthesis model; another example The speech synthesis model is trained by the server device, and then the trained speech synthesis model is deployed on the server device, and then the server device invokes the speech synthesis model to process the speech synthesis service. The present disclosure does not limit this, and it can be set flexibly in practical applications.

In the following, the speech synthesis model in this solution is firstly introduced.

The speech synthesis model in this solution decouples the speech synthesis model into two sub-models by introducing acoustic features including bottleneck features, namely: the prosody sub-model and the timbre sub-model, wherein the prosody sub-model is used to create text The depth mapping between the acoustic features including the bottleneck features and the timbre sub-model is used to establish the depth mapping between the acoustic features including the bottleneck features and the spectral features.

On this basis, it has at least the following beneficial effects:

1. The two decoupled feature extraction sub-models can be trained using different sample audio.

The prosody sub-model is used to establish a deep mapping between the text sequence and the acoustic features containing bottleneck features. The prosody sub-model needs to use high-quality first sample audio with the target rap style and the annotated text corresponding to the first sample audio , together as the sample data to train the prosody sub-model.

The timbre sub-model is used to establish the depth mapping between the acoustic features including bottleneck features and the spectral features. The timbre sub-model can be trained using the second sample audio that has not marked the corresponding text. Since there is no need to label the text corresponding to the second sample audio, This can greatly reduce the cost of acquiring a second sample of audio.

2. By decoupling the speech synthesis model, the relatively independent control of speech synthesis by rap style and timbre is realized.

The acoustic features output by the prosody sub-model include the bottleneck features used to characterize the target rap style, and realize the control of rap style on speech synthesis. In addition, the acoustic features output by the prosody sub-model may also include fundamental frequency features used to characterize pitch, so as to realize the control of speech synthesis by pitch.

The spectral features corresponding to the text output by the timbre sub-model include the spectral features used to characterize the target timbre, so as to realize the control of the timbre over speech synthesis.

In addition, it should be noted that the spectral features output by the timbre sub-model also include the spectral features used to represent the target rap style, and the spectral features representing the target timbre and the spectral features representing the target rap style are the same spectral features. If the acoustic features of the prosody sub-model output also include fundamental frequency features, the spectral features of the timbre sub-model output also include spectral features for representing the corresponding fundamental frequency, and represent the spectral features of the target timbre, the spectral features of the target rap style, and The spectral features characterizing the fundamental frequency are the same spectral features.

3. Reduced the requirement for the third sample audio with the target patch

The speech synthesis model can be trained by the third sample audio of less target timbre, so that the final speech synthesis model can synthesize audio with the target timbre, and even if the quality of the third sample audio is not high, such as non-standard pronunciation, Even if the speech is not fluent, etc., the speech synthesis model can still synthesize audio with the target timbre stably.

Since the timbre sub-model has been trained through the second sample audio, the timbre sub-model already has a high ability to control the speech synthesis of timbre. Therefore, even if the timbre sub-model learns a small amount of third sample audio, it can be better. to master the target Voice.

The structure of the speech synthesis model and how to train and obtain the speech synthesis model will be introduced in detail below through several specific embodiments. In the following embodiments, an electronic device is taken as an example to describe in detail with reference to the accompanying drawings.

Among them, Fig. 1a shows the overall frame diagram of the training and acquisition of the speech synthesis model; Fig. 1b and Fig. 1c respectively exemplarily show the structural diagrams of the prosody sub-model and the timbre sub-model included in the speech synthesis model.

Referring to FIG. 1 a , the speech synthesis model 100 includes: a prosody sub-model 101 and a timbre sub-model 102 . The process of training the speech synthesis model 100 includes the process of training the prosody sub-model 101 and the process of training the timbre sub-model 102 .

The process of training the prosody sub-model 101 and the process of training the timbre sub-model 102 are respectively introduced below.

1. Training the prosodic sub-model 101

The prosodic sub-model 101 is used for training according to the labeled text corresponding to the first sample audio and the labeled acoustic features (hereinafter the labeled acoustic features corresponding to the first sample audio are referred to as the second acoustic feature), by learning the first sample audio The relationship between the corresponding labeled text and the second acoustic features enables the prosody sub-model 101 to obtain the ability to establish a depth mapping between the text and the acoustic features including bottleneck features.

Optionally, the aforementioned marked text may specifically be a text sequence.

Specifically, the prosody sub-model 101 is specifically used to analyze the labeled text corresponding to the input first sample audio, model the intermediate feature sequence, perform feature conversion and dimensionality reduction on the intermediate feature sequence, and output the fifth acoustic features.

Afterwards, based on the second acoustic feature corresponding to the first sample audio, the fifth acoustic feature corresponding to the first sample audio, and the pre-built loss function, the loss function information of the current round of training is calculated, and according to the loss of the current round of training The function information adjusts the coefficient values of the parameters included in the prosody sub-model 101 .

Through the continuous iterative training of multiple first sample audios, the labeled text corresponding to the first sample audios, and the second acoustic features (including the first labeled bottleneck features) corresponding to the first sample audios, finally obtain the corresponding convergence conditions The first feature extraction model 101 .

During the training process, the second acoustic feature corresponding to the first audio sample can be understood as the learning objective of the prosody sub-model 101 .

Wherein, the first audio sample can include a high-quality audio file (high-quality audio text can also be understood as clean audio), and the annotation text corresponding to the first audio sample can include one or more audio files corresponding to the first audio sample. characters or one or more phonemes, which is not limited in this disclosure. The first audio sample can be obtained by recording and cleaning multiple times according to actual needs, or can also be obtained by filtering from an audio database and cleaning multiple times. The present disclosure does not limit the acquisition method of the first sample audio. Similarly, the annotation text corresponding to the first audio sample may also be obtained through repeated annotation and correction, so as to ensure the accuracy of the annotation text.

In addition, the first sample audio mentioned in this disclosure is audio with the target rap style. This disclosure does not limit the duration, file format, quantity and other parameters of the first sample audio, and the first sample audio can be A piece of music sung by the same or a different singer.

In addition, the fifth acoustic feature corresponding to the labeled text can be understood as the predicted acoustic feature corresponding to the labeled text output by the prosodic sub-model 101, and the fifth acoustic feature corresponding to the labeled text can also be understood as the fifth acoustic feature corresponding to the first sample audio .

In some embodiments, the second acoustic feature includes: a first labeled bottleneck feature corresponding to the first audio sample.

Among them, the bottleneck (bottleneck) is a nonlinear feature transformation technology and an effective dimension reduction technology. In the speech synthesis scenario for a specific timbre mentioned in this solution, the bottleneck feature may include information of dimensions such as prosody and content.

In a possible implementation manner, the first labeled bottleneck feature corresponding to the first audio sample may be obtained by an encoder (encoder) of an end-to-end speech recognition (ASR) model.

Hereinafter, the end-to-end ASR model is referred to as: ASR model for short.

Exemplarily, as shown in FIG. 1a, the first sample audio can be input to the ASR model 104, and the first marked bottleneck feature corresponding to the first sample audio output by the encoder of the ASR model 104 is obtained, wherein the ASR model 104 The encoder of is equivalent to the extractor of the bottleneck feature in advance, and the encoder of the ASR model 104 can be used to prepare sample data in this solution.

It should be noted that the ASR model 104 may also include other modules. For example, as shown in FIG. 1a, the ASR model 104 also includes a decoder (decoder) and an attention network (attention network). No processing may be performed on the processing results output by modules other than the encoder in the ASR model 104 , and this disclosure does not limit the functions and implementations of modules or networks other than the encoder in the ASR model.

Obtaining the first marked bottleneck feature corresponding to the first sample audio by the encoder of the ASR model 104 is only an example, and is not a limitation to the implementation manner of obtaining the first marked bottleneck feature corresponding to the first sample audio. In practical applications, it can also be obtained in other ways, which is not limited in the present disclosure. For example, the database stores the first sample audio and the first labeled bottleneck feature corresponding to the first sample audio, and the electronic device may also acquire the first sample audio and the first labeled bottleneck feature from the database.

In some other embodiments, the second acoustic feature corresponding to the first sample audio includes: a first labeled bottleneck feature corresponding to the first sample audio and a first labeled fundamental frequency feature corresponding to the first sample audio.

Wherein, the first marked bottleneck feature can refer to the detailed description of the foregoing examples, and for the sake of brevity, details are not repeated here.

Among them, the pitch represents the subjective feeling of the human ear for the pitch of the sound. The pitch mainly depends on the fundamental frequency of the sound. The higher the fundamental frequency, the higher the pitch, and the lower the fundamental frequency, the lower the pitch. In the process of speech synthesis, pitch is also one of the important factors affecting the effect of speech synthesis. In order to enable the final speech synthesis model to have the ability to control the speech synthesis of pitch, this solution introduces the fundamental frequency feature while introducing the bottleneck feature, so that the final prosodic sub-model 101 can output the corresponding bottleneck feature according to the input text and fundamental frequency features.

Wherein, the fifth acoustic feature corresponding to the tagged text may be understood as the predicted acoustic feature corresponding to the tagged text output by the prosody sub-model 101 . The fifth acoustic feature corresponding to the marked text may also be understood as the fifth acoustic feature corresponding to the first audio sample.

It should be noted that the second acoustic feature corresponding to the first sample audio includes: the first labeled bottleneck feature and the first labeled fundamental frequency feature, then during the training process, the first sample audio output by the prosody sub-model 101 corresponds to The fifth acoustic feature also includes: a predicted bottleneck feature and a predicted fundamental frequency feature corresponding to the first audio sample.

After that, based on the second acoustic feature corresponding to the first sample audio, the fifth acoustic feature corresponding to the first sample audio, and the pre-built loss function, the loss function information of the current round of training is calculated, and the prosody is adjusted according to the loss function information. The coefficient values of the parameters included in the sub-model 101 are adjusted.

Through the continuous iterative training of the massive first sample audio, the labeled text corresponding to the first sample audio, and the second acoustic feature corresponding to the first sample audio (including the first labeled bottleneck feature and the first labeled fundamental frequency feature), Finally, the first feature extraction model 101 satisfying the corresponding convergence condition is obtained.

In a possible implementation manner, the first labeled fundamental frequency feature corresponding to the first sample audio can be obtained by analyzing the first sample audio by a digital signal processing (DSP) method. Exemplarily, as shown in FIG. 1 a , digital signal processing may be performed on the first sample audio by the digital signal processor 105 to obtain the first labeled fundamental frequency feature corresponding to the first sample audio. Wherein, the specific implementation manner of the digital signal processor 105 is not limited, as long as it can extract the first marked fundamental frequency feature corresponding to the input first sample audio.

In addition, the first marked fundamental frequency feature corresponding to the first sample audio is not limited to be obtained by digital signal processing, and the present disclosure does not limit the implementation manner of obtaining the first marked fundamental frequency feature. For example, some databases store the first sample audio and the first labeled fundamental frequency feature corresponding to the first sample audio, and the first sample audio and the first labeled fundamental frequency feature may also be acquired from the database.

It should be noted that the convergence condition corresponding to the prosody sub-model may include, but is not limited to, evaluation indicators such as the number of iterations and loss threshold. The present disclosure does not limit the convergence conditions corresponding to the training prosodic sub-models. And the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or, according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, the convergence conditions may have differences.

In addition, the electronic device performs training according to the first labeled bottleneck feature corresponding to the first sample audio, or performs training according to the first labeled bottleneck feature and the first labeled fundamental frequency feature corresponding to the first sample audio, and the pre-built prosodic The loss functions corresponding to the models can be the same or different. The present disclosure does not limit the implementation manner of the loss function corresponding to the pre-built prosody sub-model.

The network structure of the prosodic sub-model is exemplarily shown below.

FIG. 1 b exemplarily shows an implementation of the prosodic sub-model 101 . As shown in FIG. 1 b , the prosodic sub-model 101 may include: a text encoding network (text encoder) 1011 , an attention network (attention) 1012 and a decoding network (decoder) 1013 .

Among them, the text coding network 1011 is used to receive text as input, analyze the context and time sequence relationship of the input text, and model an intermediate feature sequence, which contains context information and time sequence relationship.

The decoding network 1013 can adopt an autoregressive network structure, by using the output of the previous time step as the input of the next time step.

The attention network 1012 is mainly used to output attention coefficients. The attention coefficient and the intermediate feature sequence output by the text encoding network 1011 are weighted and averaged to obtain a weighted average result, which is used as another conditional input for each time step of the decoding network 1013 . The decoding network 1013 outputs the predicted acoustic features corresponding to the text by performing feature conversion on the input (ie, the weighted average result and the output of the previous time step).

In combination with the foregoing two implementation manners, the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text; or, the predicted acoustic features corresponding to the text output by the decoding network 1013 may include: predicted bottleneck features corresponding to the text The predicted fundamental frequency features corresponding to the text.

In addition, the initial values of the coefficients of the parameters included in the prosody sub-model 101 may be randomly generated, preset, or determined in other ways, which is not limited in the present disclosure.

The prosodic sub-model 101 is iteratively trained through the marked texts corresponding to the plurality of first sample audios and the second acoustic features respectively corresponding to the first sample audios, and the coefficient values of the parameters included in the prosody sub-model 101 are continuously optimized, Until the convergence condition of the prosody sub-model 101 is met, the training for the prosody sub-model 101 is stopped.

It should be understood that the one-to-one correspondence between the first sample audio described above and the corresponding annotation text is a pair of sample data.

2. Train the timbre sub-model 102

Training the timbre sub-model 102 includes two stages, wherein the first stage is to train the timbre sub-model based on the second sample audio to obtain an intermediate model; the second stage is to fine-tune the intermediate model based on the third sample audio to obtain The final Timbre submodel.

Wherein, the present disclosure does not limit the timbre of the second sample audio; in addition, the third sample audio is a sample audio with a target timbre.

It should be noted that the spectral features output by the above-mentioned timbre sub-model may be Mel spectral features, or other types of spectral features. In the following example, the first labeled spectral feature corresponding to the second sample audio input to the timbre sub-model is the first labeled Mel spectral feature, and the second labeled spectral feature corresponding to the third sample audio is the second labeled Mel spectral feature. The Mel spectral feature and the predicted spectral feature output by the timbre sub-model are taken as an example to illustrate the predicted Mel spectral feature.

The training process of the timbre sub-model 102 is described in detail below:

The first stage:

In the first stage of training, the timbre sub-model 102 is used to perform iterative training according to the second sample audio to obtain an intermediate model.

The timbre sub-model 102 learns the mapping relationship between the third acoustic feature corresponding to the second sample audio and the first labeled mel spectrum feature of the second sample audio, and obtains an intermediate model with certain speech synthesis control capabilities for timbre, wherein , the first marked Mel spectral feature includes: a spectral feature used to characterize the timbre of the corresponding second sample audio.

The present disclosure does not limit parameters such as the timbre, duration, storage format, and quantity of the second sample audio of the second sample audio. The second sample audio may include the audio of the specific target tone, or may include the audio of the non-target tone, or the second sample audio may include both the audio of the target tone and the audio of the non-target tone.

In the training process of the first stage, the timbre sub-model 102 is used to analyze the second acoustic feature corresponding to the input second sample audio, and output the predicted mel spectrum feature corresponding to the second sample audio; then based on the second The first labeled Mel spectrum feature corresponding to the sample audio and the predicted Mel spectrum feature corresponding to the second sample audio adjust the coefficient values of the parameters included in the timbre sub-model 102; Continuous iterative training to obtain an intermediate model.

In the training process of the first stage, the first marked Mel spectrum feature can be understood as the learning goal of the timbre sub-model 102 in the first stage.

Since the input of the timbre sub-model 102 is the third acoustic feature corresponding to the second sample audio, the second sample audio does not need to be marked with corresponding text, which can greatly reduce the time and labor cost of obtaining the second sample audio. Moreover, a large amount of audio can be obtained at a lower cost as the second sample audio for iterative training of the timbre sub-model 102, and then the timbre sub-model 102 is trained through a large amount of second sample audio, so that the intermediate model has a higher Speech synthesis controls for timbres.

second stage:

In the second stage, the intermediate model is trained based on the third sample, so that the intermediate model learns the target timbre and obtains the speech synthesis control ability for the target timbre.

It should be noted that since the intermediate model already has a high ability to control speech synthesis for timbre, the requirements for the third sample audio are reduced, for example, the duration of the third sample audio, the third sample audio Quality requirements, even if the duration of the third sample audio is short, the pronunciation is not clear, etc., the final timbre sub-model 102 obtained through training can still obtain a higher speech synthesis control ability for the target timbre.

In addition, the third sample audio has a target tone, and the third sample audio may be an audio recorded by a user, or may be an audio of a desired tone uploaded by a user, and the disclosure does not limit the source and acquisition method of the third sample audio.

Specifically, the fourth acoustic feature corresponding to the third sample audio is input to the intermediate model, and the predicted mel spectrum feature corresponding to the third sample audio output by the intermediate model is obtained; then based on the second labeled mel spectrum corresponding to the third sample audio feature and the predicted mel spectrum feature corresponding to the third sample audio, and calculate the loss function information corresponding to the current round of training; adjust the coefficient values of the parameters included in the intermediate model according to the loss function information, so as to obtain the final timbre sub-model 102 .

During the training process of the second stage, the second labeled mel spectrum feature corresponding to the third audio sample can be understood as the learning target of the intermediate model.

In conjunction with the aforementioned introduction about the prosody sub-model 101, during the training process, if the prosody sub-model 101 outputs the fifth acoustic feature according to the input annotation text of the first sample audio including the predicted bottleneck feature, that is, the prosody sub-model 101 can realize The mapping of text to bottleneck features, the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second marked bottleneck feature corresponding to the second sample audio, and the fourth input corresponding to the third sample audio of the intermediate model The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.

Among them, the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature. For the sake of simplicity, I won’t repeat them here.

During the training process, if the prosody sub-model 101 is based on the input annotated text of the first sample audio, the output fifth acoustic feature includes the predicted bottleneck feature and the predicted fundamental frequency feature, that is, the prosody sub-model 101 can realize the text-to-bottleneck feature and The mapping of the fundamental frequency feature, then the third acoustic feature corresponding to the second sample audio input into the timbre sub-model 102 includes the second labeled bottleneck feature and the second labeled fundamental frequency feature corresponding to the second sample audio, and the third acoustic feature of the input intermediate model The fourth acoustic feature corresponding to the sample audio includes a third labeled bottleneck feature and a third labeled fundamental frequency feature corresponding to the third sample audio.

Among them, the second labeled bottleneck feature and the third labeled bottleneck feature can be obtained by extracting the bottleneck feature of the second sample audio and the third sample audio respectively by the encoder of the ASR model, which is similar to the implementation method of obtaining the first labeled bottleneck feature; The second marked fundamental frequency feature and the third marked fundamental frequency feature can be obtained by analyzing the second sample audio and the third sample audio respectively through digital signal processing technology, which is similar to the implementation method of obtaining the first marked fundamental frequency feature. For simplicity, I won't repeat them here.

To sum up, during the training process, the input of the timbre sub-model 102 is consistent with the output of the prosody sub-model 101 .

In addition, when the timbre sub-model 102 is trained, the initial values of the coefficients corresponding to the parameters included in the timbre sub-model 102 may be preset or initialized randomly, which is not limited in the present disclosure.

Moreover, in the training process of the first stage and the training process of the second stage, the loss functions corresponding to the timbre sub-models used respectively may be the same or different, which is not limited in the present disclosure.

Wherein, FIG. 1 c exemplarily shows an implementation manner of the timbre sub-model 102 . Referring to FIG. 1 c , the timbre sub-model 102 can be implemented using a self-attention network structure.

In FIG. 1 c , the timbre sub-model 102 includes: a convolutional network 1021 and one or more residual networks 1022 . Wherein, each residual network 1022 includes: a self-attention network 1022a and a linear network 1022b.

The convolution network 1021 is mainly used to perform convolution processing on the acoustic features corresponding to the input sample audio, and to model local feature information. Wherein, the convolutional network 1021 may include one or more convolutional layers, and this disclosure does not limit the number of convolutional layers included in the convolutional network 1021 . And the convolutional network 1021 inputs the local feature information to the connected residual network 1022 .

The one or more residual networks 1022 are converted into spectral features (such as Mel spectral features) after passing through the one or more residual networks 1022 .

It should be understood that the structure of the intermediate model is the same as that of the timbre sub-model 102 shown in FIG. 1c, the difference lies in that the weight coefficients of the parameters included are not completely the same.

Through the aforementioned training of the prosody sub-model 101 and the timbre sub-model 102 respectively, the first feature extraction model and the second feature extraction model that meet the requirements of speech synthesis are finally obtained; The model is spliced to obtain a speech synthesis model capable of synthesizing the target timbre.

In some possible implementation manners, the speech synthesis model 100 may further include: a vocoder (vocoder) 103 . The vocoder 103 is used to convert the spectral features (such as Mel spectral features) output by the timbre sub-model 102 into audio. Of course, the vocoder can also be used as an independent module, not bound together with the speech synthesis model. And this solution does not limit the specific type of the vocoder.

On the basis of the above-mentioned embodiments shown in FIG. 1a to FIG. 1c, the target speech synthesis model finally obtained through training has the ability to stably synthesize the audio of the target timbre. Based on this, the target speech synthesis model can be used to process corresponding speech synthesis services.

Fig. 2 is a flowchart of a speech synthesis method provided by an embodiment of the present disclosure. As shown in Figure 2, the speech synthesis method provided by this embodiment includes:

S201. Obtain text to be processed.

Wherein, the text to be processed may include one or more characters, or the text to be processed may also include one or more phonemes. The text to be processed is used to synthesize audio with the target rap style and target timbre.

The present disclosure does not limit the manner in which the electronic device obtains the text to be processed.

For example, the electronic device can display the text input window and the soft keyboard to the user, and the user can input the text to be processed into the text input window by operating the soft keyboard displayed on the electronic device; or, the user can also copy and paste the text to the text input window Input the text to be processed; or, the user can also input a piece of audio to the electronic device by voice, and the electronic device obtains the text to be processed by performing voice recognition on the audio input by the user; or, it can also import the text to be processed to the electronic device Corresponding files, so that the electronic device obtains the text to be processed.

The user may, but is not limited to, input the text to be processed into the electronic device by means of the above examples. For the user, the operation is simple and convenient, and the user's enthusiasm for creating multimedia content can be enhanced.

S202. Input the text to be processed into a speech synthesis model, and acquire spectral features corresponding to the text to be processed output by the speech synthesis model.

In some embodiments, the text to be processed is input into the speech synthesis model, and the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed The bottleneck feature, wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style; the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, and outputs the spectral feature corresponding to the text to be processed.

In some other embodiments, the text to be processed is input into the speech synthesis model, and the prosody sub-model outputs the first acoustic feature corresponding to the text to be processed by performing feature extraction on the text to be processed, and the first acoustic feature includes the text corresponding to the text to be processed. The bottleneck feature and the fundamental frequency feature corresponding to the text to be processed, wherein the bottleneck feature included in the first acoustic feature is used to characterize the target rap style, and the fundamental frequency feature included in the first acoustic feature is used to characterize the pitch; the timbre sub-model receives The first acoustic feature corresponding to the text to be processed is used as input, and the spectral feature (such as Mel spectral feature) corresponding to the text to be processed is output.

Wherein, the speech synthesis model can be obtained through the implementation of the embodiment shown in Figures 1a to 1c, wherein, the network structure of the speech synthesis model and the implementation of training the speech synthesis model can refer to the implementation shown in the aforementioned Figures 1a to 1c For the sake of brevity, the detailed description of the example is omitted here.

In combination with the aforementioned embodiments shown in Figures 1a and 1b, the text encoding network included in the prosodic sub-model can receive the text to be processed as an input, and model the intermediate feature sequence by analyzing the context and temporal relationship of the text to be processed; then according to the prosody sub-model The attention coefficient output by the included attention network is weighted and averaged with the intermediate feature sequence to obtain the weighted average result; the decoding network included in the prosody sub-model is characterized by the input weighted average result and the output of the previous time step Convert, output the first acoustic feature corresponding to the text to be processed, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed, or, the first acoustic feature may include the bottleneck feature corresponding to the text to be processed and the corresponding text to be processed fundamental frequency features.

In combination with the aforementioned embodiments shown in Figures 1a and 1c, the convolutional network included in the timbre sub-model receives the first acoustic feature corresponding to the text to be processed as input, performs convolution processing on the first acoustic feature corresponding to the text to be processed, and models Local feature information; the convolutional network inputs the local feature information into the connected residual network, and after passing through one or more residual networks, outputs the spectral features (such as Mel spectral features) corresponding to the text to be processed.

S203. Acquire target audio corresponding to the text to be processed according to the spectral features corresponding to the text to be processed, where the target audio has a target timbre and a target rap style.

In a possible implementation manner, the electronic device may perform digital signal processing on the spectral features corresponding to the text to be processed based on the vocoder, so as to convert the spectral features corresponding to the text to be processed (such as the Mel spectrum feature corresponding to the text to be processed) into Audio with a target timbre and a target rap style, ie target audio.

It should be noted that the vocoder can be used as a part of the speech synthesis model, and the speech synthesis model can directly output audio with the target timbre and target rap style; in other cases, the vocoder can be used as an independent module, the vocoder can receive as input the spectral features corresponding to the text to be processed, and convert the spectral features corresponding to the text to be processed into audio with a target timbre and a target rap style.

The speech synthesis method provided in this embodiment analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed The text is used as input, and the first acoustic feature corresponding to the text to be processed is output, wherein the first acoustic feature includes the bottleneck feature used to characterize the target rap style; the timbre sub-model receives the first acoustic feature as input, and outputs the text to be processed The corresponding spectral features, which include the information of the target timbre; by converting the spectral features output by the speech synthesis model, rap audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and the voice The synthesis model supports the conversion of any text to be processed, which reduces the requirement on the user's music creation ability, and is conducive to improving the user's enthusiasm for creating multimedia content.

Fig. 3 is a schematic flowchart of a speech synthesis method provided by another embodiment of the present disclosure. Referring to Fig. 3, the speech synthesis method provided by this embodiment is based on the embodiment shown in Fig. 2, step S203, after obtaining the target audio corresponding to the text to be processed according to the spectral characteristics corresponding to the text to be processed, may also include :

S204. Add the target audio corresponding to the text to be processed to the target multimedia content.

The present disclosure does not limit the implementation manner of adding the target audio to the target multimedia content. For example, when the electronic device adds the target audio to the target multimedia content, it can combine the duration of the target multimedia content and the duration of the target audio to speed up or slow down the playback speed of the target audio; it can also add the target audio to the playback interface of the target multimedia content Corresponding subtitles, of course, can also not add the subtitles corresponding to the target audio; if the subtitles corresponding to the target audio are added on the playback interface of the target multimedia content, display parameters such as the color, font size, and font of the subtitles can also be set.

The method provided in this embodiment analyzes the text to be processed based on the speech synthesis model, and outputs the spectral features corresponding to the text to be processed, wherein the speech synthesis model includes a prosody sub-model and a timbre sub-model, and the prosody sub-model is used to receive the text to be processed as Input, output the first acoustic feature corresponding to the text to be processed, wherein the first acoustic feature includes a bottleneck feature used to characterize the target rap style; the timbre sub-model receives the first acoustic feature as input, and outputs the corresponding to the text to be processed Spectral features, including spectral features used to characterize the target timbre; by converting the spectral features output by the speech synthesis model, audio with the target rap style and target timbre can be obtained, which meets the user's individual needs for audio; and The speech synthesis model supports the conversion of any text to be processed, which reduces the requirements on the user's music creation ability and is conducive to improving the user's enthusiasm for creating multimedia content.

In addition, adding the target audio to the target multimedia content makes the target multimedia content more interesting, thereby satisfying the user's demand for creative video creation.

Exemplarily, the present disclosure also provides a speech synthesis device.

Fig. 4 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present disclosure. Referring to Figure 4, the speech synthesis device 400 provided in this embodiment includes:

An acquisition module 401, configured to acquire text to be processed.

The processing module 402 is configured to input the text to be processed into the speech synthesis model, and obtain the spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model , the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre The sub-model is used to input the spectral features corresponding to the text to be processed according to the input first acoustic feature, and the spectral features corresponding to the text to be processed include spectral features used to characterize the target timbre.

The processing module 402 is further configured to acquire the target audio corresponding to the text to be processed according to the spectral features corresponding to the text to be processed, the target audio having the target timbre and the target rap style.

In some possible implementation manners, the processing module 402 is further configured to add the target audio corresponding to the text to be processed to the target multimedia content.

The speech synthesis device provided in this embodiment can be used to implement the technical method of any of the above method embodiments, and its implementation principle and technical effect are similar. For details, please refer to the detailed description of the foregoing method embodiments.

Exemplarily, the present disclosure also provides an electronic device.

Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. Referring to FIG. 5 , the electronic device provided in this embodiment includes: a memory 501 and a processor 502 .

Wherein, the memory 501 may be an independent physical unit, and may be connected with the processor 502 through the bus 503 . The memory 501 and the processor 502 may also be integrated together, implemented by hardware, and the like.

The memory 501 is used to store program instructions, and the processor 502 invokes the program instructions to execute the operations of any one of the above method embodiments.

Optionally, when part or all of the methods in the foregoing embodiments are implemented by software, the foregoing electronic device 500 may also include only the processor 502 . The memory 501 for storing programs is located outside the electronic device 500, and the processor 502 is connected to the memory through circuits/wires, and is used to read and execute the programs stored in the memory.

The processor 502 may be a central processing unit (central processing unit, CPU), a network processor (network processor, NP) or a combination of CPU and NP.

The processor 502 may further include a hardware chip. The aforementioned hardware chip may be an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof. The aforementioned PLD may be a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.

The memory 501 may include a volatile memory (volatile memory), such as a random-access memory (random-access memory, RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory) ), a hard disk (hard disk drive, HDD) or a solid-state drive (solid-state drive, SSD); the memory can also include a combination of the above-mentioned types of memory.

The present disclosure also provides a readable storage medium, including: computer program instructions; when the computer program instructions are executed by at least one processor of the electronic device, the speech synthesis method shown in any one of the above method embodiments is implemented.

The present disclosure also provides a program product, the program product includes a computer program, the computer program is stored in a readable storage medium, and at least one processor of the electronic device can read the computer program from the readable storage medium. The computer program, the at least one processor executes the computer program to enable the electronic device to implement the speech synthesis method as shown in any one of the above method embodiments.

It should be noted that in this article, relative terms such as "first" and "second" are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these No such actual relationship or order exists between entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

The above descriptions are only specific implementation manners of the present disclosure, so that those skilled in the art can understand or implement the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present disclosure. Therefore, the present disclosure will not be limited to the embodiments described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

A speech synthesis method, comprising:

Get the text to be processed;

The text to be processed is input into the speech synthesis model, and the spectral features corresponding to the text to be processed output by the speech synthesis model are obtained; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model, and the prosody The sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, and the first acoustic feature includes a bottleneck feature for characterizing the target rap style; the timbre sub-model is used for According to the input first acoustic feature, output the spectral feature corresponding to the text to be processed, the spectral feature corresponding to the text to be processed includes the spectral feature used to characterize the target timbre;

According to the spectrum feature corresponding to the text to be processed, the target audio corresponding to the text to be processed is acquired, and the target audio has the target timbre and the target rap style.
The method according to claim 1, wherein the prosody sub-model is obtained by training according to the labeled text corresponding to the first sample audio and the second acoustic feature corresponding to the first sample audio;

The first sample audio includes at least one audio of the target rap style; the second acoustic feature includes a first labeled bottleneck feature corresponding to the first sample audio.
The method according to claim 2, wherein the timbre sub-model is based on the third acoustic feature corresponding to the second sample audio, the first labeled spectral feature corresponding to the second sample audio, and the fourth acoustic feature corresponding to the third sample audio The feature and the second labeled spectral feature corresponding to the third sample audio are obtained through training;

Wherein, the third acoustic feature includes the second labeled bottleneck feature corresponding to the second sample audio; the third sample audio includes at least one audio with the target timbre, and the third sample audio corresponds to the fourth The acoustic feature includes a third labeled bottleneck feature corresponding to the third audio sample.
The method according to claim 3, wherein the first labeled bottleneck feature corresponding to the first sample audio, the second labeled bottleneck feature corresponding to the second sample audio, and the third labeled bottleneck feature corresponding to the third sample audio The marked bottleneck features are obtained by extracting bottleneck features from the input first sample audio, the second sample audio and the third sample audio respectively by an encoder of an end-to-end speech recognition model.
The method according to claim 3, wherein the second acoustic feature further comprises: a first labeled fundamental frequency feature corresponding to the first sample audio;

The third acoustic feature also includes: a second labeled fundamental frequency feature corresponding to the second sample audio; the fourth acoustic feature further includes: a third labeled fundamental frequency feature corresponding to the third sample audio;

The first acoustic feature further includes: a fundamental frequency feature corresponding to the text to be processed.
The method according to claim 1, wherein the method further comprises:

Add the target audio corresponding to the text to be processed to the target multimedia content.
A speech synthesis device, comprising:

Obtaining module, used to obtain the text to be processed;

A processing module, configured to input the text to be processed into a speech synthesis model, and obtain spectral features corresponding to the text to be processed output by the speech synthesis model; wherein, the speech synthesis model includes: a prosody sub-model and a timbre sub-model model, the prosody sub-model is used to output the first acoustic feature corresponding to the text to be processed according to the input text to be processed, the first acoustic feature includes a bottleneck feature for characterizing the target rap style, the The first acoustic feature includes the bottleneck feature corresponding to the text to be processed; the timbre sub-model is used to output the spectral feature corresponding to the text to be processed according to the input first acoustic feature, and the corresponding text to be processed The spectral features include spectral features used to characterize the target timbre;

The processing module is configured to acquire target audio corresponding to the text to be processed according to the frequency spectrum feature corresponding to the text to be processed, the target audio having the target timbre and the target rap style.
An electronic device, comprising: a memory, a processor, and a computer program;

said memory is configured to store said computer program;

The processor is configured to execute the computer program to implement the speech synthesis method according to any one of claims 1 to 6.
A readable storage medium, comprising: computer program instructions;

When the computer program instructions are executed by at least one processor of the electronic device, the speech synthesis method according to any one of claims 1 to 6 is implemented.
A program product comprising: computer program instructions;

The computer program is stored in a readable storage medium, at least one processor of the electronic device reads the computer program instructions from the readable storage medium, and the at least one processor executes the computer program instructions to Realize the speech synthesis method as described in any one of claims 1 to 6.