CN113593520B

CN113593520B - Singing voice synthesizing method and device, electronic equipment and storage medium

Info

Publication number: CN113593520B
Application number: CN202111048649.7A
Authority: CN
Inventors: 周阳
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2024-05-17
Anticipated expiration: 2041-09-08
Also published as: CN113593520A

Abstract

The embodiment of the application relates to the field of voice synthesis, and provides a singing voice synthesis method and device, electronic equipment and storage medium, wherein a singing voice synthesis model is input by obtaining a phoneme sequence of a song to be synthesized; meanwhile, mel spectrum characteristics are input in the decoding stage of the singing voice synthesis model, and the Mel spectrum characteristics are obtained by processing the reference audio in advance, so that the synthesized singing audio can be more similar to the singing effect of a real person, and the hearing experience of a user is improved.

Description

Singing voice synthesizing method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the field of voice synthesis, in particular to a singing voice synthesis method and device, electronic equipment and a storage medium.

Background

In recent years, singing voice synthesis has been a hot topic, and this technology can synthesize a music score into audio of a vocal singing. However, the existing singing voice synthesizing effect has low naturalness and strong mechanical sense, and cannot achieve a good anthropomorphic effect. Therefore, how to synthesize songs with high naturalness and similar singing effect to a real person according to a music score is a technical problem to be solved by researchers.

Disclosure of Invention

An embodiment of the application aims to provide a singing voice synthesizing method and device, electronic equipment and storage medium, which are used for improving naturalness of singing voice synthesis and enabling the naturalness to be close to real singing effect.

In order to achieve the above object, the technical scheme adopted by the embodiment of the application is as follows:

In a first aspect, an embodiment of the present application provides a singing voice synthesizing method, including:

obtaining a phoneme sequence of a song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes and pitch and phoneme duration corresponding to each phoneme;

inputting the phoneme sequence into a singing voice synthesis model, and encoding the phoneme sequence by utilizing an encoding network of the singing voice synthesis model to obtain an encoding vector;

inputting the coding vector and the Mel spectrum characteristics into a decoding network of the singing voice synthesis model to obtain singing voice frequency of the song to be synthesized, wherein the Mel spectrum characteristics are obtained by processing reference voice frequency in advance.

Further, the coding network comprises an embedding unit, a preprocessing unit and a feature extraction unit, wherein the feature extraction unit comprises a convolution layer and a high-speed network;

The step of encoding the phoneme sequence by using the encoding network of the singing voice synthesis model to obtain an encoding vector comprises the following steps:

processing the phoneme sequence by using the embedding unit to obtain an embedding sequence;

The embedded sequence is input into the preprocessing unit to perform nonlinear transformation, and then input into the feature extraction unit to generate the coding vector.

Further, the decoding network comprises a position sensitive attention layer, a prediction unit, a decoding unit, CBHG units and a vocoder, wherein the CBHG units comprise a convolution layer, a high-speed network and a bidirectional recurrent neural network;

The step of inputting the coding vector and the mel spectrum characteristic into a decoding network of the singing voice synthesis model to obtain singing voice frequency of the song to be synthesized comprises the following steps:

inputting the coding vector into the position sensitive attention layer to learn the corresponding relation between the acoustic characteristic and the phoneme sequence and outputting a context vector;

inputting the Mel spectrum characteristics into the prediction unit, and performing linear transformation on the Mel spectrum characteristics by using the prediction unit to obtain prediction output;

The context vector and the prediction output are spliced and then input into the decoding unit for decoding, so that a decoding sequence and a stop zone bit are obtained, wherein the stop zone bit is used for representing whether the decoding process is stopped or not;

Inputting the decoding sequence into the CBHG unit to extract the context characteristics to obtain an acoustic characteristic sequence;

The acoustic feature sequence is input to the vocoder to synthesize the singing audio.

Further, the step of obtaining the phoneme sequence of the song to be synthesized includes:

obtaining a music score of the song to be synthesized, wherein the music score comprises lyrics and notes;

analyzing the music score, and dividing each syllable in the lyrics into at least one phoneme;

And according to the notes, obtaining the pitch and the phoneme duration corresponding to each phoneme, and obtaining the phoneme sequence.

Further, the step of dividing each syllable in the lyrics into at least one phoneme comprises:

Judging whether the target syllable is silent mother pinyin or not according to any target syllable in the lyrics;

if not, dividing the target syllable according to the initial consonant and the final, and obtaining at least one phoneme;

if yes, universal initials or segmentation information is added to the target syllable, and then division is carried out, so that the at least one phoneme is obtained.

Further, the phoneme sequence also includes a breath phoneme for characterizing the breath sound in the reference audio.

Further, the phoneme sequence further comprises a continuous sound identifier, wherein the continuous sound identifier is used for indicating that continuous sound exists in a target phoneme in the plurality of phonemes.

Further, the singing voice synthesis model is trained by:

Acquiring sample audio;

Analyzing the sample audio to obtain a plurality of sample phoneme sequences, wherein one sample phoneme sequence corresponds to one audio frame in the sample audio, and the sample phoneme sequence comprises a plurality of phonemes corresponding to the audio frame and the pitch and the phoneme duration of each phoneme;

Training a preset model based on the plurality of sample phoneme sequences to obtain the singing voice synthesis model.

Further, the step of training a preset model based on the plurality of sample phoneme sequences to obtain the singing voice synthesis model includes:

inputting the plurality of sample phoneme sequences into the preset model, and outputting synthesized audio;

analyzing the synthesized audio to obtain predicted phoneme duration information;

Based on the predicted phoneme duration information and the real phoneme duration information of the sample audio, a loss function is utilized:

align_loss＝tf.reduce_mean(tf.abs(align_targets-align_outputs))

and carrying out parameter updating on the preset model to obtain the singing voice synthesis model, wherein tf.abs represents taking an absolute value, targets represents the real phoneme duration information, align_outputs represents the predicted phoneme duration information, and f.reduce_mean represents solving a mean square error.

In a second aspect, an embodiment of the present application further provides a singing voice synthesizing apparatus, including:

The obtaining module is used for obtaining a phoneme sequence of the song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme;

The coding module is used for inputting the phoneme sequence into a singing voice synthesis model, and coding the phoneme sequence by utilizing a coding network of the singing voice synthesis model to obtain a coding vector;

And the decoding module is used for inputting the coding vector and the Mel spectrum characteristics into a decoding network of the singing voice synthesis model to obtain singing voice frequency of the song to be synthesized, wherein the Mel spectrum characteristics are obtained by processing reference voice frequency in advance.

In a third aspect, an embodiment of the present application further provides an electronic device, including:

one or more processors;

And a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the singing voice synthesizing method described above.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium having stored thereon a computer program that, when executed by a processor, implements the singing voice synthesizing method described above.

Compared with the prior art, the singing voice synthesis method and device, the electronic equipment and the storage medium provided by the embodiment of the application input the singing voice synthesis model by obtaining the phoneme sequence of the song to be synthesized, and the synthesized singing voice frequency can reflect the pronunciation time of each phoneme because the phoneme sequence comprises the phonemes, the pitch and the phoneme time, so that the naturalness of singing voice synthesis is improved; meanwhile, mel spectrum characteristics are input in the decoding stage of the singing voice synthesis model, and the Mel spectrum characteristics are obtained by processing the reference audio in advance, so that the synthesized singing audio can be more similar to the singing effect of a real person, and the hearing experience of a user is improved.

Drawings

Fig. 1 shows a flow chart of a singing voice synthesizing method according to an embodiment of the present application.

Fig. 2 is a flow chart of step S101 in the singing voice synthesizing method shown in fig. 1.

Fig. 3 shows an exemplary diagram of a score provided by an embodiment of the present application.

Fig. 4 shows an exemplary diagram of a parsing process of a score according to an embodiment of the present application.

Fig. 5 shows a schematic structural diagram of a singing voice synthesis model provided by an embodiment of the present application.

Fig. 6 is a flowchart of step S102 in the singing voice synthesizing method shown in fig. 1.

Fig. 7 is a flowchart of step S103 in the singing voice synthesizing method shown in fig. 1.

Fig. 8 shows a schematic diagram of a training flow of the singing voice synthesis model provided by the embodiment of the application.

Fig. 9 is a block diagram of an singing voice synthesizing apparatus according to an embodiment of the present application.

Fig. 10 shows a block schematic diagram of an electronic device according to an embodiment of the present application.

Icon: 10-an electronic device; 11-a processor; 12-memory; 13-bus; 100-singing voice synthesizing device; 110-obtaining a module; 120-an encoding module; 130-a decoding module.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The singing voice synthesizing method provided by the embodiment of the application can realize the simulation of the voice, thereby providing the artificial intelligent singing functions such as virtual singing and the like for users. Also, by the singing voice synthesizing method, various types of human voice audio, such as chinese songs, english songs, critique, folk art forms (including ballad singing, story telling, comic dialogues, clapper talks, cross talks, etc.) audio, and the like, can be synthesized.

Referring to fig. 1, fig. 1 shows a flow chart of a singing voice synthesizing method according to an embodiment of the application, where the singing voice synthesizing method is applied to an electronic device and may include the following steps S101 to S103.

S101, obtaining a phoneme sequence of a song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme.

The phonemes are the smallest phonetic units divided according to the natural attributes of the speech, analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. Phonemes are classified into vowels and consonants, and the types of phonemes are different in different pronunciation rules. For example, for English, the phonemes include vowel phonemes and consonant phonemes. For Chinese, syllables (i.e., pinyin) of each Chinese character can be decomposed into initials and finals, so phonemes include initials and finals. The following examples are presented by way of example in Chinese.

The pitch of a phoneme is used to indicate the height of the phoneme in the pronunciation process. The pitch range of the singing voice is much larger than that of a normal speaking voice, and since the nature of the pitch is frequency, the pitch of the singing voice can be divided into 36 classes, e.g., C3, F4, etc., with reference to the international pitch hertz spectrum.

The phoneme duration of a phoneme is used to indicate the duration of the phoneme in the pronunciation process. For example, the phoneme is a vowel "i" and the corresponding phoneme duration is 200ms, which indicates that the phoneme "i" lasts 200ms during pronunciation.

Due to the concept of rhythms when singing, the nature of rhythms is a phoneme duration, and the synthesized singing audio is output in the form of audio frames, which are at a minimum of 32ms. Thus, the minimum resolution of the tempo can be set to 32ms, i.e., the phoneme duration is shortest to 32ms, and divided into 300 levels in an incremental manner.

The phoneme sequence is obtained by analyzing a music score of a song to be synthesized, and generally, the music score comprises lyrics and notes, one sentence of lyrics corresponds to one phoneme sequence, the phoneme sequence is input into a singing voice synthesis model and then a corresponding audio frame is output, and one phoneme sequence corresponds to one or more continuous audio frames in the synthesized singing audio.

The phoneme sequence may comprise a plurality of elements, each element comprising a phoneme and a pitch and a phoneme duration corresponding to the phoneme, which may be expressed as (phoneme, pitch, phoneme duration). For example, (iou, C3, 10), where iou represents a phoneme, C3 represents a pitch, and 10 represents a phoneme duration.

The process of parsing the score into a phoneme sequence is described in detail below. Referring to fig. 2 on the basis of fig. 1, step S101 may include sub-steps S1011 to S1013.

S1011, obtaining a music score of the song to be synthesized, wherein the music score comprises lyrics and notes.

The song to be synthesized may be any song that needs to be synthesized using a singing voice synthesis model, for example, chinese song, english song, etc.

In this embodiment, song audio is synthesized directly based on a music score, which typically includes lyrics and notes in text form, for example, fig. 3 shows a music score segment of a chinese song including chinese lyrics "yue er yuan i am to night sky wish" and corresponding notes.

S1012, analyzing the music score, and dividing each syllable in the lyrics into at least one phoneme.

The following embodiment takes a part of the music score segment shown in fig. 3 as an example, and describes the analysis process of the music score. The partial score segment comprises: chinese lyrics "month", "child", "circle" and corresponding notes.

By parsing the score, each syllable in the lyrics can be divided into at least one phoneme. Referring to fig. 4, the lyrics include 5 syllables "yue", "er", "yuan", "you", "yuan", each of which can be exemplarily divided into at least one phoneme. For example, syllables "yue" can be divided into 2 phones "y" and "ue", with phone "y" corresponding to the initial and phone "ue" corresponding to the final.

In singing voice synthesis, the silent mother pinyin may have a fuzzy pronunciation phenomenon, for example, "er" in fig. 4, so that the silent mother pinyin needs to be preprocessed in the syllable dividing process, and then divided into phonemes after the preprocessing, so that the fuzzy pronunciation is avoided.

As an embodiment, the process of dividing each syllable in the lyrics into at least one phoneme may include:

if yes, universal initial consonants or segmentation information is added to the target syllables, and then division is carried out, so that at least one phoneme is obtained.

In this embodiment, for the silent parent pinyin, the following processing manner may be adopted:

In an alternative embodiment, a universal initial may be added to the unvoiced pinyin, e.g., "er" has no initial for vowels, and a universal initial al may be added for this pronunciation, followed by subdivision into phonemes "al" and "er".

In another alternative embodiment, segmentation information may be added to the unvoiced pinyin, e.g., "loving" pinyin is "kuai", and the segmentation information sep is added to become "sep ku sep aisep" and then subdivided into phonemes "sep", "k", "u", "sep", "ai" and "sep".

S1013, according to the notes, obtaining the pitch and the phoneme duration corresponding to each phoneme, and obtaining a phoneme sequence.

Since notes are included in the score, which have a corresponding pitch, after each syllable in the lyrics is divided into at least one phoneme according to the manner of the substep S1012, the corresponding pitch of each phoneme can be obtained according to the notes. Typically, in a score, one syllable corresponds to one pitch, and all phones in the same syllable correspond to the same pitch, e.g., syllable "yue" in fig. 4 corresponds to pitch Db4, and phones "y" and "ue" each correspond to pitch Db4.

Meanwhile, according to the notes in the music score, the pinyin duration of each syllable can be obtained, for example, the pinyin duration of syllable "yue" in fig. 4 is 260ms, but the singing voice synthesis model takes the phoneme duration of each phoneme as input, so that the phoneme duration of each phoneme in the syllable also needs to be determined according to the pinyin duration of syllables, for example, the phoneme durations of phonemes "y" and "ue" are determined according to the pinyin duration of "yue" 260 ms.

For the determination of the phoneme duration, the following may be used:

in an alternative embodiment, a bi-directional multi-layered LSTM network may be constructed and trained to learn the percentage of each phoneme duration given the pinyin duration.

In an alternative embodiment, the inventors have found through a number of experimental verification that the pronunciation habits of the same person on the initial consonants are relatively consistent. Therefore, a plurality of songs sung by the same person, for example, 100 songs may be selected in advance, and an average pronunciation duration of each initial consonant in the songs is counted, for example, an average pronunciation duration of the initial consonant "y" is 63ms, and the average pronunciation duration is used as a phoneme duration of the corresponding initial consonant, for example, a phoneme duration of the phoneme "y" in fig. 4 is set to 63ms, and a phoneme duration of a final is a pinyin duration minus a phoneme duration of the initial consonant.

In one possible scenario, there is also a breathing sound in the audio that the person is singing, and in order to achieve a more anthropomorphic effect on the synthesized singing audio, there is also a need to consider the handling of the breathing sound in the synthesis of the singing.

The respiratory audio may be pre-classified into 30 classes by duration, e.g., 6 for duration near 200ms and 10 for duration near 300ms. Meanwhile, the breathing phonemes may be set to be in the form of a break_time period level, for example, break_6, and the characterization time period level is 6; break_10, characterizing a duration rating of 10.

The processing of respiratory sounds is divided into a model training stage and a model application stage.

In the model training stage, breathing sounds can be input into the model in the form of breathing phonemes, so that the model can learn breathing information of singers in sample audio well. The breath phoneme is obtained by obtaining the duration of breath audio in the sample audio and determining the corresponding duration grade according to the duration.

In the model application stage, the duration of the respiratory audio in the reference audio can be obtained, then a respiratory phoneme is formed according to the duration, and the respiratory phoneme is input into the singing voice synthesis model. Thus, the phoneme sequence may also include breathing phonemes that are used to characterize the breathing sounds in the reference audio.

It should be noted that, in the model application stage, whether the phoneme sequence includes a breathing phoneme, and specifically which phoneme sequence includes a breathing phoneme, may be flexibly selected by the user, which is not limited herein.

In one possible scenario, when a person sings a song, it is difficult to sing completely according to a given score, and there is almost unavoidable continuous sound, so that in order to achieve a more anthropomorphic effect, the continuous sound process needs to be considered in the singing synthesis.

The continuous sound processing is also divided into a model training stage and a model application stage.

And in the model training stage, continuous tones in the sample audio can be found out by analyzing the sample audio, and are input into the model for training in a continuous tone identification mode, wherein the continuous tone identification is used for indicating that the continuous tones exist in target phonemes in a plurality of phonemes, so that the model can learn continuous tone skills.

In the model application stage, continuous tones used by singers of the reference audio can be analyzed, continuous tone identifiers are constructed according to the continuous tones, and the continuous tone identifiers are input into the model. Thus, the phoneme sequence may further comprise a continuous sound identifier for indicating that a continuous sound exists for a target phoneme of the plurality of phonemes.

It should be noted that, in the model application stage, whether the phone sequence includes the continuous tone identifier, and specifically which phone sequence includes the continuous tone identifier, may be flexibly selected by the user, which is not limited herein.

Because the Chinese pronunciation is in a special mode, the continuous vowels only exist in the vowels, namely, a mode of connecting the related vowels can be adopted, namely, a plurality of vowels with continuous vowels are used as a phoneme, and the phoneme is a target phoneme and is added with continuous sound identification. For example, syllables "xiao" include initials "x" and finals "i", "ao", but finals "i", "ao" are continuous tones in pronunciation, so that syllables "xiao" are divided into two phonemes "x" and "iao", continuous tone marks con are added at the same time, continuous tone marks con can also be used as one phoneme, but the pitch and phoneme duration corresponding to the phonemes need to be set to be 0, for example, (con, 0).

Meanwhile, for the continuous tone, there may be a plurality of pitches simultaneously in the pronunciation process, for example, the pitch of "iao" includes Eb4, F4, and E4, and for this case, a phoneme duration corresponding to each pitch may be obtained, and elements in the phoneme sequence may be respectively constructed, for example, (x, eb4, 63), (iao, eb4, 220), (con, 0), (iao, F4, 500), (con, 0), (iao, E4, 30).

S102, inputting the phoneme sequence into a singing voice synthesis model, and encoding the phoneme sequence by utilizing an encoding network of the singing voice synthesis model to obtain an encoding vector.

S103, inputting the coding vector and the Mel spectrum characteristics into a decoding network of the singing voice synthesis model to obtain singing voice frequency of the song to be synthesized, wherein the Mel spectrum characteristics are obtained by processing the reference voice frequency in advance.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a singing voice synthesis model according to an embodiment of the application. The singing voice synthesis model can be improved on the basis of the traditional sound spectrum prediction network Tacotron model. As shown in fig. 5, the singing voice synthesis model includes an encoding network and a decoding network, and after inputting the phoneme sequence into the singing voice synthesis model, the encoding network encodes the phoneme sequence into a fixed-length encoding vector, and the decoding network decodes the encoding vector to generate singing voice.

Meanwhile, the input of the decoding network includes mel-spectrum features of the reference audio in addition to the encoded vector. The reference audio may be a speaking audio or a singing audio of a person, and may be flexibly selected by a user, for example, the user records his or her own voice as the reference audio, or selects a sound of a star as the reference audio, which is not limited herein. By incorporating mel-spectrum features of the reference audio during the decoding stage, the synthesized singing audio can be made more closely to the real singing effect.

As shown in fig. 5, the encoding network may include an embedding unit, a preprocessing unit, and a feature extraction unit, and the feature extraction unit may include a convolutional layer and a high-speed network.

Thus, referring to FIG. 6 on the basis of FIG. 1, step S102 may include sub-steps S1021-S1022.

S1021, processing the phoneme sequence by using an embedding unit to obtain an embedding sequence.

S1022, inputting the embedded sequence into the preprocessing unit for nonlinear transformation, and inputting the embedded sequence into the feature extraction unit for generating the coding vector.

In this embodiment, the embedding unit may be CHARACTER EMBEDDING and the preprocessing unit may be Pre-net. Since the feature extraction of phonemes, pitches and phoneme durations does not need to take the context information into account, CBHG (Convolution Bank +highway network+ bidirectionalGated Recurrent Unit, convolutional layer+high-speed network+bi-directional recurrent neural network) in the Tacotron model is removed from the bi-directional recurrent neural network to obtain a feature extraction unit.

On the basis of the coding network shown in fig. 5, the process of generating the coding vector by using the coding network is as follows:

1. Inputting the phoneme sequence into an embedding unit (CHARACTER EMBEDDING), and processing the phoneme sequence by using the embedding unit (CHARACTER EMBEDDING) to obtain an embedding sequence;

2. Inputting the embedded sequence into a preprocessing unit (Pre-net), and performing nonlinear transformation on the embedded sequence by using the preprocessing unit (Pre-net) so as to improve the convergence and generalization capability of the singing voice synthesis model;

3. The embedded sequence after nonlinear transformation is input into a feature extraction unit (Convolution Bank +highway network), and features of phonemes, pitches and duration of the phonemes are extracted simultaneously by using the feature extraction unit (Convolution Bank +highway network) to output coding vectors.

As shown in fig. 5, the decoding network may include a position-sensitive attention layer, a prediction unit, a decoding unit, CBHG units, and a vocoder, the CBHG units including a convolutional layer, a high-speed network, and a bi-directional recurrent neural network.

Thus, referring to fig. 7 on the basis of fig. 1, step S103 may include sub-steps S1031-S1035.

S1031, inputting the coding vector into a position sensitive attention layer to learn the corresponding relation between the acoustic characteristic and the phoneme sequence, and outputting a context vector.

S1032, inputting the Mel spectrum characteristics into a prediction unit, and performing linear transformation on the Mel spectrum characteristics by using the prediction unit to obtain a prediction output.

S1033, the context vector and the prediction output are spliced and then input into a decoding unit for decoding, so that a decoding sequence and a stop zone bit are obtained, wherein the stop zone bit is used for representing whether the decoding process is stopped.

S1034, inputting the decoded sequence into CBHG unit to extract the context feature, and obtaining the acoustic feature sequence.

S1035, input the acoustic feature sequence into the vocoder to synthesize singing audio.

In this embodiment, the prediction unit may be Pre-net, the decoding unit may include an Attention RNN and a Decoder RNN, and the position-sensitive Attention layer may be Location Sensitive Attention. CBHG units may be Convolution Bank +highway network+ bidirectionalGated Recurrent Unit, i.e., convolutional layer+high-speed network+bi-directional recurrent neural network.

On the basis of the decoding network shown in fig. 5, the process of outputting singing audio by the decoding network is as follows:

1. The encoded vector is input into a position sensitive attention layer (Location Sensitive Attention), the position sensitive attention layer (Location Sensitive Attention) is essentially a matrix consisting of a context weight vector, and the corresponding relation between the acoustic feature and the phoneme sequence can be automatically learned, and the context vector is output.

The position sensitive attention layer (Location Sensitive Attention) performs attention calculation at each time step, and obtains the learned position sensitive information by accumulating the attention weights, so that the singing voice synthesis model sequentially processes the contents in the phoneme sequence, and repeated prediction or omission is avoided. That is, a position-sensitive attention layer (Location Sensitive Attention) is employed in the singing voice synthesis model, with which the correspondence of acoustic features to phoneme sequences is automatically learned concerning different parts of the encoded vector.

Therefore, the position-sensitive attention layer (Location Sensitive Attention) can further improve the stability of singing voice synthesis effect, and avoid the situations of missing phonemes, repeated phonemes or incapability of stopping.

2. The method comprises the steps of inputting the previous frame and the Mel spectrum characteristics into a prediction unit (Pre-net), and performing linear transformation on the previous frame and the Mel spectrum characteristics by using the prediction unit (Pre-net) to obtain a prediction output.

Wherein decoding is a cyclic process, and the current time step of each cycle is t. The input of the previous frame and mel-spectrum feature to the prediction unit (Pre-net) is the 1 st cycle, so t=1 at this time. When t > 1, as shown in FIG. 5, the decoded sequence of time step t-1 is input to the prediction unit (Pre-net). A prediction unit (Pre-net) performs linear transformation on the input.

The current time step t=1, the previous frame and mel-spectrum feature are input to the prediction unit (Pre-net), so the previous frame (INITIAL FRAME) is the decoding sequence of time step 0, where each element in the decoding sequence is 0, i.e., the previous frame is an all-zero frame.

3. After the position sensitive attention layer (Location Sensitive Attention) outputs the context vector and the prediction unit (Pre-net) generates the prediction output, the context vector and the prediction output are spliced, the decoding unit is utilized to decode, and the decoding sequence of the current time step t and a Stop flag bit (Stop token) are output, wherein the Stop flag bit (Stop token) is used for representing whether to Stop circulation.

4. The decoded sequence is input CBHG units to extract the context features, resulting in an acoustic feature sequence for the current time step t.

5. If a Stop token (Stop token) characterizes the Stop loop, the decoding process takes the acoustic feature sequence of the current time step t as a final acoustic feature sequence up to this point;

If the Stop flag (Stop token) represents that the circulation is not stopped, updating the current time step t=t+1, and returning to the 1 st step to continue execution until the Stop flag (Stop token) represents that the circulation is stopped, so as to obtain a final acoustic feature sequence.

6. The final acoustic feature sequence is input to a vocoder to synthesize singing audio.

The vocoder may convert the acoustic signature generated by the decoding network into an audio waveform, and the vocoder may be a vocoder that generates an audio waveform based on mel-spectrum parameters, such as WaveGlow, griffin-Lim, waveNet, or the like.

The following describes the training process of the singing voice synthesis model in detail.

In this embodiment, the training process of the singing voice synthesis model may be applied to an electronic device, and the singing voice synthesis method and the training process of the singing voice synthesis model may be implemented by the same electronic device or may be implemented by different electronic devices.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a training flow of a singing voice synthesis model according to an embodiment of the present application, and a training process of the singing voice synthesis model may include steps S201 to S203.

S201, acquiring sample audio.

The sample audio may be pre-recorded audio or audio of a specified chinese song, english song, specified comment audio, folk art forms (including ballad singing, story telling, comic dialogues, clapper talks, cross talks, etc.) audio, etc.

S202, analyzing the sample audio to obtain a plurality of sample phoneme sequences, wherein one sample phoneme sequence corresponds to one audio frame in the sample audio, and the sample phoneme sequence comprises a plurality of phonemes corresponding to the audio frame and the pitch and the phoneme duration of each phoneme.

The sample audio may comprise a plurality of audio frames, typically a plurality of audio frames and a plurality of sample phoneme sequences in a one-to-one correspondence. And analyzing any audio frame to obtain each phoneme in the audio frame, and the pitch and the phoneme duration of each phoneme to obtain a sample phoneme sequence corresponding to the audio frame.

Taking a chinese song as an example, for the sample audio, the obtaining manner of each phoneme is similar to that described in step S101, and will not be described here again.

The manner of obtaining the pitch of each phoneme may be: the pitch of each audio frame in the sample audio is identified by software, then the pitch of the identified audio frame is determined according to the preset 36 pitch classes, for example, C3, F4 and the like, and then the pitch of the audio frame is used as the pitch of each phoneme in the audio frame.

The method for obtaining the phoneme duration of each phoneme in the sample audio is various, for example, the duration of each phoneme in the sample audio in the pronunciation process can be identified through software, and the duration of each phoneme in the sample audio in the pronunciation process can be determined through a manual calibration mode, so that the phoneme duration of each phoneme is obtained.

S203, training a preset model based on a plurality of sample phoneme sequences to obtain a singing voice synthesis model.

After obtaining a plurality of sample phoneme sequences of the sample audio, the sample audio can be used as a label, the plurality of sample phoneme sequences are used as input, and the preset model is trained until the loss value of the set loss function meets a certain requirement or the iteration times reach the set times, so as to obtain the audio synthesis model. The model structure of the preset model is identical to the model structure shown in fig. 5, except for the model parameters.

As an embodiment, the training process of the preset model may include:

1. Inputting a plurality of sample phoneme sequences into a preset model, and outputting synthesized audio;

2. analyzing the synthesized audio to obtain predicted phoneme duration information;

3. Based on the predicted phoneme duration information and the real phoneme duration information of the sample audio, a loss function is utilized:

align_loss＝tf.reduce_mean(tf.abs(align_targets-align_outputs))

And carrying out parameter updating on the preset model to obtain a singing voice synthesis model, wherein tf.abs represents an absolute value, targets represents real phoneme duration information, align_outputs represents predicted phoneme duration information, and f.reduce_mean represents a mean square error.

The predicted phoneme duration information comprises the phoneme duration of each phoneme in the synthesized audio, the real phoneme duration information comprises the real phoneme duration of each phoneme in the sample audio, and the real phoneme duration information can be determined in a manual calibration mode.

For singing synthesis, model training is in learning of rhythm duration, while learning essence of rhythm duration is learning of phoneme duration of each phoneme, so the core of model training is to accurately learn phoneme duration of each phoneme.

The model provided by the embodiment of the application adopts the attention mechanism, the attention mechanism learns the phoneme duration information in an unsupervised mode, but the learned phoneme duration information has larger deviation due to the complexity of singing voice synthesis. Therefore, in the training process, phoneme duration information learned by an attention mechanism, namely predicted phoneme duration information of synthesized audio and real phoneme duration information of sample audio are added into a loss function in a mean square error mode, so that an unsupervised learning mode is changed into a supervised learning mode, and accuracy of rhythm duration control is greatly improved.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

Firstly, phonemes, pitch and phoneme duration are used as inputs of a singing voice synthesis model, so that synthesized singing voice frequency can reflect pronunciation duration of each phoneme, and naturalness of singing voice synthesis is improved; meanwhile, the Mel spectrum characteristics of the reference audio are input in the decoding stage, so that the synthesized singing audio can be more similar to the singing effect of a real person;

Secondly, introducing a position sensitive attention layer, and automatically learning the corresponding relation between the acoustic characteristics and the phoneme sequence by using the position sensitive attention layer to pay attention to different parts of the coding vector, so that the stability of the singing voice synthesis effect is improved, and the situation that a phoneme is leaked, a phoneme is repeated or can not be stopped is avoided;

Thirdly, introducing a stop flag bit to indicate whether the decoding process is stopped or not, so as to avoid the decoding process from falling into a dead loop;

fourth, consider breathing sound, even sound treatment in the singing voice synthesis, make the singing audio frequency synthesized reach more anthropomorphic effect.

In order to perform the above singing voice synthesizing method embodiment and the corresponding steps in each possible implementation, an implementation applied to the singing voice synthesizing apparatus is given below.

Referring to fig. 9, fig. 9 is a block diagram illustrating an singing voice synthesizing apparatus 100 according to an embodiment of the application. The singing voice synthesizing apparatus 100 is applied to an electronic device, and may include: the acquisition module 110, the encoding module 120, and the decoding module 130.

The obtaining module 110 is configured to obtain a phoneme sequence of a song to be synthesized, where the phoneme sequence includes a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme.

The encoding module 120 is configured to input the phoneme sequence into the singing voice synthesis model, and encode the phoneme sequence by using an encoding network of the singing voice synthesis model to obtain an encoding vector.

The decoding module 130 is configured to input the encoding vector and mel spectrum feature into a decoding network of the singing voice synthesis model to obtain singing voice frequency of the song to be synthesized, where the mel spectrum feature is obtained by processing the reference voice frequency in advance.

Optionally, the obtaining module 110 is specifically configured to:

Obtaining a music score of a song to be synthesized, wherein the music score comprises lyrics and notes;

and obtaining the pitch and the phoneme duration corresponding to each phoneme according to the notes, and obtaining a phoneme sequence.

Optionally, the obtaining module 110 performs a manner of dividing each syllable in the lyrics into at least one phoneme, including:

Optionally, the encoding module 120 performs encoding of the phoneme sequence using an encoding network of the singing voice synthesis model to obtain an encoded vector, including:

processing the phoneme sequence by using an embedding unit to obtain an embedding sequence;

the embedded sequence is input into a preprocessing unit for nonlinear transformation, and then is input into a feature extraction unit for generating a coding vector.

Optionally, the decoding module 130 performs a step of inputting the encoding vector and mel-spectrum feature into a decoding network of the singing voice synthesis model to obtain singing audio of the song to be synthesized, including:

inputting the coding vector into a position sensitive attention layer to learn the corresponding relation between the acoustic characteristic and the phoneme sequence and outputting a context vector;

inputting the Mel spectrum characteristics into a prediction unit, and performing linear transformation on the Mel spectrum characteristics by using the prediction unit to obtain a prediction output;

the context vector and the prediction output are spliced and then input into a decoding unit for decoding to obtain a decoding sequence and a stop zone bit, wherein the stop zone bit is used for representing whether the decoding process is stopped or not;

inputting the decoded sequence into CBHG units to extract the context characteristics to obtain an acoustic characteristic sequence;

the acoustic feature sequence is input to a vocoder to synthesize singing audio.

It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the singing voice synthesizing apparatus 100 described above may refer to the corresponding process in the foregoing method embodiment, and will not be described in detail herein.

Referring to fig. 10, fig. 10 is a block diagram of an electronic device 10 according to an embodiment of the application. The electronic device 10 may be any electronic device having voice processing functions, such as a server, a mobile terminal, a general purpose computer, or a special purpose computer, and the mobile terminal may be a smart phone, a notebook computer, a tablet computer, a desktop computer, a smart television, or the like.

The electronic device 10 may include a processor 11, a memory 12, and a bus 13, the processor 11 being connected to the memory 12 through the bus 13.

The memory 12 is used for storing a program, such as the singing voice synthesizing apparatus 100 shown in fig. 9, and the singing voice synthesizing apparatus 100 includes at least one software function module which may be stored in the memory 12 in the form of software or firmware (firmware), and the processor 11 executes the program to implement the singing voice synthesizing method disclosed in the above embodiment after receiving the execution instruction.

The memory 12 may include a high-speed random access memory (Random Access Memory, RAM) and may also include a non-volatile memory (NVM).

The processor 11 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 11 or by instructions in the form of software. The processor 11 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a micro control unit (Microcontroller Unit, MCU), a complex programmable logic device (Complex Programmable Logic Device, CPLD), a field programmable gate array (Field Programmable GATE ARRAY, FPGA), an embedded ARM, and the like.

The embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by the processor 11, implements the singing voice synthesizing method disclosed in the above embodiment.

In summary, according to the singing voice synthesis method and device, the electronic device and the storage medium provided by the embodiments of the present application, a phoneme sequence of a song to be synthesized is obtained, and is input into a singing voice synthesis model, and because the phoneme sequence includes phonemes, pitches and phoneme duration, the synthesized singing voice frequency can reflect the pronunciation duration of each phoneme, and the naturalness of singing voice synthesis is improved; meanwhile, mel spectrum characteristics are input in the decoding stage of the singing voice synthesis model, and the Mel spectrum characteristics are obtained by processing the reference audio in advance, so that the synthesized singing audio can be more similar to the singing effect of a real person, and the hearing experience of a user is improved.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A singing voice synthesizing method, characterized in that the method comprises:

Obtaining a phoneme sequence of a song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme, the pitch is used for indicating the height of a sound of the phonemes in a pronunciation process, and the phoneme duration is used for indicating the duration of the phonemes in the pronunciation process;

Inputting the coding vector and the Mel spectrum characteristics into a decoding network of the singing voice synthesis model to obtain singing voice frequency of the song to be synthesized, wherein the Mel spectrum characteristics are obtained by processing reference voice frequency in advance, and the reference voice frequency is speaking voice frequency or singing voice frequency of a person.

2. The method of claim 1, wherein the encoding network comprises an embedding unit, a preprocessing unit, and a feature extraction unit, the feature extraction unit comprising a convolutional layer and a high-speed network;

3. The method of claim 1, wherein the decoding network comprises a position sensitive attention layer, a prediction unit, a decoding unit, CBHG units, and a vocoder, the CBHG units comprising a convolutional layer, a high-speed network, and a bi-directional recurrent neural network;

4. The method of claim 1, wherein the step of obtaining a phoneme sequence for a song to be synthesized comprises:

5. The method of claim 4, wherein the step of dividing each syllable in the lyrics into at least one phoneme comprises:

6. The method of claim 1, wherein the sequence of phonemes further comprises a respiratory phoneme that characterizes respiratory sounds in the reference audio.

7. The method of claim 1, wherein the phoneme sequence further comprises a continuous tone identifier for indicating that a target phoneme of the plurality of phonemes is present.

8. The method of claim 1, wherein the singing voice synthesis model is trained by:

Acquiring sample audio;

9. The method of claim 8, wherein the step of training a preset model based on the plurality of sample phoneme sequences to obtain the singing voice synthesis model comprises:

align_loss＝tf.reduce_mean(tf.abs(align_targets-align_outputs))

10. An singing voice synthesizing apparatus, characterized in that the apparatus comprises:

the obtaining module is used for obtaining a phoneme sequence of a song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme, the pitch is used for indicating the height of the phoneme in the pronunciation process, and the phoneme duration is used for indicating the duration of the phoneme in the pronunciation process;

The decoding module is used for inputting the coding vector and the Mel spectrum characteristics into a decoding network of the singing synthesis model to obtain singing audio of the song to be synthesized, wherein the Mel spectrum characteristics are obtained by processing reference audio in advance, and the reference audio is speaking audio or singing audio of a person.

11. An electronic device, the electronic device comprising:

one or more processors;

A memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the singing voice synthesis method as recited in any one of claims 1-9.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements a singing voice synthesizing method as claimed in any one of claims 1-9.