CN113593520A

CN113593520A - Singing voice synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN113593520A
Application number: CN202111048649.7A
Authority: CN
Inventors: 周阳
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-11-02
Anticipated expiration: 2041-09-08

Abstract

The embodiment of the application relates to the field of speech synthesis, and provides a singing voice synthesis method and device, electronic equipment and a storage medium, wherein a phoneme sequence of a song to be synthesized is obtained and input into a singing voice synthesis model, and the synthesized singing voice frequency can reflect the pronunciation duration of each phoneme due to the fact that the phoneme sequence comprises the phoneme, the pitch and the phoneme duration, so that the naturalness of the singing voice synthesis is improved; meanwhile, the Mel-frequency spectrum characteristics are input at the decoding stage of the singing voice synthesis model, and the Mel-frequency spectrum characteristics are obtained by processing the reference audio in advance, so that the synthesized singing voice can be closer to the real singing effect, and the auditory experience of a user is improved.

Description

Singing voice synthesis method and device, electronic equipment and storage medium

Technical Field

The embodiment of the application relates to the field of voice synthesis, in particular to a singing voice synthesis method and device, electronic equipment and a storage medium.

Background

In recent years, singing voice synthesis has been a topic of enthusiasm, and the technology can synthesize musical scores into audio of singing voice. However, the existing singing voice synthesis effect is low in naturalness and strong in singing voice mechanical sense, and cannot achieve a good anthropomorphic effect. Therefore, how to synthesize a song with high naturalness and an effect close to that of a real person according to a music score is a technical problem to be solved urgently by researchers.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and an apparatus for synthesizing singing voice, an electronic device, and a storage medium, so as to improve the naturalness of the synthesis of singing voice to make it close to the real singing effect.

In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:

in a first aspect, an embodiment of the present application provides a singing voice synthesis method, including:

obtaining a phoneme sequence of a song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme;

inputting the phoneme sequence into a singing voice synthesis model, and coding the phoneme sequence by using a coding network of the singing voice synthesis model to obtain a coding vector;

and inputting the coding vector and the Mel spectrum characteristics into a decoding network of the singing voice synthesis model to obtain the singing voice frequency of the song to be synthesized, wherein the Mel spectrum characteristics are obtained by processing a reference voice frequency in advance.

Furthermore, the coding network comprises an embedding unit, a preprocessing unit and a feature extraction unit, wherein the feature extraction unit comprises a convolutional layer and a high-speed network;

the step of encoding the phoneme sequence by using the encoding network of the singing voice synthesis model to obtain an encoding vector includes:

processing the phoneme sequence by using the embedding unit to obtain an embedded sequence;

and inputting the embedded sequence into the preprocessing unit for nonlinear transformation, and then inputting the embedded sequence into the feature extraction unit to generate the coding vector.

Further, the decoding network comprises a position sensitive attention layer, a prediction unit, a decoding unit, a CBHG unit and a vocoder, wherein the CBHG unit comprises a convolutional layer, a high-speed network and a bidirectional recurrent neural network;

the step of inputting the coding vector and mel spectrum characteristics into a decoding network of the singing voice synthesis model to obtain the singing voice frequency of the song to be synthesized comprises the following steps:

inputting the coding vector into the position-sensitive attention layer to learn the corresponding relation between acoustic features and the phoneme sequence and outputting a context vector;

inputting the Mel spectrum characteristics into the prediction unit, and performing linear transformation on the Mel spectrum characteristics by using the prediction unit to obtain prediction output;

splicing the context vector and the prediction output, and inputting the spliced context vector and the prediction output into the decoding unit for decoding to obtain a decoding sequence and a stop flag bit, wherein the stop flag bit is used for representing whether the decoding process is stopped or not;

inputting the decoding sequence into the CBHG unit to extract context characteristics to obtain an acoustic characteristic sequence;

inputting the sequence of acoustic features into the vocoder to synthesize the singing audio.

Further, the step of obtaining a phoneme sequence of a song to be synthesized includes:

obtaining a music score of the song to be synthesized, wherein the music score comprises lyrics and notes;

analyzing the music score, and dividing each syllable of the lyrics into at least one phoneme;

and acquiring the pitch and the phoneme duration corresponding to each phoneme according to the notes to obtain the phoneme sequence.

Further, the step of dividing each syllable of the lyrics into at least one phoneme comprises:

aiming at any target syllable in the lyrics, judging whether the target syllable is silent initial pinyin or not;

if not, dividing the target syllable according to the initial consonant and the final sound to obtain at least one phoneme;

if yes, adding universal initial consonants or segmentation information to the target syllable, and then dividing to obtain the at least one phoneme.

Further, the phoneme sequence further includes a breath phoneme for characterizing breath sounds in the reference audio.

Further, the phoneme sequence further comprises a connective flag, and the connective flag is used for indicating that a target phoneme in the multiple phonemes has a connective.

Further, the singing voice synthesis model is trained by the following method:

acquiring sample audio;

analyzing the sample audio to obtain a plurality of sample phoneme sequences, wherein one sample phoneme sequence corresponds to one audio frame in the sample audio, and the sample phoneme sequence comprises a plurality of phonemes corresponding to the audio frame and the pitch and phoneme duration of each phoneme;

and training a preset model based on the plurality of sample phoneme sequences to obtain the singing voice synthesis model.

Further, the step of training a preset model based on the plurality of sample phoneme sequences to obtain the singing voice synthesis model includes:

inputting the plurality of sample phoneme sequences into the preset model, and outputting synthesized audio;

analyzing the synthesized audio to obtain predicted phoneme duration information;

based on the predicted phoneme duration information and the real phoneme duration information of the sample audio, utilizing a loss function:

align_loss＝tf.reduce_mean(tf.abs(align_targets-align_outputs))

and updating parameters of the preset model to obtain the singing voice synthesis model, wherein tf.abs represents an absolute value, targets represents the real phoneme duration information, align _ outputs represents the predicted phoneme duration information, and f.reduce _ mean represents the mean square error.

In a second aspect, an embodiment of the present application further provides a singing voice synthesizing apparatus, including:

the device comprises an obtaining module, a synthesizing module and a synthesizing module, wherein the obtaining module is used for obtaining a phoneme sequence of a song to be synthesized, and the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme;

the coding module is used for inputting the phoneme sequence into a singing voice synthesis model and coding the phoneme sequence by utilizing a coding network of the singing voice synthesis model to obtain a coding vector;

and the decoding module is used for inputting the coding vector and the Mel spectral characteristics into a decoding network of the singing voice synthesis model to obtain the singing audio of the song to be synthesized, wherein the Mel spectral characteristics are obtained by processing reference audio in advance.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the singing voice synthesis method described above.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the singing voice synthesizing method described above.

Compared with the prior art, the singing voice synthesis method and device, the electronic equipment and the storage medium provided by the embodiment of the application input the singing voice synthesis model by obtaining the phoneme sequence of the song to be synthesized, and the synthesized singing audio can reflect the pronunciation duration of each phoneme because the phoneme sequence comprises the phoneme, the pitch and the phoneme duration, so that the naturalness of the singing voice synthesis is improved; meanwhile, the Mel-frequency spectrum characteristics are input at the decoding stage of the singing voice synthesis model, and the Mel-frequency spectrum characteristics are obtained by processing the reference audio in advance, so that the synthesized singing voice can be closer to the real singing effect, and the auditory experience of a user is improved.

Drawings

Fig. 1 is a schematic flow chart illustrating a singing voice synthesizing method according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating step S101 in the singing voice synthesizing method illustrated in fig. 1.

Fig. 3 illustrates an exemplary diagram of a music score provided in an embodiment of the present application.

Fig. 4 is a diagram illustrating an example of a parsing process of a score provided in an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a singing voice synthesis model provided in an embodiment of the present application.

Fig. 6 is a flowchart illustrating step S102 in the singing voice synthesizing method illustrated in fig. 1.

Fig. 7 is a flowchart illustrating step S103 of the singing voice synthesizing method shown in fig. 1.

Fig. 8 is a schematic diagram illustrating a training process of a singing voice synthesis model provided in an embodiment of the present application.

Fig. 9 is a block diagram schematically illustrating a singing voice synthesizing apparatus according to an embodiment of the present application.

Fig. 10 shows a block schematic diagram of an electronic device provided in an embodiment of the present application.

Icon: 10-an electronic device; 11-a processor; 12-a memory; 13-a bus; 100-singing voice synthesizing means; 110-an obtaining module; 120-an encoding module; 130-decoding module.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The singing voice synthesis method provided by the embodiment of the application can realize the simulation of the voice, thereby providing artificial intelligent singing functions such as virtual singing and the like for a user. And, by the singing voice synthesizing method, various types of voice and audio of the person can be synthesized, such as Chinese songs, English songs, reviews, music art audio and the like.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a singing voice synthesizing method provided in an embodiment of the present application, where the singing voice synthesizing method is applied to an electronic device and may include the following steps S101 to S103.

S101, obtaining a phoneme sequence of a song to be synthesized, wherein the phoneme sequence comprises a plurality of phonemes, and a pitch and a phoneme duration corresponding to each phoneme.

Phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. Phonemes are classified into two categories, namely vowels and consonants, and the types of the phonemes are different in different pronunciation rules. For example, for English, the phonemes include vowel phonemes and consonant phonemes. For Chinese, the syllable (i.e. pinyin) of each Chinese character can be decomposed into initial consonant and final consonant, so that the phoneme includes initial consonant and final consonant. The following embodiments are presented in the context of Chinese.

The pitch of a phoneme is used to indicate the height of the phoneme during pronunciation. The pitch range of the singing voice is much larger than that of the ordinary speaking voice, and since the pitch is frequency in nature, the pitch of the singing voice can be divided into 36 levels, for example, C3, F4, etc., with reference to the international pitch hertz spectrum.

The phoneme duration of a phoneme is used to indicate the duration of the phoneme during pronunciation. For example, the phoneme is the final "i", and the corresponding phoneme duration is 200ms, which indicates that the phoneme "i" lasts 200ms during the pronunciation process.

Due to the concept of tempo when singing, the nature of tempo is phoneme duration, and the synthesized singing audio is output in the form of audio frames, which are at least 32 ms. Therefore, the minimum resolution of the rhythm can be set to 32ms, that is, the phoneme duration is set to 32ms at the minimum, and the phoneme duration is divided into 300 levels in an incremental manner.

The phoneme sequence is obtained by analyzing a music score of a song to be synthesized, generally, the music score comprises lyrics and notes, one lyric corresponds to one phoneme sequence, the phoneme sequence is input into a singing voice synthesis model and then corresponding audio frames are output, and one phoneme sequence corresponds to one or more continuous audio frames in the synthesized singing audio.

The phoneme sequence may include a plurality of elements, each element including a phoneme and a pitch and phoneme duration corresponding to the phoneme, which may be expressed as (phoneme, pitch, phoneme duration). For example, (iou, C3, 10), where iou represents a phoneme, C3 represents a pitch, and 10 represents a phoneme duration.

The process of parsing the score into a sequence of phonemes is described in detail below. Referring to fig. 2, step S101 may include sub-steps S1011 to S1013 based on fig. 1.

S1011, obtaining a music score of the song to be synthesized, wherein the music score comprises lyrics and notes.

The song to be synthesized may be any one of songs that need to be synthesized using the singing voice synthesis model, for example, a chinese song, an english song, and the like.

In the present embodiment, the audio of a song is synthesized directly based on a musical score, which usually includes lyrics and notes in text form, for example, fig. 3 shows a musical score segment of a chinese song, including chinese lyrics "round moon and round me making a wish to the night sky" and corresponding notes.

S1012, parsing the musical score to divide each syllable of the lyrics into at least one phoneme.

The following embodiment will illustrate the parsing process of a score by taking a part of the score fragment shown in fig. 3 as an example. The partial score segment includes: the Chinese words "month", "child", "circle", "again", "circle" and the corresponding notes.

Each syllable of the lyrics may be divided into at least one phoneme by parsing the score. Referring to FIG. 4, the lyrics include 5 syllables "yue", "er", "yuan", "you", "yuan", each of which may be illustratively divided into at least one phoneme. For example, the syllable "yue" can be divided into 2 phones "y" corresponding to initials and "you" corresponding to finals.

In the singing voice synthesis, the silent pinyin may have a phenomenon of pronunciation ambiguity, such as 'er' in fig. 4, so that the silent pinyin needs to be preprocessed in the syllable dividing process, and then divided into phonemes after the preprocessing, so as to avoid the pronunciation ambiguity.

As an embodiment, the process of dividing each syllable of the lyrics into at least one phoneme may include:

if not, dividing the target syllable according to the initial consonant and the final consonant to obtain at least one phoneme;

if yes, adding universal initial consonants or segmentation information to the target syllable, and then dividing to obtain at least one phoneme.

In this embodiment, for silent pinyin, the following processing may be adopted:

in an alternative embodiment, a universal initial may be added to the unvoiced pinyin, for example, "er" has only vowels and no initial, and a universal initial al may be added uniformly for this pronunciation, and then divided into phonemes "al" and "er".

In another alternative embodiment, segmentation information may be added to the unvoiced pinyin, for example, the pinyin of "love" is "kuai", the segmentation information sep is added to become "sep ku sep aisep", and then the phonemes "sep", "k", "u", "sep", "ai" and "sep" are further divided.

And S1013, acquiring the pitch and phoneme duration corresponding to each phoneme according to the notes to obtain a phoneme sequence.

Since the score contains notes with corresponding pitches, after dividing each syllable of the lyrics into at least one phoneme according to the manner of sub-step S1012, the pitch corresponding to each phoneme can be obtained from the notes. In general, in a musical score, a syllable corresponds to a pitch, and all phones in the same syllable correspond to the same pitch, for example, the syllable "yue" in FIG. 4 corresponds to the pitch Db4, and the phones "y" and "ue" correspond to the pitch Db 4.

Meanwhile, the pinyin duration of each syllable can also be obtained according to the notes in the music score, for example, the pinyin duration of the syllable "yue" in fig. 4 is 260ms, but the singing voice synthesis model takes the phone duration of each phone as input, so the phone duration of each phone in the syllable needs to be determined according to the pinyin duration of the syllable, for example, the phone durations of the phones "y" and "ue" are determined according to the pinyin duration of "yue" of 260 ms.

For the determination of the phoneme duration, the following manner may be adopted:

in an alternative embodiment, a bi-directional multi-layered LSTM network may be constructed and trained to learn the percentage of each phoneme's duration given the pinyin duration.

In another alternative embodiment, the inventor has found through a great deal of experimental verification that the pronunciation habit of the same person for the initial consonant is relatively consistent. Therefore, a plurality of songs, for example, 100 songs, sung by the same person may be selected in advance, and the average pronunciation duration of each initial consonant in the songs may be counted, for example, the average pronunciation duration of the initial consonant "y" is 63ms, and the average pronunciation duration is taken as the phoneme duration of the corresponding initial consonant, for example, the phoneme duration of the phoneme "y" in fig. 4 is set to 63ms, and the phoneme duration of the final is the pinyin duration minus the phoneme duration of the initial consonant.

In a possible situation, the voice of real person singing will have breath sound, and in order to make the synthesized singing voice more humane, the processing of breath sound is also considered in the singing voice synthesis.

The respiratory audio may be pre-classified into 30 levels by duration, e.g., 6 for durations close to 200ms and 10 for durations close to 300 ms. Meanwhile, the form of the breath phoneme may be set to break _ duration level, for example, break _6, which characterizes the duration level as 6; break — 10, characterized by a duration rating of 10.

The respiratory sound processing is divided into a model training phase and a model application phase for introduction.

In the model training stage, the breathing sounds can be input into the model in the form of breathing phonemes, so that the model can well learn the breathing information of a singer of sample audio. The breath phoneme is obtained by obtaining the duration of the breath audio in the sample audio and then determining the corresponding duration grade according to the duration.

In the model application stage, the duration of the respiratory audio in the reference audio can be obtained, the respiratory phoneme is formed according to the duration, and the respiratory phoneme is input into the singing voice synthesis model. Thus, the phoneme sequence may further comprise a breathing phoneme for characterizing breathing sounds in the reference audio.

It should be noted that, in the model application stage, whether the phoneme sequence includes the breath phoneme and which phoneme sequence includes the breath phoneme may be flexibly selected by the user, and is not limited herein.

In a possible situation, when a real person sings a song, it is difficult to sing the song completely according to a given music score, and there is almost unavoidable polyphony, so that in order to achieve a more anthropomorphic effect, the processing of the polyphony is also considered in the synthesis of the singing voice.

The processing of the polyphonic is also divided into a model training stage and a model application stage for introduction.

And in the model training stage, the sample audio is analyzed, the continuous sound in the sample audio is found out and is input into the model for training in a continuous sound identification mode, and the continuous sound identification is used for indicating that the target phoneme in the multiple phonemes has continuous sound, so that the model can learn continuous sound skills.

In the model application stage, the reference audio is analyzed, the polyphonic tone of the singer habit of the reference audio is constructed according to the polyphonic tone, and the polyphonic tone is input into the model. Therefore, the phoneme sequence may further include a hyphen flag for indicating that a target phoneme of the plurality of phonemes has a hyphen.

It should be noted that, in the model application stage, whether the phoneme sequence includes the hyphen flag, and which phoneme sequence includes the hyphen flag may be flexibly selected by the user, and is not limited herein.

Because of the special way of Chinese pronunciation, the conjunctive consonants only exist in the vowels, so the way of connecting the related vowels can be adopted, i.e. a plurality of vowels with conjunctive consonants are used as a phoneme, the phoneme is the target phoneme, and the conjunctive consonant identifier is added. For example, the syllable "xiao" includes the initial "x" and the vowels "i" and "ao", but the vowels "i" and "ao" are connective sounds in pronunciation, so the syllable "xiao" is divided into two phonemes "x" and "iao", and meanwhile, a connective sound identifier con is added, and the connective sound identifier con can also be used as a phoneme, but it is necessary to set the pitch and the phoneme duration corresponding to the phoneme to be 0, for example, (con, 0, 0).

Meanwhile, for a polyphone, there may be a plurality of pitches simultaneously during the pronunciation, for example, the pitch of "iao" includes Eb4, F4 and E4, and in this case, a phoneme duration corresponding to each pitch may be obtained, and elements in the phoneme sequence are respectively constructed, for example, (x, Eb4,63), (iao, Eb4, 220), (con, 0, 0), (iao, F4, 500), (con, 0, 0), (iao, E4, 30).

S102, inputting the phoneme sequence into the singing voice synthesis model, and coding the phoneme sequence by using a coding network of the singing voice synthesis model to obtain a coding vector.

S103, inputting the coding vector and the Mel-frequency-spectrum characteristic into a decoding network of the singing voice synthesis model to obtain the singing voice frequency of the song to be synthesized, wherein the Mel-frequency-spectrum characteristic is obtained by processing the reference voice frequency in advance.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a singing voice synthesis model provided in an embodiment of the present application. The singing voice synthesis model can be improved on the basis of a traditional voice spectrum prediction network Tacotron1 model. As shown in fig. 5, the singing voice synthesis model includes an encoding network and a decoding network, after the phoneme sequence is input into the singing voice synthesis model, the encoding network encodes the phoneme sequence into a fixed-length encoding vector, and the decoding network decodes the encoding vector to generate the singing audio.

Meanwhile, the input of the decoding network includes mel-spectrum features of reference audio in addition to the encoding vectors. The reference audio may be a speaking audio or a singing audio of a person, and may be flexibly selected by the user, for example, the user records a voice of the user as the reference audio, or selects a voice of a star as the reference audio, and the like, which is not limited herein. By incorporating the mel-spectrum features of the reference audio at the decoding stage, the synthesized singing audio can be made to more closely approximate the real singing effect.

As shown in fig. 5, the encoding network may include an embedding unit, a preprocessing unit, and a feature extraction unit, which may include a convolutional layer and a high-speed network.

Therefore, referring to fig. 6, step S102 may include sub-steps S1021 to S1022 based on fig. 1.

And S1021, processing the phoneme sequence by using the embedding unit to obtain an embedded sequence.

S1022, the embedded sequence is input into the preprocessing unit for nonlinear transformation, and then input into the feature extraction unit to generate the coding vector.

In this embodiment, the Embedding unit may be Character Embedding, and the preprocessing unit may be Pre-net. Since the feature extraction of the phoneme, pitch and phoneme duration does not need to consider context information, the bidirectional Recurrent neural network is removed from CBHG (convergence Bank + high-way network + bidirectional gated Recurrent Unit, convolutional layer + high-speed network + bidirectional Recurrent neural network) in the Tacotron1 model, and a feature extraction Unit is obtained.

On the basis of the coding network shown in fig. 5, the process of generating the code vector by using the coding network is as follows:

1. inputting the phoneme sequence into an Embedding unit (Character Embedding), and processing the phoneme sequence by using the Embedding unit (Character Embedding) to obtain an embedded sequence;

2. inputting the embedded sequence into a preprocessing unit (Pre-net), and carrying out nonlinear transformation on the embedded sequence by using the preprocessing unit (Pre-net) so as to improve the convergence and generalization capability of the singing voice synthesis model;

3. inputting the embedded sequence after nonlinear transformation into a feature extraction unit (conversion Bank + high way network), simultaneously extracting the features of phonemes, pitches and phoneme durations by using the feature extraction unit (conversion Bank + high way network), and outputting coding vectors.

As shown in fig. 5, the decoding network may include a position sensitive attention layer, a prediction unit, a decoding unit, a CBHG unit including a convolutional layer, a high speed network, and a bidirectional recurrent neural network, and a vocoder.

Therefore, referring to fig. 7 on the basis of fig. 1, step S103 may include sub-steps S1031 to S1035.

And S1031, inputting the coding vector into the position sensitive attention layer to learn the corresponding relation between the acoustic features and the phoneme sequence, and outputting a context vector.

And S1032, inputting the Mel spectrum characteristics into a prediction unit, and performing linear transformation on the Mel spectrum characteristics by using the prediction unit to obtain prediction output.

S1033, the context vector and the prediction output are spliced and then input into a decoding unit for decoding to obtain a decoding sequence and a stop flag bit, wherein the stop flag bit is used for representing whether the decoding process is stopped or not.

S1034, inputting the decoding sequence into the CBHG unit to extract the context characteristics to obtain an acoustic characteristic sequence.

S1035, inputting the acoustic feature sequence into the vocoder to synthesize the singing audio.

In this embodiment, the prediction unit may be Pre-net, the decoding unit may include an Attention RNN and a Decoder RNN, and the Location Sensitive Attention layer may be Location Sensitive Attention. The CBHG Unit may be a Convolition Bank + high-way network + bidirectional gated Recurrent Unit, i.e., convolutional layer + high-speed network + bidirectional Recurrent neural network.

On the basis of the decoding network shown in fig. 5, the process of outputting singing audio by the decoding network is as follows:

1. the coding vector is input into a position Sensitive Attention layer (Location Sensitive Attention), which is a matrix essentially composed of a context weight vector, and the corresponding relation between the acoustic features and the phoneme sequence can be automatically learned, and the context vector is output.

The position Sensitive Attention layer (Location Sensitive Attention) carries out Attention calculation at each time step, and the learned position Sensitive information is obtained by accumulating Attention weights, so that the singing voice synthesis model carries out sequential processing on the contents in the phoneme sequence, and repeated prediction or omission is avoided. That is, a position Sensitive Attention layer (Location Sensitive Attention) is used in the singing voice synthesis model, and the position Sensitive Attention layer is used to focus on different parts of the coding vector, so that the corresponding relation between the acoustic features and the phoneme sequence is automatically learned.

Therefore, the use of the Location Sensitive Attention layer (Location Sensitive Attention) can further improve the stability of the singing voice synthesis effect, and avoid the situations of missing phonemes, repeated phonemes or incapability of stopping.

2. The previous frame and Mel spectral characteristics are input into a prediction unit (Pre-net), and the prediction unit (Pre-net) is used for carrying out linear transformation on the previous frame and Mel spectral characteristics to obtain prediction output.

Decoding is a cyclic process, and the current time step of each cycle is t. The input of the previous frame and mel-frequency spectrum feature into the prediction unit (Pre-net) is the 1 st loop, so t is 1 at this time. When t > 1, as shown in FIG. 5, the decoded sequence at time step t-1 is input to the prediction unit (Pre-net). The prediction unit (Pre-net) linearly transforms the input.

When the current time step t is 1, the previous frame and mel-frequency spectrum features are input to the prediction unit (Pre-net), so the previous frame (Initial frame) is a decoded sequence of time step 0, and each element in the decoded sequence is 0 at this time, that is, the previous frame is a full zero frame.

3. After a context vector is output by a position Sensitive Attention layer (Location Sensitive Attention) and a prediction output is generated by a prediction unit (Pre-net), the context vector and the prediction output are spliced, a decoding unit is used for decoding, a decoding sequence and a Stop flag bit (Stop token) of the current time step t are output, and the Stop flag bit (Stop token) is used for representing whether to Stop circulation.

4. And inputting the decoding sequence into the CBHG unit to extract the context characteristics to obtain the acoustic characteristic sequence of the current time step t.

5. If the Stop flag bit (Stop token) represents the Stop loop, the decoding process takes the acoustic feature sequence of the current time step t as the final acoustic feature sequence;

and if the Stop flag bit (Stop token) represents the non-Stop loop, updating the current time step t to be t +1, and then returning to the step 1 to continue executing until the Stop flag bit (Stop token) represents the Stop loop to obtain the final acoustic feature sequence.

6. The final acoustic signature sequence is input to the vocoder to synthesize singing audio.

The vocoder may convert the sequence of acoustic features generated by the decoding network into an audio waveform, and the vocoder may be a vocoder that generates an audio waveform based on mel-spectrum parameters, such as WaveGlow, Griffin-Lim, WaveNet, and the like.

The following describes the training process of the song synthesis model in detail.

In this embodiment, the training process of the singing voice synthesis model may be applied to an electronic device, and the singing voice synthesis method and the training process of the singing voice synthesis model may be implemented by the same electronic device or by different electronic devices.

Referring to fig. 8, fig. 8 is a schematic diagram illustrating a training process of a singing voice synthesis model according to an embodiment of the present application, where the training process of the singing voice synthesis model may include steps S201 to S203.

S201, sample audio is obtained.

The sample audio may be one or more pre-recorded audios, which may be audios of a designated chinese song, an english song, or designated commentary audio, music art audio, or the like.

S202, analyzing the sample audio to obtain a plurality of sample phoneme sequences, wherein one sample phoneme sequence corresponds to one audio frame in the sample audio, and the sample phoneme sequence comprises a plurality of phonemes corresponding to the audio frame, and the pitch and phoneme duration of each phoneme.

The sample audio may include a plurality of audio frames, and typically, the plurality of audio frames and the plurality of sample phoneme sequences have a one-to-one correspondence. Analyzing any one audio frame to obtain each phoneme in the audio frame, and the pitch and phoneme duration of each phoneme, so as to obtain a sample phoneme sequence corresponding to the audio frame.

Taking a chinese song as an example, for the sample audio, the obtaining manner of each phoneme is similar to that described in step S101, and is not described herein again.

The way to obtain the pitch of each phoneme may be: the pitch of each audio frame in the sample audio is identified through software, the identified pitch of the audio frame is determined according to 36 preset pitch levels, for example, C3, F4 and the like, and the pitch of the audio frame is taken as the pitch of each phoneme in the audio frame.

For example, the duration of each phoneme in the sample audio in the pronunciation process can be identified through software, or the duration of each phoneme in the sample audio in the pronunciation process can be determined through a manual calibration method, so as to obtain the phoneme duration of each phoneme.

S203, training the preset model based on the multiple sample phoneme sequences to obtain a singing voice synthesis model.

After obtaining a plurality of sample phoneme sequences of the sample audio, the sample audio may be used as a label, the plurality of sample phoneme sequences may be used as an input, and the preset model is trained until a loss value of the set loss function meets a certain requirement or the number of iterations reaches a set number, so as to obtain an audio synthesis model. The model structure of the preset model is identical to that shown in fig. 5, except that the model parameters are different.

As an embodiment, the process of training the preset model may include:

1. inputting a plurality of sample phoneme sequences into a preset model, and outputting synthesized audio;

2. analyzing the synthesized audio to obtain predicted phoneme duration information;

3. based on the predicted phoneme duration information and the true phoneme duration information of the sample audio, using a loss function:

align_loss＝tf.reduce_mean(tf.abs(align_targets-align_outputs))

and updating parameters of a preset model to obtain a singing voice synthesis model, wherein tf.abs represents an absolute value, targets represents real phoneme duration information, align _ outputs represents predicted phoneme duration information, and f.reduce _ mean represents mean square error solving.

The predicted phoneme duration information comprises the phoneme duration of each phoneme in the synthetic audio, the real phoneme duration information comprises the real phoneme duration of each phoneme in the sample audio, and the real phoneme duration information can be determined in a manual calibration mode.

For singing voice synthesis, model training is to learn the rhythm duration, and the learning essence of the rhythm duration is to learn the phoneme duration of each phoneme, so the core of model training is to accurately learn the phoneme duration of each phoneme.

The model provided by the embodiment of the application adopts an attention mechanism, the attention mechanism can learn the phoneme duration information in an unsupervised mode, but the learned phoneme duration information has larger deviation due to the complexity of singing voice synthesis. Therefore, in the training process, the phoneme duration information learned by the attention mechanism, namely the predicted phoneme duration information of the synthesized audio and the real phoneme duration information of the sample audio are added into the loss function in a mean square error mode, so that the unsupervised learning mode is changed into the supervised learning mode, and the accuracy of rhythm duration control is greatly improved.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

firstly, the phoneme, the pitch and the phoneme duration are used as the input of a singing voice synthesis model, so that the synthesized singing voice frequency can reflect the pronunciation duration of each phoneme, and the naturalness of the singing voice synthesis is improved; meanwhile, the Mel-spectral characteristics of the reference audio are input in the decoding stage, so that the synthesized singing audio can be closer to the real singing effect;

secondly, introducing a position sensitive attention layer, and automatically learning the corresponding relation between the acoustic features and the phoneme sequence by using the position sensitive attention layer to pay attention to different parts of the coding vector, thereby improving the stability of the singing voice synthesis effect and avoiding the situations of missing phonemes, repeated phonemes or incapability of stopping;

thirdly, a stop flag bit is introduced to indicate whether the decoding process is stopped or not, so that the decoding process is prevented from falling into endless loop;

fourthly, the processing of breath sound and connective sound is considered in the singing sound synthesis, so that the synthesized singing audio achieves a more anthropomorphic effect.

In order to perform the corresponding steps in the above-described singing voice synthesizing method embodiment and various possible embodiments, an implementation applied to a singing voice synthesizing apparatus is given below.

Referring to fig. 9, fig. 9 is a block diagram illustrating a singing voice synthesizing apparatus 100 according to an embodiment of the present application. The singing voice synthesizing apparatus 100 is applied to an electronic device, and may include: an obtaining module 110, an encoding module 120, and a decoding module 130.

An obtaining module 110, configured to obtain a phoneme sequence of a song to be synthesized, where the phoneme sequence includes a plurality of phonemes and a pitch and a phoneme duration corresponding to each phoneme.

And the coding module 120 is configured to input the phoneme sequence into the singing voice synthesis model, and code the phoneme sequence by using a coding network of the singing voice synthesis model to obtain a coding vector.

And a decoding module 130, configured to input the coding vector and mel spectrum features into a decoding network of the singing voice synthesis model, so as to obtain the singing audio of the song to be synthesized, where the mel spectrum features are obtained by processing the reference audio in advance.

Optionally, the obtaining module 110 is specifically configured to:

obtaining a music score of a song to be synthesized, wherein the music score comprises lyrics and notes;

and acquiring the pitch and phoneme duration corresponding to each phoneme according to the notes to obtain a phoneme sequence.

Optionally, the obtaining module 110 performs a manner of dividing each syllable in the lyrics into at least one phoneme, including:

Optionally, the encoding module 120 performs a manner of encoding the phoneme sequence by using the encoding network of the singing voice synthesis model to obtain an encoding vector, including:

processing the phoneme sequence by using an embedding unit to obtain an embedded sequence;

the embedded sequence is firstly input into a preprocessing unit to carry out nonlinear transformation, and then input into a feature extraction unit to generate a coding vector.

Optionally, the decoding module 130 executes a step of inputting the coding vector and mel spectrum feature into a decoding network of the singing voice synthesis model to obtain the singing audio of the song to be synthesized, including:

inputting the coding vector into a position sensitive attention layer to learn the corresponding relation between the acoustic features and the phoneme sequence and outputting a context vector;

inputting the Mel spectrum characteristics into a prediction unit, and performing linear transformation on the Mel spectrum characteristics by using the prediction unit to obtain prediction output;

splicing the context vector and the prediction output, and inputting the spliced context vector and the prediction output into a decoding unit for decoding to obtain a decoding sequence and a stop flag bit, wherein the stop flag bit is used for representing whether the decoding process is stopped or not;

inputting the decoding sequence into a CBHG unit to extract context characteristics to obtain an acoustic characteristic sequence;

the acoustic signature sequence is input to a vocoder to synthesize singing audio.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the singing voice synthesizing apparatus 100 described above may refer to the corresponding process in the foregoing method embodiment, and will not be described herein again.

Referring to fig. 10, fig. 10 is a block diagram illustrating an electronic device 10 according to an embodiment of the present disclosure. The electronic device 10 may be any electronic device with a speech processing function, such as a server, a mobile terminal, a general-purpose computer, a special-purpose computer, or the like, and the mobile terminal may be a smart phone, a notebook computer, a tablet computer, a desktop computer, a smart television, or the like.

Electronic device 10 may include a processor 11, a memory 12, and a bus 13, with processor 11 being coupled to memory 12 via bus 13.

The memory 12 is used for storing a program, such as the singing voice synthesizing apparatus 100 shown in fig. 9, the singing voice synthesizing apparatus 100 includes at least one software functional module which can be stored in the memory 12 in the form of software or firmware (firmware), and the processor 11 executes the program after receiving an execution instruction to implement the singing voice synthesizing method disclosed in the above embodiment.

The Memory 12 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (NVM).

The processor 11 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 11. The processor 11 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Micro Control Unit (MCU), a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), and an embedded ARM.

The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by the processor 11, implements the singing voice synthesizing method disclosed in the above embodiment.

In summary, according to the singing voice synthesis method and apparatus, the electronic device, and the storage medium provided by the embodiment of the application, the phoneme sequence of the song to be synthesized is obtained and input into the singing voice synthesis model, and the phoneme sequence includes phonemes, a pitch, and phoneme durations, so that the synthesized singing audio can reflect pronunciation durations of the phonemes, and the naturalness of the singing voice synthesis is improved; meanwhile, the Mel-frequency spectrum characteristics are input at the decoding stage of the singing voice synthesis model, and the Mel-frequency spectrum characteristics are obtained by processing the reference audio in advance, so that the synthesized singing voice can be closer to the real singing effect, and the auditory experience of a user is improved.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of synthesizing singing voice, the method comprising:

2. The method of claim 1, wherein the coding network comprises an embedding unit, a preprocessing unit, and a feature extraction unit, the feature extraction unit comprising a convolutional layer and a high-speed network;

3. The method of claim 1, wherein the decoding network comprises a location sensitive attention layer, a prediction unit, a decoding unit, a CBHG unit, and a vocoder, the CBHG unit comprising a convolutional layer, a high speed network, and a bidirectional recurrent neural network;

4. The method of claim 1, wherein the step of obtaining a sequence of phonemes for a song to be synthesized comprises:

5. The method of claim 4, wherein the step of dividing each syllable of the lyrics into at least one phoneme comprises:

6. The method of claim 1, wherein the sequence of phonemes further comprises a breath phoneme for characterizing breath sounds in the reference audio.

7. The method of claim 1 wherein said phone sequence further comprises a hyphen flag indicating that a target phone of said plurality of phones has a hyphen.

8. The method of claim 1, wherein said singing voice synthesis model is trained by:

acquiring sample audio;

9. The method of claim 8, wherein said step of training a predetermined model based on said plurality of sample phoneme sequences to obtain said singing voice synthesis model comprises:

align_loss＝tf.reduce_mean(tf.abs(align_targets-align_outputs))

10. A singing voice synthesizing apparatus, characterized in that the apparatus comprises:

11. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the singing voice synthesis method of any one of claims 1-9.

12. A computer-readable storage medium on which a computer program is stored, the computer program, when being executed by a processor, implementing a singing voice synthesis method according to any one of claims 1 to 9.