CN116072143A

CN116072143A - Singing voice synthesizing method and related device

Info

Publication number: CN116072143A
Application number: CN202310126243.9A
Authority: CN
Inventors: 庄晓滨; 陈梦; 宗旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-02-07
Filing date: 2023-02-07
Publication date: 2023-05-05

Abstract

The embodiment of the application provides a singing voice synthesizing method and a related device, wherein the method comprises the following steps: inputting the syllable sequence and the fundamental frequency mark sequence of the audio to be synthesized into a formant model in the target acoustic model to obtain formant characterization information of the audio to be synthesized, wherein the formant characterization information is characterization information without tone information; inputting formant characterization information and pitch information of the audio to be synthesized into a tone color conversion model in a target acoustic model to obtain a Mel spectrum feature, wherein the synthesized Mel spectrum feature comprises tone color information of a target object, and the tone color conversion model is obtained based on sample audio training of the target object; the mel-spectrum features are input into a vocoder to obtain a synthesized audio signal. Therefore, by adopting the embodiment of the application, singing voice of any tone cross-language can be synthesized.

Description

Singing voice synthesizing method and related device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a singing voice synthesis method and related devices.

Background

With the continuous development of artificial intelligence in the music field, the singing voice synthesis technology in music application is also attracting more and more attention, wherein the singing voice synthesis technology is a new application of voice synthesis technology, and mainly synthesizes music score information and lyrics into singing voice close to real singing by a computer program. At present, when singing voice synthesis is performed, a data-driven neural network model is generally adopted for implementation, languages supported by the method (namely language types such as national language, guangdong language, minnan language and the like) are related to singing voice data corresponding to target tone, wherein the singing voice data comprise singing voice sent by a designated person (professional singer) and voice of an accompaniment instrument (when no accompaniment instrument exists, the singing voice data are the singing voice sent by the designated person), and the languages and the target tone cannot be decoupled, so that the method cannot be used for synthesizing singing voice of any tone cross languages. Therefore, how to realize the synthesis of singing voice in different languages becomes a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a singing voice synthesizing method and a related device, which can synthesize singing voice of any tone cross languages.

In a first aspect, an embodiment of the present application provides a singing voice synthesizing method, including:

inputting the syllable sequence and the fundamental frequency mark sequence of the audio to be synthesized into a formant model in the target acoustic model to obtain formant characterization information of the audio to be synthesized, wherein the formant characterization information is characterization information without tone information;

inputting formant characterization information and pitch information of the audio to be synthesized into a tone color conversion model in a target acoustic model to obtain a Mel spectrum feature, wherein the Mel spectrum feature comprises tone color information of a target object, and the tone color conversion model is obtained based on sample audio training of the target object;

the mel-spectrum features are input into a vocoder to obtain a synthesized audio signal.

In the embodiment of the application, the formant model is utilized to decouple tone information and languages of the audio to be synthesized, and formant characterization information irrelevant to tone is obtained; inputting formant characterization information into a tone conversion model in a target acoustic model corresponding to a target object, and obtaining Mel spectrum characteristics comprising tone information of the target object; the mel-spectrum feature is input to a vocoder to obtain a synthesized audio signal. Therefore, by adopting the embodiment of the application, singing voice of any tone cross-language can be synthesized.

In an alternative embodiment, before inputting the syllable sequence and the base frequency marker sequence of the audio to be synthesized into the formant model in the target acoustic model, the method further comprises:

acquiring a training sample audio set, wherein the training sample audio set comprises sample audio of a plurality of objects;

training the initialized acoustic model based on each sample audio in the training sample audio set to obtain a trained acoustic model;

taking the formant model in the trained acoustic model as the formant model in the target acoustic model;

wherein the initialized acoustic model includes an initialized formant model and an initialized timbre conversion model.

In an alternative embodiment, after obtaining the trained acoustic model, the method further comprises:

fixing parameters of a formant model in the trained acoustic model, inputting sample audio corresponding to a target object into the trained acoustic model, and retraining a tone conversion model in the trained acoustic model to obtain a trained tone conversion model;

and taking the trained timbre conversion model as a timbre conversion model in the target acoustic model.

In an alternative embodiment, training the initialized acoustic model based on each sample audio in the training set of sample audio to obtain a trained acoustic model includes:

Acquiring syllable sequences, fundamental frequency mark sequences, pitch information, tone information and acoustic characteristics of each training sample audio in a training sample audio set, wherein the acoustic characteristics are Mel spectrum characteristics;

and training the initialized acoustic model by utilizing the syllable sequence, the fundamental frequency mark sequence, the pitch information, the tone information and the acoustic characteristics of each sample audio frequency to obtain a trained acoustic model.

In an alternative embodiment, training the initialized acoustic model using the syllable sequence, the base frequency mark sequence, the pitch information, the timbre information and the acoustic features of each sample audio to obtain a trained acoustic model includes:

inputting syllable sequences and fundamental frequency mark sequences of each sample audio into an initialized formant model to obtain formant characterization information of each sample audio;

inputting formant characterization information, pitch information and tone information of each sample audio into an initialized tone conversion model to obtain predicted Mel spectrum characteristics of each sample audio;

determining a loss value for each sample audio based on the predicted mel-spectrum characteristics for each sample audio;

and when the loss value does not meet the training stopping condition, adjusting parameters of the initialized formant model and the initialized tone conversion model included in the initialized acoustic model according to the loss value to obtain a trained acoustic model.

In an alternative embodiment, determining a loss value for each sample audio based on the predicted mel-spectrum characteristics for each sample audio comprises:

determining a minimum mean square error between the predicted mel-spectrum feature and the acoustic feature of each sample audio, and determining a discriminator error corresponding to the predicted mel-spectrum feature of each sample audio;

a loss value for each sample audio is determined based on the minimum mean square error and the discriminator error.

In an alternative embodiment, determining a discriminator error corresponding to the predicted mel-spectrum feature for each sample audio includes:

inputting the predicted Mel spectrum characteristics of each sample audio into an initialized discriminator to obtain a discrimination result of the predicted Mel spectrum characteristics of each sample audio, wherein the discrimination result comprises the predicted Mel spectrum characteristics which are real or the predicted Mel spectrum characteristics which are synthesized Mel spectrum characteristics;

and determining a discriminator error corresponding to the predicted mel-spectrum characteristic of each sample audio based on the discrimination result of the predicted mel-spectrum characteristic of each sample audio.

In an alternative embodiment, after determining the loss value for each sample audio, the method further comprises:

And when the loss value does not meet the training stopping condition, adjusting parameters in the initialized discriminant according to the loss value to obtain the trained discriminant.

In a second aspect, an embodiment of the present application provides a singing voice synthesizing apparatus, including:

the processing unit is used for inputting the syllable sequence and the fundamental frequency mark sequence of the audio to be synthesized into a formant model in the target acoustic model to obtain formant characterization information of the audio to be synthesized, wherein the formant characterization information is characterization information without tone information;

the processing unit is also used for inputting formant representation information and pitch information of the audio to be synthesized into a tone color conversion model in the target acoustic model to obtain a Mel spectrum characteristic, wherein the synthesized Mel spectrum characteristic comprises tone color information of a target object, and the tone color conversion model is obtained based on sample audio training of the target object;

and the processing unit is also used for inputting the mel-spectrum characteristics into the vocoder to obtain the synthesized audio signal.

Optional embodiments of each unit in the singing voice synthesizing apparatus may be referred to the description in the foregoing first aspect, and will not be described herein.

In a third aspect, embodiments of the present application further provide a computer device, including: the system comprises a memory and a processor, wherein the memory stores a computer program which realizes the method of the first aspect when being executed by the processor.

In a fourth aspect, the present application further provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect described above.

In a fifth aspect, embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method according to the first aspect provided in the embodiment of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a sound spectrum provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a singing voice synthesizing method according to an embodiment of the present application;

FIG. 3 is a schematic illustration of an embodiment of the present application schematic representation of syllable boundary labeling;

FIG. 4a is a schematic structural view of an acoustic model according to an embodiment of the present application;

FIG. 4b is a schematic diagram of a arbiter according to an embodiment of the present disclosure;

FIG. 5a is a flow chart of a method of training an acoustic model of a target provided in an embodiment of the present application;

FIG. 5b is a flow chart of a method of determining a trained acoustic model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of formant characterization information provided by an embodiment of the present application;

fig. 7 is a schematic diagram of a singing voice synthesizing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

To facilitate an understanding of the embodiments disclosed herein, some concepts to which the embodiments of the present application relate are first described. The description of these concepts includes, but is not limited to, the following.

1. Syllables

Syllables are the smallest phonetic units of the combined pronunciation of a single vowel phone and a consonant phone in a language, where a single vowel phone may also be self-syllable. For example, syllables of Chinese are phonetic units composed of initials and finals, where individual finals can also be self-syllables, such as "o" and "a", etc.

Phonemes are the smallest units of a Chinese pronunciation. For example, the pronunciation units of "good" of Chinese characters are initials "h" and finals "ao". Similarly, grapheme (grapheme) is the smallest unit of Chinese writing.

2. Fundamental frequency

In speech, the fundamental frequency refers to the fundamental frequency of speech, and when a sounding body sounds due to vibration, the sound can be generally decomposed into many simple sine waves. That is, all natural sounds are composed of substantially sinusoidal waves of many different frequencies. Wherein sounds of different frequencies may constitute the sound spectrum. Referring to fig. 1, fig. 1 is a schematic diagram of a sound spectrum according to an embodiment of the present application. As shown in fig. 1, the lower frequency sine wave is the fundamental tone, while the other higher frequency sine waves are the overtones. In the field of music, the pitch is the main element distinguishing the pitch, the frequency of which may also be called pitch, which determines the melody. While overtones determine the timbre of the human voice. Whether the singing is running or not is usually referred to, and refers to whether the fundamental frequency of the singer accords with pitch information in a song music score or not.

3. Music score

The music score is a regular combination of various written symbols recording the pitch or rhythm of music, and the common music score format is MIDI, musicXML. In the singing voice synthesis stage, the lyric content and pitch information of the audio to be synthesized need to be given. Alternatively, the computer device may extract lyric content and pitch information of the audio to be synthesized from the music score.

At present, during the synthesis of singing voice, a data-driven neural network model is generally adopted for implementation, the language supported by the singing voice synthesis method is related to singing voice data corresponding to a target tone, and the language and the target tone cannot be decoupled, so that the method cannot be used for synthesizing the singing voice of any tone cross-language.

Based on this, the embodiment of the application provides a singing voice synthesizing method and a related device, which can synthesize singing voices with arbitrary tone colors and cross languages. Wherein the method comprises the following steps: inputting syllable sequences and fundamental frequency mark sequences of the audio to be synthesized into a formant model in a target acoustic model by computer equipment to obtain formant characterization information of the audio to be synthesized, wherein the formant characterization information is characterization information without tone information; inputting formant characterization information and pitch information of the audio to be synthesized into a tone color conversion model in a target acoustic model to obtain a Mel spectrum feature, wherein the synthesized Mel spectrum feature comprises tone color information of a target object, and the tone color conversion model is obtained based on sample audio training of the target object; the mel-spectrum features are input into a vocoder to obtain a synthesized audio signal. Therefore, by adopting the embodiment of the application, singing voice of any tone cross-language can be synthesized.

It should be noted that, the above-mentioned computer device may be a terminal device or a server. The terminal equipment comprises, but is not limited to, a smart phone, a tablet personal computer, a notebook computer, a desktop computer, an intelligent sound box, an intelligent watch, an intelligent vehicle-mounted device and the like. The server may be an independent physical server, a server cluster formed by a plurality of physical servers, a distributed system, or the like, but is not limited thereto.

In order to facilitate understanding of the embodiments of the present application, a detailed description will be given below of a specific implementation of the singing voice synthesizing method described above with a computer device as an execution body.

Referring to fig. 2, fig. 2 is a flow chart of a singing voice synthesizing method according to an embodiment of the present application. As shown in fig. 2, the singing voice synthesizing method may include, but is not limited to, the following steps:

s201, inputting the syllable sequence and the base frequency mark sequence of the audio to be synthesized into a formant model in the target acoustic model to obtain formant characterization information of the audio to be synthesized.

Wherein, the formant characterization information is characterization information without tone information.

Syllable sequences are the results after word embedding (word embedding) representation of the syllables of the audio to be synthesized. Alternatively, the syllable sequence may be a frame-level syllable sequence. Alternatively, word embedding of syllables of the audio to be synthesized represents a syllable sequence with dimension [ T,128] where T represents the number of frames of the audio and 128 represents a vector.

A baseband marking sequence (VUV precision) is a sequence that marks the baseband sequence of the audio to be synthesized. Optionally, the dimension of the baseband-marker sequence is [ T,1], where T represents the number of frames. Optionally, when marking the baseband sequence of the audio to be synthesized, if the baseband value of the i-th frame is greater than 0, the value of VUV [ i,1] is 1, otherwise, the value is 0.

The syllable sequence and the base frequency mark sequence are input into the formant model together, consonant information and vowel information can be provided implicitly, the self-adaption of the duration of the syllable in the syllable is realized, and the problem of mispronounced sound is reduced.

In an alternative embodiment, before step S201, the computer device may further perform preprocessing on the audio to be synthesized to obtain a syllable sequence and a base frequency sequence of the audio to be synthesized.

In this embodiment, the preprocessing of the audio to be synthesized by the computer device to obtain the syllable sequence of the audio to be synthesized may include: inputting the audio to be synthesized and the lyric text corresponding to the audio to be synthesized into a trained singing voice alignment model for processing to obtain first syllable labeling information corresponding to the audio to be synthesized; calling voice learning software, and carrying out alignment processing on sequences at the boundary of a first syllable labeling sequence of the audio to be synthesized based on lyric text corresponding to the audio to be synthesized to obtain second syllable labeling information of the audio to be synthesized; and carrying out word embedding representation on the second syllable labeling information of the audio to be synthesized to obtain a syllable sequence of the audio to be synthesized.

Alternatively, the voice learning software may be Praat software.

Alternatively, syllable dictionary used is different when syllable labeling is performed for audio of different languages. For example, in a national song and a cantonese song, the syllable dictionary used is different because the pronunciation is different for each kanji.

Referring to fig. 3, fig. 3 is a schematic diagram of syllable boundary labeling provided in the embodiment of the present application, and as shown in fig. 3, the computer device performs alignment processing on the boundary of syllable labeling information corresponding to an audio by calling Praat software.

In this embodiment, the preprocessing of the audio to be synthesized by the computer device to obtain the baseband marker sequence of the audio to be synthesized may include: extracting fundamental frequency of audio to be synthesized to obtain a fundamental frequency sequence of the audio to be synthesized; and marking the base frequency sequence of the audio to be synthesized to obtain a base frequency marking sequence of the audio to be synthesized.

Alternatively, the computer device may use the pYin algorithm as a method for fundamental frequency extraction when performing fundamental frequency extraction on the audio to be synthesized. Optionally, the computer device may also perform baseband extraction on the audio to be synthesized using DIO algorithm, yin algorithm, or harvest algorithm in the world vocoder, which is not limited herein.

Optionally, the computer device performs a marking process on a baseband sequence of the audio to be synthesized to obtain a baseband marking sequence of the audio to be synthesized, which may include: marking the frame where the fundamental frequency with the fundamental frequency value not being 0 in the fundamental frequency sequence is positioned as 1; marking a frame where a fundamental frequency with a fundamental frequency value of 0 in a fundamental frequency sequence is positioned as 0; and obtaining a base frequency mark sequence of the audio to be synthesized based on the marked base frequency value.

In an alternative embodiment, before step S201, the computer device further obtains a training sample audio set, the training sample audio set including sample audio of a plurality of objects; training the initialized acoustic model based on each sample audio in the training sample audio set to obtain a trained acoustic model; taking the formant model in the trained acoustic model as the formant model in the target acoustic model; wherein the initialized acoustic model includes an initialized formant model and an initialized timbre conversion model.

In an alternative embodiment, after the computer device obtains the trained acoustic model, the method further comprises: fixing parameters of a formant model in the trained acoustic model, inputting sample audio corresponding to a target object into the trained acoustic model, and training a tone conversion model in the trained acoustic model to obtain a trained tone conversion model; and taking the trained timbre conversion model as a timbre conversion model in the target acoustic model.

S202, formant representation information and pitch information of the audio to be synthesized are input into a tone color conversion model in the target acoustic model, and Mel spectrum characteristics are obtained, wherein the Mel spectrum characteristics comprise tone color information of a target object.

Wherein the timbre conversion model in the target acoustic model is obtained based on sample audio training of the target object. Alternatively, the timbre conversion model may also be regarded as a style migration module.

In an alternative embodiment, the computer device also obtains pitch information of the audio to be synthesized. Optionally, the computer device obtains pitch information of the audio to be synthesized, which may include: acquiring a music spectrum of audio to be synthesized; and extracting the pitch information of the audio to be synthesized from the music spectrum of the audio to be synthesized. Alternatively, the format of the music score includes, but is not limited to, musical instrument digital interface (Musical Instrument Digital Interface, MIDI), extensible markup language based music symbol file format (Musical Extensible Markup Language, musical XML), etc.

In an alternative embodiment, the computer device inputs formant characterization information and pitch information of the audio to be synthesized into a timbre conversion model in the target acoustic model to obtain mel spectrum features, which may include: inputting the pitch information of the audio to be synthesized into a word embedding module of the pitch information to obtain word embedding representation of the pitch information corresponding to the audio to be synthesized; and embedding the words of formant characterization information and pitch information of the audio to be synthesized into a timbre conversion model which is input into the target acoustic model to obtain the Mel spectrum characteristics.

S203, inputting the Mel spectrum characteristics into a vocoder to obtain a synthesized audio signal.

That is, the vocoder model is used to convert mel-spectral features into audio signals. Alternatively, the vocoder may be a mel-spectrum generation countermeasure network (Mel Spectrogram Generative Adversarial Network, melGAN) or a High-fidelity generation countermeasure network (High-Fidelity Generative Adversarial Network, hifiGAN).

Optionally, in the music production process, the synthesized audio signal obtained by adopting the embodiment of the application may be mixed to obtain a more specialized musical composition.

Referring to fig. 4a, fig. 4a is a schematic structural diagram of an acoustic model provided in an embodiment of the present application, and as shown in fig. 4a, the acoustic model includes a formant representation model, a timbre conversion model, a word embedding module for pitch information, and a word embedding module for timbre information.

The formant model consists of N feed forward converter blocks (Feed Forward Transformer Block, FFT Block). Alternatively, N may be 5. The formant model is input into syllable sequence and base frequency mark sequence of audio frequency, and the formant characterization information of audio frequency is output. That is, the formant model is used to learn formant characterization information (or referred to as tone-independent syllable labeling information) that represents basic pronunciation content in audio from syllable sequences and base frequency marker sequences of the audio.

The word embedding module of the pitch information is used for processing the pitch information of the audio to obtain word embedding representation of the pitch information of the input audio.

The word embedding module of tone color information is used for embedding and representing tone color information indicated by Singer Identity (Singer id).

The timbre conversion model is a mel spectrum feature which converts the characterization features without timbre information (namely formant characterization information) into a specific human timbre. The input of the tone color conversion model is formant characterization information, pitch information and singerID, wherein the pitch information is converted into [ T,128] by a word embedding mode after the fundamental frequency sequence with the dimension of [ T,1] is firstly converted into the logarithmic domain. That is, the pitch information is information processed through the word embedding module of the pitch information. The output of the timbre conversion module is the predicted mel spectrum feature with dimensions [ T,128]. That is, the timbre conversion model is a mel spectrum feature that converts formant characterization information without timbre information into timbre information.

A style migration model similar to styleGAN is used for the Timbre conversion model, which consists of M Timbre blocks. Alternatively, M may be 6. Each tone block is composed of a one-dimensional convolution module and an adaptive instance normalization (Adaptive Instance Normalization, adaIN) module. Wherein the input of the AdaIN module further includes Singer Identity (Singer id) for indicating tone information of the Singer, and the function of AdaIN aligns the mean and variance of formant characterization information c of the audio (or referred to as content feature c of the audio) to the mean and variance of tone feature s of the target object (or referred to as style feature s of the target object), whose calculation formula is as follows (1):

in the formula (1), c represents formant characterization information (content characteristics) of the audio; s represents a tone characteristic (style characteristic) of the target object; σ(s) represents the variance of the style feature; σ (c) represents the variance of the content features; μ(s) represents the mean of the style feature; μ (c) represents the mean value of the content feature.

In fig. 4a, the minimum mean square error (Mean Square Error Loss, MSE Loss) and the discriminant error (Discriminator Loss) are used to determine the Loss value of the mel-spectrum feature output by the timbre conversion model. Wherein the minimum mean square error is the error between the predicted mel feature and the target mel feature; the discriminator error refers to a true or false decision error for the predicted mel-spectrum feature.

Wherein the discriminant error is determined based on a discriminant, which is a multi-subband discriminant model. The computer equipment can divide 128 sub-bands of the Mel spectrum characteristics into three frequency bands of low, medium and high, the frequency band ranges of the three frequency bands are [0, 64], [32, 96] and [96, 128], and each sub-band is input into the discriminator to judge the authenticity of the Mel spectrum characteristics.

Referring to fig. 4b, fig. 4b is a schematic structural diagram of a discriminator according to the embodiment of the application. As shown in fig. 4b, the arbiter is mainly composed of a 1-dimensional convolution module and a residual connection. The computer equipment can input the Mel spectrum of each sub-band in the Mel spectrum characteristics into a 1-dimensional convolution module to obtain a 1-dimensional convolution result, wherein in the 1-dimensional convolution module, the size of a convolution kernel is 3, and the number of input and output channels is 64; processing the 1-dimensional convolution result by using an activation function to obtain processed characteristics; adding the processed characteristics and the Mel spectrum of the sub-band by using residual connection, and mapping the characteristics into [ T,1] through a linear layer after passing through 5 residual connection structures; and outputting a discrimination result of the Mel spectrum characteristics based on [ T,1 ]. Wherein the discrimination result includes whether the mel-spectrum feature input in the discriminator is true (ure) or false (Fake) (otherwise referred to as synthesized).

In pre-training the acoustic model, the computer device may adjust parameters in the formant characterization model, the timbre conversion model, and the arbiter in the acoustic model based on the loss values to obtain the pre-trained acoustic model. Optionally, the computer device may adjust parameters in the formant characterization model, the timbre conversion model, and the arbiter in the acoustic model based on the loss values, and the learning rate may be 0.001, and the optimizer may be an adaptive moment estimation (Adaptive Moment Estimation, adam) optimizer.

After the computer device obtains the pre-trained acoustic model, a trained formant model irrelevant to the tone is obtained, and the formant model can decouple the tone and the language, so that the formant model is a main reason for realizing the synthesis of the cross-language singing. For singing voice data of the target object, the computer equipment can fix parameters of a formant model of the pre-training model and a word embedding module of pitch information, and only training the parameters of the tone color conversion model, an acoustic model (or referred to as a target acoustic model) corresponding to the target object can be obtained.

Based on the acoustic model and the vocoder corresponding to the target object, the singing voice synthesis model corresponding to the target object can be obtained. That is, a complete singing voice synthesis model includes both the acoustic model and the vocoder.

The process of training the acoustic model corresponding to the target object (i.e., the target acoustic model) is described in detail below.

Referring to fig. 5a, fig. 5a is a flowchart of a method for training an acoustic model of a target according to an embodiment of the present application. As shown in fig. 5a, the method may include, but is not limited to, the steps of:

s501, acquiring a training sample audio set.

Wherein the training sample audio set comprises sample audio of a plurality of objects. Sample audio of the plurality of objects includes sample audio of at least two languages. Alternatively, the sample audio corresponding to each object may be in a single language.

Taking national songs and cantonese songs as examples, the training sample audio set may be a set of national songs and cantonese songs that are sung by a plurality of subjects, where each subject may sung a song of one language.

S502, training the initialized acoustic model based on each sample audio in the training sample audio set to obtain a trained acoustic model.

The trained acoustic model comprises a formant model, a tone conversion model and a word embedding module of pitch information.

In an alternative embodiment, the computer device trains the initialized acoustic model based on each sample audio in the training sample audio set, resulting in a trained acoustic model, and may include: acquiring syllable sequences, fundamental frequency mark sequences, pitch information, tone information and acoustic characteristics of each training sample audio in a training sample audio set, wherein the acoustic characteristics are Mel spectrum characteristics; and training the initialized acoustic model by utilizing the syllable sequence, the fundamental frequency mark sequence, the pitch information, the tone information and the acoustic characteristics of each sample audio frequency to obtain a trained acoustic model.

In this embodiment, the computer device further obtains a lyric text corresponding to each sample audio, and the computer device obtains a syllable sequence of each sample audio, which may include: inputting each sample audio and lyric text corresponding to the sample audio into a pre-trained singing voice alignment model for processing to obtain a first syllable labeling sequence of each sample audio; calling voice learning software, and aligning sequences at the boundary of a first syllable labeling sequence of each sample audio based on lyrics corresponding to the sample audio to obtain second syllable labeling information of each sample audio; and carrying out embedded representation on the second syllable labeling information of each sample audio to obtain a syllable sequence of each sample audio. Alternatively, the syllable sequence may have dimensions of [ T,128].

Alternatively, the voice learning software may be Praat software. Alternatively, syllable dictionary used is different when syllable labeling is performed for audio of different languages. For example, in a national song and a cantonese song, the syllable dictionary used is different because the pronunciation is different for each kanji.

Optionally, the computer device processes syllable labeling information of each sample audio to obtain syllable sequence of each sample audio, which may include: and expanding syllable labeling information of each sample audio according to the corresponding audio frame number to obtain a syllable sequence of each sample audio. For example, assuming that the syllable labeling information of one sample audio includes three syllables of "na shi wo", and the number of audio frames corresponding to each syllable is [5, 4], the three syllables can be expanded into a syllable sequence of 14 frames, namely "na na na na na shi shi shi shi shi wo wo wo wo". That is, the syllable sequence is a syllable sequence at the frame level.

In this embodiment, the computer device acquiring the baseband-labeled sequence for each sample audio may include: extracting fundamental frequency from each sample audio to obtain a fundamental frequency sequence of each sample audio; and carrying out marking processing on the base frequency sequence of each sample audio to obtain a base frequency marking sequence of each sample audio.

Alternatively, the computer device may perform baseband extraction on each sample audio using DIO algorithm, yin algorithm, pYin algorithm, or harvest algorithm in a world vocoder, etc., without limitation.

Optionally, the computer device performs a labeling process on the baseband sequence of each sample audio to obtain a baseband label sequence of each sample audio, which may include: marking a frame where a fundamental frequency with a fundamental frequency value not being 0 is located in a fundamental frequency sequence of each sample audio as 1; marking a frame where a fundamental frequency with a fundamental frequency value of 0 in a fundamental frequency sequence is positioned as 0; based on the marked fundamental frequency values, a fundamental frequency marking sequence of each sample audio is obtained. Optionally, the dimension of the baseband-marker sequence is [ T,1], where T represents the number of frames. That is, when marking the baseband sequence of each sample audio, if the baseband value of the i-th frame is greater than 0, the value of VUV [ i,1] is 1, otherwise, the value is 0.

In this embodiment, the computer device obtains pitch information and tone information of each sample audio, which may include: acquiring a music score of each sample audio; pitch information and tone information of each sample audio are extracted from the melody of each sample audio, respectively.

In this embodiment, the acoustic features are mel-spectrum features, which are due to: the frequency range which can be heard by the human ear is 20-20000Hz, but the scale unit of the human ear to Hz is not in linear perception relation, and the use of the Mel spectrum characteristics is more in accordance with the working principle of the human ear. At this time, the computer device acquires acoustic features of each sample of audio, i.e., acquires mel-spectrum features of each sample.

Optionally, the computer device obtains acoustic features of each sample audio, i.e. obtains mel-spectrum features of each sample audio, and may include: carrying out framing windowing on each sample audio to obtain each sample audio subjected to framing windowing; performing Fourier transform on each sample audio of the framing and windowing to obtain a linear frequency spectrum corresponding to each sample audio; and processing the linear frequency spectrum corresponding to each sample audio by utilizing a filter bank of the Mel scale to obtain Mel spectrum characteristics of each sample audio.

In this embodiment, the initialized acoustic model includes an initialized formant model, an initialized timbre conversion model, and a word embedding module for pitch information.

In this embodiment, the computer device trains the initialized acoustic model using syllable sequence, base frequency label sequence, pitch information, timbre information, and acoustic features of each sample audio, and the process of obtaining the trained acoustic model is described below with reference to fig. 5 b. Referring to fig. 5b, fig. 5b is a schematic flow chart of a method for determining a trained acoustic model according to an embodiment of the present application, which may also be considered as a schematic flow chart of a method for training a target acoustic model according to an embodiment of the present application. As shown in fig. 5b, the method includes, but is not limited to, the steps of:

s5021, inputting syllable sequences and fundamental frequency mark sequences of each sample audio into an initialized formant model to obtain formant characterization information of each sample audio.

Wherein, formant characterization information of each sample audio has no tone information of the sample audio. Thus, the dimensions of the formant characterization information are the same as the dimensions of the syllable sequence. For example, if the syllable sequence dimension of any one sample audio is [ T,128], and the fundamental frequency mark sequence dimension is [ T,1], the formant characterization information of the sample audio obtained after formant model processing has the dimension of [ T,128], where T represents the frame number.

Referring to fig. 6, fig. 6 is a schematic diagram of formant characterization information according to an embodiment of the present application. As shown in fig. 6, the total frame number of the formants is 1200 frames, i.e., t=1200, and thus the dimension of the formant characterization information is [1200, 128].

S5022, inputting formant characterization information, pitch information and tone information of each sample audio into an initialized tone conversion model to obtain predicted Mel spectrum characteristics of each sample audio.

Wherein the predicted mel-spectrum feature of each sample audio includes timbre information.

In an alternative embodiment, the computer device inputs formant characterization information, pitch information, and timbre information of each sample audio into the initialized timbre conversion model to obtain the predicted mel spectrum feature of each sample audio may include: inputting the pitch information of each sample audio into a word embedding module of the pitch information to obtain word embedding representation of the pitch information of each sample audio; the method comprises the steps of inputting tone information of each sample audio to a word embedding module of the tone information to obtain word embedding representation of the tone information of each sample audio; and inputting the formant characterization information, the word embedded representation of the pitch information and the word embedded representation of the tone information of each sample audio into an initialized tone conversion model to obtain the predicted Mel spectrum characteristics of each sample audio.

S5023, determining a loss value of each sample audio based on the predicted Mel spectrum characteristic of each sample audio.

In an alternative embodiment, the computer device determining the loss value for each sample audio based on the predicted mel-spectrum features for each sample audio may include: determining a minimum mean square error between the predicted mel-spectrum feature and the acoustic feature of each sample audio, and determining a discriminator error corresponding to the predicted mel-spectrum feature of each sample audio; a loss value for each sample audio is determined based on the minimum mean square error and the discriminator error.

In this embodiment, the computer device determining a discriminator error corresponding to the predicted mel-spectrum feature for each sample audio may include: inputting the predicted Mel spectrum characteristics of each sample audio into an initialized discriminator to obtain a discrimination result of the predicted Mel spectrum characteristics of each sample audio, wherein the discrimination result comprises the predicted Mel spectrum characteristics which are real or the predicted Mel spectrum characteristics which are synthesized Mel spectrum characteristics; and determining a discriminator error corresponding to the predicted mel-spectrum characteristic of each sample audio based on the discrimination result of the predicted mel-spectrum characteristic of each sample audio.

And S5024, when the loss value does not meet the training stopping condition, adjusting parameters of the initialized formant model and the initialized timbre conversion model included in the initialized acoustic model according to the loss value to obtain a trained acoustic model.

In an alternative embodiment, the computer device may derive the trained acoustic model based on the initialized formant model, the initialized timbre conversion model when the loss value satisfies the stop training condition. The trained acoustic model further comprises a word embedding module of pitch information and a word embedding module of tone information.

In an alternative embodiment, the penalty values include a first penalty value and a second penalty value; the computer device adjusts parameters of an initialized formant model and an initialized timbre conversion model included in the initialized acoustic model according to the loss value to obtain a trained acoustic model, and the method comprises the following steps: adjusting parameters of the initialized formant model and the initialized tone color conversion model according to the first loss value to obtain an acoustic model with the adjusted parameters, wherein the adjusted acoustic model comprises the formant model with the adjusted parameters, the tone color conversion model with the adjusted parameters and a word embedding module of pitch information; inputting syllable sequences, fundamental frequency mark sequences and pitch information of each sample audio into an acoustic model with parameters adjusted, and determining a second loss value; and when the second loss value meets the training stopping condition, obtaining the trained acoustic model.

It will be appreciated that the process of determining the trained acoustic model is a continually recurring process.

In an alternative embodiment, the computer device further adjusts parameters in the initialized discriminators according to the loss value to obtain trained discriminators when the loss value does not meet the stop training condition.

S503, fixing parameters of a formant model in the trained acoustic model, inputting sample audio corresponding to the target object into the trained acoustic model, and retraining a tone conversion model in the trained acoustic model to obtain the trained tone conversion model.

For example, assuming that the training sample set includes 5 sample tones for each of 100 singers, i.e., 500 sample tones in total, the computer device first trains the initialized acoustic model using the 500 sample tones to obtain a trained acoustic model. At this time, if the acoustic model exclusive to the singer 5 is to be obtained, the computer device may freeze the formant model and the word embedding module of the pitch information in the trained acoustic model, then input the 5 sample audios corresponding to the singer 5 into the trained acoustic model, and train the timbre conversion model in the trained acoustic model to obtain the trained timbre conversion model.

Alternatively, the process of training the timbre conversion model in the trained acoustic model by the computer device may be understood as a process of fine-tuning the trained acoustic model. In this way, the performance of the model can be further improved.

S504, determining a target acoustic model based on a formant model in the trained acoustic model and the trained timbre conversion model.

The target acoustic model further comprises a word embedding module of pitch information and a word embedding module of tone information.

Wherein the computer device uses the formant model in the trained acoustic model as the formant model in the target acoustic model; and taking the trained timbre conversion model as a timbre conversion model in the target acoustic model.

That is, the formant model in the trained acoustic model, the trained timbre conversion model, and the word embedding module of the pitch information together constitute a language independent acoustic model specific to the singer 5.

Therefore, by adopting the embodiment of the application, the acoustic model corresponding to the target object can be obtained by training the acoustic model, so that the singer singing the A language can sing the song in the B language by using the acoustic model corresponding to the target object in combination with the vocoder, such as the singer singing the Chinese language singing the Guangdong language song or the singer singing the Guangdong language singing the Chinese language song, on one hand, the problem that a single singer is difficult to cover multiple languages is solved, and on the other hand, the singing of the cross-language song can be realized by helping a common user.

Referring to fig. 7, fig. 7 is a schematic diagram of a singing voice synthesizing apparatus according to an embodiment of the present application. The singing voice synthesizing apparatus described in this embodiment may include a processing unit 701 and a training unit 702.

The processing unit 701 is configured to input a syllable sequence and a fundamental frequency mark sequence of an audio to be synthesized into a formant model in the target acoustic model, and obtain formant characterization information of the audio to be synthesized, where the formant characterization information is characterization information without tone information;

the processing unit 701 is further configured to input formant characterization information and pitch information of audio to be synthesized into a timbre conversion model in the target acoustic model, to obtain mel spectrum features, where the mel spectrum features include timbre information of the target object, and the timbre conversion model is obtained based on sample audio training of the target object;

the processing unit 701 is further configured to input mel-spectrum features into the vocoder to obtain a synthesized audio signal.

In an alternative embodiment, the processing unit 701 is configured to, before inputting the syllable sequence and the base frequency marker sequence of the audio to be synthesized into the formant model in the target acoustic model, train the unit 702 to:

In an alternative embodiment, training unit 702, after obtaining the trained acoustic model, is further configured to:

In an alternative embodiment, training unit 702 is configured to, when configured to train the initialized acoustic model based on each sample audio in the training sample audio set, obtain a trained acoustic model, specifically configured to:

In an alternative embodiment, the training unit 702 is configured to train the initialized acoustic model by using the syllable sequence, the baseband label sequence, the pitch information, the timbre information and the acoustic features of each sample audio, so as to obtain a trained acoustic model, which is specifically configured to:

In an alternative embodiment, training unit 702 is configured to, when configured to determine a loss value for each sample audio based on the predicted mel-spectrum feature for each sample audio, specifically:

In an alternative embodiment, training unit 702, when used to determine the corresponding discriminant error for the predicted mel-spectrum feature for each sample audio, is specifically configured to:

In an alternative embodiment, training unit 702, after being used to determine the loss value for each sample audio, is also used to:

It may be understood that the specific implementation of each unit and the beneficial effects that can be achieved in the singing voice synthesizing apparatus according to the embodiments of the present application may refer to the description of the related method embodiments, which is not repeated herein.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device described in the embodiment of the present application includes: a processor 801, a user interface 802, a communication interface 803, and a memory 804. The processor 801, the user interface 802, the communication interface 803, and the memory 804 may be connected by a bus or other means, which is exemplified in the embodiment of the present application.

Among them, the processor 801 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of a computer device, which can parse various instructions in the computer device and process various data of the computer device, for example: the CPU can be used for analyzing a startup and shutdown instruction sent by a user to the computer equipment and controlling the computer equipment to perform startup and shutdown operation; and the following steps: the CPU may transmit various types of interaction data between internal structures of the computer device, and so on. The user interface 802 is a medium for implementing interaction and information exchange between a user and a computer device, and may specifically include a Display screen (Display) for output, a Keyboard (Keyboard) for input, and the like, where the Keyboard may be a physical Keyboard, a touch screen virtual Keyboard, or a Keyboard that combines a physical Keyboard and a touch screen virtual Keyboard. The communication interface 803 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), controlled by the processor 801 to receive and transmit data.

Memory 804 (Memory) is a Memory device in a computer device for storing programs and data. It will be appreciated that the memory 804 herein may include either built-in memory of the computer device or extended memory supported by the computer device. Memory 804 provides storage space that stores the operating system of the computer device, which may include, but is not limited to: android systems, iOS systems, windows Phone systems, etc., which are not limiting in this application.

In the present embodiment, the processor 801 may execute the following operations by executing executable program codes in the memory 804:

In an alternative embodiment, the processor 801, before executing the input of the syllable sequence and the base frequency marker sequence of the audio to be synthesized into the formant model in the target acoustic model, further executes:

In an alternative embodiment, the processor 801, after performing training of the initialized acoustic model based on each sample audio in the training set of sample audio, further performs:

In an alternative embodiment, the processor 801, when executing training the initialized acoustic model based on each sample audio in the training sample audio set, specifically executes:

In an alternative embodiment, the processor 801, when performing training on the initialized acoustic model using the syllable sequence, the baseband label sequence, the pitch information, the timbre information, and the acoustic features of each sample audio, specifically performs:

In an alternative embodiment, the processor 801, when executing the predicted mel-spectrum feature based on each sample audio, determines a loss value for each sample audio, specifically performs:

In an alternative embodiment, the processor 801, when executing the determining of the discriminant error corresponding to the predicted mel-spectrum feature for each sample audio, specifically performs:

In an alternative embodiment, the processor 801, after executing the determination of the loss value for each sample audio, further executes:

In a specific implementation, the processor 801, the user interface 802, the communication interface 803, and the memory 804 described in the embodiments of the present application may execute an implementation of a computer device described in the singing voice synthesizing method provided in the embodiments of the present application, or may execute an implementation described in the singing voice synthesizing apparatus provided in the embodiments of the present application, which is not described herein again.

The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the program instructions implement the singing voice synthesis method provided in the embodiment of the present application, and specifically, reference may be made to implementation manners provided by the foregoing steps, which are not described herein again.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods as described in embodiments of the present application. The specific implementation manner may refer to the foregoing description, and will not be repeated here.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the described order of action, as some steps may take other order or be performed simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing disclosure is only illustrative of some of the embodiments of the present application and is not, of course, to be construed as limiting the scope of the appended claims, and therefore, all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A singing voice synthesizing method, characterized in that the method comprises:

inputting a syllable sequence and a fundamental frequency mark sequence of the audio to be synthesized into a formant model in a target acoustic model to obtain formant characterization information of the audio to be synthesized, wherein the formant characterization information is characterization information without tone information;

inputting formant characterization information and pitch information of the audio to be synthesized into a tone color conversion model in the target acoustic model to obtain a Mel spectrum feature, wherein the synthesized Mel spectrum feature comprises tone color information of the target object, and the tone color conversion model is obtained based on sample audio training of the target object;

2. The method of claim 1, wherein before inputting the syllable sequence and the base frequency marker sequence of the audio to be synthesized into the formant model in the target acoustic model, the method further comprises:

3. The method of claim 2, wherein after the obtaining the trained acoustic model, the method further comprises:

fixing parameters of a formant model in the trained acoustic model, inputting sample audio corresponding to a target object into the trained acoustic model, and training a tone conversion model in the trained acoustic model to obtain a trained tone conversion model;

4. The method of claim 2, wherein training the initialized acoustic model based on each sample audio in the training set of sample audio results in a trained acoustic model, comprising:

Acquiring syllable sequences, fundamental frequency mark sequences, pitch information, tone information and acoustic features of each training sample audio in the training sample audio set, wherein the acoustic features are Mel spectrum features;

and training the initialized acoustic model by using the syllable sequence, the fundamental frequency mark sequence, the pitch information, the tone information and the acoustic characteristics of each sample audio frequency to obtain the trained acoustic model.

5. The method of claim 4, wherein training the initialized acoustic model using the syllable sequence, the base frequency label sequence, the pitch information, the timbre information, and the acoustic features for each sample audio results in the trained acoustic model, comprising:

inputting the syllable sequence and the fundamental frequency mark sequence of each sample audio into the initialized formant model to obtain formant characterization information of each sample audio;

determining a loss value of each sample audio based on the predicted mel-spectrum feature of each sample audio;

And when the loss value does not meet the training stopping condition, adjusting parameters of the initialized formant model and the initialized timbre conversion model included in the initialized acoustic model according to the loss value to obtain the trained acoustic model.

6. The method of claim 5, wherein determining the loss value for each sample audio based on the predicted mel-spectrum feature for each sample audio comprises:

and determining a loss value of each sample audio based on the minimum mean square error and the discriminator error.

7. The method of claim 6, wherein determining the discriminator error corresponding to the predicted mel-spectrum feature for each sample audio comprises:

inputting the predicted Mel spectrum characteristics of each sample audio into an initialized discriminator to obtain a discrimination result of the predicted Mel spectrum characteristics of each sample audio, wherein the discrimination result comprises that the predicted Mel spectrum characteristics are real Mel spectrum characteristics or that the predicted Mel spectrum characteristics are synthesized Mel spectrum characteristics;

And determining a discriminator error corresponding to the predicted Mel spectrum characteristic of each sample audio based on the discrimination result of the predicted Mel spectrum characteristic of each sample audio.

8. The method of any one of claims 5 to 7, wherein after the determining the loss value for each sample audio, the method further comprises:

9. A computer device, comprising: a processor, a communication interface and a memory, the processor, the communication interface and the memory being interconnected, wherein the memory stores executable program code, the processor being adapted to invoke the executable program code to perform the method of any of claims 1-8.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 8.