WO2023245389A1

WO2023245389A1 - Song generation method, apparatus, electronic device, and storage medium

Info

Publication number: WO2023245389A1
Application number: PCT/CN2022/099965
Authority: WO
Inventors: 吴洁
Original assignee: 北京小米移动软件有限公司; 北京小米松果电子有限公司
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2023-12-28

Abstract

Provided in the present disclosure are a song generation method, an apparatus, an electronic device, and a storage medium, the method comprising: acquiring a voice audio input by a target user and a unique identification number of a target song; performing Mel spectrogram feature extraction on the voice audio to obtain a real Mel spectrogram feature of the target user; according to the unique identification number of the target song, acquiring a song template corresponding to the unique identification number; inputting the real Mel spectrogram feature of the target user and the song template into a preset song generation model to obtain a target Mel spectrogram feature output by the song generation model; and, according to the target Mel spectrogram feature, generating the target song. The present disclosure can effectively combine the real Mel spectrogram feature of the target user and the song template corresponding to the target song during the song generation process, so as to effectively lower the degree of dependence on the data volume of user voice data, thereby effectively improving a song generation effect while improving song generation convenience.

Description

Song generation method, device, electronic device and storage medium

Technical field

The present disclosure relates to the field of computer technology, and specifically to a song generation method, device, electronic device and storage medium.

Background technique

Song synthesis refers to generating corresponding singing audio based on lyrics and musical scores. The corresponding song synthesis algorithm has developed from the initial synthesis technology based on unit splicing to statistical parameter synthesis technology to the current synthesis technology based on deep learning. Song synthesis technology can make machines sing, further increasing the fun of human-computer interaction, and therefore has high commercial value.

In related technologies, when performing song synthesis, the quantity and quality of training corpus are usually required to be relatively high, which makes the song generation process cumbersome and the song generation effect cannot be guaranteed.

Contents of the invention

The embodiments of the present disclosure propose a song generation method, device, electronic device and storage medium, which can be applied in the field of data processing technology and can effectively combine the real Mel spectrum characteristics of the target user and the songs corresponding to the target song during the song generation process. Templates to effectively reduce the dependence on the amount of user voice data, thereby effectively improving the song generation effect while improving the convenience of song generation.

In a first aspect, an embodiment of the present disclosure provides a song generation method, including:

Obtain the voice audio input by the target user and the unique identification number of the target song;

Perform mel spectrum feature extraction on the speech audio to obtain the real mel spectrum features of the target user;

Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song;

Input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein the song generation model is trained using The set is obtained through machine learning training. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during a song and the lyric text corresponding to the singing audio;

Generate a target song based on the target mel spectrum characteristics.

In a second aspect, embodiments of the present disclosure provide a training method for a song generation model, including:

Obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during the song and the lyric text corresponding to the singing audio;

Obtain a pre-built initial neural network model, which includes initial weight parameters and a loss function;

Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain real mel spectrum features and predicted mel spectrum features. The real mel spectrum features represent the The Mel spectrum feature of the singing audio in the first sample, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;

Calculate the error between the predicted mel spectrum feature and the true mel spectrum feature according to the loss function;

Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;

Subsequent samples are obtained one by one from the training set, and the subsequent samples are repeatedly input into the latest neural network model until the loss function converges, and a trained song generation model is obtained.

In a third aspect, an embodiment of the present disclosure proposes a song generation device, including: a first acquisition module for acquiring the voice audio input by the target user and the unique identification number of the target song;

The first processing module is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;

a second acquisition module, configured to acquire a song template corresponding to the unique identification number according to the unique identification number of the target song;

The second processing module is used to input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein: The song generation model is obtained through machine learning training using a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes : The sampling audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;

A generating module, configured to generate a target song according to the target Mel spectrum characteristics.

In a fourth aspect, an embodiment of the present disclosure provides a training device for a song generation model, which is characterized by including:

The third acquisition module is used to acquire a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: The sampled singing audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;

The fourth acquisition module is used to acquire a pre-built initial neural network model, where the initial neural network model includes initial weight parameters and a loss function;

The fifth acquisition module is used to acquire the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, the real The Mel spectrum feature represents the Mel spectrum feature of the singing audio in the first sample, and the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;

A third processing module, configured to calculate the error between the predicted Mel spectrum feature and the real Mel spectrum feature according to the loss function;

A fourth processing module, configured to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;

The sixth acquisition module is used to acquire subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges to obtain the trained song generation model.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the first aspect of the present disclosure. The song generation method proposed in the embodiment of one aspect, or the training method of the song generation model proposed in the embodiment of the second aspect of the present disclosure.

In a sixth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the song generation method as proposed in the embodiment of the first aspect of the disclosure is implemented, or Implement the training method of the song generation model as proposed in the embodiment of the second aspect of the present disclosure.

In a seventh aspect, an embodiment of the present disclosure provides a computer program product. When the instructions in the computer program product are executed by a processor, the song generation method proposed in the embodiment of the first aspect of the present disclosure is executed, or the method of the present disclosure is executed. The second aspect embodiment proposes a training method for a song generation model.

To sum up, the song generation methods, devices, electronic devices, storage media, computer programs and computer program products provided in the embodiments of the present disclosure can achieve the following technical effects:

By obtaining the voice audio input by the target user and the unique identification number of the target song, perform Mel spectrum feature extraction on the voice audio to obtain the real Mel spectrum features of the target user, and obtain the corresponding unique identification number based on the unique identification number of the target song. Song template, input the target user's real mel spectrum features and song template into the preset song generation model, obtain the target mel spectrum features output by the song generation model, and generate the target song based on the target mel spectrum features, which can be used in the song During the generation process, the real Mel spectrum characteristics of the target user and the song template corresponding to the target song are effectively combined to effectively reduce the dependence on the amount of user voice data, thus effectively improving the song generation effect while improving the convenience of song generation. .

Description of the drawings

In order to more clearly illustrate the technical solutions in the embodiments of the disclosure or the background technology, the drawings required to be used in the embodiments or the background technology of the disclosure will be described below.

Figure 1 is a schematic flowchart of a song generation method proposed by an embodiment of the present disclosure;

Figure 2 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure;

Figure 3 is a schematic diagram of a song template generation process proposed by an embodiment of the present disclosure;

Figure 4 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure;

Figure 5 is a schematic structural diagram of a timbre encoding sub-model proposed by an embodiment of the present disclosure;

Figure 6 is a schematic diagram of a song generation process proposed by an embodiment of the present disclosure;

Figure 7 is a schematic flowchart of a training method for a song generation model proposed by an embodiment of the present disclosure;

Figure 8 is a training flow chart of an initial neural network model proposed by an embodiment of the present disclosure;

Figure 9 is a schematic structural diagram of a song generation device according to an embodiment of the present disclosure;

Figure 10 is a schematic structural diagram of a song generation device according to another embodiment of the present disclosure;

Figure 11 is a schematic structural diagram of a training device for a song generation model proposed by an embodiment of the present disclosure;

Figure 12 is a schematic structural diagram of a training device for a song generation model proposed by another embodiment of the present disclosure;

13 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure.

Detailed ways

Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure as detailed in the appended claims.

The terminology used in the embodiments of the present disclosure is for the purpose of describing specific embodiments only and is not intended to limit the embodiments of the present disclosure. As used in the embodiments of the present disclosure and the appended claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used to describe various information in the embodiments of the present disclosure, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the embodiments of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the words "if" and "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

For ease of understanding, terminology involved in this disclosure is first introduced.

1. Mel spectrum

Mel spectrum is a commonly used feature in the deep speech learning process. Ordinary spectrograms are linear, while Mel spectroscopy is based on the characteristics of human hearing (more sensitive to low-frequency sounds and poorer in resolving high-frequency sounds), converting the frequency of ordinary spectrograms from linear to Mel scale, The Mel scale is a logarithmic scale, and human perception of frequency is more sensitive on the Mel scale.

2. Phonemes

Phoneme is the smallest unit of speech divided according to the natural properties of speech. It can be analyzed based on the pronunciation movements in syllables. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants.

3. Tone

Timbre means that different sounds always have distinctive characteristics in terms of waveforms, and different objects vibrate with different characteristics.

Figure 1 is a schematic flowchart of a song generation method proposed by an embodiment of the present disclosure.

It should be noted that the execution subject of the song generation method in this embodiment is a song generation device. The device can be implemented by software and/or hardware. The device can be configured in an electronic device. The electronic device can include but not Limited to terminals, servers, etc. For example, terminals can be smartphones, smart TVs, smart watches, smart cars, etc.

As shown in Figure 1, the song generation method may include but is not limited to the following steps:

Step S101: Obtain the voice audio input by the target user and the unique identification number of the target song.

Among them, the target user refers to the user who wants to use the song generation method. The voice audio refers to the audio data input by the target user. The voice audio can be the audio data of the target user or the audio data of other users. There is no limit to this. The target song refers to the song to be generated by the song generation method. The unique identification number refers to the identification information corresponding to the target song, such as the number or name.

It is understandable that the number of target songs may be multiple. When the unique identification number of the target song is obtained, the target song can be accurately positioned during the song generation process.

In the embodiment of the present disclosure, when acquiring the voice audio input by the target user, the audio acquisition device may be configured in advance in the execution body of the embodiment of the present disclosure, and then the audio acquisition device acquires the voice audio of the target user, or the audio acquisition device may be configured in advance. A data interface is configured in the execution subject of the embodiment of the present disclosure, a song generation request is received through the data interface, and then the voice audio is obtained by parsing the song generation request, and there is no limit to this.

In the embodiment of the present disclosure, when obtaining the unique identification number of the target song, a relationship table may be used. The relationship table may record the unique identification number corresponding to the target song, or the unique identification number of the target song may be uniquely identified based on multiple target songs in advance. The mapping relationship between the numbers establishes a database, and then the corresponding unique identification number is obtained from the database based on the target song. There is no restriction on this.

Step S102: Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.

Among them, the Mel spectrum refers to the spectrogram extracted based on the audio data, and the Mel spectrum is a logarithmic spectrum. The Mel spectrum feature refers to the feature information corresponding to the Mel spectrum. It is understandable that the sound level heard by the human ear does not have a linear relationship with the actual frequency (Hz), and the Mel spectrum feature is more in line with the auditory characteristics of the human ear. The real Mel spectrum features refer to the Mel spectrum features obtained based on the above speech data.

In the embodiment of the present disclosure, when mel spectrum feature extraction is performed on the voice audio to obtain the real mel spectrum feature of the target user, feature extraction of the voice audio can be achieved, thereby providing reliable reference data for the song generation process.

Step S103: Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.

Among them, the song template refers to a template that describes information related to the target song.

Optionally, in some embodiments, the song template may include lyric text information and song melody information. The lyric text information includes phoneme sequences and phoneme durations. The song melody information includes song note sequences and song energy sequences. Therefore, it can be larger The representation content in the song template can be enriched to a certain extent, thereby providing the song generation model with more comprehensive reference information of the target song, thereby effectively improving the applicability of the song template in the song generation process.

Among them, the lyrics text information refers to the text information corresponding to the lyrics of the target song. Song melody information can be used to describe relevant information corresponding to the melody of the target song. Phoneme refers to the smallest phonetic unit divided according to the natural properties of speech. A phoneme sequence refers to a sequence composed of multiple phonemes. The phoneme duration refers to the duration information corresponding to the phoneme. Notes refer to progressive symbols used to record sounds of different lengths. The song note sequence refers to the sequence composed of the corresponding notes of the song audio.

Among them, energy can refer to the energy contained in the song audio, such as sound intensity, and the song energy sequence can be used to describe the energy changes corresponding to the song audio corresponding to different time points.

In the embodiment of the present disclosure, when obtaining the song template corresponding to the unique identification number according to the unique identification number of the target song, multiple song templates may be obtained in advance, and then matching processing is performed based on the unique identification number and the multiple song templates to obtain The song template corresponding to the unique identification number, or a third-party retrieval device can also obtain the song template corresponding to the unique identification number according to the unique identification number of the target song, and there is no limit to this.

Step S104: Input the target user's real mel spectrum features and song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, where the song generation model is trained by machine learning using a training set It is obtained that the training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the singing audio picked up when the sampling user sings a certain song and the lyrics corresponding to the singing audio. text.

Among them, the song generation model refers to a model used to process real mel spectrum features and song templates, and output target mel spectrum features. The song generation model may be a neural network model. The target mel spectrum feature refers to the mel spectrum feature obtained by processing the real mel spectrum feature of the target user and the song template by the song generation model. The training set refers to the sample set used by the song generation model during the training process.

Among them, sampling users refer to users who provide samples for the training process of the song generation model. The sample can refer to the singing audio and lyric text used for model training. Singing audio refers to sampling the audio picked up when the user sings a certain song.

Optionally, in some embodiments, the song generation model includes: timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model. The song generation model uses the same training set to combine the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model. The decoding sub-model is obtained through joint training. This can effectively improve the structural rationality of the song generation model. When the same training set is used to jointly train the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model, it can effectively improve the structural rationality of the song generation model. Improve the consistency between each sub-model, thereby effectively improving the output accuracy of the resulting song generation model.

Among them, timbre refers to the different characteristics of different sounds in terms of waveforms. Different objects vibrate with different characteristics. In other words, the timbre of different users is different. The timbre coding sub-model refers to a model used to process real mel spectrum features to obtain the target user's timbre feature vector.

Among them, the text encoding sub-model refers to the model used to process the phoneme sequence to obtain the text feature vector corresponding to the target song.

Among them, the acoustic decoding sub-model refers to a model used to process multiple feature information to obtain target mel spectrum features. The acoustic decoding sub-model can be a decoder of a fast end-to-end and non-autoregressive synthesis system.

In the embodiment of the present disclosure, when the real mel spectrum features and song templates of the target user are input into the preset song generation model, and the target mel spectrum features output by the song generation model are obtained, it can be quickly and accurately based on the song generation model. It effectively integrates real Mel spectrum features and relevant information of song templates, thereby effectively improving the efficiency of model generation.

Step S105: Generate a target song based on the target mel spectrum characteristics.

When generating a target song based on the target mel spectrum features in the embodiment of the present disclosure, the target mel spectrum features may be input into the vocoder, and the vocoder analyzes and processes the target spectrum features to obtain the target song.

For example, when generating a target song based on the target mel spectrum features, the target mel spectrum features can be input into the preset vocoder model to obtain the target linear spectrum features; the linear spectrum features are subjected to inverse Fourier Transform to obtain the audio data of the target song.

Among them, the vocoder model is a neural network model, which is also trained through machine learning using a training set different from the song generation model. The vocoder model can be based on Generative Adversial Networks (GAN), adversarial generation network without distillation, etc. The training set can also be a training set commonly used in the existing technology.

The training process of the vocoder model can be to input the real mel spectrum feature in a sample into the built initial model, obtain the predicted linear spectrum feature, and calculate the predicted linear spectrum feature and the sample's linear spectrum feature through the loss function. For the error of the real linear spectrum feature, the initial model weight is modified according to the error, and the samples are input in this way until the loss function converges, and the trained vocoder model is obtained.

In the embodiment of the present disclosure, by obtaining the voice audio input by the target user and the unique identification number of the target song, Mel spectrum feature extraction is performed on the voice audio to obtain the real Mel spectrum feature of the target user, which is obtained according to the unique identification number of the target song. For the song template corresponding to the unique identification number, input the target user's real mel spectrum features and song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, and generate the target mel spectrum features based on the target mel spectrum features. The target song can effectively combine the target user's real mel spectrum characteristics and the song template corresponding to the target song during the song generation process to effectively reduce the dependence on the amount of user voice data, thereby improving the convenience of song generation. , effectively improving the song generation effect.

Figure 2 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.

As shown in Figure 2, the song generation method may include but is not limited to the following steps:

Step S201: Obtain the voice audio input by the target user and the unique identification number of the target song.

Step S202: Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.

Step S203: Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.

For descriptions of steps S201 to S203, specific reference may be made to the above embodiments, and details will not be described again here.

Step S204: Input the target user's real mel spectrum features into the timbre coding sub-model to obtain the target user's timbre feature vector.

Among them, the timbre feature vector refers to the vector used to characterize the corresponding timbre characteristics of the target user.

Step S205: Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.

Among them, the lyrics text refers to the text data in the song template that describes the corresponding lyrics information of the target song.

Among them, the text feature vector refers to the vector used to characterize the text features corresponding to the lyrics text.

Optionally, in some embodiments, the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, where the phoneme sequence and phoneme duration of the target song are configured by The song audio and song lyrics of the target song are determined. The song note sequence and song energy sequence of the target song are determined by the song audio. Therefore, the target song can be quickly positioned based on the unique identification number to effectively improve the practicality of the obtained song template. , and at the same time, it can effectively improve the accuracy of the song template's representation of target lyrics-related information.

Among them, the song audio refers to the singing audio corresponding to the target song. The song lyrics refer to the lyric information corresponding to the target song.

Optionally, in some embodiments, the phoneme sequence includes: multiple phonemes obtained by parsing song lyrics, and the phoneme duration includes: the number of first frames each phoneme occupies in the song audio. Thus, the obtained phoneme sequence can be effectively improved The compatibility with the lyrics of the song, while effectively improving the accuracy of the obtained first frame number for the corresponding phoneme.

Among them, the first frame number refers to the number of video frames corresponding to the phoneme in the song audio.

Optionally, in some embodiments, the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. Therefore, the obtained results can be effectively improved based on the quantization process. The song energy sequence and song note sequence clearly represent the song energy characteristics and song fundamental frequency characteristics. At the same time, the song energy sequence and song note sequence obtained through quantitative processing can provide reliable reference data for subsequent calculation processes.

Among them, the song energy feature can be used to describe the relevant features corresponding to the song energy. The fundamental frequency characteristics of songs can be used to describe the related characteristics corresponding to the fundamental frequency of songs.

Optionally, in some embodiments, the song energy characteristics include: multiple energy values, the song energy sequence is formed based on multiple range encoding values, and the range encoding values are processed by one-hot encoding of the energy range corresponding to the energy value. It is obtained that the song energy characteristics can be effectively expanded based on the one-hot encoding process to distinguish multiple energy values based on the obtained multiple range encoding values. When the song energy sequence is formed based on the multiple range encoding values, the obtained song energy can be effectively improved. The representation effect of sequence on the energy characteristics of songs.

Among them, the energy value may refer to the value corresponding to the energy of the song. Energy range refers to the value range corresponding to the energy value, such as 0-10.

Among them, one-hot encoding can also be called one-bit effective encoding. This one-hot encoding can use an N-bit status register to encode N states. Each state has its own independent register bit, and at any time, only One is valid.

The range encoding value refers to the encoding value obtained by one-hot encoding of the energy range.

For example, encoding six states: Let the natural sequence codes of the six states be: 000, 001, 010, 011, 100, 101.

Then the one-hot encoding can be configured as: 000001, 000010, 000100, 001000, 010000, 100000.

Optionally, in some embodiments, the song fundamental frequency characteristics include: multiple fundamental frequency values, and the song note sequence includes a note symbol corresponding to each fundamental frequency value. Therefore, the song note sequence can effectively combine the fundamental frequency value and the pitch value. The correspondence between symbols can be adapted to personalized application scenarios, thereby effectively improving the applicability of the resulting song note sequence in the song generation process.

Among them, the fundamental frequency value refers to the value corresponding to the fundamental frequency of the song. Note symbols refer to the numbers corresponding to notes, which can be obtained based on relevant databases in the music field.

For example, as shown in Figure 3, Figure 3 is a schematic diagram of a song template generation process proposed by an embodiment of the present disclosure. The initial data of the song template may include song audio and song lyrics corresponding to the target song. The song template generation process may It includes: (1) Processing song lyrics based on text transcription method to obtain the phoneme sequence corresponding to the target song; (2) Processing the obtained song phoneme sequence and song audio based on forced alignment to obtain the phoneme duration of the target song. Here, forced alignment can be used method, or manual calibration can be performed after the forced alignment operation to improve the accuracy of the phoneme duration; (3) Process the song audio based on the acoustic feature extraction method to obtain the song energy features and song fundamental frequency features corresponding to the target song , and then based on the energy trajectory translation and fundamental frequency trajectory translation, the values of the corresponding energy and pitch of the song are changed to improve the flexibility of the song template; (4) Quantify the song energy characteristics and the song fundamental frequency characteristics to obtain the song energy sequence and song note sequence; (5) Generate a song template based on the phoneme sequence, phoneme duration, song energy sequence and song note sequence; (6) After generating the song template, a unique identification number of the target song can be generated for the song template, so as to facilitate The song template is retrieved based on the unique identification number during the song generation process.

Step S206: Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector.

Among them, the frame-level text feature vector refers to a vector that describes the text features corresponding to multiple audio frames. The frame-level timbre feature vector refers to the vectors corresponding to timbre features of multiple audio frames.

It can be understood that the same phoneme may include multiple audio frames, and the multiple audio frames corresponding to the same phoneme have high similarity. When the text feature vector and timbre feature vector are length-aligned according to the phoneme duration, we get When converting frame-level text feature vectors and frame-level timbre feature vectors, the phoneme-level text feature vectors and timbre feature vectors can be converted into frame-level text feature vectors and timbre feature vectors by copying, so as to facilitate subsequent processing of frame-level frames. The level text feature vector and the frame-level timbre feature vector are added together.

Step S207: Add the frame-level text feature vector, frame-level timbre feature vector and song melody information and then input them into the acoustic decoding sub-model to obtain the target mel spectrum feature.

Among them, addition refers to the addition operation of dimensions. Assume that the dimensions of frame-level text feature vectors, frame-level timbre feature vectors, song note sequences, and song energy sequences are all 10 dimensions. Addition means adding the values of the corresponding dimensions. Operation.

That is to say, after the embodiment of the present disclosure obtains the song template corresponding to the unique identification number according to the unique identification number of the target song, the real mel spectrum characteristics of the target user can be input into the timbre coding sub-model to obtain the timbre characteristics of the target user. Vector, input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template, perform duration regularization on the text feature vector and timbre feature vector according to the phoneme duration, and obtain the frame-level text feature vector and frame-level timbre feature vector, The frame-level text feature vector, frame-level timbre feature vector and song melody information are added together and then input into the acoustic decoding sub-model to obtain the target mel spectrum feature. From this, it can be quickly implemented based on the timbre encoding sub-model and the text encoding sub-model. Extract the features of real mel spectrum features and phoneme sequences, and quantify the corresponding timbre features and text features in the form of vectors, and then perform duration regularization of the text feature vectors and timbre feature vectors based on phoneme duration, which can effectively improve the resulting frame-level text The consistency between feature vectors and frame-level timbre feature vectors can effectively improve the processing effect of the acoustic decoding sub-model on frame-level text feature vectors and frame-level timbre feature vectors.

Step S208: Generate a target song based on the target mel spectrum characteristics.

For the description of step S208, reference may be made to the above-mentioned embodiments, and details will not be described again here.

In this embodiment, by inputting the real mel spectrum features of the target user into the timbre encoding sub-model, the timbre feature vector of the target user is obtained, and the phoneme sequence is input into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template. According to The phoneme duration performs duration regularization on text feature vectors and timbre feature vectors to obtain frame-level text feature vectors and frame-level timbre feature vectors. The frame-level text feature vectors, frame-level timbre feature vectors and song melody information are added and then input to the acoustic Decode the sub-model to obtain the target mel spectrum features. From this, the real mel spectrum features and phoneme sequences can be quickly extracted based on the timbre encoding sub-model and the text encoding sub-model, and the corresponding timbre features can be quantified in the form of vectors. and text features, and then duration regularization of text feature vectors and timbre feature vectors based on phoneme duration, which can effectively improve the consistency between the obtained frame-level text feature vectors and frame-level timbre feature vectors, thereby effectively improving the acoustic decoding sub-model's ability to detect frames The processing effect of level text feature vectors and frame-level timbre feature vectors.

Figure 4 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.

As shown in Figure 4, the song generation method may include but is not limited to the following steps:

Step S401: Obtain the voice audio input by the target user and the unique identification number of the target song.

Step S402: Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.

Step S403: Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.

For descriptions of steps S401 to S403, reference may be made to the above-mentioned embodiments, and details will not be described again here.

Step S404: Input the target user's real mel spectrum features into the reference encoder to obtain the target user's timbre latent space distribution vector.

The reference encoder may refer to an encoder used to process real Mel spectrum features to obtain timbre latent space distribution vectors. The timbre latent space distribution vector output by the reference encoder may refer to the timbre latent space distribution vector corresponding to the real Mel spectrum features. Hidden variables.

It can be understood that the timbre latent space distribution vector obeys the spherical Gaussian distribution. In the embodiment of the present disclosure, while outputting the timbre latent space distribution vector of the target user, the reference encoder can also output the mean and variance corresponding to the spherical Gaussian distribution.

Step S405: Input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder.

Among them, the autoregressive encoder refers to an encoder used to process the timbre latent space distribution vector to obtain the timbre distribution vector.

It can be understood that the structures of the above-mentioned reference encoder and autoregressive encoder can be multi-layer linear layers or convolutional layers, and there is no limitation on this.

Step S406: Use the timbre distribution vector as the timbre feature vector of the target user.

That is to say, the timbre encoding sub-model can include: a reference encoder and an autoregressive encoder. After obtaining the song template corresponding to the unique identification number according to the unique identification number of the target song, the real Mel spectrum characteristics of the target user can be Input into the reference encoder to obtain the timbre latent space distribution vector of the target user, and input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the autoregressive encoder pair The timbre latent space distribution vector is sampled, and the timbre distribution vector is used as the timbre feature vector of the target user. This can effectively reduce the redundant information in the obtained timbre feature vector, and at the same time convert the more complex real mel spectrum features into vectors. form, thereby effectively improving the practicality of the obtained timbre feature vector.

For example, as shown in Figure 5, Figure 5 is a schematic structural diagram of a timbre encoding sub-model proposed by an embodiment of the present disclosure, in which the random sampling point ε refers to a random sampling point of Gaussian distribution, which can be expressed as ε~N(0 ,I).

After receiving the real mel spectrum features, the timbre encoding sub-model can obtain the timbre latent space distribution vector h and two parameters through the reference encoder processing. The two parameters can be respectively used as the mean a1 and variance b1 of the Gaussian distribution. The above-mentioned The random sampling point ε is processed in combination with the mean a1 and the variance b1, and the random sampling point z of the approximate experimental distribution can be obtained (z=b1⊙ε+a1, where ⊙ refers to the matrix multiplication), and the random sampling point z and the timbre hidden The spatial distribution vector h can be processed by the autoregressive encoder to obtain the mean a2 and variance b2 corresponding to the random sampling point z, and then based on the random sampling point z and the mean a2 and variance b2, the timbre feature vector s (s=b2⊙z+ a2).

It can be understood that the timbre encoding sub-model can be a sampling process based on an inverse autoregressive flow (IAF), which is a standardized flow. Normalizing streams produces distributions that are easy to sample. The normalized flow can convert complex input distributions into tractable probability distributions through a series of reversible transformation operations. The output distribution is usually chosen to be an isotropic unit Gaussian distribution, that is, a spherical unit Gaussian distribution, allowing smooth interpolation and efficient sampling. . The timbre feature vector is learned by using the inverse autoregressive flow method. The generated timbre latent space distribution vector h can obey the spherical Gaussian distribution, so that the timbre feature vector can be obtained by sampling on this distribution. It can also learn for unprocessed users. Get a more accurate vector distribution. During the training and inference stages, samples are taken from the spherical Gaussian distribution to represent the timbre feature vector, thus ensuring the consistency of the training and inference stages and making it more suitable for users outside the training set. At the same time, sampling the user's corresponding timbre latent space distribution vector h instead of averaging further increases the convergence of the user space, thereby allowing smoother interpolation between timbre feature vectors, that is, being able to extract the data from the training set The timbre feature vector of the user outside the set is learned from the audio of a sentence of the user outside the set.

Step S407: Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.

For the description of step S407, reference may be made to the above-mentioned embodiments, and details will not be described again here.

Step S408: Determine the initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector.

Among them, the initial text encoding refers to the text encoding contained in the text feature vector.

In the embodiment of the present disclosure, when the initial text code corresponding to each phoneme in the phoneme sequence is determined from the text feature vector, reliable reference data can be provided for subsequent determination of the target text code.

Step S409: Determine the first frame number corresponding to the phoneme based on the phoneme duration.

Among them, the first frame number refers to the number of video frames corresponding to each phoneme. For example, the phoneme duration corresponding to one phoneme can be 25ms. If an audio frame is set to 5ms, then one phoneme corresponds to 5 frames of information.

Step S410: Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code.

Among them, the target text encoding refers to the text encoding obtained by splicing the initial text encoding of the first frame number.

It can be understood that the duration of a phoneme corresponding to a phoneme may be small, and there may be more redundant information between multiple audio frames corresponding to the same phoneme. When copying the initial text encoding, and copying the first frame number When the initial text encoding is spliced to obtain the target text encoding, the practicality of the obtained target text encoding can be effectively improved.

Step S411: Form a frame-level text feature vector according to multiple target text codes.

That is to say, in the embodiment of the present disclosure, after inputting the phoneme sequence into the text encoding sub-model and obtaining the text feature vector of the lyrics text in the song template, the initial character corresponding to each phoneme in the phoneme sequence can be determined from the text feature vector. Text encoding: determine the first frame number corresponding to the phoneme based on the duration of the phoneme, copy the initial text encoding, and perform splicing processing on the copied initial text encoding of the first frame number to obtain the target text encoding, and form a frame based on multiple target text encodings Level text feature vector, since the time range corresponding to each phoneme is small, and the representation content of different audio frames in the same phoneme has high similarity, when copying the initial text encoding, and copying the initial text of the first frame number When the encoding is spliced to obtain the target text encoding, the computational cost can be greatly reduced, thereby effectively improving the efficiency of determining frame-level text feature vectors.

Step S412: Determine the second frame number of the speech audio based on the phoneme duration.

The second number of frames refers to the number of frames of the speech audio determined based on the phoneme duration.

Step S413: Copy the timbre feature vector, and splice the copied timbre feature vectors of the second frame number to obtain a frame-level timbre feature vector.

That is to say, after forming a frame-level text feature vector based on multiple target text encodings, the embodiment of the present disclosure can determine the second frame number of the speech audio based on the phoneme duration, copy the timbre feature vector, and perform the copied second The timbre feature vectors of the frame number are spliced to obtain a frame-level timbre feature vector. Therefore, the obtained frame-level timbre feature vector can effectively represent the relevant feature information of the speech to be processed from the level of the audio frame, so as to effectively improve the obtained frame-level timbre features. The compatibility between the vector and the frame-level text feature vector effectively improves the representation effect of the obtained frame-level timbre feature vector.

Step S414: Add the frame-level text feature vector, the frame-level timbre feature vector and the song melody information and then input them into the acoustic decoding sub-model to obtain the target mel spectrum feature.

Step S415: Generate the target song according to the target mel spectrum characteristics.

For the description of steps S414 to S415, reference may be made to the above-mentioned embodiments, and details will not be described again here.

For example, as shown in Figure 6, Figure 6 is a schematic diagram of a song generation process proposed by an embodiment of the present disclosure. After a new user provides a piece of voice audio, the corresponding operation process of the song generation model may include: (1) Based on acoustic features The extraction method processes the speech audio to obtain real mel spectrum features; (2) inputs the real mel spectrum features into the timbre coding sub-model to obtain the timbre feature vector; (3) inputs the phoneme sequence in the song template into the text In the encoding sub-model, to obtain the text feature vector of the target song; (4) Input the phoneme duration, timbre feature vector and text feature vector into the duration regularization sub-module to obtain the frame-level text feature vector and frame-level timbre feature vector; (5) Add the frame-level text feature vector, frame-level timbre feature vector, song note sequence and song energy sequence and input them into the acoustic decoding sub-model to obtain the target mel spectrum feature; (6) The obtained target spectrum feature Input into the vocoder to get the target song. The vocoder may be a neural network vocoder.

That is to say, in the song generation process of the embodiment of the present disclosure, multiple users can share the pre-trained song generation model, and the song performed by the user can be obtained based on a piece of user audio, thereby effectively improving the convenience in the song generation process. , effectively reducing computing resources and reducing storage costs.

In this embodiment, by inputting the target user's real mel spectrum features into the reference encoder, the target user's timbre latent space distribution vector is obtained, and the timbre latent space distribution vector is input into the autoregressive encoder to obtain the target user's timbre latent space distribution vector. Timbre distribution vector, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder. The timbre distribution vector is used as the timbre feature vector of the target user. This can effectively reduce the redundant information in the resulting timbre feature vector. , and at the same time, convert the more complex real mel spectrum features into vector form, thereby effectively improving the practicality of the obtained timbre feature vector. By determining the initial text encoding corresponding to each phoneme in the phoneme sequence from the text feature vector, According to the duration of the phoneme, determine the first frame number corresponding to the phoneme, copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code. Based on multiple target text codes, frame-level text features are formed. Vector, since the time range corresponding to each phoneme is small, and the representation content of different audio frames in the same phoneme has high similarity, when the initial text encoding is copied, the initial text encoding of the first frame number is spliced. When processing the target text encoding, the computing cost can be greatly reduced, thereby effectively improving the efficiency of determining the frame-level text feature vector. By determining the second frame number of the speech audio based on the phoneme duration, copying the timbre feature vector, and copying The obtained timbre feature vectors of the second frame number are spliced to obtain a frame-level timbre feature vector. Therefore, the obtained frame-level timbre feature vector can effectively represent the relevant feature information of the speech to be processed from the level of the audio frame to effectively improve the obtained frame. The adaptability between the level timbre feature vector and the frame-level text feature vector effectively improves the representation effect of the obtained frame-level timbre feature vector.

FIG. 7 is a schematic flowchart of a training method for a song generation model proposed by an embodiment of the present disclosure.

Among them, it should be noted that the execution subject of the training method of the song generation model in this embodiment is a training device of the song generation model. The device can be implemented by software and/or hardware, and the device can be configured in an electronic device. Electronic devices may include but are not limited to terminals, servers, etc. For example, terminals may be smartphones, smart TVs, smart watches, smart cars, etc.

As shown in Figure 7, the training method of the song generation model may include but is not limited to the following steps:

Step S701: Obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the singing audio picked up when the sampling user sings a certain song and the The lyrics text corresponding to the singing audio.

In the embodiment of the present disclosure, when obtaining the training set, a communication link between the execution subject of the embodiment of the present disclosure and the big data server may be established in advance, and then the training set may be obtained from the big data server, or the training set may be obtained from the big data server based on the sample collection device. The training set is obtained from multiple sampling users, and there is no restriction on this.

Step S702: Obtain a pre-built initial neural network model. The initial neural network model includes initial weight parameters and a loss function.

Among them, the neural network model is a complex network system formed by a large number of simple processing units (called neurons) that are widely connected to each other. It reflects many basic characteristics of human brain functions. The initial neural network model refers to the neural network model to be trained. Among them, the initial weight parameters refer to the weight parameters to be iteratively updated during the model training process. The loss function can be used to describe the error information between the predicted Mel spectrum features and the real Mel spectrum features output by the initial neural network model during the training process.

In the embodiments of the present disclosure, the model performance can be evaluated in real time during the model training process based on the loss function, and whether the model has converged can be judged in a timely manner.

Step S703: Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain the real mel spectrum feature and the predicted mel spectrum feature. The real mel spectrum feature represents the singing audio in the first sample. The Mel spectrum feature, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model.

Among them, the first sample refers to the first sample among the multiple samples in the training set that is used for model training.

When obtaining the first sample from the training set in the embodiment of the present disclosure, a sample may be randomly obtained from the training set as the first sample, or the first sample may be obtained from the training set based on the number information of multiple samples in the training set. There are no restrictions on this.

Optionally, in some embodiments, when inputting the first sample into the initial neural network model to obtain the real Mel Spectrum features and the predicted Mel Spectrum features, the text of the lyrics in the first sample may be transcribed. , obtain the phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration, perform acoustic feature extraction on the singing audio in the first sample, and obtain the real mel spectrum features and audio of the first sample Energy and fundamental frequency trajectories, input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample, input the real mel spectrum characteristics of the first sample into the initial timbre encoding sub-model, and obtain the first sample The timbre feature vector of The fundamental frequency trajectories are added and then input to the initial acoustic decoding sub-model to obtain the predicted mel spectrum features of the first sample. From this, different methods can be used to extract multiple features in the lyric text and singing audio during the model training process. Converting the multiple features into vector form can not only quantify the features, but also facilitate the addition processing of the multiple feature vectors to achieve feature fusion, which can effectively improve the accuracy of the description of sample features by the predicted Mel spectrum features. .

Among them, the initial text encoding sub-model refers to the text encoding sub-model to be trained. The initial timbre coding sub-model refers to the timbre coding sub-model to be trained. The initial acoustic decoding sub-model refers to the acoustic decoding sub-model to be trained.

Among them, audio energy refers to the energy information corresponding to the singing audio in the first sample.

Among them, the fundamental frequency trajectory refers to the trajectory information corresponding to the fundamental frequency of the singing audio in the first sample.

Step S704: Calculate the error between the predicted mel spectrum feature and the real mel spectrum feature according to the loss function.

Among them, error can be used to describe the difference information between predicted Mel spectrum features and real Mel spectrum features.

In the embodiment of the present disclosure, when the error between the predicted mel spectrum feature and the real mel spectrum feature is calculated according to the loss function, the output accuracy of the initial neural network model can be evaluated in real time to determine the model performance, and the resulting error It can provide reliable reference data for determining the direction of model optimization.

Step S705: Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model.

In the embodiments of the present disclosure, when the initial weight parameters of the initial neural network model are adjusted based on the error, the initial weight parameters can be accurately adjusted based on the error, thereby effectively improving the training effect of the neural network model.

Step S706: Obtain subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges, and obtain the trained song generation model.

Among them, subsequent samples refer to other samples in the training set except the first sample.

For example, as shown in Figure 8, Figure 8 is a training flow chart of an initial neural network model proposed by an embodiment of the present disclosure. The initial neural network model may include an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial Acoustic decoding sub-model, the training process may include: (1) The song lyrics are processed through text transcription to obtain the corresponding phoneme sequence, and the resulting phoneme sequence is processed through the initial text encoding sub-model to obtain the corresponding text feature vector; (2) ) Process text feature vectors and song phoneme durations based on forced alignment to obtain the initial text encoding; (3) Process song audio based on acoustic feature extraction to obtain real mel spectrum features, song energy features and song fundamental frequency features; (4) The real mel spectrum features are processed through the initial timbre coding sub-model to obtain the timbre feature vector; (5) Multiple energy values in the song audio energy features are divided into different energy bands (for example: the value range is 0-10 The energy value can be divided into 10 or 20 energy bands based on the application environment, and the song energy characteristics are processed based on the one-hot encoding method to obtain the song energy sequence; (6) Based on relevant data in the music field, the song fundamental frequency characteristics Perform quantization processing to obtain the song note sequence. For example: the note symbol corresponding to the fundamental frequency 261.63Hz is 60, and the note symbol corresponding to the fundamental frequency 277.18Hz is 61; (7) Use the duration regularization method to process the initial encoded text based on the song phoneme duration to obtain the frame-level text feature vector; (8) The duration regularization sub-model processes the timbre feature vector based on the song phoneme duration to obtain the frame-level timbre feature vector; (9) Input the above-mentioned frame-level text feature vector, frame-level timbre feature vector, song energy sequence and song note sequence into the acoustic decoding sub-model to obtain the predicted mel spectrum features; (10) Determine the loss function of the song generation model based on the real mel spectrum features and predicted spectrum features. Through this loss function, each weight parameter in the song generation model can be iteratively updated by sampling gradient backpropagation, so that the loss function tends to converge.

In the embodiment of the present disclosure, the training set is obtained from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the sample picked up when the sampling user sings a certain song. Singing audio and lyric text corresponding to the singing audio, obtain the pre-built initial neural network model, the initial neural network model includes initial weight parameters and loss function, obtain the first sample from the training set, and input the first sample into the initial neural network In the model, the real mel spectrum feature and the predicted mel spectrum feature are obtained. The real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum predicted by the initial neural network model. Spectral features, calculate the error between the predicted Mel spectrum features and the real Mel spectrum features based on the loss function, adjust the initial weight parameters of the initial neural network model based on the error, obtain an updated neural network model, and obtain the subsequent results one by one from the training set samples, and repeatedly input subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained. From this, the predicted mel spectrum characteristics of the model output can be determined in real time based on the loss function during the model training process. The error between the model and the real mel spectrum features provides a reliable basis for judging the convergence of the model, thereby effectively improving the output accuracy of the song generation model.

FIG. 9 is a schematic structural diagram of a song generation device according to an embodiment of the present disclosure.

As shown in Figure 9, the song generation device 90 includes:

The first acquisition module 901 is used to acquire the voice audio input by the target user and the unique identification number of the target song;

The first processing module 902 is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;

The second acquisition module 903 is used to acquire the song template corresponding to the unique identification number according to the unique identification number of the target song;

The second processing module 904 is used to input the target user's real mel spectrum features and song templates into the preset song generation model to obtain the target mel spectrum features output by the song generation model, where the song generation model uses training The set is obtained through machine learning training. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the singing audio picked up when the sampling user sings a certain song and the The lyrics text corresponding to the singing audio;

The generation module 905 is used to generate the target song according to the target mel spectrum characteristics.

In some embodiments of the present disclosure, as shown in Figure 10, which is a schematic structural diagram of a song generation device proposed by another embodiment of the present disclosure, the song generation model includes: timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model; the song generation model is obtained by jointly training the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model using the same training set.

In some embodiments of the present disclosure, the song template includes lyric text information and song melody information; the lyric text information includes phoneme sequences and phoneme durations; and the song melody information includes song note sequences and song energy sequences.

In some embodiments of the present disclosure, the second processing module 904 includes: a first processing sub-module 9041, which is used to input the target user's real mel spectrum characteristics into the timbre encoding sub-model to obtain the target user's timbre feature vector; The second processing sub-module 9042 is used to input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template; the third processing sub-module 9043 is used to perform duration processing on the text feature vector and timbre feature vector according to the phoneme duration. After regularization, frame-level text feature vectors and frame-level timbre feature vectors are obtained; the fourth processing sub-module 9044 is used to add the frame-level text feature vectors, frame-level timbre feature vectors and song melody information and then input them to the acoustic decoding sub-model. , obtain the target mel spectrum characteristics.

In some embodiments of the present disclosure, it is characterized in that the first processing sub-module 9041 is specifically used to: input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user; Input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder; use the timbre distribution vector as the timbre of the target user Feature vector.

In some embodiments of the present disclosure, the third processing sub-module 9043 is specifically used to: determine the initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector; determine the first text code corresponding to the phoneme according to the phoneme duration. One frame number; copy the initial text encoding, and perform splicing processing on the copied initial text encoding of the first frame number to obtain the target text encoding; form a frame-level text feature vector based on multiple target text encodings.

In some embodiments of the present disclosure, the third processing sub-module 9043 is also used to: determine the second frame number of the speech audio according to the phoneme duration; copy the timbre feature vector, and copy the timbre feature vector of the second frame number Perform splicing processing to obtain frame-level timbre feature vectors. In some embodiments of the present disclosure, the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein the phoneme sequence and phoneme duration of the target song are configured by The song audio and song lyrics of the target song are determined, and the song note sequence and song energy sequence of the target song are determined by the song audio.

In some embodiments of the present disclosure, the phoneme sequence includes: multiple phonemes obtained by parsing the song lyrics, and the phoneme duration includes: the first frame number occupied by each phoneme in the song audio.

In some embodiments of the present disclosure, the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio.

In some embodiments of the present disclosure, the song energy characteristics include: multiple energy values; the song energy sequence is formed based on multiple range encoding values, and the range encoding values are processed by one-hot encoding of the energy range corresponding to the energy value. get.

In some embodiments of the present disclosure, the song fundamental frequency feature includes: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each fundamental frequency value.

It should be noted that the foregoing explanation of the song generation method is also applicable to the song generation device of this embodiment, and will not be described again here.

In this embodiment, by obtaining the voice audio input by the target user and the unique identification number of the target song, Mel spectrum feature extraction is performed on the voice audio to obtain the real Mel spectrum feature of the target user, and the unique identification number of the target song is obtained and For the song template corresponding to the unique identification number, input the real mel spectrum features and song template of the target user into the preset song generation model, obtain the target mel spectrum features output by the song generation model, and generate the target based on the target mel spectrum features. Songs can effectively combine the target user's real mel spectrum characteristics and the song template corresponding to the target song during the song generation process to effectively reduce the dependence on the amount of user voice data, thereby improving the convenience of song generation. Effectively improve the song generation effect.

Figure 11 is a schematic structural diagram of a training device for a song generation model proposed by an embodiment of the present disclosure.

As shown in Figure 11, the training device 110 of the song generation model includes: a third acquisition module 1101, used to obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user at least corresponds to A sample, each sample includes: the singing audio picked up when the user sings a certain song and the lyrics text corresponding to the singing audio; the fourth acquisition module 1102 is used to obtain the pre-built initial neural network model, the initial neural network model Including initial weight parameters and loss functions; the fifth acquisition module 1103 is used to obtain the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, The real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum feature predicted by the initial neural network model; the third processing module 1104 is used to calculate predictions based on the loss function The error between the Mel spectrum feature and the real Mel spectrum feature; the fourth processing module 1105 is used to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model; the sixth acquisition module 1106, It is used to obtain subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained.

In some embodiments of the present disclosure, as shown in Figure 12, which is a schematic structural diagram of a training device for a song generation model proposed by another embodiment of the present disclosure, the initial neural network model includes: an initial timbre encoding sub-model, Initial text encoding sub-model, and initial acoustic decoding sub-model; the fifth acquisition module 1103 includes: the fifth processing sub-module 11031, which is used to transcribe the lyrics text in the first sample to obtain the phoneme sequence, and according to the phoneme The sequence aligns the singing audio pairs in the first sample to obtain the phoneme duration; the sixth processing sub-module 11032 is used to extract the acoustic features of the singing audio in the first sample to obtain the real mel spectrum characteristics of the first sample. Audio energy and fundamental frequency trajectory; the seventh processing sub-module 11033 is used to input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample; the eighth processing sub-module 11034 is used to convert the first sample The real mel spectrum features are input into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample; the ninth processing sub-module 11035 is used to duration regularize the text feature vector and timbre feature vector according to the phoneme duration to obtain the frame level text feature vector and frame level timbre feature vector; the tenth processing submodule 11036 is used to add the frame level text feature vector, frame level timbre feature vector, audio energy, and fundamental frequency trajectory and input them into the initial acoustic decoding sub-module model to obtain the predicted Mel spectrum characteristics of the first sample.

It should be noted that the aforementioned explanation of the training method of the song generation model is also applicable to the training device of the song generation model in this embodiment, and will not be described again here.

In this embodiment, a training set is obtained. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: singing songs picked up when the sampling user sings a certain song. Audio and lyric text corresponding to the singing audio, obtain the pre-built initial neural network model, the initial neural network model includes initial weight parameters and loss function, obtain the first sample from the training set, and input the first sample into the initial neural network model , the real mel spectrum feature and the predicted mel spectrum feature are obtained. The real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum predicted by the initial neural network model. Features, calculate the error between the predicted Mel spectrum features and the real Mel spectrum features according to the loss function, adjust the initial weight parameters of the initial neural network model based on the error, obtain an updated neural network model, and obtain subsequent samples one by one from the training set , and repeatedly input subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained. From this, the predicted mel spectrum characteristics and the predicted mel spectrum characteristics of the model output can be determined in real time based on the loss function during the model training process. The error between real mel spectrum features provides a reliable basis for judging the convergence of the model, thereby effectively improving the output accuracy of the song generation model.

13 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure. The electronic device 12 shown in FIG. 13 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.

As shown in Figure 13, electronic device 12 is embodied in the form of a general computing device. Components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting various system components (including system memory 28 and processing unit 16).

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include but are not limited to Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.

Electronic device 12 typically includes a variety of computer system readable media. These media may be any available media that can be accessed by electronic device 12, including volatile and nonvolatile media, removable and non-removable media.

The memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or cache memory 32. Electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 13, commonly referred to as a "hard drive").

Although not shown in FIG. 13, a disk drive for reading and writing a removable non-volatile disk (e.g., a "floppy disk") and a removable non-volatile optical disk (e.g., a compact disk read-only memory) may be provided. Disc Read Only Memory (hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical disc drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the present disclosure.

A program/utility 40 having a set of (at least one) program modules 42, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment. Program modules 42 generally perform functions and/or methods in the embodiments described in this disclosure.

Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable human interaction with electronic device 12, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 12 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22. Moreover, the electronic device 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)) and/or a public network, such as the Internet, through the network adapter 20 ) communication. As shown, network adapter 20 communicates with other modules of electronic device 12 via bus 18 . It should be understood that, although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

The processing unit 16 executes programs stored in the system memory 28 to perform various functional applications and data processing, such as implementing the song generation method and the song generation model training method mentioned in the previous embodiments.

In order to implement the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the song generation method and song generation method as proposed in the previous embodiments of the present disclosure are implemented. Model training method.

In order to implement the above embodiments, the present disclosure also proposes a computer program product. When the instruction processor in the computer program product is executed, the song generation method and the song generation model training method proposed in the previous embodiments of the present disclosure are executed.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs. When the computer program is loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present disclosure are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer program may be stored in or transferred from one computer-readable storage medium to another, for example, the computer program may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., high-density digital video discs (DVD)), or semiconductor media (e.g., solid state disks, SSD)) etc.

Those of ordinary skill in the art can understand that the first, second, and other numerical numbers involved in this disclosure are only for convenience of description and are not used to limit the scope of the embodiments of the disclosure, nor to indicate the order.

At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited. In the embodiment of the present disclosure, for a technical feature, the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc. The technical features described in "first", "second", "third", "A", "B", "C" and "D" are in no particular order or order.

The corresponding relationships shown in each table in this disclosure can be configured or predefined. The values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure. When configuring the correspondence between information and each parameter, it is not necessarily required to configure all the correspondences shown in each table. For example, in the table in this disclosure, the corresponding relationships shown in some rows may not be configured. For another example, appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc. The names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device. When implementing the above tables, other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.

Predefinition in this disclosure may be understood as definition, pre-definition, storage, pre-storage, pre-negotiation, pre-configuration, solidification, or pre-burning.

Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of this disclosure.

Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure. should be covered by the protection scope of this disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims

A song generation method, characterized by including:

Obtain the voice audio input by the target user and the unique identification number of the target song;

Perform mel spectrum feature extraction on the speech audio to obtain the real mel spectrum features of the target user;

Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song;

Input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein the song generation model is trained using The set is obtained through machine learning training. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during a song and the lyric text corresponding to the singing audio;

Generate a target song based on the target mel spectrum characteristics.
The method of claim 1, wherein the song generation model includes: a timbre encoding sub-model, a text encoding sub-model and an acoustic decoding sub-model; the song generation model uses the same training set to The encoding sub-model, text encoding sub-model and acoustic decoding sub-model are jointly trained.
The method of claim 2, characterized in that:

The song template includes lyric text information and song melody information;

The lyrics text information includes phoneme sequences and phoneme durations;

The song melody information includes a song note sequence and a song energy sequence.
The method of claim 3, wherein the target user's real mel spectrum characteristics and the song template are input into a preset song generation model to obtain the target output by the song generation model. Mel spectrum features, including:

Input the real mel spectrum characteristics of the target user into the timbre coding sub-model to obtain the timbre feature vector of the target user;

Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template;

Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector;

The frame-level text feature vector, the frame-level timbre feature vector and the song melody information are added and then input into the acoustic decoding sub-model to obtain the target mel spectrum feature.
The method of claim 4, wherein the timbre encoding sub-model includes: a reference encoder and an autoregressive encoder; and the real mel spectrum characteristics of the target user are input into the timbre encoding sub-model. , obtain the timbre feature vector of the target user, including:

Input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user;

The timbre latent space distribution vector is input into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the timbre latent space distribution vector of the autoregressive encoder. sampled;

The timbre distribution vector is used as the timbre feature vector of the target user.
The method of claim 4, wherein the text feature vector is length-regularized according to the phoneme duration to obtain a frame-level text feature vector, including:

From the text feature vector, determine an initial text code corresponding to each phoneme in the phoneme sequence;

According to the duration of the phoneme, determine the first frame number corresponding to the phoneme;

Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code;

The frame-level text feature vector is formed according to multiple target text codes.
The method of claim 4, wherein the timbre feature vector is length-regularized according to the phoneme duration to obtain a frame-level timbre feature vector, including:

Determine the second frame number of the speech audio according to the phoneme duration;

The timbre feature vector is copied, and the copied timbre feature vectors of the second frame number are spliced to obtain the frame-level timbre feature vector.
The method of claim 3, wherein the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein , the phoneme sequence and the phoneme duration of the target song are determined by the song audio and song lyrics of the target song, and the song note sequence and song energy sequence of the target song are determined by the song audio.
The method of claim 8, wherein the phoneme sequence includes: a plurality of phonemes obtained by analyzing the song lyrics, and the phoneme duration includes: each phoneme occupies a space in the song audio. First frame number.
The method of claim 8, wherein the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. get.
The method of claim 10, wherein the song energy characteristics include: a plurality of energy values; the song energy sequence is formed based on a plurality of range encoding values, and the range encoding values are corresponding to the The energy range corresponding to the energy value is obtained by one-hot encoding.
The method of claim 10, wherein the song fundamental frequency characteristics include: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each of the fundamental frequency values.
A training method for a song generation model, which is characterized by including:

Obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during the song and the lyric text corresponding to the singing audio;

Obtain a pre-built initial neural network model, which includes initial weight parameters and a loss function;

Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain real mel spectrum features and predicted mel spectrum features. The real mel spectrum features represent the The Mel spectrum feature of the singing audio in the first sample, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;

Calculate the error between the predicted mel spectrum feature and the true mel spectrum feature according to the loss function;

Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;

Subsequent samples are obtained one by one from the training set, and the subsequent samples are repeatedly input into the latest neural network model until the loss function converges, and a trained song generation model is obtained.
The method of claim 13, wherein the initial neural network model includes: an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial acoustic decoding sub-model; the first sample is input to In the initial neural network model, real Mel spectrum features and predicted Mel spectrum features are obtained, including:

Perform text transcription of the lyric text in the first sample to obtain a phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration;

Perform acoustic feature extraction on the singing audio in the first sample to obtain the real mel spectrum features, audio energy and fundamental frequency trajectory of the first sample;

Input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample;

Input the real mel spectrum feature of the first sample into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample;

Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector;

The frame-level text feature vector, the frame-level timbre feature vector, the audio energy, and the fundamental frequency trajectory are added and then input into the initial acoustic decoding sub-model to obtain the prediction of the first sample Mel spectrum characteristics.
A song generating device, characterized by including:

The first acquisition module is used to acquire the voice audio input by the target user and the unique identification number of the target song;

The first processing module is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;

a second acquisition module, configured to acquire a song template corresponding to the unique identification number according to the unique identification number of the target song;

The second processing module is used to input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein: The song generation model is obtained through machine learning training using a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes : The sampling audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;

A generating module, configured to generate a target song according to the target Mel spectrum characteristics.
The device according to claim 15, wherein the song generation model includes: a timbre encoding sub-model, a text encoding sub-model and an acoustic decoding sub-model; the song generation model uses the same training set to The encoding sub-model, text encoding sub-model and acoustic decoding sub-model are jointly trained.
The device according to claim 16, characterized in that:

The song template includes lyric text information and song melody information;

The lyrics text information includes phoneme sequences and phoneme durations;

The song melody information includes a song note sequence and a song energy sequence.
The device of claim 17, wherein the second processing module includes:

The first processing sub-module is used to input the real mel spectrum characteristics of the target user into the timbre coding sub-model to obtain the timbre feature vector of the target user;

The second processing sub-module is used to input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template;

The third processing submodule is used to perform duration regularization of the text feature vector and the timbre feature vector according to the phoneme duration, to obtain a frame-level text feature vector and a frame-level timbre feature vector;

The fourth processing sub-module is used to add the frame-level text feature vector, the frame-level timbre feature vector and the song melody information and then input it into the acoustic decoding sub-model to obtain the target mel spectrum. feature.
The device according to claim 18, characterized in that the first processing sub-module is specifically used for:

Input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user;

The timbre latent space distribution vector is input into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the timbre latent space distribution vector of the autoregressive encoder. sampled;

The timbre distribution vector is used as the timbre feature vector of the target user.
The device according to claim 18, characterized in that the third processing sub-module is specifically used for:

From the text feature vector, determine an initial text code corresponding to each phoneme in the phoneme sequence;

According to the duration of the phoneme, determine the first frame number corresponding to the phoneme;

Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code;

The frame-level text feature vector is formed according to multiple target text codes.
The device according to claim 18, characterized in that the third processing sub-module is also used for:

Determine the second frame number of the speech audio according to the phoneme duration;

The timbre feature vector is copied, and the copied timbre feature vectors of the second frame number are spliced to obtain the frame-level timbre feature vector.
The device according to claim 17, wherein the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein , the phoneme sequence and the phoneme duration of the target song are determined by the song audio and song lyrics of the target song, and the song note sequence and song energy sequence of the target song are determined by the song audio.
The device of claim 22, wherein the phoneme sequence includes: a plurality of phonemes obtained by analyzing the song lyrics, and the phoneme duration includes: each phoneme occupies a space in the song audio. First frame number.
The device according to claim 22, wherein the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. get.
The device of claim 24, wherein the song energy characteristics include: a plurality of energy values; the song energy sequence is formed based on a plurality of range coded values, and the range coded values are corresponding to the The energy range corresponding to the energy value is obtained by one-hot encoding.
The device of claim 24, wherein the song fundamental frequency characteristics include: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each of the fundamental frequency values.
A training device for a song generation model, which is characterized by including:

The third acquisition module is used to acquire a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: The sampled singing audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;

The fourth acquisition module is used to acquire a pre-built initial neural network model, where the initial neural network model includes initial weight parameters and a loss function;

The fifth acquisition module is used to acquire the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, the real The Mel spectrum feature represents the Mel spectrum feature of the singing audio in the first sample, and the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;

A third processing module, configured to calculate the error between the predicted Mel spectrum feature and the real Mel spectrum feature according to the loss function;

A fourth processing module, configured to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;

The sixth acquisition module is used to acquire subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges to obtain the trained song generation model.
The device of claim 27, wherein the initial neural network model includes: an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial acoustic decoding sub-model; the fifth acquisition module includes:

The fifth processing sub-module is used to transcribe the lyrics text in the first sample to obtain a phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration. ;

The sixth processing submodule is used to extract acoustic features from the singing audio in the first sample to obtain the real mel spectrum features, audio energy and fundamental frequency trajectory of the first sample;

The seventh processing sub-module is used to input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample;

The eighth processing sub-module is used to input the real mel spectrum characteristics of the first sample into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample;

The ninth processing submodule is used to perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration, to obtain a frame-level text feature vector and a frame-level timbre feature vector;

The tenth processing sub-module is used to add the frame-level text feature vector, the frame-level timbre feature vector, the audio energy, and the fundamental frequency trajectory and input them into the initial acoustic decoding sub-model, The predicted Mel spectrum characteristics of the first sample are obtained.
An electronic device, characterized by including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-12. method, or perform the method described in any one of claims 13-14.
A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in any one of claims 1-12, or to execute the claim The method described in any one of 13-14.
A computer program product, characterized in that it includes a computer program that, when executed by a processor, implements the steps of the method according to any one of claims 1-12, or implements the method according to any one of claims 13-14. Any of the steps of the method.