WO2023245389A1 - Song generation method, apparatus, electronic device, and storage medium - Google Patents

Song generation method, apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2023245389A1
WO2023245389A1 PCT/CN2022/099965 CN2022099965W WO2023245389A1 WO 2023245389 A1 WO2023245389 A1 WO 2023245389A1 CN 2022099965 W CN2022099965 W CN 2022099965W WO 2023245389 A1 WO2023245389 A1 WO 2023245389A1
Authority
WO
WIPO (PCT)
Prior art keywords
song
timbre
model
text
target
Prior art date
Application number
PCT/CN2022/099965
Other languages
French (fr)
Chinese (zh)
Inventor
吴洁
Original Assignee
北京小米移动软件有限公司
北京小米松果电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京小米移动软件有限公司, 北京小米松果电子有限公司 filed Critical 北京小米移动软件有限公司
Priority to PCT/CN2022/099965 priority Critical patent/WO2023245389A1/en
Publication of WO2023245389A1 publication Critical patent/WO2023245389A1/en

Links

Images

Definitions

  • the present disclosure relates to the field of computer technology, and specifically to a song generation method, device, electronic device and storage medium.
  • Song synthesis refers to generating corresponding singing audio based on lyrics and musical scores.
  • the corresponding song synthesis algorithm has developed from the initial synthesis technology based on unit splicing to statistical parameter synthesis technology to the current synthesis technology based on deep learning.
  • Song synthesis technology can make machines sing, further increasing the fun of human-computer interaction, and therefore has high commercial value.
  • the embodiments of the present disclosure propose a song generation method, device, electronic device and storage medium, which can be applied in the field of data processing technology and can effectively combine the real Mel spectrum characteristics of the target user and the songs corresponding to the target song during the song generation process. Templates to effectively reduce the dependence on the amount of user voice data, thereby effectively improving the song generation effect while improving the convenience of song generation.
  • an embodiment of the present disclosure provides a song generation method, including:
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one of the samples.
  • Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during a song and the lyric text corresponding to the singing audio;
  • embodiments of the present disclosure provide a training method for a song generation model, including:
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one of the samples.
  • Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during the song and the lyric text corresponding to the singing audio;
  • the real mel spectrum features represent the The Mel spectrum feature of the singing audio in the first sample
  • the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model
  • Subsequent samples are obtained one by one from the training set, and the subsequent samples are repeatedly input into the latest neural network model until the loss function converges, and a trained song generation model is obtained.
  • an embodiment of the present disclosure proposes a song generation device, including: a first acquisition module for acquiring the voice audio input by the target user and the unique identification number of the target song;
  • the first processing module is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;
  • a second acquisition module configured to acquire a song template corresponding to the unique identification number according to the unique identification number of the target song
  • the second processing module is used to input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein:
  • the song generation model is obtained through machine learning training using a training set.
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one of the samples.
  • Each of the samples includes : The sampling audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;
  • a generating module configured to generate a target song according to the target Mel spectrum characteristics.
  • an embodiment of the present disclosure provides a training device for a song generation model, which is characterized by including:
  • the third acquisition module is used to acquire a training set.
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one of the samples.
  • Each of the samples includes: The sampled singing audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;
  • the fourth acquisition module is used to acquire a pre-built initial neural network model, where the initial neural network model includes initial weight parameters and a loss function;
  • the fifth acquisition module is used to acquire the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, the real The Mel spectrum feature represents the Mel spectrum feature of the singing audio in the first sample, and the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;
  • a third processing module configured to calculate the error between the predicted Mel spectrum feature and the real Mel spectrum feature according to the loss function
  • a fourth processing module configured to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model
  • the sixth acquisition module is used to acquire subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges to obtain the trained song generation model.
  • an embodiment of the present disclosure provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the program, it implements the first aspect of the present disclosure.
  • embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the song generation method as proposed in the embodiment of the first aspect of the disclosure is implemented, or Implement the training method of the song generation model as proposed in the embodiment of the second aspect of the present disclosure.
  • an embodiment of the present disclosure provides a computer program product.
  • the instructions in the computer program product are executed by a processor, the song generation method proposed in the embodiment of the first aspect of the present disclosure is executed, or the method of the present disclosure is executed.
  • the second aspect embodiment proposes a training method for a song generation model.
  • Figure 1 is a schematic flowchart of a song generation method proposed by an embodiment of the present disclosure
  • Figure 2 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.
  • Figure 3 is a schematic diagram of a song template generation process proposed by an embodiment of the present disclosure
  • Figure 4 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.
  • Figure 5 is a schematic structural diagram of a timbre encoding sub-model proposed by an embodiment of the present disclosure
  • Figure 6 is a schematic diagram of a song generation process proposed by an embodiment of the present disclosure.
  • Figure 7 is a schematic flowchart of a training method for a song generation model proposed by an embodiment of the present disclosure
  • Figure 8 is a training flow chart of an initial neural network model proposed by an embodiment of the present disclosure.
  • Figure 9 is a schematic structural diagram of a song generation device according to an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a song generation device according to another embodiment of the present disclosure.
  • Figure 11 is a schematic structural diagram of a training device for a song generation model proposed by an embodiment of the present disclosure
  • Figure 12 is a schematic structural diagram of a training device for a song generation model proposed by another embodiment of the present disclosure.
  • FIG. 13 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure.
  • first, second, third, etc. may be used to describe various information in the embodiments of the present disclosure, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other.
  • first information may also be called second information, and similarly, the second information may also be called first information.
  • the words "if” and “if” as used herein may be interpreted as “when” or “when” or “in response to determining.”
  • Mel spectrum is a commonly used feature in the deep speech learning process.
  • Ordinary spectrograms are linear, while Mel spectroscopy is based on the characteristics of human hearing (more sensitive to low-frequency sounds and poorer in resolving high-frequency sounds), converting the frequency of ordinary spectrograms from linear to Mel scale,
  • the Mel scale is a logarithmic scale, and human perception of frequency is more sensitive on the Mel scale.
  • Phoneme is the smallest unit of speech divided according to the natural properties of speech. It can be analyzed based on the pronunciation movements in syllables. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants.
  • Timbre means that different sounds always have distinctive characteristics in terms of waveforms, and different objects vibrate with different characteristics.
  • Figure 1 is a schematic flowchart of a song generation method proposed by an embodiment of the present disclosure.
  • the execution subject of the song generation method in this embodiment is a song generation device.
  • the device can be implemented by software and/or hardware.
  • the device can be configured in an electronic device.
  • the electronic device can include but not Limited to terminals, servers, etc.
  • terminals can be smartphones, smart TVs, smart watches, smart cars, etc.
  • the song generation method may include but is not limited to the following steps:
  • Step S101 Obtain the voice audio input by the target user and the unique identification number of the target song.
  • the target user refers to the user who wants to use the song generation method.
  • the voice audio refers to the audio data input by the target user.
  • the voice audio can be the audio data of the target user or the audio data of other users.
  • the target song refers to the song to be generated by the song generation method.
  • the unique identification number refers to the identification information corresponding to the target song, such as the number or name.
  • the number of target songs may be multiple.
  • the target song can be accurately positioned during the song generation process.
  • the audio acquisition device when acquiring the voice audio input by the target user, may be configured in advance in the execution body of the embodiment of the present disclosure, and then the audio acquisition device acquires the voice audio of the target user, or the audio acquisition device may be configured in advance.
  • a data interface is configured in the execution subject of the embodiment of the present disclosure, a song generation request is received through the data interface, and then the voice audio is obtained by parsing the song generation request, and there is no limit to this.
  • a relationship table may be used.
  • the relationship table may record the unique identification number corresponding to the target song, or the unique identification number of the target song may be uniquely identified based on multiple target songs in advance.
  • the mapping relationship between the numbers establishes a database, and then the corresponding unique identification number is obtained from the database based on the target song. There is no restriction on this.
  • Step S102 Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.
  • the Mel spectrum refers to the spectrogram extracted based on the audio data, and the Mel spectrum is a logarithmic spectrum.
  • the Mel spectrum feature refers to the feature information corresponding to the Mel spectrum. It is understandable that the sound level heard by the human ear does not have a linear relationship with the actual frequency (Hz), and the Mel spectrum feature is more in line with the auditory characteristics of the human ear.
  • the real Mel spectrum features refer to the Mel spectrum features obtained based on the above speech data.
  • Step S103 Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.
  • the song template refers to a template that describes information related to the target song.
  • the song template may include lyric text information and song melody information.
  • the lyric text information includes phoneme sequences and phoneme durations.
  • the song melody information includes song note sequences and song energy sequences. Therefore, it can be larger
  • the representation content in the song template can be enriched to a certain extent, thereby providing the song generation model with more comprehensive reference information of the target song, thereby effectively improving the applicability of the song template in the song generation process.
  • the lyrics text information refers to the text information corresponding to the lyrics of the target song.
  • Song melody information can be used to describe relevant information corresponding to the melody of the target song.
  • Phoneme refers to the smallest phonetic unit divided according to the natural properties of speech.
  • a phoneme sequence refers to a sequence composed of multiple phonemes.
  • the phoneme duration refers to the duration information corresponding to the phoneme. Notes refer to progressive symbols used to record sounds of different lengths.
  • the song note sequence refers to the sequence composed of the corresponding notes of the song audio.
  • energy can refer to the energy contained in the song audio, such as sound intensity
  • the song energy sequence can be used to describe the energy changes corresponding to the song audio corresponding to different time points.
  • multiple song templates may be obtained in advance, and then matching processing is performed based on the unique identification number and the multiple song templates to obtain
  • the song template corresponding to the unique identification number, or a third-party retrieval device can also obtain the song template corresponding to the unique identification number according to the unique identification number of the target song, and there is no limit to this.
  • Step S104 Input the target user's real mel spectrum features and song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, where the song generation model is trained by machine learning using a training set It is obtained that the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one sample.
  • Each sample includes: the singing audio picked up when the sampling user sings a certain song and the lyrics corresponding to the singing audio. text.
  • the song generation model refers to a model used to process real mel spectrum features and song templates, and output target mel spectrum features.
  • the song generation model may be a neural network model.
  • the target mel spectrum feature refers to the mel spectrum feature obtained by processing the real mel spectrum feature of the target user and the song template by the song generation model.
  • the training set refers to the sample set used by the song generation model during the training process.
  • sampling users refer to users who provide samples for the training process of the song generation model.
  • the sample can refer to the singing audio and lyric text used for model training.
  • Singing audio refers to sampling the audio picked up when the user sings a certain song.
  • the song generation model includes: timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model.
  • the song generation model uses the same training set to combine the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model.
  • the decoding sub-model is obtained through joint training. This can effectively improve the structural rationality of the song generation model.
  • the same training set is used to jointly train the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model, it can effectively improve the structural rationality of the song generation model. Improve the consistency between each sub-model, thereby effectively improving the output accuracy of the resulting song generation model.
  • timbre refers to the different characteristics of different sounds in terms of waveforms. Different objects vibrate with different characteristics. In other words, the timbre of different users is different.
  • the timbre coding sub-model refers to a model used to process real mel spectrum features to obtain the target user's timbre feature vector.
  • the text encoding sub-model refers to the model used to process the phoneme sequence to obtain the text feature vector corresponding to the target song.
  • the acoustic decoding sub-model refers to a model used to process multiple feature information to obtain target mel spectrum features.
  • the acoustic decoding sub-model can be a decoder of a fast end-to-end and non-autoregressive synthesis system.
  • the real mel spectrum features and song templates of the target user are input into the preset song generation model, and the target mel spectrum features output by the song generation model are obtained, it can be quickly and accurately based on the song generation model. It effectively integrates real Mel spectrum features and relevant information of song templates, thereby effectively improving the efficiency of model generation.
  • Step S105 Generate a target song based on the target mel spectrum characteristics.
  • the target mel spectrum features may be input into the vocoder, and the vocoder analyzes and processes the target spectrum features to obtain the target song.
  • the target mel spectrum features can be input into the preset vocoder model to obtain the target linear spectrum features; the linear spectrum features are subjected to inverse Fourier Transform to obtain the audio data of the target song.
  • the vocoder model is a neural network model, which is also trained through machine learning using a training set different from the song generation model.
  • the vocoder model can be based on Generative Adversial Networks (GAN), adversarial generation network without distillation, etc.
  • GAN Generative Adversial Networks
  • the training set can also be a training set commonly used in the existing technology.
  • the training process of the vocoder model can be to input the real mel spectrum feature in a sample into the built initial model, obtain the predicted linear spectrum feature, and calculate the predicted linear spectrum feature and the sample's linear spectrum feature through the loss function.
  • the initial model weight is modified according to the error, and the samples are input in this way until the loss function converges, and the trained vocoder model is obtained.
  • the target song by obtaining the voice audio input by the target user and the unique identification number of the target song, Mel spectrum feature extraction is performed on the voice audio to obtain the real Mel spectrum feature of the target user, which is obtained according to the unique identification number of the target song.
  • the song template corresponding to the unique identification number input the target user's real mel spectrum features and song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, and generate the target mel spectrum features based on the target mel spectrum features.
  • the target song can effectively combine the target user's real mel spectrum characteristics and the song template corresponding to the target song during the song generation process to effectively reduce the dependence on the amount of user voice data, thereby improving the convenience of song generation. , effectively improving the song generation effect.
  • Figure 2 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.
  • the song generation method may include but is not limited to the following steps:
  • Step S201 Obtain the voice audio input by the target user and the unique identification number of the target song.
  • Step S202 Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.
  • Step S203 Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.
  • Step S204 Input the target user's real mel spectrum features into the timbre coding sub-model to obtain the target user's timbre feature vector.
  • the timbre feature vector refers to the vector used to characterize the corresponding timbre characteristics of the target user.
  • Step S205 Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.
  • the lyrics text refers to the text data in the song template that describes the corresponding lyrics information of the target song.
  • the text feature vector refers to the vector used to characterize the text features corresponding to the lyrics text.
  • the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, where the phoneme sequence and phoneme duration of the target song are configured by The song audio and song lyrics of the target song are determined.
  • the song note sequence and song energy sequence of the target song are determined by the song audio. Therefore, the target song can be quickly positioned based on the unique identification number to effectively improve the practicality of the obtained song template. , and at the same time, it can effectively improve the accuracy of the song template's representation of target lyrics-related information.
  • the song audio refers to the singing audio corresponding to the target song.
  • the song lyrics refer to the lyric information corresponding to the target song.
  • the phoneme sequence includes: multiple phonemes obtained by parsing song lyrics, and the phoneme duration includes: the number of first frames each phoneme occupies in the song audio.
  • the obtained phoneme sequence can be effectively improved The compatibility with the lyrics of the song, while effectively improving the accuracy of the obtained first frame number for the corresponding phoneme.
  • the first frame number refers to the number of video frames corresponding to the phoneme in the song audio.
  • the song energy sequence is obtained by quantizing the song energy characteristics of the song audio
  • the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. Therefore, the obtained results can be effectively improved based on the quantization process.
  • the song energy sequence and song note sequence clearly represent the song energy characteristics and song fundamental frequency characteristics.
  • the song energy sequence and song note sequence obtained through quantitative processing can provide reliable reference data for subsequent calculation processes.
  • the song energy feature can be used to describe the relevant features corresponding to the song energy.
  • the fundamental frequency characteristics of songs can be used to describe the related characteristics corresponding to the fundamental frequency of songs.
  • the song energy characteristics include: multiple energy values, the song energy sequence is formed based on multiple range encoding values, and the range encoding values are processed by one-hot encoding of the energy range corresponding to the energy value. It is obtained that the song energy characteristics can be effectively expanded based on the one-hot encoding process to distinguish multiple energy values based on the obtained multiple range encoding values. When the song energy sequence is formed based on the multiple range encoding values, the obtained song energy can be effectively improved. The representation effect of sequence on the energy characteristics of songs.
  • the energy value may refer to the value corresponding to the energy of the song.
  • Energy range refers to the value range corresponding to the energy value, such as 0-10.
  • one-hot encoding can also be called one-bit effective encoding.
  • This one-hot encoding can use an N-bit status register to encode N states. Each state has its own independent register bit, and at any time, only One is valid.
  • the range encoding value refers to the encoding value obtained by one-hot encoding of the energy range.
  • the one-hot encoding can be configured as: 000001, 000010, 000100, 001000, 010000, 100000.
  • the song fundamental frequency characteristics include: multiple fundamental frequency values, and the song note sequence includes a note symbol corresponding to each fundamental frequency value. Therefore, the song note sequence can effectively combine the fundamental frequency value and the pitch value. The correspondence between symbols can be adapted to personalized application scenarios, thereby effectively improving the applicability of the resulting song note sequence in the song generation process.
  • the fundamental frequency value refers to the value corresponding to the fundamental frequency of the song.
  • Note symbols refer to the numbers corresponding to notes, which can be obtained based on relevant databases in the music field.
  • Figure 3 is a schematic diagram of a song template generation process proposed by an embodiment of the present disclosure.
  • the initial data of the song template may include song audio and song lyrics corresponding to the target song.
  • the song template generation process may It includes: (1) Processing song lyrics based on text transcription method to obtain the phoneme sequence corresponding to the target song; (2) Processing the obtained song phoneme sequence and song audio based on forced alignment to obtain the phoneme duration of the target song.
  • forced alignment can be used method, or manual calibration can be performed after the forced alignment operation to improve the accuracy of the phoneme duration; (3) Process the song audio based on the acoustic feature extraction method to obtain the song energy features and song fundamental frequency features corresponding to the target song , and then based on the energy trajectory translation and fundamental frequency trajectory translation, the values of the corresponding energy and pitch of the song are changed to improve the flexibility of the song template; (4) Quantify the song energy characteristics and the song fundamental frequency characteristics to obtain the song energy sequence and song note sequence; (5) Generate a song template based on the phoneme sequence, phoneme duration, song energy sequence and song note sequence; (6) After generating the song template, a unique identification number of the target song can be generated for the song template, so as to facilitate The song template is retrieved based on the unique identification number during the song generation process.
  • Step S206 Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector.
  • the frame-level text feature vector refers to a vector that describes the text features corresponding to multiple audio frames.
  • the frame-level timbre feature vector refers to the vectors corresponding to timbre features of multiple audio frames.
  • the same phoneme may include multiple audio frames, and the multiple audio frames corresponding to the same phoneme have high similarity.
  • the phoneme-level text feature vectors and timbre feature vectors can be converted into frame-level text feature vectors and timbre feature vectors by copying, so as to facilitate subsequent processing of frame-level frames.
  • the level text feature vector and the frame-level timbre feature vector are added together.
  • Step S207 Add the frame-level text feature vector, frame-level timbre feature vector and song melody information and then input them into the acoustic decoding sub-model to obtain the target mel spectrum feature.
  • addition refers to the addition operation of dimensions. Assume that the dimensions of frame-level text feature vectors, frame-level timbre feature vectors, song note sequences, and song energy sequences are all 10 dimensions. Addition means adding the values of the corresponding dimensions. Operation.
  • the real mel spectrum characteristics of the target user can be input into the timbre coding sub-model to obtain the timbre characteristics of the target user.
  • Vector input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template, perform duration regularization on the text feature vector and timbre feature vector according to the phoneme duration, and obtain the frame-level text feature vector and frame-level timbre feature vector,
  • the frame-level text feature vector, frame-level timbre feature vector and song melody information are added together and then input into the acoustic decoding sub-model to obtain the target mel spectrum feature.
  • Step S208 Generate a target song based on the target mel spectrum characteristics.
  • step S208 reference may be made to the above-mentioned embodiments, and details will not be described again here.
  • the timbre feature vector of the target user is obtained, and the phoneme sequence is input into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.
  • the phoneme duration performs duration regularization on text feature vectors and timbre feature vectors to obtain frame-level text feature vectors and frame-level timbre feature vectors.
  • the frame-level text feature vectors, frame-level timbre feature vectors and song melody information are added and then input to the acoustic Decode the sub-model to obtain the target mel spectrum features.
  • the real mel spectrum features and phoneme sequences can be quickly extracted based on the timbre encoding sub-model and the text encoding sub-model, and the corresponding timbre features can be quantified in the form of vectors. and text features, and then duration regularization of text feature vectors and timbre feature vectors based on phoneme duration, which can effectively improve the consistency between the obtained frame-level text feature vectors and frame-level timbre feature vectors, thereby effectively improving the acoustic decoding sub-model's ability to detect frames
  • the processing effect of level text feature vectors and frame-level timbre feature vectors are examples of the processing effect of level text feature vectors and frame-level timbre feature vectors.
  • Figure 4 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.
  • the song generation method may include but is not limited to the following steps:
  • Step S401 Obtain the voice audio input by the target user and the unique identification number of the target song.
  • Step S402 Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.
  • Step S403 Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.
  • Step S404 Input the target user's real mel spectrum features into the reference encoder to obtain the target user's timbre latent space distribution vector.
  • the reference encoder may refer to an encoder used to process real Mel spectrum features to obtain timbre latent space distribution vectors.
  • the timbre latent space distribution vector output by the reference encoder may refer to the timbre latent space distribution vector corresponding to the real Mel spectrum features. Hidden variables.
  • the timbre latent space distribution vector obeys the spherical Gaussian distribution.
  • the reference encoder can also output the mean and variance corresponding to the spherical Gaussian distribution.
  • Step S405 Input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder.
  • the autoregressive encoder refers to an encoder used to process the timbre latent space distribution vector to obtain the timbre distribution vector.
  • the structures of the above-mentioned reference encoder and autoregressive encoder can be multi-layer linear layers or convolutional layers, and there is no limitation on this.
  • Step S406 Use the timbre distribution vector as the timbre feature vector of the target user.
  • the timbre encoding sub-model can include: a reference encoder and an autoregressive encoder.
  • the real Mel spectrum characteristics of the target user can be Input into the reference encoder to obtain the timbre latent space distribution vector of the target user, and input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the autoregressive encoder pair
  • the timbre latent space distribution vector is sampled, and the timbre distribution vector is used as the timbre feature vector of the target user. This can effectively reduce the redundant information in the obtained timbre feature vector, and at the same time convert the more complex real mel spectrum features into vectors. form, thereby effectively improving the practicality of the obtained timbre feature vector.
  • Figure 5 is a schematic structural diagram of a timbre encoding sub-model proposed by an embodiment of the present disclosure, in which the random sampling point ⁇ refers to a random sampling point of Gaussian distribution, which can be expressed as ⁇ N(0 ,I).
  • the timbre encoding sub-model can obtain the timbre latent space distribution vector h and two parameters through the reference encoder processing.
  • the two parameters can be respectively used as the mean a1 and variance b1 of the Gaussian distribution.
  • the timbre encoding sub-model can be a sampling process based on an inverse autoregressive flow (IAF), which is a standardized flow.
  • IAF inverse autoregressive flow
  • Normalizing streams produces distributions that are easy to sample.
  • the normalized flow can convert complex input distributions into tractable probability distributions through a series of reversible transformation operations.
  • the output distribution is usually chosen to be an isotropic unit Gaussian distribution, that is, a spherical unit Gaussian distribution, allowing smooth interpolation and efficient sampling. .
  • the timbre feature vector is learned by using the inverse autoregressive flow method.
  • the generated timbre latent space distribution vector h can obey the spherical Gaussian distribution, so that the timbre feature vector can be obtained by sampling on this distribution.
  • Step S407 Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.
  • step S407 For the description of step S407, reference may be made to the above-mentioned embodiments, and details will not be described again here.
  • Step S408 Determine the initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector.
  • the initial text encoding refers to the text encoding contained in the text feature vector.
  • reliable reference data can be provided for subsequent determination of the target text code.
  • Step S409 Determine the first frame number corresponding to the phoneme based on the phoneme duration.
  • the first frame number refers to the number of video frames corresponding to each phoneme.
  • the phoneme duration corresponding to one phoneme can be 25ms. If an audio frame is set to 5ms, then one phoneme corresponds to 5 frames of information.
  • Step S410 Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code.
  • the target text encoding refers to the text encoding obtained by splicing the initial text encoding of the first frame number.
  • the duration of a phoneme corresponding to a phoneme may be small, and there may be more redundant information between multiple audio frames corresponding to the same phoneme.
  • Step S411 Form a frame-level text feature vector according to multiple target text codes.
  • the initial character corresponding to each phoneme in the phoneme sequence can be determined from the text feature vector.
  • Text encoding determine the first frame number corresponding to the phoneme based on the duration of the phoneme, copy the initial text encoding, and perform splicing processing on the copied initial text encoding of the first frame number to obtain the target text encoding, and form a frame based on multiple target text encodings
  • Level text feature vector since the time range corresponding to each phoneme is small, and the representation content of different audio frames in the same phoneme has high similarity, when copying the initial text encoding, and copying the initial text of the first frame number
  • the computational cost can be greatly reduced, thereby effectively improving the efficiency of determining frame-level text feature vectors.
  • Step S412 Determine the second frame number of the speech audio based on the phoneme duration.
  • the second number of frames refers to the number of frames of the speech audio determined based on the phoneme duration.
  • Step S413 Copy the timbre feature vector, and splice the copied timbre feature vectors of the second frame number to obtain a frame-level timbre feature vector.
  • the embodiment of the present disclosure can determine the second frame number of the speech audio based on the phoneme duration, copy the timbre feature vector, and perform the copied second
  • the timbre feature vectors of the frame number are spliced to obtain a frame-level timbre feature vector. Therefore, the obtained frame-level timbre feature vector can effectively represent the relevant feature information of the speech to be processed from the level of the audio frame, so as to effectively improve the obtained frame-level timbre features.
  • the compatibility between the vector and the frame-level text feature vector effectively improves the representation effect of the obtained frame-level timbre feature vector.
  • Step S414 Add the frame-level text feature vector, the frame-level timbre feature vector and the song melody information and then input them into the acoustic decoding sub-model to obtain the target mel spectrum feature.
  • Step S415 Generate the target song according to the target mel spectrum characteristics.
  • steps S414 to S415 For the description of steps S414 to S415, reference may be made to the above-mentioned embodiments, and details will not be described again here.
  • Figure 6 is a schematic diagram of a song generation process proposed by an embodiment of the present disclosure.
  • the corresponding operation process of the song generation model may include: (1) Based on acoustic features The extraction method processes the speech audio to obtain real mel spectrum features; (2) inputs the real mel spectrum features into the timbre coding sub-model to obtain the timbre feature vector; (3) inputs the phoneme sequence in the song template into the text In the encoding sub-model, to obtain the text feature vector of the target song; (4) Input the phoneme duration, timbre feature vector and text feature vector into the duration regularization sub-module to obtain the frame-level text feature vector and frame-level timbre feature vector; (5) Add the frame-level text feature vector, frame-level timbre feature vector, song note sequence and song energy sequence and input them into the acoustic decoding sub-model to obtain the target mel spectrum feature; (6) The obtained target spectrum feature Input into the vocoder to get
  • multiple users can share the pre-trained song generation model, and the song performed by the user can be obtained based on a piece of user audio, thereby effectively improving the convenience in the song generation process. , effectively reducing computing resources and reducing storage costs.
  • the target user's timbre latent space distribution vector is obtained, and the timbre latent space distribution vector is input into the autoregressive encoder to obtain the target user's timbre latent space distribution vector.
  • Timbre distribution vector where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder.
  • the timbre distribution vector is used as the timbre feature vector of the target user. This can effectively reduce the redundant information in the resulting timbre feature vector. , and at the same time, convert the more complex real mel spectrum features into vector form, thereby effectively improving the practicality of the obtained timbre feature vector.
  • the initial text encoding corresponding to each phoneme in the phoneme sequence By determining the initial text encoding corresponding to each phoneme in the phoneme sequence from the text feature vector, According to the duration of the phoneme, determine the first frame number corresponding to the phoneme, copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code. Based on multiple target text codes, frame-level text features are formed. Vector, since the time range corresponding to each phoneme is small, and the representation content of different audio frames in the same phoneme has high similarity, when the initial text encoding is copied, the initial text encoding of the first frame number is spliced. When processing the target text encoding, the computing cost can be greatly reduced, thereby effectively improving the efficiency of determining the frame-level text feature vector.
  • the obtained timbre feature vectors of the second frame number are spliced to obtain a frame-level timbre feature vector. Therefore, the obtained frame-level timbre feature vector can effectively represent the relevant feature information of the speech to be processed from the level of the audio frame to effectively improve the obtained frame.
  • the adaptability between the level timbre feature vector and the frame-level text feature vector effectively improves the representation effect of the obtained frame-level timbre feature vector.
  • FIG. 7 is a schematic flowchart of a training method for a song generation model proposed by an embodiment of the present disclosure.
  • the execution subject of the training method of the song generation model in this embodiment is a training device of the song generation model.
  • the device can be implemented by software and/or hardware, and the device can be configured in an electronic device.
  • Electronic devices may include but are not limited to terminals, servers, etc.
  • terminals may be smartphones, smart TVs, smart watches, smart cars, etc.
  • the training method of the song generation model may include but is not limited to the following steps:
  • Step S701 Obtain a training set.
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one sample.
  • Each sample includes: the singing audio picked up when the sampling user sings a certain song and the The lyrics text corresponding to the singing audio.
  • a communication link between the execution subject of the embodiment of the present disclosure and the big data server may be established in advance, and then the training set may be obtained from the big data server, or the training set may be obtained from the big data server based on the sample collection device.
  • the training set is obtained from multiple sampling users, and there is no restriction on this.
  • Step S702 Obtain a pre-built initial neural network model.
  • the initial neural network model includes initial weight parameters and a loss function.
  • the neural network model is a complex network system formed by a large number of simple processing units (called neurons) that are widely connected to each other. It reflects many basic characteristics of human brain functions.
  • the initial neural network model refers to the neural network model to be trained.
  • the initial weight parameters refer to the weight parameters to be iteratively updated during the model training process.
  • the loss function can be used to describe the error information between the predicted Mel spectrum features and the real Mel spectrum features output by the initial neural network model during the training process.
  • the model performance can be evaluated in real time during the model training process based on the loss function, and whether the model has converged can be judged in a timely manner.
  • Step S703 Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain the real mel spectrum feature and the predicted mel spectrum feature.
  • the real mel spectrum feature represents the singing audio in the first sample.
  • the Mel spectrum feature, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model.
  • the first sample refers to the first sample among the multiple samples in the training set that is used for model training.
  • a sample may be randomly obtained from the training set as the first sample, or the first sample may be obtained from the training set based on the number information of multiple samples in the training set. There are no restrictions on this.
  • the text of the lyrics in the first sample may be transcribed. , obtain the phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration, perform acoustic feature extraction on the singing audio in the first sample, and obtain the real mel spectrum features and audio of the first sample Energy and fundamental frequency trajectories, input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample, input the real mel spectrum characteristics of the first sample into the initial timbre encoding sub-model, and obtain the first sample The timbre feature vector of The fundamental frequency trajectories are added and then input to the initial acoustic decoding sub-model to obtain the predicted mel spectrum features of the first sample.
  • the initial text encoding sub-model refers to the text encoding sub-model to be trained.
  • the initial timbre coding sub-model refers to the timbre coding sub-model to be trained.
  • the initial acoustic decoding sub-model refers to the acoustic decoding sub-model to be trained.
  • audio energy refers to the energy information corresponding to the singing audio in the first sample.
  • the fundamental frequency trajectory refers to the trajectory information corresponding to the fundamental frequency of the singing audio in the first sample.
  • Step S704 Calculate the error between the predicted mel spectrum feature and the real mel spectrum feature according to the loss function.
  • error can be used to describe the difference information between predicted Mel spectrum features and real Mel spectrum features.
  • the output accuracy of the initial neural network model can be evaluated in real time to determine the model performance, and the resulting error It can provide reliable reference data for determining the direction of model optimization.
  • Step S705 Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model.
  • the initial weight parameters of the initial neural network model when the initial weight parameters of the initial neural network model are adjusted based on the error, the initial weight parameters can be accurately adjusted based on the error, thereby effectively improving the training effect of the neural network model.
  • Step S706 Obtain subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges, and obtain the trained song generation model.
  • subsequent samples refer to other samples in the training set except the first sample.
  • FIG. 8 is a training flow chart of an initial neural network model proposed by an embodiment of the present disclosure.
  • the initial neural network model may include an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial Acoustic decoding sub-model
  • the training process may include: (1) The song lyrics are processed through text transcription to obtain the corresponding phoneme sequence, and the resulting phoneme sequence is processed through the initial text encoding sub-model to obtain the corresponding text feature vector; (2) ) Process text feature vectors and song phoneme durations based on forced alignment to obtain the initial text encoding; (3) Process song audio based on acoustic feature extraction to obtain real mel spectrum features, song energy features and song fundamental frequency features; (4) The real mel spectrum features are processed through the initial timbre coding sub-model to obtain the timbre feature vector; (5) Multiple energy values in the song audio energy features are divided into different energy bands (for example: the value range is 0-10 The energy value can be divided into 10 or 20 energy bands based on the
  • the note symbol corresponding to the fundamental frequency 261.63Hz is 60, and the note symbol corresponding to the fundamental frequency 277.18Hz is 61; (7) Use the duration regularization method to process the initial encoded text based on the song phoneme duration to obtain the frame-level text feature vector; (8) The duration regularization sub-model processes the timbre feature vector based on the song phoneme duration to obtain the frame-level timbre feature vector; (9) Input the above-mentioned frame-level text feature vector, frame-level timbre feature vector, song energy sequence and song note sequence into the acoustic decoding sub-model to obtain the predicted mel spectrum features; (10) Determine the loss function of the song generation model based on the real mel spectrum features and predicted spectrum features. Through this loss function, each weight parameter in the song generation model can be iteratively updated by sampling gradient backpropagation, so that the loss function tends to converge.
  • the training set is obtained from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one sample.
  • Each sample includes: the sample picked up when the sampling user sings a certain song.
  • Singing audio and lyric text corresponding to the singing audio obtain the pre-built initial neural network model, the initial neural network model includes initial weight parameters and loss function, obtain the first sample from the training set, and input the first sample into the initial neural network
  • the real mel spectrum feature and the predicted mel spectrum feature are obtained.
  • the real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample
  • the predicted mel spectrum feature represents the mel spectrum predicted by the initial neural network model.
  • Spectral features calculate the error between the predicted Mel spectrum features and the real Mel spectrum features based on the loss function, adjust the initial weight parameters of the initial neural network model based on the error, obtain an updated neural network model, and obtain the subsequent results one by one from the training set samples, and repeatedly input subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained. From this, the predicted mel spectrum characteristics of the model output can be determined in real time based on the loss function during the model training process. The error between the model and the real mel spectrum features provides a reliable basis for judging the convergence of the model, thereby effectively improving the output accuracy of the song generation model.
  • FIG. 9 is a schematic structural diagram of a song generation device according to an embodiment of the present disclosure.
  • the song generation device 90 includes:
  • the first acquisition module 901 is used to acquire the voice audio input by the target user and the unique identification number of the target song;
  • the first processing module 902 is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;
  • the second acquisition module 903 is used to acquire the song template corresponding to the unique identification number according to the unique identification number of the target song;
  • the second processing module 904 is used to input the target user's real mel spectrum features and song templates into the preset song generation model to obtain the target mel spectrum features output by the song generation model, where the song generation model uses training
  • the set is obtained through machine learning training.
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one sample.
  • Each sample includes: the singing audio picked up when the sampling user sings a certain song and the The lyrics text corresponding to the singing audio;
  • the generation module 905 is used to generate the target song according to the target mel spectrum characteristics.
  • the song generation model includes: timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model; the song generation model is obtained by jointly training the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model using the same training set.
  • the song template includes lyric text information and song melody information; the lyric text information includes phoneme sequences and phoneme durations; and the song melody information includes song note sequences and song energy sequences.
  • the second processing module 904 includes: a first processing sub-module 9041, which is used to input the target user's real mel spectrum characteristics into the timbre encoding sub-model to obtain the target user's timbre feature vector;
  • the second processing sub-module 9042 is used to input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template;
  • the third processing sub-module 9043 is used to perform duration processing on the text feature vector and timbre feature vector according to the phoneme duration.
  • frame-level text feature vectors and frame-level timbre feature vectors are obtained; the fourth processing sub-module 9044 is used to add the frame-level text feature vectors, frame-level timbre feature vectors and song melody information and then input them to the acoustic decoding sub-model. , obtain the target mel spectrum characteristics.
  • the first processing sub-module 9041 is specifically used to: input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user; Input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder; use the timbre distribution vector as the timbre of the target user Feature vector.
  • the third processing sub-module 9043 is specifically used to: determine the initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector; determine the first text code corresponding to the phoneme according to the phoneme duration. One frame number; copy the initial text encoding, and perform splicing processing on the copied initial text encoding of the first frame number to obtain the target text encoding; form a frame-level text feature vector based on multiple target text encodings.
  • the third processing sub-module 9043 is also used to: determine the second frame number of the speech audio according to the phoneme duration; copy the timbre feature vector, and copy the timbre feature vector of the second frame number Perform splicing processing to obtain frame-level timbre feature vectors.
  • the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein the phoneme sequence and phoneme duration of the target song are configured by The song audio and song lyrics of the target song are determined, and the song note sequence and song energy sequence of the target song are determined by the song audio.
  • the phoneme sequence includes: multiple phonemes obtained by parsing the song lyrics, and the phoneme duration includes: the first frame number occupied by each phoneme in the song audio.
  • the song energy sequence is obtained by quantizing the song energy characteristics of the song audio
  • the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio.
  • the song energy characteristics include: multiple energy values; the song energy sequence is formed based on multiple range encoding values, and the range encoding values are processed by one-hot encoding of the energy range corresponding to the energy value. get.
  • the song fundamental frequency feature includes: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each fundamental frequency value.
  • Figure 11 is a schematic structural diagram of a training device for a song generation model proposed by an embodiment of the present disclosure.
  • the training device 110 of the song generation model includes: a third acquisition module 1101, used to obtain a training set.
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user at least corresponds to A sample, each sample includes: the singing audio picked up when the user sings a certain song and the lyrics text corresponding to the singing audio;
  • the fourth acquisition module 1102 is used to obtain the pre-built initial neural network model, the initial neural network model Including initial weight parameters and loss functions;
  • the fifth acquisition module 1103 is used to obtain the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features,
  • the real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum feature predicted by the initial neural network model;
  • the third processing module 1104 is used to calculate predictions based on the loss function The error between the Mel spectrum feature and the real Mel spectrum feature;
  • the fourth processing module 1105 is used to
  • the initial neural network model includes: an initial timbre encoding sub-model, Initial text encoding sub-model, and initial acoustic decoding sub-model;
  • the fifth acquisition module 1103 includes: the fifth processing sub-module 11031, which is used to transcribe the lyrics text in the first sample to obtain the phoneme sequence, and according to the phoneme The sequence aligns the singing audio pairs in the first sample to obtain the phoneme duration;
  • the sixth processing sub-module 11032 is used to extract the acoustic features of the singing audio in the first sample to obtain the real mel spectrum characteristics of the first sample.
  • the seventh processing sub-module 11033 is used to input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample; the eighth processing sub-module 11034 is used to convert the first sample
  • the real mel spectrum features are input into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample;
  • the ninth processing sub-module 11035 is used to duration regularize the text feature vector and timbre feature vector according to the phoneme duration to obtain the frame level text feature vector and frame level timbre feature vector;
  • the tenth processing submodule 11036 is used to add the frame level text feature vector, frame level timbre feature vector, audio energy, and fundamental frequency trajectory and input them into the initial acoustic decoding sub-module model to obtain the predicted Mel spectrum characteristics of the first sample.
  • a training set is obtained.
  • the training set comes from multiple sampling users.
  • the training set includes multiple samples.
  • One sampling user corresponds to at least one sample.
  • Each sample includes: singing songs picked up when the sampling user sings a certain song. Audio and lyric text corresponding to the singing audio, obtain the pre-built initial neural network model, the initial neural network model includes initial weight parameters and loss function, obtain the first sample from the training set, and input the first sample into the initial neural network model , the real mel spectrum feature and the predicted mel spectrum feature are obtained.
  • the real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample
  • the predicted mel spectrum feature represents the mel spectrum predicted by the initial neural network model.
  • FIG. 13 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure.
  • the electronic device 12 shown in FIG. 13 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • electronic device 12 is embodied in the form of a general computing device.
  • Components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting various system components (including system memory 28 and processing unit 16).
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include but are not limited to Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnection
  • Electronic device 12 typically includes a variety of computer system readable media. These media may be any available media that can be accessed by electronic device 12, including volatile and nonvolatile media, removable and non-removable media.
  • the memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or cache memory 32.
  • Electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 13, commonly referred to as a "hard drive").
  • a disk drive for reading and writing a removable non-volatile disk e.g., a "floppy disk” and a removable non-volatile optical disk (e.g., a compact disk read-only memory)
  • a removable non-volatile disk e.g., a "floppy disk”
  • a removable non-volatile optical disk e.g., a compact disk read-only memory
  • CD-ROM Disc Read Only Memory
  • DVD-ROM Digital Video Disc Read Only Memory
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the present disclosure.
  • a program/utility 40 having a set of (at least one) program modules 42 may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment.
  • Program modules 42 generally perform functions and/or methods in the embodiments described in this disclosure.
  • Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable human interaction with electronic device 12, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 12 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22.
  • the electronic device 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)) and/or a public network, such as the Internet, through the network adapter 20 ) communication.
  • networks such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)
  • a public network such as the Internet
  • network adapter 20 communicates with other modules of electronic device 12 via bus 18 .
  • other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the processing unit 16 executes programs stored in the system memory 28 to perform various functional applications and data processing, such as implementing the song generation method and the song generation model training method mentioned in the previous embodiments.
  • the present disclosure also proposes a non-transitory computer-readable storage medium on which a computer program is stored.
  • the program is executed by a processor, the song generation method and song generation method as proposed in the previous embodiments of the present disclosure are implemented. Model training method.
  • the present disclosure also proposes a computer program product.
  • the instruction processor in the computer program product is executed, the song generation method and the song generation model training method proposed in the previous embodiments of the present disclosure are executed.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer programs.
  • the computer program When the computer program is loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present disclosure are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer program may be stored in or transferred from one computer-readable storage medium to another, for example, the computer program may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated.
  • the usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., high-density digital video discs (DVD)), or semiconductor media (e.g., solid state disks, SSD)) etc.
  • magnetic media e.g., floppy disks, hard disks, magnetic tapes
  • optical media e.g., high-density digital video discs (DVD)
  • DVD digital video discs
  • semiconductor media e.g., solid state disks, SSD
  • At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited.
  • the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc.
  • the technical features described in “first”, “second”, “third”, “A”, “B”, “C” and “D” are in no particular order or order.
  • each table in this disclosure can be configured or predefined.
  • the values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure.
  • it is not necessarily required to configure all the correspondences shown in each table.
  • the corresponding relationships shown in some rows may not be configured.
  • appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc.
  • the names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device.
  • other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.
  • Predefinition in this disclosure may be understood as definition, pre-definition, storage, pre-storage, pre-negotiation, pre-configuration, solidification, or pre-burning.

Landscapes

  • Auxiliary Devices For Music (AREA)

Abstract

Provided in the present disclosure are a song generation method, an apparatus, an electronic device, and a storage medium, the method comprising: acquiring a voice audio input by a target user and a unique identification number of a target song; performing Mel spectrogram feature extraction on the voice audio to obtain a real Mel spectrogram feature of the target user; according to the unique identification number of the target song, acquiring a song template corresponding to the unique identification number; inputting the real Mel spectrogram feature of the target user and the song template into a preset song generation model to obtain a target Mel spectrogram feature output by the song generation model; and, according to the target Mel spectrogram feature, generating the target song. The present disclosure can effectively combine the real Mel spectrogram feature of the target user and the song template corresponding to the target song during the song generation process, so as to effectively lower the degree of dependence on the data volume of user voice data, thereby effectively improving a song generation effect while improving song generation convenience.

Description

歌曲生成方法、装置、电子设备和存储介质Song generation method, device, electronic device and storage medium 技术领域Technical field
本公开涉及计算机技术领域,具体涉及一种歌曲生成方法、装置、电子设备和存储介质。The present disclosure relates to the field of computer technology, and specifically to a song generation method, device, electronic device and storage medium.
背景技术Background technique
歌曲合成,是指基于歌词和乐谱生成对应的歌唱音频。对应的歌曲合成算法从最初的基于单元拼接的合成技术,发展为统计参数合成技术直至当前的基于深度学习的合成技术。歌曲合成技术可以让机器唱歌,进一步增加了人机交互的趣味性,因此具有较高的商业价值。Song synthesis refers to generating corresponding singing audio based on lyrics and musical scores. The corresponding song synthesis algorithm has developed from the initial synthesis technology based on unit splicing to statistical parameter synthesis technology to the current synthesis technology based on deep learning. Song synthesis technology can make machines sing, further increasing the fun of human-computer interaction, and therefore has high commercial value.
相关技术中,在进行歌曲合成时,通常对训练语料的数量以及质量要求较高,导致歌曲生成过程较为繁琐且无法保证歌曲生成效果。In related technologies, when performing song synthesis, the quantity and quality of training corpus are usually required to be relatively high, which makes the song generation process cumbersome and the song generation effect cannot be guaranteed.
发明内容Contents of the invention
本公开实施例提出一种歌曲生成方法、装置、电子设备和存储介质,可以应用于数据处理技术领域中,可以在歌曲生成过程中有效结合目标用户的真实梅尔谱特征和目标歌曲对应的歌曲模板,以有效降低对用户语音数据的数据量的依赖程度,从而在提升歌曲生成便捷性的同时,有效提升歌曲生成效果。The embodiments of the present disclosure propose a song generation method, device, electronic device and storage medium, which can be applied in the field of data processing technology and can effectively combine the real Mel spectrum characteristics of the target user and the songs corresponding to the target song during the song generation process. Templates to effectively reduce the dependence on the amount of user voice data, thereby effectively improving the song generation effect while improving the convenience of song generation.
第一方面,本公开实施例提供一种歌曲生成方法,包括:In a first aspect, an embodiment of the present disclosure provides a song generation method, including:
获取目标用户输入的语音音频和目标歌曲的唯一识别号;Obtain the voice audio input by the target user and the unique identification number of the target song;
对所述语音音频进行梅尔谱特征提取,得到所述目标用户的真实梅尔谱特征;Perform mel spectrum feature extraction on the speech audio to obtain the real mel spectrum features of the target user;
根据所述目标歌曲的唯一识别号获取与所述唯一识别号对应的歌曲模板;Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song;
将所述目标用户的真实梅尔谱特征和所述歌曲模板输入至预设的歌曲生成模型中,得到所述歌曲生成模型输出的目标梅尔谱特征,其中,所述歌曲生成模型为使用训练集通过机器学习训练得到,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;Input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein the song generation model is trained using The set is obtained through machine learning training. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during a song and the lyric text corresponding to the singing audio;
根据所述目标梅尔谱特征生成目标歌曲。Generate a target song based on the target mel spectrum characteristics.
第二方面,本公开实施例提供一种歌曲生成模型的训练方法,包括:In a second aspect, embodiments of the present disclosure provide a training method for a song generation model, including:
获取训练集,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个所述采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;Obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during the song and the lyric text corresponding to the singing audio;
获取预先搭建的初始神经网络模型,所述初始神经网络模型包括初始权重参数和损失函数;Obtain a pre-built initial neural network model, which includes initial weight parameters and a loss function;
从所述训练集中获取首个样本,并将所述首个样本输入至所述初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,所述真实梅尔谱特征表示所述首个样本中的歌唱音频的梅尔谱特征,所述预测梅尔谱特征表示所述初始神经网络模型所预测的梅尔谱特征;Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain real mel spectrum features and predicted mel spectrum features. The real mel spectrum features represent the The Mel spectrum feature of the singing audio in the first sample, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;
根据所述损失函数计算所述预测梅尔谱特征和所述真实梅尔谱特征之间的误差;Calculate the error between the predicted mel spectrum feature and the true mel spectrum feature according to the loss function;
根据所述误差对所述初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型;Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;
从所述训练集中逐一获取后续样本,并将所述后续样本重复输入至最新的神经网络模型,直至所述损失函数收敛,得到训练完成的歌曲生成模型。Subsequent samples are obtained one by one from the training set, and the subsequent samples are repeatedly input into the latest neural network model until the loss function converges, and a trained song generation model is obtained.
第三方面,本公开实施例提出一种歌曲生成装置,包括:第一获取模块,用于获取目标用户输入的语音音频和目标歌曲的唯一识别号;In a third aspect, an embodiment of the present disclosure proposes a song generation device, including: a first acquisition module for acquiring the voice audio input by the target user and the unique identification number of the target song;
第一处理模块,用于对所述语音音频进行梅尔谱特征提取,得到所述目标用户的真实梅尔谱特征;The first processing module is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;
第二获取模块,用于根据所述目标歌曲的唯一识别号获取与所述唯一识别号对应的歌曲模板;a second acquisition module, configured to acquire a song template corresponding to the unique identification number according to the unique identification number of the target song;
第二处理模块,用于将所述目标用户的真实梅尔谱特征和所述歌曲模板输入至预设的歌曲生成模型中,得到所述歌曲生成模型输出的目标梅尔谱特征,其中,所述歌曲生成模型为使用训练集通过机器学习训练得到,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;The second processing module is used to input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein: The song generation model is obtained through machine learning training using a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes : The sampling audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;
生成模块,用于根据所述目标梅尔谱特征生成目标歌曲。A generating module, configured to generate a target song according to the target Mel spectrum characteristics.
第四方面,本公开实施例提供一种歌曲生成模型的训练装置,其特征在于,包括:In a fourth aspect, an embodiment of the present disclosure provides a training device for a song generation model, which is characterized by including:
第三获取模块,用于获取训练集,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个所述采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;The third acquisition module is used to acquire a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: The sampled singing audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;
第四获取模块,用于获取预先搭建的初始神经网络模型,所述初始神经网络模型包括初始权重参数和损失函数;The fourth acquisition module is used to acquire a pre-built initial neural network model, where the initial neural network model includes initial weight parameters and a loss function;
第五获取模块,用于从所述训练集中获取首个样本,并将所述首个样本输入至所述初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,所述真实梅尔谱特征表示所述首个样本中的歌唱音频的梅尔谱特征,所述预测梅尔谱特征表示所述初始神经网络模型所预测的梅尔谱特征;The fifth acquisition module is used to acquire the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, the real The Mel spectrum feature represents the Mel spectrum feature of the singing audio in the first sample, and the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;
第三处理模块,用于根据所述损失函数计算所述预测梅尔谱特征和所述真实梅尔谱特征之间的误差;A third processing module, configured to calculate the error between the predicted Mel spectrum feature and the real Mel spectrum feature according to the loss function;
第四处理模块,用于根据所述误差对所述初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型;A fourth processing module, configured to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;
第六获取模块,用于从所述训练集中逐一获取后续样本,并将所述后续样本重复输入至最新的神经网络模型,直至所述损失函数收敛,得到训练完成的歌曲生成模型。The sixth acquisition module is used to acquire subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges to obtain the trained song generation model.
第五方面,本公开实施例提供一种电子设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如本公开第一方面实施例提出的歌曲生成方法,或者实现如本公开第二方面实施例提出的歌曲生成模型的训练方法。In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the first aspect of the present disclosure. The song generation method proposed in the embodiment of one aspect, or the training method of the song generation model proposed in the embodiment of the second aspect of the present disclosure.
第六方面,本公开实施例提供一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开第一方面实施例提出的歌曲生成方法,或者实现如本公开第二方面实施例提出的歌曲生成模型的训练方法。In a sixth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the song generation method as proposed in the embodiment of the first aspect of the disclosure is implemented, or Implement the training method of the song generation model as proposed in the embodiment of the second aspect of the present disclosure.
第七方面,本公开实施例提供一种计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,执行如本公开第一方面实施例提出的歌曲生成方法,或者执行如本公开第二方面实施例提出的歌曲生成模型的训练方法。In a seventh aspect, an embodiment of the present disclosure provides a computer program product. When the instructions in the computer program product are executed by a processor, the song generation method proposed in the embodiment of the first aspect of the present disclosure is executed, or the method of the present disclosure is executed. The second aspect embodiment proposes a training method for a song generation model.
综上所述,在本公开实施例提供的歌曲生成方法、装置、电子设备、存储介质、计算机程序及计算机程序产品,可以实现以下技术效果:To sum up, the song generation methods, devices, electronic devices, storage media, computer programs and computer program products provided in the embodiments of the present disclosure can achieve the following technical effects:
通过获取目标用户输入的语音音频和目标歌曲的唯一识别号,对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征,根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板,将目标 用户的真实梅尔谱特征和歌曲模板输入至预设的歌曲生成模型中,得到歌曲生成模型输出的目标梅尔谱特征,根据目标梅尔谱特征生成目标歌曲,可以在歌曲生成过程中有效结合目标用户的真实梅尔谱特征和目标歌曲对应的歌曲模板,以有效降低对用户语音数据的数据量的依赖程度,从而在提升歌曲生成便捷性的同时,有效提升歌曲生成效果。By obtaining the voice audio input by the target user and the unique identification number of the target song, perform Mel spectrum feature extraction on the voice audio to obtain the real Mel spectrum features of the target user, and obtain the corresponding unique identification number based on the unique identification number of the target song. Song template, input the target user's real mel spectrum features and song template into the preset song generation model, obtain the target mel spectrum features output by the song generation model, and generate the target song based on the target mel spectrum features, which can be used in the song During the generation process, the real Mel spectrum characteristics of the target user and the song template corresponding to the target song are effectively combined to effectively reduce the dependence on the amount of user voice data, thus effectively improving the song generation effect while improving the convenience of song generation. .
附图说明Description of the drawings
为了更清楚地说明本公开实施例或背景技术中的技术方案,下面将对本公开实施例或背景技术中所需要使用的附图进行说明。In order to more clearly illustrate the technical solutions in the embodiments of the disclosure or the background technology, the drawings required to be used in the embodiments or the background technology of the disclosure will be described below.
图1是本公开一实施例提出的歌曲生成方法的流程示意图;Figure 1 is a schematic flowchart of a song generation method proposed by an embodiment of the present disclosure;
图2是本公开另一实施例提出的歌曲生成方法的流程示意图;Figure 2 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure;
图3是本公开实施例提出的一歌曲模板生成过程示意图;Figure 3 is a schematic diagram of a song template generation process proposed by an embodiment of the present disclosure;
图4是本公开另一实施例提出的歌曲生成方法的流程示意图;Figure 4 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure;
图5是本公开实施例提出的一音色编码子模型结构示意图;Figure 5 is a schematic structural diagram of a timbre encoding sub-model proposed by an embodiment of the present disclosure;
图6是本公开实施例提出的一歌曲生成流程示意图;Figure 6 is a schematic diagram of a song generation process proposed by an embodiment of the present disclosure;
图7是本公开一实施例提出的歌曲生成模型的训练方法的流程示意图;Figure 7 is a schematic flowchart of a training method for a song generation model proposed by an embodiment of the present disclosure;
图8是本公开实施例提出的一初始神经网络模型的训练流程图;Figure 8 is a training flow chart of an initial neural network model proposed by an embodiment of the present disclosure;
图9是本公开一实施例提出的歌曲生成装置的结构示意图;Figure 9 is a schematic structural diagram of a song generation device according to an embodiment of the present disclosure;
图10是本公开另一实施例提出的歌曲生成装置的结构示意图;Figure 10 is a schematic structural diagram of a song generation device according to another embodiment of the present disclosure;
图11是本公开一实施例提出的歌曲生成模型的训练装置的结构示意图;Figure 11 is a schematic structural diagram of a training device for a song generation model proposed by an embodiment of the present disclosure;
图12是本公开另一实施例提出的歌曲生成模型的训练装置的结构示意图;Figure 12 is a schematic structural diagram of a training device for a song generation model proposed by another embodiment of the present disclosure;
图13示出了适于用来实现本公开实施方式的示例性电子设备的框图。13 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开实施例相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开实施例的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with embodiments of the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of embodiments of the present disclosure as detailed in the appended claims.
在本公开实施例使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开实施例。在本公开实施例和所附权利要求书中所使用的单数形式的“一种”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the embodiments of the present disclosure is for the purpose of describing specific embodiments only and is not intended to limit the embodiments of the present disclosure. As used in the embodiments of the present disclosure and the appended claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本公开实施例可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开实施例范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”及“若”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used to describe various information in the embodiments of the present disclosure, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the embodiments of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the words "if" and "if" as used herein may be interpreted as "when" or "when" or "in response to determining."
为了便于理解,首先介绍本公开涉及的术语。For ease of understanding, terminology involved in this disclosure is first introduced.
1、梅尔谱1. Mel spectrum
梅尔谱,是语音深度学习过程中的常用特征。普通语谱图是线性的,而梅尔谱基于人类听觉的特性(对低频声音比较敏感,对高频声音的分辨能力较差),将普通语谱图的频率从线性转换为梅尔尺度,而梅尔尺度是一种对数尺度,人类对于频率的感知在梅尔尺度上更为敏感。Mel spectrum is a commonly used feature in the deep speech learning process. Ordinary spectrograms are linear, while Mel spectroscopy is based on the characteristics of human hearing (more sensitive to low-frequency sounds and poorer in resolving high-frequency sounds), converting the frequency of ordinary spectrograms from linear to Mel scale, The Mel scale is a logarithmic scale, and human perception of frequency is more sensitive on the Mel scale.
2、音素2. Phonemes
音素,是根据语音的自然属性划分出来的最小语音单位,可以依据音节里的发音动作进行分析,一个动作构成一个音素。音素分为元音与辅音两大类。Phoneme is the smallest unit of speech divided according to the natural properties of speech. It can be analyzed based on the pronunciation movements in syllables. One movement constitutes a phoneme. Phonemes are divided into two categories: vowels and consonants.
3、音色3. Tone
音色,是指不同声音表现在波形方面总是有与众不同的特性,不同的物体振动都有不同的特点。Timbre means that different sounds always have distinctive characteristics in terms of waveforms, and different objects vibrate with different characteristics.
图1是本公开一实施例提出的歌曲生成方法的流程示意图。Figure 1 is a schematic flowchart of a song generation method proposed by an embodiment of the present disclosure.
其中,需要说明的是,本实施例的歌曲生成方法的执行主体为歌曲生成装置,该装置可以由软件和/或硬件的方式实现,该装置可以配置在电子设备中,电子设备可以包括但不限于终端、服务器端等,如终端可为智能手机、智能电视、智能手表、智能汽车等。It should be noted that the execution subject of the song generation method in this embodiment is a song generation device. The device can be implemented by software and/or hardware. The device can be configured in an electronic device. The electronic device can include but not Limited to terminals, servers, etc. For example, terminals can be smartphones, smart TVs, smart watches, smart cars, etc.
如图1所示,该歌曲生成方法,可以包括但不限于如下步骤:As shown in Figure 1, the song generation method may include but is not limited to the following steps:
步骤S101:获取目标用户输入的语音音频和目标歌曲的唯一识别号。Step S101: Obtain the voice audio input by the target user and the unique identification number of the target song.
其中,目标用户,是指待使用该歌曲生成方法的用户。而语音音频,是指目标用户所输入的音频数据,该语音音频可以是目标用户的音频数据,可以是其他用户的音频数据,对此不做限制。而目标歌曲,是指该歌曲生成方法待生成的歌曲。唯一识别号,是指目标歌曲对应的标识信息,例如编号或名称。Among them, the target user refers to the user who wants to use the song generation method. The voice audio refers to the audio data input by the target user. The voice audio can be the audio data of the target user or the audio data of other users. There is no limit to this. The target song refers to the song to be generated by the song generation method. The unique identification number refers to the identification information corresponding to the target song, such as the number or name.
可以理解的是,目标歌曲的数量可能是多个,当获取目标歌曲的唯一识别号,可以实现在歌曲生成过程中对目标歌曲的准确定位。It is understandable that the number of target songs may be multiple. When the unique identification number of the target song is obtained, the target song can be accurately positioned during the song generation process.
本公开实施例中,在获取目标用户输入的语音音频时,可以是预先在本公开实施例的执行主体中配置音频获取装置,而后由音频获取装置获取目标用户的语音音频,或者,还可以预先在本公开实施例的执行主体中配置数据接口,经由该数据接口接收歌曲生成请求,而后从歌曲生成请求中解析得到语音音频,对此不做限制。In the embodiment of the present disclosure, when acquiring the voice audio input by the target user, the audio acquisition device may be configured in advance in the execution body of the embodiment of the present disclosure, and then the audio acquisition device acquires the voice audio of the target user, or the audio acquisition device may be configured in advance. A data interface is configured in the execution subject of the embodiment of the present disclosure, a song generation request is received through the data interface, and then the voice audio is obtained by parsing the song generation request, and there is no limit to this.
本公开实施例中,在获取目标歌曲的唯一识别号时,可以是采用关系表,该关系表中可以记载目标歌曲对应的唯一识别号,或者,还可以预先基于多个目标歌曲与对象唯一识别号之间的映射关系建立数据库,而后基于目标歌曲才从数据库中获取对应的唯一识别号,对此不做限制。In the embodiment of the present disclosure, when obtaining the unique identification number of the target song, a relationship table may be used. The relationship table may record the unique identification number corresponding to the target song, or the unique identification number of the target song may be uniquely identified based on multiple target songs in advance. The mapping relationship between the numbers establishes a database, and then the corresponding unique identification number is obtained from the database based on the target song. There is no restriction on this.
步骤S102:对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征。Step S102: Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.
其中,梅尔谱,是指基于音频数据所提取得到的频谱图,该梅尔谱属于对数谱。而梅尔谱特征,是指梅尔谱对应的特征信息。可以理解的是,人耳听到的声音高低和实际(Hz)频率不呈线性关系,用梅尔谱特征更符合人耳的听觉特性。而真实梅尔谱特征,是指基于上述语音数据所获取的梅尔谱特征。Among them, the Mel spectrum refers to the spectrogram extracted based on the audio data, and the Mel spectrum is a logarithmic spectrum. The Mel spectrum feature refers to the feature information corresponding to the Mel spectrum. It is understandable that the sound level heard by the human ear does not have a linear relationship with the actual frequency (Hz), and the Mel spectrum feature is more in line with the auditory characteristics of the human ear. The real Mel spectrum features refer to the Mel spectrum features obtained based on the above speech data.
本公开实施例中,当对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征时,可以实现对语音音频的特征提取,从而为歌曲生成过程提供可靠的参考数据。In the embodiment of the present disclosure, when mel spectrum feature extraction is performed on the voice audio to obtain the real mel spectrum feature of the target user, feature extraction of the voice audio can be achieved, thereby providing reliable reference data for the song generation process.
步骤S103:根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板。Step S103: Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.
其中,歌曲模板,是指描述目标歌曲相关信息的模板。Among them, the song template refers to a template that describes information related to the target song.
可选的,一些实施例中,该歌曲模板可以包括歌词文本信息和歌曲旋律信息,歌词文本信息包括音素序列和音素时长,歌曲旋律信息包括歌曲音符序列和歌曲能量序列,由此,可以较大程度得丰富歌曲模板中的表征内容,从而为歌曲生成模型提供目标歌曲较为全面的参考信息,以有效提升歌曲模板在歌 曲生成过程中的适用性。Optionally, in some embodiments, the song template may include lyric text information and song melody information. The lyric text information includes phoneme sequences and phoneme durations. The song melody information includes song note sequences and song energy sequences. Therefore, it can be larger The representation content in the song template can be enriched to a certain extent, thereby providing the song generation model with more comprehensive reference information of the target song, thereby effectively improving the applicability of the song template in the song generation process.
其中,歌词文本信息,是指目标歌曲对应歌词的文本信息。歌曲旋律信息,可以被用于描述目标歌曲旋律对应的相关信息。音素,是指根据语音的自然属性所划分得到的最小语音单位。而音素序列,是指多个音素所组成的序列。而音素时长,是指音素所对应的时长信息。音符,是指被用于记录不同长短的音的进行符号。而歌曲音符序列,是指歌曲音频对应音符所组成的序列。Among them, the lyrics text information refers to the text information corresponding to the lyrics of the target song. Song melody information can be used to describe relevant information corresponding to the melody of the target song. Phoneme refers to the smallest phonetic unit divided according to the natural properties of speech. A phoneme sequence refers to a sequence composed of multiple phonemes. The phoneme duration refers to the duration information corresponding to the phoneme. Notes refer to progressive symbols used to record sounds of different lengths. The song note sequence refers to the sequence composed of the corresponding notes of the song audio.
其中,能量,可以是指歌曲音频中所包含的能量,例如声音强度,而歌曲能量序列,可以被用于描述不同时间点对应的歌曲音频对应的能量变化情况。Among them, energy can refer to the energy contained in the song audio, such as sound intensity, and the song energy sequence can be used to describe the energy changes corresponding to the song audio corresponding to different time points.
本公开实施例中,在根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板时,可以是预先获取多个歌曲模板,而后基于唯一识别号和多个歌曲模板进行匹配处理,以得到与唯一识别号对应的歌曲模板,或者,还可以由第三方检索装置根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板,对此不做限制。In the embodiment of the present disclosure, when obtaining the song template corresponding to the unique identification number according to the unique identification number of the target song, multiple song templates may be obtained in advance, and then matching processing is performed based on the unique identification number and the multiple song templates to obtain The song template corresponding to the unique identification number, or a third-party retrieval device can also obtain the song template corresponding to the unique identification number according to the unique identification number of the target song, and there is no limit to this.
步骤S104:将目标用户的真实梅尔谱特征和歌曲模板输入至预设的歌曲生成模型中,得到歌曲生成模型输出的目标梅尔谱特征,其中,歌曲生成模型为使用训练集通过机器学习训练得到,训练集来自于多个采样用户,训练集包括多个样本,一个采样用户至少对应一个样本,每个样本包括:采样用户歌唱某一歌曲时所拾取的歌唱音频和与歌唱音频对应的歌词文本。Step S104: Input the target user's real mel spectrum features and song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, where the song generation model is trained by machine learning using a training set It is obtained that the training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the singing audio picked up when the sampling user sings a certain song and the lyrics corresponding to the singing audio. text.
其中,歌曲生成模型,是指被用于处理真实梅尔谱特征和歌曲模板,并输出目标梅尔谱特征的模型。该歌曲生成模型,可以是神经网络模型。而目标梅尔谱特征,是指由歌曲生成模型处理目标用户的真实梅尔谱特征和歌曲模板所得到的梅尔谱特征。训练集,是指歌曲生成模型在训练过程中所使用的样本集。Among them, the song generation model refers to a model used to process real mel spectrum features and song templates, and output target mel spectrum features. The song generation model may be a neural network model. The target mel spectrum feature refers to the mel spectrum feature obtained by processing the real mel spectrum feature of the target user and the song template by the song generation model. The training set refers to the sample set used by the song generation model during the training process.
其中,采样用户,是指为歌曲生成模型的训练过程提供样本的用户。而样本,可以是指被用于进行模型训练的歌唱音频和歌词文本。而歌唱音频,是指采样用户歌唱某一歌曲时所拾取的音频。Among them, sampling users refer to users who provide samples for the training process of the song generation model. The sample can refer to the singing audio and lyric text used for model training. Singing audio refers to sampling the audio picked up when the user sings a certain song.
可选的,一些实施例中,歌曲生成模型包括:音色编码子模型、文本编码子模型和声学解码子模型,歌曲生成模型为采用同一个训练集对音色编码子模型、文本编码子模型和声学解码子模型进行联合训练得到,由此,可以有效提升歌曲生成模型的结构合理性,当采用同一个训练集对音色编码子模型、文本编码子模型和声学解码子模型进行联合训练时,能够有效提升各个子模型之间的一致性,从而有效提升所得歌曲生成模型的输出准确性。Optionally, in some embodiments, the song generation model includes: timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model. The song generation model uses the same training set to combine the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model. The decoding sub-model is obtained through joint training. This can effectively improve the structural rationality of the song generation model. When the same training set is used to jointly train the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model, it can effectively improve the structural rationality of the song generation model. Improve the consistency between each sub-model, thereby effectively improving the output accuracy of the resulting song generation model.
其中,音色,是指不同声音表现在波形方面所表现出来的不同特性,不同的物体振动存在不同的特点,也即是说,不同用户的音色存在差异。而音色编码子模型,是指被用于处理真实梅尔谱特征,以得到目标用户音色特征向量的模型。Among them, timbre refers to the different characteristics of different sounds in terms of waveforms. Different objects vibrate with different characteristics. In other words, the timbre of different users is different. The timbre coding sub-model refers to a model used to process real mel spectrum features to obtain the target user's timbre feature vector.
其中,文本编码子模型,是指被用于处理音素序列,以得到目标歌曲对应文本特征向量的模型。Among them, the text encoding sub-model refers to the model used to process the phoneme sequence to obtain the text feature vector corresponding to the target song.
其中,声学解码子模型,是指被用于处理多个特征信息以得到目标梅尔谱特征的模型,该声学解码子模型可以是快速端到端且非自回归合成系统的解码器。Among them, the acoustic decoding sub-model refers to a model used to process multiple feature information to obtain target mel spectrum features. The acoustic decoding sub-model can be a decoder of a fast end-to-end and non-autoregressive synthesis system.
本公开实施例中,当将目标用户的真实梅尔谱特征和歌曲模板输入至预设的歌曲生成模型中,得到歌曲生成模型输出的目标梅尔谱特征时,可以基于歌曲生成模型快速、准确地融合真实梅尔谱特征和歌曲模板的相关信息,从而有效提升模型生成效率。In the embodiment of the present disclosure, when the real mel spectrum features and song templates of the target user are input into the preset song generation model, and the target mel spectrum features output by the song generation model are obtained, it can be quickly and accurately based on the song generation model. It effectively integrates real Mel spectrum features and relevant information of song templates, thereby effectively improving the efficiency of model generation.
步骤S105:根据目标梅尔谱特征生成目标歌曲。Step S105: Generate a target song based on the target mel spectrum characteristics.
本公开实施例在根据目标梅尔谱特征生成目标歌曲时,可以是将目标梅尔谱特征输入至声码器中,由声码器解析处理目标频谱特征,以得到目标歌曲。When generating a target song based on the target mel spectrum features in the embodiment of the present disclosure, the target mel spectrum features may be input into the vocoder, and the vocoder analyzes and processes the target spectrum features to obtain the target song.
举例而言,在根据目标梅尔谱特征生成目标歌曲时,可以是将目标梅尔谱特征输入到预设的声码器模型中,得到目标线性谱特征;将线性谱特征进行逆傅里叶变换,得到目标歌曲的音频数据。For example, when generating a target song based on the target mel spectrum features, the target mel spectrum features can be input into the preset vocoder model to obtain the target linear spectrum features; the linear spectrum features are subjected to inverse Fourier Transform to obtain the audio data of the target song.
其中,声码器模型为神经网络模型,该声码器模型也是使用与不同于歌曲生成模型的训练集通过机器学习训练得到。声码器模型可以基于生成对抗网络(Generative Adversial Networks,GAN)、无蒸馏的对抗生成网络等,训练集也可以采用现有技术中常用的训练集。Among them, the vocoder model is a neural network model, which is also trained through machine learning using a training set different from the song generation model. The vocoder model can be based on Generative Adversial Networks (GAN), adversarial generation network without distillation, etc. The training set can also be a training set commonly used in the existing technology.
该声码器模型的训练过程可以是将一个样本中的真实的梅尔谱特征输入到搭建好的初始模型中,得到预测的线性谱特征,通过损失函数计算预测的线性谱特征和样本中的真实的线性谱特征的误差,根据误差修改初始模型权重,如此往复输入样本,直至损失函数收敛,得到训练好的声码器模型。The training process of the vocoder model can be to input the real mel spectrum feature in a sample into the built initial model, obtain the predicted linear spectrum feature, and calculate the predicted linear spectrum feature and the sample's linear spectrum feature through the loss function. For the error of the real linear spectrum feature, the initial model weight is modified according to the error, and the samples are input in this way until the loss function converges, and the trained vocoder model is obtained.
本公开实施例中,通过获取目标用户输入的语音音频和目标歌曲的唯一识别号,对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征,根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板,将目标用户的真实梅尔谱特征和歌曲模板输入至预设的歌曲生成模型中,得到歌曲生成模型输出的目标梅尔谱特征,根据目标梅尔谱特征生成目标歌曲,可以在歌曲生成过程中有效结合目标用户的真实梅尔谱特征和目标歌曲对应的歌曲模板,以有效降低对用户语音数据的数据量的依赖程度,从而在提升歌曲生成便捷性的同时,有效提升歌曲生成效果。In the embodiment of the present disclosure, by obtaining the voice audio input by the target user and the unique identification number of the target song, Mel spectrum feature extraction is performed on the voice audio to obtain the real Mel spectrum feature of the target user, which is obtained according to the unique identification number of the target song. For the song template corresponding to the unique identification number, input the target user's real mel spectrum features and song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, and generate the target mel spectrum features based on the target mel spectrum features. The target song can effectively combine the target user's real mel spectrum characteristics and the song template corresponding to the target song during the song generation process to effectively reduce the dependence on the amount of user voice data, thereby improving the convenience of song generation. , effectively improving the song generation effect.
图2是本公开另一实施例提出的歌曲生成方法的流程示意图。Figure 2 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.
如图2所示,该歌曲生成方法,可以包括但不限于如下步骤:As shown in Figure 2, the song generation method may include but is not limited to the following steps:
步骤S201:获取目标用户输入的语音音频和目标歌曲的唯一识别号。Step S201: Obtain the voice audio input by the target user and the unique identification number of the target song.
步骤S202:对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征。Step S202: Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.
步骤S203:根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板。Step S203: Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.
步骤S201-步骤S203的描述说明可以具体参见上述实施例,在此不再赘述。For descriptions of steps S201 to S203, specific reference may be made to the above embodiments, and details will not be described again here.
步骤S204:将目标用户的真实梅尔谱特征输入音色编码子模型,得到目标用户的音色特征向量。Step S204: Input the target user's real mel spectrum features into the timbre coding sub-model to obtain the target user's timbre feature vector.
其中,音色特征向量,是指被用于表征目标用户对应音色特征的向量。Among them, the timbre feature vector refers to the vector used to characterize the corresponding timbre characteristics of the target user.
步骤S205:将音素序列输入文本编码子模型,得到歌曲模板中歌词文本的文本特征向量。Step S205: Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.
其中,歌词文本,是指歌曲模板中描述目标歌曲相应歌词信息的文本数据。Among them, the lyrics text refers to the text data in the song template that describes the corresponding lyrics information of the target song.
其中,文本特征向量,是指被用于表征歌词文本对应文本特征的向量。Among them, the text feature vector refers to the vector used to characterize the text features corresponding to the lyrics text.
可选的,一些实施例中,歌曲模板由目标歌曲的音素序列、音素时长、歌曲音符序列、歌曲能量序列,以及目标歌曲的唯一识别号配置得到,其中,目标歌曲的音素序列和音素时长由目标歌曲的歌曲音频和歌曲歌词确定,目标歌曲的歌曲音符序列和歌曲能量序列由歌曲音频确定,由此,可以基于唯一识别号实现对目标歌曲的快速定位,以有效提升所得歌曲模板的实用性,同时可以有效提升歌曲模板对目标歌词相关信息的表征准确性。Optionally, in some embodiments, the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, where the phoneme sequence and phoneme duration of the target song are configured by The song audio and song lyrics of the target song are determined. The song note sequence and song energy sequence of the target song are determined by the song audio. Therefore, the target song can be quickly positioned based on the unique identification number to effectively improve the practicality of the obtained song template. , and at the same time, it can effectively improve the accuracy of the song template's representation of target lyrics-related information.
其中,歌曲音频,是指目标歌曲对应的演唱音频。而歌曲歌词,是指目标歌曲对应的歌词信息。Among them, the song audio refers to the singing audio corresponding to the target song. The song lyrics refer to the lyric information corresponding to the target song.
可选的,一些实施例中,音素序列包括:解析歌曲歌词得到的多个音素,音素时长包括:每个音素在歌曲音频中所占据的第一帧数,由此,可以有效提升所得音素序列与歌曲歌词之间的适配性,同时有效提升所得第一帧数对相应音素的准确性。Optionally, in some embodiments, the phoneme sequence includes: multiple phonemes obtained by parsing song lyrics, and the phoneme duration includes: the number of first frames each phoneme occupies in the song audio. Thus, the obtained phoneme sequence can be effectively improved The compatibility with the lyrics of the song, while effectively improving the accuracy of the obtained first frame number for the corresponding phoneme.
其中,第一帧数,是指音素在歌曲音频中对应的视频帧的数量。Among them, the first frame number refers to the number of video frames corresponding to the phoneme in the song audio.
可选的,一些实施例中,歌曲能量序列是对歌曲音频的歌曲能量特征量化处理得到,歌曲音符序列是对歌曲音频的歌曲基频特征量化处理得到,由此,基于量化处理可以有效提升所得歌曲能量序列和歌 曲音符序列对歌曲能量特征和歌曲基频特征的表征清晰性,同时量化处理所得歌曲能量序列和歌曲音符序列可以为后续计算过程提供可靠的参考数据。Optionally, in some embodiments, the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. Therefore, the obtained results can be effectively improved based on the quantization process. The song energy sequence and song note sequence clearly represent the song energy characteristics and song fundamental frequency characteristics. At the same time, the song energy sequence and song note sequence obtained through quantitative processing can provide reliable reference data for subsequent calculation processes.
其中,歌曲能量特征,可以被用于描述歌曲能量对应的相关特征。而歌曲基频特征,可以被用于描述歌曲基频对应的相关特征。Among them, the song energy feature can be used to describe the relevant features corresponding to the song energy. The fundamental frequency characteristics of songs can be used to describe the related characteristics corresponding to the fundamental frequency of songs.
可选的,一些实施例中,歌曲能量特征包括:多个能量值,歌曲能量序列是根据多个范围编码值形成得到,范围编码值是对与能量值相对应的能量范围进行独热编码处理得到,可以基于独热编码处理过程有效扩充歌曲能量特征,以基于所得多个范围编码值对多个能量值进行区分,当根据多个范围编码值形成歌曲能量序列时,可以有效提升所得歌曲能量序列对歌曲能量特征的表征效果。Optionally, in some embodiments, the song energy characteristics include: multiple energy values, the song energy sequence is formed based on multiple range encoding values, and the range encoding values are processed by one-hot encoding of the energy range corresponding to the energy value. It is obtained that the song energy characteristics can be effectively expanded based on the one-hot encoding process to distinguish multiple energy values based on the obtained multiple range encoding values. When the song energy sequence is formed based on the multiple range encoding values, the obtained song energy can be effectively improved. The representation effect of sequence on the energy characteristics of songs.
其中,能量值,可以是指歌曲能量对应的数值。能量范围,是指能量值对应的取值范围,如0-10。Among them, the energy value may refer to the value corresponding to the energy of the song. Energy range refers to the value range corresponding to the energy value, such as 0-10.
其中,独热编码,也可以称为一位有效编码,该独热编码可以使用N位状态寄存器对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。Among them, one-hot encoding can also be called one-bit effective encoding. This one-hot encoding can use an N-bit status register to encode N states. Each state has its own independent register bit, and at any time, only One is valid.
其中,范围编码值,是指能量范围经由独热编码处理所得到的编码值。The range encoding value refers to the encoding value obtained by one-hot encoding of the energy range.
举例而言,对六个状态进行编码:设六个状态的自然顺序码为:000,001,010,011,100,101。For example, encoding six states: Let the natural sequence codes of the six states be: 000, 001, 010, 011, 100, 101.
则独热编码可以对应配置为:000001,000010,000100,001000,010000,100000。Then the one-hot encoding can be configured as: 000001, 000010, 000100, 001000, 010000, 100000.
可选的,一些实施例中,歌曲基频特征包括:多个基频值,歌曲音符序列包括与每个基频值对应的音符号,由此,歌曲音符序列可以有效结合基频值与音符号之间的对应关系,以适应于个性化的应用场景,从而有效提升所得歌曲音符序列在歌曲生成过程中的适用性。Optionally, in some embodiments, the song fundamental frequency characteristics include: multiple fundamental frequency values, and the song note sequence includes a note symbol corresponding to each fundamental frequency value. Therefore, the song note sequence can effectively combine the fundamental frequency value and the pitch value. The correspondence between symbols can be adapted to personalized application scenarios, thereby effectively improving the applicability of the resulting song note sequence in the song generation process.
其中,基频值,是指歌曲基频对应的数值。音符号,是指音符对应的编号,可以基于音乐领域的相关数据库获取。Among them, the fundamental frequency value refers to the value corresponding to the fundamental frequency of the song. Note symbols refer to the numbers corresponding to notes, which can be obtained based on relevant databases in the music field.
举例而言,如图3所示,图3是本公开实施例提出的一歌曲模板生成过程示意图,该歌曲模板的初始数据可以包括目标歌曲对应的歌曲音频和歌曲歌词,该歌曲模板生成过程可以包括:(1)基于文本转写方法处理歌曲歌词,以得到目标歌曲对应的音素序列;(2)将所得歌曲音素序列和歌曲音频基于强制对齐处理得到目标歌曲的音素时长,这里可以通过强制对齐方法进行处理,也可以在强制对齐操作后进行人工校准,以提升所得音素时长的准确性;(3)基于声学特征提取方法处理歌曲音频,以得到目标歌曲对应的歌曲能量特征和歌曲基频特征,而后基于能量轨迹平移和基频轨迹平移,改变歌曲对应能量和音调的数值,以提升歌曲模板的灵活性;(4)对歌曲能量特征和歌曲基频特征进行量化处理,以得到歌曲能量序列和歌曲音符序列;(5)基于音素序列,音素时长,歌曲能量序列和歌曲音符序列生成歌曲模板;(6)在生成歌曲模板之后,可以为该歌曲模板生成目标歌曲的唯一识别号,以便于在歌曲生成过程中基于唯一识别号对歌曲模板进行检索。For example, as shown in Figure 3, Figure 3 is a schematic diagram of a song template generation process proposed by an embodiment of the present disclosure. The initial data of the song template may include song audio and song lyrics corresponding to the target song. The song template generation process may It includes: (1) Processing song lyrics based on text transcription method to obtain the phoneme sequence corresponding to the target song; (2) Processing the obtained song phoneme sequence and song audio based on forced alignment to obtain the phoneme duration of the target song. Here, forced alignment can be used method, or manual calibration can be performed after the forced alignment operation to improve the accuracy of the phoneme duration; (3) Process the song audio based on the acoustic feature extraction method to obtain the song energy features and song fundamental frequency features corresponding to the target song , and then based on the energy trajectory translation and fundamental frequency trajectory translation, the values of the corresponding energy and pitch of the song are changed to improve the flexibility of the song template; (4) Quantify the song energy characteristics and the song fundamental frequency characteristics to obtain the song energy sequence and song note sequence; (5) Generate a song template based on the phoneme sequence, phoneme duration, song energy sequence and song note sequence; (6) After generating the song template, a unique identification number of the target song can be generated for the song template, so as to facilitate The song template is retrieved based on the unique identification number during the song generation process.
步骤S206:根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量。Step S206: Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector.
其中,帧级文本特征向量,是指描述多个音频帧对应文本特征的向量。而帧级音色特征向量,是指多个音频帧对应音色特征的向量。Among them, the frame-level text feature vector refers to a vector that describes the text features corresponding to multiple audio frames. The frame-level timbre feature vector refers to the vectors corresponding to timbre features of multiple audio frames.
可以理解的是,同一个音素可能包括多个音频帧,而同一音素对应的多个音频帧之间具有较高的相似性,当根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量时,可以通过复制的方式将音素级的文本特征向量和音色特征向量转换为帧级的文本特征向量和音色特征向量,以便于后续对帧级的帧级文本特征向量和帧级音色特征向量进行相加处理。It can be understood that the same phoneme may include multiple audio frames, and the multiple audio frames corresponding to the same phoneme have high similarity. When the text feature vector and timbre feature vector are length-aligned according to the phoneme duration, we get When converting frame-level text feature vectors and frame-level timbre feature vectors, the phoneme-level text feature vectors and timbre feature vectors can be converted into frame-level text feature vectors and timbre feature vectors by copying, so as to facilitate subsequent processing of frame-level frames. The level text feature vector and the frame-level timbre feature vector are added together.
步骤S207:将帧级文本特征向量、帧级音色特征向量和歌曲旋律信息进行相加后输入至声学解码子模型,得到目标梅尔谱特征。Step S207: Add the frame-level text feature vector, frame-level timbre feature vector and song melody information and then input them into the acoustic decoding sub-model to obtain the target mel spectrum feature.
其中,相加,指的是维度的加法运算,假设帧级文本特征向量、帧级音色特征向量、歌曲音符序列和歌曲能量序列的维度都是10维,相加就是对应维度的数值进行加和运算。Among them, addition refers to the addition operation of dimensions. Assume that the dimensions of frame-level text feature vectors, frame-level timbre feature vectors, song note sequences, and song energy sequences are all 10 dimensions. Addition means adding the values of the corresponding dimensions. Operation.
也即是说,本公开实施例在根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板之后,可以将目标用户的真实梅尔谱特征输入音色编码子模型,得到目标用户的音色特征向量,将音素序列输入文本编码子模型,得到歌曲模板中歌词文本的文本特征向量,根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量,将帧级文本特征向量、帧级音色特征向量和歌曲旋律信息进行相加后输入至声学解码子模型,得到目标梅尔谱特征,由此,可以基于音色编码子模型和文本编码子模型快速实现对真实梅尔谱特征和音素序列的特征提取,并以向量的形式量化对应的音色特征和文本特征,而后基于音素时长对文本特征向量和音色特征向量进行时长规整,可以有效提升所得帧级文本特征向量和帧级音色特征向量之间的一致性,以有效提升声学解码子模型对帧级文本特征向量和帧级音色特征向量的处理效果。That is to say, after the embodiment of the present disclosure obtains the song template corresponding to the unique identification number according to the unique identification number of the target song, the real mel spectrum characteristics of the target user can be input into the timbre coding sub-model to obtain the timbre characteristics of the target user. Vector, input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template, perform duration regularization on the text feature vector and timbre feature vector according to the phoneme duration, and obtain the frame-level text feature vector and frame-level timbre feature vector, The frame-level text feature vector, frame-level timbre feature vector and song melody information are added together and then input into the acoustic decoding sub-model to obtain the target mel spectrum feature. From this, it can be quickly implemented based on the timbre encoding sub-model and the text encoding sub-model. Extract the features of real mel spectrum features and phoneme sequences, and quantify the corresponding timbre features and text features in the form of vectors, and then perform duration regularization of the text feature vectors and timbre feature vectors based on phoneme duration, which can effectively improve the resulting frame-level text The consistency between feature vectors and frame-level timbre feature vectors can effectively improve the processing effect of the acoustic decoding sub-model on frame-level text feature vectors and frame-level timbre feature vectors.
步骤S208:根据目标梅尔谱特征生成目标歌曲。Step S208: Generate a target song based on the target mel spectrum characteristics.
步骤S208的描述说明可以具体参见上述实施例,在此不再赘述。For the description of step S208, reference may be made to the above-mentioned embodiments, and details will not be described again here.
本实施例中,通过将目标用户的真实梅尔谱特征输入音色编码子模型,得到目标用户的音色特征向量,将音素序列输入文本编码子模型,得到歌曲模板中歌词文本的文本特征向量,根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量,将帧级文本特征向量、帧级音色特征向量和歌曲旋律信息进行相加后输入至声学解码子模型,得到目标梅尔谱特征,由此,可以基于音色编码子模型和文本编码子模型快速实现对真实梅尔谱特征和音素序列的特征提取,并以向量的形式量化对应的音色特征和文本特征,而后基于音素时长对文本特征向量和音色特征向量进行时长规整,可以有效提升所得帧级文本特征向量和帧级音色特征向量之间的一致性,以有效提升声学解码子模型对帧级文本特征向量和帧级音色特征向量的处理效果。In this embodiment, by inputting the real mel spectrum features of the target user into the timbre encoding sub-model, the timbre feature vector of the target user is obtained, and the phoneme sequence is input into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template. According to The phoneme duration performs duration regularization on text feature vectors and timbre feature vectors to obtain frame-level text feature vectors and frame-level timbre feature vectors. The frame-level text feature vectors, frame-level timbre feature vectors and song melody information are added and then input to the acoustic Decode the sub-model to obtain the target mel spectrum features. From this, the real mel spectrum features and phoneme sequences can be quickly extracted based on the timbre encoding sub-model and the text encoding sub-model, and the corresponding timbre features can be quantified in the form of vectors. and text features, and then duration regularization of text feature vectors and timbre feature vectors based on phoneme duration, which can effectively improve the consistency between the obtained frame-level text feature vectors and frame-level timbre feature vectors, thereby effectively improving the acoustic decoding sub-model's ability to detect frames The processing effect of level text feature vectors and frame-level timbre feature vectors.
图4是本公开另一实施例提出的歌曲生成方法的流程示意图。Figure 4 is a schematic flowchart of a song generation method proposed by another embodiment of the present disclosure.
如图4所示,该歌曲生成方法,可以包括但不限于如下步骤:As shown in Figure 4, the song generation method may include but is not limited to the following steps:
步骤S401:获取目标用户输入的语音音频和目标歌曲的唯一识别号。Step S401: Obtain the voice audio input by the target user and the unique identification number of the target song.
步骤S402:对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征。Step S402: Extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user.
步骤S403:根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板。Step S403: Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song.
步骤S401-步骤S403的描述说明可以具体参见上述实施例,在此不再赘述。For descriptions of steps S401 to S403, reference may be made to the above-mentioned embodiments, and details will not be described again here.
步骤S404:将目标用户的真实梅尔谱特征输入至参考编码器中,得到目标用户的音色隐空间分布向量。Step S404: Input the target user's real mel spectrum features into the reference encoder to obtain the target user's timbre latent space distribution vector.
其中,参考编码器,可以是指被用于处理真实梅尔谱特征以获取音色隐空间分布向量的编码器,该参考编码器输出的音色隐空间分布向量可以是指真实梅尔谱特征对应的隐层变量。The reference encoder may refer to an encoder used to process real Mel spectrum features to obtain timbre latent space distribution vectors. The timbre latent space distribution vector output by the reference encoder may refer to the timbre latent space distribution vector corresponding to the real Mel spectrum features. Hidden variables.
可以理解的是,音色隐空间分布向量服从球形高斯分布,本公开实施例中参考编码器在输出目标用户的音色隐空间分布向量的同时,还可以输出球形高斯分布对应的均值和方差。It can be understood that the timbre latent space distribution vector obeys the spherical Gaussian distribution. In the embodiment of the present disclosure, while outputting the timbre latent space distribution vector of the target user, the reference encoder can also output the mean and variance corresponding to the spherical Gaussian distribution.
步骤S405:将音色隐空间分布向量输入至自回归编码器中,得到目标用户的音色分布向量,其中,音色分布向量为自回归编码器对音色隐空间分布向量采样得到。Step S405: Input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder.
其中,自回归编码器,是指被用于处理音色隐空间分布向量以获取音色分布向量的编码器。Among them, the autoregressive encoder refers to an encoder used to process the timbre latent space distribution vector to obtain the timbre distribution vector.
可以理解的是,上述参考编码器和自回归编码器的结构可以是多层线性层或卷积层,对此不做限制。It can be understood that the structures of the above-mentioned reference encoder and autoregressive encoder can be multi-layer linear layers or convolutional layers, and there is no limitation on this.
步骤S406:将音色分布向量作为目标用户的音色特征向量。Step S406: Use the timbre distribution vector as the timbre feature vector of the target user.
也即是说,音色编码子模型可以包括:参考编码器和自回归编码器,在根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板之后,可以将目标用户的真实梅尔谱特征输入至参考编码器中,得到目标用户的音色隐空间分布向量,将音色隐空间分布向量输入至自回归编码器中,得到目标用户的音色分布向量,其中,音色分布向量为自回归编码器对音色隐空间分布向量采样得到,将音色分布向量作为目标用户的音色特征向量,由此,可以有效减少所得音色特征向量中的冗余信息,同时将较为复杂的真实梅尔谱特征转换为向量的形式,从而有效提升所得音色特征向量的实用性。That is to say, the timbre encoding sub-model can include: a reference encoder and an autoregressive encoder. After obtaining the song template corresponding to the unique identification number according to the unique identification number of the target song, the real Mel spectrum characteristics of the target user can be Input into the reference encoder to obtain the timbre latent space distribution vector of the target user, and input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the autoregressive encoder pair The timbre latent space distribution vector is sampled, and the timbre distribution vector is used as the timbre feature vector of the target user. This can effectively reduce the redundant information in the obtained timbre feature vector, and at the same time convert the more complex real mel spectrum features into vectors. form, thereby effectively improving the practicality of the obtained timbre feature vector.
举例而言,如图5所示,图5是本公开实施例提出的一音色编码子模型结构示意图,其中,随机采样点ε是指高斯分布的随机采样点,可以表示为ε~N(0,I)。For example, as shown in Figure 5, Figure 5 is a schematic structural diagram of a timbre encoding sub-model proposed by an embodiment of the present disclosure, in which the random sampling point ε refers to a random sampling point of Gaussian distribution, which can be expressed as ε~N(0 ,I).
该音色编码子模型在接收到真实梅尔谱特征之后,可以经由参考编码器处理得到音色隐空间分布向量h以及两个参数,该两个参数可以分别作为高斯分布的均值a1和方差b1,上述随机采样点ε结合均值a1以及方差b1进行处理,可以得到近似后实验分布的随机采样点z(z=b1⊙ε+a1,其中,⊙是指矩阵乘),而随机采样点z和音色隐空间分布向量h可以经由自回归编码器处理得到随机采样点z对应的均值a2和方差b2,而后可以基于随机采样点z以及均值a2和方差b2,得到音色特征向量s(s=b2⊙z+a2)。After receiving the real mel spectrum features, the timbre encoding sub-model can obtain the timbre latent space distribution vector h and two parameters through the reference encoder processing. The two parameters can be respectively used as the mean a1 and variance b1 of the Gaussian distribution. The above-mentioned The random sampling point ε is processed in combination with the mean a1 and the variance b1, and the random sampling point z of the approximate experimental distribution can be obtained (z=b1⊙ε+a1, where ⊙ refers to the matrix multiplication), and the random sampling point z and the timbre hidden The spatial distribution vector h can be processed by the autoregressive encoder to obtain the mean a2 and variance b2 corresponding to the random sampling point z, and then based on the random sampling point z and the mean a2 and variance b2, the timbre feature vector s (s=b2⊙z+ a2).
可以理解的是,该音色编码子模型,可以是基于逆自回归流(Inverse Autoregressive Flow,IAF)的采样过程,该逆自回归流属于标准化流。标准化流可以生成易于采样的分布。标准化流通过一系列的可逆变换操作,能够将复杂的输入分布转换为易于处理的概率分布,输出的分布通常选择为各向同性单位高斯分布,即球形单位高斯分布,从而允许平滑插值和有效采样。采用逆自回归流的方式学习得到音色特征向量,生成的音色隐空间分布向量h能够服从球形高斯分布,从而能在该分布上进行采样获取音色特征向量,还可以针对未处理过的用户,学习得到更准确的向量分布。在训练和推理阶段,均从球形高斯分布进行采样以表示音色特征向量,由此保证了训练和推理阶段的一致性,更适配训练集集外的用户。与此同时,对用户相应的音色隐空间分布向量h进行采样而非取平均,进一步增加了用户空间的收敛性,从而允许音色特征向量之间更平滑的插值,即是说,能够从训练集集外用户的一句话音频中学习到该集外用户的音色特征向量。It can be understood that the timbre encoding sub-model can be a sampling process based on an inverse autoregressive flow (IAF), which is a standardized flow. Normalizing streams produces distributions that are easy to sample. The normalized flow can convert complex input distributions into tractable probability distributions through a series of reversible transformation operations. The output distribution is usually chosen to be an isotropic unit Gaussian distribution, that is, a spherical unit Gaussian distribution, allowing smooth interpolation and efficient sampling. . The timbre feature vector is learned by using the inverse autoregressive flow method. The generated timbre latent space distribution vector h can obey the spherical Gaussian distribution, so that the timbre feature vector can be obtained by sampling on this distribution. It can also learn for unprocessed users. Get a more accurate vector distribution. During the training and inference stages, samples are taken from the spherical Gaussian distribution to represent the timbre feature vector, thus ensuring the consistency of the training and inference stages and making it more suitable for users outside the training set. At the same time, sampling the user's corresponding timbre latent space distribution vector h instead of averaging further increases the convergence of the user space, thereby allowing smoother interpolation between timbre feature vectors, that is, being able to extract the data from the training set The timbre feature vector of the user outside the set is learned from the audio of a sentence of the user outside the set.
步骤S407:将音素序列输入文本编码子模型,得到歌曲模板中歌词文本的文本特征向量。Step S407: Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template.
步骤S407的描述说明可以具体参见上述实施例,在此不再赘述。For the description of step S407, reference may be made to the above-mentioned embodiments, and details will not be described again here.
步骤S408:从文本特征向量中,确定与音素序列中的每个音素对应的初始文本编码。Step S408: Determine the initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector.
其中,初始文本编码,是指文本特征向量中所包含的文本编码。Among them, the initial text encoding refers to the text encoding contained in the text feature vector.
本公开实施例中,当从文本特征向量中,确定与音素序列中的每个音素对应的初始文本编码,可以为后续确定目标文本编码提供可靠的参考数据。In the embodiment of the present disclosure, when the initial text code corresponding to each phoneme in the phoneme sequence is determined from the text feature vector, reliable reference data can be provided for subsequent determination of the target text code.
步骤S409:根据音素时长,确定音素对应的第一帧数。Step S409: Determine the first frame number corresponding to the phoneme based on the phoneme duration.
其中,第一帧数,是指各个音素对应的视频帧的数量。举例而言,一个音素对应的音素时长可以是25ms,如果一个音频帧设定为5ms,则一个音素对应包括5帧的信息。Among them, the first frame number refers to the number of video frames corresponding to each phoneme. For example, the phoneme duration corresponding to one phoneme can be 25ms. If an audio frame is set to 5ms, then one phoneme corresponds to 5 frames of information.
步骤S410:复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文 本编码。Step S410: Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code.
其中,目标文本编码,是指第一帧数的初始文本编码进行拼接处理所得到的文本编码。Among them, the target text encoding refers to the text encoding obtained by splicing the initial text encoding of the first frame number.
可以理解的是,一个音素对应的音素时长可能较小,则同一音素对应的多个音频帧之间可能存在较多的冗余信息,当复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文本编码时,可以有效提升所得目标文本编码的实用性。It can be understood that the duration of a phoneme corresponding to a phoneme may be small, and there may be more redundant information between multiple audio frames corresponding to the same phoneme. When copying the initial text encoding, and copying the first frame number When the initial text encoding is spliced to obtain the target text encoding, the practicality of the obtained target text encoding can be effectively improved.
步骤S411:根据多个目标文本编码,形成帧级文本特征向量。Step S411: Form a frame-level text feature vector according to multiple target text codes.
也即是说,本公开实施例在将音素序列输入文本编码子模型,得到歌曲模板中歌词文本的文本特征向量之后,可以从文本特征向量中,确定与音素序列中的每个音素对应的初始文本编码,根据音素时长,确定音素对应的第一帧数,复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文本编码,根据多个目标文本编码,形成帧级文本特征向量,由于各个音素所对应的时间范围较小,且同一音素中不同音频帧的表征内容具有较高的相似性,当复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文本编码时,可以较大程度地降低计算成本,从而有效提升帧级文本特征向量的确定效率。That is to say, in the embodiment of the present disclosure, after inputting the phoneme sequence into the text encoding sub-model and obtaining the text feature vector of the lyrics text in the song template, the initial character corresponding to each phoneme in the phoneme sequence can be determined from the text feature vector. Text encoding: determine the first frame number corresponding to the phoneme based on the duration of the phoneme, copy the initial text encoding, and perform splicing processing on the copied initial text encoding of the first frame number to obtain the target text encoding, and form a frame based on multiple target text encodings Level text feature vector, since the time range corresponding to each phoneme is small, and the representation content of different audio frames in the same phoneme has high similarity, when copying the initial text encoding, and copying the initial text of the first frame number When the encoding is spliced to obtain the target text encoding, the computational cost can be greatly reduced, thereby effectively improving the efficiency of determining frame-level text feature vectors.
步骤S412:根据音素时长,确定语音音频的第二帧数。Step S412: Determine the second frame number of the speech audio based on the phoneme duration.
其中,第二帧数,是指基于音素时长确定的语音音频的帧数。The second number of frames refers to the number of frames of the speech audio determined based on the phoneme duration.
步骤S413:复制音色特征向量,并对复制所得第二帧数的音色特征向量进行拼接处理得到帧级音色特征向量。Step S413: Copy the timbre feature vector, and splice the copied timbre feature vectors of the second frame number to obtain a frame-level timbre feature vector.
也即是说,本公开实施例在根据多个目标文本编码,形成帧级文本特征向量之后,可以根据音素时长,确定语音音频的第二帧数,复制音色特征向量,并对复制所得第二帧数的音色特征向量进行拼接处理得到帧级音色特征向量,由此,所得帧级音色特征向量可以从音频帧的量级有效表征待处理语音的相关特征信息,以有效提升所得帧级音色特征向量与帧级文本特征向量之间的适配性,同时有效提升所得帧级音色特征向量的表征效果。That is to say, after forming a frame-level text feature vector based on multiple target text encodings, the embodiment of the present disclosure can determine the second frame number of the speech audio based on the phoneme duration, copy the timbre feature vector, and perform the copied second The timbre feature vectors of the frame number are spliced to obtain a frame-level timbre feature vector. Therefore, the obtained frame-level timbre feature vector can effectively represent the relevant feature information of the speech to be processed from the level of the audio frame, so as to effectively improve the obtained frame-level timbre features. The compatibility between the vector and the frame-level text feature vector effectively improves the representation effect of the obtained frame-level timbre feature vector.
步骤S414:将帧级文本特征向量、帧级音色特征向量和歌曲旋律信息进行相加后输入至声学解码子模型,得到目标梅尔谱特征。Step S414: Add the frame-level text feature vector, the frame-level timbre feature vector and the song melody information and then input them into the acoustic decoding sub-model to obtain the target mel spectrum feature.
步骤S415:根据目标梅尔谱特征生成目标歌曲。Step S415: Generate the target song according to the target mel spectrum characteristics.
步骤S414-步骤S415的描述说明可以具体参见上述实施例,在此不再赘述。For the description of steps S414 to S415, reference may be made to the above-mentioned embodiments, and details will not be described again here.
举例而言,如图6所示,图6是本公开实施例提出的一歌曲生成流程示意图,当新用户提供一段语音音频之后,歌曲生成模型对应的操作流程可以包括:(1)基于声学特征提取方法处理语音音频,以得到真实梅尔谱特征;(2)将真实梅尔谱特征输入至音色编码子模型中,以得到音色特征向量;(3)将歌曲模板中的音素序列输入至文本编码子模型中,以得到目标歌曲的文本特征向量;(4)将音素时长、音色特征向量以及文本特征向量输入至时长规整子模块中,以得到帧级文本特征向量和帧级音色特征向量;(5)将帧级文本特征向量、帧级音色特征向量、歌曲音符序列和歌曲能量序列相加后输入至声学解码子模型中,以得到目标梅尔谱特征;(6)将所得目标频谱特征输入至声码器中,以得到目标歌曲。该声码器可以是神经网络声码器。For example, as shown in Figure 6, Figure 6 is a schematic diagram of a song generation process proposed by an embodiment of the present disclosure. After a new user provides a piece of voice audio, the corresponding operation process of the song generation model may include: (1) Based on acoustic features The extraction method processes the speech audio to obtain real mel spectrum features; (2) inputs the real mel spectrum features into the timbre coding sub-model to obtain the timbre feature vector; (3) inputs the phoneme sequence in the song template into the text In the encoding sub-model, to obtain the text feature vector of the target song; (4) Input the phoneme duration, timbre feature vector and text feature vector into the duration regularization sub-module to obtain the frame-level text feature vector and frame-level timbre feature vector; (5) Add the frame-level text feature vector, frame-level timbre feature vector, song note sequence and song energy sequence and input them into the acoustic decoding sub-model to obtain the target mel spectrum feature; (6) The obtained target spectrum feature Input into the vocoder to get the target song. The vocoder may be a neural network vocoder.
也即是说,本公开实施例在歌曲生成过程中,多个用户可以共享预训练的歌曲生成模型,基于用户的一段音频即可以获得用户演绎的歌曲,从而有效提升歌曲生成过程中的便捷性,有效减少计算资源,降低存储成本。That is to say, in the song generation process of the embodiment of the present disclosure, multiple users can share the pre-trained song generation model, and the song performed by the user can be obtained based on a piece of user audio, thereby effectively improving the convenience in the song generation process. , effectively reducing computing resources and reducing storage costs.
本实施例中,通过将目标用户的真实梅尔谱特征输入至参考编码器中,得到目标用户的音色隐空间分布向量,将音色隐空间分布向量输入至自回归编码器中,得到目标用户的音色分布向量,其中,音色分布向量为自回归编码器对音色隐空间分布向量采样得到,将音色分布向量作为目标用户的音色特征向量,由此,可以有效减少所得音色特征向量中的冗余信息,同时将较为复杂的真实梅尔谱特征转换为向量的形式,从而有效提升所得音色特征向量的实用性,通过从文本特征向量中,确定与音素序列中的每个音素对应的初始文本编码,根据音素时长,确定音素对应的第一帧数,复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文本编码,根据多个目标文本编码,形成帧级文本特征向量,由于各个音素所对应的时间范围较小,且同一音素中不同音频帧的表征内容具有较高的相似性,当复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文本编码时,可以较大程度地降低计算成本,从而有效提升帧级文本特征向量的确定效率,通过根据音素时长,确定语音音频的第二帧数,复制音色特征向量,并对复制所得第二帧数的音色特征向量进行拼接处理得到帧级音色特征向量,由此,所得帧级音色特征向量可以从音频帧的量级有效表征待处理语音的相关特征信息,以有效提升所得帧级音色特征向量与帧级文本特征向量之间的适配性,同时有效提升所得帧级音色特征向量的表征效果。In this embodiment, by inputting the target user's real mel spectrum features into the reference encoder, the target user's timbre latent space distribution vector is obtained, and the timbre latent space distribution vector is input into the autoregressive encoder to obtain the target user's timbre latent space distribution vector. Timbre distribution vector, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder. The timbre distribution vector is used as the timbre feature vector of the target user. This can effectively reduce the redundant information in the resulting timbre feature vector. , and at the same time, convert the more complex real mel spectrum features into vector form, thereby effectively improving the practicality of the obtained timbre feature vector. By determining the initial text encoding corresponding to each phoneme in the phoneme sequence from the text feature vector, According to the duration of the phoneme, determine the first frame number corresponding to the phoneme, copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code. Based on multiple target text codes, frame-level text features are formed. Vector, since the time range corresponding to each phoneme is small, and the representation content of different audio frames in the same phoneme has high similarity, when the initial text encoding is copied, the initial text encoding of the first frame number is spliced. When processing the target text encoding, the computing cost can be greatly reduced, thereby effectively improving the efficiency of determining the frame-level text feature vector. By determining the second frame number of the speech audio based on the phoneme duration, copying the timbre feature vector, and copying The obtained timbre feature vectors of the second frame number are spliced to obtain a frame-level timbre feature vector. Therefore, the obtained frame-level timbre feature vector can effectively represent the relevant feature information of the speech to be processed from the level of the audio frame to effectively improve the obtained frame. The adaptability between the level timbre feature vector and the frame-level text feature vector effectively improves the representation effect of the obtained frame-level timbre feature vector.
图7是本公开一实施例提出的歌曲生成模型的训练方法的流程示意图。FIG. 7 is a schematic flowchart of a training method for a song generation model proposed by an embodiment of the present disclosure.
其中,需要说明的是,本实施例的歌曲生成模型的训练方法的执行主体为歌曲生成模型的训练装置,该装置可以由软件和/或硬件的方式实现,该装置可以配置在电子设备中,电子设备可以包括但不限于终端、服务器端等,如终端可为智能手机、智能电视、智能手表、智能汽车等。Among them, it should be noted that the execution subject of the training method of the song generation model in this embodiment is a training device of the song generation model. The device can be implemented by software and/or hardware, and the device can be configured in an electronic device. Electronic devices may include but are not limited to terminals, servers, etc. For example, terminals may be smartphones, smart TVs, smart watches, smart cars, etc.
如图7所示,该歌曲生成模型的训练方法,可以包括但不限于如下步骤:As shown in Figure 7, the training method of the song generation model may include but is not limited to the following steps:
步骤S701:获取训练集,训练集来自于多个采样用户,训练集包括多个样本,一个采样用户至少对应一个样本,每个样本包括:采样用户歌唱某一歌曲时所拾取的歌唱音频和与歌唱音频对应的歌词文本。Step S701: Obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the singing audio picked up when the sampling user sings a certain song and the The lyrics text corresponding to the singing audio.
本公开实施例中,在获取训练集时,可以是预先建立本公开实施例的执行主体与大数据服务器的通信链接,而后从大数据服务器处获取训练集,或者,也可以基于样本收集装置从多个采样用户处获取训练集,对此不做限制。In the embodiment of the present disclosure, when obtaining the training set, a communication link between the execution subject of the embodiment of the present disclosure and the big data server may be established in advance, and then the training set may be obtained from the big data server, or the training set may be obtained from the big data server based on the sample collection device. The training set is obtained from multiple sampling users, and there is no restriction on this.
步骤S702:获取预先搭建的初始神经网络模型,初始神经网络模型包括初始权重参数和损失函数。Step S702: Obtain a pre-built initial neural network model. The initial neural network model includes initial weight parameters and a loss function.
其中,神经网络模型,是由大量的、简单的处理单元(称为神经元)广泛地互相连接而形成的复杂网络系统,它反映了人脑功能的许多基本特征。而初始神经网络模型,是指待进行模型训练的神经网络模型。其中,初始权重参数,是指模型训练过程中待进行迭代更新的权重参数。而损失函数,可以被用于描述初始神经网络模型在训练过程中所输出的预测梅尔谱特征与真实梅尔谱特征之间的误差信息。Among them, the neural network model is a complex network system formed by a large number of simple processing units (called neurons) that are widely connected to each other. It reflects many basic characteristics of human brain functions. The initial neural network model refers to the neural network model to be trained. Among them, the initial weight parameters refer to the weight parameters to be iteratively updated during the model training process. The loss function can be used to describe the error information between the predicted Mel spectrum features and the real Mel spectrum features output by the initial neural network model during the training process.
本公开实施例中,可以基于损失函数在模型训练过程中实时对模型性能进行评估,以及时判断该模型是否收敛。In the embodiments of the present disclosure, the model performance can be evaluated in real time during the model training process based on the loss function, and whether the model has converged can be judged in a timely manner.
步骤S703:从训练集中获取首个样本,并将首个样本输入至初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,真实梅尔谱特征表示首个样本中的歌唱音频的梅尔谱特征,预测梅尔谱特征表示初始神经网络模型所预测的梅尔谱特征。Step S703: Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain the real mel spectrum feature and the predicted mel spectrum feature. The real mel spectrum feature represents the singing audio in the first sample. The Mel spectrum feature, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model.
其中,首个样本,是指训练集的多个样本中首个被用于进行模型训练的样本。Among them, the first sample refers to the first sample among the multiple samples in the training set that is used for model training.
本公开实施例在从训练集中获取首个样本时,可以是随机从训练集中获取一个样本作为首个样本, 或者,还可以基于训练集中多个样本的编号信息,从训练集中获取首个样本,对此不做限制。When obtaining the first sample from the training set in the embodiment of the present disclosure, a sample may be randomly obtained from the training set as the first sample, or the first sample may be obtained from the training set based on the number information of multiple samples in the training set. There are no restrictions on this.
可选的,一些实施例中,在将首个样本输入至初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征时,可以是对首个样本中的歌词文本进行文本转写,得到音素序列,并根据音素序列对首个样本中的歌唱音频对进行对齐,得到音素时长,对首个样本中的歌唱音频进行声学特征提取,得到首个样本的真实梅尔谱特征、音频能量和基频轨迹,将音素序列输入至初始文本编码子模型中,得到首个样本的文本特征向量,将首个样本的真实梅尔谱特征输入至初始音色编码子模型中,得到首个样本的音色特征向量,根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量,将帧级文本特征向量、帧级音色特征向量、音频能量,以及基频轨迹进行相加后输入至初始声学解码子模型,得到首个样本的预测梅尔谱特征,由此,可以在模型训练过程中采用不同方法提取歌词文本和歌唱音频中的多个特征,并将所得多个特征转换为向量的形式,可以在量化特征的同时,便于所得多个特征向量进行相加处理以实现特征融合,能够有效提升所得预测梅尔谱特征对样本特征的描述准确性。Optionally, in some embodiments, when inputting the first sample into the initial neural network model to obtain the real Mel Spectrum features and the predicted Mel Spectrum features, the text of the lyrics in the first sample may be transcribed. , obtain the phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration, perform acoustic feature extraction on the singing audio in the first sample, and obtain the real mel spectrum features and audio of the first sample Energy and fundamental frequency trajectories, input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample, input the real mel spectrum characteristics of the first sample into the initial timbre encoding sub-model, and obtain the first sample The timbre feature vector of The fundamental frequency trajectories are added and then input to the initial acoustic decoding sub-model to obtain the predicted mel spectrum features of the first sample. From this, different methods can be used to extract multiple features in the lyric text and singing audio during the model training process. Converting the multiple features into vector form can not only quantify the features, but also facilitate the addition processing of the multiple feature vectors to achieve feature fusion, which can effectively improve the accuracy of the description of sample features by the predicted Mel spectrum features. .
其中,初始文本编码子模型,是指待进行模型训练的文本编码子模型。初始音色编码子模型,是指待进行模型训练的音色编码子模型。初始声学解码子模型,是指待进行模型训练的声学解码子模型。Among them, the initial text encoding sub-model refers to the text encoding sub-model to be trained. The initial timbre coding sub-model refers to the timbre coding sub-model to be trained. The initial acoustic decoding sub-model refers to the acoustic decoding sub-model to be trained.
其中,音频能量,是指首个样本中歌唱音频对应的能量信息。Among them, audio energy refers to the energy information corresponding to the singing audio in the first sample.
其中,基频轨迹,是指首个样本中歌唱音频对应基频的轨迹信息。Among them, the fundamental frequency trajectory refers to the trajectory information corresponding to the fundamental frequency of the singing audio in the first sample.
步骤S704:根据损失函数计算预测梅尔谱特征和真实梅尔谱特征之间的误差。Step S704: Calculate the error between the predicted mel spectrum feature and the real mel spectrum feature according to the loss function.
其中,误差,可以被用于描述预测梅尔谱特征和真实梅尔谱特征之间的差异信息。Among them, error can be used to describe the difference information between predicted Mel spectrum features and real Mel spectrum features.
本公开实施例中,当根据损失函数计算预测梅尔谱特征和真实梅尔谱特征之间的误差时,可以实现实时对初始神经网络模型的输出准确率进行评估,以确定模型性能,所得误差可以为确定模型优化方向提供可靠的参考数据。In the embodiment of the present disclosure, when the error between the predicted mel spectrum feature and the real mel spectrum feature is calculated according to the loss function, the output accuracy of the initial neural network model can be evaluated in real time to determine the model performance, and the resulting error It can provide reliable reference data for determining the direction of model optimization.
步骤S705:根据误差对初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型。Step S705: Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model.
本公开实施例中,当根据误差对初始神经网络模型的初始权重参数进行调整时,可以基于误差实现对初始权重参数的准确调整,从而有效提升神经网络模型的训练效果。In the embodiments of the present disclosure, when the initial weight parameters of the initial neural network model are adjusted based on the error, the initial weight parameters can be accurately adjusted based on the error, thereby effectively improving the training effect of the neural network model.
步骤S706:从训练集中逐一获取后续样本,并将后续样本重复输入至最新的神经网络模型,直至损失函数收敛,得到训练完成的歌曲生成模型。Step S706: Obtain subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges, and obtain the trained song generation model.
其中,后续样本,是指训练集中除首个样本之外的其他样本。Among them, subsequent samples refer to other samples in the training set except the first sample.
举例而言,如图8所示,图8是本公开实施例提出的一初始神经网络模型的训练流程图,该初始神经网络模型可以包括初始音色编码子模型、初始文本编码子模型,以及初始声学解码子模型,该训练流程可以包括:(1)歌曲歌词经由文本转写处理,可以得到对应的音素序列,所得音素序列经由初始文本编码子模型处理,可以得到对应的文本特征向量;(2)基于强制对齐处理文本特征向量与歌曲音素时长,以得到初始文本编码;(3)基于声学特征提取处理歌曲音频,可以得到真实梅尔谱特征、歌曲能量特征和歌曲基频特征;(4)经由初始音色编码子模型处理真实梅尔谱特征,以得到音色特征向量;(5)将歌曲音频能量特征中的多个能量值划分至不同的能量带(例如:取值范围为0-10的能量值可以基于应用环境划分为10个或20个能量带)中,并基于独热编码方法处理歌曲能量特征,以得到歌曲能量序列;(6)基于音乐领域的相关数据,将歌曲基频特征进行量化处理,得到歌曲音符序列。例如:基频261.63Hz对应的音符号为60,基频277.18Hz对应的音符号为61;(7)使用时长规整方法,基于 歌曲音素时长处理初始编码文本得到帧级文本特征向量;(8)由时长规整子模型基于歌曲音素时长处理音色特征向量得到帧级音色特征向量;(9)将上述帧级文本特征向量、帧级音色特征向量、歌曲能量序列以及歌曲音符序列输入至声学解码子模型中,以得到预测梅尔谱特征;(10)基于真实梅尔谱特征和预测频谱特征确定该歌曲生成模型的损失函数。通过该损失函数,可以采样梯度反向传播的方式迭代更新该歌曲生成模型中的各个权重参数,使损失函数趋于收敛。For example, as shown in Figure 8, Figure 8 is a training flow chart of an initial neural network model proposed by an embodiment of the present disclosure. The initial neural network model may include an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial Acoustic decoding sub-model, the training process may include: (1) The song lyrics are processed through text transcription to obtain the corresponding phoneme sequence, and the resulting phoneme sequence is processed through the initial text encoding sub-model to obtain the corresponding text feature vector; (2) ) Process text feature vectors and song phoneme durations based on forced alignment to obtain the initial text encoding; (3) Process song audio based on acoustic feature extraction to obtain real mel spectrum features, song energy features and song fundamental frequency features; (4) The real mel spectrum features are processed through the initial timbre coding sub-model to obtain the timbre feature vector; (5) Multiple energy values in the song audio energy features are divided into different energy bands (for example: the value range is 0-10 The energy value can be divided into 10 or 20 energy bands based on the application environment, and the song energy characteristics are processed based on the one-hot encoding method to obtain the song energy sequence; (6) Based on relevant data in the music field, the song fundamental frequency characteristics Perform quantization processing to obtain the song note sequence. For example: the note symbol corresponding to the fundamental frequency 261.63Hz is 60, and the note symbol corresponding to the fundamental frequency 277.18Hz is 61; (7) Use the duration regularization method to process the initial encoded text based on the song phoneme duration to obtain the frame-level text feature vector; (8) The duration regularization sub-model processes the timbre feature vector based on the song phoneme duration to obtain the frame-level timbre feature vector; (9) Input the above-mentioned frame-level text feature vector, frame-level timbre feature vector, song energy sequence and song note sequence into the acoustic decoding sub-model to obtain the predicted mel spectrum features; (10) Determine the loss function of the song generation model based on the real mel spectrum features and predicted spectrum features. Through this loss function, each weight parameter in the song generation model can be iteratively updated by sampling gradient backpropagation, so that the loss function tends to converge.
本公开实施例中,通过获取训练集,训练集来自于多个采样用户,训练集包括多个样本,一个采样用户至少对应一个样本,每个样本包括:采样用户歌唱某一歌曲时所拾取的歌唱音频和与歌唱音频对应的歌词文本,获取预先搭建的初始神经网络模型,初始神经网络模型包括初始权重参数和损失函数,从训练集中获取首个样本,并将首个样本输入至初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,真实梅尔谱特征表示首个样本中的歌唱音频的梅尔谱特征,预测梅尔谱特征表示初始神经网络模型所预测的梅尔谱特征,根据损失函数计算预测梅尔谱特征和真实梅尔谱特征之间的误差,根据误差对初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型,从训练集中逐一获取后续样本,并将后续样本重复输入至最新的神经网络模型,直至损失函数收敛,得到训练完成的歌曲生成模型,由此,可以在模型训练过程中基于损失函数实时确定模型输出的预测梅尔谱特征和真实梅尔谱特征之间的误差,从而为判断模型收敛提供了可靠的判断依据,以有效提升该歌曲生成模型的输出准确性。In the embodiment of the present disclosure, the training set is obtained from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the sample picked up when the sampling user sings a certain song. Singing audio and lyric text corresponding to the singing audio, obtain the pre-built initial neural network model, the initial neural network model includes initial weight parameters and loss function, obtain the first sample from the training set, and input the first sample into the initial neural network In the model, the real mel spectrum feature and the predicted mel spectrum feature are obtained. The real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum predicted by the initial neural network model. Spectral features, calculate the error between the predicted Mel spectrum features and the real Mel spectrum features based on the loss function, adjust the initial weight parameters of the initial neural network model based on the error, obtain an updated neural network model, and obtain the subsequent results one by one from the training set samples, and repeatedly input subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained. From this, the predicted mel spectrum characteristics of the model output can be determined in real time based on the loss function during the model training process. The error between the model and the real mel spectrum features provides a reliable basis for judging the convergence of the model, thereby effectively improving the output accuracy of the song generation model.
图9是本公开一实施例提出的歌曲生成装置的结构示意图。FIG. 9 is a schematic structural diagram of a song generation device according to an embodiment of the present disclosure.
如图9所示,该歌曲生成装置90,包括:As shown in Figure 9, the song generation device 90 includes:
第一获取模块901,用于获取目标用户输入的语音音频和目标歌曲的唯一识别号;The first acquisition module 901 is used to acquire the voice audio input by the target user and the unique identification number of the target song;
第一处理模块902,用于对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征;The first processing module 902 is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;
第二获取模块903,用于根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板;The second acquisition module 903 is used to acquire the song template corresponding to the unique identification number according to the unique identification number of the target song;
第二处理模块904,用于将目标用户的真实梅尔谱特征和歌曲模板输入至预设的歌曲生成模型中,得到歌曲生成模型输出的目标梅尔谱特征,其中,歌曲生成模型为使用训练集通过机器学习训练得到,训练集来自于多个采样用户,训练集包括多个样本,一个采样用户至少对应一个样本,每个样本包括:采样用户歌唱某一歌曲时所拾取的歌唱音频和与歌唱音频对应的歌词文本;The second processing module 904 is used to input the target user's real mel spectrum features and song templates into the preset song generation model to obtain the target mel spectrum features output by the song generation model, where the song generation model uses training The set is obtained through machine learning training. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: the singing audio picked up when the sampling user sings a certain song and the The lyrics text corresponding to the singing audio;
生成模块905,用于根据目标梅尔谱特征生成目标歌曲。The generation module 905 is used to generate the target song according to the target mel spectrum characteristics.
在本公开的一些实施例中,如图10所示,图10是本公开另一实施例提出的歌曲生成装置的结构示意图,歌曲生成模型包括:音色编码子模型、文本编码子模型和声学解码子模型;歌曲生成模型为采用同一个训练集对音色编码子模型、文本编码子模型和声学解码子模型进行联合训练得到。In some embodiments of the present disclosure, as shown in Figure 10, which is a schematic structural diagram of a song generation device proposed by another embodiment of the present disclosure, the song generation model includes: timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model; the song generation model is obtained by jointly training the timbre encoding sub-model, text encoding sub-model and acoustic decoding sub-model using the same training set.
在本公开的一些实施例中,歌曲模板包括歌词文本信息和歌曲旋律信息;歌词文本信息包括音素序列和音素时长;歌曲旋律信息包括歌曲音符序列和歌曲能量序列。In some embodiments of the present disclosure, the song template includes lyric text information and song melody information; the lyric text information includes phoneme sequences and phoneme durations; and the song melody information includes song note sequences and song energy sequences.
在本公开的一些实施例中,第二处理模块904,包括:第一处理子模块9041,用于将目标用户的真实梅尔谱特征输入音色编码子模型,得到目标用户的音色特征向量;第二处理子模块9042,用于将音素序列输入文本编码子模型,得到歌曲模板中歌词文本的文本特征向量;第三处理子模块9043,用于根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量;第四处理子模块9044,用于将帧级文本特征向量、帧级音色特征向量和歌曲旋律信息进行相加后输入至声学解码子模型,得到目标梅尔谱特征。In some embodiments of the present disclosure, the second processing module 904 includes: a first processing sub-module 9041, which is used to input the target user's real mel spectrum characteristics into the timbre encoding sub-model to obtain the target user's timbre feature vector; The second processing sub-module 9042 is used to input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template; the third processing sub-module 9043 is used to perform duration processing on the text feature vector and timbre feature vector according to the phoneme duration. After regularization, frame-level text feature vectors and frame-level timbre feature vectors are obtained; the fourth processing sub-module 9044 is used to add the frame-level text feature vectors, frame-level timbre feature vectors and song melody information and then input them to the acoustic decoding sub-model. , obtain the target mel spectrum characteristics.
在本公开的一些实施例中,其特征在于,第一处理子模块9041,具体用于:将目标用户的真实梅 尔谱特征输入至参考编码器中,得到目标用户的音色隐空间分布向量;将音色隐空间分布向量输入至自回归编码器中,得到目标用户的音色分布向量,其中,音色分布向量为自回归编码器对音色隐空间分布向量采样得到;将音色分布向量作为目标用户的音色特征向量。In some embodiments of the present disclosure, it is characterized in that the first processing sub-module 9041 is specifically used to: input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user; Input the timbre latent space distribution vector into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is obtained by sampling the timbre latent space distribution vector by the autoregressive encoder; use the timbre distribution vector as the timbre of the target user Feature vector.
在本公开的一些实施例中,第三处理子模块9043,具体用于:从文本特征向量中,确定与音素序列中的每个音素对应的初始文本编码;根据音素时长,确定音素对应的第一帧数;复制初始文本编码,并对复制所得第一帧数的初始文本编码进行拼接处理得到目标文本编码;根据多个目标文本编码,形成帧级文本特征向量。In some embodiments of the present disclosure, the third processing sub-module 9043 is specifically used to: determine the initial text code corresponding to each phoneme in the phoneme sequence from the text feature vector; determine the first text code corresponding to the phoneme according to the phoneme duration. One frame number; copy the initial text encoding, and perform splicing processing on the copied initial text encoding of the first frame number to obtain the target text encoding; form a frame-level text feature vector based on multiple target text encodings.
在本公开的一些实施例中,第三处理子模块9043,还用于:根据音素时长,确定语音音频的第二帧数;复制音色特征向量,并对复制所得第二帧数的音色特征向量进行拼接处理得到帧级音色特征向量。在本公开的一些实施例中,歌曲模板由目标歌曲的音素序列、音素时长、歌曲音符序列、歌曲能量序列,以及目标歌曲的唯一识别号配置得到,其中,目标歌曲的音素序列和音素时长由目标歌曲的歌曲音频和歌曲歌词确定,目标歌曲的歌曲音符序列和歌曲能量序列由歌曲音频确定。In some embodiments of the present disclosure, the third processing sub-module 9043 is also used to: determine the second frame number of the speech audio according to the phoneme duration; copy the timbre feature vector, and copy the timbre feature vector of the second frame number Perform splicing processing to obtain frame-level timbre feature vectors. In some embodiments of the present disclosure, the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein the phoneme sequence and phoneme duration of the target song are configured by The song audio and song lyrics of the target song are determined, and the song note sequence and song energy sequence of the target song are determined by the song audio.
在本公开的一些实施例中,音素序列包括:解析歌曲歌词得到的多个音素,音素时长包括:每个音素在歌曲音频中所占据的第一帧数。In some embodiments of the present disclosure, the phoneme sequence includes: multiple phonemes obtained by parsing the song lyrics, and the phoneme duration includes: the first frame number occupied by each phoneme in the song audio.
在本公开的一些实施例中,歌曲能量序列是对歌曲音频的歌曲能量特征量化处理得到,歌曲音符序列是对歌曲音频的歌曲基频特征量化处理得到。In some embodiments of the present disclosure, the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio.
在本公开的一些实施例中,歌曲能量特征包括:多个能量值;歌曲能量序列是根据多个范围编码值形成得到,范围编码值是对与能量值相对应的能量范围进行独热编码处理得到。In some embodiments of the present disclosure, the song energy characteristics include: multiple energy values; the song energy sequence is formed based on multiple range encoding values, and the range encoding values are processed by one-hot encoding of the energy range corresponding to the energy value. get.
在本公开的一些实施例中,歌曲基频特征包括:多个基频值;歌曲音符序列包括与每个基频值对应的音符号。In some embodiments of the present disclosure, the song fundamental frequency feature includes: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each fundamental frequency value.
需要说明的是,前述对歌曲生成方法的解释说明也适用于本实施例的歌曲生成装置,此处不再赘述。It should be noted that the foregoing explanation of the song generation method is also applicable to the song generation device of this embodiment, and will not be described again here.
本实施例中,通过获取目标用户输入的语音音频和目标歌曲的唯一识别号,对语音音频进行梅尔谱特征提取,得到目标用户的真实梅尔谱特征,根据目标歌曲的唯一识别号获取与唯一识别号对应的歌曲模板,将目标用户的真实梅尔谱特征和歌曲模板输入至预设的歌曲生成模型中,得到歌曲生成模型输出的目标梅尔谱特征,根据目标梅尔谱特征生成目标歌曲,可以在歌曲生成过程中有效结合目标用户的真实梅尔谱特征和目标歌曲对应的歌曲模板,以有效降低对用户语音数据的数据量的依赖程度,从而在提升歌曲生成便捷性的同时,有效提升歌曲生成效果。In this embodiment, by obtaining the voice audio input by the target user and the unique identification number of the target song, Mel spectrum feature extraction is performed on the voice audio to obtain the real Mel spectrum feature of the target user, and the unique identification number of the target song is obtained and For the song template corresponding to the unique identification number, input the real mel spectrum features and song template of the target user into the preset song generation model, obtain the target mel spectrum features output by the song generation model, and generate the target based on the target mel spectrum features. Songs can effectively combine the target user's real mel spectrum characteristics and the song template corresponding to the target song during the song generation process to effectively reduce the dependence on the amount of user voice data, thereby improving the convenience of song generation. Effectively improve the song generation effect.
图11是本公开一实施例提出的歌曲生成模型的训练装置的结构示意图。Figure 11 is a schematic structural diagram of a training device for a song generation model proposed by an embodiment of the present disclosure.
如图11所示,该歌曲生成模型的训练装置110,包括:第三获取模块1101,用于获取训练集,训练集来自于多个采样用户,训练集包括多个样本,一个采样用户至少对应一个样本,每个样本包括:采样用户歌唱某一歌曲时所拾取的歌唱音频和与歌唱音频对应的歌词文本;第四获取模块1102,用于获取预先搭建的初始神经网络模型,初始神经网络模型包括初始权重参数和损失函数;第五获取模块1103,用于从训练集中获取首个样本,并将首个样本输入至初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,真实梅尔谱特征表示首个样本中的歌唱音频的梅尔谱特征,预测梅尔谱特征表示初始神经网络模型所预测的梅尔谱特征;第三处理模块1104,用于根据损失函数计算预测梅尔谱特征和真实梅尔谱特征之间的误差;第四处理模块1105,用于根据误差对初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型;第六获取模块1106,用于从训练集中逐一获取后续样本,并将后续样 本重复输入至最新的神经网络模型,直至损失函数收敛,得到训练完成的歌曲生成模型。As shown in Figure 11, the training device 110 of the song generation model includes: a third acquisition module 1101, used to obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user at least corresponds to A sample, each sample includes: the singing audio picked up when the user sings a certain song and the lyrics text corresponding to the singing audio; the fourth acquisition module 1102 is used to obtain the pre-built initial neural network model, the initial neural network model Including initial weight parameters and loss functions; the fifth acquisition module 1103 is used to obtain the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, The real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum feature predicted by the initial neural network model; the third processing module 1104 is used to calculate predictions based on the loss function The error between the Mel spectrum feature and the real Mel spectrum feature; the fourth processing module 1105 is used to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model; the sixth acquisition module 1106, It is used to obtain subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained.
在本公开的一些实施例中,如图12所示,图12是本公开另一实施例提出的歌曲生成模型的训练装置的结构示意图,其中,初始神经网络模型包括:初始音色编码子模型、初始文本编码子模型,以及初始声学解码子模型;第五获取模块1103,包括:第五处理子模块11031,用于对首个样本中的歌词文本进行文本转写,得到音素序列,并根据音素序列对首个样本中的歌唱音频对进行对齐,得到音素时长;第六处理子模块11032,用于对首个样本中的歌唱音频进行声学特征提取,得到首个样本的真实梅尔谱特征、音频能量和基频轨迹;第七处理子模块11033,用于将音素序列输入至初始文本编码子模型中,得到首个样本的文本特征向量;第八处理子模块11034,用于将首个样本的真实梅尔谱特征输入至初始音色编码子模型中,得到首个样本的音色特征向量;第九处理子模块11035,用于根据音素时长对文本特征向量和音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量;第十处理子模块11036,用于将帧级文本特征向量、帧级音色特征向量、音频能量,以及基频轨迹进行相加后输入至初始声学解码子模型,得到首个样本的预测梅尔谱特征。In some embodiments of the present disclosure, as shown in Figure 12, which is a schematic structural diagram of a training device for a song generation model proposed by another embodiment of the present disclosure, the initial neural network model includes: an initial timbre encoding sub-model, Initial text encoding sub-model, and initial acoustic decoding sub-model; the fifth acquisition module 1103 includes: the fifth processing sub-module 11031, which is used to transcribe the lyrics text in the first sample to obtain the phoneme sequence, and according to the phoneme The sequence aligns the singing audio pairs in the first sample to obtain the phoneme duration; the sixth processing sub-module 11032 is used to extract the acoustic features of the singing audio in the first sample to obtain the real mel spectrum characteristics of the first sample. Audio energy and fundamental frequency trajectory; the seventh processing sub-module 11033 is used to input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample; the eighth processing sub-module 11034 is used to convert the first sample The real mel spectrum features are input into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample; the ninth processing sub-module 11035 is used to duration regularize the text feature vector and timbre feature vector according to the phoneme duration to obtain the frame level text feature vector and frame level timbre feature vector; the tenth processing submodule 11036 is used to add the frame level text feature vector, frame level timbre feature vector, audio energy, and fundamental frequency trajectory and input them into the initial acoustic decoding sub-module model to obtain the predicted Mel spectrum characteristics of the first sample.
需要说明的是,前述对歌曲生成模型的训练方法的解释说明也适用于本实施例的歌曲生成模型的训练装置,此处不再赘述。It should be noted that the aforementioned explanation of the training method of the song generation model is also applicable to the training device of the song generation model in this embodiment, and will not be described again here.
本实施例中,通过获取训练集,训练集来自于多个采样用户,训练集包括多个样本,一个采样用户至少对应一个样本,每个样本包括:采样用户歌唱某一歌曲时所拾取的歌唱音频和与歌唱音频对应的歌词文本,获取预先搭建的初始神经网络模型,初始神经网络模型包括初始权重参数和损失函数,从训练集中获取首个样本,并将首个样本输入至初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,真实梅尔谱特征表示首个样本中的歌唱音频的梅尔谱特征,预测梅尔谱特征表示初始神经网络模型所预测的梅尔谱特征,根据损失函数计算预测梅尔谱特征和真实梅尔谱特征之间的误差,根据误差对初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型,从训练集中逐一获取后续样本,并将后续样本重复输入至最新的神经网络模型,直至损失函数收敛,得到训练完成的歌曲生成模型,由此,可以在模型训练过程中基于损失函数实时确定模型输出的预测梅尔谱特征和真实梅尔谱特征之间的误差,从而为判断模型收敛提供了可靠的判断依据,以有效提升该歌曲生成模型的输出准确性。In this embodiment, a training set is obtained. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one sample. Each sample includes: singing songs picked up when the sampling user sings a certain song. Audio and lyric text corresponding to the singing audio, obtain the pre-built initial neural network model, the initial neural network model includes initial weight parameters and loss function, obtain the first sample from the training set, and input the first sample into the initial neural network model , the real mel spectrum feature and the predicted mel spectrum feature are obtained. The real mel spectrum feature represents the mel spectrum feature of the singing audio in the first sample, and the predicted mel spectrum feature represents the mel spectrum predicted by the initial neural network model. Features, calculate the error between the predicted Mel spectrum features and the real Mel spectrum features according to the loss function, adjust the initial weight parameters of the initial neural network model based on the error, obtain an updated neural network model, and obtain subsequent samples one by one from the training set , and repeatedly input subsequent samples into the latest neural network model until the loss function converges, and the trained song generation model is obtained. From this, the predicted mel spectrum characteristics and the predicted mel spectrum characteristics of the model output can be determined in real time based on the loss function during the model training process. The error between real mel spectrum features provides a reliable basis for judging the convergence of the model, thereby effectively improving the output accuracy of the song generation model.
图13示出了适于用来实现本公开实施方式的示例性电子设备的框图。图13显示的电子设备12仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。13 illustrates a block diagram of an exemplary electronic device suitable for implementing embodiments of the present disclosure. The electronic device 12 shown in FIG. 13 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
如图13所示,电子设备12以通用计算设备的形式表现。电子设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。As shown in Figure 13, electronic device 12 is embodied in the form of a general computing device. Components of electronic device 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and a bus 18 connecting various system components (including system memory 28 and processing unit 16).
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture;以下简称:ISA)总线,微通道体系结构(Micro Channel Architecture;以下简称:MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association;以下简称:VESA)局域总线以及外围组件互连(Peripheral Component Interconnection;以下简称:PCI)总线。 Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures. For example, these architectures include but are not limited to Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (Micro Channel Architecture; hereafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (hereinafter referred to as: PCI) bus.
电子设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被电子设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。 Electronic device 12 typically includes a variety of computer system readable media. These media may be any available media that can be accessed by electronic device 12, including volatile and nonvolatile media, removable and non-removable media.
存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory;以下简称:RAM)30和/或高速缓存存储器32。电子设备12可以进一步包括其他可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图13未显示,通常称为“硬盘驱动器”)。The memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter referred to as: RAM) 30 and/or cache memory 32. Electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 13, commonly referred to as a "hard drive").
尽管图13中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如:光盘只读存储器(Compact Disc Read Only Memory;以下简称:CD-ROM)、数字多功能只读光盘(Digital Video Disc Read Only Memory;以下简称:DVD-ROM)或者其他光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本公开各实施例的功能。Although not shown in FIG. 13, a disk drive for reading and writing a removable non-volatile disk (e.g., a "floppy disk") and a removable non-volatile optical disk (e.g., a compact disk read-only memory) may be provided. Disc Read Only Memory (hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical disc drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the present disclosure.
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其他程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本公开所描述的实施例中的功能和/或方法。A program/utility 40 having a set of (at least one) program modules 42, including but not limited to an operating system, one or more application programs, other program modules, and program data, may be stored, for example, in memory 28 , each of these examples or some combination may include the implementation of a network environment. Program modules 42 generally perform functions and/or methods in the embodiments described in this disclosure.
电子设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得人体能与该电子设备12交互的设备通信,和/或与使得该电子设备12能与一个或多个其他计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,电子设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network;以下简称:LAN),广域网(Wide Area Network;以下简称:WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与电子设备12的其他模块通信。应当明白,尽管图中未示出,可以结合电子设备12使用其他硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。 Electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable human interaction with electronic device 12, and/or with Any device (eg, network card, modem, etc.) that enables the electronic device 12 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22. Moreover, the electronic device 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN)) and/or a public network, such as the Internet, through the network adapter 20 ) communication. As shown, network adapter 20 communicates with other modules of electronic device 12 via bus 18 . It should be understood that, although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现前述实施例中提及的歌曲生成方法和歌曲生成模型的训练方法。The processing unit 16 executes programs stored in the system memory 28 to perform various functional applications and data processing, such as implementing the song generation method and the song generation model training method mentioned in the previous embodiments.
为了实现上述实施例,本公开还提出一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本公开前述实施例提出的歌曲生成方法和歌曲生成模型的训练方法。In order to implement the above embodiments, the present disclosure also proposes a non-transitory computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the song generation method and song generation method as proposed in the previous embodiments of the present disclosure are implemented. Model training method.
为了实现上述实施例,本公开还提出一种计算机程序产品,当计算机程序产品中的指令处理器执行时,执行如本公开前述实施例提出的歌曲生成方法和歌曲生成模型的训练方法。In order to implement the above embodiments, the present disclosure also proposes a computer program product. When the instruction processor in the computer program product is executed, the song generation method and the song generation model training method proposed in the previous embodiments of the present disclosure are executed.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机程序。在计算机上加载和执行所述计算机程序时,全部或部分地产生按照本公开实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机程序可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机程序可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用 介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(digital video disc,DVD))、或者半导体介质(例如,固态硬盘(solid state disk,SSD))等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs. When the computer program is loaded and executed on a computer, the processes or functions described in accordance with the embodiments of the present disclosure are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer program may be stored in or transferred from one computer-readable storage medium to another, for example, the computer program may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The usable media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., high-density digital video discs (DVD)), or semiconductor media (e.g., solid state disks, SSD)) etc.
本领域普通技术人员可以理解:本公开中涉及的第一、第二等各种数字编号仅为描述方便进行的区分,并不用来限制本公开实施例的范围,也表示先后顺序。Those of ordinary skill in the art can understand that the first, second, and other numerical numbers involved in this disclosure are only for convenience of description and are not used to limit the scope of the embodiments of the disclosure, nor to indicate the order.
本公开中的至少一个还可以描述为一个或多个,多个可以是两个、三个、四个或者更多个,本公开不做限制。在本公开实施例中,对于一种技术特征,通过“第一”、“第二”、“第三”、“A”、“B”、“C”和“D”等区分该种技术特征中的技术特征,该“第一”、“第二”、“第三”、“A”、“B”、“C”和“D”描述的技术特征间无先后顺序或者大小顺序。At least one in the present disclosure can also be described as one or more, and the plurality can be two, three, four or more, and the present disclosure is not limited. In the embodiment of the present disclosure, for a technical feature, the technical feature is distinguished by “first”, “second”, “third”, “A”, “B”, “C” and “D” etc. The technical features described in "first", "second", "third", "A", "B", "C" and "D" are in no particular order or order.
本公开中各表所示的对应关系可以被配置,也可以是预定义的。各表中的信息的取值仅仅是举例,可以配置为其他值,本公开并不限定。在配置信息与各参数的对应关系时,并不一定要求必须配置各表中示意出的所有对应关系。例如,本公开中的表格中,某些行示出的对应关系也可以不配置。又例如,可以基于上述表格做适当的变形调整,例如,拆分,合并等等。上述各表中标题示出参数的名称也可以采用通信装置可理解的其他名称,其参数的取值或表示方式也可以通信装置可理解的其他取值或表示方式。上述各表在实现时,也可以采用其他的数据结构,例如可以采用数组、队列、容器、栈、线性表、指针、链表、树、图、结构体、类、堆、散列表或哈希表等。The corresponding relationships shown in each table in this disclosure can be configured or predefined. The values of the information in each table are only examples and can be configured as other values, which is not limited by this disclosure. When configuring the correspondence between information and each parameter, it is not necessarily required to configure all the correspondences shown in each table. For example, in the table in this disclosure, the corresponding relationships shown in some rows may not be configured. For another example, appropriate deformation adjustments can be made based on the above table, such as splitting, merging, etc. The names of the parameters shown in the titles of the above tables may also be other names understandable by the communication device, and the values or expressions of the parameters may also be other values or expressions understandable by the communication device. When implementing the above tables, other data structures can also be used, such as arrays, queues, containers, stacks, linear lists, pointers, linked lists, trees, graphs, structures, classes, heaps, hash tables or hash tables. wait.
本公开中的预定义可以理解为定义、预先定义、存储、预存储、预协商、预配置、固化、或预烧制。Predefinition in this disclosure may be understood as definition, pre-definition, storage, pre-storage, pre-negotiation, pre-configuration, solidification, or pre-burning.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered to be beyond the scope of this disclosure.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present disclosure, but the protection scope of the present disclosure is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present disclosure. should be covered by the protection scope of this disclosure. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Claims (31)

  1. 一种歌曲生成方法,其特征在于,包括:A song generation method, characterized by including:
    获取目标用户输入的语音音频和目标歌曲的唯一识别号;Obtain the voice audio input by the target user and the unique identification number of the target song;
    对所述语音音频进行梅尔谱特征提取,得到所述目标用户的真实梅尔谱特征;Perform mel spectrum feature extraction on the speech audio to obtain the real mel spectrum features of the target user;
    根据所述目标歌曲的唯一识别号获取与所述唯一识别号对应的歌曲模板;Obtain the song template corresponding to the unique identification number according to the unique identification number of the target song;
    将所述目标用户的真实梅尔谱特征和所述歌曲模板输入至预设的歌曲生成模型中,得到所述歌曲生成模型输出的目标梅尔谱特征,其中,所述歌曲生成模型为使用训练集通过机器学习训练得到,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;Input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein the song generation model is trained using The set is obtained through machine learning training. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during a song and the lyric text corresponding to the singing audio;
    根据所述目标梅尔谱特征生成目标歌曲。Generate a target song based on the target mel spectrum characteristics.
  2. 如权利要求1所述的方法,其特征在于,所述歌曲生成模型包括:音色编码子模型、文本编码子模型和声学解码子模型;所述歌曲生成模型为采用同一个训练集对所述音色编码子模型、文本编码子模型和声学解码子模型进行联合训练得到。The method of claim 1, wherein the song generation model includes: a timbre encoding sub-model, a text encoding sub-model and an acoustic decoding sub-model; the song generation model uses the same training set to The encoding sub-model, text encoding sub-model and acoustic decoding sub-model are jointly trained.
  3. 如权利要求2所述的方法,其特征在于,The method of claim 2, characterized in that:
    所述歌曲模板包括歌词文本信息和歌曲旋律信息;The song template includes lyric text information and song melody information;
    所述歌词文本信息包括音素序列和音素时长;The lyrics text information includes phoneme sequences and phoneme durations;
    所述歌曲旋律信息包括歌曲音符序列和歌曲能量序列。The song melody information includes a song note sequence and a song energy sequence.
  4. 如权利要求3所述的方法,其特征在于,所述将所述目标用户的真实梅尔谱特征和所述歌曲模板输入至预设的歌曲生成模型中,得到所述歌曲生成模型输出的目标梅尔谱特征,包括:The method of claim 3, wherein the target user's real mel spectrum characteristics and the song template are input into a preset song generation model to obtain the target output by the song generation model. Mel spectrum features, including:
    将所述目标用户的真实梅尔谱特征输入所述音色编码子模型,得到所述目标用户的音色特征向量;Input the real mel spectrum characteristics of the target user into the timbre coding sub-model to obtain the timbre feature vector of the target user;
    将所述音素序列输入所述文本编码子模型,得到所述歌曲模板中歌词文本的文本特征向量;Input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template;
    根据所述音素时长对所述文本特征向量和所述音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量;Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector;
    将所述帧级文本特征向量、所述帧级音色特征向量和所述歌曲旋律信息进行相加后输入至所述声学解码子模型,得到所述目标梅尔谱特征。The frame-level text feature vector, the frame-level timbre feature vector and the song melody information are added and then input into the acoustic decoding sub-model to obtain the target mel spectrum feature.
  5. 如权利要求4所述的方法,其特征在于,所述音色编码子模型包括:参考编码器和自回归编码器;所述将所述目标用户的真实梅尔谱特征输入所述音色编码子模型,得到所述目标用户的音色特征向量,包括:The method of claim 4, wherein the timbre encoding sub-model includes: a reference encoder and an autoregressive encoder; and the real mel spectrum characteristics of the target user are input into the timbre encoding sub-model. , obtain the timbre feature vector of the target user, including:
    将所述目标用户的真实梅尔谱特征输入至所述参考编码器中,得到所述目标用户的音色隐空间分布向量;Input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user;
    将所述音色隐空间分布向量输入至所述自回归编码器中,得到所述目标用户的音色分布向量,其中,所述音色分布向量为所述自回归编码器对所述音色隐空间分布向量采样得到;The timbre latent space distribution vector is input into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the timbre latent space distribution vector of the autoregressive encoder. sampled;
    将所述音色分布向量作为所述目标用户的音色特征向量。The timbre distribution vector is used as the timbre feature vector of the target user.
  6. 如权利要求4所述的方法,其特征在于,根据所述音素时长对所述文本特征向量进行时长规整,得到帧级文本特征向量,包括:The method of claim 4, wherein the text feature vector is length-regularized according to the phoneme duration to obtain a frame-level text feature vector, including:
    从所述文本特征向量中,确定与所述音素序列中的每个音素对应的初始文本编码;From the text feature vector, determine an initial text code corresponding to each phoneme in the phoneme sequence;
    根据所述音素时长,确定所述音素对应的第一帧数;According to the duration of the phoneme, determine the first frame number corresponding to the phoneme;
    复制所述初始文本编码,并对复制所得所述第一帧数的初始文本编码进行拼接处理得到目标文本编码;Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code;
    根据多个所述目标文本编码,形成所述帧级文本特征向量。The frame-level text feature vector is formed according to multiple target text codes.
  7. 如权利要求4所述的方法,其特征在于,根据所述音素时长对所述音色特征向量进行时长规整,得到帧级音色特征向量,包括:The method of claim 4, wherein the timbre feature vector is length-regularized according to the phoneme duration to obtain a frame-level timbre feature vector, including:
    根据所述音素时长,确定所述语音音频的第二帧数;Determine the second frame number of the speech audio according to the phoneme duration;
    复制所述音色特征向量,并对复制所得所述第二帧数的音色特征向量进行拼接处理得到所述帧级音色特征向量。The timbre feature vector is copied, and the copied timbre feature vectors of the second frame number are spliced to obtain the frame-level timbre feature vector.
  8. 如权利要求3所述的方法,其特征在于,所述歌曲模板由所述目标歌曲的音素序列、音素时长、歌曲音符序列、歌曲能量序列,以及所述目标歌曲的唯一识别号配置得到,其中,所述目标歌曲的音素序列和所述音素时长由所述目标歌曲的歌曲音频和歌曲歌词确定,所述目标歌曲的歌曲音符序列和歌曲能量序列由所述歌曲音频确定。The method of claim 3, wherein the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein , the phoneme sequence and the phoneme duration of the target song are determined by the song audio and song lyrics of the target song, and the song note sequence and song energy sequence of the target song are determined by the song audio.
  9. 如权利要求8所述的方法,其特征在于,所述音素序列包括:解析所述歌曲歌词得到的多个音素,所述音素时长包括:每个所述音素在所述歌曲音频中所占据的第一帧数。The method of claim 8, wherein the phoneme sequence includes: a plurality of phonemes obtained by analyzing the song lyrics, and the phoneme duration includes: each phoneme occupies a space in the song audio. First frame number.
  10. 如权利要求8所述的方法,其特征在于,所述歌曲能量序列是对所述歌曲音频的歌曲能量特征量化处理得到,所述歌曲音符序列是对所述歌曲音频的歌曲基频特征量化处理得到。The method of claim 8, wherein the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. get.
  11. 如权利要求10所述的方法,其特征在于,所述歌曲能量特征包括:多个能量值;所述歌曲能量序列是根据多个范围编码值形成得到,所述范围编码值是对与所述能量值相对应的能量范围进行独热编码处理得到。The method of claim 10, wherein the song energy characteristics include: a plurality of energy values; the song energy sequence is formed based on a plurality of range encoding values, and the range encoding values are corresponding to the The energy range corresponding to the energy value is obtained by one-hot encoding.
  12. 如权利要求10所述的方法,其特征在于,所述歌曲基频特征包括:多个基频值;所述歌曲音符序列包括与每个所述基频值对应的音符号。The method of claim 10, wherein the song fundamental frequency characteristics include: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each of the fundamental frequency values.
  13. 一种歌曲生成模型的训练方法,其特征在于,包括:A training method for a song generation model, which is characterized by including:
    获取训练集,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个所述采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;Obtain a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: the sampling user sings a certain song. The singing audio picked up during the song and the lyric text corresponding to the singing audio;
    获取预先搭建的初始神经网络模型,所述初始神经网络模型包括初始权重参数和损失函数;Obtain a pre-built initial neural network model, which includes initial weight parameters and a loss function;
    从所述训练集中获取首个样本,并将所述首个样本输入至所述初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,所述真实梅尔谱特征表示所述首个样本中的歌唱音频的梅尔谱特征,所述预测梅尔谱特征表示所述初始神经网络模型所预测的梅尔谱特征;Obtain the first sample from the training set and input the first sample into the initial neural network model to obtain real mel spectrum features and predicted mel spectrum features. The real mel spectrum features represent the The Mel spectrum feature of the singing audio in the first sample, the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;
    根据所述损失函数计算所述预测梅尔谱特征和所述真实梅尔谱特征之间的误差;Calculate the error between the predicted mel spectrum feature and the true mel spectrum feature according to the loss function;
    根据所述误差对所述初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型;Adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;
    从所述训练集中逐一获取后续样本,并将所述后续样本重复输入至最新的神经网络模型,直至所述损失函数收敛,得到训练完成的歌曲生成模型。Subsequent samples are obtained one by one from the training set, and the subsequent samples are repeatedly input into the latest neural network model until the loss function converges, and a trained song generation model is obtained.
  14. 如权利要求13所述的方法,其特征在于,所述初始神经网络模型包括:初始音色编码子模型、初始文本编码子模型,以及初始声学解码子模型;所述将所述首个样本输入至所述初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,包括:The method of claim 13, wherein the initial neural network model includes: an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial acoustic decoding sub-model; the first sample is input to In the initial neural network model, real Mel spectrum features and predicted Mel spectrum features are obtained, including:
    对所述首个样本中的歌词文本进行文本转写,得到音素序列,并根据所述音素序列对所述首个样本中的歌唱音频对进行对齐,得到音素时长;Perform text transcription of the lyric text in the first sample to obtain a phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration;
    对所述首个样本中的歌唱音频进行声学特征提取,得到所述首个样本的真实梅尔谱特征、音频能量和基频轨迹;Perform acoustic feature extraction on the singing audio in the first sample to obtain the real mel spectrum features, audio energy and fundamental frequency trajectory of the first sample;
    将所述音素序列输入至所述初始文本编码子模型中,得到所述首个样本的文本特征向量;Input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample;
    将所述首个样本的真实梅尔谱特征输入至所述初始音色编码子模型中,得到所述首个样本的音色特征向量;Input the real mel spectrum feature of the first sample into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample;
    根据所述音素时长对所述文本特征向量和所述音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量;Perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration to obtain a frame-level text feature vector and a frame-level timbre feature vector;
    将所述帧级文本特征向量、所述帧级音色特征向量、所述音频能量,以及所述基频轨迹进行相加后输入至所述初始声学解码子模型,得到所述首个样本的预测梅尔谱特征。The frame-level text feature vector, the frame-level timbre feature vector, the audio energy, and the fundamental frequency trajectory are added and then input into the initial acoustic decoding sub-model to obtain the prediction of the first sample Mel spectrum characteristics.
  15. 一种歌曲生成装置,其特征在于,包括:A song generating device, characterized by including:
    第一获取模块,用于获取目标用户输入的语音音频和目标歌曲的唯一识别号;The first acquisition module is used to acquire the voice audio input by the target user and the unique identification number of the target song;
    第一处理模块,用于对所述语音音频进行梅尔谱特征提取,得到所述目标用户的真实梅尔谱特征;The first processing module is used to extract mel spectrum features from the speech audio to obtain the real mel spectrum features of the target user;
    第二获取模块,用于根据所述目标歌曲的唯一识别号获取与所述唯一识别号对应的歌曲模板;a second acquisition module, configured to acquire a song template corresponding to the unique identification number according to the unique identification number of the target song;
    第二处理模块,用于将所述目标用户的真实梅尔谱特征和所述歌曲模板输入至预设的歌曲生成模型中,得到所述歌曲生成模型输出的目标梅尔谱特征,其中,所述歌曲生成模型为使用训练集通过机器学习训练得到,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;The second processing module is used to input the real mel spectrum features of the target user and the song template into the preset song generation model to obtain the target mel spectrum features output by the song generation model, wherein: The song generation model is obtained through machine learning training using a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes : The sampling audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;
    生成模块,用于根据所述目标梅尔谱特征生成目标歌曲。A generating module, configured to generate a target song according to the target Mel spectrum characteristics.
  16. 如权利要求15所述的装置,其特征在于,所述歌曲生成模型包括:音色编码子模型、文本编码子模型和声学解码子模型;所述歌曲生成模型为采用同一个训练集对所述音色编码子模型、文本编码子模型和声学解码子模型进行联合训练得到。The device according to claim 15, wherein the song generation model includes: a timbre encoding sub-model, a text encoding sub-model and an acoustic decoding sub-model; the song generation model uses the same training set to The encoding sub-model, text encoding sub-model and acoustic decoding sub-model are jointly trained.
  17. 如权利要求16所述的装置,其特征在于,The device according to claim 16, characterized in that:
    所述歌曲模板包括歌词文本信息和歌曲旋律信息;The song template includes lyric text information and song melody information;
    所述歌词文本信息包括音素序列和音素时长;The lyrics text information includes phoneme sequences and phoneme durations;
    所述歌曲旋律信息包括歌曲音符序列和歌曲能量序列。The song melody information includes a song note sequence and a song energy sequence.
  18. 如权利要求17所述的装置,其特征在于,所述第二处理模块,包括:The device of claim 17, wherein the second processing module includes:
    第一处理子模块,用于将所述目标用户的真实梅尔谱特征输入所述音色编码子模型,得到所述目标用户的音色特征向量;The first processing sub-module is used to input the real mel spectrum characteristics of the target user into the timbre coding sub-model to obtain the timbre feature vector of the target user;
    第二处理子模块,用于将所述音素序列输入所述文本编码子模型,得到所述歌曲模板中歌词文本的文本特征向量;The second processing sub-module is used to input the phoneme sequence into the text encoding sub-model to obtain the text feature vector of the lyrics text in the song template;
    第三处理子模块,用于根据所述音素时长对所述文本特征向量和所述音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量;The third processing submodule is used to perform duration regularization of the text feature vector and the timbre feature vector according to the phoneme duration, to obtain a frame-level text feature vector and a frame-level timbre feature vector;
    第四处理子模块,用于将所述帧级文本特征向量、所述帧级音色特征向量和所述歌曲旋律信息进行相加后输入至所述声学解码子模型,得到所述目标梅尔谱特征。The fourth processing sub-module is used to add the frame-level text feature vector, the frame-level timbre feature vector and the song melody information and then input it into the acoustic decoding sub-model to obtain the target mel spectrum. feature.
  19. 如权利要求18所述的装置,其特征在于,所述第一处理子模块,具体用于:The device according to claim 18, characterized in that the first processing sub-module is specifically used for:
    将所述目标用户的真实梅尔谱特征输入至所述参考编码器中,得到所述目标用户的音色隐空间分布向量;Input the real mel spectrum characteristics of the target user into the reference encoder to obtain the timbre latent space distribution vector of the target user;
    将所述音色隐空间分布向量输入至所述自回归编码器中,得到所述目标用户的音色分布向量,其中,所述音色分布向量为所述自回归编码器对所述音色隐空间分布向量采样得到;The timbre latent space distribution vector is input into the autoregressive encoder to obtain the timbre distribution vector of the target user, where the timbre distribution vector is the timbre latent space distribution vector of the autoregressive encoder. sampled;
    将所述音色分布向量作为所述目标用户的音色特征向量。The timbre distribution vector is used as the timbre feature vector of the target user.
  20. 如权利要求18所述的装置,其特征在于,第三处理子模块,具体用于:The device according to claim 18, characterized in that the third processing sub-module is specifically used for:
    从所述文本特征向量中,确定与所述音素序列中的每个音素对应的初始文本编码;From the text feature vector, determine an initial text code corresponding to each phoneme in the phoneme sequence;
    根据所述音素时长,确定所述音素对应的第一帧数;According to the duration of the phoneme, determine the first frame number corresponding to the phoneme;
    复制所述初始文本编码,并对复制所得所述第一帧数的初始文本编码进行拼接处理得到目标文本编码;Copy the initial text code, and perform splicing processing on the copied initial text code of the first frame number to obtain the target text code;
    根据多个所述目标文本编码,形成所述帧级文本特征向量。The frame-level text feature vector is formed according to multiple target text codes.
  21. 如权利要求18所述的装置,其特征在于,第三处理子模块,还用于:The device according to claim 18, characterized in that the third processing sub-module is also used for:
    根据所述音素时长,确定所述语音音频的第二帧数;Determine the second frame number of the speech audio according to the phoneme duration;
    复制所述音色特征向量,并对复制所得所述第二帧数的音色特征向量进行拼接处理得到所述帧级音色特征向量。The timbre feature vector is copied, and the copied timbre feature vectors of the second frame number are spliced to obtain the frame-level timbre feature vector.
  22. 如权利要求17所述的装置,其特征在于,所述歌曲模板由所述目标歌曲的音素序列、音素时长、歌曲音符序列、歌曲能量序列,以及所述目标歌曲的唯一识别号配置得到,其中,所述目标歌曲的音素序列和所述音素时长由所述目标歌曲的歌曲音频和歌曲歌词确定,所述目标歌曲的歌曲音符序列和歌曲能量序列由所述歌曲音频确定。The device according to claim 17, wherein the song template is configured by the phoneme sequence, phoneme duration, song note sequence, song energy sequence, and the unique identification number of the target song, wherein , the phoneme sequence and the phoneme duration of the target song are determined by the song audio and song lyrics of the target song, and the song note sequence and song energy sequence of the target song are determined by the song audio.
  23. 如权利要求22所述的装置,其特征在于,所述音素序列包括:解析所述歌曲歌词得到的多个音素,所述音素时长包括:每个所述音素在所述歌曲音频中所占据的第一帧数。The device of claim 22, wherein the phoneme sequence includes: a plurality of phonemes obtained by analyzing the song lyrics, and the phoneme duration includes: each phoneme occupies a space in the song audio. First frame number.
  24. 如权利要求22所述的装置,其特征在于,所述歌曲能量序列是对所述歌曲音频的歌曲能量特征量化处理得到,所述歌曲音符序列是对所述歌曲音频的歌曲基频特征量化处理得到。The device according to claim 22, wherein the song energy sequence is obtained by quantizing the song energy characteristics of the song audio, and the song note sequence is obtained by quantizing the song fundamental frequency characteristics of the song audio. get.
  25. 如权利要求24所述的装置,其特征在于,所述歌曲能量特征包括:多个能量值;所述歌曲能量序列是根据多个范围编码值形成得到,所述范围编码值是对与所述能量值相对应的能量范围进行独热编码处理得到。The device of claim 24, wherein the song energy characteristics include: a plurality of energy values; the song energy sequence is formed based on a plurality of range coded values, and the range coded values are corresponding to the The energy range corresponding to the energy value is obtained by one-hot encoding.
  26. 如权利要求24所述的装置,其特征在于,所述歌曲基频特征包括:多个基频值;所述歌曲音符序列包括与每个所述基频值对应的音符号。The device of claim 24, wherein the song fundamental frequency characteristics include: a plurality of fundamental frequency values; and the song note sequence includes a note symbol corresponding to each of the fundamental frequency values.
  27. 一种歌曲生成模型的训练装置,其特征在于,包括:A training device for a song generation model, which is characterized by including:
    第三获取模块,用于获取训练集,所述训练集来自于多个采样用户,所述训练集包括多个样本,一个所述采样用户至少对应一个所述样本,每个所述样本包括:所述采样用户歌唱某一歌曲时所拾取的歌唱音频和与所述歌唱音频对应的歌词文本;The third acquisition module is used to acquire a training set. The training set comes from multiple sampling users. The training set includes multiple samples. One sampling user corresponds to at least one of the samples. Each of the samples includes: The sampled singing audio picked up when the user sings a certain song and the lyric text corresponding to the singing audio;
    第四获取模块,用于获取预先搭建的初始神经网络模型,所述初始神经网络模型包括初始权重参数和损失函数;The fourth acquisition module is used to acquire a pre-built initial neural network model, where the initial neural network model includes initial weight parameters and a loss function;
    第五获取模块,用于从所述训练集中获取首个样本,并将所述首个样本输入至所述初始神经网络模型中,得到真实梅尔谱特征和预测梅尔谱特征,所述真实梅尔谱特征表示所述首个样本中的歌唱音频的 梅尔谱特征,所述预测梅尔谱特征表示所述初始神经网络模型所预测的梅尔谱特征;The fifth acquisition module is used to acquire the first sample from the training set, and input the first sample into the initial neural network model to obtain real Mel spectrum features and predicted Mel spectrum features, the real The Mel spectrum feature represents the Mel spectrum feature of the singing audio in the first sample, and the predicted Mel spectrum feature represents the Mel spectrum feature predicted by the initial neural network model;
    第三处理模块,用于根据所述损失函数计算所述预测梅尔谱特征和所述真实梅尔谱特征之间的误差;A third processing module, configured to calculate the error between the predicted Mel spectrum feature and the real Mel spectrum feature according to the loss function;
    第四处理模块,用于根据所述误差对所述初始神经网络模型的初始权重参数进行调整,得到更新的神经网络模型;A fourth processing module, configured to adjust the initial weight parameters of the initial neural network model according to the error to obtain an updated neural network model;
    第六获取模块,用于从所述训练集中逐一获取后续样本,并将所述后续样本重复输入至最新的神经网络模型,直至所述损失函数收敛,得到训练完成的歌曲生成模型。The sixth acquisition module is used to acquire subsequent samples one by one from the training set, and repeatedly input the subsequent samples into the latest neural network model until the loss function converges to obtain the trained song generation model.
  28. 如权利要求27所述的装置,其特征在于,所述初始神经网络模型包括:初始音色编码子模型、初始文本编码子模型,以及初始声学解码子模型;所述第五获取模块,包括:The device of claim 27, wherein the initial neural network model includes: an initial timbre encoding sub-model, an initial text encoding sub-model, and an initial acoustic decoding sub-model; the fifth acquisition module includes:
    第五处理子模块,用于对所述首个样本中的歌词文本进行文本转写,得到音素序列,并根据所述音素序列对所述首个样本中的歌唱音频对进行对齐,得到音素时长;The fifth processing sub-module is used to transcribe the lyrics text in the first sample to obtain a phoneme sequence, and align the singing audio pairs in the first sample according to the phoneme sequence to obtain the phoneme duration. ;
    第六处理子模块,用于对所述首个样本中的歌唱音频进行声学特征提取,得到所述首个样本的真实梅尔谱特征、音频能量和基频轨迹;The sixth processing submodule is used to extract acoustic features from the singing audio in the first sample to obtain the real mel spectrum features, audio energy and fundamental frequency trajectory of the first sample;
    第七处理子模块,用于将所述音素序列输入至所述初始文本编码子模型中,得到所述首个样本的文本特征向量;The seventh processing sub-module is used to input the phoneme sequence into the initial text encoding sub-model to obtain the text feature vector of the first sample;
    第八处理子模块,用于将所述首个样本的真实梅尔谱特征输入至所述初始音色编码子模型中,得到所述首个样本的音色特征向量;The eighth processing sub-module is used to input the real mel spectrum characteristics of the first sample into the initial timbre encoding sub-model to obtain the timbre feature vector of the first sample;
    第九处理子模块,用于根据所述音素时长对所述文本特征向量和所述音色特征向量进行时长规整,得到帧级文本特征向量和帧级音色特征向量;The ninth processing submodule is used to perform duration regularization on the text feature vector and the timbre feature vector according to the phoneme duration, to obtain a frame-level text feature vector and a frame-level timbre feature vector;
    第十处理子模块,用于将所述帧级文本特征向量、所述帧级音色特征向量、所述音频能量,以及所述基频轨迹进行相加后输入至所述初始声学解码子模型,得到所述首个样本的预测梅尔谱特征。The tenth processing sub-module is used to add the frame-level text feature vector, the frame-level timbre feature vector, the audio energy, and the fundamental frequency trajectory and input them into the initial acoustic decoding sub-model, The predicted Mel spectrum characteristics of the first sample are obtained.
  29. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-12中任一项所述的方法,或者执行权利要求13-14中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-12. method, or perform the method described in any one of claims 13-14.
  30. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,其中,所述计算机指令用于使所述计算机执行权利要求1-12中任一项所述的方法,或者执行权利要求13-14中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in any one of claims 1-12, or to execute the claim The method described in any one of 13-14.
  31. 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-12中任一项所述方法的步骤,或者实现根据权利要求13-14中任一项所述方法的步骤。A computer program product, characterized in that it includes a computer program that, when executed by a processor, implements the steps of the method according to any one of claims 1-12, or implements the method according to any one of claims 13-14. Any of the steps of the method.
PCT/CN2022/099965 2022-06-20 2022-06-20 Song generation method, apparatus, electronic device, and storage medium WO2023245389A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/099965 WO2023245389A1 (en) 2022-06-20 2022-06-20 Song generation method, apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/099965 WO2023245389A1 (en) 2022-06-20 2022-06-20 Song generation method, apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023245389A1 true WO2023245389A1 (en) 2023-12-28

Family

ID=89378987

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099965 WO2023245389A1 (en) 2022-06-20 2022-06-20 Song generation method, apparatus, electronic device, and storage medium

Country Status (1)

Country Link
WO (1) WO2023245389A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117528151A (en) * 2024-01-04 2024-02-06 深圳和成视讯科技有限公司 Data encryption transmission method and device based on recorder
CN117710543A (en) * 2024-02-04 2024-03-15 淘宝(中国)软件有限公司 Digital person-based video generation and interaction method, device, storage medium, and program product
CN117809621A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354332A (en) * 2018-12-05 2020-06-30 北京嘀嘀无限科技发展有限公司 Singing voice synthesis method and device
CN112562633A (en) * 2020-11-30 2021-03-26 北京有竹居网络技术有限公司 Singing synthesis method and device, electronic equipment and storage medium
CN113593520A (en) * 2021-09-08 2021-11-02 广州虎牙科技有限公司 Singing voice synthesis method and device, electronic equipment and storage medium
CN113838443A (en) * 2021-07-19 2021-12-24 腾讯科技(深圳)有限公司 Audio synthesis method and device, computer-readable storage medium and electronic equipment
CN113963717A (en) * 2021-10-27 2022-01-21 广州酷狗计算机科技有限公司 Cross-language song synthesis method and device, equipment, medium and product thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111354332A (en) * 2018-12-05 2020-06-30 北京嘀嘀无限科技发展有限公司 Singing voice synthesis method and device
CN112562633A (en) * 2020-11-30 2021-03-26 北京有竹居网络技术有限公司 Singing synthesis method and device, electronic equipment and storage medium
CN113838443A (en) * 2021-07-19 2021-12-24 腾讯科技(深圳)有限公司 Audio synthesis method and device, computer-readable storage medium and electronic equipment
CN113593520A (en) * 2021-09-08 2021-11-02 广州虎牙科技有限公司 Singing voice synthesis method and device, electronic equipment and storage medium
CN113963717A (en) * 2021-10-27 2022-01-21 广州酷狗计算机科技有限公司 Cross-language song synthesis method and device, equipment, medium and product thereof

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117528151A (en) * 2024-01-04 2024-02-06 深圳和成视讯科技有限公司 Data encryption transmission method and device based on recorder
CN117528151B (en) * 2024-01-04 2024-04-05 深圳和成视讯科技有限公司 Data encryption transmission method and device based on recorder
CN117710543A (en) * 2024-02-04 2024-03-15 淘宝(中国)软件有限公司 Digital person-based video generation and interaction method, device, storage medium, and program product
CN117809621A (en) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 Speech synthesis method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
JP7280386B2 (en) Multilingual speech synthesis and cross-language voice cloning
JP6777768B2 (en) Word vectorization model learning device, word vectorization device, speech synthesizer, their methods, and programs
WO2023245389A1 (en) Song generation method, apparatus, electronic device, and storage medium
WO2022188734A1 (en) Speech synthesis method and apparatus, and readable storage medium
CN115516552A (en) Speech recognition using synthesis of unexplained text and speech
JP2023535230A (en) Two-level phonetic prosodic transcription
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
US11908448B2 (en) Parallel tacotron non-autoregressive and controllable TTS
CN112349289B (en) Voice recognition method, device, equipment and storage medium
CN114038447A (en) Training method of speech synthesis model, speech synthesis method, apparatus and medium
WO2022222757A1 (en) Method for converting text data into acoustic feature, electronic device, and storage medium
US11475874B2 (en) Generating diverse and natural text-to-speech samples
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
Kumar et al. Machine learning based speech emotions recognition system
Basak et al. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems.
CN113593520B (en) Singing voice synthesizing method and device, electronic equipment and storage medium
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN113823265A (en) Voice recognition method and device and computer equipment
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
KR20230120790A (en) Speech Recognition Healthcare Service Using Variable Language Model
CN113870827A (en) Training method, device, equipment and medium of speech synthesis model
WO2022039636A1 (en) Method for synthesizing speech and transmitting the authentic intonation of a clonable sample
CN112951270A (en) Voice fluency detection method and device and electronic equipment
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22947170

Country of ref document: EP

Kind code of ref document: A1