CN116597797A

CN116597797A - Song adaptation method, computer device and storage medium

Info

Publication number: CN116597797A
Application number: CN202310475334.3A
Authority: CN
Inventors: 何礼
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-15

Abstract

The application relates to a song adaptation method, a computer device and a storage medium. The method comprises the following steps: acquiring a track dividing signal of a song to be adapted and melody information of the track dividing signal; performing tone rendering on the melody information of the track-dividing signal to obtain a rendered track-dividing signal of the song to be adapted; the tone color of the rendered track-divided signal is different from the tone color of the track-divided signal; according to the track dividing signals, loudness calibration is carried out on the rendering track dividing signals, and the recomposition track dividing signals of the songs to be recomposited are obtained; and mixing the recomposition track-dividing signal to obtain the target recomposition song of the song to be recomposited. By adopting the method, the song recomposition efficiency can be improved.

Description

Song adaptation method, computer device and storage medium

Technical Field

The present application relates to the field of audio processing technology, and in particular, to a song adaptation method, a computer device, a storage medium, and a computer program product.

Background

As the threshold for music creation gradually decreases, more and more users want to create songs belonging to themselves, and adapting existing songs becomes a most common creation way at present.

At present, the original music is listened to by manual test, then the performance of the original music is analyzed and recorded, and then the original music is recomposed according to the performance of the original music. However, the manual song adaptation method consumes high manufacturing cost, and has a long period of manual manufacturing, so that the manual song adaptation method has a defect of low efficiency.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a song adaptation method, a computer device, and a computer-readable storage medium that can improve song adaptation efficiency.

In a first aspect, the present application provides a song adaptation method. The method comprises the following steps:

acquiring a track dividing signal of a song to be adapted and melody information of the track dividing signal;

performing tone rendering on the melody information of the track-dividing signal to obtain a rendered track-dividing signal of the song to be adapted; the tone color of the rendered track-divided signal is different from the tone color of the track-divided signal;

according to the track dividing signals, loudness calibration is carried out on the rendering track dividing signals, and the recomposition track dividing signals of the songs to be recomposited are obtained;

and mixing the recomposition track-dividing signal to obtain the target recomposition song of the song to be recomposited.

In one embodiment, according to the track-dividing signal, performing loudness calibration on the rendered track-dividing signal to obtain an adapted track-dividing signal of the song to be adapted, including:

summing squares of the track-dividing signals in different time dimensions to obtain a track-dividing signal to be processed;

summing squares of the rendering track-dividing signals under different time dimensions to obtain rendering track-dividing signals to be processed;

dividing the to-be-processed track division signal and the to-be-processed rendering track division signal through a calibration parameter prediction model to obtain a loudness calibration parameter of the rendering track division signal;

and according to the loudness calibration parameters, carrying out loudness calibration on the rendering track-dividing signals to obtain the recomposition track-dividing signals of the songs to be recomposited.

In one embodiment, performing timbre rendering on the melody information of the track division signal to obtain a rendered track division signal of the song to be adapted, including:

determining the tone color of the target musical instrument of the song to be reprogrammed according to the musical instrument type corresponding to the track dividing signal; the instrument type corresponding to the tone color of the target instrument is the same as the instrument type corresponding to the track separation signal;

and performing tone rendering on the melody information of the track-dividing signal according to the tone of the target musical instrument to obtain a rendering track-dividing signal of the song to be adapted.

In one embodiment, according to the timbre of the target musical instrument, timbre rendering is performed on the melody information of the track-dividing signal to obtain a rendered track-dividing signal of the song to be adapted, including:

setting a renderer according to the timbre of the target musical instrument to obtain a target renderer;

and synthesizing an audio signal according to the melody information of the track-dividing signal according to the tone of the target musical instrument by a target renderer to obtain a rendering track-dividing signal of the song to be adapted.

In one embodiment, obtaining a track-divided signal of a song to be adapted, and melody information of the track-divided signal includes:

inputting the song to be adapted into a trained track separation model to obtain a track separation signal of the song to be adapted;

and extracting melody information of the track division signal from the track division signal.

In one embodiment, the trained soundtrack separation model is trained as follows:

fusing a plurality of sample track dividing signals to obtain a sample mixed signal;

inputting the sample mixed signal into a track separation model to be trained to obtain a predicted track separation signal of the sample mixed signal;

Obtaining a loss value of the track separation model to be trained according to the time domain norm between the sample track separation signal and the predicted track separation signal and the number of the predicted track separation signals;

and carrying out iterative training on the to-be-trained track separation model according to the loss value to obtain the trained track separation model.

In one embodiment, the inputting the sample mixed signal into the track separation model to be trained to obtain the predicted track separation signal of the sample mixed signal includes:

carrying out convolution processing on the sample mixed signal in a time domain dimension through a time domain convolution network in the audio track separation model to be trained to obtain a time domain convolution characteristic of the sample mixed signal;

carrying out convolution processing on the sample mixed signal in a frequency domain dimension through a frequency domain convolution network in the track separation model to be trained to obtain a frequency domain convolution characteristic of the sample mixed signal;

fusing the time domain convolution feature and the frequency domain convolution feature to obtain a fused convolution feature;

performing convolution processing on the fused convolution characteristics in a time domain dimension through the time domain convolution network to obtain target time domain characteristics corresponding to the fused convolution characteristics;

Performing convolution processing on the fused convolution characteristics in a frequency domain dimension through the frequency domain convolution network to obtain target frequency domain characteristics corresponding to the fused convolution characteristics;

and fusing the target time domain features and the target frequency domain features to obtain the prediction track-dividing signal.

In one embodiment, the extracting the melody information of the track division signal from the track division signal includes:

performing Fourier transform processing on the track division signals to obtain Mel spectrograms of the track division signals;

multitasking multitrack music transcription is carried out on the Mel spectrogram through a music transcription model to obtain a vocabulary marking sequence of the Mel spectrogram and time information of the vocabulary marking sequence;

and obtaining melody information of the track-dividing signal according to the vocabulary marking sequence and the time information of the vocabulary marking sequence.

In one embodiment, the multitasking multitrack music transcription is performed on the mel spectrogram through a music transcription model to obtain a vocabulary mark sequence of the mel spectrogram and time information of the vocabulary mark sequence, including:

taking the Mel spectrogram and the next vocabulary mark with occurrence probability meeting the preset probability condition as an input sequence of the music transcription model;

Encoding and decoding the input sequence through the music transcription model, and outputting a vocabulary mark sequence of the Mel spectrogram and time information of the vocabulary mark sequence; the vocabulary mark sequence comprises a plurality of vocabulary marks.

In a second aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a third aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fourth aspect, the application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The song recomposition method, the computer equipment and the storage medium are used for obtaining the track dividing signal of the song to be recomposited and melody information of the track dividing signal; performing tone rendering on melody information of the track dividing signals to obtain rendered track dividing signals of songs to be adapted; according to the track dividing signals, loudness calibration is carried out on the rendering track dividing signals, and an recomposition track dividing signal of the song to be recomposited is obtained; and mixing the recomposition track-dividing signal to obtain the target recomposition song of the song to be recomposited. By adopting the method, tone color rendering can be performed based on the melody information of the track-dividing signal so as to simulate the process of manually playing and adapting songs without manually adapting the songs, thereby improving the song adapting efficiency and reducing the adapting cost. In addition, the difference between the recomposition track dividing signal and the original track dividing signal is ensured while the difference exists through the loudness calibration processing, the recomposition track dividing signal and the original track dividing signal are mixed to obtain the target recomposition song of the song to be recomposited, the automatic song recomposition of the song to be recomposited is realized, the song recomposition is not needed through manual work, and the recomposition cost is reduced while the song recomposition efficiency is improved.

Drawings

FIG. 1 is a flow diagram of a method of song adaptation in one embodiment;

FIG. 2 is a flowchart illustrating steps for obtaining an adapted track signal for a song to be adapted according to one embodiment;

FIG. 3 is a schematic diagram of a track separation model in one embodiment;

FIG. 4 is a diagram illustrating melody information transcribed from multitasking multitrack music according to an embodiment;

FIG. 5 is a flow chart of a song adaptation method according to another embodiment;

FIG. 6 is a flow chart of a song adaptation method according to another embodiment;

fig. 7 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, a song adaptation method is provided, and this embodiment is applied to a terminal for illustration by using the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. The terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers. In this embodiment, the method includes the steps of:

Step S101, obtaining the track-dividing signal of the song to be adapted, and melody information of the track-dividing signal.

The song to be adapted refers to a musical composition that needs to be adapted or played again. It will be appreciated that the musical composition may be an accompaniment extracted from a song or may be a pure music. The track-divided signal refers to an audio signal of a single (instrument) audio track. The melody information refers to information describing melody aspects such as a pitch, a note, a numerical control parameter of the note, a performance time, and an end time of the track signal; the melody information may be a MIDI (Musical Instrument Digital Interface ) file, which is a standard format of music that is understandable to a computer.

Specifically, the terminal separates each track in the song to be adapted to obtain track dividing signals of different musical instruments corresponding to the song to be adapted; and the terminal further obtains the melody information of each track-divided signal respectively.

Step S102, performing tone rendering on melody information of the track dividing signal to obtain a rendered track dividing signal of the song to be adapted; rendering the timbre of the split signal is different from the timbre of the split signal.

The rendering of the track-divided signal refers to the audio signal with different tone from the track-divided signal after being rendered by the same type of musical instrument as the track-divided signal.

Specifically, the terminal performs tone rendering on the melody information of the split track signal, which may be that according to the melody information of the split track signal, the terminal performs audio synthesis by using the same instrument type and different tones to obtain a new split track signal, and uses the new split track signal as a rendered split track signal of the song to be adapted. It can be understood that, since the rendering track-divided signal is obtained based on the melody information of the track-divided signal, the melody information of the rendering track-divided signal is the same as the melody information of the track-divided signal corresponding thereto, so that the rendering track-divided signal maintains a certain degree of similarity with the track-divided signal.

By way of example, assuming that the song to be adapted is played by a plurality of instruments such as a guitar, a piano, a bass, etc., the split signals of each instrument such as a guitar, a piano, a bass, etc. may be extracted from the song to be adapted (to distinguish the split signals of each instrument, they may be called a guitar split signal, a piano split signal, a bass split signal, etc., respectively), taking the guitar split signal as an example, the original guitar split signal is played by a classical guitar, and the tone color of the classical guitar is more round and bright. The terminal can select the tone of the electric guitar, and the tone of the split rail signal is rendered to obtain a rendered split rail signal, and the tone of the rendered split rail signal is thicker and has a metallic feeling while the tone of the guitar split rail signal is more round and bright although the tone of the rendered split rail signal is the same as the tone of the guitar split rail signal.

And step S103, according to the track division signals, performing loudness calibration on the rendered track division signals to obtain the adapted track division signals of the songs to be adapted.

The recomposition track dividing signal is an audio signal obtained by recomposition of the track dividing signal by the song recomposition method. The adapted split signal approximates the original split signal but the two remain different. The adapted track-divided signal and the track-divided signal are both audio signals of a single (instrument) audio track.

Specifically, the terminal can calibrate the loudness of the rendering track-dividing signal according to the track-dividing signal, and can scale the loudness of the rendering track-dividing signal according to the loudness of the track-dividing signal so as to avoid the large change of the loudness of the rendering track-dividing signal caused by tone difference and the overlarge difference of the loudness of the track-dividing signal, so that the terminal obtains the recomposition track-dividing signal corresponding to the track-dividing signal, namely, the recomposition track-dividing signal of the song to be recomposited is obtained, the loudness of each recomposition track-dividing signal is consistent with the original loudness of the corresponding track-dividing signal, and the similarity degree between the recomposition song and the original song to be recomposited is ensured.

Step S104, the recomposition track-dividing signal is subjected to sound mixing processing, and the target recomposition song of the song to be recomposited is obtained.

The target recomposition song refers to a song obtained by recomposition of the song to be recomposited. The target adapted song is close to the song to be adapted, but both remain different.

Specifically, after obtaining the recomposition track dividing signals corresponding to the track dividing signals, the terminal may further perform audio mixing processing on all recomposition track dividing signals, or may perform audio track superposition on all recomposition track dividing signals, so that the terminal obtains a target recomposition song of the song to be recomposited.

It will be appreciated that the method of the present invention is to reprogram or re-play the original song, so that the modified song (or the modified track signal) is similar to the original song (or the track signal) to be modified in terms of melody information (such as tone, rhythm), loudness, etc., but there is a difference between the modified song and the original song (or the track signal), such as a fuller tone and a more transparent tone of the track signal, instead of re-modifying the original melody information, loudness, tone and the modified song, etc., resulting in an excessively large difference between the modified song (or the track signal) and the song to be modified. Taking the K song scenario as an example, it is often necessary to consider whether or not the accompaniment of a certain song has copyrights of the song, and thus it is often inconvenient to use the accompaniment of the original song. By the song adaptation method, the accompaniment of the original song is adapted to the target adaptation accompaniment similar to the original song but different from the original song, and the target adaptation accompaniment is used for K song, so that the use experience of the user K song is not affected, and the operation cost can be reduced.

In the song adapting method, track dividing signals of songs to be adapted and melody information of the track dividing signals are obtained; performing tone rendering on melody information of the track dividing signals to obtain rendered track dividing signals of songs to be adapted; according to the track dividing signals, loudness calibration is carried out on the rendering track dividing signals, and an recomposition track dividing signal of the song to be recomposited is obtained; and mixing the recomposition track-dividing signal to obtain the target recomposition song of the song to be recomposited. By adopting the method, tone color rendering can be performed based on melody information of the track-divided signals, loudness calibration can be performed, and loudness calibration can be performed to simulate the process of manually playing and adapting songs without manually adapting the songs, so that the song adapting efficiency is improved, and meanwhile, the adapting cost is reduced. In addition, the difference between the recomposition track dividing signal and the original track dividing signal is ensured while the difference exists through the loudness calibration processing, the recomposition track dividing signal and the original track dividing signal are mixed to obtain the target recomposition song of the song to be recomposited, the automatic song recomposition of the song to be recomposited is realized, the song recomposition is not needed through manual work, and the recomposition cost is reduced while the song recomposition efficiency is improved.

In one embodiment, as shown in fig. 2, step S103 above performs loudness calibration on the rendered track-divided signal according to the track-divided signal to obtain an adapted track-divided signal of the song to be adapted, which specifically includes the following contents:

step S201, sum squares of the track-divided signals in different time dimensions to obtain the track-divided signals to be processed.

Step S202, summing squares of the rendering track-divided signals in different time dimensions to obtain the rendering track-divided signals to be processed.

Step S203, dividing the to-be-processed track-divided signal and the to-be-processed rendering track-divided signal by a calibration parameter prediction model to obtain a loudness calibration parameter of the rendering track-divided signal.

The calibration parameter prediction model is a model for calculating a loudness calibration parameter of the rendered track-divided signal. The loudness calibration parameter is used to adjust a parameter that renders the loudness of the track-divided signal.

In order to enable the recomposition track dividing signal to be more similar to the track dividing signal of the song to be recomposited, loudness calibration can be carried out on the rendering track dividing signal, so that the loudness condition of each recomposition track dividing signal is consistent with that of the original track dividing signal, and meanwhile, the target recomposition song is enabled to be more similar to the song to be recomposited. Specifically, the terminal inputs the track division signals and rendering track division signals into a calibration parameter prediction model, and sums the squares of the track division signals under different time dimensions to obtain track division signals to be processed, and sums the squares of the rendering track division signals under different time dimensions to obtain rendering track division signals to be processed; the calibration parameter prediction model can calculate and obtain loudness calibration parameters of the rendering track-dividing signals according to the track-dividing signals to be processed and the rendering track-dividing signals to be processed.

In practical application, the calibration parameter prediction model may be a formula shown in formula (1), and the terminal may input the track division signal and the rendering track division signal into formula (1), and further calculate the loudness calibration parameter of the rendering track division signal through the calibration parameter prediction model shown in formula (1).

Where α represents a loudness calibration parameter; y (t) represents a track-divided signal; x (t) represents rendering the track-divided signal; m represents the total duration of the split signal (or rendered split signal); t represents the time dimension.

Step S204, according to the loudness calibration parameters, performing loudness calibration on the rendered track-divided signals to obtain the adapted track-divided signals of the songs to be adapted.

Specifically, the terminal performs loudness calibration on the rendered track-divided signal according to the loudness calibration parameter, which may be that the loudness calibration parameter is multiplied by the rendered track-divided signal to implement loudness scaling on the rendered track-divided signal through the loudness calibration parameter, and then the terminal obtains an adapted track-divided signal of the song to be adapted. In practical application, the split-track signal x is adapted _new (t) can also be calculated by the following formula (2):

x _new (t)＝αx(t) (2)

in the embodiment, firstly, calculating to obtain loudness calibration parameters of rendering track dividing signals through the track dividing signals and rendering track dividing signals; and then, according to the loudness calibration parameters, the rendering track-dividing signals are subjected to loudness calibration to obtain the adapted track-dividing signals of the songs to be adapted, so that the fact that the loudness of the processed track-dividing signals (such as the adapted track-dividing signals) is too large in comparison with the loudness of the track-dividing signals is avoided, the original loudness of the processed adapted track-dividing signals is consistent with the original loudness of the corresponding track-dividing signals is ensured, and the similarity between the adapted track-dividing signals and the track-dividing signals of the songs to be adapted is ensured while the adaptation efficiency is improved.

In one embodiment, the step S102 performs tone rendering on the melody information of the track division signal to obtain a rendered track division signal of the song to be adapted; rendering the tone of the track-divided signal to be different from the tone of the track-divided signal, and specifically comprises the following contents: determining the tone color of a target musical instrument of the song to be reprogrammed according to the musical instrument type corresponding to the track dividing signal; the instrument type corresponding to the tone color of the target instrument is the same as the instrument type corresponding to the track-dividing signal; and performing tone rendering on melody information of the track dividing signal according to the tone of the target musical instrument to obtain a rendered track dividing signal of the song to be adapted.

The target instrument tone color refers to the tone color selected for rendering the melody information.

Specifically, the terminal determines the type of instrument corresponding to the track-divided signal, and then selects an instrument of the same type as the instrument but different in tone color as a target instrument, and the tone color of the target instrument is the target instrument tone color of the song to be adapted. The terminal uses the timbre of the target musical instrument to render the melody information of the track division signal, and may use the timbre of the target musical instrument to perform audio synthesis according to the melody information of the track division signal, thereby obtaining a rendered track division signal which is identical to the melody information of the track division signal but has different timbres.

In this embodiment, the target musical instrument tone of the song to be adapted is determined according to the musical instrument type corresponding to the track-dividing signal; and then, according to the tone of the target musical instrument, tone rendering is carried out on the melody information of the track dividing signal, so that a rendered track dividing signal which is close to the track dividing signal and keeps tone difference is obtained, and reasonable adaptation of the track dividing signal of the song to be adapted is realized.

In one embodiment, according to the timbre of the target musical instrument, timbre rendering is performed on the melody information of the track dividing signal to obtain a rendered track dividing signal of the song to be adapted, which specifically includes the following steps: setting a renderer according to the timbre of the target musical instrument to obtain a target renderer; and synthesizing the audio signal according to the melody information of the track dividing signal according to the tone color of the target musical instrument by the target renderer to obtain the rendered track dividing signal of the song to be adapted.

Wherein the target renderer refers to a tool for rendering, or secondary processing, the audio signal. The renderer and the target renderer may be fluidisynth (real-time MIDI synthesizer) capable of synthesizing audio and also capable of freely controlling and adjusting the effect of the synthesized audio.

Specifically, after determining the timbre of the target musical instrument, the terminal can set relevant parameters of the timbre of the target musical instrument in the renderer, and the setting is completed to obtain a usable target renderer; and then the melody information of the track dividing signal is input to a target renderer, the target renderer synthesizes the audio signal according to the tone of the target musical instrument based on the melody information, and the terminal obtains the rendering track dividing signal of the song to be adapted.

For example, assuming that the target instrument tone is the tone of electric guitar a, the terminal may perform audio signal synthesis by fluidisynth selecting the tone of electric guitar a based on the MIDI file of the split signal, thereby obtaining rendering split information of which tone is the tone of electric guitar a.

In this embodiment, the renderer is set according to the timbre of the target musical instrument to obtain the configured target renderer; and then, the target renderer synthesizes the audio signals according to the melody information of the track dividing signals of the timbres of the target musical instruments so as to obtain the rendering track dividing signals of the songs to be adapted, and reasonable adaptation of the track dividing signals of the songs to be adapted can be automatically realized through the renderer without manually playing the music instruments with different timbres, so that the adaptation efficiency of the track dividing signals is improved.

In one embodiment, the step S101 acquires the track-dividing signal of the song to be adapted, and the melody information of the track-dividing signal, which specifically includes the following contents: inputting the songs to be reprogrammed into a track separation model after training, and obtaining track separation signals of the songs to be reprogrammed; melody information of the track division signal is extracted from the track division signal.

The track separation model is a model for separating out the track separation signals of different musical instruments.

Specifically, the terminal is pre-trained to obtain a trained track separation model, after a song to be adapted is obtained, the song to be adapted can be input into the trained track separation model, and track separation signals of different musical instruments are separated through the trained track separation model; and then the terminal extracts melody information of each track-dividing signal from the track-dividing signals of different musical instruments respectively.

In the embodiment, the track-dividing signal of the song to be adapted is obtained by inputting the song to be adapted into the track-separating model after training; and then the melody information of the track division signal is extracted from the track division signal, so that the track division signal and the melody information of the song to be reprogrammed are reasonably obtained, and the track division signal and the melody information can be used as processing basis to execute the subsequent song reprogramming step.

In one embodiment, the above-described trained soundtrack separation model is trained as follows: fusing a plurality of sample track dividing signals to obtain a sample mixed signal; inputting the sample mixed signal into an audio track separation model to be trained to obtain a predicted track separation signal of the sample mixed signal; obtaining a loss value of the track separation model to be trained according to the time domain norm between the sample track separation signal and the predicted track separation signal and the number of the predicted track separation signals; and carrying out iterative training on the track separation model to be trained according to the loss value to obtain the track separation model after training.

The audio track separation model can be constructed based on an end-to-end time-frequency domain model. The track separation model may be a hybrid demux model (a music source separation model).

Wherein the sample track-splitting signal is also an audio signal of a single (instrument) audio track, but the sample track-splitting signal is training data for training the audio track separation model. The sample mix signal is an audio signal mixed with a plurality of (instrument) audio tracks, and is also training data for training an audio track separation model.

The terminal may first construct training data for training the track separation model. Specifically, the terminal may acquire a plurality of sample track signals from a data set or a database; then the terminal fuses the plurality of sample track-dividing signals, namely the same instrument type but different tone colors, or the terminal fuses the plurality of sample track-dividing signals of different instrument types, so that the terminal obtains a sample mixed signal. Further, the terminal inputs the sample mixed signal into a track separation model to be trained, and a predicted track separation signal of the sample mixed signal is separated through the track separation model to be trained; the terminal obtains a loss value of the track separation model to be trained according to the time domain norm between the sample track separation signal and the predicted track separation signal and the number of the predicted track separation signals; and the terminal iteratively updates the model parameters of the track separation model to be trained by using the loss value, and obtains the track separation model after training under the condition that the preset training termination condition is met. Wherein the loss value of the track separation model to be trained Can be calculated by the following formula (3):

wherein J represents the number of predicted track-divided signals separated by the track separation model; subscript 1 denotes a time domain first order norm;representing a j-th predicted split signal; y is _j (t) represents a sample track-divided signal corresponding to the jth predicted track-divided signal; t represents the time dimension.

In the embodiment, a sample mixed signal obtained through fusion is input into a track separation model to be trained, so that a predicted track separation signal of the sample mixed signal is obtained; then obtaining a loss value of the track separation model to be trained according to the time domain norm between the sample track separation signal and the predicted track separation signal and the number of the predicted track separation signals; and further, according to the loss value, carrying out iterative training on the track separation model to be trained to obtain a track separation model after training, and realizing the training of the track separation model, so that the track separation model after training can be used for processing by the subsequent song adaptation step to obtain a track separation signal.

In one embodiment, the sample mixed signal is input into a track separation model to be trained to obtain a predicted track separation signal of the sample mixed signal, which specifically comprises the following contents: carrying out convolution processing on the sample mixed signal in the time domain dimension through a time domain convolution network in the to-be-trained sound track separation model to obtain the time domain convolution characteristic of the sample mixed signal; carrying out convolution processing on the sample mixed signal in the frequency domain dimension through a frequency domain convolution network in the track separation model to be trained to obtain the frequency domain convolution characteristic of the sample mixed signal; fusing the time domain convolution feature and the frequency domain convolution feature to obtain a fused convolution feature; performing convolution processing on the fused convolution characteristics in the time domain dimension through a time domain convolution network to obtain target time domain characteristics corresponding to the fused convolution characteristics; carrying out convolution processing on the fused convolution characteristics in the frequency domain dimension through a frequency domain convolution network to obtain target frequency domain characteristics corresponding to the fused convolution characteristics; and fusing the target time domain features and the target frequency domain features to obtain a predicted track-dividing signal.

FIG. 3 is a schematic diagram of a track separation model including multiple time domain convolutional networks (TDecorder, TEncorder), multiple frequency domain convolutional networks (ZDecorder, ZEncorder), and a shared encoder (Eecoder) ₆ ) And a shared Decoder (Decoder) ₆ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein the time domain convolutional network and the frequency domain convolutional network can be constructed by an encoder and a decoder. Time STpes in FIG. 3 represents a time step, freq represents a sampling frequency, C _in Representing the size of the input data, C _out Representing the size of the output data。

Specifically, as shown in fig. 3, the terminal may first perform short-time fourier transform (Short Time Fourier Transform, STFT) on the sample mixed signal to obtain a spectrogram of the sample mixed signal. Carrying out convolution processing on the spectrogram in the time domain dimension through a plurality of time domain convolution networks constructed by the encoder so as to obtain the time domain convolution characteristics of the spectrogram; meanwhile, the spectrogram is convolved in the frequency domain dimension through a plurality of frequency domain convolution networks constructed by the encoder, so that the frequency domain convolution characteristic of the spectrogram is obtained. Then, fusing the time domain convolution feature and the frequency domain convolution feature to obtain a fused convolution feature; and carrying out convolution processing on the fused convolution characteristics through a shared encoder and a shared decoder in the track separation model to be trained to obtain the shared convolution characteristics. The terminal carries out convolution processing on the shared convolution characteristics in the time domain dimension through a plurality of time domain convolution networks constructed by the decoder, and simultaneously carries out convolution processing on the shared convolution characteristics in the frequency domain dimension through a plurality of frequency domain convolution networks constructed by the decoder, so that the terminal obtains the target time domain characteristics output by the last time domain convolution network and the target frequency domain characteristics output by the last frequency domain convolution network. The terminal obtains a predicted spectrogram by carrying out fusion processing on the target time domain features and the target frequency domain features; and finally, the terminal performs short-time inverse Fourier transform (ISTFT) processing on the predicted spectrogram to obtain a predicted track-dividing signal. It will be appreciated that the processing steps of the training completed track separation model to the song to be adapted are the same as the processing steps of the track separation model to be trained to the sample mix signal.

In this embodiment, the time domain convolution network in the to-be-trained audio track separation model is used to convolve the sample mixed signal in the time domain dimension, and the frequency domain convolution network in the to-be-trained audio track separation model is used to convolve the sample mixed signal in the frequency domain dimension, so that the feature of the sample mixed signal that is richer in the frequency domain and the time domain can be mined, which is beneficial to improving the audio track separation capability of the audio track separation model after training.

In one embodiment, melody information of the track-divided signal is extracted from the track-divided signal, which specifically includes the following contents: carrying out Fourier transform processing on the split-track signal to obtain a Mel spectrogram of the split-track signal; multitasking multitrack music transcription is carried out on the Mel spectrogram through a music transcription model to obtain a vocabulary marking sequence of the Mel spectrogram and time information of the vocabulary marking sequence; and obtaining melody information of the track-dividing signal according to the vocabulary marking sequence and the time information of the vocabulary marking sequence.

Specifically, the terminal obtains a mel spectrogram of the split-track signal by performing fourier transform processing on the split-track signal; the Multi-task multitrack music transcription (Multi-Task Multitrack Music Transcription, MT 3) is carried out on the Mel spectrogram through a music transcription model, wherein the Mel spectrogram can be input into the music transcription model obtained based on MT3 training, and then the vocabulary mark sequence of the Mel spectrogram and the time information of the vocabulary mark sequence are output through the music transcription model; acquiring a preset mapping relation between the vocabulary marking sequence and the notes, and converting time information of the vocabulary marking sequence and the vocabulary marking sequence into melody information of the track-dividing signal according to the preset mapping relation; the melody information may be MIDI files including musical instrument type, notes, tones, start and stop of notes, time information, and ligature information. It will be appreciated that the melody information itself of the present method does not contain waveform data, but rather information on recorded sounds is a set of instructions for instructing the above-described target renderer how the timbre of the target instrument reproduces music.

FIG. 4 is a schematic diagram of melody information obtained by multitasking multitrack transcription, as shown in FIG. 4, six data sets of different sizes, different recording processes, different instruments and types can be used to verify the model performance of the MT 3-trained music transcription model; wherein the six data sets are respectively: MAESTRO, whose audio and detailed MIDI data are collected from performers playing on a Disklavier piano; slakh2100 is composed of audio generated by rendering MIDI files using professional-grade, sample-based synthesis software; cerberus4, derived from Slakh2100 dataset, is combined by mixing four instruments (guitar, bass, drum, piano) in the track where the instruments are active; the Guitarset consists of live guitar performances of various types, rhythms and styles, and is recorded by using a high-precision six-tone pickup, and the pickup captures the sound of each guitar string independently; musicNet, recordings from classical instruments and ensemble instruments such as classical guitar, drum set, etc.; URMP is composed of classical works with a variety of instruments, such as violins and violins. The terminal inputs the six data sets MAESTRO, cerberus, guitarSet, musicNet, slakh and URMP into the music transcription model respectively, and the music transcription model outputs MIDI files corresponding to each data set.

In the embodiment, a mel spectrum diagram of the split-track signal is obtained by performing fourier transform processing on the split-track signal; multitasking multitrack music transcription is carried out on the Mel spectrogram through a music transcription model to obtain a vocabulary marking sequence of the Mel spectrogram and time information of the vocabulary marking sequence; and further, melody information of the track-divided signal is obtained according to the vocabulary mark sequence and time information of the vocabulary mark sequence, so that the melody information can be used as a processing basis to execute a subsequent tone rendering step, and the target renderer is combined to realize the adaptation of the track-divided signal.

In one embodiment, through a music transcription model, multitasking multitrack music transcription is performed on a mel spectrogram to obtain a vocabulary marking sequence of the mel spectrogram and time information of the vocabulary marking sequence, and the method specifically comprises the following steps: taking the Mel spectrogram and the next vocabulary mark with occurrence probability meeting the preset probability condition as the input sequence of the music transcription model; encoding and decoding the input sequence through a music transcription model, and outputting a vocabulary marking sequence for obtaining a Mel spectrogram and time information of the vocabulary marking sequence; the vocabulary mark sequence includes a plurality of vocabulary marks.

The preset probability condition is a judgment condition set by the pointer on the occurrence probability of the next vocabulary mark. The predictive probability condition may be that the probability of occurrence is highest. The vocabulary mark refers to a token vocabulary constructed based on the MIDI file, and the corresponding melody information can be represented by the vocabulary mark. The time information refers to the start-stop time of the corresponding vocabulary mark.

Specifically, when the terminal inputs the word mark, the terminal takes the mel spectrogram as an input sequence, and performs coding and decoding processing on the input sequence through a music transcription model, so that the music transcription model outputs the word mark and time information of the word mark. When the input is not the first time, the terminal screens out the vocabulary mark with the highest occurrence probability from the output vocabulary marks and takes the vocabulary mark as the next vocabulary mark; the terminal further takes the Mel spectrogram and the next vocabulary mark as the input sequence of the music transcription model; encoding and decoding the input sequence through a music transcription model, and outputting a vocabulary mark and time information of the vocabulary mark by the music transcription model; and the vocabulary mark sequence is obtained by the composition of the vocabulary marks output each time, and the time information of the vocabulary mark sequence is determined.

In the embodiment, the mel spectrogram and the next vocabulary mark with the occurrence probability meeting the preset probability condition are used as an input sequence of a music transcription model; the input sequence is subjected to coding processing and decoding processing through a music transcription model, a vocabulary marking sequence of the Mel spectrogram and time information of the vocabulary marking sequence are output, accurate acquisition of the vocabulary marking sequence is realized through the music transcription model MT3, and then melody information of the track-dividing signal can be obtained by taking the vocabulary marking sequence and the time information thereof as processing basis in the subsequent steps.

In one embodiment, as shown in fig. 5, another song adaptation method is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step S501, inputting the song to be adapted into the track separation model after training, and obtaining the track separation signal of the song to be adapted.

Step S502, carrying out Fourier transform processing on the split-track signal to obtain a Mel spectrogram of the split-track signal; and taking the Mel spectrogram and the next vocabulary mark with occurrence probability meeting the preset probability condition as an input sequence of the music transcription model.

Step S503, through a music transcription model, the input sequence is processed by encoding and decoding, and the vocabulary marking sequence of the Mel spectrogram and the time information of the vocabulary marking sequence are output.

The vocabulary mark sequence includes a plurality of vocabulary marks.

Step S504, obtaining melody information of the track-divided signal according to the vocabulary marking sequence and the time information of the vocabulary marking sequence.

In step S505, the target instrument tone color of the song to be adapted is determined according to the instrument type corresponding to the track-dividing signal.

Wherein, the musical instrument type corresponding to the tone color of the target musical instrument is the same as the musical instrument type corresponding to the track-dividing signal.

Step S506, setting a renderer according to the timbre of the target musical instrument to obtain a target renderer; and synthesizing the audio signal according to the melody information of the track dividing signal according to the tone color of the target musical instrument by the target renderer to obtain the rendered track dividing signal of the song to be adapted.

Step S507, summing squares of the track-divided signals in different time dimensions to obtain the track-divided signals to be processed; and summing squares of the rendering track-dividing signals in different time dimensions to obtain the rendering track-dividing signals to be processed.

Step S508, dividing the to-be-processed track-dividing signal and the to-be-processed rendering track-dividing signal by a calibration parameter prediction model to obtain a loudness calibration parameter of the rendering track-dividing signal.

Step S509, according to the loudness calibration parameters, performing loudness calibration on the rendered track-divided signals to obtain the adapted track-divided signals of the songs to be adapted.

Step S510, the recomposition track signal is subjected to audio mixing processing, and the target recomposition song of the song to be recomposited is obtained.

The song adapting method can realize the following beneficial effects: the method can perform tone color rendering based on melody information of the track separation signals, and perform loudness calibration to simulate the process of manually playing and adapting songs without manually adapting the songs, so that the song adapting efficiency is improved, and meanwhile, the adapting cost is reduced. In addition, the difference between the recomposition track dividing signal and the original track dividing signal is ensured while the difference exists through the loudness calibration processing, the recomposition track dividing signal and the original track dividing signal are mixed to obtain the target recomposition song of the song to be recomposited, the automatic song recomposition of the song to be recomposited is realized, the song recomposition is not needed through manual work, and the recomposition cost is reduced while the song recomposition efficiency is improved.

In order to more clearly illustrate the song adaptation method provided by the embodiments of the present disclosure, the song adaptation method will be specifically described in the following with a specific embodiment. As shown in fig. 6, there is provided still another song adaptation method, which may be applied to a terminal, including the following:

If the music to be adapted is the accompaniment of a song, the terminal can carry out track separation on the accompaniment to obtain a piano track-dividing signal and a guitar track-dividing signal of the accompaniment; then, the terminal extracts a piano MIDI file from the piano track-dividing signal, and extracts a guitar MIDI file from the guitar track-dividing signal; the terminal further respectively determines a target piano tone and a target guitar tone, performs tone rendering on the piano MIDI file according to the target piano tone to obtain a rendered piano track-dividing signal of the accompaniment, and performs tone rendering on the guitar MIDI file according to the target guitar tone to obtain a rendered guitar track-dividing signal of the accompaniment; the terminal performs loudness calibration on the rendered piano track-dividing signal according to the piano track-dividing signal to obtain an adapted piano track-dividing signal of the accompaniment, and simultaneously performs loudness calibration on the rendered guitar track-dividing signal according to the guitar track-dividing signal to obtain an adapted guitar track-dividing signal of the accompaniment; finally, the track-dividing signal of the recomposition guitar and the recomposition piano signal are mixed, and the terminal obtains the target recomposition accompaniment of the accompaniment.

In this embodiment, the terminal automatically completes the adaptation process of accompaniment based on the extracted track-dividing signal and melody information of the accompaniment without manually playing the accompaniment again, thereby improving the song adaptation efficiency and reducing the adaptation cost. In addition, the loudness calibration processing ensures that the target recomposition accompaniment and the original accompaniment have the difference, and simultaneously ensures the similarity degree between the target recomposition accompaniment and the original accompaniment.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a song adaptation method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of song adaptation, the method comprising:

2. The method of claim 1, wherein loudness calibrating the rendered track-divided signal based on the track-divided signal results in an adapted track-divided signal for the song to be adapted, comprising:

3. The method according to claim 1, wherein the performing timbre rendering on the melody information of the track-divided signal to obtain the rendered track-divided signal of the song to be adapted includes:

4. The method of claim 3, wherein performing timbre rendering on the melody information of the track-divided signal according to the timbre of the target musical instrument to obtain the rendered track-divided signal of the song to be adapted, comprises:

5. The method according to claim 1, wherein the acquiring the track-divided signal of the song to be adapted and the melody information of the track-divided signal includes:

6. The method of claim 5, wherein the trained soundtrack separation model is trained by:

7. The method of claim 6, wherein inputting the sample mixed signal into a track separation model to be trained to obtain a predicted split signal of the sample mixed signal comprises:

8. The method of claim 5, wherein the extracting melody information of the split signal from the split signal comprises:

9. The method of claim 8, wherein the multitasking multitrack music transcription of the mel-spectrogram by the music transcription model to obtain a vocabulary tag sequence of the mel-spectrogram and time information of the vocabulary tag sequence, comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 9 when the computer program is executed.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 9.