WO2023092368A1 - 音频分离方法、装置、设备、存储介质及程序产品 - Google Patents

音频分离方法、装置、设备、存储介质及程序产品 Download PDF

Info

Publication number
WO2023092368A1
WO2023092368A1 PCT/CN2021/132977 CN2021132977W WO2023092368A1 WO 2023092368 A1 WO2023092368 A1 WO 2023092368A1 CN 2021132977 W CN2021132977 W CN 2021132977W WO 2023092368 A1 WO2023092368 A1 WO 2023092368A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
feature
frequency
features
separated
Prior art date
Application number
PCT/CN2021/132977
Other languages
English (en)
French (fr)
Inventor
黄杰雄
万景轩
漆原
陈传艺
Original Assignee
广州酷狗计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州酷狗计算机科技有限公司 filed Critical 广州酷狗计算机科技有限公司
Priority to PCT/CN2021/132977 priority Critical patent/WO2023092368A1/zh
Priority to CN202180005209.5A priority patent/CN114365219A/zh
Publication of WO2023092368A1 publication Critical patent/WO2023092368A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present application relates to the technical field of audio processing, and in particular to an audio separation method, device, equipment, storage medium and program product.
  • Music is an audio file mixed with human voice and various musical instruments. Separate the audio file to obtain multiple independent audio tracks in the audio file. It has important applications in music mixing and accompaniment extraction.
  • an audio separation method based on a convolutional neural network is used for audio separation of the audio to be separated.
  • Convolution processing is performed on the audio to obtain vocal features and accompaniment features respectively, and a separated vocal track and accompaniment track are generated based on the separated vocal features and accompaniment features.
  • an audio separation method comprising:
  • the audio files respectively corresponding to the n audio track sets are generated.
  • the training data includes audio samples to be separated and n label tracks corresponding to the audio samples to be separated, the audio samples to be separated include at least two audio tracks, and n is positive integer;
  • the time-domain feature is used to characterize the harmonic correlation of the audio sample to be separated, and the texture feature is used to characterize the Harmonic continuity of the audio samples to be separated;
  • the spectral features are used to characterize the frequency and amplitude information of the audio track sets, and each audio track set includes the An audio track or a combination of multiple audio tracks in the audio sample to be separated;
  • An audio acquisition module configured to acquire the audio to be separated, the audio to be separated includes at least two audio tracks;
  • a feature extraction module configured to obtain time-domain features and texture features of the audio to be separated, the time-domain features are used to characterize the harmonic correlation of the audio to be separated, and the texture features are used to characterize the audio to be separated Harmonic continuity of audio;
  • a spectrum generation module configured to obtain spectral features respectively corresponding to n audio track sets according to the time domain features and the texture features, and the spectral features are used to characterize the frequency and amplitude information of the audio track sets, each The audio track set includes an audio track or a combination of multiple audio tracks in the audio to be separated, and n is a positive integer;
  • the audio track generating module is configured to generate audio files respectively corresponding to the n audio track sets according to the frequency spectrum characteristics corresponding to the n audio track sets.
  • a training device for an audio separation model comprising:
  • a data acquisition module configured to acquire training data of the audio separation model, the training data includes audio samples to be separated and n label tracks corresponding to the audio samples to be separated, and the audio samples to be separated include at least two audio track, n is a positive integer;
  • a feature extraction module configured to obtain the time-domain feature and texture feature of the audio sample to be separated through the audio separation model, the time-domain feature is used to characterize the harmonic correlation of the audio sample to be separated, and the texture The feature is used to characterize the harmonic continuity of the audio samples to be separated;
  • a spectrum generation module configured to obtain spectral features respectively corresponding to n audio track sets according to the time domain features and the texture features, and the spectral features are used to characterize the frequency and amplitude information of the audio track sets, each The audio track set includes an audio track or a combination of multiple audio tracks in the audio sample to be separated;
  • the model training module is used to calculate the training loss of the audio separation model according to the spectral features corresponding to the n track sets and the spectral features corresponding to the n label tracks, and based on the training loss
  • the audio separation model is trained.
  • a computer device includes: a processor and a memory, the above memory stores a computer program, the above computer program is loaded and executed by the above processor to implement the above audio separation method or A training method for audio separation models.
  • a computer-readable storage medium is provided, and a computer program is stored in the above-mentioned computer-readable storage medium, and the above-mentioned computer program is loaded and executed by a processor to realize the above-mentioned audio separation method or audio separation model training method.
  • a computer program product or computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, and the processor reads from the above-mentioned computer
  • the storage medium reads the above computer instructions to implement the audio separation method or the training method of the audio separation model as described above.
  • time-domain features and texture features of the audio to be separated By obtaining the time-domain features and texture features of the audio to be separated, and then performing audio separation based on these two features, since the time-domain features and texture features only contain harmonic-related features, they do not include factors such as phase and other factors in the audio to be separated Related features, so in the process of audio separation, the amount of calculation to obtain the time-domain features and frequency-domain features of the audio to be separated is small.
  • This method obtains the time-domain features and frequency-domain features of the audio to be separated than directly through the audio to be separated. The dimensionality of the audio features obtained by convolution is smaller, so this method requires less calculation when performing audio separation, and the audio separation speed is fast.
  • Fig. 1 is a schematic diagram of a scheme implementation environment provided by an embodiment of the present application.
  • Fig. 2 is a flowchart of an audio separation method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an audio separation process provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an audio separation process provided by another embodiment of the present application.
  • FIG. 5 is a flowchart of an audio separation method provided in another embodiment of the present application.
  • FIG. 6 is a schematic diagram of a network structure of an audio separation model provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of another network structure of the audio separation model provided by the embodiment of the present application.
  • Fig. 8 is a schematic diagram of an audio separation method provided by another embodiment of the present application.
  • Fig. 9 is a flowchart of a training method of an audio separation model provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a training method for an audio separation model provided by an embodiment of the present application.
  • Figure 11 is a block diagram of an audio separation device provided by an embodiment of the application.
  • Fig. 12 is a block diagram of a training device for an audio separation model provided by an embodiment of the application.
  • Fig. 13 is a schematic diagram of computer equipment provided by an embodiment of the application.
  • FIG. 1 shows a schematic diagram of a solution implementation environment provided by an embodiment of the present application.
  • the solution implementation environment may include a model training device 10 and an audio processing device 20 .
  • the model training device 10 is an electronic device for training an audio separation model, and the model training device 10 may be an electronic device such as a PC (Personal Computer, personal computer), a server, or the like.
  • the audio separation model trained by the model training device 10 can be deployed in the audio processing device 20 for use.
  • the audio processing device 20 is an electronic device for processing the audio to be separated, and the audio processing device 20 may be an electronic device such as a mobile phone, a tablet computer, an intelligent robot, or a server.
  • the audio processing device 20 may perform audio separation processing on the audio to be separated through an audio separation model to generate n audio track sets, and then obtain audio files respectively corresponding to the n audio track sets, where n is a positive integer.
  • the audio processing device 20 also has functions such as music playback and audio synthesis, which are not limited in this application.
  • the audio processing system may include a terminal device and a server.
  • the terminal device has functions such as audio data transmission, audio playback and data storage, and the server can provide background support for audio processing functions for the terminal device.
  • the audio separation system is mounted on the terminal device, and the audio separation process is performed on the terminal device.
  • the terminal device obtains the audio to be separated, it performs feature extraction on the audio to be separated to obtain the time domain features and texture features of the audio to be separated.
  • the time-domain features and texture features of the audio to be separated obtain the spectral features corresponding to the n audio track sets respectively.
  • the terminal device respectively obtains the audio files of n audio track sets according to the n frequency-amplitude characteristics, and completes the audio separation process.
  • the mobile phone operator only selects k audio files that meet the requirements for use .
  • the audio separation system is installed on the server, and the audio separation process is performed on the server.
  • the terminal device obtains the audio to be separated, it sends the audio to be separated to the server.
  • the server receives the audio to be separated from the terminal device and extracts the audio to be separated.
  • the time-domain features and texture features of the audio to be separated are obtained based on the time-domain features and texture features to obtain spectral features of n audio track sets, and audio files corresponding to the n audio track sets are generated.
  • the server sends n audio files to the terminal device, completing the audio separation process.
  • FIG. 2 shows a flowchart of an audio separation method provided by an embodiment of the present application.
  • the execution subject of each step of the method may be the audio processing device 20 in the implementation environment of the solution shown in FIG. 1, and the method may include At least one of the following steps (210-240):
  • Step 210 acquire the audio to be separated, the audio to be separated includes at least two audio tracks.
  • the audio to be separated refers to an audio file for audio separation.
  • An audio file refers to information obtained by sampling loudness in the time and frequency domains.
  • An audio track records the relationship between a class of audio signals with the same attributes and time.
  • the attributes of an audio track include timbre, timbre library, and input and output channels.
  • Audio tracks include single-track and multi-track.
  • a mono track is also called a mono signal track.
  • the recorded performance audio of a musical instrument belongs to a mono track, and a character singing a cappella also belongs to a mono track.
  • the multi-audio track includes a multi-audio track obtained by superimposing multiple identical audio tracks, or a multi-audio track obtained by superimposing a plurality of different audio tracks.
  • the audio to be separated includes at least two audio tracks.
  • the audio to be separated is audio related to musical instrument ensembles, and the audio to be separated includes audio tracks corresponding to piano, violin, cello, flute, clarinet, euphonium, and timpani.
  • the audio to be separated is song-like audio.
  • the audio to be separated includes vocal tracks and accompaniment tracks.
  • the type of the audio track and the type of audio tracks contained in the audio to be separated are determined according to actual needs, and are not limited here.
  • the audio processing device acquires the audio to be separated, and the audio to be separated is mixed audio composed of multiple audio tracks, and the audio processing device can perform audio separation on the audio to be separated.
  • Step 220 obtain the time-domain feature and texture feature of the audio to be separated, the time-domain feature is used to characterize the harmonic correlation of the audio to be separated, and the texture feature is used to characterize the harmonic continuity of the audio to be separated.
  • the audio processing device analyzes the audio to be separated to obtain time-domain features and texture features.
  • the time-domain feature contains a plurality of different time-domain feature information, and different audio tracks have different time-domain feature information
  • the texture feature includes a plurality of different texture feature information, and different audio tracks have different texture feature information.
  • Different musical instruments and human voices have different attributes such as timbre and frequency, so different musical instruments and human voices have different harmonics.
  • Texture features are used to represent the continuity of harmonics, that is, the variation rules and characteristics of harmonics along the time axis
  • time domain features are used to represent the correlation of harmonics. characteristics, as well as the changing law and characteristics of the time axis direction.
  • the audio processing device acquires time-domain features and texture features based on frequency-amplitude features of the audio to be separated, and the extracted time-domain features and texture features are used to obtain spectral features of the audio track set.
  • the harmonics of the audio to be separated and the characteristic information of different track sets can be grasped through the time domain features and texture features of the audio to be separated, which is beneficial for the subsequent network to generate n track sets through the time domain features and texture features of the audio to be separated spectrum features.
  • Step 230 according to the time-domain features and texture features, obtain the spectral features corresponding to the n audio track sets respectively, the spectral features are used to represent the frequency and amplitude information of the audio track sets, and each audio track set includes a sound in the audio to be separated track or a combination of multiple tracks, n is a positive integer.
  • the audio track set refers to the audio tracks obtained after the audio processing device separates the audio to be separated.
  • the set of tracks is a track for a single instrument or vocal.
  • the audio track set is a mixed audio track obtained by mixing multiple audio tracks, for example, the audio track set is a mixed audio track obtained by mixing a vocal track and at least one corresponding audio track of an instrument.
  • the audio track set is a mixed audio track obtained by superimposing audio tracks corresponding to at least two musical instruments. Spectrum features contain the feature that the amplitude information of the track set changes with the change of frequency information.
  • the audio to be separated is a song
  • the audio to be separated contains 5 audio tracks, specifically the audio tracks corresponding to vocals, guitars, basses, electronic synthesizers, and drum kits. set, specifically the spectral features corresponding to the human voice track set, guitar track set, drum kit track set and mixed track set respectively, where the spectral features corresponding to the mixed track set include bass track and electronic synthesizer sound
  • the spectral features corresponding to the track combination is a song, and the audio to be separated contains 5 audio tracks, specifically the audio tracks corresponding to vocals, guitars, basses, electronic synthesizers, and drum kits. set, specifically the spectral features corresponding to the human voice track set, guitar track set, drum kit track set and mixed track set respectively, where the spectral features corresponding to the mixed track set include bass track and electronic synthesizer sound
  • the spectral features corresponding to the track combination are examples of the track combination.
  • Step 240 Generate audio files corresponding to the n audio track sets respectively according to the spectral features corresponding to the n audio track sets.
  • FIG. 3 shows a schematic diagram of an audio separation process.
  • the audio to be separated is divided into a vocal track and an accompaniment track.
  • FIG. 4 shows a schematic diagram of another audio separation process.
  • the audio to be separated is more finely divided into vocal tracks, piano tracks, bass tracks and other instrument tracks.
  • Other instrument tracks include instrument sounds other than the human voice track, piano track, and guitar track in the audio to be separated.
  • the audio processing equipment obtains the spectral characteristics of n audio track sets according to the audio separation for processing, and obtains the audio files corresponding to the n audio track sets through the spectral features of the n audio track sets and the phase information of the audio to be separated, and completes the audio separation process .
  • the audio processing device obtains the corresponding spectrum file of the audio track set by processing the spectral features of the audio track set and the phase information of the audio to be separated.
  • the technical solution provided by the embodiment of this application obtains the time-domain features and texture features of the audio to be separated, and then performs audio separation based on these two features. Since the time-domain features and texture features only contain harmonics The relevant features of the audio to be separated do not include the features related to factors such as phase in the audio to be separated. Therefore, in the process of audio separation, the amount of calculation to obtain the time domain features and frequency domain features of the audio to be separated is small.
  • This method obtains the audio to be separated.
  • the time-domain features and frequency-domain features are smaller in dimension than the audio features directly obtained by convolution of the audio to be separated. Therefore, this method requires less calculation for audio separation and has a faster audio separation speed.
  • the audio separation method provided in this application can extract the audio track sets corresponding to human voice, string accompaniment and drum sound respectively from the audio to be separated.
  • the audio separation method provided by the present application can also separate the audio to be separated of the instrumental ensemble class, and obtain the audio tracks corresponding to each musical instrument, which satisfies the needs of music lovers to obtain audio files of a certain type of musical instrument from the audio to be separated. need.
  • FIG. 5 shows a schematic diagram of an audio separation method provided by another embodiment of the present application.
  • Step 520 acquire the time domain feature and texture feature of the audio to be separated, the time domain feature is used to characterize the harmonic correlation of the audio to be separated, and the texture feature is used to characterize the harmonic continuity of the audio to be separated.
  • step 520 includes the following sub-steps:
  • the frequency amplitude information of the audio to be separated is referred to as a spectrogram of the audio to be separated.
  • the frequency amplitude information and phase information of the audio to be separated are obtained by performing Fourier transform on the audio to be separated.
  • the audio processing device processes a to-be-separated audio through short-time Fourier transform, and obtains the time-domain feature and texture feature corresponding to the to-be-separated audio.
  • the music signal is not a stationary signal.
  • the signals with differences in the time domain may have very similar frequency spectrums.
  • Direct Fourier transform of the audio to be processed will cause distortion.
  • the signals in these small segments are relatively stable, and the signals in the small frequency band are relatively stable.
  • the signal undergoes Fourier transform to obtain the frequency and amplitude information of the audio to be separated, and the use of short-time Fourier transform can avoid distortion of the audio to be separated.
  • the time domain information of the audio to be separated contains a large amount of information, and the phase-related information in the time domain information plays a small role in the audio separation process, therefore, by short-time Fourier transform of the audio to be separated Transform or other methods that can separate the frequency domain feature information from the time domain information of the audio to be separated, obtain the frequency amplitude information of the audio to be separated, and extract time domain features and texture features based on the frequency amplitude information of the audio to be separated, which is helpful It is used to reduce the amount of calculation in the process of audio separation and improve the speed of audio separation.
  • an audio separation model is used to separate the audio to be separated and output a separated tagged audio track.
  • the audio separation model is a neural network model with audio separation functions.
  • the audio separation model is a neural network such as a recurrent neural network, a convolutional neural network, and a recurrent neural network, and a combination thereof.
  • the audio separation model includes a frequency-amplitude coding network, a time-domain extraction network, and a texture extraction network.
  • the frequency-amplitude encoding network is used to sort out the characteristics of the frequency-amplitude information to obtain the frequency-amplitude features of the audio to be separated.
  • the frequency-amplitude encoding network is used to convolve the frequency-amplitude information to obtain the frequency-amplitude features.
  • the time domain extraction network is used to extract time domain features, for example, the time domain extraction network is used to extract time domain features based on the first frequency amplitude feature.
  • the texture extraction network is used to extract texture features, for example, the texture extraction network is used to extract texture features based on the second frequency amplitude feature.
  • the frequency-amplitude feature refers to a type of feature information related to frequency and amplitude extracted from the frequency-amplitude information of the audio to be separated.
  • the frequency-amplitude encoding network in the audio separation model sorts out the features of the frequency-amplitude information of the audio to be separated by convolution, and extracts the frequency-amplitude features from the frequency-amplitude information of the audio to be separated.
  • a larger-sized convolution kernel is used in the frequency-amplitude coding network to extract features of the frequency-amplitude information of the audio to be separated.
  • the audio separation model inputs the frequency-amplitude information of the audio to be separated into the frequency-amplitude coding network.
  • the frequency-amplitude coding network includes three convolutional layers, and each convolutional layer uses a convolution kernel with a size of 7*7 to input the The feature information of the convolutional layer is convolved, and the output of the last convolutional layer is the frequency-amplitude feature.
  • the size of the convolution kernel in the frequency-amplitude coding network is greater than or equal to 3*3.
  • the number of convolutional layers and the size of the convolution kernel are set according to actual needs, and are not limited here .
  • the use of a large-size convolution kernel for convolution can abstract the input frequency-amplitude information into multiple-dimensional spectral features, which is conducive to increasing the range of the receptive field of the convolution process and reducing the frequency amplitude.
  • the coupling of features helps the subsequent network to better learn the specific features of the audio to be separated from the frequency-amplitude features.
  • the structure of the frequency-amplitude coding network of the audio separation model is improved, so that the channel of the frequency-amplitude feature extracted through the frequency-amplitude coding network
  • the number is larger, that is, the number of channels of the first frequency amplitude feature and the second frequency amplitude feature obtained after separation is larger, and the more data contained in the first frequency amplitude feature and the second frequency amplitude feature can make the audio separation model The better the accuracy of the separated results.
  • increasing the number of channels of the frequency-amplitude feature will increase the calculation amount of the audio separation model and slow down the audio separation speed.
  • the number of channels of the frequency-amplitude feature, the number of channels of the first frequency-amplitude feature, and the number of channels of the second frequency-amplitude feature can be comprehensively determined according to requirements such as audio separation accuracy and audio separation speed, which are not limited in this application.
  • the audio separation model inputs the first frequency-amplitude feature to the time-domain extraction network, and the time-domain extraction network extracts the time-domain feature of the audio to be separated from the first frequency-amplitude feature.
  • the audio processing device uses a recurrent neural network as a time domain extraction network, such as BilSTM (Bi-directional Long Short-Term Memory, two-way long and short-term memory) neural network, BiGRU (Bi-directional Gated Recurrent Unit two-way gating recurrent unit) neural network, etc.
  • the audio separation model inputs the second frequency-amplitude feature to the texture extraction network, and the texture extraction network extracts the texture feature of the audio to be separated from the second frequency-amplitude feature.
  • the audio processing device uses a convolutional neural network as a time-domain extraction network, for example, using a convolutional neural network with a convolution kernel size of 3*3 in each convolutional layer to convolve the second frequency-amplitude feature , to obtain the texture features of the audio to be separated.
  • the number of convolutional layers and the size of the convolution kernel of the convolutional neural network are set according to the actual situation such as the computing power of the device, and are not limited here.
  • the audio separation model also includes an audio track feature extraction network.
  • Step 530 includes the following sub-steps:
  • the dimensions of the time-domain features extracted based on the first frequency-amplitude features are not equal to the dimensions of the texture features extracted based on the second frequency-amplitude features.
  • the time-domain extraction network uses a recurrent neural network
  • the number of channels of the time-domain feature is smaller than the number of channels of the texture feature because the recurrent neural network will perform dimension reduction before outputting the time-domain feature.
  • the temporal features need to be copied in the channel dimension, so that the number of temporal features and texture channels is equal.
  • the number of channels of time-domain features is 1, and the number of channels of texture features is 2.
  • the audio separation model copies the time-domain features to obtain the copied time-domain features, and uses the copied time-domain features to expand the number of channels of the time-domain features.
  • the number of channels of the time domain feature is changed to 2, which is the same as the number of channels of the texture feature.
  • the temporal features and texture features of the same dimension are added to the data at the corresponding positions to obtain a mixed feature.
  • the audio separation model includes a frequency-amplitude encoding network, a temporal extraction network, a texture extraction network and a track acquisition network.
  • a frequency-amplitude encoding network includes a frequency-amplitude encoding network, a temporal extraction network, a texture extraction network and a track acquisition network.
  • the audio separation model includes a combined time-domain extraction network, a combined texture extraction network and an audio track acquisition network, and the combined time-domain extraction network has frequency-amplitude coding at the same time.
  • Network and time-domain extraction network capabilities, combined texture extraction network has the ability of frequency-amplitude coding network and texture extraction network at the same time.
  • Step 540 Generate audio files respectively corresponding to the n audio track sets according to the spectral features corresponding to the n audio track sets.
  • generating the audio files corresponding to the n audio track sets respectively includes: obtaining the phase information of the audio to be separated, and the phase information is used to characterize the phase of the audio to be separated ; Perform inverse Fourier transform on the spectral feature corresponding to the audio track set according to the phase information, and generate an audio file corresponding to the audio track set.
  • the audio processing device After the audio processing device performs short-time Fourier transform on the audio to be separated to generate frequency-amplitude information, the phase information of the audio to be separated can be obtained according to the frequency-amplitude information.
  • the audio processing device performs inverse Fourier transform on the spectral features corresponding to the n audio track sets and the phase information respectively, generates n audio files, and outputs the n audio files respectively.
  • FIG. 8 shows a schematic diagram of an audio separation method provided by an embodiment of the present application.
  • the audio processing device After acquiring the audio to be separated, the audio processing device performs a short-time Fourier transform on the audio to be separated to obtain frequency and amplitude information of the audio to be separated.
  • the audio processing equipment inputs the frequency-amplitude information into the audio separation model, uses a feature encoding network to sort out the features of the frequency-amplitude information using a large-scale convolution, and obtains the high-level features in the frequency-amplitude information, that is, the frequency-amplitude features of the audio samples to be separated .
  • the audio separation model divides the frequency-amplitude feature into the first frequency-amplitude feature and the second-frequency-amplitude feature; wherein, the first-frequency-amplitude feature and the second-frequency-amplitude feature are subsets of the frequency-amplitude feature;
  • the frequency-amplitude feature is extracted to obtain the time-domain feature;
  • the second frequency-amplitude feature is extracted through the texture extraction network to obtain the texture feature;
  • the audio separation model performs dimension matching on the time-domain feature and the texture feature, and performs fusion processing to obtain a mixed Feature
  • the track feature generation network convolutes the mixed features, and finally outputs the spectral features of track set 1 and the spectral features of track set 2, by inverting the spectral features of track set 1 and the phase information of the audio to be separated
  • the audio file corresponding to track set 1 is obtained by Fourier transform
  • the audio file corresponding to track set 2 is obtained by performing inverse Fourier transform on the phase information of the audio to be separated from the spectral characteristics of track set 2.
  • the audio separation model first separates the audio to be separated to obtain n audio track sets (n greater than or equal to 1), and then select the audio track required by the user for output. Using this method can ensure that the user obtains an audio file corresponding to a specific audio track with better quality.
  • the audio separation model separates the audio to be separated, only the audio track required by the user is generated. Using this method can reduce the amount of calculation in the audio separation process, speed up the separation speed of the audio to be separated, and separate a kind of audio track from the audio to be separated in a targeted manner.
  • the training process of the audio separation model is introduced and explained through an embodiment.
  • the content involved in the use of the audio separation model and the content involved in the training process are corresponding to each other, and the two are interoperable. For details, refer to the description on the other side.
  • FIG. 9 shows a flowchart of a training method for an audio separation model provided by an embodiment of the present application.
  • the execution subject of each step of the method implements the model training device 10 in the environment, and the model training device 10 is used as the execution below.
  • the subject, the method may include at least one of the following steps (910-940):
  • Step 910 acquire training data of the audio separation model, the training data includes audio samples to be separated and n label tracks corresponding to the audio samples to be separated, the audio samples to be separated include at least two audio tracks, and n is a positive integer.
  • Step 920 obtain the time-domain feature and texture feature of the audio sample to be separated through the audio separation model, the time-domain feature is used to characterize the harmonic correlation of the audio sample to be separated, and the texture feature is used to characterize the harmonic continuity of the audio sample to be separated .
  • Step 930 according to the time-domain features and texture features, obtain the spectral features corresponding to the n audio track sets respectively, the spectral features are used to represent the frequency and amplitude information of the audio track sets, and each audio track set includes one of the audio samples to be separated audio track or a combination of multiple audio tracks.
  • Step 940 Calculate the training loss of the audio separation model according to the spectral features corresponding to the n audio track sets and the spectral features corresponding to the n labeled audio tracks, and train the audio separation model based on the training loss.
  • obtaining the training data of the audio separation model includes: obtaining an audio data set, the audio data set includes a plurality of source audio track audio; from the plurality of source audio track audio, selecting m source audio track audio, m is a positive integer greater than or equal to n; perform audio mixing processing on m source audio tracks to obtain audio samples to be separated; generate n label audio tracks corresponding to the audio samples to be separated based on the m source audio track audio.
  • Source track audio refers to audio files obtained by recording, electronic synthesis, etc. The source track audio can be obtained from the audio data set, and the source and type of the source track audio are not limited here.
  • Audio mixing processing refers to an operation of mixing m source audio tracks to obtain mixed audio.
  • the model training device 10 aligns the time axes of the m source audio tracks, plays them in a unified manner, completes sound mixing processing, and obtains audio samples to be separated.
  • the label track refers to the type of track that the audio separation model can separate from the audio to be separated, and the trained audio separation model has the ability to separate n label tracks from the audio to be separated.
  • the source audio tracks with shorter playback durations are played repeatedly to prolong the playback time; the source audio tracks with longer playback durations are intercepted, shorten its playback time. Mix the m source audio tracks with the same playback duration after processing to obtain the audio samples to be separated.
  • each label track has a corresponding source track audio.
  • the model training device 10 obtains two source track audios from the audio data set, which are the source track audio sources corresponding to human voices.
  • the source track corresponding to the guitar sound source the audio separation model includes 2 label tracks, which are the track set corresponding to the human voice and the track set corresponding to the guitar sound.
  • the track set corresponding to the human voice can be directly based on the human voice
  • the corresponding sound source of the audio track set is obtained; the audio track set corresponding to the guitar sound can be obtained directly from the sound source of the sound track set corresponding to the guitar sound.
  • some label audio tracks are obtained by mixing multiple source audio tracks.
  • the model training device 10 obtains 5 source audio tracks from the audio data set, which are respectively piano sound, guitar sound , human voice, drum sound and triangle iron sound respectively correspond to the source track audio.
  • the audio separation model includes 4 label tracks, which are the label tracks corresponding to piano, guitar, human voice and percussion respectively.
  • Piano, guitar and The label tracks corresponding to vocals can be determined directly from the corresponding source track audio, and the label tracks corresponding to percussion need to be mixed by mixing the source track audio corresponding to the drum sound and the source track audio corresponding to the triangle. Determined based on the audio from the mix.
  • the audio separation model includes a frequency-amplitude encoding network, a time-domain extraction network, and a texture extraction network; obtaining the time-domain characteristics and texture characteristics of the audio sample to be separated through the audio separation model includes: obtaining the frequency of the audio sample to be separated The frequency and amplitude information is used to represent the frequency and amplitude information of the audio sample to be separated; the frequency and amplitude information is convolved through the frequency and amplitude encoding network to obtain the frequency and amplitude features; the frequency and amplitude features are divided to obtain the first frequency and amplitude features and the second frequency-amplitude feature; wherein, the first frequency-amplitude feature and the second frequency-amplitude feature are subsets of the frequency-amplitude feature, and the frequency-amplitude feature can be obtained by superimposing the first frequency-amplitude feature and the second frequency-amplitude feature; A domain extraction network extracts time-domain features based on the first frequency-amplitude feature; a texture feature is extracted based on the second frequency-amplitude feature through
  • the audio separation model further includes: an audio track feature extraction network, which obtains spectral features corresponding to n audio track sets respectively according to texture features and time domain features, including: performing fusion processing on time domain features and texture features , to obtain mixed features; among them, the fusion process refers to unifying the dimensions between the time domain features and the texture features, and adding the features of the corresponding dimensions in the time domain features after the unified dimension and the texture features; through the track feature extraction network to The mixed features are processed to generate spectral features corresponding to n track sets respectively.
  • an audio track feature extraction network which obtains spectral features corresponding to n audio track sets respectively according to texture features and time domain features, including: performing fusion processing on time domain features and texture features , to obtain mixed features; among them, the fusion process refers to unifying the dimensions between the time domain features and the texture features, and adding the features of the corresponding dimensions in the time domain features after the unified dimension and the texture features; through the track feature extraction network to The mixed features are processed to generate spectral features
  • the training loss of the audio separation model is calculated according to the spectral features corresponding to the n audio track sets and the spectral features corresponding to the n label audio tracks respectively, including:
  • n track sets For each track set in n track sets, calculate the degree of difference between the spectral features of the track set and the spectral features of the label track corresponding to the track set, and obtain n degrees of difference; according to n degrees of difference, Determines the training loss for the audio separation model.
  • the degree of distinction between the spectral features of a track set and the label track corresponding to the track set is used to characterize the degree of distinction between the track set and the corresponding label track.
  • the spectral features of the audio track set and the spectral features of the label audio track have the same dimension, and the degree of difference between the spectral features of a certain audio track set and the spectral features of the label audio track corresponding to the audio track set , obtained by calculating the absolute value of the data difference at the corresponding position in the two spectral features and calculating the average.
  • the difference between the spectral feature of the track set and the spectral feature of the label track corresponding to the track set can be calculated by other methods of calculating the distance, such as calculating the spectral feature of the track set and the label sound corresponding to the track set
  • the sum of the absolute values of the differences between the spectral features of the tracks, etc., the calculation method of the discrimination degree is not limited here.
  • the audio separation model determines the loss of the audio separation model according to the n discrimination degrees, including calculating the average of the n discrimination degrees to determine the loss of the audio separation model, or calculating the sum of the n discrimination degrees to determine the loss of the audio separation model.
  • the computer device adjusts the network parameters of each part in the audio separation model. In some embodiments, the computer device adjusts the parameters in the audio separation model using a gradient descent method.
  • the training data of the audio separation model and obtaining the time-domain features and texture features of the audio samples to be separated from the audio separation model; according to the time-domain features and texture features, the spectra corresponding to n audio track sets are respectively obtained Features, according to the spectral features corresponding to n track sets and the spectral features corresponding to n label tracks, calculate the training loss of the audio separation model, and train the audio separation model based on the training loss so that the trained audio separation
  • the model has the ability to generate n labeled audio tracks.
  • the spectral features of n audio track sets are obtained by using the time-domain features and texture features of the audio to be separated, the calculation amount in the audio separation process is small, and the audio separation speed is fast.
  • the label tracks of an audio separation model include: piano label track, guitar label track, bass label track and vocal label tracks.
  • the parameters in the model responsible for separating the piano label track, guitar label track, bass label track and vocal label track interact with each other to achieve Transfer learning is improved, the effect of model training is improved, the audio separation model has a good separation effect, and the quality of the audio track set obtained through separation is better.
  • FIG. 10 shows a schematic diagram of an audio separation model training process of the present application.
  • the model training device 10 After the model training device 10 acquires the training data, it performs short-time Fourier transform on the audio samples to be separated in the training data to obtain the frequency amplitude information of the audio samples to be separated, and inputs the frequency amplitude information into the audio separation model.
  • the feature encoding network uses a large-scale convolution kernel to sort out the features of the frequency-amplitude information, and obtains the high-level features in the frequency-amplitude information, that is, the frequency-amplitude features of the audio samples to be separated.
  • the audio separation model divides the frequency-amplitude feature into the first frequency-amplitude feature and the second-frequency-amplitude feature; the first frequency-amplitude feature is extracted through the time-domain extraction network to obtain the time-domain feature; the second frequency-amplitude feature is obtained through the texture extraction network Features are extracted to obtain texture features; the audio separation model matches the dimensions of time-domain features and texture features, and performs fusion processing to obtain mixed features.
  • the track feature generation network performs convolution on the mixed features, and finally outputs the spectrum of the track set feature, optionally, the audio separation model has n labeled audio tracks, then the audio track feature generation network finally outputs n spectral features, and the n spectral features correspond to n audio track sets respectively.
  • the track separation network outputs the spectral features corresponding to track set 1, track set 2, and track set 3, and the calculated spectral features of the three track sets are compared with the corresponding label tracks
  • the difference between the frequency and amplitude features is used to obtain the training loss of the audio separation model; based on the training loss, the sub-parameters of the audio separation model are adjusted, and the above steps are repeated until the training loss of the audio separation model converges to the target value, and the audio separation model is completed. training.
  • FIG. 11 shows a block diagram of an audio separation device provided by an embodiment of the present application.
  • the device has the function of realizing the above-mentioned audio separation method, and the function can be realized by hardware, and can also be realized by executing corresponding software by hardware.
  • the apparatus may be the audio processing device introduced above, or may be set in the audio processing device.
  • the apparatus 1100 may include: an audio acquisition module 1110 , a feature extraction module 1120 , a frequency spectrum generation module 1130 and an audio track generation module 1140 .
  • the audio acquisition module 1110 is configured to acquire the audio to be separated, and the audio to be separated includes at least two audio tracks.
  • the feature extraction module 1120 is configured to acquire time-domain features and texture features of the audio to be separated, the time-domain features are used to characterize the harmonic correlation of the audio to be separated, and the texture features are used to characterize the audio to be separated Separates the harmonic continuity of audio.
  • Spectrum generating module 1130 configured to obtain spectral features respectively corresponding to n audio track sets according to the time domain features and the texture features, the spectral features are used to characterize the frequency and amplitude information of the audio track sets, each The audio track sets include one audio track or a combination of multiple audio tracks in the audio to be separated, and n is a positive integer.
  • the audio track generation module 1140 is configured to generate audio files respectively corresponding to the n audio track sets according to the frequency spectrum characteristics respectively corresponding to the n audio track sets.
  • the feature extraction module 1120 includes: a frequency amplitude information acquisition submodule and a feature extraction submodule.
  • the frequency-amplitude information acquisition submodule is configured to acquire the frequency-amplitude information of the audio to be separated, and the frequency-amplitude information is used to characterize the frequency and amplitude information of the audio to be separated.
  • the feature extraction submodule is configured to extract the time-domain feature and the texture feature based on the frequency-amplitude information.
  • the feature extraction submodule is used to convolve the frequency-amplitude information to obtain a frequency-amplitude feature; divide the frequency-amplitude feature to obtain a first frequency-amplitude feature and a second frequency-amplitude feature ; wherein, the first frequency amplitude feature and the second frequency amplitude feature are subsets of the frequency amplitude feature, and the first frequency amplitude feature and the second frequency amplitude feature can be superimposed to obtain the A frequency-amplitude feature; extracting the time-domain feature based on the first frequency-amplitude feature; extracting the texture feature based on the second frequency-amplitude feature.
  • the audio track generating module 1140 is configured to obtain phase information of the audio to be separated, and the phase information is used to characterize the phase of the audio to be separated; Inverse Fourier transform is performed on the spectral feature corresponding to the track set to generate an audio file corresponding to the track set.
  • FIG. 12 shows a block diagram of an audio separation model training device provided by an embodiment of the present application.
  • the device has the function of realizing the above-mentioned training method of the audio separation model, and the function can be realized by hardware, and can also be realized by hardware executing corresponding software.
  • the apparatus may be the model training device 10 introduced above, or may be set in the model training device 10 .
  • the apparatus 1200 may include: a data acquisition module 1210 , a feature extraction module 1220 , a spectrum generation module 1230 and a model training module 1240 .
  • the data acquisition module 1210 is configured to acquire training data of the audio separation model, the training data includes audio samples to be separated and n label tracks corresponding to the audio samples to be separated, and the audio samples to be separated include at least two tracks, n is a positive integer.
  • the feature extraction module 1220 is configured to obtain the time-domain features and texture features of the audio samples to be separated through the audio separation model, the time-domain features are used to characterize the harmonic correlation of the audio samples to be separated, the The texture feature is used to characterize the harmonic continuity of the audio sample to be separated.
  • Spectrum generating module 1230 configured to obtain spectral features respectively corresponding to n audio track sets according to the time domain features and the texture features, the spectral features are used to characterize the frequency and amplitude information of the audio track sets, each A set of audio tracks includes an audio track or a combination of multiple audio tracks in the audio samples to be separated.
  • the model training module 1240 is used to calculate the training loss of the audio separation model according to the spectral features corresponding to the n audio track sets and the spectral features corresponding to the n label audio tracks, and based on the training The loss trains the audio separation model.
  • the data acquisition module 1210 is configured to select m source audio tracks from the plurality of source audio tracks, where m is a positive integer greater than or equal to n; for the m sources Audio track audio is mixed to obtain the to-be-separated audio samples; n label audio tracks corresponding to the to-be-separated audio samples are generated based on the m source audio track audios.
  • the audio separation model includes a frequency-amplitude encoding network, a temporal extraction network, and a texture extraction network.
  • the feature extraction module 1220 is configured to obtain the frequency amplitude information of the audio sample to be separated, and the frequency amplitude information is used to characterize the frequency and amplitude information of the audio sample to be separated;
  • the frequency amplitude information is convoluted to obtain the frequency amplitude feature;
  • the frequency amplitude feature is divided to obtain the first frequency amplitude feature and the second frequency amplitude feature; wherein, the first frequency amplitude feature and the second frequency amplitude feature
  • the amplitude feature is a subset of the frequency-amplitude feature, and the first frequency-amplitude feature and the second frequency-amplitude feature are superimposed to obtain the frequency-amplitude feature;
  • the time-domain extraction network is based on the first frequency-amplitude feature Extracting the time-domain feature through the amplitude feature; extracting the texture feature based on the second frequency-amplitude feature through the texture extraction network.
  • the spectrum generation module 1230 is configured to perform fusion processing on the time domain feature and the texture feature to obtain a mixed feature; wherein the fusion processing refers to unifying the time domain feature and the Dimensions between the texture features, and add the time-domain feature after the unified dimension and the feature of the corresponding dimension in the texture feature; process the mixed feature through the track feature extraction network to generate the Spectral features corresponding to the n audio track sets respectively.
  • the model training module 1240 is configured to calculate, for each track set in the n track sets, the spectral feature of the track set corresponding to the label track of the track set The degree of discrimination between the spectral features of the frequency spectrum features is obtained to obtain n degrees of discrimination; according to the n degrees of discrimination, the training loss of the audio separation model is determined.
  • the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to the needs.
  • the content structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device and the method embodiment provided by the above embodiment belong to the same idea, and the specific implementation process thereof is detailed in the method embodiment, and will not be repeated here.
  • FIG. 13 shows a schematic diagram of a computer device provided by an embodiment of the present application.
  • the computer device 1300 can be the audio processing device 20 in the implementation environment shown in Figure 1, for implementing the above-mentioned audio separation method; it can also be the model training device 10 in the implementation environment shown in Figure 1, for implementing the above-mentioned audio separation model training method.
  • a computer device 1300 includes: a processor 1301 and a memory 1302 .
  • the processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • Processor 1301 can be realized by at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) .
  • Processor 1301 may also include a main processor and a coprocessor, and the main processor is a processor for processing data in a wake-up state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is Low-power processor for processing data in standby state.
  • CPU Central Processing Unit, central processing unit
  • the coprocessor is Low-power processor for processing data in standby state.
  • the processor 1301 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 1301 may also include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1302 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 1302 may also include high-speed random access memory and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • FIG. 13 does not constitute a limitation to the device 1300, and may include more or less components than shown in the figure, or combine certain components, or adopt a different component arrangement.
  • a computer program is stored in the memory of the computer device, and the computer program is loaded and executed by the processor to implement the audio separation method or the training method of the audio separation model as described above.
  • the present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to implement the audio separation method or the training method of the audio separation model as described above.
  • the computer storage medium includes RAM, ROM, flash memory or other solid-state storage technologies, other optical storage such as CD-ROM, magnetic tape cassette, magnetic tape, magnetic disk storage, and the like.
  • the present application also provides a computer program product or computer program, the computer program product or computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, and the processor reads the computer instructions from the computer-readable storage medium , so as to realize the audio separation method or the training method of the audio separation model provided by the above method embodiments.
  • the "plurality” mentioned herein refers to two or more than two.
  • “And/or” describes the association relationship of associated objects, indicating that there may be three types of relationships, for example, A and/or B may indicate: A exists alone, A and B exist simultaneously, and B exists independently.
  • the character "/” generally indicates that the contextual objects are an "or” relationship.
  • the numbering of the steps described herein only exemplarily shows a possible sequence of execution among the steps. In some other embodiments, the above-mentioned steps may not be executed according to the order of the numbers, such as two different numbers The steps are executed at the same time, or two steps with different numbers are executed in the reverse order as shown in the illustration, which is not limited in this embodiment of the present application.

Abstract

本申请公开了一种音频分离方法、装置、设备、存储介质及程序产品,属于音频处理技术领域。所述方法包括:获取待分离音频,待分离音频包括至少两个音轨;获取待分离音频的时域特征和纹理特征,时域特征和纹理特征分别用于表征待分离音频的谐波相关性和谐波连续性;根据时域特征和纹理特征,获得n个音轨集分别对应的频谱特征,频谱特征用于表征音轨集的频率和振幅信息,每个音轨集包括待分离音频中的若干个音轨的组合;根据n个音轨集分别对应的频谱特征,生成n个对应的音频文件。本申请提供了一种分离效果好的音频分离方法,通过时域特征和纹理特征生成的音轨集分别对应的频谱特征,减小了音频分离过程中设备的计算量,提高了音频分离的速度。

Description

音频分离方法、装置、设备、存储介质及程序产品 技术领域
本申请涉及音频处理技术领域,特别涉及一种音频分离方法、装置、设备、存储介质及程序产品。
背景技术
音乐是混合着人声和各种不同乐器声音的音频文件,将音频文件进行分离,获得音频文件中的多个独立音轨,在音乐混音、伴奏提取等方面有重要应用。
相关技术中,使用基于卷积神经网络的音频分离方法对待分离音频进行音频分离,使用此方法进行人声和伴奏声的分离时,先将待分离音频输入音频分离模型,通过音频分离模型对待分离音频进行卷积处理,分别获取人声特征和伴奏特征,基于分离出的人声特征和伴奏特征生成分离后的人声音轨和伴奏音轨。
然而,上述音频分离方法对音轨进行分离时,分离过程的计算量较大,分离速度较慢。
发明内容
本申请实施例提供了一种音频分离方法、装置、设备、存储介质及程序产品,在对待分离音频进行分离获得多个音轨集的过程中,计算量小,分离速度快。技术方案如下:
根据本申请实施例的一个方面,提供了一种音频分离方法,所述方法包括:
获取待分离音频,所述待分离音频包括至少两个音轨;
获取所述待分离音频的时域特征和纹理特征,所述时域特征用于表征所述待分离音频的谐波相关性,所述纹理特征用于表征所述待分离音频的谐波连续性;
根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频中的一个音轨或者多个音轨的组合,n为正整数;
根据所述n个音轨集分别对应的频谱特征,生成所述n个音轨集分别对应的音频文件。
根据本申请实施例的一个方面,提供了一种音频分离模型的训练方法,所述方法包括:
获取所述音频分离模型的训练数据,所述训练数据包括待分离音频样本和所述待分离音频样本对应的n个标签音轨,所述待分离音频样本包括至少两个音轨,n为正整数;
通过所述音频分离模型获取所述待分离音频样本的时域特征和纹理特征,所述时域特征用于表征所述待分离音频样本的谐波相关性,所述纹理特征用于表征所述待分离音频样本的谐波连续性;
根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频样本中的一个音轨或者多个音轨的组合;
根据所述n个音轨集分别对应的频谱特征,以及所述n个标签音轨分别对应的频谱特征,计算所述音频分离模型的训练损失,并基于所述训练损失对所述音频分离模型进行训练。
根据本申请实施例的一个方面,提供了一种音频分离装置,所述装置包括:
音频获取模块,用于获取待分离音频,所述待分离音频包括至少两个音轨;
特征提取模块,用于获取所述待分离音频的时域特征和纹理特征,所述时域特征用于表征所述待分离音频的谐波相关性,所述纹理特征用于表征所述待分离音频的谐波连续性;
频谱生成模块,用于根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频中的一个音轨或者多个音轨的组合,n为正整数;
音轨生成模块,用于根据所述n个音轨集分别对应的频谱特征,生成所述n个音轨集分别对应的音频文件。
根据本申请实施例的一个方面,提供了一种音频分离模型的训练装置,所述装置包括:
数据获取模块,用于获取所述音频分离模型的训练数据,所述训练数据包括待分离音频样本和所述待分离音频样本对应的n个标签音轨,所述待分离音频样本包括至少两个音轨,n为正整数;
特征提取模块,用于通过所述音频分离模型获取所述待分离音频样本的时域特征和纹理特征,所述时域特征用于表征所述待分离音频样本的谐波相关性,所述纹理特征用于表征所述待分离音频样本的谐波连续性;
频谱生成模块,用于根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频样本中的一个音轨或者多个音轨的组合;
模型训练模块,用于根据所述n个音轨集分别对应的频谱特征,以及所述n个标签音轨分别对应的频谱特征,计算所述音频分离模型的训练损失,并基于所述训练损失对所述音频分离模型进行训练。
根据本申请实施例的一个方面,提供了一种计算机设备,上述计算机设备包括:处理器和存储器,上述存储器存储有计算机程序,上述计算机程序由上述处理器加载并执行以实现上述音频分离方法或音频分离模型的训练方法。
根据本申请实施例的一个方面,提供了一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,上述计算机程序由处理器加载并执行以实现上述音频分离方法或音频分离模型的训练方法。
根据本申请实施例的一个方面,提供了一种计算机程序产品或计算机程序,上述计算机程序产品或计算机程序包括计算机指令,上述计算机指令存储在计 算机可读存储介质中,处理器从上述计算机可读存储介质读取上述计算机指令,以实现如上所述音频分离方法或音频分离模型的训练方法。
本申请实施例提供的技术方案可以带来如下有益效果:
通过获取待分离音频的时域特征和纹理特征,然后基于这两方面特征进行音频分离,由于时域特征和纹理特征中只含有与谐波的相关特征,不包含待分离音频中与相位等因素相关的特征,因此在音频分离的过程中,获取待分离音频的时域特征和频域特征的计算量小,本方法获取待分离音频的时域特征和频域特征比直接通过待分离音频进行卷积获得的音频特征的维度更小,因此,本方法进行音频分离时的计算量较小,音频分离速度快。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一个实施例提供的方案实施环境的示意图;
图2是本申请一个实施例提供的音频分离方法的流程图;
图3是本申请一个实施例提供的音频分离过程的示意图;
图4是本申请另一个实施例提供的音频分离过程的示意图;
图5是本申请另一个实施例提供的音频分离方法的流程图;
图6是本申请实施例提供的音频分离模型的一种网络结构的示意图;
图7是本申请实施例提供的音频分离模型的另一种网络结构的示意图;
图8是本申请另一个实施例提供的音频分离方法的示意图;
图9是本申请一个实施例提供的音频分离模型的训练方法的流程图;
图10是本申请一个实施例提供的音频分离模型的训练方法的示意图;
图11申请一个实施例提供的音频分离装置的框图;
图12申请一个实施例提供的音频分离模型的训练装置的框图;
图13申请一个实施例提供的计算机设备的示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
请参考图1,其示出了本申请一个实施例提供的方案实施环境的示意图。该方案实施环境可以包括模型训练设备10和音频处理设备20。
模型训练设备10是用于训练音频分离模型的电子设备,该模型训练设备10可以是诸如PC(Personal Computer,个人计算机)、服务器等电子设备。模型训练设备10训练得到的音频分离模型可以部署在音频处理设备20中使用。
音频处理设备20是用于对待分离音频进行处理的电子设备,该音频处理设备20可以是诸如手机、平板电脑、智能机器人、服务器等电子设备。音频处理设备20可以通过音频分离模型对待分离音频进行音频分离处理,生成n个音轨集,进而获得n个音轨集分别对应的音频文件,n为正整数。可选地,音频处理 设备20还具有音乐播放,音频合成等功能,本申请对此不作限定。
本申请实施例提供的技术方案,可以应用于任何有需求对音频文件进行分离处理的实际应用场景中。
音频处理系统可以包括终端设备和服务器,终端设备具有音频数据传输,音频播放和数据存储等功能,服务器能够为终端设备提供音频处理功能的后台支持。
在一个示例中,音频分离系统搭载在终端设备上,音频分离过程在终端设备上进行,终端设备获取待分离音频后,对待分离音频进行特征提取得到待分离音频的时域特征和纹理特征,根据待分离音频的时域特征和纹理特征获得n个音轨集分别对应的频谱特征。终端设备根据n个频幅特征分别得到n个音轨集的音频文件,完成音频分离过程。可选地,n个音轨集对应的音频文件中有k个音频文件满足手机操纵者的需求,k是小于等于n的正整数,则手机操纵者只挑选满足需求的k个音频文件进行使用。
在另一个示例中,音频分离系统搭载在服务器上,音频分离过程在服务器上进行,终端设备获取待分离音频后,将待分离音频发送给服务器,服务器接收终端设备发送的待分离音频,并提取待分离音频的时域特征和纹理特征,基于时域特征和纹理特征等到n个音轨集的频谱特征,并生成n个音轨集对应的音频文件。服务器将n个音频文件发送给终端设备,完成了音频分离过程。
当然,上文介绍的示例性应用场景,仅是为了便于理解本申请技术方案而介绍的一些典型的应用场景,本申请技术方案还可应用于其他有需求对音频文件进行分离的实际应用场景中,本申请实施例对此不作限定。
请参考图2,其示出了本申请一个实施例提供的音频分离方法的流程图,该方法各步骤的执行主体可以是图1所示方案实施环境中的音频处理设备20,该方法可以包括如下几个步骤(210-240)中的至少一个步骤:
步骤210,获取待分离音频,待分离音频包括至少两个音轨。
待分离音频是指用于进行音频分离的音频文件。音频文件是指在时域和频域上对响度进行采样获得的信息。音轨记录了一类具有相同属性的音频信号与时间的关系,音轨的属性包括音色、音色库和输入输出通道等。音轨包括单音轨和多音轨。单音轨又称为单声道信号音轨,例如,录制的一种乐器的演奏音频属于一个单音轨,某个人物清唱也属于一个单音轨。多音轨包括对多个相同音轨进行叠加得到的多音轨,或者将多个不同的音轨进行叠加得到的多音轨。待分离音频中包含至少两个音轨,例如待分离音频是乐器合奏相关的音频,待分离音频中包含钢琴、小提琴、大提琴、长笛、单簧管、低音号和定音鼓分别对应的音轨。又例如,待分离音频是歌曲类音频,待分离音频中包含人声音轨和伴奏音轨,伴奏音轨中又包含吉他、贝斯、电子合成器和架子鼓分别对应的音轨,待分离音频的类型及待分离音频中包含的音轨种类根据实际需要确定,在此不进行限定。音频处理设备获取待分离音频,待分离音频是多个音轨组成的混合音频,音频处理设备能够对待分离音频进行音频分离。
步骤220,获取待分离音频的时域特征和纹理特征,时域特征用于表征待分 离音频的谐波相关性,纹理特征用于表征待分离音频的谐波连续性。
音频处理设备对待分离音频进行分析,获取时域特征和纹理特征。时域特征中包含多个不同的时域特征信息,不同的音轨具有不同的时域特征信息,纹理特征中包含多个不同的纹理特征信息,不同的音轨具有不同的纹理特征信息。声音在时间轴上表现为上下震动的轨迹,这些轨迹称为谐波。不同的乐器以及人声具有的音色、频率等属性不相同,因此不同的乐器和人声具有不同的谐波。纹理特征用于表示谐波的连续性,即谐波沿着时间轴方向的变化规律和特征,时域特征用于表示谐波的相关性,即时域特征中包括谐波上下震动的变化规律和特征,以及时间轴方向的变化规律和特征。
音频处理设备基于待分离音频的频幅特征获取时域特征和纹理特征,提取出的时域特征和纹理特征用于得到音轨集的频谱特征。
通过待分离音频的时域特征和纹理特征能够掌握待分离音频的谐波特性和不同音轨集的特征信息,有利于后续网络通过待分离音频的时域特征和纹理特征生成n个音轨集的频谱特征。
步骤230,根据时域特征和纹理特征,获得n个音轨集分别对应的频谱特征,频谱特征用于表征音轨集的频率和振幅信息,每个音轨集包括待分离音频中的一个音轨或者多个音轨的组合,n为正整数。
音轨集是指音频处理设备对待分离音频进行分离后得到的音轨。在一些实施例中,音轨集是单个乐器或人声对应的音轨。在另一些实施例中,音轨集是多个音轨混合后得到的混合音轨,例如音轨集是人声音轨和至少一个乐器对应的音轨混合得到的混合音轨。又例如,音轨集是由至少两个乐器对应的音轨进行叠加得到的混合音轨。频谱特征包含音轨集的振幅信息随频率信息变化而变化的特征。例如,待分离音频是歌曲,待分离音频中包含5个音轨,具体为人声、吉他、贝斯、电子合成器和架子鼓分别对应的音轨对待分离音频进行音频分离后获得的4个音轨集,具体为人声音轨集、吉他音轨集、架子鼓音轨集和混合音轨集分别对应的频谱特征,其中混合音轨集对应的频谱特征中包括由贝斯音轨和电子合成器音轨组合对应的频谱特征。
步骤240,根据n个音轨集分别对应的频谱特征,生成n个音轨集分别对应的音频文件。
请参考图3,其示出了一种音频分离过程的示意图。例如,根据实际需要,在一些实施例中,待分离音频被分成人声音轨和伴奏音轨。
请参考图4,其示出了另一种音频分离过程的示意图。在另一些实施例中,待分离音频被更细致地划分,分成了人声音轨、钢琴音轨、贝斯音轨和其他乐器音轨。其他乐器音轨中包含待分离音频中除了人声音轨、钢琴音轨、吉他音轨之外的乐器声音。
音频处理设备根据音频分离获得n个音轨集的频谱特征进行处理,通过n个音轨集的频谱特征与待分离音频的相位信息分别获得n个音轨集对应的音频文件,完成音频分离过程。以某一个音轨集为例,音频处理设备通过对该音轨集的频谱特征和待分离音频的相位信息进行处理,获得该音轨集对应的频谱文 件。
综上所述,本申请实施例提供的技术方案,通过获取待分离音频的时域特征和纹理特征,然后基于这两方面特征进行音频分离,由于时域特征和纹理特征中只含有与谐波的相关特征,不包含待分离音频中与相位等因素相关的特征,因此在音频分离的过程中,获取待分离音频的时域特征和频域特征的计算量小,本方法获取待分离音频的时域特征和频域特征比直接通过待分离音频进行卷积获得的音频特征的维度更小,因此,本方法进行音频分离时的计算量较小,音频分离速度快。
此外,通过改变参数n的大小,能够获得多种音轨集,解决了相关技术中只能获取人声音轨和伴奏音轨的限制。例如,本申请提供的音频分离方法能从待分离音频中提取出人声、弦乐伴奏和鼓声分别对应的音轨集。又例如,本申请提供的音频分离方法还能对器乐合奏类的待分离音频进行分离,得到各个乐器分别对应的音轨,满足了音乐爱好者从待分离音频中获取某一类乐器音频文件的需求。
下面通过两个实施例对获取待分离音频的时域特征和纹理特征的过程进行介绍。
请参考图5,其示出了本申请另一个实施例提供的音频分离方法的示意图。
步骤510,获取待分离音频,待分离音频包括至少两个音轨。
步骤520,获取待分离音频的时域特征和纹理特征,时域特征用于表征待分离音频的谐波相关性,纹理特征用于表征待分离音频的谐波连续性。
在一些实施例中,步骤520包括以下几个子步骤:
步骤522,获取待分离音频的频幅信息,频幅信息用于表征待分离音频的频率和振幅信息。
可选地,待分离音频的频幅信息称为待分离音频的频谱图。在一些实施例中,通过对待分离音频进行傅里叶变换,获得该待分离音频的频幅信息和相位信息。例如,音频处理设备通过短时傅里叶变换对某个待分离音频进行处理,获得该待分离音频对应的时域特征和纹理特征。
由音乐信号的波形图可知,音乐信号不属于平稳的信号,在一些情况下,在时域上有差异的信号,频谱之间可能十分相似,直接对待处理音频进行傅里叶变换会导致失真。采用短时傅里叶变换对待处理音频进行处理,通过加窗的方式,对待分离音频进行时域上的分割,获得若干个小片段,这些小的片段中的信号比较平稳,对小频段中的信号进行傅里叶变换,得到待分离音频的频幅信息,使用短时傅里叶变换能够避免造成待分离音频的失真。由于待分离音频的时域信息中包含的信息量较大,并且时域信息中与相位相关的信息在音频分离过程中起到的作用较小,因此,通过对待分离音频进行短时傅里叶变换或其他能从待分离音频的时域信息中分离出频域特征信息的方法,获取待分离音频的频幅信息,并基于待分离音频的频幅信息提取时域特征和纹理特征,有助于减少音频分离过程中的计算量,提高音频分离的速度。
步骤524,基于频幅信息提取时域特征和纹理特征。
在一些实施例中,基于频幅信息提取时域特征和纹理特征,包括:对频幅信息进行卷积,得到频幅特征;对频幅特征进行划分,得到第一频幅特征和第二频幅特征;其中,第一频幅特征和第二频幅特征是频幅特征的子集,将第一频幅特征和第二频幅特征进行叠加得到频幅特征;基于第一频幅特征提取时域特征;基于第二频幅特征提取纹理特征。
在一些实施例中,采用音频分离模型对待分离音频进行分离并输出分离后的标签音轨。音频分离模型是具有音频分离功能的神经网络模型,例如音频分离模型是递归神经网络、卷积神经网络和循环神经网络等神经网络及其之间的相互组合。
可选地,音频分离模型包括频幅编码网络、时域提取网络和纹理提取网络。其中,频幅编码网络用于对频幅信息进行特征梳理,获得待分离音频的频幅特征,例如频幅编码网络用于对频幅信息进行卷积,得到频幅特征。时域提取网络用于提取时域特征,例如时域提取网络用于基于第一频幅特征提取时域特征。纹理提取网络用于提取纹理特征,例如纹理提取网络用于基于第二频幅特征提取纹理特征。
频幅特征是指一类从待分离音频的频幅信息中提取出的与频率和振幅相关的特征信息。音频分离模型中的频幅编码网络通过卷积的方式,对待分离音频的频幅信息进行特征梳理,从待分离音频的频幅信息中提取频幅特征。在一些实施例中,频幅编码网络中使用尺寸较大的卷积核对待分离音频的频幅信息进行特征提取。例如,音频分离模型将待分离音频的频幅信息输入频幅编码网络中,频幅编码网络中包括三个卷积层,每个卷积层中使用尺寸为7*7的卷积核对输入该卷积层的特征信息进行卷积,最后一个卷积层输出即为频幅特征。可选地,频幅编码网络中的卷积核尺寸大于等于3*3,频幅编码网络中,卷积层的层数和卷积核的大小根据实际需要进行设定,在此不进行限定。
在频幅编码网络的卷积层中使用大尺寸卷积核进行卷积能够将输入的频幅信息抽象成多个维度的频谱特征,有利于增加卷积过程的感受野的范围,减少频幅特征的耦合,有助于后续网络更好地从频幅特征中学习待分离音频的具体特征。
音频分离模型对频幅编码网络提取的频幅特征进行划分,通过划分得到第一频幅特征和第二频幅特征。例如,某个频幅特征是具有宽度、时间和通道数(channel)三个维度的矩阵,该频幅特征的通道数为64,每个通道上对应的时间和宽度大小相等,音频分离模型将该频幅特征的前32个通道作为第一频幅特征,后32个通道作为第二频幅特征。在一些实施例中,出于保证音频分离后得到的音轨集的准确度更高等目的,对音频分离模型的频幅编码网络进行结构改进,使得通过频幅编码网络提取的频幅特征的通道数更大,也即分离后得到的第一频幅特征和第二频幅特征的通道数更大,第一频幅特征和第二频幅特征中包含的数据越多,能够使得音频分离模型分离出的结果的准确度越好。但是增大频幅特征的通道数,会导致音频分离模型的计算量增大,音频分离速度减慢。频幅特征的通道数、第一频幅特征的通道数和第二频幅特征的通道数可以根据 音频分离的准确度、音频分离速度等要求综合确定,本申请不进行限定。
音频分离模型将第一频幅特征输入到时域提取网络,时域提取网络从第一频幅特征中提取出待分离音频的时域特征。在一些实施例中,音频处理设备使用递归神经网络作为时域提取网络,例如BilSTM(Bi-directional Long Short-Term Memory,双向长短期记忆)神经网络、BiGRU(Bi-directional Gated Recurrent Unit双向门控循环单元)神经网络等。
音频分离模型将第二频幅特征输入到纹理提取网络,纹理提取网络从第二频幅特征中提取出待分离音频的纹理特征。在一些实施例中,音频处理设备使用卷积神经网络作为时域提取网络,例如使用各个卷积层中卷积核的大小为3*3的卷积神经网络对第二频幅特征进行卷积,得到待分离音频的纹理特征。卷积神经网络的卷积层数和卷积核大小根据设备计算能力等实际情况进行设定,在此不进行限定。
在另一些实施例中,基于频幅信息提取时域特征和纹理特征,包括:对频幅信息进行卷积,得到第三频幅特征,基于频幅特征提取时域特征;对频幅信息进行卷积,得到第四频幅特征,基于频幅特征提取纹理特征。
可选地,音频分离模型中包括组合时域提取网络和组合纹理提取网络。音频处理设备将待分离音频的频幅信息输入音频分离模型,组合时域提取网络对频幅信息进行卷积处理,并提取待分离音频的时域特征;组合纹理提取网络对频幅信息进行卷积,并提取待分离音频的纹理特征。
步骤530,根据时域特征和纹理特征,获得n个音轨集分别对应的频谱特征,频谱特征用于表征音轨集的频率和振幅信息,每个音轨集包括待分离音频中的一个音轨或者多个音轨的组合,n为正整数。
在一些实施例中,音频分离模型还包括音轨特征提取网络。步骤530包括以下几个子步骤:
步骤532,对时域特征和纹理特征进行融合处理,得到混合特征;其中,融合处理是指统一时域特征和纹理特征之间的维度,并将统一维度后的时域特征和纹理特征中对应维度的特征相加。
步骤534,通过音轨特征提取网络对混合特征进行处理,生成n个音轨集分别对应的频谱特征。
在一些实施例中,基于第一频幅特征提取出的时域特征的维度和基于第二频幅特征的维度相同。在对时域特征和纹理特征进行融合处理的过程中,只需要将时域特征和纹理特征对应位置上的特征值相加,即可获得混合特征;可选地,时域特征、纹理特征和混合特征具有相同的维度。
在一些实施例中,基于第一频幅特征提取出的时域特征的维度和基于第二频幅特征提取出的纹理特征的维度不相等,在对时域特征和纹理特征进行融合处理前,需要对时域特征和纹理特征进行维度匹配,使得时域调整和纹理一种的维度相等。在一些实施例中,在时域提取网络使用递归神经网络的情况下,由于递归神经网络在输出时域特征之前会进行维度缩减,使得时域特征的通道数小于纹理特征的通道数。在对时域特征和纹理特征进行融合前,需要在通道 维度上对时域特征进行复制,使得时域特征与纹理的通道数相等。例如,时域特征的通道数为1,纹理特征的通道数为2,音频分离模型将时域特征进行复制,得到复制时域特征,使用复制时域特征对时域特征的通道数进行扩展,使得时域特征的通道数变为2,与纹理特征的通道数相同。将维度相同的时域特征和纹理特征,对应位置上的数据相加,获得混合特征。
音轨集的频谱特征中包含音轨集的频率以及对应的振幅信息,可选地,音轨集的频谱特征中包括音轨集的频幅信息,音轨集的频幅信息称为音轨集的频谱图。音轨特征提取网络将混合特征进行卷积,提取出音轨集的频谱特征,在一些实施例中,音频分离模型中使用全卷积网络作为音轨特征提取网络,例如,U-Net(U型网络)全卷积神经网络。
请参考图6,其示出了音频分离模型的一种网络结构。音频分离模型中包括频幅编码网络,时域提取网络,纹理提取网络和音轨获取网络。各个网络的类型以及具体作用请参考上文,在此不进行赘述。
请参考图7,其示出了音频分离模型的另一种网络结构,音频分离模型中包括组合时域提取网络,组合纹理提取网络和音轨获取网络,组合时域提取网络同时具有频幅编码网络和时域提取网络的能力,组合纹理提取网络同时具有频幅编码网络和纹理提取网络的能力。
步骤540,根据n个音轨集分别对应的频谱特征,生成n个音轨集分别对应的音频文件。
在一些实施例中,根据n个音轨集分别对应的频谱特征,生成n个音轨集分别对应的音频文件,包括:获取待分离音频的相位信息,相位信息用于表征待分离音频的相位;根据相位信息对音轨集对应的频谱特征进行反傅里叶变换,生成音轨集对应的音频文件。
音频处理设备对待分离音频进行短时傅里叶变换生成频幅信息后,可以根据频幅信息获得待分离音频的相位信息。音频处理设备将n个音轨集对应的频谱特征分别与相位信息进行反傅里叶变换,生成n个音频文件,并将n个音频文件分别输出。
请参考图8,其示出本申请一个实施例提供的音频分离方法的示意图。
音频处理设备获取待分离音频后,对待分离音频进行短时傅里叶变换,获取待分离音频的频幅信息。音频处理设备将频幅信息输入音频分离模型,通过特征编码网络对频幅信息采用寸尺较大的卷积进行特征梳理,得到频幅信息中的高层特征,即待分离音频样本的频幅特征。音频分离模型将频幅特征划分成为第一频幅特征和第二频幅特征;其中,第一频幅特征和第二频幅特征是频幅特征的子集;通过时域提取网络对第一频幅特征进行特征提取,获得时域特征;通过纹理提取网络对第二频幅特征进行特征提取,获得纹理特征;音频分离模型对时域特征和纹理特征进行维度匹配,并进行融合处理获得混合特征,音轨特征生成网络对混合特征进行卷积,最后输出音轨集1的频谱特征和音轨集2的频谱特征,通过对音轨集1的频谱特征和待分离音频的相位信息进行反傅里叶变换获得音轨集1对应的音频文件,通过对音轨集2的频谱特征待分离音频 的相位信息进行反傅里叶变换获得音轨集2对应的音频文件。
在实际应用过程中,在用户只需要从待分离音频中分离出某一种特定音轨的情况下,可选地,音频分离模型先将待分离音频进行分离,获得n个音轨集(n大于等于1),再选择用户需要的音轨进行输出,使用此方法能够保证用户获得质量更好的特定音轨对应的音频文件。可选地,音频分离模型将待分离音频进行分离后,只生成用户需要的音轨。使用此方法能够减少音频分离过程中的计算量,加快待分离音频的分离速度,有针对性地从待分离音频中分离出一种音轨。
下面,通过实施例对音频分离模型的训练过程进行介绍说明,有关该音频分离模型的使用过程中涉及的内容和训练过程中涉及的内容是相互对应的,两者互通,如在一侧未作详细说明的地方,可以参考另一侧的描述说明。
请参考图9,其示出了本申请一个实施例提供的音频分离模型的训练方法的流程图,本方法各步骤的执行主体实施环境中的模型训练设备10,下面以模型训练设备10作为执行主体,该方法可以包括如下几个步骤(910-940)中的至少一个步骤:
步骤910,获取音频分离模型的训练数据,训练数据包括待分离音频样本和待分离音频样本对应的n个标签音轨,待分离音频样本包括至少两个音轨,n为正整数。
步骤920,通过音频分离模型获取待分离音频样本的时域特征和纹理特征,时域特征用于表征待分离音频样本的谐波相关性,纹理特征用于表征待分离音频样本的谐波连续性。
步骤930,根据时域特征和纹理特征,获得n个音轨集分别对应的频谱特征,频谱特征用于表征音轨集的频率和振幅信息,每个音轨集包括待分离音频样本中的一个音轨或者多个音轨的组合。
步骤940,根据n个音轨集分别对应的频谱特征,以及n个标签音轨分别对应的频谱特征,计算音频分离模型的训练损失,并基于训练损失对音频分离模型进行训练。
在一些实施例中,获取音频分离模型的训练数据,包括:获取音频数据集,音频数据集中包括多个源音轨音频;从多个源音轨音频中,选取m个源音轨音频,m为大于或等于n的正整数;对m个源音轨音频进行混音处理,得到待分离音频样本;基于m个源音轨音频生成待分离音频样本对应的n个标签音轨。源音轨音频是指通过录制、电子合成等方式得到的音频文件,源音轨音频可以从音频数据集中获取,源音轨音频的来源和类型在此不进行限定。混音处理是指将m个源音频音轨进行混合,得到混合音频的操作。在一些实施例中,模型训练设备10将m个源音轨音频的时间轴对齐,统一进行播放,完成混音处理,获取待分离音频样本。
标签音轨是指音频分离模型能够从待分离音频中分离出的音轨的种类,训练完成后的音频分离模型具有从待分离音频中分离出n个标签音轨的能力。
在m个源音轨音频的播放时长不同的情况下,可选地,对于播放时长较短 的源音轨音频进行重复播放,延长播放时间;对于播放时长较长的源音轨音频进行截取,缩短其播放时间。将处理后播放时长相等的m个源音轨音频进行混合,获得待分离音频样本。
在m等于n的情况下,每一个标签音轨拥有对应的一个源音轨音频,例如,模型训练设备10从音频数据集中获取2个源音轨音频,分别是人声对应的源音轨音源和吉他声对应的源音轨音源,音频分离模型中包括2个标签音轨分别是人声对应的音轨集和吉他声对应的音轨集,人声对应的音轨集能够直接根据人声对应的音轨集音源获得;吉他声对应的音轨集能够直接从吉他声对应的音轨集音源获得。在m大于n的情况下,存在一些标签音轨是通过混合多个源音轨音频得到的,例如,模型训练设备10从音频数据集中获取5个源音轨音频,分别是钢琴声、吉他声、人声,鼓声和三角铁声分别对应的源音轨音频,音频分离模型中包括4个标签音轨,分别是钢琴、吉他、人声和打击乐分别对应的标签音轨,钢琴、吉他和人声分别对应的标签音轨能够分别直接从对应的源音轨音频中确定,打击乐对应的标签音轨需要通过将鼓声对应的源音轨音频和三角铁对应的源音轨音频进行混合,根据混合得到的音轨音频确定。
在一些实施例中,音频分离模型包括频幅编码网络、时域提取网络和纹理提取网络;通过音频分离模型获取待分离音频样本的时域特征和纹理特征,包括:获取待分离音频样本的频幅信息,频幅信息用于表征待分离音频样本的频率和振幅信息;通过频幅编码网络对频幅信息进行卷积,得到频幅特征;对频幅特征进行划分,得到第一频幅特征和第二频幅特征;其中,第一频幅特征和第二频幅特征是频幅特征的子集,将第一频幅特征和第二频幅特征进行叠加能够得到频幅特征;通过时域提取网络基于第一频幅特征提取时域特征;通过纹理提取网络基于第二频幅特征提取纹理特征。
通过音频分离模型获取待分离音频样本的时域特征和纹理特征的详细过程请参考上一个实施例,在此不进行赘述。
在一些实施例中,音频分离模型还包括:音轨特征提取网络,根据纹理特征和时域特征,获得n个音轨集分别对应的频谱特征,包括:对时域特征和纹理特征进行融合处理,得到混合特征;其中,融合处理是指统一时域特征和纹理特征之间的维度,并将统一维度后的时域特征和纹理特征中对应维度的特征相加;通过音轨特征提取网络对混合特征进行处理,生成n个音轨集分别对应的频谱特征。
对时域特征和纹理特征进行融合处理,得到混合特征的详细过程请参考上一个实施例,在此不进行赘述。
在一些实施例中,根据n个音轨集分别对应的频谱特征,以及n个标签音轨分别对应的频谱特征,计算音频分离模型的训练损失,包括:
对于n个音轨集中的每一个音轨集,计算音轨集的频谱特征与音轨集对应的标签音轨的频谱特征之间的区别度,得到n个区别度;根据n个区别度,确定音频分离模型的训练损失。
某个音轨集与该与音轨集对应的标签音轨的频谱特征之间的区别度用于表 征该音轨集和对应的标签音轨之间的区别程度。在一些实施例中,音轨集的频谱特征和标签音轨的频谱特征具有相同的维度,某个音轨集的频谱特征与该音轨集对应的标签音轨的频谱特征之间的区别度,通过计算两个频谱特征中对应位置上的数据差的绝对值并计算平均数获得。
音轨集的频谱特征与音轨集对应的标签音轨的频谱特征之间的区别度可以通过其他计算距离的方式计算得出,例如计算音轨集的频谱特征与音轨集对应的标签音轨的频谱特征之间差的绝对值之和等,区别度的计算方式在此不进行限定。
音频分离模型根据n个区别度确定音频分离模型的损失,包括计算n个区别度的平均数,确定音频分离模型的损失,或者,计算n个区别度之和,确定音频分离模型的损失。确定音频分离模型的损失后,计算机设备对音频分离模型中各部分的网络参数进行调整,在一些实施例中,计算机设备使用梯度下降法对音频分离模型中的参数进行调整。
在音频分离模型的损失收敛于目标数值后,完成音频分离模型的训练。
综上所述,通过获取音频分离模型的训练数据,并从音频分离模型中获取待分离音频样本的时域特征和纹理特征;根据时域特征和纹理特征得到n个音轨集分别对应的频谱特征,根据n个音轨集分别对应的频谱特征,以及n个标签音轨分别对应的频谱特征,计算音频分离模型的训练损失,并基于训练损失对音频分离模型进行训练使得训练后的音频分离模型具备生成n个标签音轨的能力。使用待分离音频的时域特征和纹理特征获得n个音轨集的频谱特征,音频分离过程中的计算量小,音频分离速度快。此外,在音频分离模型训练的过程中,标签音轨的种类越多,音频分离模型最终的分离能力越强,通过音频分离得到的音轨集的质量越好。N越大,代表训练过程中能够对音频分离模型的训练损失造成影响的因素越多,例如,某个音频分离模型的标签音轨包括:钢琴标签音轨、吉他标签音轨、贝斯标签音轨和人声标签音轨,在该音频分离模型的训练过程中,模型中负责分离钢琴标签音轨、吉他标签音轨、贝斯标签音轨和人声标签音轨的参数之间相互影响制约,实现了迁移学习,提高了模型训练的效果,使得音频分离模型分离效果好,通过分离得到的音轨集的质量较好。
请参考图10,其示出本申请一个音频分离模型训练过程的示意图。
模型训练设备10获取训练数据后,将训练数据中的待分离音频样本进行短时傅里叶变换获取待分离音频样本的频幅信息,并将频幅信息输入音频分离模型。特征编码网络对频幅信息采用寸尺较大的卷积核对频幅信息进行特征梳理,得到频幅信息中的高层特征,即待分离音频样本的频幅特征。音频分离模型将频幅特征划分成为第一频幅特征和第二频幅特征;通过时域提取网络对第一频幅特征进行特征提取,获得时域特征;通过纹理提取网络对第二频幅特征进行特征提取,获得纹理特征;音频分离模型将时域特征和纹理特征进行维度匹配,并进行融合处理获得混合特征,音轨特征生成网络对混合特征进行卷积,最后输出音轨集的频谱特征,可选地,音频分离模型具有n个标签音轨,则音轨特征生成网络最后输出n个频谱特征,n个频谱特征分别对应n个音轨集。如图 10所示,音轨分离网络输出了音轨集1、音轨集2和音轨集3对应的频谱特征,并计算得到的3个音轨集的频谱特征与对应的标签音轨的频幅特征的区别度,得到音频分离模型的训练损失;基于训练损失对音频分离模型中分参数进行调整,不断重复上述步骤,直至音频分离模型的训练损失收敛于目标数值,完成对音频分离模型的训练。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图11,其示出了本申请一个实施例提供的音频分离装置的框图。该装置具有实现上述音频分离方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是上文介绍的音频处理设备,也可以设置在音频处理设备中。该装置1100可以包括:音频获取模块1110、特征提取模块1120、频谱生成模块1130和音轨生成模块1140。
音频获取模块1110,用于获取待分离音频,所述待分离音频包括至少两个音轨。
特征提取模块1120,用于获取所述待分离音频的时域特征和纹理特征,所述时域特征用于表征所述待分离音频的谐波相关性,所述纹理特征用于表征所述待分离音频的谐波连续性。
频谱生成模块1130,用于根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频中的一个音轨或者多个音轨的组合,n为正整数。
音轨生成模块1140,用于根据所述n个音轨集分别对应的频谱特征,生成所述n个音轨集分别对应的音频文件。
在一些实施例中,所述特征提取模块1120包括:频幅信息获取子模块和特征提取子模块。
所述频幅信息获取子模块,用于获取所述待分离音频的频幅信息,所述频幅信息用于表征所述待分离音频的频率和振幅信息。
所述特征提取子模块,用于基于所述频幅信息提取所述时域特征和所述纹理特征。
在一些实施例中,所述特征提取子模块用于对所述频幅信息进行卷积,得到频幅特征;对所述频幅特征进行划分,得到第一频幅特征和第二频幅特征;其中,所述第一频幅特征和所述第二频幅特征是所述频幅特征的子集,将所述第一频幅特征和所述第二频幅特征进行叠加能够得到所述频幅特征;基于所述第一频幅特征提取所述时域特征;基于所述第二频幅特征提取所述纹理特征。
在一些实施例中,音频分离模型包括频幅编码网络、时域提取网络和纹理提取网络;其中,所述频幅编码网络用于对所述频幅信息进行卷积,得到所述频幅特征;所述时域提取网络用于基于所述第一频幅特征提取所述时域特征;所述纹理提取网络用于基于所述第二频幅特征提取所述纹理特征。
在一些实施例中,所述音频分离模型还包括:音轨特征提取网络。频谱生 成模块1130,用于对所述时域特征和所述纹理特征进行融合处理,得到混合特征;其中,所述融合处理是指统一所述时域特征和所述纹理特征之间的维度,并将统一维度后的所述时域特征和所述纹理特征中对应维度的特征相加;通过所述音轨特征提取网络对所述混合特征进行处理,生成所述n个音轨集分别对应的频谱特征。
在一些实施例中,所述音轨生成模块1140,用于获取所述待分离音频的相位信息,所述相位信息用于表征所述待分离音频的相位;根据所述相位信息对所述音轨集对应的频谱特征进行反傅里叶变换,生成所述音轨集对应的音频文件。
请参考图12,其示出了本申请一个实施例提供的音频分离模型的训练装置的框图。该装置具有实现上述音频分离模型的训练方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是上文介绍的模型训练设备10,也可以设置在模型训练设备10中。该装置1200可以包括:数据获取模块1210、特征提取模块1220、频谱生成模块1230和模型训练模块1240。
数据获取模块1210,用于获取所述音频分离模型的训练数据,所述训练数据包括待分离音频样本和所述待分离音频样本对应的n个标签音轨,所述待分离音频样本包括至少两个音轨,n为正整数。
特征提取模块1220,用于通过所述音频分离模型获取所述待分离音频样本的时域特征和纹理特征,所述时域特征用于表征所述待分离音频样本的谐波相关性,所述纹理特征用于表征所述待分离音频样本的谐波连续性。
频谱生成模块1230,用于根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频样本中的一个音轨或者多个音轨的组合。
模型训练模块1240,用于根据所述n个音轨集分别对应的频谱特征,以及所述n个标签音轨分别对应的频谱特征,计算所述音频分离模型的训练损失,并基于所述训练损失对所述音频分离模型进行训练。
在一些实施例中,所述数据获取模块1210,用于从所述多个源音轨音频中,选取m个源音轨音频,m为大于或等于n的正整数;对所述m个源音轨音频进行混音处理,得到所述待分离音频样本;基于所述m个源音轨音频生成所述待分离音频样本对应的n个标签音轨。
在一些实施例中,所述音频分离模型包括频幅编码网络、时域提取网络和纹理提取网络。所述特征提取模块1220,用于获取所述待分离音频样本的频幅信息,所述频幅信息用于表征所述待分离音频样本的频率和振幅信息;通过所述频幅编码网络对所述频幅信息进行卷积,得到频幅特征;对所述频幅特征进行划分,得到第一频幅特征和第二频幅特征;其中,所述第一频幅特征和所述第二频幅特征是所述频幅特征的子集,将所述第一频幅特征和所述第二频幅特征进行叠加得到所述频幅特征;通过所述时域提取网络基于所述第一频幅特征提取所述时域特征;通过所述纹理提取网络基于所述第二频幅特征提取所述纹理特征。
在一些实施例中,所述频谱生成模块1230,用于对所述时域特征和所述纹理特征进行融合处理,得到混合特征;其中,所述融合处理是指统一所述时域特征和所述纹理特征之间的维度,并将统一维度后的所述时域特征和所述纹理特征中对应维度的特征相加;通过所述音轨特征提取网络对所述混合特征进行处理,生成所述n个音轨集分别对应的频谱特征。
在一些实施例中,所述模型训练模块1240,用于对于所述n个音轨集中的每一个音轨集,计算所述音轨集的频谱特征与所述音轨集对应的标签音轨的频谱特征之间的区别度,得到n个区别度;根据所述n个区别度,确定所述音频分离模型的训练损失。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内容结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图13,其示出了本申请一个实施例提供的计算机设备的示意图。该计算机设备1300可以是图1所示实施环境中的音频处理设备20,用于实施上述音频分离方法;也可以是图1所示实施环境中的模型训练设备10,用于实施上述音频分离模型的训练方法。
通常,计算机设备1300包括有:处理器1301和存储器1302。
处理器1301可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1301可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1301也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1301可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1301还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1302可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1302还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。
本领域技术人员可以理解,图13中示出的结构并不构成对设备1300的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在一些实施例中,计算机设备的存储器中存储有计算机程序,该计算机程序由处理器加载并执行以实现如上所述的音频分离方法或音频分离模型的训练方法。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,该计算机程序由处理器加载并执行以实现如上所述的音频分离方法或音频分离模型的训练方法。
可选地,计算机存储介质包括RAM、ROM、闪存或其他固态存储技术,CD-ROM等其他光学存储、磁带盒、磁带、磁盘存储等。
本申请还提供一种计算机程序产品或计算机程序,上述计算机程序产品或计算机程序包括计算机指令,上述计算机指令存储在计算机可读存储介质中,处理器从上述计算机可读存储介质读取上述计算机指令,以实现上述各方法实施例提供的音频分离方法或音频分离模型的训练方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。另外,本文中描述的步骤编号,仅示例性示出了步骤间的一种可能的执行先后顺序,在一些其它实施例中,上述步骤也可以不按照编号顺序来执行,如两个不同编号的步骤同时执行,或者两个不同编号的步骤按照与图示相反的顺序执行,本申请实施例对此不作限定。
以上所述仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种音频分离方法,其特征在于,所述方法包括:
    获取待分离音频,所述待分离音频包括至少两个音轨;
    获取所述待分离音频的时域特征和纹理特征,所述时域特征用于表征所述待分离音频的谐波相关性,所述纹理特征用于表征所述待分离音频的谐波连续性;
    根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频中的一个音轨或者多个音轨的组合,n为正整数;
    根据所述n个音轨集分别对应的频谱特征,生成所述n个音轨集分别对应的音频文件。
  2. 根据权利要求1所述的方法,其特征在于,所述获取所述待分离音频的时域特征和纹理特征,包括:
    获取所述待分离音频的频幅信息,所述频幅信息用于表征所述待分离音频的频率和振幅信息;
    基于所述频幅信息提取所述时域特征和所述纹理特征。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述频幅信息提取所述时域特征和所述纹理特征,包括:
    对所述频幅信息进行卷积,得到频幅特征;
    对所述频幅特征进行划分,得到第一频幅特征和第二频幅特征;其中,所述第一频幅特征和所述第二频幅特征是所述频幅特征的子集,将所述第一频幅特征和所述第二频幅特征进行叠加得到所述频幅特征;
    基于所述第一频幅特征提取所述时域特征;
    基于所述第二频幅特征提取所述纹理特征。
  4. 根据权利要求3所述的方法,其特征在于,音频分离模型包括频幅编码网络、时域提取网络和纹理提取网络;其中,
    所述频幅编码网络用于对所述频幅信息进行卷积,得到所述频幅特征;
    所述时域提取网络用于基于所述第一频幅特征提取所述时域特征;
    所述纹理提取网络用于基于所述第二频幅特征提取所述纹理特征。
  5. 根据权利要求4所述的方法,其特征在于,所述音频分离模型还包括:音轨特征提取网络,所述根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,包括:
    对所述时域特征和所述纹理特征进行融合处理,得到混合特征;其中,所述融合处理是指统一所述时域特征和所述纹理特征之间的维度,并将统一维度 后的所述时域特征和所述纹理特征中对应维度的特征相加;
    通过所述音轨特征提取网络对所述混合特征进行处理,生成所述n个音轨集分别对应的频谱特征。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述n个音轨集分别对应的频谱特征,生成所述n个音轨集分别对应的音频文件,包括:
    获取所述待分离音频的相位信息,所述相位信息用于表征所述待分离音频的相位;
    根据所述相位信息对所述音轨集对应的频谱特征进行反傅里叶变换,生成所述音轨集对应的音频文件。
  7. 一种音频分离模型的训练方法,其特征在于,所述方法包括:
    获取所述音频分离模型的训练数据,所述训练数据包括待分离音频样本和所述待分离音频样本对应的n个标签音轨,所述待分离音频样本包括至少两个音轨,n为正整数;
    通过所述音频分离模型获取所述待分离音频样本的时域特征和纹理特征,所述时域特征用于表征所述待分离音频样本的谐波相关性,所述纹理特征用于表征所述待分离音频样本的谐波连续性;
    根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频样本中的一个音轨或者多个音轨的组合;
    根据所述n个音轨集分别对应的频谱特征,以及所述n个标签音轨分别对应的频谱特征,计算所述音频分离模型的训练损失,并基于所述训练损失对所述音频分离模型进行训练。
  8. 根据权利要求7所述的方法,其特征在于,所述获取所述音频分离模型的训练数据,包括:
    获取音频数据集,所述音频数据集中包括多个源音轨音频;
    从所述多个源音轨音频中,选取m个源音轨音频,m为大于或等于n的正整数;
    对所述m个源音轨音频进行混音处理,得到所述待分离音频样本;
    基于所述m个源音轨音频生成所述待分离音频样本对应的n个标签音轨。
  9. 根据权利要求7所述的方法,其特征在于,所述音频分离模型包括频幅编码网络、时域提取网络和纹理提取网络;所述通过所述音频分离模型获取所述待分离音频样本的时域特征和纹理特征,包括:
    获取所述待分离音频样本的频幅信息,所述频幅信息用于表征所述待分离音频样本的频率和振幅信息;
    通过所述频幅编码网络对所述频幅信息进行卷积,得到频幅特征;
    对所述频幅特征进行划分,得到第一频幅特征和第二频幅特征;其中,所述第一频幅特征和所述第二频幅特征是所述频幅特征的子集,将所述第一频幅特征和所述第二频幅特征进行叠加得到所述频幅特征;
    通过所述时域提取网络基于所述第一频幅特征提取所述时域特征;
    通过所述纹理提取网络基于所述第二频幅特征提取所述纹理特征。
  10. 根据权利要求9所述的方法,其特征在于,所述音频分离模型还包括:音轨特征提取网络,所述根据所述纹理特征和所述时域特征,获得n个音轨集分别对应的频谱特征,包括:
    对所述时域特征和所述纹理特征进行融合处理,得到混合特征;其中,所述融合处理是指统一所述时域特征和所述纹理特征之间的维度,并将统一维度后的所述时域特征和所述纹理特征中对应维度的特征相加;
    通过所述音轨特征提取网络对所述混合特征进行处理,生成所述n个音轨集分别对应的频谱特征。
  11. 根据权利要求7所述的方法,其特征在于,所述根据所述n个音轨集分别对应的频谱特征,以及所述n个标签音轨分别对应的频谱特征,计算所述音频分离模型的训练损失,包括:
    对于所述n个音轨集中的每一个音轨集,计算所述音轨集的频谱特征与所述音轨集对应的标签音轨的频谱特征之间的区别度,得到n个区别度;
    根据所述n个区别度,确定所述音频分离模型的训练损失。
  12. 一种音频分离装置,其特征在于,所述装置包括:
    音频获取模块,用于获取待分离音频,所述待分离音频包括至少两个音轨;
    特征提取模块,用于获取所述待分离音频的时域特征和纹理特征,所述时域特征用于表征所述待分离音频的谐波相关性,所述纹理特征用于表征所述待分离音频的谐波连续性;
    频谱生成模块,用于根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频中的一个音轨或者多个音轨的组合,n为正整数;
    音轨生成模块,用于根据所述n个音轨集分别对应的频谱特征,生成所述n个音轨集分别对应的音频文件。
  13. 一种音频分离模型的训练装置,其特征在于,所述装置包括:
    数据获取模块,用于获取所述音频分离模型的训练数据,所述训练数据包括待分离音频样本和所述待分离音频样本对应的n个标签音轨,所述待分离音频样本包括至少两个音轨,n为正整数;
    特征提取模块,用于通过所述音频分离模型获取所述待分离音频样本的时 域特征和纹理特征,所述时域特征用于表征所述待分离音频样本的谐波相关性,所述纹理特征用于表征所述待分离音频样本的谐波连续性;
    频谱生成模块,用于根据所述时域特征和所述纹理特征,获得n个音轨集分别对应的频谱特征,所述频谱特征用于表征所述音轨集的频率和振幅信息,每个音轨集包括所述待分离音频样本中的一个音轨或者多个音轨的组合;
    模型训练模块,用于根据所述n个音轨集分别对应的频谱特征,以及所述n个标签音轨分别对应的频谱特征,计算所述音频分离模型的训练损失,并基于所述训练损失对所述音频分离模型进行训练。
  14. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机程序,所述计算机程序由所述处理器加载并执行以实现如权利要1至6任一项所述的音频分离方法,或实现如权利要求7至11任一项所述的音频分离模型的训练方法。
  15. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至6任一项所述的音频分离方法,或实现如权利要求7至11任一项所述的音频分离模型的训练方法。
  16. 一种计算机程序产品或计算机程序,其特征在于,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中,处理器从所述计算机可读存储介质读取并执行所述计算机指令,以实现如权利要求1至6任一项所述的音频分离方法,或实现如权利要求7至11任一项所述的音频分离模型的训练方法。
PCT/CN2021/132977 2021-11-25 2021-11-25 音频分离方法、装置、设备、存储介质及程序产品 WO2023092368A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/132977 WO2023092368A1 (zh) 2021-11-25 2021-11-25 音频分离方法、装置、设备、存储介质及程序产品
CN202180005209.5A CN114365219A (zh) 2021-11-25 2021-11-25 音频分离方法、装置、设备、存储介质及程序产品

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/132977 WO2023092368A1 (zh) 2021-11-25 2021-11-25 音频分离方法、装置、设备、存储介质及程序产品

Publications (1)

Publication Number Publication Date
WO2023092368A1 true WO2023092368A1 (zh) 2023-06-01

Family

ID=81104575

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/132977 WO2023092368A1 (zh) 2021-11-25 2021-11-25 音频分离方法、装置、设备、存储介质及程序产品

Country Status (2)

Country Link
CN (1) CN114365219A (zh)
WO (1) WO2023092368A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110064233A1 (en) * 2003-10-09 2011-03-17 James Edwin Van Buskirk Method, apparatus and system for synthesizing an audio performance using Convolution at Multiple Sample Rates
CN102054480A (zh) * 2009-10-29 2011-05-11 北京理工大学 一种基于分数阶傅立叶变换的单声道混叠语音分离方法
US20180047372A1 (en) * 2016-08-10 2018-02-15 Red Pill VR, Inc. Virtual music experiences
CN111724807A (zh) * 2020-08-05 2020-09-29 字节跳动有限公司 音频分离方法、装置、电子设备及计算机可读存储介质
CN113573136A (zh) * 2021-09-23 2021-10-29 腾讯科技(深圳)有限公司 视频处理方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110064233A1 (en) * 2003-10-09 2011-03-17 James Edwin Van Buskirk Method, apparatus and system for synthesizing an audio performance using Convolution at Multiple Sample Rates
CN102054480A (zh) * 2009-10-29 2011-05-11 北京理工大学 一种基于分数阶傅立叶变换的单声道混叠语音分离方法
US20180047372A1 (en) * 2016-08-10 2018-02-15 Red Pill VR, Inc. Virtual music experiences
CN111724807A (zh) * 2020-08-05 2020-09-29 字节跳动有限公司 音频分离方法、装置、电子设备及计算机可读存储介质
CN113573136A (zh) * 2021-09-23 2021-10-29 腾讯科技(深圳)有限公司 视频处理方法、装置、计算机设备和存储介质

Also Published As

Publication number Publication date
CN114365219A (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
JP7243052B2 (ja) オーディオ抽出装置、オーディオ再生装置、オーディオ抽出方法、オーディオ再生方法、機械学習方法及びプログラム
US8198525B2 (en) Collectively adjusting tracks using a digital audio workstation
Lin et al. A unified model for zero-shot music source separation, transcription and synthesis
Schulze-Forster et al. Unsupervised music source separation using differentiable parametric source models
Pereira et al. Moisesdb: A dataset for source separation beyond 4-stems
Jackson Digital audio editing fundamentals
WO2023092368A1 (zh) 音频分离方法、装置、设备、存储介质及程序产品
Tachibana et al. A real-time audio-to-audio karaoke generation system for monaural recordings based on singing voice suppression and key conversion techniques
Sha’ath Estimation of key in digital music recordings
Hinrichs et al. Convolutional neural networks for the classification of guitar effects and extraction of the parameter settings of single and multi-guitar effects from instrument mixes
WO2022143530A1 (zh) 音频处理方法、装置、计算机设备及存储介质
Benetos et al. Multiple-F0 estimation and note tracking for Mirex 2015 using a sound state-based spectrogram factorization model
Chen et al. Improving choral music separation through expressive synthesized data from sampled instruments
Zhu et al. A Survey of AI Music Generation Tools and Models
Munoz-Montoro et al. Online/offline score informed music signal decomposition: application to minus one
Bittner Data-driven fundamental frequency estimation
Mounir et al. Musical note onset detection based on a spectral sparsity measure
Xinhao The practice of string sound source in computer music production--take pop music production as an example
Colonel Autoencoding neural networks as musical audio synthesizers
Patel et al. Karaoke Generation from songs: recent trends and opportunities
Grumiaux et al. Efficient bandwidth extension of musical signals using a differentiable harmonic plus noise model
Mina et al. Musical note onset detection based on a spectral sparsity measure
Roads A conversation with james a. moorer
Mangal et al. Music Source Separation with Deep Convolution Neural Network
Walczyński et al. Comparison of selected acoustic signal parameterization methods in the problem of machine recognition of classical music styles

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21965106

Country of ref document: EP

Kind code of ref document: A1