CN112259119B

CN112259119B - Music source separation method based on stacked hourglass network

Info

Publication number: CN112259119B
Application number: CN202011118473.3A
Authority: CN
Inventors: 孙超
Original assignee: Shenzhen Cehui Technology Co Ltd
Current assignee: Shenzhen Cehui Technology Co., Ltd
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-11-16
Anticipated expiration: 2040-10-19
Also published as: CN112259119A

Abstract

The invention discloses a music source separation method based on a stacked hourglass network, which comprises the following steps: s1, framing, windowing and Fourier transforming the original mixed voice signal to obtain a spectrogram of the original mixed voice signal; s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, and obtaining a first vocal prediction value and a first accompaniment prediction value after the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode; and S3, obtaining a second voice predicted value and a second accompaniment predicted value through a time-frequency mask, and obtaining a predicted voice signal and a predicted accompaniment signal. Compared with the prior art, the music source separation method provided by the invention fully utilizes the connection between the voice signal contexts based on the hourglass module, thereby improving the separation effect of the network, designing an equal-difference channel increasing structure to make up the information loss generated in the process of down-sampling the voice spectrogram, and further improving the music source separation effect.

Description

Music source separation method based on stacked hourglass network

Technical Field

The invention relates to a music source separation method, in particular to a music source separation method based on a stacked hourglass network.

Background

The separation of music sources is an important branch of natural language processing, and for specific requirements of different fields, the purpose of the separation of music sources may be to separate human voices or accompaniment from a mixed signal, or to separate the sound of a single instrument from the mixed signal. The separated signal source can be further applied to musical instrument identification, pitch statistics, music transcription, lyric synchronization, singer and lyric identification and the like in the field of music retrieval. In the field of voice recognition, the method can be used for applications such as voice recognition, keyword recognition, voice emotion recognition and the like. With the research on machine learning and deep learning, a series of neural networks are continuously enriched and evolved. In order to effectively apply the one-dimensional voice signal to the neural network such as CNN, the one-dimensional voice signal may be first converted into a two-dimensional amplitude spectrogram through fourier transform, or converted into a mel spectrogram or a logarithmic mel spectrogram through a mel scale filter. The two-dimensional image obtained by transformation can be trained by CNN or other neural networks suitable for signal processing. However, these CNN networks are often shallow in depth, cannot extract features of a deeper speech signal by using the advantage of deep learning, and are often simple in structure, cannot process more complicated separation tasks, and are unsatisfactory in separation effect.

The stacked hourglass network is a neural network used for solving related problems in human body postures, in the stacked hourglass network, the hourglass module in each stage is a simple lightweight network and comprises own downsampling and upsampling paths, and the hourglass networks in the previous stage are overlapped end to end in an end-to-end mode to form the stacked hourglass network. The stacked hourglass network ensures normal updating of parameters of each layer of the network through intermediate supervision. The purpose of the original design of the stacked hourglass network was to solve the related problems in human body posture, and its iterative structure allows the hourglass network to handle features at different scales of human body joints and to capture various spatial relationships related to the body joints. The method not only effectively solves the problem of human body posture estimation, but also more importantly provides a new idea and a new base body for other image processing fields, and a plurality of dominant network structures are based on different variants generated on the stacked hourglass network.

Disclosure of Invention

The invention aims to overcome the defect of poor voice separation effect of a neural network structure in the prior art, and provides a music source separation method based on a stacked hourglass network, wherein based on the up-sampling and down-sampling paths of an hourglass module, along with the end-to-end stacking of different hourglass modules in four stages, voice characteristic information learned by the hourglass module in the previous stage is used as the input of the next hourglass module, so that the hourglass module in the next stage obtains richer characteristic information, and the connection between the contexts of voice signals can be more fully utilized, thereby improving the separation effect of the network; meanwhile, aiming at the defects of the hourglass network coding part, an arithmetic-difference type channel increasing structure is designed in the hourglass module to make up the information loss generated in the down-sampling voice spectrogram process, and further improve the effect of music source separation.

The purpose of the invention is mainly realized by the following technical scheme:

the music source separation method based on the stacked hourglass network comprises the following steps: s1, performing framing, windowing and Fourier transform on the original mixed voice signal to obtain an original mixed voice signal spectrogram, wherein the original mixed voice signal spectrogram comprises an original mixed signal amplitude spectrum and an original mixed signal phase spectrum; s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, wherein the stacked hourglass network comprises four hourglass modules which are stacked end to end, and the original mixed signal amplitude spectrum obtains a first human voice predicted value and a first accompaniment predicted value after passing through the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode; s3, combining the first voice predicted value and the first accompaniment predicted value with a time frequency mask to obtain a second voice predicted value after the time frequency mask and a second accompaniment predicted value after the time frequency mask; and respectively combining the second voice predicted value and the second accompaniment predicted value with the phase spectrum of the original mixed signal, and respectively obtaining a predicted voice signal and a predicted accompaniment signal through inverse Fourier transform.

The stacked hourglass network is a neural network used for solving related problems in human body postures, in the stacked hourglass network, the hourglass module in each stage is a simple lightweight network and comprises own downsampling and upsampling paths, and the hourglass networks in the previous stage are overlapped end to end in an end-to-end mode to form the stacked hourglass network. The stacked hourglass network ensures normal updating of parameters of each layer of the network through intermediate supervision. In a stacked hourglass single module, many configurations employ channels of equal width for repeated downsampling and upsampling. Although the structural design seems to be a topological attractive symmetrical structure, the effect is far inferior to that of a mainstream network such as ResNet, and according to the technical scheme, the stacked hourglass network is used in music source separation, and based on the up-sampling and down-sampling paths of the hourglass modules, along with the end-to-end stacking of the different hourglass modules in four stages, the voice characteristic information learned by the hourglass module in the previous stage is used as the input of the next hourglass module, so that the hourglass module in the later stage obtains richer characteristic information, the connection between voice signal contexts can be more fully utilized, and the separation effect of the network is improved. According to the technical scheme, the four hourglass modules which are stacked end to end are used for separation, along with the end to end stacking of the four stages of different hourglass modules, the voice characteristic information learned by the hourglass module at the previous stage is used as the input of the hourglass module at the next stage, so that the hourglass module at the next stage obtains more abundant characteristic information, the connection between the contexts of voice signals can be more fully utilized, and the separation effect of a network is improved. On the other hand, the stacked hourglass network enables the network to be deeper, and the network is beneficial to learning deeper semantic features. Since the time-frequency mask may be used to generate constraints between the input mix signal and the output prediction signal for the relationship between the different sources in the mix signal, a smooth prediction result may be generated. We use the time-frequency mask as the output of the separated source. And multiplying the time-frequency mask and the input spectrogram of the mixed signal to obtain the voice spectrogram estimated by the network. Each hourglass module corresponds to a loss, so the sum of 4 losses corresponds to a final loss function, and the intermediate supervision can ensure the normal updating of parameters of each layer of the network, thereby improving the separation performance; the stacked hourglass network provided by the technical scheme does not change the phase of the original signal, so that the signal of the predicted separation source can be obtained through inverse STFT, namely combining the amplitude and the original phase of a voice spectrogram. In addition, aiming at the defects of the hourglass network coding part, an arithmetic mean channel increasing structure is designed in the hourglass module to carry out downsampling on the voice spectrogram of the mixed signal, and the output channel arithmetic mean after the downsampling of each hourglass module is firstly convolved is increased progressively to construct a strong characteristic coder, so that the information loss is reduced, the information loss generated in the downsampling of the voice spectrogram is compensated, and the effect of music source separation is further improved.

Further, the stacked hourglass network further comprises an initial convolution module consisting of five continuous convolution layers, wherein the convolution module is arranged in front of the four hourglass modules, the size of an input image is not changed, and only the number of output channels of the image is increased.

In the technical scheme, when separating, the original mixed voice signal is firstly converted into a spectrogram through Fourier transform, and then is further input into the hourglass module in the first stage. Specifically, by setting the window length of sliding at the time of fourier transform to 1024, the distance between adjacent windows is 256. The 0 complementing operation is performed for the speech signal whose time frame length is less than 64. The resolution of the voice spectrogram obtained after Fourier transform is 512x 64-corresponding to the height and width of the image respectively. The minimum number of feature channels in the four different stages of hourglass modules is 256. The voice spectrogram obtained through Fourier transform is a single-channel image with the channel number being the gray image, so that in order to avoid unstable network performance caused by overlarge difference of feature dimensions, the obtained voice spectrogram passes through an initial convolution module before being input into the hourglass module in the first stage, and the purpose is to increase the feature channel number of the voice spectrogram. The initial convolution module consists of five successive convolutional layers, which do not change the size of the input image, but only increase the number of output channels of the image. Specifically, after a voice spectrogram with the dimension of 512x64x1 sequentially passes through five convolutional layers consisting of 7x7x64, 3x3x128 and 3x3x256, the input spectrogram dimension of the obtained mixed signal is 512x64x 256-the last multiplication factor represents the number of output channels.

Furthermore, the four hourglass modules are four-order hourglass modules, and the input spectrogram is subjected to four continuous downsampling in each hourglass module so as to continuously reduce half of the resolution of the input spectrogram.

In the technical scheme, in the coding part of a single hourglass module, an input spectrogram-512 x64x256 of a mixed signal is subjected to four continuous downsampling in sequence to continuously reduce the resolution of a half input spectrogram.

Furthermore, an attention layer is arranged behind the convolution layer in each hourglass module, and batch standardization and a Leaky _ relu activation function are arranged in the convolution layer of each hourglass module to improve the inverse gradient propagation and the parameter updating.

The technical scheme keeps the pooling and convolution structure unchanged during downsampling, firstly adds an attention layer after the convolutional layer, and adds batch standardization and a Leaky _ relu activation function to improve the inverse gradient propagation and the parameter updating aiming at each convolutional layer. It can be seen that with the addition of the channel attention mechanism, the degree of importance varies from channel to channel. A channel attention mechanism is added into the stacked hourglass network and serves as a basic component of the hourglass network, so that the whole network can be modeled according to the importance among different channels, the weight of the corresponding channel with redundant characteristic information is correspondingly reduced, and the expression capacity of the whole hourglass network is improved. In addition, the technical scheme also focuses on details on the network structure, batch standardization and a Leaky-relu activation function are added, the deep network structure of the stacked hourglass network is optimized, and the performance of music source separation is further improved.

Further, the number of output channels is kept unchanged during the first convolution of the downsampling of each hourglass module, the characteristic information of the amplitude spectrum of the original mixed signal is learned according to the proportion of 1:1 of the number of the output channels, 128 are added to the output channels in sequence after the first convolution, and the size of the output channels of the coding part in each hourglass module is 384, 512, 640 and 768 in sequence.

The technical scheme uses an equal difference type incremental structure to increase the number of output channels of a convolution layer during downsampling, specifically, the number of channels is kept unchanged after the first convolution during downsampling of an input spectrogram of a mixed signal, and is still C, C is 256, so that unnecessary information loss caused by the difference of the number of channels before and after convolution is avoided, and the characteristic information of the input spectrogram is learned according to the ratio of the number of channels to 1. After the first convolution, N is added to the output channels in sequence, and N equals to 128, so that the output channel sizes of the coding parts in a single four-order hourglass module are 384, 512, 640 and 768 in sequence. After the whole down-sampling operation is completed, the feature map with the resolution of 1/16 compared with the input spectrogram of the mixed signal is finally obtained, and the number of feature channels at this time is maximum-C + 4N. In the up-sampling operation corresponding to the subsequent decoding portion, since there is no information loss similar to that in the down-sampling, the eigen channel of the decoding portion is restored to 256 again and is 1/3 of the maximum eigen channel number.

Further, in the down-sampling and decoding portion up-sampling of each hourglass module, the convolution kernel size of all convolution layers is 3 × 3.

Further, the hourglass module adopts an L1 norm between a real spectrogram and a predicted spectrogram as a loss function, and specifically comprises the following steps: giving an input spectrogram X and an ith real music source Y_iAnd the mask generated by the ith music source in the jth hourglass module

Then the penalty for the ith source is defined as:

where |, indicates element multiplication and the L1 norm is the sum of the absolute values of the matrix elements.

Further, the overall loss function of a stacked hourglass network is:

where C is the number of sources that the network is to separate.

For the task of separating the common vocal accompaniment under the MIR-1K and DSD100 data sets in the technical scheme, C is set to be 2 and respectively corresponds to the vocal accompaniment and the accompaniment. For the multiple music source separation task under the DSD100 data set, C is set to 4 for outputting drum, bass, human voice and other time-frequency masks, respectively.

Further, the method for calculating the second voice predicted value and the second accompaniment predicted value comprises the following steps:

wherein [ ] indicates multiplication of elements,

and

respectively a second vocal prediction value and a second accompaniment prediction value, x_tIn order to be the amplitude spectrum of the original mixed signal,

is a time-frequency mask, and

wherein

And

respectively a first vocal prediction value and a first accompaniment prediction value.

The technical scheme utilizes a time-frequency masking technology to further smooth the source separation result, so that the sum of the prediction results is equal to the constraint of the original mixture.

In conclusion, compared with the prior art, the invention has the following beneficial effects:

1. according to the music source separation method based on the stacked hourglass network, due to the up-sampling and down-sampling paths of the hourglass modules, along with the end-to-end stacking of the different hourglass modules in four stages, the voice characteristic information learned by the hourglass module in the previous stage is used as the input of the next hourglass module, so that the hourglass module in the next stage obtains richer characteristic information, the connection between the contexts of voice signals can be more fully utilized, and the separation effect of the network is improved; meanwhile, aiming at the defects of the hourglass network coding part, an arithmetic-difference type channel increasing structure is designed in the hourglass module to make up the information loss generated in the down-sampling voice spectrogram process, and further improve the effect of music source separation.

2. According to the music source separation method based on the stacked hourglass network, the channel attention mechanism is added into the stacked hourglass network and serves as a basic component of the hourglass network, so that the whole network can be modeled according to the importance among different channels, the weight of the corresponding channel with redundant characteristic information is correspondingly reduced, and the expression capacity of the whole hourglass network is improved. In addition, the details of the network structure are also concerned, batch standardization and a Leaky-relu activation function are added, the network structure with deep hierarchy of the stacked hourglass network is optimized, and the performance of music source separation is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

fig. 1 is a flow chart of a music source separation method based on a stacked hourglass network.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1:

as shown in fig. 1, the present embodiment includes the steps of:

s1, performing framing, windowing and Fourier transform on the original mixed voice signal to obtain an original mixed voice signal spectrogram, wherein the original mixed voice signal spectrogram comprises an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;

s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, wherein the stacked hourglass network comprises four hourglass modules which are stacked end to end, and the original mixed signal amplitude spectrum obtains a first human voice predicted value and a first accompaniment predicted value after passing through the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode;

s3, combining the first voice predicted value and the first accompaniment predicted value with a time frequency mask to obtain a second voice predicted value after the time frequency mask and a second accompaniment predicted value after the time frequency mask; and respectively combining the second voice predicted value and the second accompaniment predicted value with the phase spectrum of the original mixed signal, and respectively obtaining a predicted voice signal and a predicted accompaniment signal through inverse Fourier transform.

According to the music source separation method based on the stacked hourglass network, based on the up-sampling and down-sampling paths of the hourglass modules, along with end-to-end stacking of the different hourglass modules in four stages, voice characteristic information learned by the hourglass module in the previous stage is used as input of the hourglass module in the next stage, so that the hourglass module in the next stage obtains richer characteristic information, and the connection between the contexts of voice signals can be more fully utilized, so that the separation effect of the network is improved; meanwhile, aiming at the defects of the hourglass network coding part, an arithmetic-difference type channel increasing structure is designed in the hourglass module to make up the information loss generated in the down-sampling voice spectrogram process, and further improve the effect of music source separation.

Example 2:

the present embodiment further includes, on the basis of embodiment 1: the stacked hourglass network further comprises an initial convolution module consisting of five successive convolution layers, which is arranged before the four hourglass modules, and which does not change the size of the input image, but only increases the number of output channels of the image.

Preferably, the four hourglass modules are four-order hourglass modules, and the input spectrogram is subjected to four successive downsampling in each hourglass module so as to continuously reduce half of the resolution of the input spectrogram.

Preferably, an attention layer is further provided behind the convolution layer in each hourglass module, and a batch normalization and a Leaky _ relu activation function are further provided in the convolution layer of each hourglass module to improve the inverse gradient propagation and updating of parameters.

Preferably, the number of output channels is kept unchanged during the first convolution of the downsampling of each hourglass module, the characteristic information of the amplitude spectrum of the original mixed signal is learned according to the proportion of 1:1 of the number of the output channels, 128 channels are added in sequence after the first convolution, and the size of the output channel of the coding part in each hourglass module is 384, 512, 640 and 768 in sequence.

Preferably, the convolution kernel size of all convolutional layers is 3x3 in the downsampling and decoding portion upsampling of each hourglass module.

Preferably, the hourglass module adopts an L1 norm between a real spectrogram and a predicted spectrogram as a loss function, specifically: is given oneInputting spectrogram X and ith real music source Y_iAnd the mask generated by the ith music source in the jth hourglass module

Then the penalty for the ith source is defined as:

Preferably, the total loss function of the stacked hourglass network is:

where C is the number of sources that the network is to separate.

Preferably, the method for calculating the predicted value of the second vocal sound and the predicted value of the second accompaniment is as follows:

wherein [ ] indicates multiplication of elements,

and

is a time-frequency mask, and

wherein

And

According to the music source separation method based on the stacked hourglass network, the channel attention mechanism is added into the stacked hourglass network and serves as a basic component of the hourglass network, so that the whole network can be modeled according to the importance among different channels, the weight of the corresponding channel with redundant characteristic information is correspondingly reduced, and the expression capacity of the whole hourglass network is improved. In addition, the details of the network structure are also concerned, batch standardization and a Leaky-relu activation function are added, the network structure with deep hierarchy of the stacked hourglass network is optimized, and the performance of music source separation is further improved. The music source separation method of the embodiment is based on a stacked hourglass network, designs an equal-difference channel increasing structure aiming at an hourglass module, and adds an attention mechanism, so that the music source separation method has stronger feature extraction capability and multi-scale integration capability, has attention among channels, and has better music source separation performance

Verification and comparative test: to verify the separation effect of the method of embodiment 2, the inventors compared the separation effect of the accompaniment part under the MIR-1K data set for the control group of the speech separation method based on the neural network of the existing RPCA (robust principal component analysis), the separation method of embodiment 1, and the separation method of embodiment 2. The test conditions were: a single block of GPU with the model of Tesla P100 and a deep environment of Tensorflow are used for training the network by using an Adam optimizer, and the same initial learning rate and parameter settings of batch size, iteration times and the like are adopted.

(1) Performance evaluation indexes are as follows:

the indexes for evaluating the separation effect select a signal-to-noise ratio (SDR), a Source Interference Ratio (SIR) and a Source Artifact Ratio (SAR) based on BSS-EVAL as evaluation indexes, and specifically comprise the following steps:

signal-to-noise ratio (SDR):

source interference ratio ((SIR):

source-artifact ratio (SAR):

e_target(t) is a prediction signal, e_interf(t) is an interference signal, e_noise(t) is a noise signal, e_artif(t) is an algorithm-induced artifact; SDR evaluates the separation effect of the separation algorithm from a relatively comprehensive angle, SIR analyzes the separation effect from the angle of interference, SNR analyzes the separation effect from the angle of noise, and SAR analyzes the separation effect from the angle of artifact; the larger the values of SDR, SIR and SAR are, the better the separation effect of human voice and background music is. Global NSDR (gnsdr), global SIR (gsir) and global SAR (gsar) are computed as weighted averages of NSDR, SIR and SAR, respectively, weighted by source length. Wherein, normalized sdr (nsdr): NSDR (T)_e,T_o,T_m)＝SDR(T_e,T_o)-SDR(T_m,T_o) Wherein T is_eDefinition of human voice/background music, T, predicted for hourglass network_oFor pure human voice/background music, T, in the original signal_mIs the original mixed signal.

(2) The test results are as follows:

as can be seen from the above table, the accompaniment separation effect obtained by the separation methods of the embodiments 1 and 2 of the present invention is significantly better than that of the control group using the existing music source separation method in terms of signal-to-noise ratio, interference ratio, and artifact ratio, and particularly, the stacked hourglass network of the embodiment 2 has stronger feature extraction capability and multi-scale integration capability after adding an attention mechanism and further deepening the arithmetic channel incremental structure, and has attention among channels, which has better separation effect on signal-to-noise ratio, interference ratio, and artifact ratio. The music source separation method is based on the stacked hourglass network, an equal-difference type channel increasing structure is designed for the hourglass module, and an attention mechanism is added, so that the music source separation method has stronger feature extraction capability and multi-scale integration capability, has attention among channels, and has better music source separation performance.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The music source separation method based on the stacked hourglass network is characterized by comprising the following steps of:

s3, combining the first voice predicted value and the first accompaniment predicted value with a time frequency mask to obtain a second voice predicted value after the time frequency mask and a second accompaniment predicted value after the time frequency mask; respectively combining the second voice predicted value and the second accompaniment predicted value with the phase spectrum of the original mixed signal, and respectively obtaining a predicted voice signal and a predicted accompaniment signal through inverse Fourier transform;

the stacked hourglass network further comprises an initial convolution module consisting of five continuous convolution layers, wherein the convolution module is arranged in front of the four hourglass modules, the size of an input image is not changed, and only the number of output channels of the image is increased;

the four hourglass modules are four-order hourglass modules, and the input spectrogram is subjected to four continuous downsampling in each hourglass module so as to continuously reduce half of the resolution of the input spectrogram;

keeping the number of output channels unchanged during the first convolution of the downsampling of each hourglass module, learning the characteristic information of the original mixed signal amplitude spectrum according to the proportion of 1:1 of the number of the output channels, and sequentially adding 128 to the output channels after the first convolution to enable the size of the output channels of the coding part in each hourglass module to be 384, 512, 640 and 768 in sequence.

2. The music source separation method based on the stacked hourglass network of claim 1, wherein an attention layer is further provided after the convolution layer in each hourglass module, and batch normalization and Leaky _ relu activation functions are further provided in the convolution layer of each hourglass module to improve backward gradient propagation and parameter updating.

3. The music source separation method based on a stacked hourglass network of claim 1, wherein the convolution kernel size of all convolution layers is 3x3 in the down-sampling and decoding portion up-sampling of each hourglass module.

4. The music source separation method based on the stacked hourglass network as claimed in claim 1, wherein the hourglass module adopts an L1 norm between a real spectrogram and a predicted spectrogram as a loss function, and specifically comprises: giving an input spectrogram X and an ith real music source Y_iAnd the mask generated by the ith music source in the jth hourglass module

Then the penalty for the ith source is defined as:

5. Such as rightThe method for separating music source based on the stacked hourglass network of claim 4, wherein the total loss function of the stacked hourglass network is as follows:

where C is the number of sources that the network is to separate.

6. The music source separation method based on the stacked hourglass network as claimed in claim 1, wherein the second vocal prediction value and the second accompaniment prediction value are calculated by:

wherein [ ] indicates multiplication of elements,

and

is a time-frequency mask, and

wherein

And