CN112259119B - Music source separation method based on stacked hourglass network - Google Patents

Music source separation method based on stacked hourglass network Download PDF

Info

Publication number
CN112259119B
CN112259119B CN202011118473.3A CN202011118473A CN112259119B CN 112259119 B CN112259119 B CN 112259119B CN 202011118473 A CN202011118473 A CN 202011118473A CN 112259119 B CN112259119 B CN 112259119B
Authority
CN
China
Prior art keywords
hourglass
network
stacked
module
accompaniment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011118473.3A
Other languages
Chinese (zh)
Other versions
CN112259119A (en
Inventor
孙超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Cehui Technology Co., Ltd
Original Assignee
Shenzhen Cehui Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Cehui Technology Co Ltd filed Critical Shenzhen Cehui Technology Co Ltd
Priority to CN202011118473.3A priority Critical patent/CN112259119B/en
Publication of CN112259119A publication Critical patent/CN112259119A/en
Application granted granted Critical
Publication of CN112259119B publication Critical patent/CN112259119B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention discloses a music source separation method based on a stacked hourglass network, which comprises the following steps: s1, framing, windowing and Fourier transforming the original mixed voice signal to obtain a spectrogram of the original mixed voice signal; s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, and obtaining a first vocal prediction value and a first accompaniment prediction value after the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode; and S3, obtaining a second voice predicted value and a second accompaniment predicted value through a time-frequency mask, and obtaining a predicted voice signal and a predicted accompaniment signal. Compared with the prior art, the music source separation method provided by the invention fully utilizes the connection between the voice signal contexts based on the hourglass module, thereby improving the separation effect of the network, designing an equal-difference channel increasing structure to make up the information loss generated in the process of down-sampling the voice spectrogram, and further improving the music source separation effect.

Description

Music source separation method based on stacked hourglass network
Technical Field
The invention relates to a music source separation method, in particular to a music source separation method based on a stacked hourglass network.
Background
The separation of music sources is an important branch of natural language processing, and for specific requirements of different fields, the purpose of the separation of music sources may be to separate human voices or accompaniment from a mixed signal, or to separate the sound of a single instrument from the mixed signal. The separated signal source can be further applied to musical instrument identification, pitch statistics, music transcription, lyric synchronization, singer and lyric identification and the like in the field of music retrieval. In the field of voice recognition, the method can be used for applications such as voice recognition, keyword recognition, voice emotion recognition and the like. With the research on machine learning and deep learning, a series of neural networks are continuously enriched and evolved. In order to effectively apply the one-dimensional voice signal to the neural network such as CNN, the one-dimensional voice signal may be first converted into a two-dimensional amplitude spectrogram through fourier transform, or converted into a mel spectrogram or a logarithmic mel spectrogram through a mel scale filter. The two-dimensional image obtained by transformation can be trained by CNN or other neural networks suitable for signal processing. However, these CNN networks are often shallow in depth, cannot extract features of a deeper speech signal by using the advantage of deep learning, and are often simple in structure, cannot process more complicated separation tasks, and are unsatisfactory in separation effect.
The stacked hourglass network is a neural network used for solving related problems in human body postures, in the stacked hourglass network, the hourglass module in each stage is a simple lightweight network and comprises own downsampling and upsampling paths, and the hourglass networks in the previous stage are overlapped end to end in an end-to-end mode to form the stacked hourglass network. The stacked hourglass network ensures normal updating of parameters of each layer of the network through intermediate supervision. The purpose of the original design of the stacked hourglass network was to solve the related problems in human body posture, and its iterative structure allows the hourglass network to handle features at different scales of human body joints and to capture various spatial relationships related to the body joints. The method not only effectively solves the problem of human body posture estimation, but also more importantly provides a new idea and a new base body for other image processing fields, and a plurality of dominant network structures are based on different variants generated on the stacked hourglass network.
Disclosure of Invention
The invention aims to overcome the defect of poor voice separation effect of a neural network structure in the prior art, and provides a music source separation method based on a stacked hourglass network, wherein based on the up-sampling and down-sampling paths of an hourglass module, along with the end-to-end stacking of different hourglass modules in four stages, voice characteristic information learned by the hourglass module in the previous stage is used as the input of the next hourglass module, so that the hourglass module in the next stage obtains richer characteristic information, and the connection between the contexts of voice signals can be more fully utilized, thereby improving the separation effect of the network; meanwhile, aiming at the defects of the hourglass network coding part, an arithmetic-difference type channel increasing structure is designed in the hourglass module to make up the information loss generated in the down-sampling voice spectrogram process, and further improve the effect of music source separation.
The purpose of the invention is mainly realized by the following technical scheme:
the music source separation method based on the stacked hourglass network comprises the following steps: s1, performing framing, windowing and Fourier transform on the original mixed voice signal to obtain an original mixed voice signal spectrogram, wherein the original mixed voice signal spectrogram comprises an original mixed signal amplitude spectrum and an original mixed signal phase spectrum; s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, wherein the stacked hourglass network comprises four hourglass modules which are stacked end to end, and the original mixed signal amplitude spectrum obtains a first human voice predicted value and a first accompaniment predicted value after passing through the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode; s3, combining the first voice predicted value and the first accompaniment predicted value with a time frequency mask to obtain a second voice predicted value after the time frequency mask and a second accompaniment predicted value after the time frequency mask; and respectively combining the second voice predicted value and the second accompaniment predicted value with the phase spectrum of the original mixed signal, and respectively obtaining a predicted voice signal and a predicted accompaniment signal through inverse Fourier transform.
The stacked hourglass network is a neural network used for solving related problems in human body postures, in the stacked hourglass network, the hourglass module in each stage is a simple lightweight network and comprises own downsampling and upsampling paths, and the hourglass networks in the previous stage are overlapped end to end in an end-to-end mode to form the stacked hourglass network. The stacked hourglass network ensures normal updating of parameters of each layer of the network through intermediate supervision. In a stacked hourglass single module, many configurations employ channels of equal width for repeated downsampling and upsampling. Although the structural design seems to be a topological attractive symmetrical structure, the effect is far inferior to that of a mainstream network such as ResNet, and according to the technical scheme, the stacked hourglass network is used in music source separation, and based on the up-sampling and down-sampling paths of the hourglass modules, along with the end-to-end stacking of the different hourglass modules in four stages, the voice characteristic information learned by the hourglass module in the previous stage is used as the input of the next hourglass module, so that the hourglass module in the later stage obtains richer characteristic information, the connection between voice signal contexts can be more fully utilized, and the separation effect of the network is improved. According to the technical scheme, the four hourglass modules which are stacked end to end are used for separation, along with the end to end stacking of the four stages of different hourglass modules, the voice characteristic information learned by the hourglass module at the previous stage is used as the input of the hourglass module at the next stage, so that the hourglass module at the next stage obtains more abundant characteristic information, the connection between the contexts of voice signals can be more fully utilized, and the separation effect of a network is improved. On the other hand, the stacked hourglass network enables the network to be deeper, and the network is beneficial to learning deeper semantic features. Since the time-frequency mask may be used to generate constraints between the input mix signal and the output prediction signal for the relationship between the different sources in the mix signal, a smooth prediction result may be generated. We use the time-frequency mask as the output of the separated source. And multiplying the time-frequency mask and the input spectrogram of the mixed signal to obtain the voice spectrogram estimated by the network. Each hourglass module corresponds to a loss, so the sum of 4 losses corresponds to a final loss function, and the intermediate supervision can ensure the normal updating of parameters of each layer of the network, thereby improving the separation performance; the stacked hourglass network provided by the technical scheme does not change the phase of the original signal, so that the signal of the predicted separation source can be obtained through inverse STFT, namely combining the amplitude and the original phase of a voice spectrogram. In addition, aiming at the defects of the hourglass network coding part, an arithmetic mean channel increasing structure is designed in the hourglass module to carry out downsampling on the voice spectrogram of the mixed signal, and the output channel arithmetic mean after the downsampling of each hourglass module is firstly convolved is increased progressively to construct a strong characteristic coder, so that the information loss is reduced, the information loss generated in the downsampling of the voice spectrogram is compensated, and the effect of music source separation is further improved.
Further, the stacked hourglass network further comprises an initial convolution module consisting of five continuous convolution layers, wherein the convolution module is arranged in front of the four hourglass modules, the size of an input image is not changed, and only the number of output channels of the image is increased.
In the technical scheme, when separating, the original mixed voice signal is firstly converted into a spectrogram through Fourier transform, and then is further input into the hourglass module in the first stage. Specifically, by setting the window length of sliding at the time of fourier transform to 1024, the distance between adjacent windows is 256. The 0 complementing operation is performed for the speech signal whose time frame length is less than 64. The resolution of the voice spectrogram obtained after Fourier transform is 512x 64-corresponding to the height and width of the image respectively. The minimum number of feature channels in the four different stages of hourglass modules is 256. The voice spectrogram obtained through Fourier transform is a single-channel image with the channel number being the gray image, so that in order to avoid unstable network performance caused by overlarge difference of feature dimensions, the obtained voice spectrogram passes through an initial convolution module before being input into the hourglass module in the first stage, and the purpose is to increase the feature channel number of the voice spectrogram. The initial convolution module consists of five successive convolutional layers, which do not change the size of the input image, but only increase the number of output channels of the image. Specifically, after a voice spectrogram with the dimension of 512x64x1 sequentially passes through five convolutional layers consisting of 7x7x64, 3x3x128 and 3x3x256, the input spectrogram dimension of the obtained mixed signal is 512x64x 256-the last multiplication factor represents the number of output channels.
Furthermore, the four hourglass modules are four-order hourglass modules, and the input spectrogram is subjected to four continuous downsampling in each hourglass module so as to continuously reduce half of the resolution of the input spectrogram.
In the technical scheme, in the coding part of a single hourglass module, an input spectrogram-512 x64x256 of a mixed signal is subjected to four continuous downsampling in sequence to continuously reduce the resolution of a half input spectrogram.
Furthermore, an attention layer is arranged behind the convolution layer in each hourglass module, and batch standardization and a Leaky _ relu activation function are arranged in the convolution layer of each hourglass module to improve the inverse gradient propagation and the parameter updating.
The technical scheme keeps the pooling and convolution structure unchanged during downsampling, firstly adds an attention layer after the convolutional layer, and adds batch standardization and a Leaky _ relu activation function to improve the inverse gradient propagation and the parameter updating aiming at each convolutional layer. It can be seen that with the addition of the channel attention mechanism, the degree of importance varies from channel to channel. A channel attention mechanism is added into the stacked hourglass network and serves as a basic component of the hourglass network, so that the whole network can be modeled according to the importance among different channels, the weight of the corresponding channel with redundant characteristic information is correspondingly reduced, and the expression capacity of the whole hourglass network is improved. In addition, the technical scheme also focuses on details on the network structure, batch standardization and a Leaky-relu activation function are added, the deep network structure of the stacked hourglass network is optimized, and the performance of music source separation is further improved.
Further, the number of output channels is kept unchanged during the first convolution of the downsampling of each hourglass module, the characteristic information of the amplitude spectrum of the original mixed signal is learned according to the proportion of 1:1 of the number of the output channels, 128 are added to the output channels in sequence after the first convolution, and the size of the output channels of the coding part in each hourglass module is 384, 512, 640 and 768 in sequence.
The technical scheme uses an equal difference type incremental structure to increase the number of output channels of a convolution layer during downsampling, specifically, the number of channels is kept unchanged after the first convolution during downsampling of an input spectrogram of a mixed signal, and is still C, C is 256, so that unnecessary information loss caused by the difference of the number of channels before and after convolution is avoided, and the characteristic information of the input spectrogram is learned according to the ratio of the number of channels to 1. After the first convolution, N is added to the output channels in sequence, and N equals to 128, so that the output channel sizes of the coding parts in a single four-order hourglass module are 384, 512, 640 and 768 in sequence. After the whole down-sampling operation is completed, the feature map with the resolution of 1/16 compared with the input spectrogram of the mixed signal is finally obtained, and the number of feature channels at this time is maximum-C + 4N. In the up-sampling operation corresponding to the subsequent decoding portion, since there is no information loss similar to that in the down-sampling, the eigen channel of the decoding portion is restored to 256 again and is 1/3 of the maximum eigen channel number.
Further, in the down-sampling and decoding portion up-sampling of each hourglass module, the convolution kernel size of all convolution layers is 3 × 3.
Further, the hourglass module adopts an L1 norm between a real spectrogram and a predicted spectrogram as a loss function, and specifically comprises the following steps: giving an input spectrogram X and an ith real music source YiAnd the mask generated by the ith music source in the jth hourglass module
Figure GDA0003012277270000041
Then the penalty for the ith source is defined as:
Figure GDA0003012277270000042
where |, indicates element multiplication and the L1 norm is the sum of the absolute values of the matrix elements.
Further, the overall loss function of a stacked hourglass network is:
Figure GDA0003012277270000043
where C is the number of sources that the network is to separate.
For the task of separating the common vocal accompaniment under the MIR-1K and DSD100 data sets in the technical scheme, C is set to be 2 and respectively corresponds to the vocal accompaniment and the accompaniment. For the multiple music source separation task under the DSD100 data set, C is set to 4 for outputting drum, bass, human voice and other time-frequency masks, respectively.
Further, the method for calculating the second voice predicted value and the second accompaniment predicted value comprises the following steps:
Figure GDA0003012277270000044
Figure GDA0003012277270000045
wherein [ ] indicates multiplication of elements,
Figure GDA0003012277270000046
and
Figure GDA0003012277270000047
respectively a second vocal prediction value and a second accompaniment prediction value, xtIn order to be the amplitude spectrum of the original mixed signal,
Figure GDA0003012277270000048
is a time-frequency mask, and
Figure GDA0003012277270000049
wherein
Figure GDA00030122772700000410
And
Figure GDA00030122772700000411
respectively a first vocal prediction value and a first accompaniment prediction value.
The technical scheme utilizes a time-frequency masking technology to further smooth the source separation result, so that the sum of the prediction results is equal to the constraint of the original mixture.
In conclusion, compared with the prior art, the invention has the following beneficial effects:
1. according to the music source separation method based on the stacked hourglass network, due to the up-sampling and down-sampling paths of the hourglass modules, along with the end-to-end stacking of the different hourglass modules in four stages, the voice characteristic information learned by the hourglass module in the previous stage is used as the input of the next hourglass module, so that the hourglass module in the next stage obtains richer characteristic information, the connection between the contexts of voice signals can be more fully utilized, and the separation effect of the network is improved; meanwhile, aiming at the defects of the hourglass network coding part, an arithmetic-difference type channel increasing structure is designed in the hourglass module to make up the information loss generated in the down-sampling voice spectrogram process, and further improve the effect of music source separation.
2. According to the music source separation method based on the stacked hourglass network, the channel attention mechanism is added into the stacked hourglass network and serves as a basic component of the hourglass network, so that the whole network can be modeled according to the importance among different channels, the weight of the corresponding channel with redundant characteristic information is correspondingly reduced, and the expression capacity of the whole hourglass network is improved. In addition, the details of the network structure are also concerned, batch standardization and a Leaky-relu activation function are added, the network structure with deep hierarchy of the stacked hourglass network is optimized, and the performance of music source separation is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a flow chart of a music source separation method based on a stacked hourglass network.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1:
as shown in fig. 1, the present embodiment includes the steps of:
s1, performing framing, windowing and Fourier transform on the original mixed voice signal to obtain an original mixed voice signal spectrogram, wherein the original mixed voice signal spectrogram comprises an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;
s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, wherein the stacked hourglass network comprises four hourglass modules which are stacked end to end, and the original mixed signal amplitude spectrum obtains a first human voice predicted value and a first accompaniment predicted value after passing through the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode;
s3, combining the first voice predicted value and the first accompaniment predicted value with a time frequency mask to obtain a second voice predicted value after the time frequency mask and a second accompaniment predicted value after the time frequency mask; and respectively combining the second voice predicted value and the second accompaniment predicted value with the phase spectrum of the original mixed signal, and respectively obtaining a predicted voice signal and a predicted accompaniment signal through inverse Fourier transform.
According to the music source separation method based on the stacked hourglass network, based on the up-sampling and down-sampling paths of the hourglass modules, along with end-to-end stacking of the different hourglass modules in four stages, voice characteristic information learned by the hourglass module in the previous stage is used as input of the hourglass module in the next stage, so that the hourglass module in the next stage obtains richer characteristic information, and the connection between the contexts of voice signals can be more fully utilized, so that the separation effect of the network is improved; meanwhile, aiming at the defects of the hourglass network coding part, an arithmetic-difference type channel increasing structure is designed in the hourglass module to make up the information loss generated in the down-sampling voice spectrogram process, and further improve the effect of music source separation.
Example 2:
the present embodiment further includes, on the basis of embodiment 1: the stacked hourglass network further comprises an initial convolution module consisting of five successive convolution layers, which is arranged before the four hourglass modules, and which does not change the size of the input image, but only increases the number of output channels of the image.
Preferably, the four hourglass modules are four-order hourglass modules, and the input spectrogram is subjected to four successive downsampling in each hourglass module so as to continuously reduce half of the resolution of the input spectrogram.
Preferably, an attention layer is further provided behind the convolution layer in each hourglass module, and a batch normalization and a Leaky _ relu activation function are further provided in the convolution layer of each hourglass module to improve the inverse gradient propagation and updating of parameters.
Preferably, the number of output channels is kept unchanged during the first convolution of the downsampling of each hourglass module, the characteristic information of the amplitude spectrum of the original mixed signal is learned according to the proportion of 1:1 of the number of the output channels, 128 channels are added in sequence after the first convolution, and the size of the output channel of the coding part in each hourglass module is 384, 512, 640 and 768 in sequence.
Preferably, the convolution kernel size of all convolutional layers is 3x3 in the downsampling and decoding portion upsampling of each hourglass module.
Preferably, the hourglass module adopts an L1 norm between a real spectrogram and a predicted spectrogram as a loss function, specifically: is given oneInputting spectrogram X and ith real music source YiAnd the mask generated by the ith music source in the jth hourglass module
Figure GDA0003012277270000061
Then the penalty for the ith source is defined as:
Figure GDA0003012277270000062
where |, indicates element multiplication and the L1 norm is the sum of the absolute values of the matrix elements.
Preferably, the total loss function of the stacked hourglass network is:
Figure GDA0003012277270000063
where C is the number of sources that the network is to separate.
Preferably, the method for calculating the predicted value of the second vocal sound and the predicted value of the second accompaniment is as follows:
Figure GDA0003012277270000064
wherein [ ] indicates multiplication of elements,
Figure GDA0003012277270000071
and
Figure GDA0003012277270000072
respectively a second vocal prediction value and a second accompaniment prediction value, xtIn order to be the amplitude spectrum of the original mixed signal,
Figure GDA0003012277270000073
is a time-frequency mask, and
Figure GDA0003012277270000074
wherein
Figure GDA0003012277270000075
And
Figure GDA0003012277270000076
respectively a first vocal prediction value and a first accompaniment prediction value.
According to the music source separation method based on the stacked hourglass network, the channel attention mechanism is added into the stacked hourglass network and serves as a basic component of the hourglass network, so that the whole network can be modeled according to the importance among different channels, the weight of the corresponding channel with redundant characteristic information is correspondingly reduced, and the expression capacity of the whole hourglass network is improved. In addition, the details of the network structure are also concerned, batch standardization and a Leaky-relu activation function are added, the network structure with deep hierarchy of the stacked hourglass network is optimized, and the performance of music source separation is further improved. The music source separation method of the embodiment is based on a stacked hourglass network, designs an equal-difference channel increasing structure aiming at an hourglass module, and adds an attention mechanism, so that the music source separation method has stronger feature extraction capability and multi-scale integration capability, has attention among channels, and has better music source separation performance
Verification and comparative test: to verify the separation effect of the method of embodiment 2, the inventors compared the separation effect of the accompaniment part under the MIR-1K data set for the control group of the speech separation method based on the neural network of the existing RPCA (robust principal component analysis), the separation method of embodiment 1, and the separation method of embodiment 2. The test conditions were: a single block of GPU with the model of Tesla P100 and a deep environment of Tensorflow are used for training the network by using an Adam optimizer, and the same initial learning rate and parameter settings of batch size, iteration times and the like are adopted.
(1) Performance evaluation indexes are as follows:
the indexes for evaluating the separation effect select a signal-to-noise ratio (SDR), a Source Interference Ratio (SIR) and a Source Artifact Ratio (SAR) based on BSS-EVAL as evaluation indexes, and specifically comprise the following steps:
signal-to-noise ratio (SDR):
Figure GDA0003012277270000077
source interference ratio ((SIR):
Figure GDA0003012277270000078
source-artifact ratio (SAR):
Figure GDA0003012277270000079
etarget(t) is a prediction signal, einterf(t) is an interference signal, enoise(t) is a noise signal, eartif(t) is an algorithm-induced artifact; SDR evaluates the separation effect of the separation algorithm from a relatively comprehensive angle, SIR analyzes the separation effect from the angle of interference, SNR analyzes the separation effect from the angle of noise, and SAR analyzes the separation effect from the angle of artifact; the larger the values of SDR, SIR and SAR are, the better the separation effect of human voice and background music is. Global NSDR (gnsdr), global SIR (gsir) and global SAR (gsar) are computed as weighted averages of NSDR, SIR and SAR, respectively, weighted by source length. Wherein, normalized sdr (nsdr): NSDR (T)e,To,Tm)=SDR(Te,To)-SDR(Tm,To) Wherein T iseDefinition of human voice/background music, T, predicted for hourglass networkoFor pure human voice/background music, T, in the original signalmIs the original mixed signal.
(2) The test results are as follows:
Figure GDA0003012277270000081
as can be seen from the above table, the accompaniment separation effect obtained by the separation methods of the embodiments 1 and 2 of the present invention is significantly better than that of the control group using the existing music source separation method in terms of signal-to-noise ratio, interference ratio, and artifact ratio, and particularly, the stacked hourglass network of the embodiment 2 has stronger feature extraction capability and multi-scale integration capability after adding an attention mechanism and further deepening the arithmetic channel incremental structure, and has attention among channels, which has better separation effect on signal-to-noise ratio, interference ratio, and artifact ratio. The music source separation method is based on the stacked hourglass network, an equal-difference type channel increasing structure is designed for the hourglass module, and an attention mechanism is added, so that the music source separation method has stronger feature extraction capability and multi-scale integration capability, has attention among channels, and has better music source separation performance.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (6)

1. The music source separation method based on the stacked hourglass network is characterized by comprising the following steps of:
s1, performing framing, windowing and Fourier transform on the original mixed voice signal to obtain an original mixed voice signal spectrogram, wherein the original mixed voice signal spectrogram comprises an original mixed signal amplitude spectrum and an original mixed signal phase spectrum;
s2, inputting the original mixed signal amplitude spectrum into a stacked hourglass network, wherein the stacked hourglass network comprises four hourglass modules which are stacked end to end, and the original mixed signal amplitude spectrum obtains a first human voice predicted value and a first accompaniment predicted value after passing through the stacked hourglass network; each hourglass module downsamples an output channel subjected to first convolution in an equal difference type increasing mode;
s3, combining the first voice predicted value and the first accompaniment predicted value with a time frequency mask to obtain a second voice predicted value after the time frequency mask and a second accompaniment predicted value after the time frequency mask; respectively combining the second voice predicted value and the second accompaniment predicted value with the phase spectrum of the original mixed signal, and respectively obtaining a predicted voice signal and a predicted accompaniment signal through inverse Fourier transform;
the stacked hourglass network further comprises an initial convolution module consisting of five continuous convolution layers, wherein the convolution module is arranged in front of the four hourglass modules, the size of an input image is not changed, and only the number of output channels of the image is increased;
the four hourglass modules are four-order hourglass modules, and the input spectrogram is subjected to four continuous downsampling in each hourglass module so as to continuously reduce half of the resolution of the input spectrogram;
keeping the number of output channels unchanged during the first convolution of the downsampling of each hourglass module, learning the characteristic information of the original mixed signal amplitude spectrum according to the proportion of 1:1 of the number of the output channels, and sequentially adding 128 to the output channels after the first convolution to enable the size of the output channels of the coding part in each hourglass module to be 384, 512, 640 and 768 in sequence.
2. The music source separation method based on the stacked hourglass network of claim 1, wherein an attention layer is further provided after the convolution layer in each hourglass module, and batch normalization and Leaky _ relu activation functions are further provided in the convolution layer of each hourglass module to improve backward gradient propagation and parameter updating.
3. The music source separation method based on a stacked hourglass network of claim 1, wherein the convolution kernel size of all convolution layers is 3x3 in the down-sampling and decoding portion up-sampling of each hourglass module.
4. The music source separation method based on the stacked hourglass network as claimed in claim 1, wherein the hourglass module adopts an L1 norm between a real spectrogram and a predicted spectrogram as a loss function, and specifically comprises: giving an input spectrogram X and an ith real music source YiAnd the mask generated by the ith music source in the jth hourglass module
Figure FDA0003012277260000011
Then the penalty for the ith source is defined as:
Figure FDA0003012277260000012
where |, indicates element multiplication and the L1 norm is the sum of the absolute values of the matrix elements.
5. Such as rightThe method for separating music source based on the stacked hourglass network of claim 4, wherein the total loss function of the stacked hourglass network is as follows:
Figure FDA0003012277260000013
where C is the number of sources that the network is to separate.
6. The music source separation method based on the stacked hourglass network as claimed in claim 1, wherein the second vocal prediction value and the second accompaniment prediction value are calculated by:
Figure FDA0003012277260000021
wherein [ ] indicates multiplication of elements,
Figure FDA0003012277260000022
and
Figure FDA0003012277260000023
respectively a second vocal prediction value and a second accompaniment prediction value, xtIn order to be the amplitude spectrum of the original mixed signal,
Figure FDA0003012277260000024
is a time-frequency mask, and
Figure FDA0003012277260000025
wherein
Figure FDA0003012277260000026
And
Figure FDA0003012277260000027
respectively a first vocal prediction value and a first accompaniment prediction value.
CN202011118473.3A 2020-10-19 2020-10-19 Music source separation method based on stacked hourglass network Active CN112259119B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011118473.3A CN112259119B (en) 2020-10-19 2020-10-19 Music source separation method based on stacked hourglass network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011118473.3A CN112259119B (en) 2020-10-19 2020-10-19 Music source separation method based on stacked hourglass network

Publications (2)

Publication Number Publication Date
CN112259119A CN112259119A (en) 2021-01-22
CN112259119B true CN112259119B (en) 2021-11-16

Family

ID=74244866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011118473.3A Active CN112259119B (en) 2020-10-19 2020-10-19 Music source separation method based on stacked hourglass network

Country Status (1)

Country Link
CN (1) CN112259119B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113129920B (en) * 2021-04-15 2021-08-17 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Music and human voice separation method based on U-shaped network and audio fingerprint
CN113241092A (en) * 2021-06-15 2021-08-10 新疆大学 Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network
CN114492522B (en) * 2022-01-24 2023-04-28 四川大学 Automatic modulation classification method based on improved stacked hourglass neural network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038471A (en) * 2017-12-27 2018-05-15 哈尔滨工程大学 A kind of underwater sound communication signal type Identification method based on depth learning technology
CN109815893A (en) * 2019-01-23 2019-05-28 中山大学 The normalized method in colorized face images illumination domain of confrontation network is generated based on circulation

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464727B (en) * 2014-12-11 2018-02-09 福州大学 A kind of song separation method of the single channel music based on depth belief network
US10679046B1 (en) * 2016-11-29 2020-06-09 MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V. Machine learning systems and methods of estimating body shape from images
CN108229490B (en) * 2017-02-23 2021-01-05 北京市商汤科技开发有限公司 Key point detection method, neural network training method, device and electronic equipment
CN108427927B (en) * 2018-03-16 2020-11-27 深圳市商汤科技有限公司 Object re-recognition method and apparatus, electronic device, program, and storage medium
CN108710830B (en) * 2018-04-20 2020-08-28 浙江工商大学 Human body 3D posture estimation method combining dense connection attention pyramid residual error network and isometric limitation
CN109376571B (en) * 2018-08-03 2022-04-08 西安电子科技大学 Human body posture estimation method based on deformation convolution
CN109409288B (en) * 2018-10-25 2022-02-01 北京市商汤科技开发有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN109830245B (en) * 2019-01-02 2021-03-12 北京大学 Multi-speaker voice separation method and system based on beam forming
CN110085251B (en) * 2019-04-26 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 Human voice extraction method, human voice extraction device and related products
CN110097895B (en) * 2019-05-14 2021-03-16 腾讯音乐娱乐科技(深圳)有限公司 Pure music detection method, pure music detection device and storage medium
CN110136067B (en) * 2019-05-27 2022-09-06 商丘师范学院 Real-time image generation method for super-resolution B-mode ultrasound image
CN110458001A (en) * 2019-06-28 2019-11-15 南昌大学 A kind of convolutional neural networks gaze estimation method and system based on attention mechanism
CN110503976B (en) * 2019-08-15 2021-11-23 广州方硅信息技术有限公司 Audio separation method and device, electronic equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038471A (en) * 2017-12-27 2018-05-15 哈尔滨工程大学 A kind of underwater sound communication signal type Identification method based on depth learning technology
CN109815893A (en) * 2019-01-23 2019-05-28 中山大学 The normalized method in colorized face images illumination domain of confrontation network is generated based on circulation

Also Published As

Publication number Publication date
CN112259119A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN112259119B (en) Music source separation method based on stacked hourglass network
CN107680611B (en) Single-channel sound separation method based on convolutional neural network
CN108766419B (en) Abnormal voice distinguishing method based on deep learning
CN107703486B (en) Sound source positioning method based on convolutional neural network CNN
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN105957537B (en) One kind being based on L1/2The speech de-noising method and system of sparse constraint convolution Non-negative Matrix Factorization
CN108172238A (en) A kind of voice enhancement algorithm based on multiple convolutional neural networks in speech recognition system
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN113314140A (en) Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN112331216A (en) Speaker recognition system and method based on composite acoustic features and low-rank decomposition TDNN
CN113053407B (en) Single-channel voice separation method and system for multiple speakers
Guzhov et al. Esresne (x) t-fbsp: Learning robust time-frequency transformation of audio
CN112259120A (en) Single-channel human voice and background voice separation method based on convolution cyclic neural network
CN112633175A (en) Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN114446314A (en) Voice enhancement method for deeply generating confrontation network
CN113241092A (en) Sound source separation method based on double-attention mechanism and multi-stage hybrid convolution network
CN115101085A (en) Multi-speaker time-domain voice separation method for enhancing external attention through convolution
Qi et al. Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement
Fan et al. Utterance-level permutation invariant training with discriminative learning for single channel speech separation
CN116246644A (en) Lightweight speech enhancement system based on noise classification
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
CN114283829A (en) Voice enhancement method based on dynamic gate control convolution cyclic network
CN116863965A (en) Improved pathological voice generation model and construction method thereof
CN116013339A (en) Single-channel voice enhancement method based on improved CRN
CN113282718B (en) Language identification method and system based on self-adaptive center anchor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210203

Address after: No. 1418, 14th floor, building 1, No. 1166, Tianfu 3rd Street, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan 610000

Applicant after: Chengdu Yuejian Technology Co.,Ltd.

Address before: 610000 Chengdu, Sichuan, Shuangliu District, Dongsheng Street, long bridge 6, 129, 1 units, 9 level 902.

Applicant before: CHENGDU MINGJIE TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211026

Address after: 518000 1201 Jiayu building, Hongxing community, Songgang street, Bao'an District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Cehui Technology Co., Ltd

Address before: No. 1418, 14th floor, building 1, No. 1166, Tianfu 3rd Street, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan 610000

Applicant before: Chengdu Yuejian Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant