KR20120130908A - Apparatus for separating vocal signal - Google Patents
Apparatus for separating vocal signal Download PDFInfo
- Publication number
- KR20120130908A KR20120130908A KR1020110048969A KR20110048969A KR20120130908A KR 20120130908 A KR20120130908 A KR 20120130908A KR 1020110048969 A KR1020110048969 A KR 1020110048969A KR 20110048969 A KR20110048969 A KR 20110048969A KR 20120130908 A KR20120130908 A KR 20120130908A
- Authority
- KR
- South Korea
- Prior art keywords
- signal
- vocal
- music
- channel
- voice signal
- Prior art date
Links
- 230000001755 vocal effect Effects 0.000 title abstract description 127
- 238000000926 separation method Methods 0.000 claims description 81
- 239000011159 matrix material Substances 0.000 abstract description 73
- 238000000034 method Methods 0.000 abstract description 34
- 238000000354 decomposition reaction Methods 0.000 description 41
- 238000004091 panning Methods 0.000 description 38
- 230000006870 function Effects 0.000 description 30
- 206010019133 Hangover Diseases 0.000 description 29
- 238000004364 calculation method Methods 0.000 description 23
- 238000006243 chemical reaction Methods 0.000 description 17
- 238000001228 spectrum Methods 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 238000000605 extraction Methods 0.000 description 7
- 239000000203 mixture Substances 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000004615 ingredient Substances 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04S—STEREOPHONIC SYSTEMS
- H04S1/00—Two-channel systems
- H04S1/007—Two-channel systems in which the audio signals are in digital form
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
Description
The present invention relates to an apparatus and method for separating a speech signal from a music signal. More particularly, the present invention relates to an apparatus and a method for separating a vocal signal from a stereo polyphonic music.
Stereo music generally consists of various musical instrument sounds, including vocals. The technique of separating vocal signals from stereo music can be applied to various fields such as karaoke using vocal-free accompaniment, music mood control, automatic sheet music generation, singer / album identification, and automatic lyrics generation. For this reason, various types of separation techniques have been proposed. Among the various separation methods, the method using the sinusoidal model, the central channel extraction method, and the nonnegative matrix decomposition method are noteworthy.
The sinusoidal model estimates the magnitude, frequency, phase, etc., parameters of the sinusoidal components that make up the vocal and instrument notes, and determines whether each sinusoid belongs to the vocal signal by using the vocal feature factors. How to separate. There is a problem that the process of estimating and tracking the sinusoidal parameters in a state where several musical instruments are mixed is very complicated.
The center channel extraction method utilizes a panning effect to properly place each instrument sound in a virtual space in order to increase the sense of space when recording stereo music. In this case, the vocal is mainly set to be located at the center. Ideally, stereo music calculates the magnitude difference between each channel in the short-time fourier transform (STFT) region (including phase difference if necessary) and is zero for the vocal signal. Thus, the vocal signal can be separated by considering time-frequency samples that fall within a certain range as being the vocal signal. However, not only vocals but also other instruments can be panned in the center, so the separation performance may vary depending on the music. In addition, since a binary decision is applied to the vocal decision, there is a problem of severe distortion of sound quality.
Non-negative matrix decomposition can be applied to mono music as well as stereo music. The spectrogram of the STFT region is regarded as a non-negative matrix, and the matrix and harmonic components containing two non-negative matrices are harmonics. This method decomposes a matrix containing the change in size over time. The vocal signal can be separated by determining whether the separated harmonic components are from the vocal signal. Nonnegative matrix decomposition has a disadvantage in that the separation performance is degraded when the relative magnitude of the vocal signal included in music is weak.
SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, by separating a center channel signal from stereo music, and performing basic frequency estimation, vocal determination, and voice / voiceless determination from the center channel separated signal. An object of the present invention is to propose a speech signal separation device and method for separating vocal signals with improved sound quality than when applying a non-negative matrix decomposition technique.
The present invention has been made to achieve the above object, the first embodiment, a voice feature estimating unit for estimating the features of the voice signal included in the input music signal; A contribution calculator for calculating a contribution of the speech signal to the music signal using the estimated feature; And a voice signal separation unit for separating the voice signal from the music signal with a filter gain for the voice signal based on the calculated contribution.
Preferably, the voice feature estimator may include: a channel separator configured to separate a channel signal related to the voice signal from the music signal by analyzing a panning reflected in the music signal; A type determination unit for dividing the separated channel signal into frame units and determining a signal type for each frame; And a frequency estimator for estimating a frequency component of a main melody using the separated channel signal.
Preferably, the voice signal separation unit comprises: a hangover calculator configured to calculate a hangover using a result according to the determination; A filter gain calculator configured to calculate a filter gain for the speech signal based on the calculated contribution and the calculated hangover; And a separator for separating the voice signal from the music signal based on the calculated gain.
Preferably, the audio signal separation device, the first signal conversion unit for converting the music signal into a frequency domain signal to enable the estimation; A second signal converter which determines a shape of the music signal and converts the music signal into a digital signal when the music signal is an analog signal; And a third signal converter converting the separated voice signal into a time domain signal.
Preferably, the channel separator comprises: a channel component calculator for calculating a magnitude ratio and a phase difference between channel signals included in the music signal; A panning calculator configured to calculate the panning by using the calculated magnitude ratio and the phase difference; And a component extracting unit extracting a component signal related to the calculated panning from the music signal to the channel signal using the calculated panning.
Preferably, the frequency estimator comprises: a first non-negative matrix decomposition unit for decomposing the separated spectrogram of the channel signal into a non-negative matrix; And a component tracking unit which performs the estimation by temporally tracing the component having the largest size in the decomposed non-negative matrix.
Preferably, the type determining unit is calculated according to the music signal separation, and whether the vocal component is detected for each frame with the signal type using the feature factor associated with the channel signal, and if the vocal component is detected. It is determined whether the component is related to the voiced or unvoiced component.
The present invention also provides a second embodiment, comprising: a speech feature estimating step of estimating a feature of a speech signal included in an input music signal; A contribution calculation step of calculating a contribution of the speech signal to the music signal using the estimated feature; And a voice signal separation step of separating the voice signal from the music signal with a filter gain for the voice signal based on the calculated contribution.
Preferably, the voice feature estimating step may include: a channel separation step of analyzing a panning reflected in the music signal to separate a channel signal related to the voice signal from the music signal; A type determination step of dividing the separated channel signal into frame units to determine a signal type for each frame; And estimating a frequency component of a main melody using the separated channel signal.
Preferably, the speech signal separation step includes: a hangover calculation step of calculating a hangover using a result according to the determination; A filter gain calculating step of calculating a filter gain for the speech signal based on the calculated contribution and the calculated hangover; And a separation step of separating the voice signal from the music signal based on the calculated gain.
Preferably, the voice signal separation method includes: a first signal conversion step of converting the music signal into a frequency domain signal to enable the estimation; Determining a shape of the music signal and converting the music signal into a digital signal when the music signal is an analog signal; And a third signal conversion step of converting the separated speech signal into a time domain signal.
Preferably, the channel separation step may include a channel component calculation step of calculating a magnitude ratio and a phase difference between channel signals included in the music signal; A panning calculation step of calculating the panning by using the calculated magnitude ratio and the phase difference; And extracting a component signal related to the calculated panning from the music signal to the channel signal using the calculated panning.
Preferably, the frequency estimating step comprises: a first non-negative matrix decomposition step of decomposing the separated spectrogram of the channel signal into a non-negative matrix; And a component tracking step of performing the estimation by temporally tracking the largest component in the decomposed nonnegative matrix.
Preferably, the type determining step is calculated according to the music signal separation, and whether or not a vocal component is detected for each frame by the signal type using the feature factor associated with the channel signal, and when the vocal component is detected. It is determined whether the vocal component is related to voiced or unvoiced.
The present invention also provides a third embodiment, comprising: a module for receiving stereo music and performing short-time Fourier transform; A module for separating a central channel component from the converted signal using panning in the converted signal; A module for calculating the MFCC, panning index, and zero crossing rate for each frame to determine the type of signal for each frame; A module for decomposing a spectrogram into a nonnegative matrix using a nonnegative matrix decomposition technique; A module for estimating the fundamental frequency of the main melody from the decomposed nonnegative matrix; A module for calculating the contribution of the vocal component included in the mixed signal by applying the estimated fundamental frequency and signal type determination as a constraint to the non-negative matrix decomposition technique; A module for calculating a hangover using signal type determination; A module for separating the vocal signal by calculating the Wiener filter gain through the contribution to the hangover and the vocal signal; And a module for converting the separated frequency domain signal into a vocal signal in the time domain using an inverse short time Fourier transform.
Preferably, the signal type determination module determines the frame signal as silent, unvoiced vocal, voiced vocal, etc. by calculating the MFCC, panning index, and zero crossing rate from frame to frame by the center channel separated signal.
Preferably, the module for decomposing a spectrogram into a non-negative matrix using the non-negative matrix decomposition technique, the non-negative matrix decomposition technique based on the stimulus-filter model is applied to the spectrogram of the center channel separated signal. Decompose into execution columns.
Preferably, the module for estimating the fundamental frequency of the main melody, the non-negative matrix decomposed from the
Preferably, the module for calculating the contribution of the vocal component included in the mixed signal by applying the estimated fundamental frequency and the signal type determination as a constraint to the non-negative matrix decomposition technique, uses the fundamental frequency of the estimated main melody. Nonnegative matrix
Is modified as shown in Equation 22 below and used as an initial value. Add a DC term to and use W K calculated in the nonnegative matrix decomposition1 module as the initial value.Preferably, the Wiener filter module calculates the contribution of the vocal signal from the non-negative matrix decomposed from the
Advantageously, said inverse STFT transform module converts said separated frequency domain vocal signal into a time domain.
According to the present invention, the following effects can be obtained. First, the separation of the center channel signal component reduces the distortion of the separated signal while sufficiently removing the instrumental sounds other than the vocal, thereby improving the estimation performance of the fundamental frequency of the main melody obtained by the non-negative matrix decomposition technique. Second, by applying the vocal discrimination function, improved vocal signal separation performance can be obtained in various stereo music.
1 is a block diagram schematically illustrating an apparatus for separating voice signals according to a preferred embodiment of the present invention.
2 and 3 are block diagrams showing the internal configuration of the voice signal separation device according to the present embodiment in detail.
4 is a block diagram illustrating a signal flow of a vocal signal separation apparatus according to an embodiment of the present invention.
5 is a block diagram of a computer system implementing the vocal signal separation apparatus according to an embodiment of the present invention.
6 is a flowchart illustrating each step of the vocal signal separation method according to an embodiment of the present invention.
Figure 7 compares the vocal signal separation results according to an embodiment of the present invention.
8 is a flowchart schematically illustrating a voice signal separation method according to a preferred embodiment of the present invention.
Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.
1 is a block diagram schematically illustrating an apparatus for separating voice signals according to a preferred embodiment of the present invention. 2 and 3 are block diagrams showing the internal configuration of the voice signal separation device according to the present embodiment in detail. The following description refers to Fig. 1 to Fig.
The speech
The
The
The
The type determiner 112 divides the separated channel signal into frame units and determines a signal type for each frame. The type determining unit 112 calculates whether the vocal component is detected for each frame as a signal type using a feature factor associated with the channel signal, and when the vocal component is detected, the vocal component is a voiced or unvoiced sound. Determine if it is related to an ingredient. Accordingly, the type determination unit 112 determines the signal type for each frame signal as any one of a voiced vocal signal, an unvoiced vocal signal, and a silent signal. When determining the signal type, the type determination unit 112 may use MFCC (Mel-Frequency Cepstral Coefficients), LPC (Linear Predictive Coefficient), panning index, zero crossing rate (zero-crossing rate), etc. Can be.
The
The
The contribution calculator 120 calculates a contribution of the voice signal to the music signal using the estimated feature of the voice signal.
The
The hangover calculator 131 calculates a hangover by using the result of the determination of the type determiner 112. The hangover calculator 131 calculates a hangover function for exponentially reducing the signal with respect to the frame signal in which no vocal component is detected. The filter gain calculator 132 calculates a filter gain for the speech signal based on the calculated contribution and the calculated hangover. The separating
The power supply unit 140 performs a function of supplying power to each unit constituting the voice
The
The voice
The first signal converter 161 converts the music signal into a frequency domain signal so that the feature of the voice signal can be estimated. The first signal converter 161 converts the music signal into a frequency domain signal using a short-time Fourier transform (STFT). The second signal converter 162 determines the shape of the music signal and converts the music signal into a digital signal when the music signal is an analog signal. The third signal converter 163 converts the separated voice signal into a time domain signal. The third signal converter 163 converts the voice signal into a time domain signal using an inverse short-time fourier transform (STFT).
Next, an embodiment of the voice signal separation apparatus of FIG. 1 will be described. 4 is a block diagram illustrating a signal flow of the vocal signal separation apparatus as an embodiment of the voice
The vocal
In FIG. 4, left and right channel signals of stereo music are transformed into spectrum of a frequency domain in Short-Time Fourier Transform (STFT)
The central
The stereo music signal processed by the center
The fundamental
The signal
The
The
The inverse
The modules described in FIG. 4 may be performed using hardware or software. 5 shows a computer system, a
The vocal signal separation method according to FIG. 6 includes: receiving stereo music and performing STFT conversion; Calculating a magnitude difference and a phase difference between the channels in the converted music signal to separate the center channel component from the converted signal; Estimating the fundamental frequency of the main melody by applying a non-negative matrix decomposition technique to the center channel separation signal; Determining the signal for each frame as voiced vocal, unvoiced vocal, and silent; Calculating the contribution of the vocal component from the spectrogram of the STFT transformed stereo music; Calculating a hangover using the signal type determination result; Determining a gain of the Wiener filter using the contribution of the vocal component and the hangover; Separating vocal signal components included in each channel of stereo music; And converting the separated frequency domain vocal signal into a time domain vocal signal by applying an inverse STFT.
Each module of FIG. 4 is described in detail as follows using FIG. 6, which is a flowchart illustrating each module of the vocal separation method according to an embodiment of the present invention.
The left channel signal x L (n) (right channel signal x R (n)) of stereo music consists of the left channel vocal signal v L (n) (right channel vocal signal v R (n)) and the left channel accompaniment signal a L ( n) (the right channel accompaniment signal a R (n)). In the
Where f is the frequency index, t is the frame index, and X L (f, t), V L (f, t), A L (f, t) (or X R (f, t), V R (f , t), A R (f, t)) are STFT transform representations of left (or right) channel mixed, vocal, and accompaniment signals, respectively.
Center Channel Separation Steps
The first step of vocal feature estimation is the center channel separation step (S202). In the center channel separation step, it is assumed that the vocal signal is panned at the center and the remaining instrument sounds are panned at a position other than the center, and the panning component is extracted in each (f, t) region. Panning is the size ratio of both channels in each (f, t) region.
And phase difference It is calculated by calculating The magnitude ratio of the channel signal is You can also apply First, the magnitude ratio and the phase difference are defined as
Now, in each (f, t), the spectrum is assumed as
Where N (x (f, t) | μ k , Σ k ) is a Gaussian probability density function with mean of μ k and covariance of Σ k (k = 1,2). π k (f, t), k = 1,2 satisfies 0 ≦ π k ≦ 1 and π 1 + π 2 = 1 as the mixing ratio of two random variables at (f, t). A method of estimating the parameter {μ k , Σ k } is to apply a conventional Expectation-Maximization (EM) algorithm that maximizes the Log-Likelihood Function (Equation 4).
The EM algorithm first initializes the parameters and then optimizes the log-likelihood function using the E-step, which calculates the posterior probabilities for the mix ratio of the vocal and accompaniment signals using the currently estimated parameters, and the current posterior probabilities. It is a method of continuously calculating the M-step of calculating θ = {μ k , Σ k } until the log-likelihood function converges. The post probability can be interpreted as a panning value at each (f, t). When the log-likelihood function is converged, the vocal signal is estimated by selecting k = k * having a small value of Σ k as shown in Equation (6). On the other hand, the posterior probability is as shown in
In this case, it is necessary to exclude the Bass portion included in the mixed music signal by appropriately limiting the frequency range of the vocal signal. Inverse STFT conversion of the estimated vocal signal in the frequency domain yields the vocal signals x L v (m, t) and x R v (m, t) in the time domain.
Other parameters that are calculated in the center channel separation step are Mel Frequency Cepstrum Coefficients (MFCC), panning index, and zero-crossing rate required for vocal discrimination. In this example, 24 MFCC coefficients were calculated by applying a filter bank of 40 channels. In order to apply a panning index to determine whether the frame signal is vocal, a limited frequency range is calculated as in Equation (7).
The frequency range may vary with music, but a range of approximately 300Hz to 10kHz is appropriate. The zero crossing rate is calculated as shown in
Where m = 0,… Is the sample index in frame t.
Nonnegative matrix Vocals with Decomposition Harmonic Ingredient extraction
When the center channel component separation step is completed, the separated signal is input to the vocal harmonic component extraction step (S203) applying the non-negative matrix decomposition technique. Only one channel signal can be used at this time, but here, average value of both channels
Shall be used. x v (m, t) is the separation of the panning in the center of the mixed music signal, but the instrument sound component still remains, so the vocal signal x v v (m, t) and the instrument sound component x a v ( is considered to be the sum of m, t).
When the signal is STFT-converted, the power spectra of the vocal component and the instrumental sound component can be expressed by decomposing them into non-negative matrices, respectively, as shown in Equation (10).
Where N F , N F0 , K, and T are the number of frequencies, the number of fundamental frequencies, the number of filter shapes, and the number of time frames, respectively, and the nonnegative matrix W V is N F × N F0 , and H V is N F0 × N t , W A is N F x K, and H A is K x N t .
When the spectrum has a complex value and assumes that the real part and the imaginary part each follow a Gaussian probability distribution, the spectrum can be assumed to be a probability distribution as shown in Equation (11).
The vocal signal is assumed to be a stimulus-filter model. Assuming that an F 0 From this range of possible utterances fundamental frequency, stimulus-filter model is represented by a signal filter component representing the harmonic component and the group of spectral shape having a fundamental frequency f 0 ∈F 0, the pole and the filter When each of them is decomposed into a nonnegative matrix, it can be expressed as in Equation 12.
Here, the nonnegative matrices W A and H A constituting the power spectra of the components W F0 , H F0 , W K , and H K are assumed to be probability distributions such as N c (0, σ 2 ). The entire power spectrum is expressed by Equation 13.
The entire power spectrum represented by Equation 13 is equally assumed to have a complex Gaussian probability distribution in polar coordinates.
W F0 is predetermined based on the vocal range of a person. The remaining parameters to be estimated are θ = {H F0 , W K , H K , W M , H M }. These parameters may be determined to satisfy equation (15).
Parameter estimation to maximize the log-likelihood function is equivalent to minimizing the following equation (16).
This expression is equivalent to the Itakura-Saito cost function. Nonnegative matrix decomposition is one of the useful methods for calculating the parameter that minimizes this cost function. The update formula of the nonnegative matrix decomposition algorithm is known to ensure fast convergence while satisfying the nonnegative matrix condition by having a multiplication form rather than an addition form. Derived as 17.
Where. * And ./ mean multiplication by element and division by element, respectively. Non-negative matrices except W F0 are initialized to non-negative behavior using Gaussian random variables. That is, we create a Gaussian random variable x and then take the absolute value | x |. Using the initial value, iteratively calculates until the cost function (16) converges. This is calculated by updating only one parameter at a time. Normally, convergence is completed when each parameter is repeatedly calculated 50 times.
Estimation of the fundamental frequency of the main melody
The parameter H F0 estimated in Equation 17 represents a change in size of the harmonic groups over time, and a larger size is a component corresponding to the main melody of music. The fundamental frequency of the main melody is estimated by introducing the Viterbi algorithm that tracks the largest component in the calculated non-negative matrix H F0 in time (S204). The Viterbi algorithm maximizes the cost function of
here
Is the probability of shifting the fundamental frequency from f 0 i to f 0 j . It is expressed as a function of. β is set to an appropriate value through experimentation. In general, β = 10 is appropriate. In addition, using the normalized value of H F0 (k) is expressed as in Equation 19.
Here, setting as shown in
Vocal judgment
Characteristic factors used in vocal determination include Mel Frequency Cepstrum Coefficients (MFCC), Linear Predictive Coefficient (LPC), fundamental frequency, zero crossing rate, energy, and panning index. As a method of determining whether the frame signal includes a vocal, a deterministic method such as a support vector machine (SVM) and a statistical method such as a hidden markov model (HMM) or a likelihood function may be applied. These schemes provide nearly identical decision performance.
In one embodiment of the present invention was used MFCC and panning index. In addition, if the zero crossing rate calculated in step S202 exceeds a predetermined threshold in the vocal determination process to determine whether the frame signal is voiced or unvoiced, the frame is determined to be unvoiced vocal. This is expressed as follows.
Nonnegative matrix Calculation of vocal component contribution using decomposition method
If the vocal determination result and the estimated fundamental frequency information are applied to the nonnegative matrix decomposition technique, the contribution of the vocal signal included in the music mixture X L (f, t) and X R (f, t) can be estimated directly. First, a DC component corresponding to the
The nonnegative matrix W K uses the value estimated in step S203 as an initial value. The remaining nonnegative matrices H K , W M , and H M are initialized to non-negative numbers using Gaussian random variables in the same manner as in step S204. Using the initialized nonnegative matrices, Equation 17 is repeatedly calculated and converged.
winner Vocal Signal Separation Using Filter
When the algorithm calculation of Equation 17 is completed, the contribution to the main melody component in the music mixture is calculated as follows.
This contribution is the gain of the Wiener filter. In order to remove a section in which the main melody is not a vocal, a hangover that is exponentially decreased as shown in Equation 24 is calculated from the time point t 0 at which the vocal ends using the result of the vocal determination decision step S206 (S207).
Herein, the
The vocal signal is converted into signals v L (n) and v R (n) in the time domain using an inverse STFT.
Figure 7 compares the vocal signal separation results according to an embodiment of the present invention. 4a is a vocal signal, 4b is a music signal mixed with accompaniment, 4c is a result of vocal signal separation using only center channel separation, 4d is a result of vocal signal separation using only non-negative matrix decomposition, and 4e is an embodiment of the present invention. The result is the vocal signal separation using non-negative matrix decomposition with center channel separation and signal determination. As shown in FIG. 7, the accompaniment signal still remains when only direct channel separation or non-negative matrix decomposition is applied to the music mixture. It is possible to separate vocal signals with improved sound quality.
Next, the audio signal separation method of the preferable form by the audio signal separation apparatus of FIG. 1 is demonstrated. 8 is a flowchart schematically illustrating a voice signal separation method according to a preferred embodiment of the present invention. The following description refers to FIG. 8.
First, a feature of the voice signal included in the input music signal is estimated (voice feature estimation step S800). The speech feature estimation step S800 may include a channel separation step, a type determination step, and a frequency estimation step.
In the channel separation step, the panning reflected in the music signal is analyzed to separate the channel signal related to the voice signal from the music signal. The channel separation step may include a channel component calculation step, a panning calculation step, and a component extraction step. In the channel component calculating step, the magnitude ratio and the phase difference between the channel signals included in the music signal are calculated. In the panning calculation step, panning is calculated using the calculated magnitude ratio and the phase difference. In the component extraction step, a component signal related to panning is extracted from the music signal as a channel signal using the calculated panning.
In the type determination step, the separated channel signals are classified in units of frames to determine signal types for each frame. More specifically, in the type determination step, whether the vocal component is detected for each frame as the signal type using the feature factor associated with the channel signal, and when the vocal component is detected, the vocal component is composed of voiced and unvoiced sound. It is determined whether it is related to which component.
In the frequency estimating step, the frequency component of the main melody is estimated using the separated channel signal. The frequency estimation step may consist of a first non-negative matrix decomposition step and a component tracking step. In the first non-negative matrix decomposition step, the spectrogram of the separated channel signal is decomposed into a non-negative matrix. In the component tracking step, feature estimation of a speech signal is performed by temporally tracking the largest component in the decomposed nonnegative matrix.
After the speech feature estimation step S800, the contribution of the speech signal to the music signal is calculated using the estimated features of the speech signal (contribution calculation step S810).
After the contribution calculation step S810, the speech signal is separated from the music signal by the filter gain for the speech signal based on the calculated contribution (voice signal separation step S820). The speech signal separation step S820 may include a hangover calculation step, a filter gain calculation step, and a separation step. In the hangover calculation step, a hangover is calculated by using the result according to the signal type determination. In the filter gain calculation step, the filter gain for the speech signal is calculated based on the calculated contribution and the calculated hangover. In the separating step, the speech signal is separated from the music signal based on the calculated gain.
The speech signal separation method may further include at least one signal conversion step of the first signal conversion step to the third signal conversion step. In the first signal converting step, the music signal is converted into a frequency domain signal to enable speech signal feature estimation. In the second signal conversion step, the shape of the music signal is determined to convert the music signal into a digital signal when the music signal is an analog signal. In the third signal conversion step, the separated speech signal is converted into a time domain signal.
The first signal conversion step and the second signal conversion step are performed before the speech feature estimation step. On the other hand, the third signal conversion step is performed after the speech signal separation step.
It will be apparent to those skilled in the art that various modifications, substitutions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. will be. Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are not intended to limit the technical spirit of the present invention but to describe the present invention, and the scope of the technical idea of the present invention is not limited by the embodiments and the accompanying drawings. . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.
The present invention relates to an apparatus and method for separating a vocal signal from a stereo music signal, and may be applied to a music content service, a realistic / 3D broadcast service, and the like.
100: speech signal separation unit 110: speech feature estimation unit
111: channel separation unit 112: type determination unit
113: frequency estimation unit 120: contribution calculation unit
130: voice signal separation unit 131: hangover calculation unit
132: filter gain calculation unit 133: separation unit
161 to 163: first signal converter to third signal converter
171: channel component calculator 172: panning calculator
173: component extraction unit 181: first non-negative matrix decomposition unit
182: component tracking unit
Claims (1)
A voice signal separator for separating the voice signal from the music signal with a filter gain for the voice signal based on the calculated contribution
Voice signal separation device comprising a.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020110048969A KR20120130908A (en) | 2011-05-24 | 2011-05-24 | Apparatus for separating vocal signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020110048969A KR20120130908A (en) | 2011-05-24 | 2011-05-24 | Apparatus for separating vocal signal |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20120130908A true KR20120130908A (en) | 2012-12-04 |
Family
ID=47514880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020110048969A KR20120130908A (en) | 2011-05-24 | 2011-05-24 | Apparatus for separating vocal signal |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20120130908A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180079975A (en) * | 2017-01-03 | 2018-07-11 | 한국전자통신연구원 | Sound source separation method using spatial position of the sound source and non-negative matrix factorization and apparatus performing the method |
CN111968669A (en) * | 2020-07-28 | 2020-11-20 | 安徽大学 | Multi-element mixed sound signal separation method and device |
CN113393857A (en) * | 2021-06-10 | 2021-09-14 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device and medium for eliminating human voice of music signal |
-
2011
- 2011-05-24 KR KR1020110048969A patent/KR20120130908A/en not_active Application Discontinuation
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20180079975A (en) * | 2017-01-03 | 2018-07-11 | 한국전자통신연구원 | Sound source separation method using spatial position of the sound source and non-negative matrix factorization and apparatus performing the method |
CN111968669A (en) * | 2020-07-28 | 2020-11-20 | 安徽大学 | Multi-element mixed sound signal separation method and device |
CN111968669B (en) * | 2020-07-28 | 2024-02-20 | 安徽大学 | Multi-element mixed sound signal separation method and device |
CN113393857A (en) * | 2021-06-10 | 2021-09-14 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, device and medium for eliminating human voice of music signal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Duan et al. | Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions | |
Ueda et al. | HMM-based approach for automatic chord detection using refined acoustic features | |
Virtanen et al. | Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music. | |
Durrieu et al. | A musically motivated mid-level representation for pitch estimation and musical audio source separation | |
Huang et al. | Singing-voice separation from monaural recordings using robust principal component analysis | |
Hsu et al. | A tandem algorithm for singing pitch extraction and voice separation from music accompaniment | |
Rao et al. | Vocal melody extraction in the presence of pitched accompaniment in polyphonic music | |
Duan et al. | Multi-pitch streaming of harmonic sound mixtures | |
Duan et al. | Unsupervised single-channel music source separation by average harmonic structure modeling | |
Tachibana et al. | Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms | |
Pertusa et al. | Multiple fundamental frequency estimation using Gaussian smoothness | |
Lagrange et al. | Normalized cuts for predominant melodic source separation | |
Mayer et al. | Impact of phase estimation on single-channel speech separation based on time-frequency masking | |
Cogliati et al. | Context-dependent piano music transcription with convolutional sparse coding | |
CN110516102B (en) | Lyric time stamp generation method based on spectrogram recognition | |
Arora et al. | On-line melody extraction from polyphonic audio using harmonic cluster tracking | |
Demir et al. | Single-channel speech-music separation for robust ASR with mixture models | |
Jeong et al. | Vocal separation from monaural music using temporal/spectral continuity and sparsity constraints | |
Durrieu et al. | An iterative approach to monaural musical mixture de-soloing | |
CN103915093A (en) | Method and device for realizing voice singing | |
Benetos | Automatic transcription of polyphonic music exploiting temporal evolution | |
Yong et al. | Singing expression transfer from one voice to another for a given song | |
Nakano et al. | Nonnegative matrix factorization with Markov-chained bases for modeling time-varying patterns in music spectrograms | |
Marxer et al. | Low-latency instrument separation in polyphonic audio using timbre models | |
Katmeoka et al. | Separation of harmonic structures based on tied Gaussian mixture model and information criterion for concurrent sounds |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WITN | Withdrawal due to no request for examination |