KR20120130908A

KR20120130908A - Apparatus for separating vocal signal

Info

Publication number: KR20120130908A
Application number: KR1020110048969A
Authority: KR
Inventors: 장인선; 김민제; 백승권; 강경옥; 남승현
Original assignee: 한국전자통신연구원; 배재대학교 산학협력단
Priority date: 2011-05-24
Filing date: 2011-05-24
Publication date: 2012-12-04

Abstract

PURPOSE: A voice signal separating device is provided to improve an estimation function of a basic frequency of main melody obtained in a non-negative matrix factorization method by sufficiently removing sounds of other musical instruments except for vocals. CONSTITUTION: A voice feature estimator(110) estimates a feature of a voice signal included in an inputted music signal. A contribution level calculator(120) calculates a contribution level of the voice signal for the music signal by using the estimated feature of the voice signal. A voice signal separator(130) separates the voice signal from the music signal by a filter gain for the voice signal based on the calculated contribution level. [Reference numerals] (110) Voice feature estimator; (120) Contribution level calculator; (130) Voice signal separator; (140) Power supply unit; (150) Main control unit

Description

Apparatus for separating vocal signal

The present invention relates to an apparatus and method for separating a speech signal from a music signal. More particularly, the present invention relates to an apparatus and a method for separating a vocal signal from a stereo polyphonic music.

Stereo music generally consists of various musical instrument sounds, including vocals. The technique of separating vocal signals from stereo music can be applied to various fields such as karaoke using vocal-free accompaniment, music mood control, automatic sheet music generation, singer / album identification, and automatic lyrics generation. For this reason, various types of separation techniques have been proposed. Among the various separation methods, the method using the sinusoidal model, the central channel extraction method, and the nonnegative matrix decomposition method are noteworthy.

The sinusoidal model estimates the magnitude, frequency, phase, etc., parameters of the sinusoidal components that make up the vocal and instrument notes, and determines whether each sinusoid belongs to the vocal signal by using the vocal feature factors. How to separate. There is a problem that the process of estimating and tracking the sinusoidal parameters in a state where several musical instruments are mixed is very complicated.

The center channel extraction method utilizes a panning effect to properly place each instrument sound in a virtual space in order to increase the sense of space when recording stereo music. In this case, the vocal is mainly set to be located at the center. Ideally, stereo music calculates the magnitude difference between each channel in the short-time fourier transform (STFT) region (including phase difference if necessary) and is zero for the vocal signal. Thus, the vocal signal can be separated by considering time-frequency samples that fall within a certain range as being the vocal signal. However, not only vocals but also other instruments can be panned in the center, so the separation performance may vary depending on the music. In addition, since a binary decision is applied to the vocal decision, there is a problem of severe distortion of sound quality.

Non-negative matrix decomposition can be applied to mono music as well as stereo music. The spectrogram of the STFT region is regarded as a non-negative matrix, and the matrix and harmonic components containing two non-negative matrices are harmonics. This method decomposes a matrix containing the change in size over time. The vocal signal can be separated by determining whether the separated harmonic components are from the vocal signal. Nonnegative matrix decomposition has a disadvantage in that the separation performance is degraded when the relative magnitude of the vocal signal included in music is weak.

SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, by separating a center channel signal from stereo music, and performing basic frequency estimation, vocal determination, and voice / voiceless determination from the center channel separated signal. An object of the present invention is to propose a speech signal separation device and method for separating vocal signals with improved sound quality than when applying a non-negative matrix decomposition technique.

The present invention has been made to achieve the above object, the first embodiment, a voice feature estimating unit for estimating the features of the voice signal included in the input music signal; A contribution calculator for calculating a contribution of the speech signal to the music signal using the estimated feature; And a voice signal separation unit for separating the voice signal from the music signal with a filter gain for the voice signal based on the calculated contribution.

Preferably, the voice feature estimator may include: a channel separator configured to separate a channel signal related to the voice signal from the music signal by analyzing a panning reflected in the music signal; A type determination unit for dividing the separated channel signal into frame units and determining a signal type for each frame; And a frequency estimator for estimating a frequency component of a main melody using the separated channel signal.

Preferably, the voice signal separation unit comprises: a hangover calculator configured to calculate a hangover using a result according to the determination; A filter gain calculator configured to calculate a filter gain for the speech signal based on the calculated contribution and the calculated hangover; And a separator for separating the voice signal from the music signal based on the calculated gain.

Preferably, the audio signal separation device, the first signal conversion unit for converting the music signal into a frequency domain signal to enable the estimation; A second signal converter which determines a shape of the music signal and converts the music signal into a digital signal when the music signal is an analog signal; And a third signal converter converting the separated voice signal into a time domain signal.

Preferably, the channel separator comprises: a channel component calculator for calculating a magnitude ratio and a phase difference between channel signals included in the music signal; A panning calculator configured to calculate the panning by using the calculated magnitude ratio and the phase difference; And a component extracting unit extracting a component signal related to the calculated panning from the music signal to the channel signal using the calculated panning.

Preferably, the frequency estimator comprises: a first non-negative matrix decomposition unit for decomposing the separated spectrogram of the channel signal into a non-negative matrix; And a component tracking unit which performs the estimation by temporally tracing the component having the largest size in the decomposed non-negative matrix.

Preferably, the type determining unit is calculated according to the music signal separation, and whether the vocal component is detected for each frame with the signal type using the feature factor associated with the channel signal, and if the vocal component is detected. It is determined whether the component is related to the voiced or unvoiced component.

The present invention also provides a second embodiment, comprising: a speech feature estimating step of estimating a feature of a speech signal included in an input music signal; A contribution calculation step of calculating a contribution of the speech signal to the music signal using the estimated feature; And a voice signal separation step of separating the voice signal from the music signal with a filter gain for the voice signal based on the calculated contribution.

Preferably, the voice feature estimating step may include: a channel separation step of analyzing a panning reflected in the music signal to separate a channel signal related to the voice signal from the music signal; A type determination step of dividing the separated channel signal into frame units to determine a signal type for each frame; And estimating a frequency component of a main melody using the separated channel signal.

Preferably, the speech signal separation step includes: a hangover calculation step of calculating a hangover using a result according to the determination; A filter gain calculating step of calculating a filter gain for the speech signal based on the calculated contribution and the calculated hangover; And a separation step of separating the voice signal from the music signal based on the calculated gain.

Preferably, the voice signal separation method includes: a first signal conversion step of converting the music signal into a frequency domain signal to enable the estimation; Determining a shape of the music signal and converting the music signal into a digital signal when the music signal is an analog signal; And a third signal conversion step of converting the separated speech signal into a time domain signal.

Preferably, the channel separation step may include a channel component calculation step of calculating a magnitude ratio and a phase difference between channel signals included in the music signal; A panning calculation step of calculating the panning by using the calculated magnitude ratio and the phase difference; And extracting a component signal related to the calculated panning from the music signal to the channel signal using the calculated panning.

Preferably, the frequency estimating step comprises: a first non-negative matrix decomposition step of decomposing the separated spectrogram of the channel signal into a non-negative matrix; And a component tracking step of performing the estimation by temporally tracking the largest component in the decomposed nonnegative matrix.

Preferably, the type determining step is calculated according to the music signal separation, and whether or not a vocal component is detected for each frame by the signal type using the feature factor associated with the channel signal, and when the vocal component is detected. It is determined whether the vocal component is related to voiced or unvoiced.

The present invention also provides a third embodiment, comprising: a module for receiving stereo music and performing short-time Fourier transform; A module for separating a central channel component from the converted signal using panning in the converted signal; A module for calculating the MFCC, panning index, and zero crossing rate for each frame to determine the type of signal for each frame; A module for decomposing a spectrogram into a nonnegative matrix using a nonnegative matrix decomposition technique; A module for estimating the fundamental frequency of the main melody from the decomposed nonnegative matrix; A module for calculating the contribution of the vocal component included in the mixed signal by applying the estimated fundamental frequency and signal type determination as a constraint to the non-negative matrix decomposition technique; A module for calculating a hangover using signal type determination; A module for separating the vocal signal by calculating the Wiener filter gain through the contribution to the hangover and the vocal signal; And a module for converting the separated frequency domain signal into a vocal signal in the time domain using an inverse short time Fourier transform.

Preferably, the signal type determination module determines the frame signal as silent, unvoiced vocal, voiced vocal, etc. by calculating the MFCC, panning index, and zero crossing rate from frame to frame by the center channel separated signal.

Preferably, the module for decomposing a spectrogram into a non-negative matrix using the non-negative matrix decomposition technique, the non-negative matrix decomposition technique based on the stimulus-filter model is applied to the spectrogram of the center channel separated signal. Decompose into execution columns.

Preferably, the module for estimating the fundamental frequency of the main melody, the non-negative matrix decomposed from the non-negative matrix decomposition 1 module

We estimate the fundamental frequency of the main melody using a Viterbi algorithm.

Preferably, the module for calculating the contribution of the vocal component included in the mixed signal by applying the estimated fundamental frequency and the signal type determination as a constraint to the non-negative matrix decomposition technique, uses the fundamental frequency of the estimated main melody. Nonnegative matrix

Is modified as shown in Equation 22 below and used as an initial value.

Add a DC term to and use W _K calculated in the nonnegative matrix decomposition1 module as the initial value.

Preferably, the Wiener filter module calculates the contribution of the vocal signal from the non-negative matrix decomposed from the non-negative matrix decomposition 2 module as shown in Equation 23 below, and calculates a hangover as shown in Equation 24 below. Calculate the Wiener filter gain as shown in Equation 25.

Advantageously, said inverse STFT transform module converts said separated frequency domain vocal signal into a time domain.

According to the present invention, the following effects can be obtained. First, the separation of the center channel signal component reduces the distortion of the separated signal while sufficiently removing the instrumental sounds other than the vocal, thereby improving the estimation performance of the fundamental frequency of the main melody obtained by the non-negative matrix decomposition technique. Second, by applying the vocal discrimination function, improved vocal signal separation performance can be obtained in various stereo music.

1 is a block diagram schematically illustrating an apparatus for separating voice signals according to a preferred embodiment of the present invention.
2 and 3 are block diagrams showing the internal configuration of the voice signal separation device according to the present embodiment in detail.
4 is a block diagram illustrating a signal flow of a vocal signal separation apparatus according to an embodiment of the present invention.
5 is a block diagram of a computer system implementing the vocal signal separation apparatus according to an embodiment of the present invention.
6 is a flowchart illustrating each step of the vocal signal separation method according to an embodiment of the present invention.
Figure 7 compares the vocal signal separation results according to an embodiment of the present invention.
8 is a flowchart schematically illustrating a voice signal separation method according to a preferred embodiment of the present invention.

Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.

1 is a block diagram schematically illustrating an apparatus for separating voice signals according to a preferred embodiment of the present invention. 2 and 3 are block diagrams showing the internal configuration of the voice signal separation device according to the present embodiment in detail. The following description refers to Fig. 1 to Fig.

The speech signal separation apparatus 100 according to the present embodiment is a signal processing apparatus, and more specifically, the main melody of music using a panning index and nonnegative matrix factorization (NMF) technique from stereo music. A vocal signal separator that separates melody and determines whether a separated frame includes a vocal to separate vocal signals. According to FIG. 1, the voice signal separation device 100 includes a contribution calculation unit 120, a voice signal separation unit 130, a power supply unit 140, and a main control unit 150. In more detail, the voice signal separation apparatus 100 may further include a voice feature estimator 110.

The voice feature estimator 110 estimates a feature of the voice signal included in the input music signal. The voice feature estimator 110 may include a channel separator 111, a type determiner 112, and a frequency estimator 113 as illustrated in FIG. 2 (a).

The channel separator 111 separates the channel signal related to the voice signal from the music signal by analyzing the panning reflected in the music signal. The channel separator 111 separates the center channel component from the music signal into channel components related to the voice signal. At this time, the channel separator 111 separates the center channel component by calculating the magnitude difference and the phase difference between the channels in the music signal.

The channel separator 111 may include a channel component calculator 171, a panning calculator 172, and a component extractor 173 as illustrated in FIG. 3A. The channel component calculator 171 calculates a magnitude ratio and a phase difference between channel signals included in the music signal. The panning calculator 172 calculates panning using the calculated magnitude ratio and the phase difference. The component extractor 173 extracts the component signal related to the panning calculated from the music signal as the channel signal using the calculated panning.

The type determiner 112 divides the separated channel signal into frame units and determines a signal type for each frame. The type determining unit 112 calculates whether the vocal component is detected for each frame as a signal type using a feature factor associated with the channel signal, and when the vocal component is detected, the vocal component is a voiced or unvoiced sound. Determine if it is related to an ingredient. Accordingly, the type determination unit 112 determines the signal type for each frame signal as any one of a voiced vocal signal, an unvoiced vocal signal, and a silent signal. When determining the signal type, the type determination unit 112 may use MFCC (Mel-Frequency Cepstral Coefficients), LPC (Linear Predictive Coefficient), panning index, zero crossing rate (zero-crossing rate), etc. Can be.

The frequency estimator 113 estimates the frequency component of the main melody using the separated channel signal. When the separated channel signal is decomposed into a nonnegative matrix, the frequency estimator 113 estimates a frequency component of the main melody based on the decomposed nonnegative matrix. In this embodiment, the speech feature estimation unit 110 may further include a first non-negative matrix decomposition unit 181 in consideration of this point.

The frequency estimator 113 may include a first non-negative matrix resolver 181 and a component tracker 182 as shown in FIG. The first non-negative matrix decomposition unit 181 decomposes the spectrogram of the separated channel signal into a non-negative matrix. The first non-negative matrix decomposition unit 181 performs a function of decomposing the spectrogram of the separated channel signal into the non-negative matrix using a non-negative matrix decomposition technique. The component tracking unit 182 performs the estimation by temporally tracking the component having the largest size in the decomposed non-negative matrix. The component tracking unit 182 may use a Viterbi algorithm when performing the estimation.

The contribution calculator 120 calculates a contribution of the voice signal to the music signal using the estimated feature of the voice signal.

The voice signal separator 130 separates the voice signal from the music signal as a filter gain for the voice signal based on the calculated contribution. The voice signal separator 130 may include a hangover calculator 131, a filter gain calculator 132, and a separator 133, as shown in FIG. 2 (b).

The hangover calculator 131 calculates a hangover by using the result of the determination of the type determiner 112. The hangover calculator 131 calculates a hangover function for exponentially reducing the signal with respect to the frame signal in which no vocal component is detected. The filter gain calculator 132 calculates a filter gain for the speech signal based on the calculated contribution and the calculated hangover. The separating unit 133 separates the voice signal from the music signal based on the calculated filter gain.

The power supply unit 140 performs a function of supplying power to each unit constituting the voice signal separation device 100.

The main controller 150 controls the overall operation of each unit constituting the voice signal separation device 100.

The voice signal separation device 100 further includes at least one of the first signal converter 161, the second signal converter 162, and the third signal converter 163, as shown in FIG. 2 (c). can do.

The first signal converter 161 converts the music signal into a frequency domain signal so that the feature of the voice signal can be estimated. The first signal converter 161 converts the music signal into a frequency domain signal using a short-time Fourier transform (STFT). The second signal converter 162 determines the shape of the music signal and converts the music signal into a digital signal when the music signal is an analog signal. The third signal converter 163 converts the separated voice signal into a time domain signal. The third signal converter 163 converts the voice signal into a time domain signal using an inverse short-time fourier transform (STFT).

Next, an embodiment of the voice signal separation apparatus of FIG. 1 will be described. 4 is a block diagram illustrating a signal flow of the vocal signal separation apparatus as an embodiment of the voice signal separation apparatus 100 according to the present embodiment. FIG. 5 is a block diagram of a computer system implementing the vocal signal separation apparatus of FIG. 4. 6 is a flowchart illustrating each step of the vocal signal separation method as an embodiment of a method of driving the voice signal separation apparatus according to the present embodiment. FIG. 7 is a diagram illustrating a result of vocal signal separation according to FIG. 6. The following description refers to FIGS. 4 to 7.

The vocal signal separation apparatus 10 according to FIG. 4 includes a module for receiving stereo music, STFT conversion, and separating a center channel component from the converted music signal; A module for estimating the fundamental frequency of the main melody by applying a non-negative matrix decomposition technique to the center channel separation signal; A module for determining a signal for each frame as voiced vocal, unvoiced vocal, and silent; A module for calculating the contribution of the vocal component from the spectrogram of the STFT-converted stereo music using a nonnegative matrix decomposition technique in which the fundamental frequency and signal type determination result of the estimated main melody are applied as a constraint; A module for calculating a hangover using the signal type determination result; A module for determining the gain of the Wiener filter using the contribution of the vocal component and the hangover; A module for separating vocal signal components included in each channel of stereo music; And a module for applying an inverse STFT to the separated frequency domain vocal signal to convert the vocal signal into a time domain. The overall structure and contents of this embodiment are described with reference to FIG.

In FIG. 4, left and right channel signals of stereo music are transformed into spectrum of a frequency domain in Short-Time Fourier Transform (STFT) module 20, respectively. In a preferred embodiment, the stereo music signal is in digital form, but in the case of analog signals it is converted to digital form using well known techniques. The STFT conversion module 20 functions as a first signal converter of FIG. 2C. The short-time Fourier transformed stereo music signal is input to the vocal feature estimation module 30. The vocal feature estimation module 30 functions as the voice feature estimation unit of FIG. 1 and may include a central channel separation module 31, a non-negative matrix decomposition 1 module 32, a fundamental frequency estimation module 33, a signal type determination module 34, and the like.

The central channel separation module 31 functions as the channel separation unit of FIG. 2 (a). The central channel separation module 31 estimates the panning applied to the stereo music during recording to estimate and separate the vocal components included in the signal. If the vocal signal is panned in the center, the magnitude and phase of both channel spectra will almost match. Another output of the central channel separation module 21 is a useful feature for vocal determination.

The stereo music signal processed by the center channel separation module 31 is input to the non-negative matrix decomposition 1 module 32. Nonnegative matrix decomposition is a technique that decomposes a signal into a nonnegative matrix representing the change in magnitude of the harmonic groups and each harmonic group over time. Module 32 analyzes the change in size of the harmonic groups and each harmonic group over time. Decompose into nonnegative matrices that it represents. Non-negative matrix decomposition 1 The module 32 functions as the first non-negative matrix decomposition portion of FIG.

The fundamental frequency estimation module 33 functions as the frequency estimator of FIG. 2 (a) and estimates the fundamental frequency of the main melody from the magnitude change of the harmonic group inputted from the non-negative matrix decomposition module 1 32. In music, the main melody can sometimes be a vocal or other musical instrument.

The signal type determining module 34 functions as a type determining unit of FIG. 2 (a), and whether the vocal is included in the current signal and whether the current signal is unvoiced or voiced using the feature factors calculated in the center channel separation module 31. Determine.

The nonnegative matrix decomposition 2 module 40 functions as the contribution calculation unit of FIG. 1 and applies the non-negative music spectrum of the output of the STFT transform module 20 by applying the fundamental frequency of the main melody estimated by the vocal feature estimation module 30 as a constraint. Decompose into matrices The Wiener filter module 50 functions as the voice signal separator of FIG. 1, and calculates the Wiener filter gain for the main melody using the non-negative matrices calculated from the non-negative matrix decomposition 2 module 40.

The winner filter module 50 may include a hangover calculation module 51 and a filter module 52. The hangover calculation module 51 functions as the hangover calculation unit of FIG. 2 (b), and uses the result of the vocal feature estimation module 30 to exponentially reduce the signal when the vocal signal is not included in the corresponding frame. Compute the over function. The calculated hangover function is applied at the filter module 52 to calculate the Wiener filter gain. The filter module 52 functions as the filter gain calculator of FIG. 2 (b).

The inverse STFT transform module 60 converts the separated vocal signals in the frequency domain into the time domain. The inverse STFT conversion module 60 functions as a third signal converter of FIG. 2C.

The modules described in FIG. 4 may be performed using hardware or software. 5 shows a computer system, a central processing unit 101, a ROM memory 104, a RAM memory 105 capable of embedding a vocal signal separation program, an auxiliary memory 106, an input / output device 102, 103, and a device. It is composed of a bus (107) for transmitting and receiving data, a peripheral device for implementing the present invention between these devices in software, and the like. 4 and 6 may be performed by a digital signal processor (DSP) under control of a central processing unit (CPU) 101. The central processing unit 101 may include a controller 101a, a processor 101b, a register 101c, and the like.

The vocal signal separation method according to FIG. 6 includes: receiving stereo music and performing STFT conversion; Calculating a magnitude difference and a phase difference between the channels in the converted music signal to separate the center channel component from the converted signal; Estimating the fundamental frequency of the main melody by applying a non-negative matrix decomposition technique to the center channel separation signal; Determining the signal for each frame as voiced vocal, unvoiced vocal, and silent; Calculating the contribution of the vocal component from the spectrogram of the STFT transformed stereo music; Calculating a hangover using the signal type determination result; Determining a gain of the Wiener filter using the contribution of the vocal component and the hangover; Separating vocal signal components included in each channel of stereo music; And converting the separated frequency domain vocal signal into a time domain vocal signal by applying an inverse STFT.

Each module of FIG. 4 is described in detail as follows using FIG. 6, which is a flowchart illustrating each module of the vocal separation method according to an embodiment of the present invention.

The left channel signal x _L (n) (right channel signal x _R (n)) of stereo music consists of the left channel vocal signal v _L (n) (right channel vocal signal v _R (n)) and the left channel accompaniment signal a _L ( n) (the right channel accompaniment signal a _R (n)). In the STFT conversion module 20, the stereo music signal is converted into a frequency domain as shown in Equation 1 (S201).

Where f is the frequency index, t is the frame index, and X _L (f, t), V _L (f, t), A _L (f, t) (or X _R (f, t), V _R (f , t), A _R (f, t)) are STFT transform representations of left (or right) channel mixed, vocal, and accompaniment signals, respectively.

Center Channel Separation Steps

The first step of vocal feature estimation is the center channel separation step (S202). In the center channel separation step, it is assumed that the vocal signal is panned at the center and the remaining instrument sounds are panned at a position other than the center, and the panning component is extracted in each (f, t) region. Panning is the size ratio of both channels in each (f, t) region.

And phase difference

It is calculated by calculating The magnitude ratio of the channel signal is

You can also apply First, the magnitude ratio and the phase difference are defined as Equation 2 as two-dimensional Gaussian random variables.

Now, in each (f, t), the spectrum is assumed as Equation 3 as the sum of vocal signals and random variables belonging to different musical notes.

Where N (x (f, t) | μ _k , Σ _k ) is a Gaussian probability density function with mean of μ _k and covariance of Σ _k (k = 1,2). π _k (f, t), k = 1,2 satisfies ₀ ≦ π _k ≦ 1 and π ₁ + π ₂ = 1 as the mixing ratio of two random variables at (f, t). A method of estimating the parameter {μ _k , Σ _k } is to apply a conventional Expectation-Maximization (EM) algorithm that maximizes the Log-Likelihood Function (Equation 4).

The EM algorithm first initializes the parameters and then optimizes the log-likelihood function using the E-step, which calculates the posterior probabilities for the mix ratio of the vocal and accompaniment signals using the currently estimated parameters, and the current posterior probabilities. It is a method of continuously calculating the M-step of calculating θ = {μ _k , Σ _k } until the log-likelihood function converges. The post probability can be interpreted as a panning value at each (f, t). When the log-likelihood function is converged, the vocal signal is estimated by selecting k = k ^* having a small value of Σ _k as shown in Equation (6). On the other hand, the posterior probability is as shown in Equation 5.

In this case, it is necessary to exclude the Bass portion included in the mixed music signal by appropriately limiting the frequency range of the vocal signal. Inverse STFT conversion of the estimated vocal signal in the frequency domain yields the vocal signals x _L ^v (m, t) and x _R ^v (m, t) in the time domain.

Other parameters that are calculated in the center channel separation step are Mel Frequency Cepstrum Coefficients (MFCC), panning index, and zero-crossing rate required for vocal discrimination. In this example, 24 MFCC coefficients were calculated by applying a filter bank of 40 channels. In order to apply a panning index to determine whether the frame signal is vocal, a limited frequency range is calculated as in Equation (7).

The frequency range may vary with music, but a range of approximately 300Hz to 10kHz is appropriate. The zero crossing rate is calculated as shown in Equation 8 by selecting a signal of one channel.

Where m = 0,… Is the sample index in frame t.

Nonnegative matrix Vocals with Decomposition Harmonic Ingredient extraction

When the center channel component separation step is completed, the separated signal is input to the vocal harmonic component extraction step (S203) applying the non-negative matrix decomposition technique. Only one channel signal can be used at this time, but here, average value of both channels

Shall be used. x ^v (m, t) is the separation of the panning in the center of the mixed music signal, but the instrument sound component still remains, so the vocal signal x _v ^v (m, t) and the instrument sound component x _a ^v ( is considered to be the sum of m, t).

When the signal is STFT-converted, the power spectra of the vocal component and the instrumental sound component can be expressed by decomposing them into non-negative matrices, respectively, as shown in Equation (10).

Where N _F , N _F0 , K, and T are the number of frequencies, the number of fundamental frequencies, the number of filter shapes, and the number of time frames, respectively, and the nonnegative matrix W _V is N _F × N _F0 , and H _V is N _F0 × N _t , W _A is N _F x K, and H _A is K x N _t .

When the spectrum has a complex value and assumes that the real part and the imaginary part each follow a Gaussian probability distribution, the spectrum can be assumed to be a probability distribution as shown in Equation (11).

The vocal signal is assumed to be a stimulus-filter model. Assuming that an F ₀ From this range of possible utterances fundamental frequency, stimulus-filter model is represented by a signal filter component representing the harmonic component and the group of spectral shape having a fundamental frequency f ₀ ∈F _0, the pole and the filter When each of them is decomposed into a nonnegative matrix, it can be expressed as in Equation 12.

Here, the nonnegative matrices W _A and H _A constituting the power spectra of the components W _F0 , H _F0 , W _K , and H _K are assumed to be probability distributions such as N _c (0, σ ² ). The entire power spectrum is expressed by Equation 13.

The entire power spectrum represented by Equation 13 is equally assumed to have a complex Gaussian probability distribution in polar coordinates.

W _F0 is predetermined based on the vocal range of a person. The remaining parameters to be estimated are θ = {H _F0 , W _K , H _K , W _M , H _M }. These parameters may be determined to satisfy equation (15).

Parameter estimation to maximize the log-likelihood function is equivalent to minimizing the following equation (16).

This expression is equivalent to the Itakura-Saito cost function. Nonnegative matrix decomposition is one of the useful methods for calculating the parameter that minimizes this cost function. The update formula of the nonnegative matrix decomposition algorithm is known to ensure fast convergence while satisfying the nonnegative matrix condition by having a multiplication form rather than an addition form. Derived as 17.

Where. * And ./ mean multiplication by element and division by element, respectively. Non-negative matrices except W _F0 are initialized to non-negative behavior using Gaussian random variables. That is, we create a Gaussian random variable x and then take the absolute value | x |. Using the initial value, iteratively calculates until the cost function (16) converges. This is calculated by updating only one parameter at a time. Normally, convergence is completed when each parameter is repeatedly calculated 50 times.

Estimation of the fundamental frequency of the main melody

The parameter H _F0 estimated in Equation 17 represents a change in size of the harmonic groups over time, and a larger size is a component corresponding to the main melody of music. The fundamental frequency of the main melody is estimated by introducing the Viterbi algorithm that tracks the largest component in the calculated non-negative matrix H _F0 in time (S204). The Viterbi algorithm maximizes the cost function of

here

Is the probability of shifting the fundamental frequency from f ₀ ⁱ to f ₀ ^j .

It is expressed as a function of. β is set to an appropriate value through experimentation. In general, β = 10 is appropriate. In addition, using the normalized value of H _F0 (k) is expressed as in Equation 19.

Here, setting as shown in Equation 20 may reduce the octave error, which is an error of estimating twice the actual fundamental frequency, which is likely to occur in the fundamental frequency estimation. The Viterbi algorithm starts by assuming that all fundamental frequencies f ₀ ⁱ initially have the same probability.

Vocal judgment

Characteristic factors used in vocal determination include Mel Frequency Cepstrum Coefficients (MFCC), Linear Predictive Coefficient (LPC), fundamental frequency, zero crossing rate, energy, and panning index. As a method of determining whether the frame signal includes a vocal, a deterministic method such as a support vector machine (SVM) and a statistical method such as a hidden markov model (HMM) or a likelihood function may be applied. These schemes provide nearly identical decision performance.

In one embodiment of the present invention was used MFCC and panning index. In addition, if the zero crossing rate calculated in step S202 exceeds a predetermined threshold in the vocal determination process to determine whether the frame signal is voiced or unvoiced, the frame is determined to be unvoiced vocal. This is expressed as follows.

Nonnegative matrix Calculation of vocal component contribution using decomposition method

If the vocal determination result and the estimated fundamental frequency information are applied to the nonnegative matrix decomposition technique, the contribution of the vocal signal included in the music mixture X _L (f, t) and X _R (f, t) can be estimated directly. First, a DC component corresponding to the fundamental frequency 0 is added to the non-negative matrix W _F0 to accommodate the vocal unvoiced sound. The initial value of the non-negative matrix H _F0 is the fundamental frequency estimated in the fundamental frequency estimating step (S204).

In the MIDI note area the following is restricted.

The nonnegative matrix W _K uses the value estimated in step S203 as an initial value. The remaining nonnegative matrices H _K , W _M , and H _M are initialized to non-negative numbers using Gaussian random variables in the same manner as in step S204. Using the initialized nonnegative matrices, Equation 17 is repeatedly calculated and converged.

winner Vocal Signal Separation Using Filter

When the algorithm calculation of Equation 17 is completed, the contribution to the main melody component in the music mixture is calculated as follows.

This contribution is the gain of the Wiener filter. In order to remove a section in which the main melody is not a vocal, a hangover that is exponentially decreased as shown in Equation 24 is calculated from the time point t ₀ at which the vocal ends using the result of the vocal determination decision step S206 (S207).

Herein, the parameter 0 <? _H ? 1/5 is appropriately set. The vocal signal is now calculated from Equation 25 from the stereo music mixture using the Wiener filter gains of Equations 23 and 24 (S208).

The vocal signal is converted into signals v _L (n) and v _R (n) in the time domain using an inverse STFT.

Figure 7 compares the vocal signal separation results according to an embodiment of the present invention. 4a is a vocal signal, 4b is a music signal mixed with accompaniment, 4c is a result of vocal signal separation using only center channel separation, 4d is a result of vocal signal separation using only non-negative matrix decomposition, and 4e is an embodiment of the present invention. The result is the vocal signal separation using non-negative matrix decomposition with center channel separation and signal determination. As shown in FIG. 7, the accompaniment signal still remains when only direct channel separation or non-negative matrix decomposition is applied to the music mixture. It is possible to separate vocal signals with improved sound quality.

Next, the audio signal separation method of the preferable form by the audio signal separation apparatus of FIG. 1 is demonstrated. 8 is a flowchart schematically illustrating a voice signal separation method according to a preferred embodiment of the present invention. The following description refers to FIG. 8.

First, a feature of the voice signal included in the input music signal is estimated (voice feature estimation step S800). The speech feature estimation step S800 may include a channel separation step, a type determination step, and a frequency estimation step.

In the channel separation step, the panning reflected in the music signal is analyzed to separate the channel signal related to the voice signal from the music signal. The channel separation step may include a channel component calculation step, a panning calculation step, and a component extraction step. In the channel component calculating step, the magnitude ratio and the phase difference between the channel signals included in the music signal are calculated. In the panning calculation step, panning is calculated using the calculated magnitude ratio and the phase difference. In the component extraction step, a component signal related to panning is extracted from the music signal as a channel signal using the calculated panning.

In the type determination step, the separated channel signals are classified in units of frames to determine signal types for each frame. More specifically, in the type determination step, whether the vocal component is detected for each frame as the signal type using the feature factor associated with the channel signal, and when the vocal component is detected, the vocal component is composed of voiced and unvoiced sound. It is determined whether it is related to which component.

In the frequency estimating step, the frequency component of the main melody is estimated using the separated channel signal. The frequency estimation step may consist of a first non-negative matrix decomposition step and a component tracking step. In the first non-negative matrix decomposition step, the spectrogram of the separated channel signal is decomposed into a non-negative matrix. In the component tracking step, feature estimation of a speech signal is performed by temporally tracking the largest component in the decomposed nonnegative matrix.

After the speech feature estimation step S800, the contribution of the speech signal to the music signal is calculated using the estimated features of the speech signal (contribution calculation step S810).

After the contribution calculation step S810, the speech signal is separated from the music signal by the filter gain for the speech signal based on the calculated contribution (voice signal separation step S820). The speech signal separation step S820 may include a hangover calculation step, a filter gain calculation step, and a separation step. In the hangover calculation step, a hangover is calculated by using the result according to the signal type determination. In the filter gain calculation step, the filter gain for the speech signal is calculated based on the calculated contribution and the calculated hangover. In the separating step, the speech signal is separated from the music signal based on the calculated gain.

The speech signal separation method may further include at least one signal conversion step of the first signal conversion step to the third signal conversion step. In the first signal converting step, the music signal is converted into a frequency domain signal to enable speech signal feature estimation. In the second signal conversion step, the shape of the music signal is determined to convert the music signal into a digital signal when the music signal is an analog signal. In the third signal conversion step, the separated speech signal is converted into a time domain signal.

The first signal conversion step and the second signal conversion step are performed before the speech feature estimation step. On the other hand, the third signal conversion step is performed after the speech signal separation step.

It will be apparent to those skilled in the art that various modifications, substitutions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. will be. Accordingly, the embodiments disclosed in the present invention and the accompanying drawings are not intended to limit the technical spirit of the present invention but to describe the present invention, and the scope of the technical idea of the present invention is not limited by the embodiments and the accompanying drawings. . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

The present invention relates to an apparatus and method for separating a vocal signal from a stereo music signal, and may be applied to a music content service, a realistic / 3D broadcast service, and the like.

100: speech signal separation unit 110: speech feature estimation unit
111: channel separation unit 112: type determination unit
113: frequency estimation unit 120: contribution calculation unit
130: voice signal separation unit 131: hangover calculation unit
132: filter gain calculation unit 133: separation unit
161 to 163: first signal converter to third signal converter
171: channel component calculator 172: panning calculator
173: component extraction unit 181: first non-negative matrix decomposition unit
182: component tracking unit

Claims

A contribution calculator for calculating a contribution of the voice signal to the music signal by using features of the voice signal included in the input music signal; And
A voice signal separator for separating the voice signal from the music signal with a filter gain for the voice signal based on the calculated contribution
Voice signal separation device comprising a.