CN104616663A - Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation) - Google Patents

Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation) Download PDF

Info

Publication number
CN104616663A
CN104616663A CN201510023609.5A CN201510023609A CN104616663A CN 104616663 A CN104616663 A CN 104616663A CN 201510023609 A CN201510023609 A CN 201510023609A CN 104616663 A CN104616663 A CN 104616663A
Authority
CN
China
Prior art keywords
mfcc
music
matrix
according
repeatedly
Prior art date
Application number
CN201510023609.5A
Other languages
Chinese (zh)
Inventor
张天骐
徐昕
张刚
高超
阳锐
李�灿
Original Assignee
重庆邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to CN2014106908405 priority Critical
Priority to CN201410690840 priority
Application filed by 重庆邮电大学 filed Critical 重庆邮电大学
Priority to CN201510023609.5A priority patent/CN104616663A/en
Publication of CN104616663A publication Critical patent/CN104616663A/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS
    • G10H1/00Details of electrophonic musical instruments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00-G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/041Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal based on mfcc [mel -frequency spectral coefficients]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/056Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for extraction or identification of individual instrumental parts, e.g. melody, chords, bass; Identification or separation of instrumental parts by their characteristic voices or timbres
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/131Mathematical functions for musical analysis, processing, synthesis or composition
    • G10H2250/215Transforms, i.e. mathematical transforms into domains appropriate for musical signal processing, coding or compression
    • G10H2250/235Fourier transform; Discrete Fourier Transform [DFT]; Fast Fourier Transform [FFT]

Abstract

The invention discloses a music separation method of an MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with an HPSS (High Performance Storage System), and relates to the technical field of signal processing. In consideration of high probability of ignore of a gentle sound source and time-varying change characteristic of music, the sound source type is analyzed through a harmonic/percussive sound separation (HPSS) method to separate out a harmonic source, then MFCC characteristic parameters of the remaining sound sources are extracted, and similar operation is performed on the sound sources to construct a similar matrix so as to establish a multi-repetition structural model of the sound source suitable for tune transformation, so that a mask matrix is obtained, and finally the time domain waveform of a song and background music is obtained through ideal binary mask (IBM) and fourier inversion. According to the method, effective separation can be performed on different types of sound source signals, so the separation precision is improved; meanwhile the method is low in complexity, high in processing speed and higher in stability, and has broad application prospect in the fields such as singer retrieval, song retrieval, melody extraction and voice recognition in a musical instrument background.

Description

The music separation method of the many models repeatedly of a kind of MFCC-in conjunction with HPSS

Technical field:

The present invention relates to Audio Signal Processing, be specially the separation problem of song and background music in a kind of music signal.

Background technology

The background music and the song that extract music have broad application prospects in Audio Signal Processing field, the speech recognition etc. as extracted with song retrieval, melody in singer's retrieval, under musical instrument background, but audio frequency separation is also one has challenging task.On multitone music separation problem, the auditory system of people's ear has mysterious ability, people can be easy to differentiate the song in music and background music, even can tell in tune and contain which musical instrument, these are all minor matters easily, but are faced with numerous difficulties to computing machine.

Current music isolation technics is mainly statistical technique and based on psychoacoustic investigative technique.What statistical technique mainly adopted in music Separation Research is Non-negative Matrix Factorization (Nonnegative Matrix Decom-position, NMF) method and sparse coding.Based in psychoacoustic research, that use is Computational auditory scene analysis (Computational Auditory Scene Analysis, CASA) method, from the perceptually of human auditory system, focuses on the various features of perception music signal.In recent years, the music separation method of Corpus--based Method technology and CASA technology achieves very ten-strike.

Figure 1 shows that CASA process basic flow sheet.

CASA system mainly contains two kinds of processing modes, a kind of information processing manner be called from bottom to top, this mode auditory system showed as people's ear have sound is decomposed, ability of recombinating again.Be usually expressed as by signature analysises such as the periodicity to voice data, similarity, continuitys, sound component carried out decomposing and converges in different sense of hearing stream, finally again by same class voice combination to together, realize the separation of sound.Another kind of mode is called top-down information processing manner, and its performance is that the auditory system of people's ear has the ability of unknown sound source being carried out to learning and Memory.In CASA system, be usually expressed as the prior imformation utilizing voice signal, carry out machine learning and Modling model, finally realize the function such as identification, separation to unknown sound.Although CASA achieves significant progress in long-term research, be still faced with some challenges.The aspect such as pitch evaluation, acoustic cue fusion, the speech Separation of high-frequency region, the separation of voiceless sound of robustness is all faced with numerous difficulties.And real music is always ever-changing, and in most cases, grasp enough inabundant to the prior imformation of music, the blind separation technology therefore for music still needs further Exploration & stu dy.In recent years, the further feature of people's also continuous Exploration & stu dy music, document (Z.Rafii, B.Pardo.Repeating pattern extraction technique (REPET): A simple method for music/voice separation [J] .Audio, Speech, and Language Processing, IEEE Transactions on) it is studied, wherein most is representational will belong to repeatability (repeating) feature.Musician Schenker thinks " being structural element minimum in musical works repeatedly " (H.Schenker.Harmony [M] .University of Chicago Press, 1980), but repeatedly as an important attribute of music, be just applied to separation method in recent years.

Figure 2 shows that and identify repeatedly fragment approach (REPET) schematic diagram.

The background music that method based on structure is repeatedly intended to exist repeatedly structure with do not have repeatedly the song of structure to separate.Its central idea identifies the fragment repeatedly of background music, shelters (IBM) extract background music finally by Binary Ideal.

Compared with some local features (as tone, overtone etc.) of music, be a global property repeatedly, music time-varying characteristics much smaller than further feature, will be disturbed impact less to its impact, and in music is separated, the method for structure extraction is simple repeatedly.But just grow up based on the separation method of feature repeatedly, for different source of sound types, adopt repeatedly structure to carry out the adaptivity of music separation still very poor, need more exploration and research.

Summary of the invention

The present invention is directed to Problems existing in existing separation method repeatedly, utilize the inherent characteristic of source of sound type and the global property of mel cepstrum coefficients (MFCC), propose the music separation method of the many structures repeatedly of a kind of MFCC-in conjunction with impulse source separation method HPSS.The method can effectively be analyzed different sources of sound, then carries out song and be separated with background music, not only good separating effect, and travelling speed is fast, and has good stability.

The technical scheme that the present invention solves the problem is, for the inherent characteristic of source of sound type and the global property of mel cepstrum coefficients MFCC, first harmonic wave utilized to the source of sound type that will be separated, adopt impulse source separation method HPSS to be analyzed, the source information that effective differentiation is dissimilar, especially to the source information that some rhythm are mild; Selecting in the feature characterizing repeatedly structure, except using energy information, the MFCC feature also adding sound signal is improved, and the robustness of MFCC is by the more effective accuracy that ensure that repeatedly structure extraction; Be directed to the time-varying characteristics of music, adopt similarity computing, the many structures repeatedly setting up background music adaptively carry out music separation.

MFCC-based on HPSS many structures repeatedly music separation method, is characterized in that, comprise the following steps:

Under Short Time Fourier Transform STFT, carry out harmonic separation, the harmonic source in background music is separated; Extract the MFCC characteristic parameter after harmonic separation in remaining music information, similar op is carried out to MFCC characteristic parameter, obtains similar matrix S mFCC; According to similar matrix S mFCCfind similar fragments; Set up the S of the structural model repeatedly (i, j) of respective frame according to similar fragments, call repeatedly structural model and carry out the background music that medium filtering calculates corresponding structure place repeatedly; According to formula: W (i, j)=min{S (i, j), V (i, j) }, obtain the amplitude spectrum W (i, j) of background music, call formula according to amplitude spectrum: set up masking matrix M (i, j); Adopt ideal binary to shelter to masking matrix M (i, j), recover the time domain waveform of song and background music through inverse Fourier transform, wherein, the amplitude spectrum matrix that V (i, j) is signal, j is frame number, and i is Frequency point.

Described being separated by harmonic source in background music specifically comprises, according to formula: obtain masking matrix to the Fourier transform F of music signal h,i, call be separated harmonic source, obtain the frequency spectrum X of harmonic source after being separated h, wherein, H h,iand P h,ibe respectively the Short Time Fourier Transform of harmonic source and impulse source.Carry out similar op to the MFCC parameter matrix extracted, the similar matrix obtained under different spectral line number between MFCC coefficient is: S MFCC ( j a , j b ) = Σ i = 1 n C ( i , j a ) C ( i , j b ) Σ i = 1 n C ( i , j a ) 2 Σ i = 1 n C ( i , j b ) 2 , Wherein, i represents Frequency point, and n is dimension, C (i, j a), C (i, j b) be the i-th frame, jth aspectral line and the i-th frame, jth bmFCC matrix of coefficients under spectral line.Set up the S of the structural model repeatedly (i, j) of respective frame, call repeatedly structural model and fragment corresponding on the amplitude spectrum of input signal is carried out medium filtering, wherein, median represents median filter, and x represents number of fragments, and i represents Frequency point, and j represents frame number, V (i, J j(l)) be the signal amplitude spectrum of l of the Frequency point fragment repeated.Described searching similar fragments specifically comprises further: the maximum length, the minimum length that limit repeatedly structure according to snatch of music length, according to duration and the musical time determination threshold value of fragment repeatedly, according to the similarity between similar matrix determination fragment, two fragments that similarity is greater than threshold value are similar fragments.Call formula: recover the time domain waveform of song and background music, wherein, x is original input music signal, for the time domain waveform of background music, for the time domain waveform of song, with represent Fourier transform and inverse transformation respectively.According to formula: C MFCC ( i ) = 2 L Σ i = 1 L - 1 log m ( l ) cos ( l - iπ 2 L ) Determine MFCC characteristic parameter C mFCC, according to formula: S MFCC ( j a , j b ) = Σ i = 1 n C ( i , j a ) C ( i , j b ) Σ i = 1 n C ( i , j a ) 2 Σ i = 1 n C ( i , j b ) 2 The similar matrix setting up MFCC characteristic parameter determines the similarity under different spectral line number between MFCC coefficient, and wherein, m (l) is for signal is by the energy of wave filter, and L is the number of bank of filters, and i represents Frequency point, and j represents frame number, C (i, j a) be the i-th frame, jth amFCC matrix of coefficients under spectral line, C (i, j b) be the i-th frame, jth bmFCC matrix of coefficients under spectral line, n is the dimension of MFCC.

The present invention utilizes HPSS to be separated source of sound type before music is separated, the source information that some rhythm in the existing separation method of structure are repeatedly mild problem of omitting is improved, and adopt a kind of simple alternative manner, utilize an auxiliary function to carry out iterative, enormously simplify the complexity of original harmonic source method.Again the audio file after separation is processed, MFCC parameter is joined in single energy information, the global feature planning as a whole music signal has more robustness, structure is repeatedly characterized with MFCC, similarity is utilized to set up many structural models repeatedly to extract background music, more realistic music file feature, makes it no matter in separating property, or is all significantly improved in processing speed.

Accompanying drawing explanation

Fig. 1 CASA process basic flow sheet;

Fig. 2 REPET method schematic diagram;

Fig. 3 HPSS method design sketch

(a) mixed signal frequency spectrum, (b) is separated rear recorder frequency spectrum, and (c) is separated rear piano frequency spectrum, (d) original recorder frequency spectrum, (e) original piano frequency spectrum;

Fig. 4 MFCC-many structures repeatedly music piece-rate system schematic diagram;

Fig. 5 MFCC characteristic parameter extraction theory diagram;

Fig. 6 the inventive method overall flow figure;

The Signal separator sample graph that Fig. 7 improves one's methods

(a) mixed signal time domain waveform, the frequency spectrum of (b) mixed signal, (c) is separated rear song time domain waveform,

D () is separated rear song frequency spectrum, (e) is separated rear backdrop music time domain waveform, and (f) is separated rear backdrop music frequency spectrum;

The comparison diagram of isolated song pitch value and standard value under the different signal to noise ratio (S/N ratio) of Fig. 8

The comparison of the lower three kinds of pitch value of (a)-5dB situation, the comparison of the lower three kinds of pitch value of (b) 0dB situation, the comparison of the lower three kinds of pitch value of (c)+5dB situation.

Embodiment

As previously mentioned, traditional separation method of structure repeatedly is easily omitted for the music information that rhythm is milder, because what the extraction of structure adopted repeatedly is the method for " rhythm spectrum ", this method easily captures the obvious source information of tempo variation, but then easily omits for the source of sound of comparatively releiving that orchestra (as flute, violin, piano etc.) produces.To this, add harmonic source before separation and be separated, adopt harmonic wave, impulse source separation method HPSS here, first source of sound is analyzed, be separated again afterwards, thus effectively improve the performance be separated.

Generally speaking, the spectrogram of music signal presents two kinds of formal distributions usually, a kind of along the distribution of time shaft continuously smooth, and another kind is that the source of sound of these two kinds distributions is referred to as harmonic source and impulse source respectively along the distribution of frequency axis continuously smooth.Minimize cost function J (H, P) and obtain harmonic source H h,iwith impulse source P h,i,

Harmonic source H h,iwith impulse source P h,igeneral satisfaction formula (1) and (2).

J ( H , P ) = 1 2 σ H 2 Σ h , i ( H h , i - 1 - H h , i ) 2 + 1 2 σ P 2 Σ h , i ( P h , i - 1 - P h , i ) 2 - - - ( 1 )

Wherein H h,iand P h,ibe respectively the Short Time Fourier Transform of harmonic source and impulse source, W h,ifor the energy spectrum of input signal, σ h, σ prepresent the parameter factors of the smoothness of horizontal direction (harmonic source) and vertical direction (impulse source) respectively, i is frame number, H h, i-1and P h, i-1represent the harmonic source of i-1 frame and the Fourier transform of impulse source respectively, H, P are H respectively h,iand P h,iset.

Formula (1) is a concave function, there is a global minimum, passes through differentiate just can obtain.But due to H h,iand P h,ithe simultaneous equations of large dimension can be produced, in order to avoid the problems referred to above, employing auxiliary function is carried out iterative.

Generally, suppose there is spectrum gradient (H h, i-1-H h,i), (P h-1, i-P h,i) obey independent Gaussian distribution, then have

(H h,i-1-H h,i) 2≤2(H h,i-1-U h,i) 2+2(H h,i-U h,i) 2

(3)

(P h-1,i-P h,i) 2≤2(P h-1,i-V h,i) 2+2(P h,i-U h,i) 2

(4)

Wherein U h,i=(H h, i-1+ H h,i)/2, V h,i=(P h-1, i+ P h,i)/2, auxiliary function Q (H, P, U, V) for shown in formula (5), and meets formula (6) and formula (7).

Q ( H , P , U , V ) = 1 σσ H 2 Σ h , i { ( H h , i - 1 - U h , i ) 2 + ( H h , i - U h , i ) 2 } + 1 σσ P 2 Σ h , i { ( P h - 1 , i - V h , i ) 2 + ( P h , i - V h , i ) 2 } - - - ( 5 )

J(H,P)≤Q(H,P,U,V)

(6)

J ( H , P ) = min U , V Q ( H , P , U , V ) - - - ( 7 )

By formula (5), (6), (7) can combine try to achieve iterative such as formula shown in (8), (9).

{ U ( k + 1 ) , V ( k + 1 ) } = min U , V Q { H ( k ) , P ( k ) , U , V } - - - ( 8 )

{ H ( k + 1 ) , P ( k + 1 ) } = min U , V Q { H , P , U ( k + 1 ) , V ( k + 1 ) } - - - ( 9 )

By above formula, by obtaining least cost function, and then obtain harmonic source and impulse source, wherein, k represents current iteration number of times, the cost function that J (H, P) is harmonic source and impulse source, U h,i, V h,ifor auxiliary parameter, i is frame number, and h shows that this parameter belongs to harmonic source.Formula (8), (9) ensure that cost function J is dull reduction.

Research finds, song information can present different forms according to the height of the frequency resolution of Short Time Fourier Transform on spectrogram, and during lower frequency resolution, song concentrates in harmonic wave source of sound, and when frequency resolution is higher, song concentrates on and impacts in source of sound.Due to when low frequency resolution, most of energy of song is all gathered on some single-frequencies, and when high-frequency is differentiated, the pseudo-harmonic influence in song just can not have ignored, different from the energy of each overtone, the energy of pseudo-harmonic wave disperses over the entire frequency range.During high frequency discrimination, also just mean low temporal resolution, likely cause non-song part to be identified as song at one time under frame, and then affect the song energy on surrounding frequencies, make song show as impulse form like this.But no matter frequency resolution is how many, the energy of harmonic wave always concentrates in the narrower bandwidth of harmonic frequency, song information distribution not change, the just mode of observation of change.Under high-resolution Short Time Fourier Transform (STFT), song can be presented with the form of impulse source, the harmonic source in background music is separated.

Harmonic source can be isolated in the following ways, be drawn the frequency-domain expression of harmonic source and impulse source by following formula iteration, make the effect be separated reach best by the iteration of certain number of times, namely maximizedly isolate harmonic source.

In practice, formula (8), (9) iteration formula are derived:

Equation H will be retrained h,i+ P h,i=W h,ih in substitution formula (9) (k+1), P (k+1), obtain the derivative of auxiliary function

Q ~ ( H , P ) = Q ( H , P , U ( k + 1 ) , V ( k + 1 ) ) + Σ h , i λ h , i ( H h , i + p h , i + W h , i ) - - - ( 10 )

Wherein, λ h,ifor Lagrange multiplier.By above formula respectively to H h,i, P h,iand λ h,icarry out differentiate, the formula of obtaining (11) and (12) can be solved afterwards.

H h , i ( k + 1 ) = a 2 ( U h , i + 1 ( k + 1 ) + U h , i ( k + 1 ) ) + ( 1 - a ) 2 ( 2 W h , i - W h + 1 , i ( k + 1 ) - V h , i ( k + 1 ) ) - - - ( 11 )

P h , i ( k + 1 ) = a 2 ( V h + 1 , i ( k + 1 ) + V h , i ( k + 1 ) ) + ( 1 - a ) 2 ( 2 W h , i - U h , i + 1 ( k + 1 ) - U h , i ( k + 1 ) ) - - - ( 12 )

Wherein auxiliary parameter U again (k+1), V (k+1)meet formula (3), (4), so formula (13) can be obtained.

U h , i ( k + 1 ) = H h , i - 1 ( k ) + H h , i ( k ) 2 , V h , i ( k + 1 ) = P h - 1 , i ( k ) + P h , i ( k ) 2 - - - ( 13 )

Cancellation auxiliary parameter U h,iand P h,i, thus obtain new iterative formula, shown in (14) and (15).

H h , i ( k + 1 ) = H h , i ( k ) + Δ ( k ) - - - ( 14 )

P h , i ( k + 1 ) = P h , i ( k ) + Δ ( k ) - - - ( 15 )

Wherein Δ k = α ( H h , i - 1 ( k ) - 2 H h , i ( k ) + H h , i + 1 ( k ) 4 ) - ( 1 - α ) ( P h - 1 , i ( k ) - 2 P h , i ( k ) + P h + 1 , i ( k ) 4 ) Auxiliary parameter. for weight factor, generally, in experiment, H h,iand P h,iinitial value is 0.5W h,i, total iterations can reach good separating effect at 15 ~ 20 times, can certainly arrange accordingly according to actual requirement.Wherein, k is iterations, and h represents the parameter belonging to harmonic source, and i is frame number.

Finally, then isolate harmonic source through sheltering computing, design sketch as shown in Figure 3, specifically can be separated harmonic source according to following formula.

M H h , i = H h , i 2 ( H h , i 2 + P h , i 2 ) - - - ( 16 )

X H = F h , i ⊗ M H h , i - - - ( 17 )

Wherein, for masking matrix, F h,ifor the Fourier transform of music signal, X hfor being separated the frequency spectrum of rear harmonic source.

After harmonic source is separated, short-term stationarity analysis is re-started to remaining music information, then after windowing, framing, STFT conversion, signal is separated.

Here can adopt MFCC-many structures repeatedly separation method, overall flow figure as shown in Figure 4.

First carry out MFCC parameter extraction and Similarity measures to input music signal, set up structural model repeatedly, background music is structure extraction repeatedly, obtains ideal binary mask, isolates singing voice signals.

Be illustrated in figure 5 MFCC characteristic parameter extraction theory diagram.Input music signal is through pre-service, and FFT converts, and obtain line energy, Mel filtered energy, asks logarithm DCT, obtains the i-th frame MFCC parameter matrix according to formula (23).

C MFCC ( i ) = 2 L Σ i = 1 L - 1 log m ( l ) cos ( l - iπ 2 L ) - - - ( 23 )

Wherein, C mFCCrepresent MFCC parameter matrix, l represents present filter number, and L is the number of total bank of filters, and m (l) is for signal is by the energy of l Mel wave filter.Similar op is carried out to the MFCC parameter matrix extracted, obtains the similar matrix under different spectral line number between MFCC coefficient according to formula (24).

S MFCC ( j a , j b ) = Σ i = 1 n C ( i , j a ) C ( i , j b ) Σ i = 1 n C ( i , j a ) 2 Σ i = 1 n C ( i , j b ) 2 - - - ( 24 )

Wherein, i represents Frequency point, n representation dimension, and j represents frame number, and a, b are the spectral line number after dct transform, so C (i, j a), C (i, j b) be the i-th frame, jth aspectral line and the i-th frame, jth bmFCC matrix of coefficients under spectral line.Similar matrix being analyzed, finds similar fragments according to similar matrix, in order to meet music time-varying characteristics, some relevant constraints being carried out to similar fragment.Constraint condition comprises:

1) maximum length of structure is limited repeatedly according to snatch of music length;

2) limit repeatedly the minimum length of structure according to snatch of music length, because the duration of fragment and the time of music do not have correlativity repeatedly, between consecutive frame, there will be higher similarity;

3) similarity between fragment is weighed according to threshold value.

Optimum, threshold value can be 0.6, and two fragments that namely similarity is greater than 0.6 can be regarded as the fragment of repetition, are incorporated into together, as a fragment.

Utilize the openness feature of song energy distribution, under above-mentioned restrictive condition, fragment consistent for similarity is combined, set up the S of the structural model repeatedly (i of respective frame, j), fragment corresponding on the amplitude spectrum of original input signal is carried out medium filtering, and its formula is such as formula shown in (25).

S ( i , j ) = median l ∈ [ 1 , x ] { V ( i , J j ( l ) ) } - - - ( 25 )

Wherein, x represents number of fragments, and i represents Frequency point, and j represents frame number, and median represents median filter, J jl () is l the fragment repeated, V (i, J j(l)) be the signal amplitude spectrum of l of the Frequency point fragment repeated.

Suppose between the amplitude spectrum of non-duplicate part and repeating part amplitude spectrum separate, so just there is V=W+W'(wherein V>0, W>0, W'>0), then W≤V, wherein W is the amplitude spectrum of background music, and W' is song amplitude spectrum.The amplitude spectrum of the i-th Frequency point, jth frame place background music signal can be obtained according to formula (26),

W(i,j)=min{S(i,j),V(i,j)}

(26)

Masking matrix M (i, j) is obtained, shown in (27) by the amplitude spectrum of background music.

M ( i , j ) = W ( i , j ) V ( i , j ) - - - ( 27 )

Adopt ideal binary to shelter (IBM), then the time domain waveform of song and background music can be recovered through inverse Fourier transform, shown in formula following (28), (29).

Wherein x is original input music signal, for background music, for song, with represent Fourier transform and inverse transformation respectively.Its process flow diagram as shown in Figure 6.

The method utilizing the present invention to propose is separated mixed music signal, when Fig. 7 is the separation of a piece of music fragment, frequency domain effect figure.As can be seen from the spectrogram of Fig. 7 (d) song, fundamental tone and the higher hamonic wave of voice are separated out preferably, and in background music the also fundamental tone of residual voice and harmonic component thereof.As can be seen from this design sketch, separation method of the present invention effectively can realize being separated of song and background music.

The present invention and the REPET method of Rafii and the up-to-date NMF method of Ozerov are compared.For the separating effect of quantitative test three kinds of methods, to the snatch of music of MIR-1K data centralization with the energy Ratios-5dB of " voice-to-music ", 0dB, 5dB aliasing, adopt the STFT of a frame 1024 points to convert, frame is stacked as field.Also 100Hz high-pass filtering has been carried out, because be the main concentrated area of song energy at 100Hz with lower area in experiment.Be separated by three kinds of methods on this basis.We adopt objective global normalization signal errors (GNSDR) to observe the separating property of MIR-1K data centralization 1000 song fragment on the whole, and wherein the computing formula of GNSDR is as follows.

NSDR ( s ^ , ^ s , x ) = SDR ( s , s ) - SDR ( x , s ) - - - ( 30 )

GNSDR = Σ k w k NSDR ( S ^ , S , X ) Σ k w k - - - ( 31 )

Wherein for isolated target source, s represents original object source, and x represents mixing sound source.W kfor their length.GNSDR value is larger represents that the performance be separated is better.

Table 1: between three kinds of methods, song, background music global normalization signal-to-deviation ratio (GNSDR) are compared

Mark as shown in table 1 to the GNSDR of 1000 first songs and background music.From table in data, background music under-5dB, 0dB, two kinds repeatedly structural approach be all better than NMF method, this be due to its lack prior imformation.The many separation methods repeatedly of MFCC-have had lifting compared in REPET method performance, and the extraction of background music feature repeatedly improves to some extent.But seem poorer at 5dB, under reason is 5dB situation, song energy is higher than background music, and song is that sparse hypothesis will be false at whole spectrogram, and thus extracting background music for employing medium filtering can cause deleterious.Improve one's methods mainly for the extraction of background music, song part is not too large change in all cases.

In addition the pitch value that we introduce song compares separating property, pitch Detection is carried out to the identical method of isolated song under the song in original MIR-1K database, different signal to noise ratio (S/N ratio), contrast with the standard value of its mark, result as shown in Figure 8 again.Wherein '-' is standard fundamental tone, and ' o ' is the pitch value of song in raw data base, and ' * ', for being separated rear song pitch value, Fig. 8 a), b), c) is respectively the comparing result in-5dB, 0dB ,+5dB three kinds of situations.' o ' and '-' are close to the validity and reliability that this fundamental tone detecting method is described, close to '-', ' * ' more illustrates that the effect be separated is better.As shown in Figure 8, standard fundamental tone is nearly all capped, and the pitch value that after being separated, song detects also more and more is near the mark value, thus describes song that separation method of the present invention is separated very thoroughly, and effect is fine.

Except above-mentioned global normalization's signal errors ratio and pitch value, we also adopt objective evaluation standard, and namely blind source separating tool box (Blind Source Separation Evaluation, BSS_EVAL) judges the performance of the inventive method.This kit has quantized the performance of estimated signal relative to its actual target source, and its quantization principles is, will be separated the estimated signal obtained be decomposed into four parts shown in formula (32).

s ^ ( t ) = s t arg et ( t ) + e interf ( t ) + e noise ( t ) + e artif ( t ) - - - ( 32 )

Wherein s targett () represents the part belonging to source signal in estimated signal, be called that front is contributed, e interft () to represent in estimated signal and not to meet source signal, but belong to the part of mixed signal, is the evaluated error caused by other signal source, e noiset () represents observation noise mushing error, e artift () represents the system noise error because method produces.

In most cases and this patent research in, all can neglect the impact of noise, so e noiset () this part can be removed, by three partial informations be left, three performance parameters can be obtained, be respectively signal-to-deviation ratio (Source-to-Distortion Ratio, SDR), signal-to-noise ratio (Source-to-Interference Ratio, SIR), systematic error ratio (Source-to-Artifacts Ratio, SAR) such as formula shown in (33), (34), (35), three higher expression separating effects of performance parameter value are better.

SDR = 10 log 10 ( | | s t arg et ( t ) | | 2 | | e interf ( t ) + e artif ( t ) | | 2 ) - - - ( 33 )

SIR = 10 log 10 ( | | s t arg et ( t ) | | 2 | | e interf ( t ) | | 2 ) - - - ( 34 )

SAR = 10 log 10 ( | | s t arg et ( t ) + e interf | | 2 | | e interf ( t ) | | 2 ) - - - ( 35 )

Due to the music data fragment in MIR-1K, duration is all no more than about 10s, the change of structure repeatedly of melody is less, method repeatedly for the improvement of background music time variation is also prominent does not show its good performance, just real music is tested below, experimental data is that (3 first tunes change greatly 5 first real music fragments, 2 first tunes are milder) and 1 first entire music, be SiSEC to compete music data used, for the music recording that specialty makes, sampling rate is the two-channel wav audio frequency of the 16bit coding of 44.1KHz, these music are all thought and are belonged to convolved mixtures type.All the other arrange the same.Respectively its SDR, SAR, SIR numerical value is calculated under different signal to noise ratio (S/N ratio) to these fragments.Result is as follows.

Table 2:dev2 dev1_bearlin-roads_snip_85_99_mix.wav separating property comparison sheet

Table 3:dev1_tamy-que_pena_tanto_faz_snip_6_19_mix.wav separating property comparison sheet

Table 4:dev2_fort_minor-remember_the_name_snip_54_78_mix.wav separating property comparison sheet

Table 5:dev2_another_dreamer-the_ones_we_love_snip_69_94_mix.w av separating property comparison sheet

Table 6:dev2_ultimate_nz_tour_snip_43_61_mix.wav separating property comparison sheet

Table 7:Put Your Head On My Shoulder.wav entire music separating property comparison sheet

Table 2, table 3, table 4 are the snatch of music separating property that 3 first tunes change greatly, because its tune changes greatly, thus all there is multiple structure repeatedly, as can be seen from the table, for the music situation that complex structure is changeable repeatedly, the inventive method is better improved in separating property (SDR, SIR), and while method also has more robustness (SAR).Especially in the performance of song separation, because the structural model repeatedly characterized with MFCC more meets melody itself, make the separation of melody more thorough, thus make the performance of song obtain good lifting.Table 5, table 6 are the mild snatch of music separating properties of 2 first tunes, because tune is mild, can suppose its repeatedly structure be unique, as can be seen from the table, although the inventive method is not improved for its SDR index of such music, but the SIR performance index of background music still have good improvement, this shows that this method contributes to being separated of song and background music, background music is more thoroughly separated from song, even for the milder music type of tune.As can be seen from the overall data of 5 forms, in the background music performance of separation, method of the present invention is well more a lot of than the NMF method effect of Ozerov generally.In the SAR data of background music, the method performance of Ozerov is better than many methods repeatedly of improvement, indicates to introduce the NMF method of prior imformation and can greatly reduce the interference that method error brings.And in song part owing to not introducing the prior imformation of song, make the song performance that is separated well below the separating property of method repeatedly.Table 7 is the separating property of 1 first full songs under different signal to noise ratio (S/N ratio), from table 7 Experimental comparison data, the background music that method of the present invention is separating obtained and song improve nearly 2dB and 3dB respectively in SDR performance index, the quality of voice data after the method indicating the present invention's proposition significantly improves and is separated.The lifting of SIR, SAR performance index also illustrate that simultaneously, and the identification of method to background music and song is more clear, and method itself has more robustness.

Claims (6)

1., based on MFCC-many structures repeatedly music separation method of HPSS, it is characterized in that, comprise the following steps:
Under Short Time Fourier Transform STFT, carry out harmonic separation, the harmonic source in background music is separated; Extract the MFCC characteristic parameter after harmonic separation in remaining music information, similar op is carried out to MFCC characteristic parameter, obtains similar matrix S mFCC; According to similar matrix S mFCCfind similar fragments; Set up the S of the structural model repeatedly (i, j) of respective frame according to similar fragments, call repeatedly structural model and carry out the background music that medium filtering calculates corresponding structure place repeatedly; According to formula: W (i, j)=min{S (i, j), V (i, j) }, obtain the amplitude spectrum W (i, j) of background music, call formula according to amplitude spectrum: set up masking matrix M (i, j); Adopt ideal binary to shelter to masking matrix M (i, j), recover the time domain waveform of song and background music through inverse Fourier transform, wherein, the amplitude spectrum matrix that V (i, j) is signal, j is frame number, and i is Frequency point.
2. method according to claim 1, is characterized in that, described being separated by harmonic source in background music specifically comprises, according to formula: obtain masking matrix to the Fourier transform F of music signal h,i, call be separated harmonic source, obtain the frequency spectrum X of harmonic source after being separated h, wherein, H h,iand P h,ibe respectively the Short Time Fourier Transform of harmonic source and impulse source.
3. method according to claim 1, is characterized in that, carries out similar op to the MFCC parameter matrix extracted, and the similar matrix obtained under different spectral line number between MFCC coefficient is: S MFCC ( j a , j b ) = Σ i = 1 n C ( i , j a ) C ( i , j b ) Σ i = 1 n C ( i , j a ) 2 Σ i = 1 n C ( i , j b ) 2 , Wherein, i represents Frequency point, and n is dimension, C (i, j a) ,c (i, j b) be the i-th frame, jth aspectral line and the i-th frame, jth bmFCC matrix of coefficients under spectral line.
4. method according to claim 1, is characterized in that, according to formula call repeatedly structural model and fragment corresponding on the amplitude spectrum of input signal is carried out medium filtering, wherein, median represents median filter, and x represents number of fragments, and i represents Frequency point, and j represents frame number, V (i, J j(l)) be the signal amplitude spectrum of l of the Frequency point i fragment repeated.
5. according to the method for claim 1-3 described in one of them, it is characterized in that, described searching similar fragments specifically comprises further: the maximum length, the minimum length that limit repeatedly structure according to snatch of music length, according to duration and the musical time determination threshold value of fragment repeatedly, according to the similarity between similar matrix determination fragment, two fragments that similarity is greater than threshold value are similar fragments.
6., according to the method for claim 1-3 described in one of them, it is characterized in that, call formula: recover the time domain waveform of song and background music, wherein, x is original input music signal, for the time domain waveform of background music, for the time domain waveform of song, with represent Fourier transform and inverse transformation respectively.
According to the method for claims 1-3 described in one of them, it is characterized in that, according to formula: C MFCC ( i ) = 2 L Σ i = 1 L - 1 log m ( l ) cos ( l - iπ 2 L ) Determine MFCC characteristic parameter C mFCC, according to formula: the similar matrix setting up MFCC characteristic parameter determines the similarity under different spectral line number between MFCC coefficient, and wherein, m (l) is for signal is by the energy of wave filter, and L is the number of bank of filters, and i represents Frequency point, and j represents frame number, C (i, j a) be the i-th frame, jth amFCC matrix of coefficients under spectral line ,c (i, j b) be the i-th frame, jth bmFCC matrix of coefficients under spectral line, n is the dimension of MFCC.
CN201510023609.5A 2014-11-25 2015-01-16 Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation) CN104616663A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN2014106908405 2014-11-25
CN201410690840 2014-11-25
CN201510023609.5A CN104616663A (en) 2014-11-25 2015-01-16 Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510023609.5A CN104616663A (en) 2014-11-25 2015-01-16 Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)

Publications (1)

Publication Number Publication Date
CN104616663A true CN104616663A (en) 2015-05-13

Family

ID=53151084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510023609.5A CN104616663A (en) 2014-11-25 2015-01-16 Music separation method of MFCC (Mel Frequency Cepstrum Coefficient)-multi-repetition model in combination with HPSS (Harmonic/Percussive Sound Separation)

Country Status (1)

Country Link
CN (1) CN104616663A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105070301A (en) * 2015-07-14 2015-11-18 福州大学 Multiple specific musical instrument strengthening separation method in single-channel music human voice separation
CN105096987A (en) * 2015-06-01 2015-11-25 努比亚技术有限公司 Audio data processing method and terminal
CN106024005A (en) * 2016-07-01 2016-10-12 腾讯科技(深圳)有限公司 Processing method and apparatus for audio data
CN106940996A (en) * 2017-04-24 2017-07-11 维沃移动通信有限公司 The recognition methods of background music and mobile terminal in a kind of video
CN107017005A (en) * 2017-04-27 2017-08-04 同济大学 A kind of binary channels language separation method based on DFT
US9852745B1 (en) 2016-06-24 2017-12-26 Microsoft Technology Licensing, Llc Analyzing changes in vocal power within music content using frequency spectrums
TWI658458B (en) * 2018-05-17 2019-05-01 張智星 Method for improving the performance of singing voice separation, non-transitory computer readable medium and computer program product thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012181475A (en) * 2011-03-03 2012-09-20 Univ Of Tokyo Method for extracting feature of acoustic signal and method for processing acoustic signal using the feature
KR20130037041A (en) * 2011-10-05 2013-04-15 목포대학교산학협력단 Mask-based multi-channel sound for marine sound recognition, acoustic model, which combines the separation study compensation
CN103680517A (en) * 2013-11-20 2014-03-26 华为技术有限公司 Method, device and equipment for processing audio signals

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012181475A (en) * 2011-03-03 2012-09-20 Univ Of Tokyo Method for extracting feature of acoustic signal and method for processing acoustic signal using the feature
KR20130037041A (en) * 2011-10-05 2013-04-15 목포대학교산학협력단 Mask-based multi-channel sound for marine sound recognition, acoustic model, which combines the separation study compensation
CN103680517A (en) * 2013-11-20 2014-03-26 华为技术有限公司 Method, device and equipment for processing audio signals

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
秦翔宇: "乐曲与歌声分离算法研究乐曲与歌声分离算法研究", 《万方数据知识服务平台HTTP://D.WANFANGDATA.COM.CN/THESIS/D498158》 *
龚君才等: "一种基于隐马尔科夫模型的波形文件主旋律基频提取算法", 《软件》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105096987A (en) * 2015-06-01 2015-11-25 努比亚技术有限公司 Audio data processing method and terminal
CN105070301A (en) * 2015-07-14 2015-11-18 福州大学 Multiple specific musical instrument strengthening separation method in single-channel music human voice separation
CN105070301B (en) * 2015-07-14 2018-11-27 福州大学 A variety of particular instrument idetified separation methods in the separation of single channel music voice
US9852745B1 (en) 2016-06-24 2017-12-26 Microsoft Technology Licensing, Llc Analyzing changes in vocal power within music content using frequency spectrums
US10043538B2 (en) 2016-06-24 2018-08-07 Microsoft Technology Licensing, Llc Analyzing changes in vocal power within music content using frequency spectrums
CN106024005A (en) * 2016-07-01 2016-10-12 腾讯科技(深圳)有限公司 Processing method and apparatus for audio data
WO2018001039A1 (en) * 2016-07-01 2018-01-04 腾讯科技(深圳)有限公司 Audio data processing method and apparatus
CN106024005B (en) * 2016-07-01 2018-09-25 腾讯科技(深圳)有限公司 A kind of processing method and processing device of audio data
CN106940996A (en) * 2017-04-24 2017-07-11 维沃移动通信有限公司 The recognition methods of background music and mobile terminal in a kind of video
CN107017005A (en) * 2017-04-27 2017-08-04 同济大学 A kind of binary channels language separation method based on DFT
TWI658458B (en) * 2018-05-17 2019-05-01 張智星 Method for improving the performance of singing voice separation, non-transitory computer readable medium and computer program product thereof

Similar Documents

Publication Publication Date Title
Zhao et al. CASA-based robust speaker identification
Kim et al. Auditory processing of speech signals for robust speech recognition in real-world noisy environments
Ryynänen et al. Automatic transcription of melody, bass line, and chords in polyphonic music
Klapuri Multipitch analysis of polyphonic music and speech signals using an auditory model
US9313593B2 (en) Ranking representative segments in media data
US7035742B2 (en) Apparatus and method for characterizing an information signal
US20050060153A1 (en) Method and appratus for speech characterization
Tsai et al. Automatic singer recognition of popular music recordings via estimation and modeling of solo vocal signals
Gillet et al. Transcription and separation of drum signals from polyphonic music
RU2418321C2 (en) Neural network based classfier for separating audio sources from monophonic audio signal
US20040074378A1 (en) Method and device for characterising a signal and method and device for producing an indexed signal
Shrawankar et al. Techniques for feature extraction in speech recognition system: A comparative study
Deshmukh et al. Use of temporal information: Detection of periodicity, aperiodicity, and pitch in speech
Zhang Automatic singer identification
Durrieu et al. A musically motivated mid-level representation for pitch estimation and musical audio source separation
Virtanen Speech recognition using factorial hidden Markov models for separation in the feature space
Mesaros et al. Singer identification in polyphonic music using vocal separation and pattern recognition methods.
Tiwari MFCC and its applications in speaker recognition
CN101599271B (en) Recognition method of digital music emotion
Cosi et al. Auditory modelling and self‐organizing neural networks for timbre classification
Regnier et al. Singing voice detection in music tracks using direct voice vibrato detection
CN103189913B (en) Method, apparatus for decomposing a multichannel audio signal
Shao et al. Model-based sequential organization in cochannel speech
Dhingra et al. Isolated speech recognition using MFCC and DTW
WO2003009273A1 (en) Method and device for characterising a signal and for producing an indexed signal

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150513

RJ01 Rejection of invention patent application after publication