CN106919662A

CN106919662A - A kind of music recognition methods and system

Info

Publication number: CN106919662A
Application number: CN201710077359.2A
Authority: CN
Inventors: 李伟
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2017-02-14
Filing date: 2017-02-14
Publication date: 2017-07-04
Anticipated expiration: 2037-02-14
Also published as: CN106919662B

Abstract

The invention discloses a kind of music recognition methods and system.The method includes：Obtain snatch of music to be identified；Extract mel cepstrum coefficients, mel cepstrum coefficients first-order difference, linear prediction residue error and the perception linear predictor coefficient of each frame audio in snatch of music to be identified；The characteristic vector of audio is constituted using the above-mentioned coefficient of audio；Characteristic vector to audio described in each frame is combined, and obtains the eigenmatrix of snatch of music to be identified；The eigenmatrix of snatch of music to be identified is compared with the sample musical features matrix in music libraries, maximum similarity eigenmatrix is obtained, the maximum similarity eigenmatrix is the sample musical features matrix maximum with the similarity of snatch of music to be identified；Obtain the music information of the maximum sample musical features matrix of similarity；Music information is exported.The music recognition methods and system that the present invention is provided have more preferable noise immunity and discrimination, and recognition effect is more preferable.

Description

A kind of music recognition methods and system

Technical field

The present invention relates to music recognition field, more particularly to a kind of music recognition methods and system.

Background technology

With flourishing for science and technology, especially the generation of computer technology and develop rapidly, sound and music meter Calculating (Sound andMusic Computing, SMC) becomes an emerging cross discipline, and the subject is mainly by calculating Method understands, simulates and produces sound and music.

With the ripe and development of every technology, people can obtain music by multiple channel.Wherein network is most just It is prompt, most quick to obtain channel.This music that directly results on network generates explosive growth, and people are also accustomed to passing through Network retrieval down-load music.But the music for spreading through the internet not has complete music information to mark, on the one hand for Management causes certain difficulty, and the on the other hand use for obtaining taker also result in certain puzzlement.Simultaneously in life, Music is often simple broadcasting, and the relevant information of music such as title of the song, singer etc., it is not same along with music always When occur.This case, for having heard that music and music-lover that is interested in it, wanting to know about and obtain the music be One very stubborn problem.Retrieval accordingly, for music becomes a highly important problem.

The retrieval of music mainly has two ways, and one kind is text based music retrieval, and another kind is based on content Music retrieval.Wherein, text based music retrieval system mainly by user submit to keyword message, such as title of the song, Ge Shouming, Lyrics fragment etc., is then retrieved and is matched using keyword to the music information in database, so as to realize purpose.The skill Art is very ripe and is used widely, but its limitation is also fairly obvious.Which is bright for that cannot provide The unknown audio of true keyword message cannot be retrieved and recognized, and the keyword message that user provides also very easily is produced Raw mistake, causes error and the failure of retrieval.

And music recognition, namely content-based music retrieval is different from text based music retrieval, it and not according to The text message of music, but be directly identified, so as to reach the mesh of retrieval according to content of music samples fragment itself 's.Although in itself with complicated physical characteristic, per song has its metastable feature to music, and can make One section of music is characterized with the feature of these stabilizations.And music recognition feature namely according to these music come complete identification. The feature that can go out from extraction of music has many, such as beat number per minute, timing node etc. is started over, using different Theoretical and method, can extract different audio frequency characteristics.And different audio frequency characteristics have different characteristics, for different Condition, can choose appropriate feature and be identified.Music recognition overcomes the limitation of text based music retrieval, concern Music in itself, with more preferable adaptability and practicality, will also be increasingly becoming the main trend of music retrieval.

Music recognition, in being calculated as sound and music compared with based on and practical problem, is at home and abroad weighed Depending on.The music recognition service of Shazam companies is by extracting happy line (Music-Fingerprinting, MFP), then carrying out Match to realize music recognition, it proposes a kind of happy line extracting method of distinguished point based, i.e., find feature from frequency spectrum Point, is exactly the happy line of the fragment to the sequence for constituting by its composition characteristic point to (Peak-Pairs) by characteristic point.Separately there is one kind Classical music automatic recognition system based on hidden Markov model, the system is according to the half rank note category feature for extracting (Chroma Features) is clustered, and is then identified using hidden Markov model.

The magnanimity music retrieval system of a kind of optimization based on Shazam companies achievement, by happy line extraction process The middle optimization for adding frequency spectrum optimization, peak value filtering, setting feature three links of pixel confidence to realize to happy line.Also use line Property alignment matching method be used to the singing search system realized approximate melody matching and build based on this.The main basis of the method The pitch contour of close melody is in similitude geometrically, the new method that pitch and rhythm characteristic are considered and formed in the lump.

But above-mentioned music recognition methods are not high to the discrimination of music, recognition effect is also unsatisfactory, needs further Improve.

The content of the invention

There is more preferable noise immunity and discrimination, the more preferable music of recognition effect it is an object of the invention to provide one kind Recognition methods and system.

To achieve the above object, the invention provides following scheme：

A kind of music recognition methods, methods described includes：

Obtain snatch of music to be identified；

Extract mel cepstrum coefficients, the jump of mel cepstrum coefficients one of each frame audio in the snatch of music to be identified Point, linear prediction residue error and perceive linear predictor coefficient；

Using the mel cepstrum coefficients of the audio, mel cepstrum coefficients first-order difference, linear prediction residue error and sense Know that linear predictor coefficient constitutes the characteristic vector of the audio；

Characteristic vector to audio described in each frame is combined, and obtains the eigenmatrix of the snatch of music to be identified；

The eigenmatrix of the snatch of music to be identified is compared with the sample musical features matrix in music libraries, is obtained To maximum similarity eigenmatrix, the maximum similarity eigenmatrix is maximum with the similarity of the snatch of music to be identified Sample musical features matrix；

Obtain the music information of the maximum sample musical features matrix of the similarity；

By music information output.

Optionally, after one section of music to be identified of the acquisition, each frame audio in the extraction music to be identified Before mel cepstrum coefficients, mel cepstrum coefficients first-order difference, linear prediction residue error and perception linear predictor coefficient, also wrap Include：

The snatch of music to be identified is pre-processed, the pretreatment includes preemphasis treatment, framing and adding window.

Optionally, the sample musical features matrix in the eigenmatrix by the snatch of music to be identified and music libraries It is compared, obtains similarity maximum sample musical features matrix, specifically includes：

From the sample musical features matrixThe middle feature square intercepted with the snatch of music to be identified Battle array has the matrix of same number of rowsThe sample musical features matrix has multiple, wherein, k =1,2 ..., M-N+1, Δ k=1, tz (1), tz (2) ..., tz (M) they are each frame audio in the sample musical features matrix B Characteristic vector, M is the number of characteristic vector in the musical features matrix B, and N is the feature square of the snatch of music to be identified The line number of battle array, by matrix B_kLabeled as eigenmatrix to be compared；

The similarity of eigenmatrix to be compared and the eigenmatrix of the snatch of music to be identified is calculated, obtains being treated with described Recognize the maximum eigenmatrix to be compared of the eigenmatrix similarity of snatch of music, the spy with the snatch of music to be identified Levy the musical features matrix as maximum similarity musical features matrix belonging to the maximum eigenmatrix to be compared of matrix similarity.

Optionally, described from the musical features matrixMiddle interception and the musical film to be identified The eigenmatrix of section has the matrix of same number of rowsWherein, Δ k=1, k=1,2 ..., M-N+ 1, tz (1), tz (2) ..., tz (M) are the characteristic vector of each frame audio in the musical features matrix B, and M is the special music The number of characteristic vector in matrix B is levied, before N is the line number of the eigenmatrix of the snatch of music to be identified, is also included：

Judge whether to need to complete to calculate in Preset Time, and predetermined threshold value is less than to accuracy requirement；

If it is, Δ k to be set greater than 1 integer.

Optionally, the similarity for calculating eigenmatrix to be compared and the eigenmatrix of the snatch of music to be identified, Specifically include：

Cosine using matrix absolute value distance method, the cosine law method of vector space model or vector space model is determined Manage the feature that the method being combined with Euclidean distance calculates the eigenmatrix to be compared and the snatch of music to be identified The similarity of matrix.

Present invention also offers a kind of music recognition system, the system includes：

Snatch of music acquisition module to be identified, for obtaining snatch of music to be identified；

Parameter extraction module, mel cepstrum coefficients, plum for extracting each frame audio in the snatch of music to be identified That cepstrum coefficient first-order difference, linear prediction residue error and perception linear predictor coefficient；

Characteristic vector determining module, for using the mel cepstrum coefficients of the audio, mel cepstrum coefficients first-order difference, Linear prediction residue error and perception linear predictor coefficient constitute the characteristic vector of the audio；

Eigenmatrix determining module, is combined for the characteristic vector to audio described in each frame, obtains described waiting to know The eigenmatrix of other snatch of music；

Matching module, for by the sample musical features square in the eigenmatrix of the snatch of music to be identified and music libraries Battle array is compared, and obtains maximum similarity eigenmatrix, and the maximum similarity eigenmatrix is and the musical film to be identified The maximum sample musical features matrix of the similarity of section；

Music information acquisition module, the music information for obtaining the maximum sample musical features matrix of the similarity；

Music information output module, for the music information to be exported.

Optionally, the system also includes：

Pretreatment module, for being pre-processed to the snatch of music to be identified, the pretreatment is included at preemphasis Reason, framing and adding window.

Optionally, the matching module, specifically includes：

Matrix interception unit, for from the sample musical features matrixMiddle interception is to be identified with described The eigenmatrix of snatch of music has the matrix of same number of rowsThe sample musical features square Battle array has multiple, wherein, k=1,2 ..., M-N+1, Δ k=1, tz (1), tz (2) ..., tz (M) they are the sample musical features square The characteristic vector of each frame audio in battle array B, M is the number of characteristic vector in the musical features matrix B, and N is described to be identified The line number of the eigenmatrix of snatch of music, by matrix B_kLabeled as eigenmatrix to be compared；

Similarity calculated, for calculating eigenmatrix to be compared with the eigenmatrix of the snatch of music to be identified Similarity, obtains the to be compared eigenmatrix maximum with the eigenmatrix similarity of the snatch of music to be identified, described and institute Musical features matrix belonging to the eigenmatrix to be compared of the eigenmatrix similarity maximum for stating snatch of music to be identified is most Big similarity musical features matrix.

Optionally, the matching unit also includes：

Judging unit, completes to calculate for judging whether to need in Preset Time, and to accuracy requirement less than default Threshold value；

Setting unit, for completing to calculate in Preset Time when needs, and when being less than predetermined threshold value to accuracy requirement, Δ k is set greater than 1 integer.

Optionally, the similarity calculated is specifically included：

Similarity Measure subelement, for using matrix absolute value distance method, the cosine law method of vector space model or The cosine law of person's vector space model calculates the eigenmatrix to be compared and institute with the method that Euclidean distance is combined State the similarity of the eigenmatrix of snatch of music to be identified.

According to the specific embodiment that the present invention is provided, the invention discloses following technique effect：The present invention is fallen using Mel Spectral coefficient, linear prediction residue error, perceive linear prediction these three essential characteristics and carry out the comprehensive identification to realize music, three Class audio frequency feature has different characteristics and advantage, and more preferable recognition result can be obtained with reference to after；Linear prediction is perceived because of mould The masking effect of human ear is intended, with more preferable noise immunity and discrimination, the recognition effect for obtaining is more preferable.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to institute in embodiment The accompanying drawing for needing to use is briefly described, it should be apparent that, drawings in the following description are only some implementations of the invention Example, for those of ordinary skill in the art, without having to pay creative labor, can also be according to these accompanying drawings Obtain other accompanying drawings.

Fig. 1 is the schematic flow sheet of embodiment of the present invention music recognition methods；

Fig. 2 is the structural representation of embodiment of the present invention music recognition system.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

There are the music recognition methods and system being obviously improved it is an object of the invention to provide a kind of accuracy and noise immunity.

It is below in conjunction with the accompanying drawings and specific real to enable the above objects, features and advantages of the present invention more obvious understandable The present invention is further detailed explanation to apply mode.

Fig. 1 is embodiment of the present invention music recognition methods steps flow chart schematic diagram, as shown in figure 1, music recognition methods are walked It is rapid as follows：

Step 101：Obtain snatch of music to be identified；

Step 102：Extract mel cepstrum coefficients, the mel cepstrum coefficients of each frame audio in the snatch of music to be identified First-order difference, linear prediction residue error and perception linear predictor coefficient；

Step 103：Using the mel cepstrum coefficients of the audio, mel cepstrum coefficients first-order difference, linear prediction cepstrum coefficient Coefficient and perception linear predictor coefficient constitute the characteristic vector of the audio；

Step 104：Characteristic vector to audio described in each frame is combined, and obtains the spy of the snatch of music to be identified Levy matrix；

Step 105：The eigenmatrix of the snatch of music to be identified is entered with the sample musical features matrix in music libraries Row compares, and obtains maximum similarity eigenmatrix, and the maximum similarity eigenmatrix is and the snatch of music to be identified The maximum sample musical features matrix of similarity；

Step 106：Obtain the music information of the maximum sample musical features matrix of the similarity；

Step 107：By music information output.

Wherein, before step 101, also include：

Step 105, specifically includes：

If it is, Δ k to be set greater than 1 integer；

The similarity of eigenmatrix to be compared and the eigenmatrix of the snatch of music to be identified is calculated, obtains being treated with described Recognize the maximum eigenmatrix to be compared of the eigenmatrix similarity of snatch of music, the spy with the snatch of music to be identified Levy the musical features matrix as maximum similarity musical features matrix belonging to the maximum eigenmatrix to be compared of matrix similarity. When calculating the similarity of the eigenmatrix to be compared and the eigenmatrix of the snatch of music to be identified, using matrix absolute value The cosine law of Furthest Neighbor, the cosine law method of vector space model or vector space model is combined with Euclidean distance Method carry out the calculating of similarity.

Used as preferred embodiment, pretreatment includes three work, respectively preemphasis treatment, framing, adding window.

A) preemphasis treatment

Preemphasis treatment is that, by a high-pass filter, the transmission function of this wave filter is as follows by audio signal：

H (z)=μ z^-1

In above formula, the value of parameter μ generally takes 0.97 between 0.9-1.0.The purpose of preemphasis treatment is lifting High fdrequency component in audio signal, making the frequency spectrum of audio signal becomes more flat, strengthens track characteristics, is spectrum analysis or sound Road Parameter analysis are ready.

B) framing

As a whole, be one has unstable, stochastic behaviour process to audio signal, but audio is between the short time In (usually 10ms-30ms), audio signal shows certain stability.In passing research, more attention is Audio carries out short-time analysis in the invariant feature for showing interior in short-term.Continuous audio signal is divided into comprising equivalent amount The period in short-term of sampled point, i.e. frame.Three audio frequency characteristics of required extraction are analyzed on frame-basis.Framing The short-time stability of audio signal can be kept, so that for short-time analysis lays the first stone.Meanwhile, to ensure the continuous of audio signal Property and dynamic, in framing it is required that adjacent two interframe has one section of overlapping region, the Duplication of interframe is typically about 50%-80%.44100Hz is using audio in system, is a frame with 512 sampled points, corresponding time span is (512 ÷ 44100) × 1000=11.61ms, interframe Duplication is the sampled point of 50%, i.e., 256.

C) adding window

To continuous audio signal sub-frame processing, spectrum energy is leaked caused by producing after being truncated by unlimited signal, In order to reduce spectrum energy leakage, it is to avoid each issuable signal in frame two ends is discontinuous, it is necessary to be carried out at adding window to each frame Reason.Conventional window function has following three kinds：

(1) rectangular window (RectangularWindow)

(2) Hamming window (Hamming Window)

(3) Hanning window (Hann Window)

From the point of view of window function, Hamming window and Hanning window are all broad sense raised cosines.From the angle for reducing spectrum energy leakage For degree, Hanning window is better than rectangular window, but frequency identification precision is low.Hamming window is reducing spectrum energy compared with rectangular window Leakage aspect is more excellent, while frequency identification precision is more preferable compared with Hanning window.Therefore, the system carries out adding window with Hamming window. In the flow of the extraction mel cepstrum coefficients of current standard, also adding window is carried out using Hamming window.

As the preferred embodiments of the present invention, the mel cepstrum system of each frame audio in the snatch of music to be identified is extracted Number, mel cepstrum coefficients first-order difference, linear prediction residue error and perception linear predictor coefficient.

1. mel cepstrum coefficients MFCC and its first-order difference

Mel cepstrum coefficients, also known as mel-frequency cepstrum coefficient.The frequency band of mel-frequency cepstrum is first-class in Mel scale Away from what is divided, it is similar to the auditory system of the mankind compared with the linear interval frequency band that cepstrum is used, more, can be from human ear The angle of the sense of hearing preferably represents sound.It is by the common-used formula that frequency f is converted to mel-frequency m：The flow for extracting mel cepstrum coefficients is as follows：

A) Fast Fourier Transform (FFT) (Fast Fourier Transform, FFT) is discrete Fourier transform (Discrete FourierTransform, DFT) fast algorithm.For N point sequences { x [n] } 0≤n≤N, its discrete Fourier transform is

Accordingly, the discrete Fourier transform of audio signal is set to

Wherein x (n) is the audio signal of input, and N shows the points of discrete Fourier transform.

Because acoustic characteristic is difficult to draw in time domain, in order to obtain the characteristic of different audios, usually using discrete Fourier transformation is converted to frequency domain, to obtain its Energy distribution on frequency spectrum, such that it is able to be further processed, Obtain the characteristic of audio.In practice, each frame is obtained by Fast Fourier Transform (FFT) by the audio of pretreatment to each frame Frequency spectrum, then to frequency spectrum modulus square, finally obtain the power spectrum of every frame signal.

B) Mel wave filter group is one group of equally distributed M triangle bandpass filter in Mel scale, and usual M takes 24, centre frequency is f (m) (m=1,2 ..., M), and the interval of each f (m) is widened along with the increase of m.The triangle band logical is filtered The frequency response function of ripple device is

Wherein

Because the scope that each triangle bandpass filter is covered in the wave filter group be similar to human ear one is critical Bandwidth, in order to simulate the masking effect of human ear, is filtered using the wave filter group to the power spectrum of every frame signal.

C) output result to Mel wave filter group carries out following computing, so as to try to achieve logarithmic energy.

D) logarithmic energy for trying to achieve back carries out discrete cosine transform, as follows

Wherein L is mel cepstrum coefficients exponent number, generally takes 12-16.Because 0 rank of mel cepstrum coefficients reflects spectrum energy, It is often used without, using 12-16 ranks parameter thereafter as mel cepstrum coefficients.The mel cepstrum coefficients that the system is extracted are 12 ranks.

E) it is merely able to react the static characteristic of audio because of mel cepstrum coefficients, in order to describe the dynamic characteristic of audio, so Calculate the first-order difference parameter of mel cepstrum coefficients.Calculate first-order difference formula be：

Wherein, d_tIt is t-th first-order difference, C_tIt is t-th mel cepstrum coefficients, Q is the exponent number of mel cepstrum coefficients, K tables Show the time difference of first derivative, can use 1 or 2.

F) mel cepstrum coefficients and its first-order difference are merged, as one 24 vector of dimension, as first audio spy Levy.

2. linear prediction residue error LPCC

Linear prediction residue error is obtained based on linear predictor coefficient (LPC).The basic thought of linear prediction It is that using the correlation between audio sampling point, past several sample values predict present or following sample value.Extract line Property prediction cepstrum coefficient flow it is as follows：

A) LPC parameters are solved

Tract characteristics are simulated using all-pole modeling, transfer function is as follows：

P is the exponent number of linear prediction, a in formula_kIt is that (k=1,2 ..., p), G is that vocal tract filter increases to linear predictor coefficient Benefit.

If a certain frame audio signal is x (n), the linear prediction result of the audio is：

Error between the frame in audio signal and linear prediction result is：

Linear predictor coefficient is by making the error between the sample value of actual audio signal and linear prediction result equal Square criterion is issued to minimum come what is tried to achieve, i.e. following formula value reaches minimum：

To cause that above formula value is minimum, to a_kLocal derviation is sought, it is zero then to make it, need to solve following canonical side after abbreviation Journey group：

Wherein R (k) is the auto-correlation coefficient of sequence x (n).Above formula can be rewritten into following matrix form：

R_p·a_p=-r_p

Wherein

a_p=[a_p1,a_p2,...,a_pp]^T

r_p=[R (1), R (2) ..., R (p)]^T

Then Levinson-Durbin algorithms are used, namely based on autocorrelative Recursive Solution equation, is solved cutting edge aligned Predictive coefficient.

Through observation shows that, R_pToeplitz matrixes, with the element on diagonal it is all equal the characteristics of. Levinson-Durbin is iterated calculating using this feature, the result of low order is first calculated, then using the result of low order Go to calculate high order result.For m rank linear predictions, linear predictor coefficient algorithm is can obtain as follows：

a_mk=a_(m-1)k+ρ_ma_(m-1)(m-k)=a_(m-1)k+a_mma_(m-1)(m-k)

Wherein, a_m-1It is (m-1) rank predictive coefficient, Subscript b represent r_m-1=[R (1), R (2) ..., R (m-1)]^TThe inverted order arrangement of element.

According to this algorithm, linear predictor coefficient can be obtained.

B) LPCC parameters are solved

Cepstrum generally refers to the inverse Fourier transform of the logarithm value of power spectrum, and the mathematic(al) representation of cepstrum function is：

C (q)=IDFT (log (| s (f) |²))

Wherein, s (f) is the Fourier transformation of signal s (t), and to take the logarithm, IDFT () is inverse Fourier transform to log ().

Linear predictor coefficient and cepstrum are defined, and linear prediction residue error exists a kind of using linear predictor coefficient realization Recursive Solution method, it is as follows：

Specific recurrence relation is as follows：

Wherein, a_nIt is n-th order linear predictor coefficient.

Linear prediction residue error c is tried to achieve according to this recurrence relation, i.e.,

C=[c (1), c (2) ..., c (q)], 10≤q≤16

Q is the exponent number of linear prediction residue error in formula, and 12 are taken in the present invention.

3. linear prediction PLP is perceived

It is the characteristic parameter based on auditory model to perceive linear prediction, and the all-pole modeling using linear prediction is divided Analysis, but the related conclusions of human auditory model are applied into spectrum analysis from linear prediction unlike linear prediction, is perceived In, by the method for approximate calculation so that audible spectrum meets human auditory system and perceives mechanism, improves the robust of audio frequency characteristics Property.

Perception linear forecasting technology realizes the imitation to the Auditory Perception mechanism of human ear by three levels, faces respectively The treatment of boundary frequency range analysis, contour of equal loudness preemphasis, signal intensity-hearing loudness conversion.Extract and perceive linear prediction feature Flow is as follows：

A) discrete Fourier transform DFT

It is similar to mel cepstrum coefficients MFCC is extracted, calculated using Fast Fourier Transform (FFT), obtain every frame audio letter Number frequency spectrum, then to the power spectrum for square obtaining every frame signal of frequency spectrum modulus.After this process, the power of audio signal is obtained Spectrum P (f).

B) critical band analysis

The masking effect of human auditory system is an important feature, refers to that the sound not waited when two loudness acts on people During ear, the sound that loudness sound higher can allow loudness relatively low becomes to be difficult to be therefore easily perceived by humans.Under study for action, critical band is analyzed is Embodiment to masking effect.Critical band refers to continuously making an uproar by the frequency centered on it and with certain bandwidth when certain pure tone When sound is sheltered, if the pure tone can just be heard the power that the power for being is equal to this inband noise, then this band Width is referred to as critical band width.The unit of critical band is Bark.

In order that obtaining PLP features can approximately imitate the mechanism of perception of human auditory system, it is necessary to be analyzed by critical band, will The frequency axis f of power spectrum is mapped to Bark domains, and formula is as follows：

The audio frequency that system is used is 44100Hz, and 30 frequency bands can be obtained after substitution.Then by the power spectrum after mapping with Simulate the curve of critical bandMutually roll up, obtain critical band power spectrum θ (k)：

Wherein simulate critical band curveIt is as follows：

Z in formula₀K () represents k-th centre frequency of critical band power spectrum.

The loudness preemphasis such as c)

Loudness describes subjective feeling of the human ear to sound, and the external ear and middle ear of people are for 1~5kHz frequency ranges Interior sound has the lifting of 10~20dB.In order to imitate this characteristic of human ear, it is necessary to the treatment of loudness preemphasis such as carry out.Use The 40dB equal loudness contours of simulation such as carry out at the treatment of loudness preemphasis, and because of it, no matter noise intensity height can preferably reflect people Ear is widely used in the evaluation criterion of noise to the sensation of noise loudness.

Simulating contour of equal loudness is：

Preemphasis is carried out to critical band power spectrum using simulation contour of equal loudness：

τ (k)=E [f₀(k)] θ (k), (k=1,2 ..., 30)

Wherein f₀K () represents the frequency corresponding to k-th centre frequency of critical band power spectrum.

D) intensity-hearing loudness conversion

Because the relation between the intensity of sound and the loudness of auditory perceptual is nonlinear, in order to simulate this relation, Need to carry out the conversion between intensity and hearing loudness：

E) inverse discrete Fourier transform IDFT

Inverse discrete Fourier transform is the inverse process to discrete Fourier transform, and direct computation of DFT is carried out here by δ (k) Leaf inverse transformation obtains the short-time autocorrelation function of audio signal, and the linear prediction analysis after being prepares.

For N point sequences { x [n] } 0≤n<N, its inverse discrete Fourier transform is

F) all-pole modeling

Using all-pole modeling, the result of inverse discrete Fourier transform is carried out to δ (k) carries out linear prediction analysis, method It is identical with linear predictor coefficient is solved.Go out 12 rank linear predictor coefficients using Levinson-Durbin Algorithm for Solving, then utilize With solution linear prediction residue error identical method, 16 rank cepstrum coefficients are solved, be as a result and perceive linear predictor coefficient.

As the preferred embodiment of the invention, using the mel cepstrum coefficients of the audio, the jump of mel cepstrum coefficients one Point, linear prediction residue error and perceive the characteristic vector that linear predictor coefficient constitutes the audio；To audio described in each frame Characteristic vector be combined, obtain the eigenmatrix of the snatch of music to be identified.

For each frame audio, 12 Jan Vermeer cepstrum coefficient m, 12 Jan Vermeer cepstrum coefficient first-order difference Δs are can extract out M, the 12 dimensional linears dimensions of prediction cepstrum coefficient l and 16 perceive linear predictor coefficient p：

M=[m₁ m₂…m₁₂]

Δ m=[Δ m₁ Δm₂…Δm₁₂]

L=[l₁ l₂…l₁₂]

P=[p₁ p₂…p₁₆]

Aforementioned four feature is merged turns into the general audio characteristic vector tz of one 52 dimension：

Tz=[m Δ m l p]

For a section audio, the eigenmatrix A, N that can extract out N × 52 represent totalframes, by audio duration, adopt Sample rate, frame length and frame are moved and determined：

Why propose the synthesis based on mel cepstrum coefficients, linear prediction residue error, perception linear prediction Audio frequency characteristics are used as music recognition standard, because the different characteristic that these features have.Mel cepstrum coefficients are based on Auditory model, according to human ear mechanism, obtains result, with preferably anti-by with hertz into the mel-frequency of non-linear relation Making an uproar property；Linear prediction residue error is based on channel model, according to principle of sound, is obtained by the linear approximation of all-pole modeling As a result, the tract characteristics and resonance peak character of audio signal are reflected；Perceive linear prediction and be based on auditory model, according to human ear Mechanism, by obtaining result for the simulation and linear prediction of human ear characteristic, reflects the masking characteristics of human ear, with compared with Good noise immunity.

Four kinds of features of selection each have its advantage and can be complementary, so with mel cepstrum coefficients, linear prediction cepstrum coefficient Coefficient and perceive linear predictor coefficient be combined the general audio feature to be formed as criterion of identification, can carry out more fully, More accurately retrieval matching, music recognition rate and noise immunity can get a promotion.

It is as preferred embodiment, the sample music in the eigenmatrix of the snatch of music to be identified and music libraries is special Levy matrix to be compared, obtain maximum similarity eigenmatrix, the maximum similarity eigenmatrix is and the sound to be identified The maximum sample musical features matrix of the similarity of happy fragment.

Matching is to be compared the eigenmatrix of any sample music in the eigenmatrix of snatch of music and audio repository And matching, most close eigenmatrix is found, the sample music information belonging to the matrix is then obtained, namely：

The eigenmatrix A of-unknown audio sample fragments

- audio repository provides a series of eigenmatrix B for being used to match and comparing_k(n)

- find out the corresponding n of the matrix most like with the eigenmatrix A of unknown audio sample fragments

Such as, the eigenmatrix of snatch of music is A, is the matrix of N × 52, if the feature square of any music in audio repository Battle array is B, is the matrix of M × 52, and M >=N：

Interception has the submatrix B of same number of rows with matrix A from matrix B successively_k

Wherein k represents submatrix B_kThe first row be located at position in eigenmatrix B, the changes delta k of k values is defaulted as 1, Namely matrix B_kWith matrix B_k+1Between have N-1 rows identical, complete identification need M-N+1 matrix B_kCarried out with matrix A successively Compare, calculate similarity ψ (k).

Calculating two similarities of matrix can have various methods, and three kinds of methods have been outlined below：

A) matrix absolute value distance

Two similarity degrees of matrix are calculated, there are various methods, simplest is exactly to seek two absolute value distances of matrix.

Size is the matrix A and matrix B of N × 52_kSubtract each other that to obtain matrix poor, matrix difference is asked and thoroughly deserves matrix C：

C=| A-B_k|

Then diversity factor ψ (k) for obtaining two matrixes is summed up to all elements in Matrix C：

ψ (k)=C₁,₁+C₁,₂+C₁,₃+…+C_i,_j+…+C_N,₅₂,

1≤i≤N, 1≤j≤52

Although this method is easier to realize also relatively intuitively, readily appreciate, but this method is easily influenceed by 0 value, once sound Occur the pause of long period in pleasure, matching effect will be influenceed.

B) cosine law of vector space model

It is considered that two similarity degrees of matrix, are in essence to compare the numerical value on two correspondence positions of matrix Difference degree, as long as namely the corresponding relation of numerical value do not change, the form of matrix can be what is be changed.By two Individual matrix is deformed, and becomes two multi-C vectors, the method then referring to text similarity is calculated in natural language processing, profit Audio similarity is calculated with the cosine law of vector space model (Vector Space Model).

According to the cosine law of vector space model, the similarity of two section audios can be by their audio frequency characteristics in N The relative position performance of dimension space, and the relative position of audio frequency characteristics is then the cosine table by the angle between two vectors Show, angle between the two is smaller, then similarity is bigger.

Size is the matrix A and matrix B of N × 52_kBe converted to the matrix that two sizes are 1 × 52N, namely two The vectorial a of 52N dimensions₁And a₂：

a₁((i-1) * 52+j)=A (i, j), 1≤i≤N, 1≤j≤52

a₂((i-1) * 52+j)=B_k(i, j), 1≤i≤N, 1≤j≤52

According to the cosine law of space vector, calculating matrix A and matrix B_kSimilarity ψ：

Wherein vector a₂According to matrix B_kChange and change.

C) cosine law ＆ Euclidean distances of vector space model

It is similar to former approach, the similarity of matrix is asked for using the cosine law of vector space model；Difference It is that eigenmatrix is first divided into three submatrixs according to the dimension of three class audio frequency features, then asks for the sub- square of audio sample Similarity between battle array and audio repository submatrix, then obtains the similarity of eigenmatrix by Euclidean distance.

Size is the matrix A and matrix B of N × 52_kFirst respectively according to the dimension of three category features, three sub- squares are divided into Battle array, the respectively A of N × 24₁And B_k1, N × 12 A₂And B_k2, N × 16 A₃And B_k3, the six sub- matrix conversions that will be obtained are six Individual vector of different sizes：

a₁((i-1) * 24+j)=A₁(i, j), 1≤i≤N, 1≤j≤24

a₂((i-1) * 12+j)=A₂(i, j), 1≤i≤N, 1≤j≤12

a₃((i-1) * 16+j)=A₃(i, j), 1≤i≤N, 1≤j≤16

b₁((i-1) * 24+j)=B_k1(i, j), 1≤i≤N, 1≤j≤24

b₂((i-1) * 12+j)=B_k2(i, j), 1≤i≤N, 1≤j≤12

b₃((i-1) * 16+j)=B_k3(i, j), 1≤i≤N, 1≤j≤16

According to the cosine law of space vector, calculated sub-matrix A₁With submatrix B_k1, submatrix A₂With submatrix B_k2, sub- square Battle array A₃With submatrix B_k3Similarity ψ₁(k)、ψ₂(k)、ψ₃(k)：

According to Euclidean distance, matrix A and matrix B are calculated_kSimilarity ψ (k)：

(3rd) kind method is have selected when system is realized carries out Similarity Measure, obtain M-N+1 dimension similarity to Amount, then by comparing, finds out the similarity ψ closest to 1 in ψ (k)_maxAnd its corresponding k, then obtain corresponding audio Time location t：

Wherein L represents frame length, and sr represents sample rate, and wherein L is that the unit that 512, sr is 44100, t is the second.

By the eigenmatrix B of all audios in the eigenmatrix A of snatch of music and audio repository_kN () (n is in audio repository Audio quantity), Similarity Measure is carried out, obtain the ψ of audio sample and every first audio_maxAnd t, the ψ of all audios in storehouse_maxFormed One n-dimensional vector ψ_maxN (), the t of all audios forms n-dimensional vector t (n) in storehouse.

By comparing, n-dimensional vector ψ is obtained_maxIn (n) closest to 1 that, then using corresponding n, obtain audio Title and time location, namely complete identification.

3. Rapid matching

Matching process described above is more comprehensive and accurate method, and relative, matching speed also can be slower.For reality When property is higher, the relatively low situation of accuracy requirement, 1 can be set greater than by by the value of Δ k, so as to accelerate submatrix B_k's Translational speed.Because Δ k values increase to the integer more than 1 so that matrix B_kWith matrix B_k+1Identical line number is reduced to N- Δ k, Completing identification needs the number of times of the matrix Similarity Measure for carrying out to be reduced to from M-N+1It is secondary, with the increase of Δ k, Calculation times reduction is bigger, and the increase of matching speed is also more notable.

The music recognition methods that the present invention is provided are linear pre- using mel cepstrum coefficients, linear prediction residue error, perception Surveying these three essential characteristics carries out the comprehensive identification to realize music, and three class audio frequency features have different characteristics and advantage, tie More preferable recognition result can be obtained after conjunction；Linear prediction is perceived because simulating the masking effect of human ear, with more preferable anti-noise Property and discrimination, the recognition effect for obtaining are more preferable.

Present invention also offers a kind of music recognition system, Fig. 2 shows for the structure of embodiment of the present invention music recognition system It is intended to, as shown in Fig. 2 the system includes：

Snatch of music acquisition module 201 to be identified, for obtaining snatch of music to be identified；

Parameter extraction module 202, for extract each frame audio in the snatch of music to be identified mel cepstrum coefficients, Mel cepstrum coefficients first-order difference, linear prediction residue error and perception linear predictor coefficient；

Characteristic vector determining module 203, for mel cepstrum coefficients, the jump of mel cepstrum coefficients one using the audio Point, linear prediction residue error and perceive the characteristic vector that linear predictor coefficient constitutes the audio；

Eigenmatrix determining module 204, is combined for the characteristic vector to audio described in each frame, obtains described treating Recognize the eigenmatrix of snatch of music；

Matching module 205, for the sample music in the eigenmatrix of the snatch of music to be identified and music libraries is special Levy matrix to be compared, obtain maximum similarity eigenmatrix, the maximum similarity eigenmatrix is and the sound to be identified The maximum sample musical features matrix of the similarity of happy fragment；

Music information acquisition module 206, the music letter for obtaining the maximum sample musical features matrix of the similarity Breath；

Music information output module 207, for the music information to be exported.

The system also includes：

The matching module 205, specifically includes：

The similarity calculated is specifically included：

The music recognition system that the present invention is provided is linear pre- using mel cepstrum coefficients, linear prediction residue error, perception Surveying these three essential characteristics carries out the comprehensive identification to realize music, and three class audio frequency features have different characteristics and advantage, tie More preferable recognition result can be obtained after conjunction；Linear prediction is perceived because simulating the masking effect of human ear, with more preferable anti-noise Property and discrimination, the recognition effect for obtaining are more preferable.

Each embodiment is described by the way of progressive in this specification, and what each embodiment was stressed is and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.For system disclosed in embodiment For, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is said referring to method part It is bright.

Specific case used herein is set forth to principle of the invention and implementation method, and above example is said It is bright to be only intended to help and understand the method for the present invention and its core concept；Simultaneously for those of ordinary skill in the art, foundation Thought of the invention, will change in specific embodiments and applications.In sum, this specification content is not It is interpreted as limitation of the present invention.

Claims

1. a kind of music recognition methods, it is characterised in that methods described includes：

Obtain snatch of music to be identified；

Extract mel cepstrum coefficients, mel cepstrum coefficients first-order difference, the line of each frame audio in the snatch of music to be identified Property prediction cepstrum coefficient and perceive linear predictor coefficient；

Using the mel cepstrum coefficients of the audio, mel cepstrum coefficients first-order difference, linear prediction residue error and perception line Property predictive coefficient constitute the characteristic vector of the audio；

The eigenmatrix of the snatch of music to be identified is compared with the sample musical features matrix in music libraries, is obtained most Big similarity eigenmatrix, the maximum similarity eigenmatrix is the sample maximum with the similarity of the snatch of music to be identified This musical features matrix；

By music information output.

2. recognition methods according to claim 1, it is characterised in that after one section of music to be identified of the acquisition, carry Take mel cepstrum coefficients, mel cepstrum coefficients first-order difference, the linear prediction cepstrum coefficient of each frame audio in the music to be identified Before coefficient and perception linear predictor coefficient, also include：

3. recognition methods according to claim 1, it is characterised in that the feature square by the snatch of music to be identified Battle array is compared with the sample musical features matrix in music libraries, obtains similarity maximum sample musical features matrix, specific bag Include：

From the sample musical features matrixMiddle interception has with the eigenmatrix of the snatch of music to be identified There is the matrix of same number of rowsThe sample musical features matrix has multiple, wherein, k=1, 2 ..., M-N+1, Δ k=1, tz (1), tz (2) ..., tz (M) are the spy of each frame audio in the sample musical features matrix B Vector is levied, M is the number of characteristic vector in the musical features matrix B, and N is the eigenmatrix of the snatch of music to be identified Line number, by matrix B_kLabeled as eigenmatrix to be compared；

The similarity of eigenmatrix to be compared and the eigenmatrix of the snatch of music to be identified is calculated, obtains to be identified with described The maximum eigenmatrix to be compared of the eigenmatrix similarity of snatch of music, the feature square with the snatch of music to be identified Musical features matrix belonging to the maximum eigenmatrix to be compared of battle array similarity is maximum similarity musical features matrix.

4. recognition methods according to claim 3, it is characterised in that described from the musical features matrixIt is middle to intercept the matrix with the eigenmatrix of the snatch of music to be identified with same number of rowsWherein, Δ k=1, k=1,2 ..., M-N+1, tz (1), tz (2) ..., tz (M) is described The characteristic vector of each frame audio in musical features matrix B, M is the number of characteristic vector in the musical features matrix B, and N is Before the line number of the eigenmatrix of the snatch of music to be identified, also include：

If it is, Δ k to be set greater than 1 integer.

5. recognition methods according to claim 3, it is characterised in that the calculating eigenmatrix to be compared is waited to know with described The similarity of the eigenmatrix of other snatch of music, specifically includes：

Using matrix absolute value distance method, the cosine law method of vector space model or vector space model the cosine law with The method that Euclidean distance is combined calculates the eigenmatrix of the eigenmatrix to be compared and the snatch of music to be identified Similarity.

6. a kind of music recognition system, it is characterised in that the system includes：

Parameter extraction module, falls for extracting the mel cepstrum coefficients of each frame audio in the snatch of music to be identified, Mel Spectral coefficient first-order difference, linear prediction residue error and perception linear predictor coefficient；

Characteristic vector determining module, for using the mel cepstrum coefficients of the audio, mel cepstrum coefficients first-order difference, linear Prediction cepstrum coefficient and perception linear predictor coefficient constitute the characteristic vector of the audio；

Eigenmatrix determining module, is combined for the characteristic vector to audio described in each frame, obtains the sound to be identified The eigenmatrix of happy fragment；

Matching module, for the eigenmatrix of the snatch of music to be identified to be entered with the sample musical features matrix in music libraries Row compares, and obtains maximum similarity eigenmatrix, and the maximum similarity eigenmatrix is and the snatch of music to be identified The maximum sample musical features matrix of similarity；

Music information output module, for the music information to be exported.

7. identifying system according to claim 6, it is characterised in that the system also includes：

Pretreatment module, for being pre-processed to the snatch of music to be identified, the pretreatment includes preemphasis treatment, divides Frame and adding window.

8. identifying system according to claim 6, it is characterised in that the matching module, specifically includes：

Matrix interception unit, for from the sample musical features matrixMiddle interception and the music to be identified The eigenmatrix of fragment has the matrix of same number of rowsThe sample musical features matrix has Multiple, wherein, k=1,2 ..., M-N+1, Δ k=1, tz (1), tz (2) ..., tz (M) they are the sample musical features matrix B In each frame audio characteristic vector, M is the number of characteristic vector in the musical features matrix B, and N is the music to be identified The line number of the eigenmatrix of fragment, by matrix B_kLabeled as eigenmatrix to be compared；

Similarity calculated, it is similar to the eigenmatrix of the snatch of music to be identified for calculating eigenmatrix to be compared Degree, obtains the to be compared eigenmatrix maximum with the eigenmatrix similarity of the snatch of music to be identified, described to be treated with described Recognize that the musical features matrix belonging to the maximum eigenmatrix to be compared of the eigenmatrix similarity of snatch of music is maximum phase Like degree musical features matrix.

9. identifying system according to claim 8, it is characterised in that the matching unit also includes：

Judging unit, completes to calculate, and be less than predetermined threshold value to accuracy requirement for judging whether to need in Preset Time；

Setting unit, for completing to calculate in Preset Time when needs, and when being less than predetermined threshold value to accuracy requirement, by Δ K is set greater than 1 integer.

10. identifying system according to claim 8, it is characterised in that the similarity calculated is specifically included：

Similarity Measure subelement, for using matrix absolute value distance method, the cosine law method of vector space model or to The cosine law of quantity space model calculates the eigenmatrix to be compared and is treated with described with the method that Euclidean distance is combined Recognize the similarity of the eigenmatrix of snatch of music.