CN101079044A

CN101079044A - Similarity measurement method for audio-frequency fragments

Info

Publication number: CN101079044A
Application number: CN 200610080669
Authority: CN
Inventors: 彭宇新; 房翠华; 陈晓鸥; 吴於茜
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University; Peking University Founder Group Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Peking University Founder Research and Development Center
Priority date: 2006-05-25
Filing date: 2006-05-25
Publication date: 2007-11-28
Anticipated expiration: 2026-05-25
Also published as: CN100585592C

Abstract

The invention discloses a method of measuring similarity degree among the acoustical frequency fragment. The present technique represents the whole acoustical frequency fragment with the acoustical frequency characteristic owing to not considering the difference of the specific content in the acoustical frequency fragment, so the present technique doesn' t measure the similarity degree of the acoustical frequency content effectively. Pointing to the above problem, the invention divides the acoustical frequency fragment in to two layers: the acoustical frequency unit and the acoustical frequency fragment. In the acoustical frequency unit procedure, the acoustical frequency unit is defined series of acoustical frequency frames with similar acoustic fidelity. Firstly, the acoustical frequency fragment is divided into series of the acoustical frequency units and the similar degree of the acoustical frequency unit is measured; what' s more, in the acoustical frequency fragment procedure, the similar degree of two acoustical frequency fragments is measured and modeled a weighted dimidiate graph based on the measuring result of the acoustical frequency unit; at last, the similar degree of two acoustical frequency fragments is measured with optimum matching. The invention provides higher search accuracy, which shows the important effect of the acoustical search technique in the information retrieval.

Description

The method of measuring similarity between a kind of audio-frequency fragments

Technical field

The invention belongs to the audio retrieval technical field, be specifically related to the method for measuring similarity between a kind of audio-frequency fragments.

Background technology

Along with being on the increase of multimedia document and application, audio analysis and retrieval technique become more and more important, and the audio-frequency fragments retrieval is a kind of important form of above-mentioned technology, it is the given audio-frequency fragments of user, how automatically in audio repository, to retrieve similar audio-frequency fragments, and sort from high to low according to similarity.Existing audio retrieval technology generally is to extract audio frequency characteristics from audio-frequency fragments, utilizes these features to carry out measuring similarity then, and the result retrieves according to tolerance.This method is not because consider the difference of particular content in the audio-frequency fragments, and adopts audio frequency characteristics to represent whole audio-frequency fragments, similarity that therefore can not the valid metric audio content.

(author is J.Gu to the document of delivering at Pacific-Rim Conference on Multimedia in 2004 " DominantFeature Vectors Based Audio Similarity Measure ", L.Lu, R.Cai, H.J.Zhang and J.Yang, the page number is 890-897), a kind of audio frequency characteristics of proper vector and the eigenwert based on the audio frequency characteristics matrix has been proposed: main proper vector (Dominant Feature Vectors).The frame characterizing definition that the document extracts audio fragment becomes a feature frame matrix, calculates the autocorrelation matrix of this matrix then, calculates the proper vector of autocorrelation matrix and eigenwert at last as the audio fragment feature.This method is based on the statistical nature of whole audio fragment, the content change characteristic in therefore can't the description audio segment, thus limited the accuracy of audio retrieval.

Summary of the invention

At the deficiencies in the prior art, the present invention proposes a kind of method of audio-frequency fragments measuring similarity, is used to measure the similarity between the different audio-frequency fragments.

For reaching above purpose, the technical solution used in the present invention is: the method for measuring similarity between a kind of audio-frequency fragments may further comprise the steps:

(1) audio-frequency fragments that respectively will be to be measured is divided into the similar audio unit of a plurality of tonequality;

(2) calculate the similarity between any two audio units in above-mentioned two audio-frequency fragments;

(3) according to the result of (2), measure the similarity between above-mentioned two audio-frequency fragments.

Further, (Bayesian Information Criterion BIC), is divided into the similar audio unit of a plurality of tonequality with audio-frequency fragments to be measured to utilization Bayes information standard.

Further, use following formula to calculate the similarity of two audio units:

Sim(s _i，s _j)＝exp(-Dis tan ce(s _i，s _j)/2)

Dis \tan ce (s_{i}, s_{j}) = {(Σ_{p = 1}^{n} {(f_{ip} - f_{jp})}^{2})}^{\frac{1}{2}}

Wherein, s _iAnd s _jRepresent two audio units, Dis tan ce (s _i, s _j) expression s _iAnd s _jThe Euclidean distance of audio frequency characteristics vector.

Further, the proper vector of audio unit is to adopt the mean value of all frame audio frequency characteristics vectors in this audio unit to represent.

What further, the proper vector of audio frame adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed.

Further: the similarity concrete steps of measuring between above-mentioned two audio-frequency fragments are:

A: the measuring similarity of two audio-frequency fragments is modeled as a cum rights bipartite graph;

B: the similarity between two audio-frequency fragments of utilization Optimum Matching tolerance;

C: adopt the similarity between two audio-frequency fragments of following formula calculating:

{Sim}_{OM} (X, Y) = \frac{Σ ω_{ij}}{\max (p, q)}

∑ ω _IjRepresent two maximum similarities that the audio-frequency fragments Optimum Matching obtains, p and q represent the audio unit number of two audio-frequency fragments X and Y respectively.

In addition, the present invention proposes a kind of method of audio-frequency fragments retrieval, this method can be retrieved and audio-frequency fragments like the query piece phase failure more effectively, and sorts from high to low according to similarity, thereby can bring into play the huge effect of audio retrieval technology in information retrieval more fully.

For reaching above purpose, the technical scheme of employing is that a kind of method of audio-frequency fragments retrieval is used for retrieving the audio-frequency fragments similar to the audio-frequency fragments of inquiring about from audio repository, may further comprise the steps:

(1) audio-frequency fragments and the audio-frequency fragments in the audio repository with inquiry is divided into the similar audio unit of a plurality of tonequality;

(2) calculate the similarity of inquiring about between audio-frequency fragments and the audio repository sound intermediate frequency segment sound intermediate frequency unit respectively;

(3) measure similarity between above-mentioned inquiry segment and the audio repository sound intermediate frequency segment respectively;

(4) by similarity from high to low, retrieve and audio-frequency fragments like the query piece phase failure.

Further, (Bayesian Information Criterion BIC), is divided into the similar audio unit of a plurality of tonequality with audio-frequency fragments and the audio-frequency fragments of inquiring about in the audio repository to utilization Bayes information standard.

Further, use following formula to calculate the similarity of two audio units:

Sim(s _i，s _j)＝exp(-Dis tan ce(s _i，s _j)/2)

Dis \tan ce (s_{i}, s_{j}) = {(Σ_{p = 1}^{n} {(f_{ip} - f_{jp})}^{2})}^{\frac{1}{2}}

Wherein, s _iAnd s _jRepresent two audio units, Dis tan ce (s _i, s _j) expression s _iAnd s _jThe Euclidean distance of audio frequency characteristics vector; Wherein the proper vector of audio unit is to adopt the mean value of all frame audio frequency characteristics vectors in this audio unit to represent, what the proper vector of audio frame adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed.

Further, the similarity concrete steps between tolerance inquiry segment and the audio repository sound intermediate frequency segment are:

{Sim}_{OM} (X, Y) = \frac{Σ ω_{ij}}{\max (p, q)}

Effect of the present invention is: compare with existing method, the present invention can obtain higher retrieval accuracy, thereby gives full play to the huge effect of audio retrieval technology in information retrieval.

Why the present invention has the foregoing invention effect, and its reason is: at prior art problems, the present invention is divided into two levels to the audio-frequency fragments retrieval: audio unit and audio-frequency fragments.In the audio unit stage, it is the similar audio frames of a series of tonequality that the present invention defines audio unit, at first audio-frequency fragments is divided into audio unit one by one, measures the similarity of two audio-frequency fragments sound intermediate frequency unit then; In the audio-frequency fragments stage, based on the tolerance result of audio unit, the measuring similarity of two audio-frequency fragments is modeled as a cum rights bipartite graph, use the similarity of two audio-frequency fragments of Optimum Matching tolerance at last.

Description of drawings

Fig. 1 is a schematic flow sheet of the present invention;

Fig. 2 is the recall ratio contrast synoptic diagram of the present invention and existing 3 kinds of methods;

Fig. 3 is the precision ratio contrast synoptic diagram of the present invention and existing 3 kinds of methods.

Embodiment

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

As shown in Figure 1, method of the present invention specifically may further comprise the steps:

(1) audio-frequency fragments and the audio-frequency fragments in the audio repository with inquiry is divided into the similar audio unit of tonequality one by one;

(Bayesian Information Criterion BIC), is divided into the similar audio unit of tonequality one by one to audio-frequency fragments at first to use Bayes's information standard.Detailed description about Bayes's information standard, can list of references " Efficient Audio Segmentation Algorithms based on the BIC " [M.Cettolo and M.Vescovi, IEEE International Conference on Acoustics, Speech andSignal Processing, 2003].

What the proper vector of audio frame adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed, and the proper vector of audio unit is to adopt the mean value of all frame audio frequency characteristics vectors in this audio unit to represent.Use following formula to calculate the similarity of two audio units then:

Sim(s _i，s _j)＝exp(-Dis tan ce(s _i，s _j)/2)

Dis \tan ce (s_{i}, s_{j}) = {(Σ_{p = 1}^{n} {(f_{ip} - f_{jp})}^{2})}^{\frac{1}{2}}

{Sim}_{OM} (X, Y) = \frac{Σ ω_{ij}}{\max (p, q)}

∑ ω _IjRepresent two maximum similarities that the audio-frequency fragments Optimum Matching obtains, p and q represent the audio unit number of two audio-frequency fragments X and Y.

Following experimental result shows that compare with existing method, the present invention can obtain higher retrieval accuracy, thereby gives full play to the huge effect of audio retrieval technology in information retrieval.

Set up the database of 1000 audio-frequency fragments in the present embodiment, comprised the fragment of sound of many types, for example animal sound, voice, vehicle sound, machine voice music, the report of a gun etc.In these 1000 audio-frequency fragments, have 500 segments that one or more similar segments are arranged, and other 500 segments have only occurred once.Therefore, 500 audio-frequency fragments of one or more similar segments are arranged, be used, so that verify the correctness of similar audio-frequency fragments retrieval as the inquiry segment.

In order to prove validity of the present invention, we have tested following 4 kinds of methods and have contrasted as experiment:

1, the present invention;

2, (author is J.Gu to the document " Dominant Feature Vectors Based Audio Similarity Measure " delivered at Pacific-Rim Conference on Multimedia of existing method 1:2004, L.Lu, R.Cai, H.J.Zhang and J.Yang, page number 890-897);

3, existing method 2:L ₂Distance;

4, have the document " Content-based Indexing and Retrieval-by-Example in Audio " (author is Z.Liu and Q.Huang) that method 3:2000 delivers at IEEE International Conference on Multimedia andExpo now.

13 dimensional feature vectors that above-mentioned 4 kinds of methods, audio frame feature have all adopted logarithm energy and Mel cepstral coefficients to form, therefore, last experimental result can prove superiority of the present invention.The key distinction of these 4 kinds of methods is as shown in table 1:

Table 1: the key distinction of the present invention and existing method

	The present invention	Existing method 1	Existing method 2	Existing method 3
	The present invention	Existing method 1	Existing method 2	Existing method 3	Segment is represented	The audio unit feature	Main feature	The audio frame feature	The audio frame feature
Measuring similarity	Audio unit tolerance and audio-frequency fragments tolerance	Audio-frequency fragments tolerance	Audio-frequency fragments tolerance	Audio-frequency fragments tolerance	Segment is represented	The audio unit feature	Main feature	The audio frame feature	The audio frame feature
Measuring similarity		Audio-frequency fragments tolerance	Audio-frequency fragments tolerance	Audio-frequency fragments tolerance	Measure	Optimum Matching	Main proper vector	The K-L distance	L ₂Distance

Two kinds of evaluation indexes in the MPEG-7 standardization activity have been adopted in experiment: the average adjusted retrieval order of normalization ANMRR (Average Normalized Modified Retrieval Rank) and recall level average AR (Average Recall).AR is similar to traditional recall ratio (Recall), and ANMRR compares with traditional precision ratio (Precision), not only can reflect correct result for retrieval ratio, and can reflect correct result's arrangement sequence number.The ANMRR value is more little, means that the rank of the correct segment that retrieval obtains is forward more; The AR value is big more, and it is big more to mean that in the individual result for retrieval of preceding K (K is the cutoff value of result for retrieval) similar segment accounts for the ratio of all similar segments.So AR is big more, illustrate that the recall ratio of segment retrieval is good more; ANMRR is more little, illustrates that the accuracy of segment retrieval is high more.Table 2 is that AR and the ANMRR that above-mentioned 4 kinds of methods are retrieved 500 audio-frequency fragments compares.

The contrast and experiment of table 2 the present invention and existing method

	The present invention	Existing method 1	Existing method 2	Existing method 3
	The present invention	Existing method 1	Existing method 2	Existing method 3	AR	0.72	0.66	0.67	0.66
ANMRR	0.26	0.33	0.32	0.33	AR	0.72	0.66	0.67	0.66

As can be seen from Table 2, no matter the present invention is AR, or ANMRR, all obtained than the better effect of existing method, this mainly be because: (1) the present invention proposes the similarity of audio-frequency fragments is based upon on the similarity of audio unit, and audio unit is the similar audio frames of a series of tonequality, and this has guaranteed the validity of audio-frequency fragments measuring similarity; (2) the present invention proposes to use the similarity of Optimum Matching tolerance audio-frequency fragments, and Optimum Matching has the mechanism of coupling one to one, and this has guaranteed the validity of audio-frequency fragments tolerance.

In order further to confirm validity of the present invention, except AR and ANMRR, we have adopted other one group of evaluation index: recall ratio and precision ratio, and they are defined as follows:

The number of relevant segment number/all relevant segments of recall ratio=retrieve

All segment numbers of the relevant segment number of precision ratio=retrieve/retrieve

The result as shown in Figures 2 and 3, no matter the present invention is recall ratio, or precision ratio, has all obtained than the better effect of existing method.Therefore, above-mentioned two class evaluation index: AR and ANMRR, recall ratio and precision ratio, full proof the present invention in audio-frequency fragments retrieval, go out chromatic effect.

Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Annotate: work of the present invention, by grant of national natural science foundation (project approval number: 60503062).

Claims

1, the method for measuring similarity between a kind of audio-frequency fragments is characterized in that, may further comprise the steps:

2, the method for measuring similarity between a kind of audio-frequency fragments as claimed in claim 1, it is characterized in that: in the step (1), utilization Bayes information standard (Bayesian Information Criterion, BIC), audio-frequency fragments to be measured is divided into the similar audio unit of a plurality of tonequality.

3, the method for measuring similarity between a kind of audio-frequency fragments as claimed in claim 1 is characterized in that: in the step (2), use following formula to calculate the similarity of two audio units:

Sim(s _i，s _j)＝exp(-Dis tan ce(s _i，s _j)/2)

Dis \tan ce (s_{i}, s_{j}) = {(Σ_{p = 1}^{n} {(f_{ip} - f_{jp})}^{2})}^{\frac{1}{2}}

4, the method for measuring similarity between a kind of audio-frequency fragments as claimed in claim 3 is characterized in that: in the step (2), the proper vector of audio unit is to adopt the mean value of all frame audio frequency characteristics vectors in this audio unit to represent.

5, the method for measuring similarity between a kind of audio-frequency fragments as claimed in claim 4 is characterized in that: in the step (2), what the proper vector of audio frame adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed.

6, as the method for measuring similarity between claim 1,2,3, the 4 or 5 described a kind of audio-frequency fragments, it is characterized in that: step (3) is specially:

{Sim}_{OM} (X, Y) = \frac{{Σω}_{ij}}{\max (p, q)}

7, a kind of method of audio-frequency fragments retrieval is used for retrieving the audio-frequency fragments similar to the audio-frequency fragments of inquiring about from audio repository, it is characterized in that, may further comprise the steps:

8, the method for a kind of audio-frequency fragments retrieval as claimed in claim 7, it is characterized in that: in the step (), utilization Bayes information standard (Bayesian Information Criterion, BIC), audio-frequency fragments and the audio-frequency fragments in the audio repository with inquiry is divided into the similar audio unit of a plurality of tonequality.

9, audio-frequency fragments search method as claimed in claim 7 is characterized in that: in the step (two), use following formula to calculate the similarity of two audio units:

Sim(s _i，s _j)＝exp(-Dis tan ce(s _i，s _j)/2)

Dis \tan ce (s_{i}, s_{j}) = {(Σ_{p = 1}^{n} {(f_{ip} - f_{jp})}^{2})}^{\frac{1}{2}}

Wherein, s _iAnd s _jRepresent two audio units, Dis tan ce (s _i, s _j) expression s _iAnd s _jThe Euclidean distance of audio frequency characteristics vector; The proper vector of audio unit is to adopt the mean value of all frame audio frequency characteristics vectors in this audio unit to represent; What the proper vector of audio frame adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed.

10, as claim 7,8 or 9 described audio-frequency fragments search methods, it is characterized in that: step (three) be specially:

{Sim}_{OM} (X, Y) = \frac{{Σω}_{ij}}{\max (p, q)}