CN101566999A

CN101566999A - A quick audio retrieval method

Info

Publication number: CN101566999A
Application number: CNA2009100721610A
Authority: CN
Inventors: 任广辉; 吴静龙
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2009-06-02
Filing date: 2009-06-02
Publication date: 2009-10-28
Anticipated expiration: 2029-06-02
Also published as: CN101566999B

Abstract

A quick audio retrieval method relates to an audio retrieval method. The invention aims at providing a quick audio retrieval method so as to further improve the speed of audio retrieval. The method of the invention enters in a retrieval stage after a pretreatment stage; and the process of the pretreatment stage comprises the following steps of: extracting and classifying MFCC parameters from an audio library, and extracting MFCC parameters from a query audio section (retrieval target) and classifying the query audio section according to a sorted classification; and the retrieval process of the retrieval stage comprises the following steps of: looking over the classifications, coarse retrieval and accurate retrieval of the query audio section. The retrieval method consumes 8.5h at the speed of 4ms and is approximately 3 times (Pentium M1.6GHz) faster than a classical method after speed conversion.

Description

A kind of method of quick audio retrieval

Technical field

The present invention relates to a kind of method of audio retrieval.

Background technology

In the audio retrieval field, existing classic algorithm is based on the histogram method of sub belt energy, this method adopts the iir filter of 7 passages to obtain the parameter of a series of sub belt energies, then generate histogram, a large amount of impossible match points of filtering dynamically in the middle of retrieving then, reach the purpose of quick retrieval, the retrieval rate of this method has reached 200 hours (CPU Pentium IV 2.0GHz) of retrieval in 0.3 second at present. ^[1]

Though the retrieval rate of existing histogram method based on sub belt energy is significantly improved than the retrieval rate of other method, if but audio section of inquiry still need spend the long time in an extremely huge audio repository, its retrieval rate can't satisfy request for utilization.Retrieval rate is that The faster the better undoubtedly.

List of references: [1] Kimura﹠amp; Kashino, A Quick Search Method for Audio SignalsBased on a Piecewise Linear Representation of Feature Trajectories, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2008,2.

Summary of the invention

The method that the purpose of this invention is to provide a kind of quick audio retrieval is with the speed of further raising audio retrieval.

The present invention solves the problems of the technologies described above the technical scheme of taking to be: a kind of method of quick audio retrieval, and described method enters retrieval phase through behind the pretreatment stage;

The process of described pretreatment stage is:

Steps A 1, audio repository is extracted the MFCC parameter: the sound signal to audio repository to be checked is carried out feature extraction, obtains characteristic information; The characteristic information that obtained is with the matrix representation of a * m, and a is greater than 6 and smaller or equal to 30 integer; M, n are integer, m＞＞n, n 〉=4; During feature extraction, frame length is 2～3s, and it is 0.2～0.3s that frame moves;

Move by described frame length, frame, the matrix of described a * m is divided into the matrix of several a * n, each matrix is a frame; N is an integer, n 〉=4;

Steps A 2, inquiry audio section (searched targets) is extracted the MFCC parameter: the described frame length of A1, frame move set by step, and the sound signal of inquiry audio section is carried out feature extraction, obtain characteristic information; The characteristic information that is obtained is a frame, and with the matrix representation of a 12 * n, n is an integer, n 〉=4;

Steps A 3, classification: described several frames in the steps A 1 are divided into 32 classes; Press the element classification of the preceding X row of every frame, X desirable 3～4;

Steps A 4, the classification of dividing among the A3 are set by step sorted out described inquiry audio section;

The retrieving of described retrieval phase is: check classification, coarse search and the smart retrieval of inquiry audio section;

Check the classification of inquiry audio section:

Step B1, begin retrieval, get a new frame, described frame is the frame in the audio repository to be checked;

Whether frame arrives postamble described in step B2, the determining step B1; If the process of the classification of inquiry audio section is this time checked in end; Otherwise, execution in step B3;

Step B3, judge whether classification mates; Whether the classification of judging the present frame in inquiry pairing frame of audio section and the audio repository to be checked mates;

If, execution in step B4; Otherwise, execution in step B1;

Coarse search:

Step B4, get a new element in the characteristic parameter: get a new element in the coupling frame in the audio repository to be checked;

Step B5, calculate in the audio repository to be checked the distance of new element and inquiry audio section corresponding element in the coupling frame;

Whether the gained distance is greater than thresholding among step B6, the determining step B5; If, execution in step B1; Otherwise, execution in step B7;

Step B7, calculate in the audio repository to be checked the distance of next characteristic parameter element in the coupling frame;

Step B8, what judge whether to get is last element; Described last element is meant last element of the preceding Y row of described frame, and Y desirable 2～8; If, execution in step B9; Otherwise, execution in step B4;

Step B9, note the position of this frame, deposit in the array of position;

Smart retrieval:

A frame new in the array is put in step B10, fetch bit;

Whether frame arrives postamble described in step B11, the determining step B10; If finish retrieval this time; Otherwise, execution in step B12;

Step B12, the distance of calculating and inquiring about the audio section eigenmatrix; Described distance is meant the quadratic sum of each corresponding element difference of the eigenmatrix of inquiring about the audio section eigenmatrix and being contrasted.

Step B13, judge whether greater than thresholding; If, execution in step B10; Otherwise, execution in step B14;

Step B14, find target; And then execution in step B10, up to finishing retrieval this time.

The invention has the beneficial effects as follows: the method that the present invention has adopted coarse search and smart retrieval to combine, improved retrieval rate and accuracy, the computing formula of accuracy is: accuracy=100%-(false alarm rate+loss)/2.As shown in Figure 6, be 4ms retrieval 8.5 hours with the speed of the inventive method, speed convert later than classical way fast about 3 times (Pentium M 1.6GHz).The present invention makes the inventive method under the prerequisite that guarantees accuracy owing to adopted coarse search and the introducing of method, especially coarse search that smart retrieval combines, can realize that bigger frame length and frame move.Frame length is 2～3s when retrieving with the inventive method, and it is 0.2～0.3s that frame moves; And frame length commonly used is 25ms, and it is 10ms that frame moves, and has therefore improved retrieval rate.

Description of drawings

Fig. 1 is the The general frame (being pretreatment stage in the square frame) of pretreatment stage of the present invention and retrieval phase, Fig. 2 is the classification stage of checking the inquiry audio section of retrieval phase of the present invention and the process flow diagram in coarse search stage, Fig. 3 is the process flow diagram of the smart retrieval phase of retrieval phase of the present invention, Fig. 4 is the synoptic diagram of calculated characteristics vector corresponding element part, and Fig. 5 extracts the characteristic parameter synoptic diagram; Fig. 6 is that (abscissa is a decibel for the expression signal numerical value bigger than noise, unit to retrieval precision comparison diagram of the present invention; Ordinate is a recall rate, and the curve of band asterism is the retrieval of method described in a background technology accuracy curve map), Fig. 7 is the computation process process flow diagram of MFCC.

Embodiment

Embodiment one: shown in Fig. 1～5, the method for the described quick audio retrieval of present embodiment enters retrieval phase through behind the pretreatment stage;

The process of described pretreatment stage is:

Check the classification of inquiry audio section:

If, execution in step B4; Otherwise, execution in step B1;

Coarse search:

Step B9, note the position of this frame, deposit in the array of position;

Smart retrieval:

A frame new in the array is put in step B10, fetch bit;

Extract characteristic parameter, as shown in Figure 5, in general the feature of sound signal is extracted frame by frame, and the signal that promptly intercepts certain hour length is used as a frame, calculates its characteristic parameter, and translation certain distance (being called frame moves) continues the intercepting next frame, by that analogy again.

This programme employing has very, and the MFCC of strong robustness (Mel Frequency CepstralCoefficient) parameter is a feature, because these characteristics, make when extracting feature, can jump and extract significantly, that is to say that frame moves can obtain greatlyyer, can reduce the data volume of required calculating of when retrieval in later stage like this.In this experiment, the 12 rank MFCC that adopt 24 Mel bank of filters and dct transform to obtain.Frame length can be 2.5s, and frame moves and can be 0.25s.

Utilize VQ (Vector Quantization vector quantization) to classify to resulting MFCC, the method that this programme adopts is the MFCC parameter of extracting before, per 36 one group is carried out VQ (being 36 elements if get the first three columns of 12 rank MFCC), purpose still improves anti-interference, and the VQ codebook selecting is 32 (promptly in steps A

Described in 32 classes).

In retrieval phase, the advantage of this programme has just greatly improved the speed in this stage.In retrieval phase, at first according to the classification of inquiry audio section, the feature of only seeking respective classes in feature database gets final product, will not the searching of other classifications.Then, each element in each frame characteristic parameter (before nobody so do, former method is to mate characteristic parameter is whole) relatively one by one under this classification, as difference greater than a predefined thresholding, promptly delete this frame, continue to judge next frame.At last,, calculate one by one and the distance of inquiring about the audio section parameter (the more little representative similarity of distance is big more),, then be judged as target when less than a certain given thresholding according to the position of the frame that remains.

About the MFCC parameter:

Mei Er frequency marking cepstrum coefficient (MFCC) has been considered the auditory properties of people's ear, and frequency spectrum is converted into non-linear frequency spectrum based on the Mel frequency marking, is transformed on the cepstrum domain then.Owing to taken into full account people's auditory properties, and without any the prerequisite hypothesis, the MFCC parameter has good recognition performance and anti-noise ability.

MFCC adopts the method for bank of filters to calculate, and this group wave filter is an equiband on the Mei Er of frequency coordinate.This is because the mankind do not follow linear relationship in the perception to the audio frequency range more than about 1000Hz, but follows the linear approximate relationship on the logarithm frequency coordinate.Can provide Mel frequency map function according to acoustic measurements:

mel = 2595 \times \log (1 + \frac{f}{700})

F is a common frequencies, and mel is a Mei Er coordinate frequency;

Its computation process as shown in Figure 7;

1. primary speech signal advanced after windowing divides frame, obtained the time-domain signal of each speech frame

2. with mending several zero sequences of growing into N with shape behind the time-domain signal, obtain linear spectral through behind the DFT then, conversion formula is

X (k) = Σ_{n = 0}^{N - 1} x (n) e^{- j 2 πnk / N}, (0 \leq n, k \leq N - 1)

In actual applications, usually calculated by the FFT process, wherein N generally is referred to as DFT (or FFT) window width.

3. above-mentioned linear spectral is obtained the Mel frequency spectrum by Mel frequency filter group, and pass through the processing of logarithm energy, obtain log spectrum.

4. above-mentioned log spectrum is transformed to the cepstrum spectrum domain through discrete cosine transform (DCT), can obtain Mel frequency cepstral coefficient (MFCC parameter), general coefficient is got about 12-16, and the MFCC coefficient is

C_{n} = Σ_{k = 1}^{M} \log x (k) \cos [π (k - 0.5) n / M], n = 1,2, . . ., L

With this MFCC coefficient that directly obtains as static nature, more this static nature is done single order and second order difference, obtain the corresponding dynamic feature.

This Mei Er cepstrum coefficient is subjected to the influence of each factor such as bank of filters median filter number, shape, distribution and energy spectrum.

Owing to consider robustness, MFCC only gets static nature in this programme, does not get behavioral characteristics.

Embodiment two: shown in Fig. 1～5, present embodiment is in steps A 1, and during feature extraction, frame length is 2.5s, and frame moves and is 0.25s, can improve retrieval rate like this when guaranteeing retrieval precision.Other step is identical with embodiment one.

Embodiment three: shown in Fig. 1～5, present embodiment is described in steps A 1, and a gets 12.In the method, the 12 rank MFCC that adopt 24 Mel bank of filters and dct transform to obtain.Other step is identical with embodiment one.

Embodiment four: shown in Fig. 1～5, present embodiment is in step B8, and described last element is meant last element of preceding 4 row of described frame.The reason that preceding 4 column matrix are asked in Rough Inspection is that if columns too much may make retrieval rate reduce and increase false dismissal probability, I haven't seen you for ages causes and can not effectively screen if columns is crossed.Other step is identical with embodiment one.

Claims

1, a kind of method of quick audio retrieval, described method enters retrieval phase through behind the pretreatment stage, it is characterized in that:

The process of described pretreatment stage is:

Steps A 2, the inquiry audio section is extracted the MFCC parameter: the described frame length of A1, frame move set by step, and the sound signal of inquiry audio section is carried out feature extraction, obtain characteristic information; The characteristic information that is obtained is a frame, and with the matrix representation of a 12 * n, n is an integer, n 〉=4;

Check the classification of inquiry audio section:

If, execution in step B4; Otherwise, execution in step B1;

Coarse search:

Step B9, note the position of this frame, deposit in the array of position;

Smart retrieval:

A frame new in the array is put in step B10, fetch bit;

2, the method for a kind of quick audio retrieval according to claim 1 is characterized in that: in steps A 1, during feature extraction, frame length is 2.5s, and frame moves and is 0.25s.

3, the method for a kind of quick audio retrieval according to claim 1 is characterized in that: in steps A 1, a gets 12.

4, the method for a kind of quick audio retrieval according to claim 1 is characterized in that: in step B8, described last element is meant last element of preceding 4 row of described frame.