Summary of the invention
The method that the purpose of this invention is to provide a kind of quick audio retrieval is with the speed of further raising audio retrieval.
The present invention solves the problems of the technologies described above the technical scheme of taking to be: a kind of method of quick audio retrieval, and described method enters retrieval phase through behind the pretreatment stage;
The process of described pretreatment stage is:
Steps A 1, audio repository is extracted the MFCC parameter: the sound signal to audio repository to be checked is carried out feature extraction, obtains characteristic information; The characteristic information that obtained is with the matrix representation of a * m, and a is greater than 6 and smaller or equal to 30 integer; M, n are integer, m>>n, n 〉=4; During feature extraction, frame length is 2~3s, and it is 0.2~0.3s that frame moves;
Move by described frame length, frame, the matrix of described a * m is divided into the matrix of several a * n, each matrix is a frame; N is an integer, n 〉=4;
Steps A 2, inquiry audio section (searched targets) is extracted the MFCC parameter: the described frame length of A1, frame move set by step, and the sound signal of inquiry audio section is carried out feature extraction, obtain characteristic information; The characteristic information that is obtained is a frame, and with the matrix representation of a 12 * n, n is an integer, n 〉=4;
Steps A 3, classification: described several frames in the steps A 1 are divided into 32 classes; Press the element classification of the preceding X row of every frame, X desirable 3~4;
Steps A 4, the classification of dividing among the A3 are set by step sorted out described inquiry audio section;
The retrieving of described retrieval phase is: check classification, coarse search and the smart retrieval of inquiry audio section;
Check the classification of inquiry audio section:
Step B1, begin retrieval, get a new frame, described frame is the frame in the audio repository to be checked;
Whether frame arrives postamble described in step B2, the determining step B1; If the process of the classification of inquiry audio section is this time checked in end; Otherwise, execution in step B3;
Step B3, judge whether classification mates; Whether the classification of judging the present frame in inquiry pairing frame of audio section and the audio repository to be checked mates;
If, execution in step B4; Otherwise, execution in step B1;
Coarse search:
Step B4, get a new element in the characteristic parameter: get a new element in the coupling frame in the audio repository to be checked;
Step B5, calculate in the audio repository to be checked the distance of new element and inquiry audio section corresponding element in the coupling frame;
Whether the gained distance is greater than thresholding among step B6, the determining step B5; If, execution in step B1; Otherwise, execution in step B7;
Step B7, calculate in the audio repository to be checked the distance of next characteristic parameter element in the coupling frame;
Step B8, what judge whether to get is last element; Described last element is meant last element of the preceding Y row of described frame, and Y desirable 2~8; If, execution in step B9; Otherwise, execution in step B4;
Step B9, note the position of this frame, deposit in the array of position;
Smart retrieval:
A frame new in the array is put in step B10, fetch bit;
Whether frame arrives postamble described in step B11, the determining step B10; If finish retrieval this time; Otherwise, execution in step B12;
Step B12, the distance of calculating and inquiring about the audio section eigenmatrix; Described distance is meant the quadratic sum of each corresponding element difference of the eigenmatrix of inquiring about the audio section eigenmatrix and being contrasted.
Step B13, judge whether greater than thresholding; If, execution in step B10; Otherwise, execution in step B14;
Step B14, find target; And then execution in step B10, up to finishing retrieval this time.
The invention has the beneficial effects as follows: the method that the present invention has adopted coarse search and smart retrieval to combine, improved retrieval rate and accuracy, the computing formula of accuracy is: accuracy=100%-(false alarm rate+loss)/2.As shown in Figure 6, be 4ms retrieval 8.5 hours with the speed of the inventive method, speed convert later than classical way fast about 3 times (Pentium M 1.6GHz).The present invention makes the inventive method under the prerequisite that guarantees accuracy owing to adopted coarse search and the introducing of method, especially coarse search that smart retrieval combines, can realize that bigger frame length and frame move.Frame length is 2~3s when retrieving with the inventive method, and it is 0.2~0.3s that frame moves; And frame length commonly used is 25ms, and it is 10ms that frame moves, and has therefore improved retrieval rate.
Embodiment
Embodiment one: shown in Fig. 1~5, the method for the described quick audio retrieval of present embodiment enters retrieval phase through behind the pretreatment stage;
The process of described pretreatment stage is:
Steps A 1, audio repository is extracted the MFCC parameter: the sound signal to audio repository to be checked is carried out feature extraction, obtains characteristic information; The characteristic information that obtained is with the matrix representation of a * m, and a is greater than 6 and smaller or equal to 30 integer; M, n are integer, m>>n, n 〉=4; During feature extraction, frame length is 2~3s, and it is 0.2~0.3s that frame moves;
Move by described frame length, frame, the matrix of described a * m is divided into the matrix of several a * n, each matrix is a frame; N is an integer, n 〉=4;
Steps A 2, inquiry audio section (searched targets) is extracted the MFCC parameter: the described frame length of A1, frame move set by step, and the sound signal of inquiry audio section is carried out feature extraction, obtain characteristic information; The characteristic information that is obtained is a frame, and with the matrix representation of a 12 * n, n is an integer, n 〉=4;
Steps A 3, classification: described several frames in the steps A 1 are divided into 32 classes; Press the element classification of the preceding X row of every frame, X desirable 3~4;
Steps A 4, the classification of dividing among the A3 are set by step sorted out described inquiry audio section;
The retrieving of described retrieval phase is: check classification, coarse search and the smart retrieval of inquiry audio section;
Check the classification of inquiry audio section:
Step B1, begin retrieval, get a new frame, described frame is the frame in the audio repository to be checked;
Whether frame arrives postamble described in step B2, the determining step B1; If the process of the classification of inquiry audio section is this time checked in end; Otherwise, execution in step B3;
Step B3, judge whether classification mates; Whether the classification of judging the present frame in inquiry pairing frame of audio section and the audio repository to be checked mates;
If, execution in step B4; Otherwise, execution in step B1;
Coarse search:
Step B4, get a new element in the characteristic parameter: get a new element in the coupling frame in the audio repository to be checked;
Step B5, calculate in the audio repository to be checked the distance of new element and inquiry audio section corresponding element in the coupling frame;
Whether the gained distance is greater than thresholding among step B6, the determining step B5; If, execution in step B1; Otherwise, execution in step B7;
Step B7, calculate in the audio repository to be checked the distance of next characteristic parameter element in the coupling frame;
Step B8, what judge whether to get is last element; Described last element is meant last element of the preceding Y row of described frame, and Y desirable 2~8; If, execution in step B9; Otherwise, execution in step B4;
Step B9, note the position of this frame, deposit in the array of position;
Smart retrieval:
A frame new in the array is put in step B10, fetch bit;
Whether frame arrives postamble described in step B11, the determining step B10; If finish retrieval this time; Otherwise, execution in step B12;
Step B12, the distance of calculating and inquiring about the audio section eigenmatrix; Described distance is meant the quadratic sum of each corresponding element difference of the eigenmatrix of inquiring about the audio section eigenmatrix and being contrasted.
Step B13, judge whether greater than thresholding; If, execution in step B10; Otherwise, execution in step B14;
Step B14, find target; And then execution in step B10, up to finishing retrieval this time.
Extract characteristic parameter, as shown in Figure 5, in general the feature of sound signal is extracted frame by frame, and the signal that promptly intercepts certain hour length is used as a frame, calculates its characteristic parameter, and translation certain distance (being called frame moves) continues the intercepting next frame, by that analogy again.
This programme employing has very, and the MFCC of strong robustness (Mel Frequency CepstralCoefficient) parameter is a feature, because these characteristics, make when extracting feature, can jump and extract significantly, that is to say that frame moves can obtain greatlyyer, can reduce the data volume of required calculating of when retrieval in later stage like this.In this experiment, the 12 rank MFCC that adopt 24 Mel bank of filters and dct transform to obtain.Frame length can be 2.5s, and frame moves and can be 0.25s.
Utilize VQ (Vector Quantization vector quantization) to classify to resulting MFCC, the method that this programme adopts is the MFCC parameter of extracting before, per 36 one group is carried out VQ (being 36 elements if get the first three columns of 12 rank MFCC), purpose still improves anti-interference, and the VQ codebook selecting is 32 (promptly in steps A
Described in 32 classes).
In retrieval phase, the advantage of this programme has just greatly improved the speed in this stage.In retrieval phase, at first according to the classification of inquiry audio section, the feature of only seeking respective classes in feature database gets final product, will not the searching of other classifications.Then, each element in each frame characteristic parameter (before nobody so do, former method is to mate characteristic parameter is whole) relatively one by one under this classification, as difference greater than a predefined thresholding, promptly delete this frame, continue to judge next frame.At last,, calculate one by one and the distance of inquiring about the audio section parameter (the more little representative similarity of distance is big more),, then be judged as target when less than a certain given thresholding according to the position of the frame that remains.
About the MFCC parameter:
Mei Er frequency marking cepstrum coefficient (MFCC) has been considered the auditory properties of people's ear, and frequency spectrum is converted into non-linear frequency spectrum based on the Mel frequency marking, is transformed on the cepstrum domain then.Owing to taken into full account people's auditory properties, and without any the prerequisite hypothesis, the MFCC parameter has good recognition performance and anti-noise ability.
MFCC adopts the method for bank of filters to calculate, and this group wave filter is an equiband on the Mei Er of frequency coordinate.This is because the mankind do not follow linear relationship in the perception to the audio frequency range more than about 1000Hz, but follows the linear approximate relationship on the logarithm frequency coordinate.Can provide Mel frequency map function according to acoustic measurements:
F is a common frequencies, and mel is a Mei Er coordinate frequency;
Its computation process as shown in Figure 7;
1. primary speech signal advanced after windowing divides frame, obtained the time-domain signal of each speech frame
2. with mending several zero sequences of growing into N with shape behind the time-domain signal, obtain linear spectral through behind the DFT then, conversion formula is
In actual applications, usually calculated by the FFT process, wherein N generally is referred to as DFT (or FFT) window width.
3. above-mentioned linear spectral is obtained the Mel frequency spectrum by Mel frequency filter group, and pass through the processing of logarithm energy, obtain log spectrum.
4. above-mentioned log spectrum is transformed to the cepstrum spectrum domain through discrete cosine transform (DCT), can obtain Mel frequency cepstral coefficient (MFCC parameter), general coefficient is got about 12-16, and the MFCC coefficient is
With this MFCC coefficient that directly obtains as static nature, more this static nature is done single order and second order difference, obtain the corresponding dynamic feature.
This Mei Er cepstrum coefficient is subjected to the influence of each factor such as bank of filters median filter number, shape, distribution and energy spectrum.
Owing to consider robustness, MFCC only gets static nature in this programme, does not get behavioral characteristics.
Embodiment two: shown in Fig. 1~5, present embodiment is in steps A 1, and during feature extraction, frame length is 2.5s, and frame moves and is 0.25s, can improve retrieval rate like this when guaranteeing retrieval precision.Other step is identical with embodiment one.
Embodiment three: shown in Fig. 1~5, present embodiment is described in steps A 1, and a gets 12.In the method, the 12 rank MFCC that adopt 24 Mel bank of filters and dct transform to obtain.Other step is identical with embodiment one.
Embodiment four: shown in Fig. 1~5, present embodiment is in step B8, and described last element is meant last element of preceding 4 row of described frame.The reason that preceding 4 column matrix are asked in Rough Inspection is that if columns too much may make retrieval rate reduce and increase false dismissal probability, I haven't seen you for ages causes and can not effectively screen if columns is crossed.Other step is identical with embodiment one.