CN1924850A

CN1924850A - Audio fast search method

Info

Publication number: CN1924850A
Application number: CN 200510086315
Authority: CN
Inventors: 梁伟; 张树武; 徐波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2005-08-31
Filing date: 2005-08-31
Publication date: 2007-03-07
Anticipated expiration: 2025-08-31
Also published as: CN100424692C

Abstract

This invention provides one rapid audio frequency research method based on time and frequency zone, which has the following properties: using audio signal energy proportion and taking histogram as establishing method and testing the appearance position on aim audio frequency; selecting proper sub band to make the frequency signal with best robustness of noise signal and deformation in statistical means; frequency spectrum distribution according to aim audio and adjusting VQ boundary; using widely histogram match formula; forwarding audio research formula property standard and designing the object evaluation parameters.

Description

Audio fast search method

Technical field

The present invention relates to multimedia audio searching system technical field.More precisely, a kind of audio fast search method.

Background technology

At present, information industry is just obtaining unprecedented development.Various information mediums have also obtained swift and violent development, such as TV, and broadcasting, network, wireless telecommunications etc.All be flooded with a large amount of information every day in these information mediums.How the attention that just progressively obtains country with the information security that guarantees country is effectively managed and monitored to these information.Based on the responsive Audio Monitoring System of audio frequency time-frequency domain treatment technology in order to satisfy the monitoring requirement of the responsive audio frequency of information security field.

Summary of the invention

The present invention proposes a kind of audio fast search method of robust, this method has strong robustness for distortion such as noises.The most basic feature of the present invention is the time-frequency domain treatment technology at frequency spectrum.By normalized, make proper vector have very strong robustness and the property distinguished to frequency spectrum.Based on the frequency spectrum after handling, set up sub belt energy than histogram, the matching process that utilizes histogram to overlap carries out rapid Estimation to the doubtful position of target audio;

A kind of audio fast search method, fast audio search method have proposed the fast audio search method based on the description of time and frequency zone frequency spectrum.The essential characteristic of this method is to utilize the sound signal sub belt energy to liken to be essential characteristic, and as modeling method, the appearance position of target audio jumped to be detected, thereby has very high search speed with histogram; The essential characteristic of this method, the one, select suitable subband, make the signal of this frequency band on statistical significance, have best robustness for noise signal and distortion; The 2nd, according to the spectrum distribution of target audio, adaptive adjustment VQ quantization boundary; The 3rd, used for reference widely used histogram matching algorithm in the image recognition.After the sub belt energy signal is done normalization, avoided in the conventional method detecting mistake and omission, and calculated amount is very little because of what distortion such as ground unrest interference caused; The 4th, proposed to set up the Performance evaluation criterion of audio search algorithm, and design analysis the objective evaluation parameter of result for retrieval.Experiment showed, that algorithm that the present invention proposes not only steadily obtaining good retrieval precision and search speed under the ground unrest, also has good robustness to nonstationary noise.

Audio fast search method, this method can be located the target audio fragment of being concerned about from the unknown audio stream of magnanimity fast, and process flow diagram the steps include: as shown in Figure 1

1) at first target audio segment and audio stream are carried out feature extraction; The feature extraction of audio frequency at first utilizes bandpass filter that audio frequency is carried out filtering, calculates sub belt energy respectively based on the signal of each passband after the filtering, and the calculating of sub belt energy is a frame with 256, and frame moves 128 points; Frequency subband is evenly distributed on the log frequency;

2) based on 1) sub belt energy that calculated, calculate the sub belt energy ratio of target audio segment and audio stream, liken to sub belt energy and be the essential characteristic vector;

3) in order to improve the robustness of feature for noise, need be to 2) proper vector calculated carries out quantification treatment, the selection of every dimension quantization boundary has equal characteristic number with each dimensional feature of target audio in each bin be criterion, proper vector after quantizing is set up histogram model, and the quantization boundary of each dimension of record; Quantization boundary according to target audio quantizes the proper vector that testing audio flows;

4) histogram of target audio flows to line slip along audio frequency characteristics, and sets up the histogram of audio stream current location, and the histogram of target audio and the histogram of testing audio stream are complementary, and obtains similarity; If similarity, is then thought the position of finding target audio greater than certain thresholding, mate otherwise jump to next possible position according to the estimation of current similarity next time.

The present invention mainly comprises three modules: a feature extraction, two histogrammic foundation are described in detail respectively below three measuring similarities.

Feature extraction.This method employing sub belt energy likens to and is essential characteristic, sub belt energy is than being to each description of the distribution trend of pairing each sub belt energy constantly, in order to improve the robustness of feature, need carry out vector quantization to the sub belt energy ratio handles, the selection of quantization boundary has equal feature number with each dimensional feature of target audio in each bin be criterion, proper vector after quantization boundary and the quantification is deposited in the file

Can be expressed as:

Feature(n)＝(f(n)，g(n)) (5)

f(n)＝(f ₁(n)，f ₂(n)，f ₃(n)，…，f _M(n)) (6)

g(n)＝(g ₁(n)，g ₂(n)，g ₃(n)，…，g _M(n)) (7)

In the formula, n express time, the frequency band number of M representation feature vector

f _i(n)＝α(n)×E _i(n) (8)

g _i(n)＝β(n)×ECR _i(n) (9)

ECR _i(n)＝(E _i(n)-E _i(n-1))/E _i(n-1) (10)

In the formula, E _i(n) the output frame energy of pairing i the bandpass filter of expression n frame; Because short-time energy is relatively more responsive to high level,, be defined as so the range value that adopts short-time average magnitude to measure sound signal changes:

E_{i} (n) = Σ_{t = nN}^{(n + 1) N} | g_{i} (t) | - - - (11)

α (n) is used for each proper vector is carried out normalization, so that eliminate the influence of volume, is defined as:

α (n) = \frac{1}{\max (E_{i} (n))} - - - (12)

β (n) = \frac{1}{\max_{i} (EC R_{i} (n))} - - - (13)

In the formula, max represents to get maximal value.

In order to improve the robustness of feature, need carry out vector quantization to the sub belt energy ratio.The vector quantization border is to determine according to the distribution of the sub belt energy ratio of target audio.The selection of quantization boundary has equal characteristic number with each dimensional feature of target audio in each bin be criterion.

Histogrammic foundation and measuring similarity.After having finished feature extraction, need set up model to each audio-frequency fragments, the method for setting up model is a lot, because the calculated amount of histogram matching is little, and has stronger robustness for noise, so adopt histogrammic matching process.

Simultaneously, for the sequential that increases template is distinguished property, be that the target audio of t is equally divided into n subwindow to duration, set up histogram respectively at each subwindow, use h _i ^RExpression.

Distance metric adopts the overlapping mode of histogram, can be expressed as such as n histogrammic distance constantly in target audio histogram and the testing audio stream:

S (h^{R}, h^{T} (n)) = \frac{1}{L} Σ_{i = 1}^{L} \min (h_{i}^{R}, h_{i}^{T} (n)) - - - (1)

In the formula, h ^R: the histogram of reference audio, h _j ^T(n): n is the histogram of testing audio constantly, L: the number in histogram Zhong Bao chamber.

Because similarity and histogrammic sliding position between the histogram have correlativity, can pass through n ₁The similarity of stone inscription is to n ₂The similarity upper limit is constantly estimated.The coupling budget that can skip this point if discreet value is lower than the thresholding of appointment, thus calculated amount reduced.Predictor formula is as follows:

S_{up} (h_{i}^{R}, h_{i}^{T} (n 2)) = S (h_{i}^{R}, h_{i}^{T} (n 1)) + \frac{n 2 - n 1}{P_{i}} - - - (2)

So the jumping over step-length and can utilize formulate as follows of each subwindow:

w_{i} = \{\begin{matrix} floor (P_{i} (θ - S_{i})) + 1 & if S_{i} < θ, \\ 1 & otherwise, \end{matrix} - - - (3)

In the formula, w _iExpression jump step-length, the maximum positive integer less than x is got in floor (x) expression.Finally jump step-length w can use following public affairs

w = \max_{i} (w_{i}) - - - (4)

Algorithm performance is estimated.The performance evaluation of this algorithm is by the occurrence number of advertisement in the TV programme is verified.If detect position and the actual play position of targeted advertisements differ and be no more than 1 second, we just think that this advertisement correctly detects.Search performance is made up of two indexs: accuracy ξ, recall rate δ and overall accuracy τ.Formulate is as follows:

τ = \frac{2 \times ξ \times δ}{ξ + δ}

Description of drawings

Fig. 1 is a quick audio retrieval process flow diagram of the present invention.

Fig. 2 is that audio-frequency fragments is through the short-time energy oscillogram behind the comb filtering.

Fig. 3 is the energy waveform figure of each frequency band after the low-pass filtering.

Fig. 4 is the energy waveform figure of each frequency band after the normalization.

Embodiment

The quick audio retrieval flow process of Fig. 1, this flow process at first utilize the comb filter group that testing audio and reference audio are carried out comb filtering, obtain proper vector through handling; Then reference audio is set up histogram; Utilize the reference audio histogram that testing audio is searched at last.Jump each time all and the current matching similarity of search window have confidential relation.

The audio-frequency fragments of Fig. 2 is through the short-time energy oscillogram behind the comb filtering, and this figure is the subband short-time energy waveform that obtains after audio-frequency fragments is handled through the comb filtering group.The frequency band energy waveform that different color showings is different.

The energy waveform figure of each frequency band after the low-pass filtering of Fig. 3.This figure is the short-time energy curve that obtains behind the subband short-time energy waveform process low pass smoothing filter.

Fig. 4, this figure are to carry out the normalized on the frequency axis direction, the normalization short-time energy curve that finally obtains through the short-time energy curve after the processing of low pass smoothing filter.

Table 1: result for retrieval

Table 1: experimental result relatively

Search method	The advertisement duration	Accuracy	Recall rate	Correctness	Search time
Search method	The advertisement duration	Accuracy	Recall rate	Correctness	Search time	The related coefficient matching process	＜=5 seconds	64.8％	78.2％	71.5％	21 hours 56 minutes
6-10 second	91.1％	88.5％	89.8％				＜=5 seconds	64.8％	78.2％	71.5％
6-10 second	91.1％	88.5％	89.8％	11-20 second	97.1％		85.8％	91.4％
＞20 seconds	100％	89.7％	94.8％	11-20 second	97.1％		85.8％	91.4％
＞20 seconds	100％	89.7％	94.8％	Histogram method	＜=5 seconds		93.0％	94.2％	93.6％	30 minutes 14 seconds
6-10 second	95.3％	96.0％	95.7％		＜=5 seconds		93.0％	94.2％	93.6％
6-10 second	95.3％	96.0％	95.7％		11-20 second	99.2％	97.5％	98.4％
＞20 seconds	100％	98.2％	99.1％		11-20 second	99.2％	97.5％	98.4％

Claims

1. audio fast search method, utilize the sound signal sub belt energy to liken to and be essential characteristic, with histogram as modeling method, the appearance position of target audio jumped detect, the essential characteristic of this method, the one, select suitable subband, make the signal of this frequency band on statistical significance, have best robustness for noise signal and distortion; The 2nd, according to the spectrum distribution of target audio, adaptive adjustment VQ quantization boundary; The 3rd, used for reference widely used histogram matching algorithm in the image recognition, after the sub belt energy signal is done normalization, avoided in the conventional method detecting mistake and omission, and calculated amount is very little because of what the ground unrest distorted due to interference caused; The 4th, proposed to set up the Performance evaluation criterion of audio search algorithm, and design analysis the objective evaluation parameter of result for retrieval.

2. according to the audio fast search method of claim 1, it is characterized in that this method can be located the target audio fragment of being concerned about fast, the steps include: from the unknown audio stream of magnanimity

3. audio fast search method according to claim 1 and 2 is characterized in that, feature extraction, and histogrammic foundation and similarity are calculated,

1) feature extraction

This method employing sub belt energy likens to and is essential characteristic, sub belt energy is than being to each description of the distribution trend of pairing each sub belt energy constantly, in order to improve the robustness of feature, need carry out vector quantization to the sub belt energy ratio handles, the selection of quantization boundary has equal feature number with each dimensional feature of target audio in each bin be criterion, proper vector after quantization boundary and the quantification is deposited in the file

2) histogrammic foundation and measuring similarity

After having finished feature extraction, need set up model to each audio-frequency fragments, the method for setting up model is a lot, because the calculated amount of histogram matching is little, and has stronger robustness for noise, thus adopt histogrammic matching process,

Simultaneously, for the sequential that increases template is distinguished property, be that the target audio of t is equally divided into 4 subwindows to duration, set up histogram respectively at each subwindow, use h _i ^RExpression,

S (h^{R}, h^{T} (n)) = \frac{1}{L} Σ_{i = 1}^{L} \min (h_{i}^{R}, h_{i}^{T} (n)) - - - (1)

In the formula, h ^R: the histogram of reference audio, h _j ^T(n): n is the histogram of testing audio constantly, L: the number in histogram Zhong Bao chamber,

Because similarity and histogrammic sliding position between the histogram have correlativity, can pass through n ₁Similarity constantly is to n ₂The similarity upper limit is constantly estimated, the coupling budget that can skip this point if discreet value is lower than the thresholding of appointment, thus having reduced calculated amount, predictor formula is as follows:

S_{up} (h_{i}^{R}, h_{i}^{T} (n 2)) = S (h_{i}^{R}, h_{i}^{T} (n 1)) + \frac{n 2 - n 1}{P_{i}} - - - (2)

w_{i} = \{\begin{matrix} floor (P_{i} (θ - S_{i})) + 1 & if S_{i} < θ, \\ 1 & otherwise, \end{matrix} - - - (3)

In the formula, w _iExpression jump step-length, the maximum positive integer less than x is got in floor (x) expression, and final jump step-length w can use following formulate:

w = \max_{i} (w_{i}) . - - - (4)