CN112214635B

CN112214635B - Fast audio retrieval method based on cepstrum analysis

Info

Publication number: CN112214635B
Application number: CN202011145738.9A
Authority: CN
Inventors: 邵玉斌; 杨贵安; 龙华; 杜庆治; 刘晶; 唐维康; 陈亮
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2022-09-13
Anticipated expiration: 2040-10-23
Also published as: CN112214635A

Abstract

The invention discloses a fast audio retrieval method based on cepstrum analysis, and belongs to the technical field of audio retrieval. The invention comprises the following steps: firstly, constructing a retrieval audio feature library, and extracting frequency domain features from each section of audio in the retrieval audio library according to a signal energy ratio cycle to construct the retrieval audio feature library for retrieval; secondly, extracting sample audio fingerprints, and extracting frequency domain characteristics from the sample audio input by the user according to the signal energy ratio to form sample audio characteristics; thirdly, determining an optimal mixing point according to the sample length, wherein the sample audio features and the retrieved audio features are mixed at the optimal mixing point, so that the cepstrum analysis result of the mixed features is more accurate; and fourthly, searching the sample audio, namely searching the searched audio features with the highest similarity to the sample audio features in the searched audio feature library by using a cepstrum analysis method, wherein the corresponding searched audio information is the sample audio searching result. The audio features extracted by the method are high in representativeness and small in occupied space; during retrieval, the cepstrum analysis is directly carried out on a mixed result of two audio features, and the cepstrum analysis only carries out Fourier correlation transformation on the mixed features, so that the calculation amount is small and the calculation speed is high. Therefore, aiming at the defect of low retrieval efficiency in audio retrieval application in the prior art, the invention greatly improves the retrieval efficiency on the premise of ensuring the audio retrieval accuracy.

Description

Fast audio retrieval method based on cepstrum analysis

Technical Field

The invention relates to a fast audio retrieval method based on cepstrum analysis, and belongs to the technical field of audio retrieval.

Background

With the advent of the big data age, the amount of multimedia information on the internet has increased explosively. The traditional audio retrieval based on text labels aims at establishing different label libraries for different fields, and the method has no universality and can not meet the requirements of people on multimedia retrieval. Therefore, the method for constructing the audio fingerprint database and performing audio retrieval in the hash index mode is provided. Most of subsequent audio retrieval algorithms are improved based on the thought, but the key problems that retrieval accuracy and retrieval efficiency cannot be balanced exist.

Disclosure of Invention

The invention aims to provide a fast audio retrieval method based on cepstrum analysis, which can greatly improve the retrieval efficiency on the premise of ensuring the retrieval accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme:

s1, establishing a retrieval audio feature library, and extracting frequency domain features from each section of audio in the retrieval audio library according to the signal energy ratio in a circulating manner to establish the retrieval audio feature library for retrieval;

s2, extracting sample audio fingerprints, and extracting frequency domain characteristics from the sample audio input by the user according to the signal energy ratio to form sample audio characteristics;

s3, determining an optimal mixing point according to the sample length, wherein the sample audio features and the retrieval audio features are mixed at the optimal mixing point, so that the cepstrum analysis result of the mixing features is more accurate;

s4, sample audio retrieval, namely searching for a retrieval audio feature with the highest similarity to the sample audio feature in a retrieval audio feature library by using a cepstrum analysis method, wherein the corresponding retrieval audio information is a sample audio retrieval result;

preferably, before performing steps S1 and S2, the method for extracting audio features according to the signal energy ratio in the frequency domain is described, further comprising:

in the frequency domain, each frame of signal starts from the first data point, and goes down in sequence, the energy value of the current data point is divided by the sum of the energy of the whole frame of data points, the obtained result is a row of energy ratio, and the calculation is shown as the formula (1):

wherein Er represents the energy ratio, E represents the energy, k _t Indicates the corresponding time point, k _f Representing the corresponding frequency point, and Q representing the upper frequency limit;

finding out the frequency position corresponding to the maximum value in the energy ratio, calculating all frames of a section of audio according to the method, and expressing a section of audio characteristic by a group of one-dimensional arrays (dimension, namely audio frame number) according to the calculation result;

preferably, the step S1 includes:

s1.1, performing Fourier transform on each frame of signals obtained after framing and windowing the retrieved audio signals;

s1.2, extracting a frequency position corresponding to a maximum point of the energy ratio of each frame of signal in a frequency domain as a feature, wherein the extraction result represents a section of retrieval audio feature by a group of one-dimensional numerical groups, and the one-dimensional retrieval audio feature F with the length of N is used as a one-dimensional retrieval audio feature F _T Is represented as follows:

F _T ＝(f _t1 f _t2 f _t3 … f _tN ) (2)

s1.3, traversing all audios in the search audio library in the modes of S1.1 and S1.2, and naming each section of search audio features by respective search audio information so as to construct a search audio feature library;

preferably, the step S2 includes:

s2.1, a user inputs an audio clip of any retrieval audio as a sample audio signal, wherein the duration of the audio clip is R seconds and the audio clip can have white noise with a certain signal-to-noise ratio;

s2.2, performing Fourier transform on each frame of signals obtained after framing and windowing the sample audio signals;

s2.3, extracting the frequency position corresponding to the maximum point of the energy ratio of each frame signal in the frequency domain as a feature, wherein the extraction result represents a section of sample audio feature by a group of one-dimensional arrays, and the length of the one-dimensional sample audio feature is MF _S Is represented as follows:

F _S ＝(f _s1 f _s2 f _s3 … f _sM ) (3)

preferably, the step S3 includes:

s3.1, retrieving a first section of retrieval audio in an audio library as an original audio, and intercepting an audio fragment of R seconds in the original audio as an audio to be detected;

s3.2, extracting the audio features to be detected, wherein the length is L1;

s3.3, mixing the audio feature to be detected and the original audio feature with the length of L2 from the first point until the point of subtracting L1 from L2 is finished;

s3.4, Fourier transformation, modulus value taking, logarithm solving and inverse Fourier transformation are carried out on each mixing result of the S3.3, cepstrum domain data are obtained, a peak value in the first half of data is found out after an autocorrelation peak of the cepstrum domain data is eliminated, the similarity between the audio feature to be detected and the original audio feature is calculated according to the peak value, a similarity result and corresponding mixing point information are recorded, and the mixing point information corresponding to the highest similarity in the record is returned, namely the optimal mixing point tau of the audio feature of the sample with the length;

preferably, the step S4 includes:

s4.1, mixing the sample audio features with the optimal mixing point calculated in the step S3 and the retrieved audio features to obtain mixed features;

s4.2, Fourier transform, modulus value taking, logarithm solving and inverse Fourier transform are carried out on the mixed features obtained in the S4.1, cepstral domain data are obtained, after an autocorrelation peak of the cepstral domain data is eliminated, a peak value in the first half of data is found out, and the similarity between the sample audio features and the retrieval audio features is calculated according to the peak value;

the search audio features and the sample audio features are both one-dimensional arrays, so that the search audio features and the sample audio features can be regarded as two waveform signals, and the principle of calculating the similarity between the two waveform signals by performing cepstrum analysis on a mixed signal is as follows:

suppose that the retrieved audio feature signal is x ₁ (t) the sample audio feature signal is x ₂ (t)：

Where τ (τ > 0) is the optimum mixing point, i.e. the audio feature signal x is retrieved ₁ (t) and sample audio feature signal x ₂ Time delay between (t), a ₁ And a ₂ Is an attenuation factor of the signal, and a ₁ ∈(0,1)，a ₂ ∈(0,1)；

The mixed signal is configured as:

y(t)＝x(t)*(a ₁ δ(t)+a ₂ δ(t-τ)) (5)

according to the definition of the power cepstrum, the cepstrum analysis result of the mixed signal is as follows:

as can be seen from equation (3), in the power cepstrum of the mixed signal, there are impulse peak amounts at the optimal mixing point position and at integer multiple positions thereof. Eliminating the self-correlation peak interference in the cepstrum, finding out an impact peak from the first half power cepstrum, and calculating according to the impact peak to obtain the similarity;

s4.3, performing S4.1 and S4.2 on sample feature circulation and each section of retrieval audio features in the retrieval audio feature library, recording similarity results and corresponding retrieval audio information, and returning the retrieval audio information corresponding to the highest similarity in the record, namely the sample audio retrieval result;

compared with the traditional method that the index value of each data point pair in the two audio fingerprints needs to be matched in the Hash search, the search method provided by the invention directly performs cepstrum analysis on the mixed result of the two audio features to obtain the similarity, and the cepstrum analysis only performs Fourier correlation transformation on the mixed result, so that the calculation amount is small and the calculation speed is high.

Drawings

FIG. 1 is a flowchart illustrating an exemplary audio retrieval method according to the present invention

FIG. 2 is a schematic diagram of feature extraction according to the present invention

FIG. 3 is a flow chart of the present invention for constructing a search audio feature library

FIG. 4 is a flow chart of extracting sample audio features according to the present invention

FIG. 5 is a waveform of the audio characteristics of the present invention

FIG. 6 is a flow chart of the present invention for determining the optimal mixing point according to the sample length

FIG. 7 is a flow chart of mixed feature cepstrum analysis in accordance with the present invention

FIG. 8 shows the result of cepstrum analysis of the mixture characteristic of the present invention

Detailed Description

The invention will be further described by means of embodiments in conjunction with the accompanying drawings.

In order to solve the problem that the prior art cannot balance and take the retrieval accuracy and the retrieval efficiency into consideration, the embodiment of the invention provides a fast audio retrieval method based on cepstrum analysis, as shown in fig. 1, comprising the following operations:

before proceeding to steps S1 and S2, it is necessary to explain a method of extracting audio features according to the signal energy ratio in the frequency domain in the embodiment:

as shown in fig. 2, each frame signal in the frequency domain starts from the first data point, and sequentially goes down, the energy value of the current data point is divided by the sum of the energies of the data points of the whole frame, the obtained result is a row of energy ratio values, the frequency position corresponding to the maximum value in the row of energy ratio values is found, all frames of a section of audio are calculated according to the method, and the calculation result represents a section of audio features by a group of one-dimensional arrays (dimension, i.e. audio frame number);

on the basis of the above embodiments, step S1 includes: description will be made with reference to FIG. 3

As shown in fig. 3:

searching 19 pieces of (I ═ 19) audio in the audio library, wherein the time length of each piece of audio is 1min, dividing the 1 st piece of (I ═ 1) audio into 29999 frames (N ═ 29999), performing Fourier transform on the 1 st frame (N ═ 1) signal, and when N is equal to N, indicating that each frame of signal is subjected to Fourier transform;

s1.2, calculating and searching the signal energy ratio of each frame in the audio frequency domain according to the formula (1), wherein 29999 frames are calculated:

extracting the frequency position corresponding to the maximum point of the energy ratio of each frame signal as a feature, wherein the length of the one-dimensional retrieval audio feature F is 29999 frames _T Is represented as follows:

F _T ＝(5 1 3 … 8)

s1.3, traversing and retrieving all audios of the audio library in the ways of S1.1 and S1.2, naming each section of retrieved audio features by respective retrieved audio information, and when I is equal to I, indicating that the audio traversal of the whole retrieved audio library is finished, and completing the construction of the retrieved audio feature library;

on the basis of the above embodiments, step S2 includes: description will be made with reference to FIG. 4

As shown in fig. 4:

s2.1, if the user inputs an audio clip of any search audio as a sample audio signal, the duration of the audio clip is 20 seconds (R is 20), and the audio clip has white noise with a signal-to-noise ratio of 10 dB;

s2.2, performing Fourier transform on each frame of signal obtained after the frame windowing is performed on the sample audio signal;

dividing sample audio input by a user into 9999 frames (M & lt9999 & gt), performing Fourier transform on a 1 st frame (M & lt1 & gt) signal, and performing Fourier transform on each frame signal when M is equal to M;

calculating the signal energy ratio of each frame in the sample audio frequency domain according to the formula (1), and obtaining 9999 frames:

F _S ＝(1 1 3 … 6)

FIG. 5 illustrates retrieving an audio signature and a sample audio signature;

on the basis of the above embodiments, step S3 includes: description will be made with reference to FIG. 6

As shown in fig. 6:

s3.1, retrieving a first section of retrieval audio in an audio library as an original audio, and intercepting an audio fragment with the duration consistent with that of the sample audio from the original audio, namely, an audio fragment with the duration of 20 seconds (R is 20) as an audio to be detected;

s3.2, extracting the characteristics of the audio to be detected, wherein the length of the characteristics is 4999 frames (L1 is 4999);

s3.3, mixing the to-be-detected audio feature with the original audio feature with the length of 29999 frames (L2 is 29999) from the j (j is 1) th point until j is equal to 29999 and 4999 frames (L2-L1), and representing all mixing situations of the to-be-detected audio feature and the original audio feature;

s3.4, Fourier transformation, modulus value taking, logarithm solving and inverse Fourier transformation are carried out on each mixing result of S3.3 to obtain cepstrum domain data, 400 data points in front of the cepstrum domain are set to be 0 so as to eliminate autocorrelation peak interference, a peak value in the first half of data of the cepstrum domain is found out, the similarity between the audio feature to be detected and the original audio feature is calculated according to the peak value, the similarity result and corresponding mixing point information are recorded, and the mixing point information j corresponding to the highest similarity in the record is returned to be the optimal mixing point of the audio feature of the sample with the length;

best mixing point reference values in the examples of Table 1

On the basis of the above embodiments, step S4 includes: description will be made with reference to FIG. 7

As shown in fig. 7:

s4.1, mixing the sample audio features with the retrieved audio features of the ith (i equals to 1) section of the retrieved audio library at the 5400 th frame of the optimal mixing point (j equals to 5400) to obtain mixed features;

s4.2, performing Fourier transform, modulus value taking, logarithm solving and inverse Fourier transform on the mixed features obtained in the S4.1 to obtain cepstrum domain data, setting 400 data points in front of the cepstrum domain to be 0 so as to eliminate self-correlation peak interference, finding out a peak value in the first half data of the cepstrum domain, and calculating the similarity between the sample audio features and the retrieval audio features according to the peak value;

FIG. 8 shows the cepstral analysis of the mixture of the retrieved audio features and the sample audio features of FIG. 5 after fitting, with the top 400 data points in the cepstral domain set to 0;

recording the similarity result and the corresponding retrieval audio information, returning to S4.1 after I +1 is recorded, and returning the retrieval audio information corresponding to the highest similarity in the record as the sample audio retrieval result when I is equal to I (I is 20);

table 2 sample audio retrieval results in the examples

It can be seen from the above table that the highest similarity is 89%, corresponding to the retrieved audio information being "min.

Claims

1. A fast audio retrieval method based on cepstrum analysis is characterized in that:

constructing an audio feature library; traversing each section of audio in the retrieval audio library, extracting a frequency position corresponding to the maximum point of the energy ratio of each frame of signal in a section of audio frequency domain as a characteristic, expressing a section of retrieval audio characteristic by a group of one-dimensional numerical groups according to the extraction result, and naming each section of retrieval audio characteristic by respective retrieval audio information so as to construct a retrieval audio characteristic library;

s2, extracting sample audio fingerprints, and extracting frequency domain features from the sample audio input by the user according to the signal energy ratio to form sample audio features;

extracting sample audio fingerprints, wherein the sample audio features are used for matching with the features of the retrieval audio feature library; extracting a frequency position corresponding to a maximum point of the energy ratio of each frame of signal in a sample audio frequency domain as a feature, wherein the extraction result represents a section of sample audio feature by a group of one-dimensional arrays;

extracting audio features according to a signal energy ratio in a frequency domain, wherein each frame of signal in the frequency domain starts from a first data point and sequentially goes down, the energy value of the current data point is divided by the sum of the energy of the whole frame of data points, the obtained result is a row of energy ratios, the frequency position corresponding to the maximum value in the row of energy ratios is found out, all frames of a section of audio are calculated according to the method, and the calculation result represents a section of audio features by a group of one-dimensional arrays, wherein the dimension is the number of audio frames;

determining an optimal mixing point according to the sample length; taking a first section of retrieval audio in a retrieval audio library as original audio, intercepting an audio segment with the duration consistent with that of sample audio in the original audio as to-be-detected audio, performing sliding mixing on the characteristics of the to-be-detected audio and the characteristics of the original audio in a window mode, and performing cepstrum analysis on the mixed characteristics to obtain a mixing point corresponding to the highest similarity between the two characteristics as an optimal mixing point;

and S4, sample audio retrieval, namely searching for the retrieval audio features with the highest similarity to the sample audio features in the retrieval audio feature library by using a cepstrum analysis method, wherein the corresponding retrieval audio information is the sample audio retrieval result.

2. The fast audio retrieval method based on cepstral analysis according to claim 1, wherein: step S4, sample audio retrieval; mixing the sample audio feature cycle with each section of retrieval audio features in the audio feature library by using the optimal mixing point to obtain mixed features, performing cepstrum analysis on the mixed features to calculate the similarity between the two audio features, recording the similarity result and corresponding retrieval audio information, and returning the retrieval audio information corresponding to the highest similarity in the record, namely the audio retrieval result.