CN105448290A

CN105448290A - Variable frame rate audio feature extraction method

Info

Publication number: CN105448290A
Application number: CN201510782814.XA
Authority: CN
Inventors: 张晖; 刘宝
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2015-11-16
Filing date: 2015-11-16
Publication date: 2016-03-30
Anticipated expiration: 2035-11-16
Also published as: CN105448290B

Abstract

The invention provides a variable frame rate audio feature extraction method. The feature vector of a reference frame is obtained by the weight fusion of the feature vectors of a plurality of selected frames, and the Euclidean distance between the feature vector of the reference frame and the feature vectors of a plurality of candidate frames, and an audio frame which best represents the plurality of candidate frames is selected from the plurality of candidate frames to be subjected to audio retrieval. The method prevents the shielding effect of the selected frames on the candidate frames, facilitates the extraction of useful audio information, and improves the accuracy of audio retrieval.

Description

Audio feature extraction method with variable frame rate

Technical Field

The invention relates to the technical field of digital audio processing, in particular to a variable frame rate audio feature extraction method.

Background

With the explosion of the internet, the data volume of audio media on the internet is increasing day by day, and the traditional audio search based on text labels cannot meet the increasing use requirements of people. In recent years, with the rise of cloud computing and big data, the audio retrieval technology based on contents is becoming the focus of attention of scholars at home and abroad.

The audio feature extraction is a key of the content-based audio retrieval technology, and the general processing process of the audio feature extraction needs to firstly perform framing processing on audio data. The framing process usually adopts an overlapping framing method to maintain continuity between frames, the length of a divided frame is called a frame length, and the non-overlapping part between frames is called a frame offset. There are two main ways to extract audio features at present: fixed frame rate and variable frame rate. The fixed frame rate, namely the frame length and the frame offset are kept unchanged in the framing process, and the method cannot be well adapted to the change characteristic of an audio frequency spectrum; the frame rate is changed, namely, dynamic frame offset is adopted for framing in the framing process, so that the defect that the fixed frame rate cannot reflect audio frequency spectrum change can be effectively overcome.

In the existing method for extracting the audio features of the variable frame rate, the fixed frame rate and the small frame offset are adopted for framing, then the similarity between a candidate frame and a selected frame is calculated, and the effect of the variable frame rate is achieved by frame loss. The method is easy to cause the shielding effect of the selected frame on the candidate frame, so that the audio frame which can show the audio characteristic most is discarded in the framing process, the audio information is lost, and the retrieval accuracy is low.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a variable frame rate audio feature extraction method. According to the method, the audio frame which can represent audio changes most is selected from the candidate frames, so that the shielding effect of the selected frame on the candidate frames is avoided, the effectiveness of audio feature extraction is improved, and the accuracy of audio retrieval is improved.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention provides a variable frame rate audio feature extraction method, which comprises the following specific processes:

step a, framing audio data according to a preset frame length and frame offset to obtain audio frames, and calculating characteristic vectors of the audio frames;

step b, sequentially selecting n audio frames from the audio frames as initial selected frames, and taking the subsequent h audio frames as candidate frames; n and h are preset natural numbers;

step c, calculating the reference frame feature vector of the selected frame according to the n selected frame feature vectors

{\tilde{v}}_{m} = Σ_{i = m - n + 1}^{m} w_{i} v_{i}

Wherein v is_iRepresenting the ith selected frame feature vector, w_iRepresents the weight of the ith selected frame; i is the label of the selected frame and is a positive integer; m represents a total of m selected frames, and 1<n≤m；

Each audio frame feature vector is represented by a Q-dimensional component, the ith selected frame feature vector v_iAre respectively expressed as:wherein k is an integer and k is greater than or equal to 1 and less than or equal to Q, so that the reference frame feature vectorComponent of the k-th dimension ofThe calculation formula of (2) is as follows:

{\tilde{v}}_{m}^{k} = Σ_{i = m - n + 1}^{m} w_{i}^{k} v_{i}^{k}

wherein,representing the kth dimension component of the ith selected frame feature vector,representing the weight of the kth dimensional component of the ith selected frame feature vector,the calculation formula of (2) is as follows:

w_{i}^{k} = \frac{1 / {(σ_{i}^{k})}^{2}}{Σ_{j = m - n + 1}^{m} 1 / {(σ_{j}^{k})}^{2}}

and satisfies the following conditions:wherein,the calculation formula of (2) is as follows:

σ_{i}^{k} = | v_{i}^{k} - \frac{1}{n} Σ_{j = m - n + 1}^{m} v_{j}^{k} |

step d, according to the reference frame characteristic vectorCalculating Euclidean distance d between h candidate frame feature vectors and reference frame feature vector_l；

Euclidean distance d_lThe calculation formula of (a) is as follows:

d_{l} = \sqrt{Σ_{k = 1}^{Q} {({\tilde{v}}_{m}^{k} - v_{l}^{k})}^{2}}

wherein l is the label of the candidate frame and is a positive integer, and l is more than or equal to 1 and less than or equal to h;

e, according to Euclidean distance d between h candidate frame characteristic vectors and reference frame characteristic vector_lSelecting candidate frames meeting the conditions, and adding the candidate frames into the selected frames, wherein the specific process is as follows:

step e-1, according to the Euclidean distanceFrom d_lThe maximum Euclidean distance d is selected_max；

Step e-2, determining the maximum Euclidean distance d_maxComparing the maximum Euclidean distance D with a threshold value D, and if the maximum Euclidean distance D exceeds the threshold value_maxAdding the corresponding candidate frames into the selected frames, taking the h audio frames after the selected frames as new candidate frames, and if not, discarding all the h candidate frames and continuously using the subsequent h audio frames as candidate frames for selection;

and c, repeatedly executing the step d and the step e, sequentially selecting h audio frames from the rest audio frames as candidate frames, selecting candidate frames meeting the conditions from the h candidate frames to be added into the selected frames, if the number of the rest audio frames is less than h, taking the actually rest audio frames as the candidate frames, selecting the candidate frames meeting the conditions from the candidate frames by using the same method to be added into the selected frames until the number of the rest audio frames is zero, and taking the feature vectors of all the selected frames as the feature vectors of the audio data.

Has the advantages that: the audio feature extraction method of the variable frame rate provided by the invention obtains the reference frame feature vector by performing weighted fusion on the selected frame feature vectors, calculates Euclidean distances between the reference frame feature vector and the candidate frame feature vectors respectively, and selects the audio frame which can represent the candidate frames most from the candidate frames according to the Euclidean distances to perform audio retrieval, thereby avoiding the shielding effect of the selected frame on the candidate frames, being beneficial to extracting more useful audio information and improving the accuracy of the audio retrieval.

Drawings

FIG. 1 is a schematic diagram of the calculation of feature vectors of reference frames using a weighted fusion method;

FIG. 2 is a schematic view of a process for selecting qualified candidate frames from candidate frames and adding the qualified candidate frames to a selected frame according to a reference frame;

fig. 3 is a schematic flow chart illustrating an implementation of a variable frame rate audio feature extraction method.

Detailed Description

For more detailed description of the variable frame rate audio feature extraction method proposed by the present invention, the following examples are illustrated with reference to the accompanying drawings:

the invention mainly comprises three contents: firstly, weighting and fusing a plurality of selected frame feature vectors to obtain a reference frame feature vector; calculating Euclidean distances between the reference frame feature vector and a plurality of candidate frame feature vectors; thirdly, selecting the candidate frame meeting the conditions according to the Euclidean distances between the feature vector of the reference frame and the feature vectors of the plurality of candidate frames and adding the candidate frame into the selected frame.

1. The reference frame is calculated using a weighted fusion method:

FIG. 1 is a schematic diagram illustrating calculation of feature vectors of reference frames by using a weighted fusion method, where all m selected frame feature vectors are respectively denoted as v₁,v₂,v₃,...,v_mSetting to perform weighted fusion according to the last 4 selected frames, and respectively marking the feature vectors of the 4 selected frames as v_m-3,v_m-2,v_m-1,v_m. Each audio frame feature vector is expressed by Q-dimensional component, and the feature vector v of the ith selected frame_iIs represented as:wherein k is an integer, and k is more than or equal to 1 and less than or equal to Q; the feature types extracted from each audio frame include, but are not limited to, time domain features, frequency domain features, or a combination thereof of the audio, such as autocorrelation coefficients, mel-frequency cepstral coefficients, and the like.

Each dimension component of the reference frame feature vector is obtained by weighting and fusing each dimension component of the 4 selected frame feature vectors, and the calculation formula is as follows:

{\tilde{v}}_{m}^{k} = Σ_{i = m - 3}^{m} w_{i}^{k} v_{i}^{k} - - - (1)

wherein,representing the kth dimension component of the ith selected frame feature vector,a k-dimension component representing a feature vector of a reference frame;a weight representing the kth dimension component of the ith selected frame feature vector; i. k respectively represents the label of the selected frame and the label of each dimension component of each audio frame feature vector, and i and k are positive integers; m represents a total of m selected frames.

Weight ofThe quality of the reference frame is directly affected, and the selection of the candidate frame is further affected. In the schemeThe calculation formula of (2) is as follows:

w_{i}^{k} = \frac{1 / {(σ_{i}^{k})}^{2}}{Σ_{j = m - 3}^{m} 1 / {(σ_{j}^{k})}^{2}} - - - (2)

and the weight coefficient satisfies:whereinRepresenting the absolute value of the difference between the kth dimension component of the ith and jth selected frame feature vector and the mean value of the kth dimension component of the nth selected frame feature vector,the calculation formula of (2) is as follows:

σ_{i}^{k} = | v_{i}^{k} - \frac{1}{n} Σ_{j = m - 3}^{m} v_{j}^{k} | - - - (3)

wherein j is the label of the selected frame and is a positive integer.

It should be noted that, in the embodiment, the reference frame feature vector is calculated by performing weighted fusion on the 4 selected frame feature vectors, and the reference frame feature vector may also be calculated by using other numbers of selected frame feature vectors according to the actual situation.

2. Calculating Euclidean distances between the reference frame feature vector and a plurality of candidate frame feature vectors:

supposing that selection is carried out from the subsequent 3 candidate frames, according to the obtained reference frame feature vector, the Euclidean distance d between the feature vector of the 3 candidate frames and the feature vector of the reference frame is calculated_lThe calculation formula is as follows:

d_{l} = \sqrt{Σ_{k = 1}^{Q} {({\tilde{v}}_{m}^{k} - v_{l}^{k})}^{2}} - - - (4)

the Euclidean distance between the reference frame feature vector and each candidate frame feature vector is used for measuring the similarity degree between the reference frame and each candidate frame, wherein l is the label of the candidate frame, is a positive integer, and is more than or equal to 1 and less than or equal to 3;

3. according to Euclidean distance d between the reference frame feature vector and the candidate frame feature vectors_lSelecting candidate frames meeting the conditions to be added into the selected frames:

according to Euclidean distance d_lA schematic flow chart of selecting a candidate frame meeting the condition from the candidate frames is shown in fig. 2. Of the 3 candidate frames, the candidate frame with the largest euclidean distance is least similar to the reference frame, and the candidate frame with the least similarity to the reference frame can reflect the audio change most. The specific process of selecting the candidate frame meeting the conditions from the candidate frames according to the Euclidean distance and adding the candidate frame into the selected frame is as follows:

step 3-1: from Euclidean distance d_lSelecting the maximum Euclidean distance d_max；

Step 3-2: the maximum Euclidean distance d_maxComparing with a threshold value D, and if the threshold value is exceeded, determining the maximum Euclidean distance D_maxAnd adding the corresponding candidate frame into the selected frame, taking the 3 audio frames after the selected frame as new candidate frames, or else, completely discarding the 3 audio frames, and continuously using the subsequent 3 audio frames as candidate frames for selection.

The audio frames which can represent the candidate frames most are selected from the candidate frames, and the preset threshold value D is combined, so that more frames are selected in the place where the audio frequency spectrum changes greatly in the characteristic extraction process, less frames are selected in the place where the audio frequency spectrum changes slowly, and the shielding effect of the selected frames on the candidate frames can be avoided.

It should be noted that, in the embodiment, for example, a candidate frame meeting the condition is selected from the subsequent 3 candidate frames and added to the selected frame, and a candidate frame meeting the condition may also be selected from other number of candidate frames and added to the selected frame according to the actual situation.

Fig. 3 shows an execution flow diagram of the variable frame rate audio feature extraction method provided by the present invention, and the specific execution flow is as follows:

step 1, framing is carried out according to a preset frame length and frame offset, if the frame length is 30ms and the frame offset is 2ms, audio data are framed to obtain audio frames, and feature vectors of the audio frames are calculated;

step 2, according to preset n and h, sequentially selecting n audio frames from the audio frames as initial selected frames, and taking the subsequent h audio frames as candidate frames;

step 3, calculating the weight of each dimensional component of the n selected frame feature vectors by using a weighted fusion method, and performing weighted fusion to obtain a reference frame feature vector;

step 4, calculating Euclidean distances between the feature vectors of the reference frame and the feature vectors of the h candidate frames;

and 5, selecting the maximum Euclidean distance according to the Euclidean distances between the feature vectors of the reference frame and the h candidate frame feature vectors, comparing the maximum Euclidean distance with a threshold value, adding the candidate frame corresponding to the maximum Euclidean distance into the selected frame if the maximum Euclidean distance exceeds the threshold value, and discarding the group of candidate frames and selecting from the next group of candidate frames if the maximum Euclidean distance does not exceed the threshold value.

And repeating the step 3, the step 4 and the step 5, sequentially selecting h audio frames from the rest audio frames as candidate frames, selecting candidate frames meeting the conditions from the candidate frames to be added into the selected frames, if the number of the rest audio frames is less than h, taking the actually rest audio frames as the candidate frames, selecting the candidate frames meeting the conditions from the candidate frames to be added into the selected frames by using the same method until the number of the rest audio frames is zero, and taking the feature vectors of all the selected frames as the feature vectors of the audio data.

Claims

1. A frame rate variable audio feature extraction method is characterized by comprising the following specific processes:

step c, calculating the selected frame according to n selected frame feature vectorsReference frame feature vector of selected frame

{\tilde{v}}_{m} = Σ_{i = m - n + 1}^{m} w_{i} v_{i}

{\tilde{v}}_{m}^{k} = Σ_{i = m - n + 1}^{m} w_{i}^{k} v_{i}^{k}

w_{i}^{k} = \frac{1 / {(σ_{i}^{k})}^{2}}{Σ_{j = m - n + 1}^{m} 1 / {(σ_{j}^{k})}^{2}}

σ_{i}^{k} = | v_{i}^{k} - \frac{1}{n} Σ_{j = m - n + 1}^{m} v_{j}^{k} |

Euclidean distance d_lThe calculation formula of (a) is as follows:

d_{l} = \sqrt{Σ_{k = 1}^{Q} {({\tilde{v}}_{m}^{k} - v_{l}^{k})}^{2}}

step e-1, according to the Euclidean distance d_lThe maximum Euclidean distance d is selected_max；

c, repeating the step d and the step e, sequentially selecting h audio frames from the rest audio frames as candidate frames, and selecting candidate frames meeting the conditions from the h candidate frames to be added into the selected frames; and if the number of the remaining audio frames is less than h, taking the actually remaining audio frames as candidate frames, selecting the candidate frames meeting the conditions from the candidate frames by using the same method, adding the candidate frames into the selected frames until the number of the remaining audio frames is zero, and taking the feature vectors of all the selected frames as the feature vectors of the audio data.