CN105448290B

CN105448290B - A kind of audio feature extraction methods becoming frame per second

Info

Publication number: CN105448290B
Application number: CN201510782814.XA
Authority: CN
Inventors: 张晖; 刘宝
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2015-11-16
Filing date: 2015-11-16
Publication date: 2019-03-01
Anticipated expiration: 2035-11-16
Also published as: CN105448290A

Abstract

A kind of audio feature extraction methods becoming frame per second proposed by the present invention, by having selected frame feature vector to be weighted fusion for multiple, obtain reference frame feature vector, and the Euclidean distance between reference frame feature vector and multiple candidate frame feature vectors is calculated separately, the audio frame progress audio retrieval that can represent multiple candidate frame is selected most from multiple candidate frames according to Euclidean distance.The method of the invention avoids the screen effect for having selected frame to candidate frame, is conducive to extract more useful audio-frequency information, to improve the accuracy of audio retrieval.

Description

A kind of audio feature extraction methods becoming frame per second

Technical field

The present invention relates to Digital Audio-Frequency Processing Techniques field more particularly to a kind of audio feature extraction methods for becoming frame per second.

Background technique

With flourishing for internet, the data volume of audio frequency media is growing day by day on network, traditional based on text mark The audio search of note can no longer meet the growing use demand of people.In recent years, with cloud computing, big data it is emerging It rises, content-based audio retrieval technology increasingly becomes domestic and foreign scholars' focus of attention.

Audio feature extraction is the key that content-based audio retrieval technology, and general treatment process needs first by audio Data carry out sub-frame processing.Framing process generallys use the method for overlapping framing so as to keep continuity between frame and frame, divides The length of frame is known as frame length, and non-overlapping part is known as vertical shift between frame and frame.Sound is carried out there are mainly two types of mode at present Frequency feature extraction: fixed frame per second and change frame per second.Fixed frame per second i.e. frame length and vertical shift during framing all remains unchanged, this side Formula cannot well adapt to the variation characteristic of audible spectrum；Become frame per second to be divided during framing using dynamic vertical shift Frame can effectively make up the shortcomings that fixed frame per second cannot reflect audible spectrum variation.

It is all that framing is carried out using fixed frame per second and smaller vertical shift in existing change frame per second audio feature extraction methods, Candidate frame is calculated again and has selected the similarity of interframe, achievees the effect that become frame per second by frame losing.Such method, which is easily led to, have been selected Frame makes to have abandoned the audio frame that can most show acoustic characteristic during framing, thus dropped audio to the screen effect of candidate frame Information keeps retrieval accuracy not high.

Summary of the invention

In order to overcome the above-mentioned deficiencies of the prior art, the present invention proposes a kind of audio feature extraction methods for becoming frame per second.It should Method by the audio frame for selecting most represent audio variation in multiple candidate frames avoids that frame has been selected to imitate the shielding of candidate frame It answers, to improve the validity of audio feature extraction, and then improves audio retrieval accuracy.

In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is that:

A kind of audio feature extraction methods becoming frame per second proposed by the present invention, detailed process is as follows:

Step a carries out framing to audio data, obtains audio frame, and calculate according to preset frame length and vertical shift Each audio frame feature vector；

Step b, n audio frame of sequential selection is as initially having selected frame from the audio frame, and by h audio thereafter Frame is as candidate frame；N and h is preset natural number；

Step c has selected frame feature vector according to n, calculates the reference frame feature vector for having selected frame

Wherein, v_iIt indicates to have selected frame feature vector, w i-th_iIndicate i-th of weight for having selected frame；I is the mark for having selected frame Number, it is positive integer；M indicates that currently shared m have been selected frame, and 1 < n≤m；

Each audio frame feature vector ties up representation in components with Q, has selected frame feature vector v i-th_iEach dimension component distinguish table It is shown as:Wherein, k is integer, and 1≤k≤Q, therefore the reference frame feature vectorKth tie up componentCalculation formula are as follows:

Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,It indicates to have selected frame feature vector i-th Kth ties up the weight of component,Calculation formula are as follows:

And meet:Wherein,Calculation formula are as follows:

Step d, according to the reference frame feature vectorCalculate h candidate frame feature vector and reference frame feature vector Euclidean distance d_l；

Euclidean distance d_lCalculation formula it is as follows:

Wherein l is the label of candidate frame, is positive integer, and 1≤l≤h；

Step e, according to the Euclidean distance d of the h candidate frame feature vector and reference frame feature vector_l, selection meets The candidate frame of condition, is added to and has selected in frame, and detailed process is as follows:

Step e-1, according to the Euclidean distance d_lSelect maximum Euclidean distance d therein_max；

Step e-2, by the maximum Euclidean distance d_maxIt is compared with threshold value D, it, will be described if being more than threshold value Maximum Euclidean distance d_maxCorresponding candidate frame, which is added to, have been selected in frame, and using h audio frame thereafter as new candidate frame, Otherwise the h candidate frame is all abandoned, continues to use subsequent h audio frame and is selected as candidate frame；

Step c, step d and step e are repeated, h audio frame of sequential selection is as candidate from remaining audio frame Frame, and select qualified candidate frame to be added to from h candidate frame and selected in frame, if remaining audio frame number is less than h, Then using the audio frame of real surplus as candidate frame, qualified candidate frame is therefrom selected to be added to using same method It selects in frame, until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.

The utility model has the advantages that a kind of audio feature extraction methods for becoming frame per second proposed by the present invention, by having selected frame special for multiple Sign vector is weighted fusion, obtains reference frame feature vector, and calculates separately reference frame feature vector and multiple candidate frames spy The Euclidean distance between vector is levied, selects most the audio frame that can represent multiple candidate frame from multiple candidate frames according to Euclidean distance Audio retrieval is carried out, the screen effect for having selected frame to candidate frame is avoided, is conducive to extract more useful audio-frequency information, to mention The accuracy of high audio retrieval.

Detailed description of the invention

Fig. 1 is to carry out reference frame feature vector using Weighted Fusion method to calculate schematic diagram；

Fig. 2 is to be selected eligible candidate frame that the flow diagram for having selected frame is added from candidate frame according to reference frame；

Fig. 3 is a kind of audio feature extraction methods execution flow diagram for becoming frame per second.

Specific embodiment

For a kind of more detailed description audio feature extraction methods for becoming frame per second proposed by the present invention, in conjunction with attached drawing, It is illustrated below:

The invention mainly comprises three contents: first is that having selected frame feature vector to be weighted fusion to multiple, being referred to Frame feature vector；Second is that calculating the Euclidean distance of reference frame feature vector and multiple candidate frame feature vectors；Third is that according to reference The qualified candidate frame of the Euclidean distance of frame feature vector and multiple candidate frame feature vectors selection, which is added to, have been selected in frame.

1. calculating reference frame using Weighted Fusion method:

Reference frame feature vector, which is carried out, using Weighted Fusion method calculates schematic diagram as shown in Figure 1, current all m have been selected Frame feature vector is denoted as v respectively₁,v₂,v₃,...,v_m, set and selected frame to be weighted fusion according to last 4, this 4 have been selected The feature vector of frame is denoted as v respectively_m-3,v_m-2,v_m-1,v_m.Each audio frame feature vector ties up representation in components with Q, has selected for i-th The feature vector v of frame_iEach dimension representation in components are as follows:Wherein, k is integer, and 1≤k≤Q；From each sound The characteristic type that frequency frame extracts includes but is not limited to the temporal signatures of audio, frequency domain character or combinations thereof feature, such as auto-correlation system Number, mel cepstrum coefficients etc..

Each dimension component of reference frame feature vector is melted by having selected each dimension component of frame feature vector to be weighted to this 4 It closes and obtains, its calculation formula is:

Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,Indicate the kth dimension of reference frame feature vector Component；Indicate i-th of weight for having selected frame feature vector kth dimension component；I, k respectively indicates the label for having selected frame and each sound Frequency frame feature vector respectively ties up the label of component, and i, k are positive integer；M indicates that currently shared m have been selected frame.

WeightIt will have a direct impact on the quality of reference frame, and then influence the selection to candidate frame.In this programmeCalculating Formula are as follows:

And weight coefficient meets:WhereinIndicate that i-th, j has been selected the kth of frame feature vector to tie up Component and the n absolute value of the difference for having selected frame feature vector kth to tie up component average value,Calculation formula are as follows:

Wherein, j is the label for having selected frame, is positive integer.

It should be noted that the present embodiment to 4 to have selected frame feature vector to be weighted fusion calculation reference frame feature It, also can having selected frame feature vector and carry out the meter of reference frame feature vector using other numbers according to the actual situation for vector It calculates.

2. calculating the Euclidean distance of reference frame feature vector and multiple candidate frame feature vectors:

It is assumed that being selected from subsequent 3 candidate frames, according to the reference frame feature vector of above-mentioned acquisition, 3 are calculated The Euclidean distance d of candidate frame feature vector and reference frame feature vector_l, its calculation formula is:

The Euclidean distance of reference frame feature vector and each candidate frame feature vector is used to measure reference frame and each candidate frame Similarity degree, wherein l is the label of candidate frame, is positive integer, and 1≤l≤3；

3. according to the Euclidean distance d of reference frame feature vector and multiple candidate frame feature vector_lIt selects qualified Candidate frame, which is added to, have been selected in frame:

According to Euclidean distance d_lSelect the flow diagram of qualified candidate frame as shown in Figure 2 from candidate frame.3 In candidate frame, the maximum candidate frame of Euclidean distance is least similar to reference frame, and candidate frame least similar with reference frame most can be anti- Reflect audio variation.According to Euclidean distance from selecting qualified candidate frame to be added to the detailed process selected in frame in candidate frame Are as follows:

Step 3-1: the d from Euclidean distance_lSelect maximum Euclidean distance d_max；

Step 3-2: by the maximum Euclidean distance d_maxIt is compared with threshold value D, if being more than threshold value, most by this Big Euclidean distance d_maxCorresponding candidate frame, which is added to, have been selected in frame, and using 3 audio frames thereafter as new candidate frame, no Then 3 frame is all abandoned, subsequent 3 audio frames is continued to use and is selected as candidate frame.

The audio frame that can most represent multiple candidate frame is selected from multiple candidate frames, and combines predetermined threshold value D, makes spy Sign extraction process changes place greatly in audible spectrum and selects more frame, audible spectrum change gentle place select it is less Frame, and be avoided that the screen effect for having selected frame to candidate frame.

It should be noted that the present embodiment from subsequent 3 candidate frames to select qualified candidate frame to be added to For having selected in frame, also qualified candidate frame can be selected to be added to from the candidate frame of other numbers according to the actual situation It selects in frame.

Fig. 3 shows that a kind of audio feature extraction methods for becoming frame per second provided by the invention execute flow diagram, specifically Execute process are as follows:

Step 1 carries out framing according to preset frame length and vertical shift, the use of frame length is such as 30ms, vertical shift is 2ms to sound Frequency obtains audio frame, and calculate each audio frame feature vector according to framing is carried out；

Step 2 selects n audio frame as initially having selected frame in order according to preset n and h from the audio frame, And using h audio frame thereafter as candidate frame；

Step 3 is calculated n and frame feature vector has been selected respectively to tie up the weight of component using Weighted Fusion method, and is weighted Fusion, obtains reference frame feature vector；

Step 4 calculates the Euclidean distance of reference frame feature vector and h candidate frame feature vector；

Step 5 selects maximum Euclidean simultaneously according to the Euclidean distance of reference frame feature vector and h candidate frame feature vector It is compared with threshold value, then the corresponding candidate frame of maximum Euclidean distance is added to more than threshold value and has been selected in frame, otherwise This group of candidate frame is abandoned, is selected from next group of candidate frame.

Step 3, step 4, step 5 are repeated, h audio frame of sequential selection is as candidate from remaining audio frame Frame, and therefrom select qualified candidate frame to be added to and selected in frame, it, will be practical if remaining audio frame number is less than h Remaining audio frame therefrom selects qualified candidate frame to be added to and has selected in frame as candidate frame using same method, Until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.

Claims

1. a kind of audio feature extraction methods for becoming frame per second, which is characterized in that detailed process is as follows:

Step a carries out framing to audio data, obtains audio frame, and calculate each sound according to preset frame length and vertical shift Frequency frame feature vector；

Step b, n audio frame of sequential selection has selected frame as initial from the audio frame, and h audio frame thereafter is made For candidate frame；N and h is preset natural number；

Wherein, v_iIt indicates to have selected frame feature vector, w i-th_iIndicate i-th of weight for having selected frame；I is the label for having selected frame, is Positive integer；M indicates that currently shared m have been selected frame, and 1 < n≤m；

Each audio frame feature vector ties up representation in components with Q, has selected frame feature vector v i-th_iEach dimension component respectively indicate are as follows:Wherein, k is integer, and 1≤k≤Q, therefore the reference frame feature vectorKth tie up componentMeter Calculate formula are as follows:

Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,It indicates to have selected frame feature vector kth to tie up i-th The weight of component,Calculation formula are as follows:

And meet:J is the label for having selected frame in formula, is positive integer；Wherein,Calculation formula are as follows:

Step d, according to the reference frame feature vectorCalculate the Europe of h candidate frame feature vector and reference frame feature vector Family name's distance d_l；

Euclidean distance d_lCalculation formula it is as follows:

Step e, according to the Euclidean distance d of the h candidate frame feature vector and reference frame feature vector_l, select qualified Candidate frame is added to and has selected in frame, and detailed process is as follows:

Step e-2, by the maximum Euclidean distance d_maxIt is compared with threshold value D, if being more than threshold value, by the maximum Euclidean distance d_maxCorresponding candidate frame, which is added to, have been selected in frame, and using h audio frame thereafter as new candidate frame, otherwise The h candidate frame is all abandoned, subsequent h audio frame is continued to use and is selected as candidate frame；

Repeat step c, step d and step e, from remaining audio frame h audio frame of sequential selection as candidate frame, and It selects qualified candidate frame to be added to from h candidate frame to have selected in frame；It, will if remaining audio frame number is less than h The audio frame of real surplus therefrom selects qualified candidate frame to be added to and has selected frame as candidate frame using same method In, until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.