CN105448290B - A kind of audio feature extraction methods becoming frame per second - Google Patents
A kind of audio feature extraction methods becoming frame per second Download PDFInfo
- Publication number
- CN105448290B CN105448290B CN201510782814.XA CN201510782814A CN105448290B CN 105448290 B CN105448290 B CN 105448290B CN 201510782814 A CN201510782814 A CN 201510782814A CN 105448290 B CN105448290 B CN 105448290B
- Authority
- CN
- China
- Prior art keywords
- frame
- feature vector
- audio
- candidate
- euclidean distance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
Abstract
A kind of audio feature extraction methods becoming frame per second proposed by the present invention, by having selected frame feature vector to be weighted fusion for multiple, obtain reference frame feature vector, and the Euclidean distance between reference frame feature vector and multiple candidate frame feature vectors is calculated separately, the audio frame progress audio retrieval that can represent multiple candidate frame is selected most from multiple candidate frames according to Euclidean distance.The method of the invention avoids the screen effect for having selected frame to candidate frame, is conducive to extract more useful audio-frequency information, to improve the accuracy of audio retrieval.
Description
Technical field
The present invention relates to Digital Audio-Frequency Processing Techniques field more particularly to a kind of audio feature extraction methods for becoming frame per second.
Background technique
With flourishing for internet, the data volume of audio frequency media is growing day by day on network, traditional based on text mark
The audio search of note can no longer meet the growing use demand of people.In recent years, with cloud computing, big data it is emerging
It rises, content-based audio retrieval technology increasingly becomes domestic and foreign scholars' focus of attention.
Audio feature extraction is the key that content-based audio retrieval technology, and general treatment process needs first by audio
Data carry out sub-frame processing.Framing process generallys use the method for overlapping framing so as to keep continuity between frame and frame, divides
The length of frame is known as frame length, and non-overlapping part is known as vertical shift between frame and frame.Sound is carried out there are mainly two types of mode at present
Frequency feature extraction: fixed frame per second and change frame per second.Fixed frame per second i.e. frame length and vertical shift during framing all remains unchanged, this side
Formula cannot well adapt to the variation characteristic of audible spectrum;Become frame per second to be divided during framing using dynamic vertical shift
Frame can effectively make up the shortcomings that fixed frame per second cannot reflect audible spectrum variation.
It is all that framing is carried out using fixed frame per second and smaller vertical shift in existing change frame per second audio feature extraction methods,
Candidate frame is calculated again and has selected the similarity of interframe, achievees the effect that become frame per second by frame losing.Such method, which is easily led to, have been selected
Frame makes to have abandoned the audio frame that can most show acoustic characteristic during framing, thus dropped audio to the screen effect of candidate frame
Information keeps retrieval accuracy not high.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention proposes a kind of audio feature extraction methods for becoming frame per second.It should
Method by the audio frame for selecting most represent audio variation in multiple candidate frames avoids that frame has been selected to imitate the shielding of candidate frame
It answers, to improve the validity of audio feature extraction, and then improves audio retrieval accuracy.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is that:
A kind of audio feature extraction methods becoming frame per second proposed by the present invention, detailed process is as follows:
Step a carries out framing to audio data, obtains audio frame, and calculate according to preset frame length and vertical shift
Each audio frame feature vector;
Step b, n audio frame of sequential selection is as initially having selected frame from the audio frame, and by h audio thereafter
Frame is as candidate frame;N and h is preset natural number;
Step c has selected frame feature vector according to n, calculates the reference frame feature vector for having selected frame
Wherein, viIt indicates to have selected frame feature vector, w i-thiIndicate i-th of weight for having selected frame;I is the mark for having selected frame
Number, it is positive integer;M indicates that currently shared m have been selected frame, and 1 < n≤m;
Each audio frame feature vector ties up representation in components with Q, has selected frame feature vector v i-thiEach dimension component distinguish table
It is shown as:Wherein, k is integer, and 1≤k≤Q, therefore the reference frame feature vectorKth tie up componentCalculation formula are as follows:
Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,It indicates to have selected frame feature vector i-th
Kth ties up the weight of component,Calculation formula are as follows:
And meet:Wherein,Calculation formula are as follows:
Step d, according to the reference frame feature vectorCalculate h candidate frame feature vector and reference frame feature vector
Euclidean distance dl;
Euclidean distance dlCalculation formula it is as follows:
Wherein l is the label of candidate frame, is positive integer, and 1≤l≤h;
Step e, according to the Euclidean distance d of the h candidate frame feature vector and reference frame feature vectorl, selection meets
The candidate frame of condition, is added to and has selected in frame, and detailed process is as follows:
Step e-1, according to the Euclidean distance dlSelect maximum Euclidean distance d thereinmax;
Step e-2, by the maximum Euclidean distance dmaxIt is compared with threshold value D, it, will be described if being more than threshold value
Maximum Euclidean distance dmaxCorresponding candidate frame, which is added to, have been selected in frame, and using h audio frame thereafter as new candidate frame,
Otherwise the h candidate frame is all abandoned, continues to use subsequent h audio frame and is selected as candidate frame;
Step c, step d and step e are repeated, h audio frame of sequential selection is as candidate from remaining audio frame
Frame, and select qualified candidate frame to be added to from h candidate frame and selected in frame, if remaining audio frame number is less than h,
Then using the audio frame of real surplus as candidate frame, qualified candidate frame is therefrom selected to be added to using same method
It selects in frame, until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.
The utility model has the advantages that a kind of audio feature extraction methods for becoming frame per second proposed by the present invention, by having selected frame special for multiple
Sign vector is weighted fusion, obtains reference frame feature vector, and calculates separately reference frame feature vector and multiple candidate frames spy
The Euclidean distance between vector is levied, selects most the audio frame that can represent multiple candidate frame from multiple candidate frames according to Euclidean distance
Audio retrieval is carried out, the screen effect for having selected frame to candidate frame is avoided, is conducive to extract more useful audio-frequency information, to mention
The accuracy of high audio retrieval.
Detailed description of the invention
Fig. 1 is to carry out reference frame feature vector using Weighted Fusion method to calculate schematic diagram;
Fig. 2 is to be selected eligible candidate frame that the flow diagram for having selected frame is added from candidate frame according to reference frame;
Fig. 3 is a kind of audio feature extraction methods execution flow diagram for becoming frame per second.
Specific embodiment
For a kind of more detailed description audio feature extraction methods for becoming frame per second proposed by the present invention, in conjunction with attached drawing,
It is illustrated below:
The invention mainly comprises three contents: first is that having selected frame feature vector to be weighted fusion to multiple, being referred to
Frame feature vector;Second is that calculating the Euclidean distance of reference frame feature vector and multiple candidate frame feature vectors;Third is that according to reference
The qualified candidate frame of the Euclidean distance of frame feature vector and multiple candidate frame feature vectors selection, which is added to, have been selected in frame.
1. calculating reference frame using Weighted Fusion method:
Reference frame feature vector, which is carried out, using Weighted Fusion method calculates schematic diagram as shown in Figure 1, current all m have been selected
Frame feature vector is denoted as v respectively1,v2,v3,...,vm, set and selected frame to be weighted fusion according to last 4, this 4 have been selected
The feature vector of frame is denoted as v respectivelym-3,vm-2,vm-1,vm.Each audio frame feature vector ties up representation in components with Q, has selected for i-th
The feature vector v of frameiEach dimension representation in components are as follows:Wherein, k is integer, and 1≤k≤Q;From each sound
The characteristic type that frequency frame extracts includes but is not limited to the temporal signatures of audio, frequency domain character or combinations thereof feature, such as auto-correlation system
Number, mel cepstrum coefficients etc..
Each dimension component of reference frame feature vector is melted by having selected each dimension component of frame feature vector to be weighted to this 4
It closes and obtains, its calculation formula is:
Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,Indicate the kth dimension of reference frame feature vector
Component;Indicate i-th of weight for having selected frame feature vector kth dimension component;I, k respectively indicates the label for having selected frame and each sound
Frequency frame feature vector respectively ties up the label of component, and i, k are positive integer;M indicates that currently shared m have been selected frame.
WeightIt will have a direct impact on the quality of reference frame, and then influence the selection to candidate frame.In this programmeCalculating
Formula are as follows:
And weight coefficient meets:WhereinIndicate that i-th, j has been selected the kth of frame feature vector to tie up
Component and the n absolute value of the difference for having selected frame feature vector kth to tie up component average value,Calculation formula are as follows:
Wherein, j is the label for having selected frame, is positive integer.
It should be noted that the present embodiment to 4 to have selected frame feature vector to be weighted fusion calculation reference frame feature
It, also can having selected frame feature vector and carry out the meter of reference frame feature vector using other numbers according to the actual situation for vector
It calculates.
2. calculating the Euclidean distance of reference frame feature vector and multiple candidate frame feature vectors:
It is assumed that being selected from subsequent 3 candidate frames, according to the reference frame feature vector of above-mentioned acquisition, 3 are calculated
The Euclidean distance d of candidate frame feature vector and reference frame feature vectorl, its calculation formula is:
The Euclidean distance of reference frame feature vector and each candidate frame feature vector is used to measure reference frame and each candidate frame
Similarity degree, wherein l is the label of candidate frame, is positive integer, and 1≤l≤3;
3. according to the Euclidean distance d of reference frame feature vector and multiple candidate frame feature vectorlIt selects qualified
Candidate frame, which is added to, have been selected in frame:
According to Euclidean distance dlSelect the flow diagram of qualified candidate frame as shown in Figure 2 from candidate frame.3
In candidate frame, the maximum candidate frame of Euclidean distance is least similar to reference frame, and candidate frame least similar with reference frame most can be anti-
Reflect audio variation.According to Euclidean distance from selecting qualified candidate frame to be added to the detailed process selected in frame in candidate frame
Are as follows:
Step 3-1: the d from Euclidean distancelSelect maximum Euclidean distance dmax;
Step 3-2: by the maximum Euclidean distance dmaxIt is compared with threshold value D, if being more than threshold value, most by this
Big Euclidean distance dmaxCorresponding candidate frame, which is added to, have been selected in frame, and using 3 audio frames thereafter as new candidate frame, no
Then 3 frame is all abandoned, subsequent 3 audio frames is continued to use and is selected as candidate frame.
The audio frame that can most represent multiple candidate frame is selected from multiple candidate frames, and combines predetermined threshold value D, makes spy
Sign extraction process changes place greatly in audible spectrum and selects more frame, audible spectrum change gentle place select it is less
Frame, and be avoided that the screen effect for having selected frame to candidate frame.
It should be noted that the present embodiment from subsequent 3 candidate frames to select qualified candidate frame to be added to
For having selected in frame, also qualified candidate frame can be selected to be added to from the candidate frame of other numbers according to the actual situation
It selects in frame.
Fig. 3 shows that a kind of audio feature extraction methods for becoming frame per second provided by the invention execute flow diagram, specifically
Execute process are as follows:
Step 1 carries out framing according to preset frame length and vertical shift, the use of frame length is such as 30ms, vertical shift is 2ms to sound
Frequency obtains audio frame, and calculate each audio frame feature vector according to framing is carried out;
Step 2 selects n audio frame as initially having selected frame in order according to preset n and h from the audio frame,
And using h audio frame thereafter as candidate frame;
Step 3 is calculated n and frame feature vector has been selected respectively to tie up the weight of component using Weighted Fusion method, and is weighted
Fusion, obtains reference frame feature vector;
Step 4 calculates the Euclidean distance of reference frame feature vector and h candidate frame feature vector;
Step 5 selects maximum Euclidean simultaneously according to the Euclidean distance of reference frame feature vector and h candidate frame feature vector
It is compared with threshold value, then the corresponding candidate frame of maximum Euclidean distance is added to more than threshold value and has been selected in frame, otherwise
This group of candidate frame is abandoned, is selected from next group of candidate frame.
Step 3, step 4, step 5 are repeated, h audio frame of sequential selection is as candidate from remaining audio frame
Frame, and therefrom select qualified candidate frame to be added to and selected in frame, it, will be practical if remaining audio frame number is less than h
Remaining audio frame therefrom selects qualified candidate frame to be added to and has selected in frame as candidate frame using same method,
Until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.
Claims (1)
1. a kind of audio feature extraction methods for becoming frame per second, which is characterized in that detailed process is as follows:
Step a carries out framing to audio data, obtains audio frame, and calculate each sound according to preset frame length and vertical shift
Frequency frame feature vector;
Step b, n audio frame of sequential selection has selected frame as initial from the audio frame, and h audio frame thereafter is made
For candidate frame;N and h is preset natural number;
Step c has selected frame feature vector according to n, calculates the reference frame feature vector for having selected frame
Wherein, viIt indicates to have selected frame feature vector, w i-thiIndicate i-th of weight for having selected frame;I is the label for having selected frame, is
Positive integer;M indicates that currently shared m have been selected frame, and 1 < n≤m;
Each audio frame feature vector ties up representation in components with Q, has selected frame feature vector v i-thiEach dimension component respectively indicate are as follows:Wherein, k is integer, and 1≤k≤Q, therefore the reference frame feature vectorKth tie up componentMeter
Calculate formula are as follows:
Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,It indicates to have selected frame feature vector kth to tie up i-th
The weight of component,Calculation formula are as follows:
And meet:J is the label for having selected frame in formula, is positive integer;Wherein,Calculation formula are as follows:
Step d, according to the reference frame feature vectorCalculate the Europe of h candidate frame feature vector and reference frame feature vector
Family name's distance dl;
Euclidean distance dlCalculation formula it is as follows:
Wherein l is the label of candidate frame, is positive integer, and 1≤l≤h;
Step e, according to the Euclidean distance d of the h candidate frame feature vector and reference frame feature vectorl, select qualified
Candidate frame is added to and has selected in frame, and detailed process is as follows:
Step e-1, according to the Euclidean distance dlSelect maximum Euclidean distance d thereinmax;
Step e-2, by the maximum Euclidean distance dmaxIt is compared with threshold value D, if being more than threshold value, by the maximum
Euclidean distance dmaxCorresponding candidate frame, which is added to, have been selected in frame, and using h audio frame thereafter as new candidate frame, otherwise
The h candidate frame is all abandoned, subsequent h audio frame is continued to use and is selected as candidate frame;
Repeat step c, step d and step e, from remaining audio frame h audio frame of sequential selection as candidate frame, and
It selects qualified candidate frame to be added to from h candidate frame to have selected in frame;It, will if remaining audio frame number is less than h
The audio frame of real surplus therefrom selects qualified candidate frame to be added to and has selected frame as candidate frame using same method
In, until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510782814.XA CN105448290B (en) | 2015-11-16 | 2015-11-16 | A kind of audio feature extraction methods becoming frame per second |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510782814.XA CN105448290B (en) | 2015-11-16 | 2015-11-16 | A kind of audio feature extraction methods becoming frame per second |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105448290A CN105448290A (en) | 2016-03-30 |
CN105448290B true CN105448290B (en) | 2019-03-01 |
Family
ID=55558397
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510782814.XA Active CN105448290B (en) | 2015-11-16 | 2015-11-16 | A kind of audio feature extraction methods becoming frame per second |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105448290B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3479378B1 (en) * | 2016-07-04 | 2023-05-24 | Harman Becker Automotive Systems GmbH | Automatic correction of loudness level in audio signals containing speech signals |
CN108922556B (en) * | 2018-07-16 | 2019-08-27 | 百度在线网络技术(北京)有限公司 | Sound processing method, device and equipment |
CN117097909B (en) * | 2023-10-20 | 2024-02-02 | 深圳市星易美科技有限公司 | Distributed household audio and video processing method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1505572A1 (en) * | 2002-05-06 | 2005-02-09 | Prous Science S.A. | Voice recognition method |
CN102332262A (en) * | 2011-09-23 | 2012-01-25 | 哈尔滨工业大学深圳研究生院 | Method for intelligently identifying songs based on audio features |
WO2015092711A1 (en) * | 2013-12-18 | 2015-06-25 | Isis Innovation Ltd. | Method and apparatus for automatic speech recognition |
CN105046699A (en) * | 2015-07-09 | 2015-11-11 | 硅革科技(北京)有限公司 | Motion video superposition contrast method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7110947B2 (en) * | 1999-12-10 | 2006-09-19 | At&T Corp. | Frame erasure concealment technique for a bitstream-based feature extractor |
-
2015
- 2015-11-16 CN CN201510782814.XA patent/CN105448290B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1505572A1 (en) * | 2002-05-06 | 2005-02-09 | Prous Science S.A. | Voice recognition method |
CN102332262A (en) * | 2011-09-23 | 2012-01-25 | 哈尔滨工业大学深圳研究生院 | Method for intelligently identifying songs based on audio features |
WO2015092711A1 (en) * | 2013-12-18 | 2015-06-25 | Isis Innovation Ltd. | Method and apparatus for automatic speech recognition |
CN105046699A (en) * | 2015-07-09 | 2015-11-11 | 硅革科技(北京)有限公司 | Motion video superposition contrast method |
Non-Patent Citations (2)
Title |
---|
基于HMM的手势识别研究;严焰 等;《华中师范大学学报(自然科学版)》;20121031;全文 * |
基于量子粒子群优化算法的运动捕获数据关键帧提取;杨涛 等;《计算机应用研究》;20140831;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105448290A (en) | 2016-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104464726B (en) | A kind of determination method and device of similar audio | |
CN105513583A (en) | Display method and system for song rhythm | |
CN103514883B (en) | A kind of self-adaptation realizes men and women's sound changing method | |
CN105448290B (en) | A kind of audio feature extraction methods becoming frame per second | |
CN106653056B (en) | Fundamental frequency extraction model and training method based on LSTM recurrent neural network | |
CN106067989B (en) | Portrait voice video synchronous calibration device and method | |
CN101819638B (en) | Establishment method of pornographic detection model and pornographic detection method | |
US20160330512A1 (en) | Multimedia processing method and multimedia apparatus | |
CN104134444B (en) | A kind of song based on MMSE removes method and apparatus of accompanying | |
CN109147763A (en) | A kind of audio-video keyword recognition method and device based on neural network and inverse entropy weighting | |
CN108206027A (en) | A kind of audio quality evaluation method and system | |
CN110148425A (en) | A kind of camouflage speech detection method based on complete local binary pattern | |
CN109035246A (en) | A kind of image-selecting method and device of face | |
CN109616105A (en) | A kind of noisy speech recognition methods based on transfer learning | |
CN110458247A (en) | The training method and device of image recognition model, image-recognizing method and device | |
CN109346056A (en) | Phoneme synthesizing method and device based on depth measure network | |
CN105895080A (en) | Voice recognition model training method, speaker type recognition method and device | |
CN111128211B (en) | Voice separation method and device | |
CN104199838B (en) | A kind of user model constructing method based on label disambiguation | |
CN109272044A (en) | A kind of image similarity determines method, apparatus, equipment and storage medium | |
CN107134277A (en) | A kind of voice-activation detecting method based on GMM model | |
Callier | Social meaning in prosodic variability | |
CN104217731A (en) | Quick solo music score recognizing method | |
CN103327359A (en) | Video significance region searching method applied to video quality evaluation | |
CN204883593U (en) | Real border system of increase that combines speech recognition and pronunciation test and appraisal technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |