CN105448290B - A kind of audio feature extraction methods becoming frame per second - Google Patents

A kind of audio feature extraction methods becoming frame per second Download PDF

Info

Publication number
CN105448290B
CN105448290B CN201510782814.XA CN201510782814A CN105448290B CN 105448290 B CN105448290 B CN 105448290B CN 201510782814 A CN201510782814 A CN 201510782814A CN 105448290 B CN105448290 B CN 105448290B
Authority
CN
China
Prior art keywords
frame
feature vector
audio
candidate
euclidean distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510782814.XA
Other languages
Chinese (zh)
Other versions
CN105448290A (en
Inventor
张晖
刘宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201510782814.XA priority Critical patent/CN105448290B/en
Publication of CN105448290A publication Critical patent/CN105448290A/en
Application granted granted Critical
Publication of CN105448290B publication Critical patent/CN105448290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search

Abstract

A kind of audio feature extraction methods becoming frame per second proposed by the present invention, by having selected frame feature vector to be weighted fusion for multiple, obtain reference frame feature vector, and the Euclidean distance between reference frame feature vector and multiple candidate frame feature vectors is calculated separately, the audio frame progress audio retrieval that can represent multiple candidate frame is selected most from multiple candidate frames according to Euclidean distance.The method of the invention avoids the screen effect for having selected frame to candidate frame, is conducive to extract more useful audio-frequency information, to improve the accuracy of audio retrieval.

Description

A kind of audio feature extraction methods becoming frame per second
Technical field
The present invention relates to Digital Audio-Frequency Processing Techniques field more particularly to a kind of audio feature extraction methods for becoming frame per second.
Background technique
With flourishing for internet, the data volume of audio frequency media is growing day by day on network, traditional based on text mark The audio search of note can no longer meet the growing use demand of people.In recent years, with cloud computing, big data it is emerging It rises, content-based audio retrieval technology increasingly becomes domestic and foreign scholars' focus of attention.
Audio feature extraction is the key that content-based audio retrieval technology, and general treatment process needs first by audio Data carry out sub-frame processing.Framing process generallys use the method for overlapping framing so as to keep continuity between frame and frame, divides The length of frame is known as frame length, and non-overlapping part is known as vertical shift between frame and frame.Sound is carried out there are mainly two types of mode at present Frequency feature extraction: fixed frame per second and change frame per second.Fixed frame per second i.e. frame length and vertical shift during framing all remains unchanged, this side Formula cannot well adapt to the variation characteristic of audible spectrum;Become frame per second to be divided during framing using dynamic vertical shift Frame can effectively make up the shortcomings that fixed frame per second cannot reflect audible spectrum variation.
It is all that framing is carried out using fixed frame per second and smaller vertical shift in existing change frame per second audio feature extraction methods, Candidate frame is calculated again and has selected the similarity of interframe, achievees the effect that become frame per second by frame losing.Such method, which is easily led to, have been selected Frame makes to have abandoned the audio frame that can most show acoustic characteristic during framing, thus dropped audio to the screen effect of candidate frame Information keeps retrieval accuracy not high.
Summary of the invention
In order to overcome the above-mentioned deficiencies of the prior art, the present invention proposes a kind of audio feature extraction methods for becoming frame per second.It should Method by the audio frame for selecting most represent audio variation in multiple candidate frames avoids that frame has been selected to imitate the shielding of candidate frame It answers, to improve the validity of audio feature extraction, and then improves audio retrieval accuracy.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention is that:
A kind of audio feature extraction methods becoming frame per second proposed by the present invention, detailed process is as follows:
Step a carries out framing to audio data, obtains audio frame, and calculate according to preset frame length and vertical shift Each audio frame feature vector;
Step b, n audio frame of sequential selection is as initially having selected frame from the audio frame, and by h audio thereafter Frame is as candidate frame;N and h is preset natural number;
Step c has selected frame feature vector according to n, calculates the reference frame feature vector for having selected frame
Wherein, viIt indicates to have selected frame feature vector, w i-thiIndicate i-th of weight for having selected frame;I is the mark for having selected frame Number, it is positive integer;M indicates that currently shared m have been selected frame, and 1 < n≤m;
Each audio frame feature vector ties up representation in components with Q, has selected frame feature vector v i-thiEach dimension component distinguish table It is shown as:Wherein, k is integer, and 1≤k≤Q, therefore the reference frame feature vectorKth tie up componentCalculation formula are as follows:
Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,It indicates to have selected frame feature vector i-th Kth ties up the weight of component,Calculation formula are as follows:
And meet:Wherein,Calculation formula are as follows:
Step d, according to the reference frame feature vectorCalculate h candidate frame feature vector and reference frame feature vector Euclidean distance dl
Euclidean distance dlCalculation formula it is as follows:
Wherein l is the label of candidate frame, is positive integer, and 1≤l≤h;
Step e, according to the Euclidean distance d of the h candidate frame feature vector and reference frame feature vectorl, selection meets The candidate frame of condition, is added to and has selected in frame, and detailed process is as follows:
Step e-1, according to the Euclidean distance dlSelect maximum Euclidean distance d thereinmax
Step e-2, by the maximum Euclidean distance dmaxIt is compared with threshold value D, it, will be described if being more than threshold value Maximum Euclidean distance dmaxCorresponding candidate frame, which is added to, have been selected in frame, and using h audio frame thereafter as new candidate frame, Otherwise the h candidate frame is all abandoned, continues to use subsequent h audio frame and is selected as candidate frame;
Step c, step d and step e are repeated, h audio frame of sequential selection is as candidate from remaining audio frame Frame, and select qualified candidate frame to be added to from h candidate frame and selected in frame, if remaining audio frame number is less than h, Then using the audio frame of real surplus as candidate frame, qualified candidate frame is therefrom selected to be added to using same method It selects in frame, until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.
The utility model has the advantages that a kind of audio feature extraction methods for becoming frame per second proposed by the present invention, by having selected frame special for multiple Sign vector is weighted fusion, obtains reference frame feature vector, and calculates separately reference frame feature vector and multiple candidate frames spy The Euclidean distance between vector is levied, selects most the audio frame that can represent multiple candidate frame from multiple candidate frames according to Euclidean distance Audio retrieval is carried out, the screen effect for having selected frame to candidate frame is avoided, is conducive to extract more useful audio-frequency information, to mention The accuracy of high audio retrieval.
Detailed description of the invention
Fig. 1 is to carry out reference frame feature vector using Weighted Fusion method to calculate schematic diagram;
Fig. 2 is to be selected eligible candidate frame that the flow diagram for having selected frame is added from candidate frame according to reference frame;
Fig. 3 is a kind of audio feature extraction methods execution flow diagram for becoming frame per second.
Specific embodiment
For a kind of more detailed description audio feature extraction methods for becoming frame per second proposed by the present invention, in conjunction with attached drawing, It is illustrated below:
The invention mainly comprises three contents: first is that having selected frame feature vector to be weighted fusion to multiple, being referred to Frame feature vector;Second is that calculating the Euclidean distance of reference frame feature vector and multiple candidate frame feature vectors;Third is that according to reference The qualified candidate frame of the Euclidean distance of frame feature vector and multiple candidate frame feature vectors selection, which is added to, have been selected in frame.
1. calculating reference frame using Weighted Fusion method:
Reference frame feature vector, which is carried out, using Weighted Fusion method calculates schematic diagram as shown in Figure 1, current all m have been selected Frame feature vector is denoted as v respectively1,v2,v3,...,vm, set and selected frame to be weighted fusion according to last 4, this 4 have been selected The feature vector of frame is denoted as v respectivelym-3,vm-2,vm-1,vm.Each audio frame feature vector ties up representation in components with Q, has selected for i-th The feature vector v of frameiEach dimension representation in components are as follows:Wherein, k is integer, and 1≤k≤Q;From each sound The characteristic type that frequency frame extracts includes but is not limited to the temporal signatures of audio, frequency domain character or combinations thereof feature, such as auto-correlation system Number, mel cepstrum coefficients etc..
Each dimension component of reference frame feature vector is melted by having selected each dimension component of frame feature vector to be weighted to this 4 It closes and obtains, its calculation formula is:
Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,Indicate the kth dimension of reference frame feature vector Component;Indicate i-th of weight for having selected frame feature vector kth dimension component;I, k respectively indicates the label for having selected frame and each sound Frequency frame feature vector respectively ties up the label of component, and i, k are positive integer;M indicates that currently shared m have been selected frame.
WeightIt will have a direct impact on the quality of reference frame, and then influence the selection to candidate frame.In this programmeCalculating Formula are as follows:
And weight coefficient meets:WhereinIndicate that i-th, j has been selected the kth of frame feature vector to tie up Component and the n absolute value of the difference for having selected frame feature vector kth to tie up component average value,Calculation formula are as follows:
Wherein, j is the label for having selected frame, is positive integer.
It should be noted that the present embodiment to 4 to have selected frame feature vector to be weighted fusion calculation reference frame feature It, also can having selected frame feature vector and carry out the meter of reference frame feature vector using other numbers according to the actual situation for vector It calculates.
2. calculating the Euclidean distance of reference frame feature vector and multiple candidate frame feature vectors:
It is assumed that being selected from subsequent 3 candidate frames, according to the reference frame feature vector of above-mentioned acquisition, 3 are calculated The Euclidean distance d of candidate frame feature vector and reference frame feature vectorl, its calculation formula is:
The Euclidean distance of reference frame feature vector and each candidate frame feature vector is used to measure reference frame and each candidate frame Similarity degree, wherein l is the label of candidate frame, is positive integer, and 1≤l≤3;
3. according to the Euclidean distance d of reference frame feature vector and multiple candidate frame feature vectorlIt selects qualified Candidate frame, which is added to, have been selected in frame:
According to Euclidean distance dlSelect the flow diagram of qualified candidate frame as shown in Figure 2 from candidate frame.3 In candidate frame, the maximum candidate frame of Euclidean distance is least similar to reference frame, and candidate frame least similar with reference frame most can be anti- Reflect audio variation.According to Euclidean distance from selecting qualified candidate frame to be added to the detailed process selected in frame in candidate frame Are as follows:
Step 3-1: the d from Euclidean distancelSelect maximum Euclidean distance dmax
Step 3-2: by the maximum Euclidean distance dmaxIt is compared with threshold value D, if being more than threshold value, most by this Big Euclidean distance dmaxCorresponding candidate frame, which is added to, have been selected in frame, and using 3 audio frames thereafter as new candidate frame, no Then 3 frame is all abandoned, subsequent 3 audio frames is continued to use and is selected as candidate frame.
The audio frame that can most represent multiple candidate frame is selected from multiple candidate frames, and combines predetermined threshold value D, makes spy Sign extraction process changes place greatly in audible spectrum and selects more frame, audible spectrum change gentle place select it is less Frame, and be avoided that the screen effect for having selected frame to candidate frame.
It should be noted that the present embodiment from subsequent 3 candidate frames to select qualified candidate frame to be added to For having selected in frame, also qualified candidate frame can be selected to be added to from the candidate frame of other numbers according to the actual situation It selects in frame.
Fig. 3 shows that a kind of audio feature extraction methods for becoming frame per second provided by the invention execute flow diagram, specifically Execute process are as follows:
Step 1 carries out framing according to preset frame length and vertical shift, the use of frame length is such as 30ms, vertical shift is 2ms to sound Frequency obtains audio frame, and calculate each audio frame feature vector according to framing is carried out;
Step 2 selects n audio frame as initially having selected frame in order according to preset n and h from the audio frame, And using h audio frame thereafter as candidate frame;
Step 3 is calculated n and frame feature vector has been selected respectively to tie up the weight of component using Weighted Fusion method, and is weighted Fusion, obtains reference frame feature vector;
Step 4 calculates the Euclidean distance of reference frame feature vector and h candidate frame feature vector;
Step 5 selects maximum Euclidean simultaneously according to the Euclidean distance of reference frame feature vector and h candidate frame feature vector It is compared with threshold value, then the corresponding candidate frame of maximum Euclidean distance is added to more than threshold value and has been selected in frame, otherwise This group of candidate frame is abandoned, is selected from next group of candidate frame.
Step 3, step 4, step 5 are repeated, h audio frame of sequential selection is as candidate from remaining audio frame Frame, and therefrom select qualified candidate frame to be added to and selected in frame, it, will be practical if remaining audio frame number is less than h Remaining audio frame therefrom selects qualified candidate frame to be added to and has selected in frame as candidate frame using same method, Until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.

Claims (1)

1. a kind of audio feature extraction methods for becoming frame per second, which is characterized in that detailed process is as follows:
Step a carries out framing to audio data, obtains audio frame, and calculate each sound according to preset frame length and vertical shift Frequency frame feature vector;
Step b, n audio frame of sequential selection has selected frame as initial from the audio frame, and h audio frame thereafter is made For candidate frame;N and h is preset natural number;
Step c has selected frame feature vector according to n, calculates the reference frame feature vector for having selected frame
Wherein, viIt indicates to have selected frame feature vector, w i-thiIndicate i-th of weight for having selected frame;I is the label for having selected frame, is Positive integer;M indicates that currently shared m have been selected frame, and 1 < n≤m;
Each audio frame feature vector ties up representation in components with Q, has selected frame feature vector v i-thiEach dimension component respectively indicate are as follows:Wherein, k is integer, and 1≤k≤Q, therefore the reference frame feature vectorKth tie up componentMeter Calculate formula are as follows:
Wherein,It indicates to have selected the kth of frame feature vector to tie up component i-th,It indicates to have selected frame feature vector kth to tie up i-th The weight of component,Calculation formula are as follows:
And meet:J is the label for having selected frame in formula, is positive integer;Wherein,Calculation formula are as follows:
Step d, according to the reference frame feature vectorCalculate the Europe of h candidate frame feature vector and reference frame feature vector Family name's distance dl
Euclidean distance dlCalculation formula it is as follows:
Wherein l is the label of candidate frame, is positive integer, and 1≤l≤h;
Step e, according to the Euclidean distance d of the h candidate frame feature vector and reference frame feature vectorl, select qualified Candidate frame is added to and has selected in frame, and detailed process is as follows:
Step e-1, according to the Euclidean distance dlSelect maximum Euclidean distance d thereinmax
Step e-2, by the maximum Euclidean distance dmaxIt is compared with threshold value D, if being more than threshold value, by the maximum Euclidean distance dmaxCorresponding candidate frame, which is added to, have been selected in frame, and using h audio frame thereafter as new candidate frame, otherwise The h candidate frame is all abandoned, subsequent h audio frame is continued to use and is selected as candidate frame;
Repeat step c, step d and step e, from remaining audio frame h audio frame of sequential selection as candidate frame, and It selects qualified candidate frame to be added to from h candidate frame to have selected in frame;It, will if remaining audio frame number is less than h The audio frame of real surplus therefrom selects qualified candidate frame to be added to and has selected frame as candidate frame using same method In, until remaining audio frame number is zero, using all feature vectors for having selected frame feature vector as the audio data.
CN201510782814.XA 2015-11-16 2015-11-16 A kind of audio feature extraction methods becoming frame per second Active CN105448290B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510782814.XA CN105448290B (en) 2015-11-16 2015-11-16 A kind of audio feature extraction methods becoming frame per second

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510782814.XA CN105448290B (en) 2015-11-16 2015-11-16 A kind of audio feature extraction methods becoming frame per second

Publications (2)

Publication Number Publication Date
CN105448290A CN105448290A (en) 2016-03-30
CN105448290B true CN105448290B (en) 2019-03-01

Family

ID=55558397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510782814.XA Active CN105448290B (en) 2015-11-16 2015-11-16 A kind of audio feature extraction methods becoming frame per second

Country Status (1)

Country Link
CN (1) CN105448290B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3479378B1 (en) * 2016-07-04 2023-05-24 Harman Becker Automotive Systems GmbH Automatic correction of loudness level in audio signals containing speech signals
CN108922556B (en) * 2018-07-16 2019-08-27 百度在线网络技术(北京)有限公司 Sound processing method, device and equipment
CN117097909B (en) * 2023-10-20 2024-02-02 深圳市星易美科技有限公司 Distributed household audio and video processing method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1505572A1 (en) * 2002-05-06 2005-02-09 Prous Science S.A. Voice recognition method
CN102332262A (en) * 2011-09-23 2012-01-25 哈尔滨工业大学深圳研究生院 Method for intelligently identifying songs based on audio features
WO2015092711A1 (en) * 2013-12-18 2015-06-25 Isis Innovation Ltd. Method and apparatus for automatic speech recognition
CN105046699A (en) * 2015-07-09 2015-11-11 硅革科技(北京)有限公司 Motion video superposition contrast method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7110947B2 (en) * 1999-12-10 2006-09-19 At&T Corp. Frame erasure concealment technique for a bitstream-based feature extractor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1505572A1 (en) * 2002-05-06 2005-02-09 Prous Science S.A. Voice recognition method
CN102332262A (en) * 2011-09-23 2012-01-25 哈尔滨工业大学深圳研究生院 Method for intelligently identifying songs based on audio features
WO2015092711A1 (en) * 2013-12-18 2015-06-25 Isis Innovation Ltd. Method and apparatus for automatic speech recognition
CN105046699A (en) * 2015-07-09 2015-11-11 硅革科技(北京)有限公司 Motion video superposition contrast method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于HMM的手势识别研究;严焰 等;《华中师范大学学报(自然科学版)》;20121031;全文 *
基于量子粒子群优化算法的运动捕获数据关键帧提取;杨涛 等;《计算机应用研究》;20140831;全文 *

Also Published As

Publication number Publication date
CN105448290A (en) 2016-03-30

Similar Documents

Publication Publication Date Title
CN104464726B (en) A kind of determination method and device of similar audio
CN105513583A (en) Display method and system for song rhythm
CN103514883B (en) A kind of self-adaptation realizes men and women&#39;s sound changing method
CN105448290B (en) A kind of audio feature extraction methods becoming frame per second
CN106653056B (en) Fundamental frequency extraction model and training method based on LSTM recurrent neural network
CN106067989B (en) Portrait voice video synchronous calibration device and method
CN101819638B (en) Establishment method of pornographic detection model and pornographic detection method
US20160330512A1 (en) Multimedia processing method and multimedia apparatus
CN104134444B (en) A kind of song based on MMSE removes method and apparatus of accompanying
CN109147763A (en) A kind of audio-video keyword recognition method and device based on neural network and inverse entropy weighting
CN108206027A (en) A kind of audio quality evaluation method and system
CN110148425A (en) A kind of camouflage speech detection method based on complete local binary pattern
CN109035246A (en) A kind of image-selecting method and device of face
CN109616105A (en) A kind of noisy speech recognition methods based on transfer learning
CN110458247A (en) The training method and device of image recognition model, image-recognizing method and device
CN109346056A (en) Phoneme synthesizing method and device based on depth measure network
CN105895080A (en) Voice recognition model training method, speaker type recognition method and device
CN111128211B (en) Voice separation method and device
CN104199838B (en) A kind of user model constructing method based on label disambiguation
CN109272044A (en) A kind of image similarity determines method, apparatus, equipment and storage medium
CN107134277A (en) A kind of voice-activation detecting method based on GMM model
Callier Social meaning in prosodic variability
CN104217731A (en) Quick solo music score recognizing method
CN103327359A (en) Video significance region searching method applied to video quality evaluation
CN204883593U (en) Real border system of increase that combines speech recognition and pronunciation test and appraisal technique

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant