CN107622773A - A kind of audio feature extraction methods and device, electronic equipment - Google Patents
A kind of audio feature extraction methods and device, electronic equipment Download PDFInfo
- Publication number
- CN107622773A CN107622773A CN201710803397.1A CN201710803397A CN107622773A CN 107622773 A CN107622773 A CN 107622773A CN 201710803397 A CN201710803397 A CN 201710803397A CN 107622773 A CN107622773 A CN 107622773A
- Authority
- CN
- China
- Prior art keywords
- candidate
- extreme point
- extreme
- point
- voice data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of audio feature extraction methods and device, electronic equipment, methods described to comprise the following steps:Step 1, obtain pending voice data;Step 2, according to the spectrum energy amplitude of the pending voice data, determine the original candidates extreme point of pending voice data;Step 3, the original candidates extreme point of every frame voice data is screened based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or based on the Difference Calculation result between candidate's extreme point, obtains the extreme value point list of the pending voice data;Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data, the noise immunity of the audio frequency characteristics of extraction can be improved by the present invention, the audio frequency characteristics of extraction is more accurately described voice data.
Description
Technical field
The present invention relates to Speech processing, technical field of information retrieval, more particularly to a kind of audio feature extraction methods
With device, electronic equipment.
Background technology
With the outburst of information technology and big data industry, the audio frequency and video of magnanimity store in digital form, utilize
It is the very important one side of current artificial intelligence field that the voice data of magnanimity, which carries out analyzing and processing, and such as voice data is entered
Audio retrieval, the retrieval of music primary sound are carried out after row analyzing and processing;After extracting the efficient voice in voice data, voice knowledge is carried out
Not etc..When audio analysis is handled, the feature for how accurately extracting voice data describes voice data and is directly connected to audio
The application effect of data.
Existing audio feature extraction methods are typically all simply to carry out extreme point detection according to the energy of voice data,
The extreme point of voice data is obtained, then extracts the audio frequency characteristics of corresponding extreme point, such as spectrum signature or fundamental frequency feature;Or
Person, directly extract the spectrum signature of voice data or voice data is described fundamental frequency feature.No matter however, it is to determine audio
Extracted again after the extreme point of data voice data feature or directly extraction audio data characteristics method, its noise immunity compared with
Difference, and when voice data has some noises, is difficult to accurately to extract audio frequency characteristics voice data is described, seriously
Influence the result of subsequent audio data.
The content of the invention
To overcome above-mentioned the shortcomings of the prior art, the purpose of the present invention is to provide a kind of audio feature extraction methods
With device, electronic equipment, to extract audio frequency characteristics exactly, the noise immunity of the audio frequency characteristics of extraction is improved, makes the audio of extraction
Feature can more accurately describe voice data.
For the above-mentioned purpose, technical scheme provided by the invention is as follows:
A kind of audio feature extraction methods, comprise the following steps:
Step 1, obtain pending voice data;
Step 2, according to the spectrum energy amplitude of the pending voice data, determine original candidates extreme point;
Step 3, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or it is based on
Difference Calculation result between candidate's extreme point is screened to the original candidates extreme point of the pending voice data, is obtained
The extreme value point list of the pending voice data;
Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data.
Alternatively, the step of influence coefficient between the extreme point based on candidate is screened further comprises:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively
Region centered on the extreme point of center, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
Alternatively, described the step of determining whether according to influence coefficient and retain the candidate centers extreme point, is specially:If institute
The frequency domain amplitude for stating candidate centers extreme point is more than or equal to the frequency domain amplitude of each non-candidate center extreme point in the region
With the product of corresponding influence coefficient, then retain the candidate centers extreme point.
Alternatively, the step of density based on candidate's extreme point is screened further comprises:
The original candidates extreme point of every frame voice data is selected successively and/or through based on the influence system between candidate's extreme point
Each extreme point in candidate's extreme point after number sieve choosing calculates the close of current candidate extreme point as current candidate extreme point
Degree;
If the density of current candidate extreme point is more than threshold value set in advance, the current candidate extreme point is deleted, it is no
Then retain current candidate extreme point.
Alternatively, the step of Difference Calculation result between the extreme point based on candidate is screened further comprises:
Original candidates extreme point and/or warp to every frame voice data is based on the influence coefficient screening between candidate's extreme point
Each candidate's extreme point in candidate's extreme point afterwards and/or candidate's extreme point after being screened based on the density of candidate's extreme point
Difference Calculation is carried out, obtains the difference spectrum value of each candidate's extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
Alternatively, the step of Difference Calculation is specially:According to previous or multiframe voice data the candidate of present frame
Extreme point and present frame is latter or the spectrum value of candidate's extreme point of multiframe voice data is to the candidate pole of present frame voice data
Value point carries out Difference Calculation and obtains the differentiated difference spectrum value of each candidate's extreme point of present frame voice data.
Alternatively, step 4 further comprises:
Based on each extreme point structure candidate region in the extreme value point list, it is determined that the extreme point pair of each extreme point;
According to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted;
The fingerprint characteristic of every frame voice data is merged, obtains the audio fingerprint feature per section audio data.
Alternatively, it is described to be based on each extreme point structure candidate region in the extreme value point list, it is determined that each extreme point
The step of extreme point pair specifically include:
Each extreme point in the extreme value point list is selected to be used as fixed extreme point successively;
The candidate region is built based on the fixation extreme point, extreme point and the fixation are selected in the candidate region
Extreme point forms extreme point pair.
To reach above-mentioned purpose, the present invention also provides a kind of audio feature extraction device, including:
Voice data acquiring unit, for obtaining pending voice data;
Candidate's extreme point determining unit, for the spectrum energy amplitude according to the pending voice data, it is determined that waiting to locate
Manage the original candidates extreme point of voice data;
Candidate's extreme point screening unit, based on the influence coefficient between candidate's extreme point and/or based on candidate's extreme point
Density and the original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, obtained
The extreme value point list of the pending voice data;
Audio feature extraction unit, for extracting voice data according to the extreme value point list of the pending voice data
Fingerprint characteristic.
Alternatively, the first screening unit of candidate's extreme point screening unit is based on the influence system between candidate's extreme point
Several original candidates extreme points to every frame voice data screen, and are specifically used for:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively
Region centered on the extreme point of center, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
Alternatively, second screening unit of candidate's extreme point screening unit is based on the shadow between candidate's extreme point
Ring coefficient to screen the original candidates extreme point of voice data, be specifically used for:
The original candidates extreme point of every frame voice data is selected successively or through based on the influence coefficient between candidate's extreme point
Each extreme point in candidate's extreme point after screening is as current candidate extreme point, the density of calculating current candidate extreme point;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained
Current candidate extreme point.
Alternatively, density of the third filtering unit of candidate's extreme point screening unit based on candidate's extreme point and
The original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, is specifically used for:
To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or
Each candidate's extreme point in candidate's extreme point after being screened based on the density of candidate's extreme point carries out Difference Calculation, obtains every
The difference spectrum value of individual candidate's extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
Alternatively, the audio feature extraction unit further comprises:
Extreme point is to determining unit, based on each extreme point structure candidate region in the extreme value point list, it is determined that each
The extreme point pair of extreme point;
Finger print characteristic abstract unit, for each extreme point pair in the extreme value point list, extract per frame audio
The fingerprint characteristic of data;
Combining unit, for the fingerprint characteristic of every frame voice data to be merged, obtain the audio per section audio data
Fingerprint characteristic.
The present invention also provides a kind of electronic equipment, and the electronic equipment includes;
Storage medium, a plurality of instruction is stored with, the instruction is loaded by processor, and perform claim requires the step of the above method
Suddenly;And
Processor, for performing the instruction in the storage medium.
Compared with prior art, a kind of audio feature extraction methods of the present invention and device, the beneficial effect of electronic equipment exist
In:
A kind of audio feature extraction methods of the present invention and device, electronic equipment by receiving pending voice data, according to
The spectrum energy amplitude of the voice data determines candidate's extreme point of pending voice data, then is based respectively on auditory masking effect
Should, the difference value of candidate's extreme value dot density and candidate's extreme point candidate's extreme point of voice data is screened, obtain waiting to locate
The extreme value point list of voice data is managed, to realize the purpose for the fingerprint characteristic that voice data is extracted according to the extreme value point list, and
And the present invention can effectively improve extraction using the difference value of auditory masking effect, candidate's extreme value dot density and candidate's extreme point
The noise immunity of audio frequency characteristics, the audio frequency characteristics of extraction are enable more accurately to describe voice data.
Brief description of the drawings
Fig. 1 is a kind of step flow chart of one embodiment of audio feature extraction methods of the present invention;
Fig. 2 is the rectangular area schematic diagram of candidate centers extreme point in the specific embodiment of the invention;
Fig. 3 is the thin portion flow chart of step 104 in the specific embodiment of the invention
Fig. 4 is the structure schematic diagram that extreme point pair is fixed in Fig. 2;
Fig. 5 is a kind of structural representation of one embodiment of audio feature extraction device of the present invention
Fig. 6 is the detail structure chart of specific embodiment of the invention sound intermediate frequency feature extraction unit;
Fig. 7 is the structural representation for the electronic equipment that the present invention is used for audio feature extraction methods.
Embodiment
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, control is illustrated below
The embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, for
For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings
Accompanying drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically show in each figure, they are not represented
Its practical structures as product.In addition, so that simplified form readily appreciates, there is identical structure or function in some figures
Part, one of those is only symbolically depicted, or only marked one of those.Herein, "one" is not only represented
" only this ", the situation of " more than one " can also be represented.
In one embodiment of the invention, as shown in figure 1, a kind of audio feature extraction methods of the present invention, including it is as follows
Step:
Step 101, pending voice data is obtained.
The pending voice data can be the speech data for including effect voice, or absolute music audio number
According to also or song data, the pending voice data can pass through the voice acquisition device such as microphone of smart machine
Collection obtains, and smart machine can be mobile phone, PC, tablet personal computer etc., and certain pending voice data can also
It is to prestore or the voice data of external equipment transmission, the specific present invention is not construed as limiting.
Step 102, according to the spectrum energy amplitude of the pending voice data, the original of pending voice data is determined
Candidate's extreme point.
Specifically, step 102 further comprises:
The pending voice data is transformed into frequency domain by step a), obtains the spectrum energy amplitude of the voice data, by
It is same as the prior art that the specific conversion method of time domain is transformed into the voice data that the present invention uses, will not be described here;
Step b) selects spectrum energy amplitude to exceed predetermined threshold value according to the spectrum energy amplitude per frame voice data
Point, the original candidates extreme point as every frame voice data.
Step 103, based on the influence coefficient between candidate's extreme point and/or the density and/or base based on candidate's extreme point
Difference Calculation result between candidate's extreme point is screened to the original candidates extreme point of the pending voice data, is obtained
To the extreme value point list of the pending voice data.That is, in step 103, can be based between candidate's extreme point
Influence coefficient, the density based on candidate's extreme point, based on one or more modes between candidate's extreme point to original candidates pole
Value point is screened.
, can be based on the influence coefficient between candidate's extreme point to every frame voice data in step 103 as a kind of example
Candidate's extreme point screened for the first time, obtain first candidate's extreme value point list of every frame voice data as the pending sound
The extreme value point list of frequency evidence.
In the specific embodiment of the invention, it is used for i-th of candidate of expression using G (i, j) on time dimension and frequency dimension
Influence coefficient between extreme point and j-th candidates extreme point, the influence coefficient is determined based on auditory masking effect, described to listen
It is interactional when feeling that masking effect refers to people to perception of sound, between spectral peak frequency point, a frequency component may
Masking and its similar frequency component.
The present invention is screened for the first time using the influence coefficient to candidate's extreme point, specifically, the first screening step
It is rapid as follows:The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively
Region centered on the extreme point of center, all candidate's extreme points in the region are obtained, such as select present frame audio number first
According to original candidates extreme point candidate's extreme point as candidate centers extreme point, composed in the language of the candidate centers extreme point
Rectangular area of the structure centered on the extreme point on figure, find in rectangular area per candidate's extreme point of frame voice data, institute
The transverse axis of predicate spectrogram is the time, and the longitudinal axis is that the shade of each candidate's extreme point in frequency values, figure represents amplitude, such as Fig. 2
It show candidate centers extreme point rectangular area schematic diagram;Calculate respectively the candidate centers extreme point with it is other in rectangular area
Influence coefficient G (i, j) between candidate's extreme point, as shown in following formula (1):
In formula (1), itAnd jtThe time value of i-th of candidate's extreme point and j-th candidates extreme point, i are represented respectivelyfWith
jfThe frequency value of i-th of candidate's extreme point and j-th candidates extreme point, l and w represent center extreme point rectangular area respectively
Length and width;
Determine whether to retain the candidate centers extreme point according to the frequency domain amplitude of influence coefficient and candidate's extreme point, specifically
Ground, if the frequency domain amplitude value of each non-candidate center extreme point is with influenceing in the rectangular area of the candidate centers extreme point
When the product of coefficient is both less than the frequency domain amplitude value of center extreme point, then retain the candidate centers extreme point, such as formula (2) institute
Show:
P(i)≥P(j)×G(i,j) (2)
Wherein, centered on P (i) extreme point frequency domain amplitude value, P (j) represent rectangular area in other non-central extreme values
The frequency domain amplitude value of point.Herein it should be noted that, if directly retaining the candidate without other candidate's extreme points in rectangular area
Center extreme point.
As the current candidate center extreme point in Fig. 2 rectangular area in, in addition to the extreme point of current candidate center, also 8
Other individual candidate's extreme points, candidate centers extreme point need to be calculated according to formula (2) respectively at 8 candidate's extreme points, only
Have when all meeting the condition of formula (2), the candidate centers extreme point can just retain, and otherwise need to delete.
As a kind of example, in candidate's extreme point based on the influence coefficient between candidate's extreme point to every frame voice data
After being screened, can also the density based on candidate's extreme point to through based between candidate's extreme point influence coefficient screening after the
One candidate's extreme value point list is screened again, with filter current sound, obtains second candidate's extreme value point range of every frame voice data
Extreme value point list of the table as the pending voice data.
In the audio of part on some frequency bands, all very big extreme point continuous in time of energy and density, i.e. electric current be present
Sound.Current sound can cause Audio Matching, and matching degree is very high in a short time, misleads Audio Matching result;Therefore, in order to prevent sound
There is the high spectrum energy point of comparatively dense in frequency, density screening of the present invention based on candidate's extreme point is specific in some frequency ranges
Including:
Select in first candidate's extreme value point list that each extreme point is as current candidate extreme point successively, with current extreme value point
For starting point respectively forwardly or after being moved rearwards the set time, candidate's extreme point sum in this time is counted, as current candidate
The density of extreme point, the set time such as 5s;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained
Current candidate extreme point.
So by being screened successively to each extreme point in first candidate's extreme value point list, you can obtain the second candidate
Extreme value point list.
Certainly, the present invention can also the density based on candidate's extreme point it is straight to the original candidates extreme point of every frame voice data
Capable screening is tapped into, specific step of screening is same as described above, will not be described here.
As a kind of example, in order to improve the noise immunity of extreme point and adaptivity, the present invention can also be through based on candidate
The extreme point in second candidate's extreme value point list after the density screening of extreme point carries out Difference Calculation successively, to ensure audio energy
Amount still can be matched after overall scaling.
During specific Difference Calculation, behind the candidate's extreme point and present frame of one or more frame voice datas before present frame
The spectrum value of candidate's extreme point of one or more frame voice datas carries out Difference Calculation to candidate's extreme point of present frame voice data
The differentiated frequency spectrum value of each candidate's extreme point of present frame voice data is obtained, shown in specific Difference Calculation formula such as formula (3):
Δ P (i)=| P (i)+P (i (t+1))-P (i (t-1))-P (i (t-2)) | (3)
Wherein, Δ P (i) represents the value after present frame candidate's extreme point i Difference Calculations, and P (i (t+1)) is represented and candidate
Extreme point i is the same as the spectrum value of candidate's extreme point of a later frame of frequency range, P (i (t-1)) and P (i (t-2)) is represented respectively and candidate
Extreme point i is the same as the former frame of frequency range and the spectrum value of front cross frame candidate's extreme point;
After terminating to the extreme point Difference Calculation in second candidate's extreme value point list, obtain each in candidate's extreme value point list
The difference spectrum value of candidate's extreme point;Selection exceedes candidate's extreme point conduct of predetermined threshold value per frame voice data difference spectrum value
Per the extreme point of frame voice data, naturally it is also possible to the difference frequency of each candidate's extreme point in second candidate's extreme value point list
Spectrum is ranked up, according to N number of candidate's extreme point before the size selection difference spectrum value ranking of difference spectrum value as every
The extreme point of frame voice data, so as to obtain the extreme value point list of every frame voice data.
Certainly, the present invention can also be to original candidates extreme point or through being screened based on the influence coefficient between candidate's extreme point
First candidate's extreme value point list afterwards carries out Difference Calculation, obtains the extreme value point list per frame voice data, specific Difference Calculation
The step of it is same as described above, will not be described here.
Step 104, the fingerprint characteristic of voice data is extracted according to the extreme value point list of pending voice data.
Specifically, as shown in figure 3, step 104 further comprises:
Step S31, based on each extreme point structure candidate region in extreme value point list, it is determined that the extreme point of each extreme point
It is right.Specifically, in step S31, each extreme point in extreme value point list is selected to fix extreme point, Ran Houji as current successively
In the fixation extreme point, candidate region is built in fixed frequency band and time range, selects spectrum energy to be more than the g of predetermined threshold value
Individual extreme point with fixation extreme point composition point pair, such as contains 8 candidate's extreme points, only selected in fig. 2, in candidate region respectively
5 larger points of spectrum energy and fixed extreme point structure point pair are selected, if Fig. 4 is to fix extreme point in Fig. 2 to illustrate structure
Figure.Said process is applied to each extreme point in extreme value point list, can be obtained in every frame voice data extreme value point list
The extreme point pair of each extreme point composition;
Step S32, according to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted,
During specific extraction, each extreme point is as current extreme value point in selection extreme value point list successively, according to current extreme value point and its group
Into extreme point to extracting the fingerprint characteristic F of present frame voice data, time letter corresponding to frame specially where current extreme value point
Cease t, the frequency-domain spectrum value f of current extreme value point, current extreme value point frame pair where extreme point each with its extreme point centering respectively
The difference DELTA t of the temporal information value answered, current extreme value point respectively with each extreme point frequency-domain spectrum value of its extreme point centering
Difference DELTA f, the unique identification audioID of present frame voice data, the Feature Representation for Fingerprints of present frame voice data is F
={ t, f, Δ t, Δ f, audioID };
Step S33, the audio fingerprint feature for every frame voice data that every section audio packet contains is combined, obtained every
The audio fingerprint feature of section audio data.
In one embodiment of the invention, as shown in figure 5, a kind of audio feature extraction device of the present invention, including:Audio
Data capture unit 51, candidate's extreme point determining unit 52, extreme value point list determining unit 53 and audio feature extraction unit
54。
Voice data acquiring unit 51, for obtaining pending voice data.The pending voice data can be bag
Speech data containing efficient voice, or absolute music voice data, can also be song data.
Candidate's extreme point determining unit 52, for the spectrum energy amplitude according to the pending voice data, it is determined that treating
Handle the original candidates extreme point of voice data.
Extreme value point list determining unit 53, based on the influence coefficient between candidate's extreme point and/or based on candidate's extreme point
Density and the original candidates extreme point of every frame voice data is sieved based on the Difference Calculation result between candidate's extreme point
Choosing, obtains the extreme value point list of the pending voice data.
Specifically, extreme value point list determining unit 53 further comprises:
First screening unit, for the original candidates based on the influence coefficient between candidate's extreme point to every frame voice data
Extreme point is screened;And/or
Second screening unit, the original candidates extreme value of every frame voice data is clicked through for the density based on candidate's extreme point
Row screening or candidate's extreme point after first screening unit screening are screened;And/or
Third filtering unit, for candidate's extreme value to original candidates extreme point or after first screening unit screening
Point or candidate's extreme point after second screening unit screening are screened.
First screening unit is specifically used for:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively
Region centered on the extreme point of center, obtain all candidate's extreme points in the region.Specifically, present frame voice data is selected
Original candidates extreme point candidate's extreme point as candidate centers extreme point, in the sound spectrograph of the candidate centers extreme point
Rectangular area of the upper structure centered on the extreme point, finds in rectangular area per candidate's extreme point of frame voice data;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Determine whether to retain the candidate centers extreme point according to the frequency domain amplitude of influence coefficient and candidate's extreme point, specifically
Ground, if the frequency domain amplitude value of each non-candidate center extreme point is with influenceing system in the rectangular area of the candidate centers extreme point
When several products is both less than the frequency domain amplitude value of center extreme point, then retain the candidate centers extreme point.
In the present invention, original candidates extreme value of density of second screening unit based on candidate's extreme point to every frame voice data
Point is screened or candidate's extreme point after first screening unit screening is screened, with filter current sound;Second sieve
Menu member is specifically used for:
The original candidates extreme point of every frame voice data is selected successively or through based on the influence coefficient between candidate's extreme point
Each extreme point in candidate's extreme point after screening as current candidate extreme point, using current extreme value point as starting point respectively forwardly
Or after being moved rearwards the set time, candidate's extreme point sum in this time is counted, as the density of current candidate extreme point, institute
State set time such as 5s;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained
Current candidate extreme point.
In the present invention, third filtering unit is used for original candidates extreme point or after first screening unit screening
Candidate's extreme point or through second screening unit screening after candidate's extreme point in candidate's extreme point carry out Difference Calculation after,
It is determined that the extreme value point list per frame voice data.Third filtering unit is specifically used for:
To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or
Each candidate's extreme point in candidate's extreme point after being screened based on the density of candidate's extreme point carries out Difference Calculation, obtains every
The difference spectrum value of individual candidate's extreme point.Specific Difference Calculation is the candidate pole according to one or more frame voice datas before present frame
Candidate pole of the spectrum value of candidate's extreme point of one or more frame voice datas to present frame voice data behind value point and present frame
Value point carries out Difference Calculation and obtains the differentiated frequency spectrum value of each candidate's extreme point of present frame voice data;
Selection exceedes candidate's extreme point of threshold value per frame voice data difference spectrum value as the extreme value per frame voice data
Point, or the difference spectrum value of each candidate's extreme point is ranked up, the size selection difference frequency spectrum according to difference spectrum value
It is worth N number of candidate's extreme point before ranking as the extreme point per frame voice data, so as to obtain the extreme value of every frame voice data
Point list.
Audio feature extraction unit 54, for extracting the finger of voice data according to the extreme value point list of pending voice data
Line feature.
Specifically, as shown in fig. 6, audio feature extraction unit 54 further comprises:
Extreme point is to determining unit 541, based on each extreme point structure candidate region in the extreme value point list, it is determined that often
The extreme point pair of individual extreme point, specifically, extreme point select determining unit 541 each extreme point in extreme value point list to make successively
Extreme point is fixed to be current, and based on current fixed extreme point, candidate region is built in fixed frequency band and time range, is selected
The g extreme point that spectrum energy is more than predetermined threshold value forms point pair with the fixation extreme point respectively;
Finger print characteristic abstract unit 542, for each extreme point pair in the extreme value point list, extract per frame sound
The fingerprint characteristic of frequency evidence;
Combining unit 543, for the fingerprint characteristic of every frame voice data to be merged, obtain the sound per section audio data
Frequency fingerprint characteristic.
Referring to Fig. 7, show that the present invention is used for the structural representation of the electronic equipment 300 of audio feature extraction methods.Ginseng
According to Fig. 7, electronic equipment 300 includes processing component 301, and it further comprises one or more processors, and by storage medium
Storage device resource representated by 302, can be by the instruction of the execution of processing component 301, such as application program for storing.Storage
The application program stored in medium 302 can include it is one or more each correspond to the module of one group of instruction.This
Outside, processing component 301 is configured as execute instruction, to perform each step of above-mentioned audio feature extraction methods.
Electronic equipment 300 can also include a power supply module 303, be configured as performing the power supply pipe of electronic equipment 300
Reason;One wired or wireless network interface 304, it is configured as electronic equipment 300 being connected to network;With an input and output
(I/O) interface 305.Electronic equipment 300 can be operated based on the operating system for being stored in storage device 302, such as Windows
ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
In summary, a kind of audio feature extraction methods of the present invention and device, electronic equipment are by receiving pending audio
Data, candidate's extreme point of pending voice data is determined according to the spectrum energy amplitude of the voice data, then is based respectively on
Time of the difference value of influence coefficient, candidate's extreme value dot density and candidate's extreme point between sense of hearing candidate's extreme point to voice data
Select extreme point to be screened, obtain the extreme value point list of pending voice data, sound is extracted according to the extreme value point list to realize
The purpose of the fingerprint characteristic of frequency evidence, and the present invention utilizes the influence coefficient between candidate's extreme point, candidate's extreme value dot density
And the difference value of candidate's extreme point can effectively improve the noise immunity of the audio frequency characteristics of extraction, enable the audio frequency characteristics of extraction more accurate
True description voice data.
It should be noted that above-described embodiment can independent assortment as needed.Described above is only the preferred of the present invention
Embodiment, it is noted that for those skilled in the art, do not departing from the premise of the principle of the invention
Under, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.
Claims (14)
1. a kind of audio feature extraction methods, comprise the following steps:
Step 1, obtain pending voice data;
Step 2, according to the spectrum energy amplitude of the pending voice data, determine original candidates extreme point;
Step 3, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or based on candidate
Difference Calculation result between extreme point, the original candidates extreme point of the pending voice data is screened, obtains institute
State the extreme value point list of pending voice data;
Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data.
2. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that described to be based between candidate's extreme point
Influence coefficient the step of being screened further comprise:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate centers successively
Region centered on extreme point, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
3. a kind of audio feature extraction methods as claimed in claim 2, it is characterised in that described to be according to influence coefficient determination
It is no reservation the candidate centers extreme point the step of be specially:If the frequency domain amplitude of the candidate centers extreme point is more than or equal to institute
The frequency domain amplitude of each non-candidate center extreme point and the product of corresponding influence coefficient in region are stated, then is retained in the candidate
Heart extreme point.
4. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that described based on the close of candidate's extreme point
The step of degree is screened further comprises:
The original candidates extreme point per frame voice data is selected successively and/or through being sieved based on the influence coefficient between candidate's extreme point
Each extreme point in candidate's extreme point after choosing is as current candidate extreme point, the density of calculating current candidate extreme point;
If the density of current candidate extreme point is more than threshold value set in advance, the current candidate extreme point is deleted, is otherwise protected
Stay current candidate extreme point.
5. a kind of audio feature extraction methods as claimed in claim 4, it is characterised in that described to be based between candidate's extreme point
Difference Calculation result the step of being screened further comprise:
Original candidates extreme point to every frame voice data and/or through based between candidate's extreme point influence coefficient screening after
Candidate's extreme point and/or through based on the density of candidate's extreme point screen after candidate's extreme point in each candidate's extreme point carry out
Difference Calculation, obtain the difference spectrum value of each candidate's extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
6. a kind of audio feature extraction methods as claimed in claim 5, it is characterised in that specific the step of the Difference Calculation
For:According to previous or multiframe voice data the candidate's extreme point and present frame of present frame be latter or the candidate of multiframe voice data
It is each that candidate extreme point progress Difference Calculation of the spectrum value of extreme point to present frame voice data obtains present frame voice data
The differentiated difference spectrum value of candidate's extreme point.
7. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that step 4 further comprises:
Based on each extreme point structure candidate region in the extreme value point list, it is determined that the extreme point pair of each extreme point;
According to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted;
The fingerprint characteristic of every frame voice data is merged, obtains the audio fingerprint feature per section audio data.
A kind of 8. audio feature extraction methods as claimed in claim 7, it is characterised in that:It is described to be based on the extreme value point list
In each extreme point structure candidate region, it is determined that the step of the extreme point pair of each extreme point specifically includes:
Each extreme point in the extreme value point list is selected to be used as fixed extreme point successively;
The candidate region is built based on the fixation extreme point, extreme point and the fixed extreme value are selected in the candidate region
Point composition extreme point pair.
9. a kind of audio feature extraction device, including:
Voice data acquiring unit, for obtaining pending voice data;
Candidate's extreme point determining unit, for the spectrum energy amplitude according to the pending voice data, determine pending sound
The original candidates extreme point of frequency evidence;
Candidate's extreme point screening unit, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point
And/or the original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, obtain
The extreme value point list of the pending voice data;
Audio feature extraction unit, for extracting the fingerprint of voice data according to the extreme value point list of the pending voice data
Feature.
10. a kind of audio feature extraction device as claimed in claim 9, it is characterised in that candidate's extreme point screening is single
First screening unit of member is clicked through based on the coefficient that influences between candidate's extreme point to the original candidates extreme value of every frame voice data
Row screening, is specifically used for:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate centers successively
Region centered on extreme point, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
11. a kind of audio feature extraction device as claimed in claim 9, it is characterised in that candidate's extreme point screening is single
Second screening unit of member is clicked through based on the coefficient that influences between candidate's extreme point to the original candidates extreme value of voice data
Row screening, is specifically used for:
The original candidates extreme point per frame voice data is selected successively or through being screened based on the influence coefficient between candidate's extreme point
Each extreme point in candidate's extreme point afterwards is as current candidate extreme point, the density of calculating current candidate extreme point;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained current
Candidate's extreme point.
12. a kind of audio feature extraction device as claimed in claim 11, it is characterised in that candidate's extreme point screening is single
Density of the third filtering unit based on candidate's extreme point of member and based on the Difference Calculation result pair between candidate's extreme point
The original candidates extreme point of voice data is screened, and is specifically used for:
To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or through base
Each candidate's extreme point in candidate's extreme point after the density screening of candidate's extreme point carries out Difference Calculation, obtains each time
Select the difference spectrum value of extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
A kind of 13. audio feature extraction device as claimed in claim 9, it is characterised in that the audio feature extraction unit
Further comprise:
Extreme point is to determining unit, based on each extreme point structure candidate region in the extreme value point list, it is determined that each extreme value
The extreme point pair of point;
Finger print characteristic abstract unit, for each extreme point pair in the extreme value point list, extract per frame voice data
Fingerprint characteristic;
Combining unit, for the fingerprint characteristic of every frame voice data to be merged, obtain the audio-frequency fingerprint per section audio data
Feature.
14. a kind of electronic equipment, it is characterised in that the electronic equipment includes;
Storage medium, a plurality of instruction is stored with, the instruction is loaded by processor, any one of perform claim requirement 1 to 8 side
The step of method;And
Processor, for performing the instruction in the storage medium.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710803397.1A CN107622773B (en) | 2017-09-08 | 2017-09-08 | Audio feature extraction method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710803397.1A CN107622773B (en) | 2017-09-08 | 2017-09-08 | Audio feature extraction method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107622773A true CN107622773A (en) | 2018-01-23 |
CN107622773B CN107622773B (en) | 2021-04-06 |
Family
ID=61088507
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710803397.1A Active CN107622773B (en) | 2017-09-08 | 2017-09-08 | Audio feature extraction method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107622773B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109658939A (en) * | 2019-01-26 | 2019-04-19 | 北京灵伴即时智能科技有限公司 | A kind of telephonograph access failure reason recognition methods |
WO2019184518A1 (en) * | 2018-03-29 | 2019-10-03 | 北京字节跳动网络技术有限公司 | Audio retrieval and identification method and device |
WO2019184517A1 (en) * | 2018-03-29 | 2019-10-03 | 北京字节跳动网络技术有限公司 | Audio fingerprint extraction method and device |
CN111522991A (en) * | 2020-04-15 | 2020-08-11 | 厦门快商通科技股份有限公司 | Audio fingerprint extraction method, device and equipment |
CN112037815A (en) * | 2020-08-28 | 2020-12-04 | 中移(杭州)信息技术有限公司 | Audio fingerprint extraction method, server and storage medium |
CN114640926A (en) * | 2022-03-31 | 2022-06-17 | 歌尔股份有限公司 | Current sound detection method, device, equipment and computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1819197A2 (en) * | 2006-02-14 | 2007-08-15 | STMicroelectronics Asia Pacific Pte Ltd. | Digital audio signal processing method and system for generating and controlling digital reverberations for audio signals |
US20080091366A1 (en) * | 2004-06-24 | 2008-04-17 | Avery Wang | Method of Characterizing the Overlap of Two Media Segments |
CN102214218A (en) * | 2011-06-07 | 2011-10-12 | 盛乐信息技术(上海)有限公司 | System and method for retrieving contents of audio/video |
US20160335347A1 (en) * | 2015-05-11 | 2016-11-17 | Alibaba Group Holding Limited | Audiot information retrieval method and device |
-
2017
- 2017-09-08 CN CN201710803397.1A patent/CN107622773B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080091366A1 (en) * | 2004-06-24 | 2008-04-17 | Avery Wang | Method of Characterizing the Overlap of Two Media Segments |
EP1819197A2 (en) * | 2006-02-14 | 2007-08-15 | STMicroelectronics Asia Pacific Pte Ltd. | Digital audio signal processing method and system for generating and controlling digital reverberations for audio signals |
CN102214218A (en) * | 2011-06-07 | 2011-10-12 | 盛乐信息技术(上海)有限公司 | System and method for retrieving contents of audio/video |
US20160335347A1 (en) * | 2015-05-11 | 2016-11-17 | Alibaba Group Holding Limited | Audiot information retrieval method and device |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019184518A1 (en) * | 2018-03-29 | 2019-10-03 | 北京字节跳动网络技术有限公司 | Audio retrieval and identification method and device |
WO2019184517A1 (en) * | 2018-03-29 | 2019-10-03 | 北京字节跳动网络技术有限公司 | Audio fingerprint extraction method and device |
US10950255B2 (en) | 2018-03-29 | 2021-03-16 | Beijing Bytedance Network Technology Co., Ltd. | Audio fingerprint extraction method and device |
US11182426B2 (en) | 2018-03-29 | 2021-11-23 | Beijing Bytedance Network Technology Co., Ltd. | Audio retrieval and identification method and device |
CN109658939A (en) * | 2019-01-26 | 2019-04-19 | 北京灵伴即时智能科技有限公司 | A kind of telephonograph access failure reason recognition methods |
CN109658939B (en) * | 2019-01-26 | 2020-12-01 | 北京灵伴即时智能科技有限公司 | Method for identifying reason of call record non-connection |
CN111522991A (en) * | 2020-04-15 | 2020-08-11 | 厦门快商通科技股份有限公司 | Audio fingerprint extraction method, device and equipment |
CN112037815A (en) * | 2020-08-28 | 2020-12-04 | 中移(杭州)信息技术有限公司 | Audio fingerprint extraction method, server and storage medium |
CN114640926A (en) * | 2022-03-31 | 2022-06-17 | 歌尔股份有限公司 | Current sound detection method, device, equipment and computer readable storage medium |
CN114640926B (en) * | 2022-03-31 | 2023-11-17 | 歌尔股份有限公司 | Current sound detection method, device, equipment and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107622773B (en) | 2021-04-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107622773A (en) | A kind of audio feature extraction methods and device, electronic equipment | |
CN111210021B (en) | Audio signal processing method, model training method and related device | |
CN103503060B (en) | Speech syllable/vowel/phone boundary detection using auditory attention cues | |
WO2020181824A1 (en) | Voiceprint recognition method, apparatus and device, and computer-readable storage medium | |
WO2021073116A1 (en) | Method and apparatus for generating legal document, device and storage medium | |
CN112863547A (en) | Virtual resource transfer processing method, device, storage medium and computer equipment | |
CN106503184B (en) | Determine the method and device of the affiliated class of service of target text | |
CN109065051B (en) | Voice recognition processing method and device | |
CN105006230A (en) | Voice sensitive information detecting and filtering method based on unspecified people | |
CN1013525B (en) | Real-time phonetic recognition method and device with or without function of identifying a person | |
CN111462758A (en) | Method, device and equipment for intelligent conference role classification and storage medium | |
CN110444190A (en) | Method of speech processing, device, terminal device and storage medium | |
CN108021635A (en) | The definite method, apparatus and storage medium of a kind of audio similarity | |
Sun et al. | Dynamic time warping for speech recognition with training part to reduce the computation | |
CN108628813A (en) | Treating method and apparatus, the device for processing | |
CN107564526A (en) | Processing method, device and machine readable media | |
CN111710332B (en) | Voice processing method, device, electronic equipment and storage medium | |
CN110473563A (en) | Breathing detection method, system, equipment and medium based on time-frequency characteristics | |
CN110931019B (en) | Public security voice data acquisition method, device, equipment and computer storage medium | |
CN109841221A (en) | Parameter adjusting method, device and body-building equipment based on speech recognition | |
CN110728993A (en) | Voice change identification method and electronic equipment | |
CN112466328B (en) | Breath sound detection method and device and electronic equipment | |
CN106297795B (en) | Audio recognition method and device | |
CN111968651A (en) | WT (WT) -based voiceprint recognition method and system | |
CN106340310B (en) | Speech detection method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |