CN107622773A - A kind of audio feature extraction methods and device, electronic equipment - Google Patents

A kind of audio feature extraction methods and device, electronic equipment Download PDF

Info

Publication number
CN107622773A
CN107622773A CN201710803397.1A CN201710803397A CN107622773A CN 107622773 A CN107622773 A CN 107622773A CN 201710803397 A CN201710803397 A CN 201710803397A CN 107622773 A CN107622773 A CN 107622773A
Authority
CN
China
Prior art keywords
candidate
extreme point
extreme
point
voice data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710803397.1A
Other languages
Chinese (zh)
Other versions
CN107622773B (en
Inventor
李永超
方昕
刘俊华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201710803397.1A priority Critical patent/CN107622773B/en
Publication of CN107622773A publication Critical patent/CN107622773A/en
Application granted granted Critical
Publication of CN107622773B publication Critical patent/CN107622773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of audio feature extraction methods and device, electronic equipment, methods described to comprise the following steps:Step 1, obtain pending voice data;Step 2, according to the spectrum energy amplitude of the pending voice data, determine the original candidates extreme point of pending voice data;Step 3, the original candidates extreme point of every frame voice data is screened based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or based on the Difference Calculation result between candidate's extreme point, obtains the extreme value point list of the pending voice data;Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data, the noise immunity of the audio frequency characteristics of extraction can be improved by the present invention, the audio frequency characteristics of extraction is more accurately described voice data.

Description

A kind of audio feature extraction methods and device, electronic equipment
Technical field
The present invention relates to Speech processing, technical field of information retrieval, more particularly to a kind of audio feature extraction methods With device, electronic equipment.
Background technology
With the outburst of information technology and big data industry, the audio frequency and video of magnanimity store in digital form, utilize It is the very important one side of current artificial intelligence field that the voice data of magnanimity, which carries out analyzing and processing, and such as voice data is entered Audio retrieval, the retrieval of music primary sound are carried out after row analyzing and processing;After extracting the efficient voice in voice data, voice knowledge is carried out Not etc..When audio analysis is handled, the feature for how accurately extracting voice data describes voice data and is directly connected to audio The application effect of data.
Existing audio feature extraction methods are typically all simply to carry out extreme point detection according to the energy of voice data, The extreme point of voice data is obtained, then extracts the audio frequency characteristics of corresponding extreme point, such as spectrum signature or fundamental frequency feature;Or Person, directly extract the spectrum signature of voice data or voice data is described fundamental frequency feature.No matter however, it is to determine audio Extracted again after the extreme point of data voice data feature or directly extraction audio data characteristics method, its noise immunity compared with Difference, and when voice data has some noises, is difficult to accurately to extract audio frequency characteristics voice data is described, seriously Influence the result of subsequent audio data.
The content of the invention
To overcome above-mentioned the shortcomings of the prior art, the purpose of the present invention is to provide a kind of audio feature extraction methods With device, electronic equipment, to extract audio frequency characteristics exactly, the noise immunity of the audio frequency characteristics of extraction is improved, makes the audio of extraction Feature can more accurately describe voice data.
For the above-mentioned purpose, technical scheme provided by the invention is as follows:
A kind of audio feature extraction methods, comprise the following steps:
Step 1, obtain pending voice data;
Step 2, according to the spectrum energy amplitude of the pending voice data, determine original candidates extreme point;
Step 3, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or it is based on Difference Calculation result between candidate's extreme point is screened to the original candidates extreme point of the pending voice data, is obtained The extreme value point list of the pending voice data;
Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data.
Alternatively, the step of influence coefficient between the extreme point based on candidate is screened further comprises:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
Alternatively, described the step of determining whether according to influence coefficient and retain the candidate centers extreme point, is specially:If institute The frequency domain amplitude for stating candidate centers extreme point is more than or equal to the frequency domain amplitude of each non-candidate center extreme point in the region With the product of corresponding influence coefficient, then retain the candidate centers extreme point.
Alternatively, the step of density based on candidate's extreme point is screened further comprises:
The original candidates extreme point of every frame voice data is selected successively and/or through based on the influence system between candidate's extreme point Each extreme point in candidate's extreme point after number sieve choosing calculates the close of current candidate extreme point as current candidate extreme point Degree;
If the density of current candidate extreme point is more than threshold value set in advance, the current candidate extreme point is deleted, it is no Then retain current candidate extreme point.
Alternatively, the step of Difference Calculation result between the extreme point based on candidate is screened further comprises:
Original candidates extreme point and/or warp to every frame voice data is based on the influence coefficient screening between candidate's extreme point Each candidate's extreme point in candidate's extreme point afterwards and/or candidate's extreme point after being screened based on the density of candidate's extreme point Difference Calculation is carried out, obtains the difference spectrum value of each candidate's extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
Alternatively, the step of Difference Calculation is specially:According to previous or multiframe voice data the candidate of present frame Extreme point and present frame is latter or the spectrum value of candidate's extreme point of multiframe voice data is to the candidate pole of present frame voice data Value point carries out Difference Calculation and obtains the differentiated difference spectrum value of each candidate's extreme point of present frame voice data.
Alternatively, step 4 further comprises:
Based on each extreme point structure candidate region in the extreme value point list, it is determined that the extreme point pair of each extreme point;
According to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted;
The fingerprint characteristic of every frame voice data is merged, obtains the audio fingerprint feature per section audio data.
Alternatively, it is described to be based on each extreme point structure candidate region in the extreme value point list, it is determined that each extreme point The step of extreme point pair specifically include:
Each extreme point in the extreme value point list is selected to be used as fixed extreme point successively;
The candidate region is built based on the fixation extreme point, extreme point and the fixation are selected in the candidate region Extreme point forms extreme point pair.
To reach above-mentioned purpose, the present invention also provides a kind of audio feature extraction device, including:
Voice data acquiring unit, for obtaining pending voice data;
Candidate's extreme point determining unit, for the spectrum energy amplitude according to the pending voice data, it is determined that waiting to locate Manage the original candidates extreme point of voice data;
Candidate's extreme point screening unit, based on the influence coefficient between candidate's extreme point and/or based on candidate's extreme point Density and the original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, obtained The extreme value point list of the pending voice data;
Audio feature extraction unit, for extracting voice data according to the extreme value point list of the pending voice data Fingerprint characteristic.
Alternatively, the first screening unit of candidate's extreme point screening unit is based on the influence system between candidate's extreme point Several original candidates extreme points to every frame voice data screen, and are specifically used for:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
Alternatively, second screening unit of candidate's extreme point screening unit is based on the shadow between candidate's extreme point Ring coefficient to screen the original candidates extreme point of voice data, be specifically used for:
The original candidates extreme point of every frame voice data is selected successively or through based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point after screening is as current candidate extreme point, the density of calculating current candidate extreme point;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained Current candidate extreme point.
Alternatively, density of the third filtering unit of candidate's extreme point screening unit based on candidate's extreme point and The original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, is specifically used for:
To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or Each candidate's extreme point in candidate's extreme point after being screened based on the density of candidate's extreme point carries out Difference Calculation, obtains every The difference spectrum value of individual candidate's extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
Alternatively, the audio feature extraction unit further comprises:
Extreme point is to determining unit, based on each extreme point structure candidate region in the extreme value point list, it is determined that each The extreme point pair of extreme point;
Finger print characteristic abstract unit, for each extreme point pair in the extreme value point list, extract per frame audio The fingerprint characteristic of data;
Combining unit, for the fingerprint characteristic of every frame voice data to be merged, obtain the audio per section audio data Fingerprint characteristic.
The present invention also provides a kind of electronic equipment, and the electronic equipment includes;
Storage medium, a plurality of instruction is stored with, the instruction is loaded by processor, and perform claim requires the step of the above method Suddenly;And
Processor, for performing the instruction in the storage medium.
Compared with prior art, a kind of audio feature extraction methods of the present invention and device, the beneficial effect of electronic equipment exist In:
A kind of audio feature extraction methods of the present invention and device, electronic equipment by receiving pending voice data, according to The spectrum energy amplitude of the voice data determines candidate's extreme point of pending voice data, then is based respectively on auditory masking effect Should, the difference value of candidate's extreme value dot density and candidate's extreme point candidate's extreme point of voice data is screened, obtain waiting to locate The extreme value point list of voice data is managed, to realize the purpose for the fingerprint characteristic that voice data is extracted according to the extreme value point list, and And the present invention can effectively improve extraction using the difference value of auditory masking effect, candidate's extreme value dot density and candidate's extreme point The noise immunity of audio frequency characteristics, the audio frequency characteristics of extraction are enable more accurately to describe voice data.
Brief description of the drawings
Fig. 1 is a kind of step flow chart of one embodiment of audio feature extraction methods of the present invention;
Fig. 2 is the rectangular area schematic diagram of candidate centers extreme point in the specific embodiment of the invention;
Fig. 3 is the thin portion flow chart of step 104 in the specific embodiment of the invention
Fig. 4 is the structure schematic diagram that extreme point pair is fixed in Fig. 2;
Fig. 5 is a kind of structural representation of one embodiment of audio feature extraction device of the present invention
Fig. 6 is the detail structure chart of specific embodiment of the invention sound intermediate frequency feature extraction unit;
Fig. 7 is the structural representation for the electronic equipment that the present invention is used for audio feature extraction methods.
Embodiment
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, control is illustrated below The embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing, and obtain other embodiments.
To make simplified form, part related to the present invention is only schematically show in each figure, they are not represented Its practical structures as product.In addition, so that simplified form readily appreciates, there is identical structure or function in some figures Part, one of those is only symbolically depicted, or only marked one of those.Herein, "one" is not only represented " only this ", the situation of " more than one " can also be represented.
In one embodiment of the invention, as shown in figure 1, a kind of audio feature extraction methods of the present invention, including it is as follows Step:
Step 101, pending voice data is obtained.
The pending voice data can be the speech data for including effect voice, or absolute music audio number According to also or song data, the pending voice data can pass through the voice acquisition device such as microphone of smart machine Collection obtains, and smart machine can be mobile phone, PC, tablet personal computer etc., and certain pending voice data can also It is to prestore or the voice data of external equipment transmission, the specific present invention is not construed as limiting.
Step 102, according to the spectrum energy amplitude of the pending voice data, the original of pending voice data is determined Candidate's extreme point.
Specifically, step 102 further comprises:
The pending voice data is transformed into frequency domain by step a), obtains the spectrum energy amplitude of the voice data, by It is same as the prior art that the specific conversion method of time domain is transformed into the voice data that the present invention uses, will not be described here;
Step b) selects spectrum energy amplitude to exceed predetermined threshold value according to the spectrum energy amplitude per frame voice data Point, the original candidates extreme point as every frame voice data.
Step 103, based on the influence coefficient between candidate's extreme point and/or the density and/or base based on candidate's extreme point Difference Calculation result between candidate's extreme point is screened to the original candidates extreme point of the pending voice data, is obtained To the extreme value point list of the pending voice data.That is, in step 103, can be based between candidate's extreme point Influence coefficient, the density based on candidate's extreme point, based on one or more modes between candidate's extreme point to original candidates pole Value point is screened.
, can be based on the influence coefficient between candidate's extreme point to every frame voice data in step 103 as a kind of example Candidate's extreme point screened for the first time, obtain first candidate's extreme value point list of every frame voice data as the pending sound The extreme value point list of frequency evidence.
In the specific embodiment of the invention, it is used for i-th of candidate of expression using G (i, j) on time dimension and frequency dimension Influence coefficient between extreme point and j-th candidates extreme point, the influence coefficient is determined based on auditory masking effect, described to listen It is interactional when feeling that masking effect refers to people to perception of sound, between spectral peak frequency point, a frequency component may Masking and its similar frequency component.
The present invention is screened for the first time using the influence coefficient to candidate's extreme point, specifically, the first screening step It is rapid as follows:The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, all candidate's extreme points in the region are obtained, such as select present frame audio number first According to original candidates extreme point candidate's extreme point as candidate centers extreme point, composed in the language of the candidate centers extreme point Rectangular area of the structure centered on the extreme point on figure, find in rectangular area per candidate's extreme point of frame voice data, institute The transverse axis of predicate spectrogram is the time, and the longitudinal axis is that the shade of each candidate's extreme point in frequency values, figure represents amplitude, such as Fig. 2 It show candidate centers extreme point rectangular area schematic diagram;Calculate respectively the candidate centers extreme point with it is other in rectangular area Influence coefficient G (i, j) between candidate's extreme point, as shown in following formula (1):
In formula (1), itAnd jtThe time value of i-th of candidate's extreme point and j-th candidates extreme point, i are represented respectivelyfWith jfThe frequency value of i-th of candidate's extreme point and j-th candidates extreme point, l and w represent center extreme point rectangular area respectively Length and width;
Determine whether to retain the candidate centers extreme point according to the frequency domain amplitude of influence coefficient and candidate's extreme point, specifically Ground, if the frequency domain amplitude value of each non-candidate center extreme point is with influenceing in the rectangular area of the candidate centers extreme point When the product of coefficient is both less than the frequency domain amplitude value of center extreme point, then retain the candidate centers extreme point, such as formula (2) institute Show:
P(i)≥P(j)×G(i,j) (2)
Wherein, centered on P (i) extreme point frequency domain amplitude value, P (j) represent rectangular area in other non-central extreme values The frequency domain amplitude value of point.Herein it should be noted that, if directly retaining the candidate without other candidate's extreme points in rectangular area Center extreme point.
As the current candidate center extreme point in Fig. 2 rectangular area in, in addition to the extreme point of current candidate center, also 8 Other individual candidate's extreme points, candidate centers extreme point need to be calculated according to formula (2) respectively at 8 candidate's extreme points, only Have when all meeting the condition of formula (2), the candidate centers extreme point can just retain, and otherwise need to delete.
As a kind of example, in candidate's extreme point based on the influence coefficient between candidate's extreme point to every frame voice data After being screened, can also the density based on candidate's extreme point to through based between candidate's extreme point influence coefficient screening after the One candidate's extreme value point list is screened again, with filter current sound, obtains second candidate's extreme value point range of every frame voice data Extreme value point list of the table as the pending voice data.
In the audio of part on some frequency bands, all very big extreme point continuous in time of energy and density, i.e. electric current be present Sound.Current sound can cause Audio Matching, and matching degree is very high in a short time, misleads Audio Matching result;Therefore, in order to prevent sound There is the high spectrum energy point of comparatively dense in frequency, density screening of the present invention based on candidate's extreme point is specific in some frequency ranges Including:
Select in first candidate's extreme value point list that each extreme point is as current candidate extreme point successively, with current extreme value point For starting point respectively forwardly or after being moved rearwards the set time, candidate's extreme point sum in this time is counted, as current candidate The density of extreme point, the set time such as 5s;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained Current candidate extreme point.
So by being screened successively to each extreme point in first candidate's extreme value point list, you can obtain the second candidate Extreme value point list.
Certainly, the present invention can also the density based on candidate's extreme point it is straight to the original candidates extreme point of every frame voice data Capable screening is tapped into, specific step of screening is same as described above, will not be described here.
As a kind of example, in order to improve the noise immunity of extreme point and adaptivity, the present invention can also be through based on candidate The extreme point in second candidate's extreme value point list after the density screening of extreme point carries out Difference Calculation successively, to ensure audio energy Amount still can be matched after overall scaling.
During specific Difference Calculation, behind the candidate's extreme point and present frame of one or more frame voice datas before present frame The spectrum value of candidate's extreme point of one or more frame voice datas carries out Difference Calculation to candidate's extreme point of present frame voice data The differentiated frequency spectrum value of each candidate's extreme point of present frame voice data is obtained, shown in specific Difference Calculation formula such as formula (3):
Δ P (i)=| P (i)+P (i (t+1))-P (i (t-1))-P (i (t-2)) | (3)
Wherein, Δ P (i) represents the value after present frame candidate's extreme point i Difference Calculations, and P (i (t+1)) is represented and candidate Extreme point i is the same as the spectrum value of candidate's extreme point of a later frame of frequency range, P (i (t-1)) and P (i (t-2)) is represented respectively and candidate Extreme point i is the same as the former frame of frequency range and the spectrum value of front cross frame candidate's extreme point;
After terminating to the extreme point Difference Calculation in second candidate's extreme value point list, obtain each in candidate's extreme value point list The difference spectrum value of candidate's extreme point;Selection exceedes candidate's extreme point conduct of predetermined threshold value per frame voice data difference spectrum value Per the extreme point of frame voice data, naturally it is also possible to the difference frequency of each candidate's extreme point in second candidate's extreme value point list Spectrum is ranked up, according to N number of candidate's extreme point before the size selection difference spectrum value ranking of difference spectrum value as every The extreme point of frame voice data, so as to obtain the extreme value point list of every frame voice data.
Certainly, the present invention can also be to original candidates extreme point or through being screened based on the influence coefficient between candidate's extreme point First candidate's extreme value point list afterwards carries out Difference Calculation, obtains the extreme value point list per frame voice data, specific Difference Calculation The step of it is same as described above, will not be described here.
Step 104, the fingerprint characteristic of voice data is extracted according to the extreme value point list of pending voice data.
Specifically, as shown in figure 3, step 104 further comprises:
Step S31, based on each extreme point structure candidate region in extreme value point list, it is determined that the extreme point of each extreme point It is right.Specifically, in step S31, each extreme point in extreme value point list is selected to fix extreme point, Ran Houji as current successively In the fixation extreme point, candidate region is built in fixed frequency band and time range, selects spectrum energy to be more than the g of predetermined threshold value Individual extreme point with fixation extreme point composition point pair, such as contains 8 candidate's extreme points, only selected in fig. 2, in candidate region respectively 5 larger points of spectrum energy and fixed extreme point structure point pair are selected, if Fig. 4 is to fix extreme point in Fig. 2 to illustrate structure Figure.Said process is applied to each extreme point in extreme value point list, can be obtained in every frame voice data extreme value point list The extreme point pair of each extreme point composition;
Step S32, according to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted, During specific extraction, each extreme point is as current extreme value point in selection extreme value point list successively, according to current extreme value point and its group Into extreme point to extracting the fingerprint characteristic F of present frame voice data, time letter corresponding to frame specially where current extreme value point Cease t, the frequency-domain spectrum value f of current extreme value point, current extreme value point frame pair where extreme point each with its extreme point centering respectively The difference DELTA t of the temporal information value answered, current extreme value point respectively with each extreme point frequency-domain spectrum value of its extreme point centering Difference DELTA f, the unique identification audioID of present frame voice data, the Feature Representation for Fingerprints of present frame voice data is F ={ t, f, Δ t, Δ f, audioID };
Step S33, the audio fingerprint feature for every frame voice data that every section audio packet contains is combined, obtained every The audio fingerprint feature of section audio data.
In one embodiment of the invention, as shown in figure 5, a kind of audio feature extraction device of the present invention, including:Audio Data capture unit 51, candidate's extreme point determining unit 52, extreme value point list determining unit 53 and audio feature extraction unit 54。
Voice data acquiring unit 51, for obtaining pending voice data.The pending voice data can be bag Speech data containing efficient voice, or absolute music voice data, can also be song data.
Candidate's extreme point determining unit 52, for the spectrum energy amplitude according to the pending voice data, it is determined that treating Handle the original candidates extreme point of voice data.
Extreme value point list determining unit 53, based on the influence coefficient between candidate's extreme point and/or based on candidate's extreme point Density and the original candidates extreme point of every frame voice data is sieved based on the Difference Calculation result between candidate's extreme point Choosing, obtains the extreme value point list of the pending voice data.
Specifically, extreme value point list determining unit 53 further comprises:
First screening unit, for the original candidates based on the influence coefficient between candidate's extreme point to every frame voice data Extreme point is screened;And/or
Second screening unit, the original candidates extreme value of every frame voice data is clicked through for the density based on candidate's extreme point Row screening or candidate's extreme point after first screening unit screening are screened;And/or
Third filtering unit, for candidate's extreme value to original candidates extreme point or after first screening unit screening Point or candidate's extreme point after second screening unit screening are screened.
First screening unit is specifically used for:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, obtain all candidate's extreme points in the region.Specifically, present frame voice data is selected Original candidates extreme point candidate's extreme point as candidate centers extreme point, in the sound spectrograph of the candidate centers extreme point Rectangular area of the upper structure centered on the extreme point, finds in rectangular area per candidate's extreme point of frame voice data;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Determine whether to retain the candidate centers extreme point according to the frequency domain amplitude of influence coefficient and candidate's extreme point, specifically Ground, if the frequency domain amplitude value of each non-candidate center extreme point is with influenceing system in the rectangular area of the candidate centers extreme point When several products is both less than the frequency domain amplitude value of center extreme point, then retain the candidate centers extreme point.
In the present invention, original candidates extreme value of density of second screening unit based on candidate's extreme point to every frame voice data Point is screened or candidate's extreme point after first screening unit screening is screened, with filter current sound;Second sieve Menu member is specifically used for:
The original candidates extreme point of every frame voice data is selected successively or through based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point after screening as current candidate extreme point, using current extreme value point as starting point respectively forwardly Or after being moved rearwards the set time, candidate's extreme point sum in this time is counted, as the density of current candidate extreme point, institute State set time such as 5s;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained Current candidate extreme point.
In the present invention, third filtering unit is used for original candidates extreme point or after first screening unit screening Candidate's extreme point or through second screening unit screening after candidate's extreme point in candidate's extreme point carry out Difference Calculation after, It is determined that the extreme value point list per frame voice data.Third filtering unit is specifically used for:
To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or Each candidate's extreme point in candidate's extreme point after being screened based on the density of candidate's extreme point carries out Difference Calculation, obtains every The difference spectrum value of individual candidate's extreme point.Specific Difference Calculation is the candidate pole according to one or more frame voice datas before present frame Candidate pole of the spectrum value of candidate's extreme point of one or more frame voice datas to present frame voice data behind value point and present frame Value point carries out Difference Calculation and obtains the differentiated frequency spectrum value of each candidate's extreme point of present frame voice data;
Selection exceedes candidate's extreme point of threshold value per frame voice data difference spectrum value as the extreme value per frame voice data Point, or the difference spectrum value of each candidate's extreme point is ranked up, the size selection difference frequency spectrum according to difference spectrum value It is worth N number of candidate's extreme point before ranking as the extreme point per frame voice data, so as to obtain the extreme value of every frame voice data Point list.
Audio feature extraction unit 54, for extracting the finger of voice data according to the extreme value point list of pending voice data Line feature.
Specifically, as shown in fig. 6, audio feature extraction unit 54 further comprises:
Extreme point is to determining unit 541, based on each extreme point structure candidate region in the extreme value point list, it is determined that often The extreme point pair of individual extreme point, specifically, extreme point select determining unit 541 each extreme point in extreme value point list to make successively Extreme point is fixed to be current, and based on current fixed extreme point, candidate region is built in fixed frequency band and time range, is selected The g extreme point that spectrum energy is more than predetermined threshold value forms point pair with the fixation extreme point respectively;
Finger print characteristic abstract unit 542, for each extreme point pair in the extreme value point list, extract per frame sound The fingerprint characteristic of frequency evidence;
Combining unit 543, for the fingerprint characteristic of every frame voice data to be merged, obtain the sound per section audio data Frequency fingerprint characteristic.
Referring to Fig. 7, show that the present invention is used for the structural representation of the electronic equipment 300 of audio feature extraction methods.Ginseng According to Fig. 7, electronic equipment 300 includes processing component 301, and it further comprises one or more processors, and by storage medium Storage device resource representated by 302, can be by the instruction of the execution of processing component 301, such as application program for storing.Storage The application program stored in medium 302 can include it is one or more each correspond to the module of one group of instruction.This Outside, processing component 301 is configured as execute instruction, to perform each step of above-mentioned audio feature extraction methods.
Electronic equipment 300 can also include a power supply module 303, be configured as performing the power supply pipe of electronic equipment 300 Reason;One wired or wireless network interface 304, it is configured as electronic equipment 300 being connected to network;With an input and output (I/O) interface 305.Electronic equipment 300 can be operated based on the operating system for being stored in storage device 302, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
In summary, a kind of audio feature extraction methods of the present invention and device, electronic equipment are by receiving pending audio Data, candidate's extreme point of pending voice data is determined according to the spectrum energy amplitude of the voice data, then is based respectively on Time of the difference value of influence coefficient, candidate's extreme value dot density and candidate's extreme point between sense of hearing candidate's extreme point to voice data Select extreme point to be screened, obtain the extreme value point list of pending voice data, sound is extracted according to the extreme value point list to realize The purpose of the fingerprint characteristic of frequency evidence, and the present invention utilizes the influence coefficient between candidate's extreme point, candidate's extreme value dot density And the difference value of candidate's extreme point can effectively improve the noise immunity of the audio frequency characteristics of extraction, enable the audio frequency characteristics of extraction more accurate True description voice data.
It should be noted that above-described embodiment can independent assortment as needed.Described above is only the preferred of the present invention Embodiment, it is noted that for those skilled in the art, do not departing from the premise of the principle of the invention Under, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (14)

1. a kind of audio feature extraction methods, comprise the following steps:
Step 1, obtain pending voice data;
Step 2, according to the spectrum energy amplitude of the pending voice data, determine original candidates extreme point;
Step 3, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or based on candidate Difference Calculation result between extreme point, the original candidates extreme point of the pending voice data is screened, obtains institute State the extreme value point list of pending voice data;
Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data.
2. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that described to be based between candidate's extreme point Influence coefficient the step of being screened further comprise:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate centers successively Region centered on extreme point, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
3. a kind of audio feature extraction methods as claimed in claim 2, it is characterised in that described to be according to influence coefficient determination It is no reservation the candidate centers extreme point the step of be specially:If the frequency domain amplitude of the candidate centers extreme point is more than or equal to institute The frequency domain amplitude of each non-candidate center extreme point and the product of corresponding influence coefficient in region are stated, then is retained in the candidate Heart extreme point.
4. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that described based on the close of candidate's extreme point The step of degree is screened further comprises:
The original candidates extreme point per frame voice data is selected successively and/or through being sieved based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point after choosing is as current candidate extreme point, the density of calculating current candidate extreme point;
If the density of current candidate extreme point is more than threshold value set in advance, the current candidate extreme point is deleted, is otherwise protected Stay current candidate extreme point.
5. a kind of audio feature extraction methods as claimed in claim 4, it is characterised in that described to be based between candidate's extreme point Difference Calculation result the step of being screened further comprise:
Original candidates extreme point to every frame voice data and/or through based between candidate's extreme point influence coefficient screening after Candidate's extreme point and/or through based on the density of candidate's extreme point screen after candidate's extreme point in each candidate's extreme point carry out Difference Calculation, obtain the difference spectrum value of each candidate's extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
6. a kind of audio feature extraction methods as claimed in claim 5, it is characterised in that specific the step of the Difference Calculation For:According to previous or multiframe voice data the candidate's extreme point and present frame of present frame be latter or the candidate of multiframe voice data It is each that candidate extreme point progress Difference Calculation of the spectrum value of extreme point to present frame voice data obtains present frame voice data The differentiated difference spectrum value of candidate's extreme point.
7. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that step 4 further comprises:
Based on each extreme point structure candidate region in the extreme value point list, it is determined that the extreme point pair of each extreme point;
According to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted;
The fingerprint characteristic of every frame voice data is merged, obtains the audio fingerprint feature per section audio data.
A kind of 8. audio feature extraction methods as claimed in claim 7, it is characterised in that:It is described to be based on the extreme value point list In each extreme point structure candidate region, it is determined that the step of the extreme point pair of each extreme point specifically includes:
Each extreme point in the extreme value point list is selected to be used as fixed extreme point successively;
The candidate region is built based on the fixation extreme point, extreme point and the fixed extreme value are selected in the candidate region Point composition extreme point pair.
9. a kind of audio feature extraction device, including:
Voice data acquiring unit, for obtaining pending voice data;
Candidate's extreme point determining unit, for the spectrum energy amplitude according to the pending voice data, determine pending sound The original candidates extreme point of frequency evidence;
Candidate's extreme point screening unit, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point And/or the original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, obtain The extreme value point list of the pending voice data;
Audio feature extraction unit, for extracting the fingerprint of voice data according to the extreme value point list of the pending voice data Feature.
10. a kind of audio feature extraction device as claimed in claim 9, it is characterised in that candidate's extreme point screening is single First screening unit of member is clicked through based on the coefficient that influences between candidate's extreme point to the original candidates extreme value of every frame voice data Row screening, is specifically used for:
The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate centers successively Region centered on extreme point, obtain all candidate's extreme points in the region;
The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively;
Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.
11. a kind of audio feature extraction device as claimed in claim 9, it is characterised in that candidate's extreme point screening is single Second screening unit of member is clicked through based on the coefficient that influences between candidate's extreme point to the original candidates extreme value of voice data Row screening, is specifically used for:
The original candidates extreme point per frame voice data is selected successively or through being screened based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point afterwards is as current candidate extreme point, the density of calculating current candidate extreme point;
If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained current Candidate's extreme point.
12. a kind of audio feature extraction device as claimed in claim 11, it is characterised in that candidate's extreme point screening is single Density of the third filtering unit based on candidate's extreme point of member and based on the Difference Calculation result pair between candidate's extreme point The original candidates extreme point of voice data is screened, and is specifically used for:
To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or through base Each candidate's extreme point in candidate's extreme point after the density screening of candidate's extreme point carries out Difference Calculation, obtains each time Select the difference spectrum value of extreme point;
Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.
A kind of 13. audio feature extraction device as claimed in claim 9, it is characterised in that the audio feature extraction unit Further comprise:
Extreme point is to determining unit, based on each extreme point structure candidate region in the extreme value point list, it is determined that each extreme value The extreme point pair of point;
Finger print characteristic abstract unit, for each extreme point pair in the extreme value point list, extract per frame voice data Fingerprint characteristic;
Combining unit, for the fingerprint characteristic of every frame voice data to be merged, obtain the audio-frequency fingerprint per section audio data Feature.
14. a kind of electronic equipment, it is characterised in that the electronic equipment includes;
Storage medium, a plurality of instruction is stored with, the instruction is loaded by processor, any one of perform claim requirement 1 to 8 side The step of method;And
Processor, for performing the instruction in the storage medium.
CN201710803397.1A 2017-09-08 2017-09-08 Audio feature extraction method and device and electronic equipment Active CN107622773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710803397.1A CN107622773B (en) 2017-09-08 2017-09-08 Audio feature extraction method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710803397.1A CN107622773B (en) 2017-09-08 2017-09-08 Audio feature extraction method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN107622773A true CN107622773A (en) 2018-01-23
CN107622773B CN107622773B (en) 2021-04-06

Family

ID=61088507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710803397.1A Active CN107622773B (en) 2017-09-08 2017-09-08 Audio feature extraction method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN107622773B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109658939A (en) * 2019-01-26 2019-04-19 北京灵伴即时智能科技有限公司 A kind of telephonograph access failure reason recognition methods
WO2019184518A1 (en) * 2018-03-29 2019-10-03 北京字节跳动网络技术有限公司 Audio retrieval and identification method and device
WO2019184517A1 (en) * 2018-03-29 2019-10-03 北京字节跳动网络技术有限公司 Audio fingerprint extraction method and device
CN111522991A (en) * 2020-04-15 2020-08-11 厦门快商通科技股份有限公司 Audio fingerprint extraction method, device and equipment
CN112037815A (en) * 2020-08-28 2020-12-04 中移(杭州)信息技术有限公司 Audio fingerprint extraction method, server and storage medium
CN114640926A (en) * 2022-03-31 2022-06-17 歌尔股份有限公司 Current sound detection method, device, equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1819197A2 (en) * 2006-02-14 2007-08-15 STMicroelectronics Asia Pacific Pte Ltd. Digital audio signal processing method and system for generating and controlling digital reverberations for audio signals
US20080091366A1 (en) * 2004-06-24 2008-04-17 Avery Wang Method of Characterizing the Overlap of Two Media Segments
CN102214218A (en) * 2011-06-07 2011-10-12 盛乐信息技术(上海)有限公司 System and method for retrieving contents of audio/video
US20160335347A1 (en) * 2015-05-11 2016-11-17 Alibaba Group Holding Limited Audiot information retrieval method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080091366A1 (en) * 2004-06-24 2008-04-17 Avery Wang Method of Characterizing the Overlap of Two Media Segments
EP1819197A2 (en) * 2006-02-14 2007-08-15 STMicroelectronics Asia Pacific Pte Ltd. Digital audio signal processing method and system for generating and controlling digital reverberations for audio signals
CN102214218A (en) * 2011-06-07 2011-10-12 盛乐信息技术(上海)有限公司 System and method for retrieving contents of audio/video
US20160335347A1 (en) * 2015-05-11 2016-11-17 Alibaba Group Holding Limited Audiot information retrieval method and device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019184518A1 (en) * 2018-03-29 2019-10-03 北京字节跳动网络技术有限公司 Audio retrieval and identification method and device
WO2019184517A1 (en) * 2018-03-29 2019-10-03 北京字节跳动网络技术有限公司 Audio fingerprint extraction method and device
US10950255B2 (en) 2018-03-29 2021-03-16 Beijing Bytedance Network Technology Co., Ltd. Audio fingerprint extraction method and device
US11182426B2 (en) 2018-03-29 2021-11-23 Beijing Bytedance Network Technology Co., Ltd. Audio retrieval and identification method and device
CN109658939A (en) * 2019-01-26 2019-04-19 北京灵伴即时智能科技有限公司 A kind of telephonograph access failure reason recognition methods
CN109658939B (en) * 2019-01-26 2020-12-01 北京灵伴即时智能科技有限公司 Method for identifying reason of call record non-connection
CN111522991A (en) * 2020-04-15 2020-08-11 厦门快商通科技股份有限公司 Audio fingerprint extraction method, device and equipment
CN112037815A (en) * 2020-08-28 2020-12-04 中移(杭州)信息技术有限公司 Audio fingerprint extraction method, server and storage medium
CN114640926A (en) * 2022-03-31 2022-06-17 歌尔股份有限公司 Current sound detection method, device, equipment and computer readable storage medium
CN114640926B (en) * 2022-03-31 2023-11-17 歌尔股份有限公司 Current sound detection method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN107622773B (en) 2021-04-06

Similar Documents

Publication Publication Date Title
CN107622773A (en) A kind of audio feature extraction methods and device, electronic equipment
CN111210021B (en) Audio signal processing method, model training method and related device
CN103503060B (en) Speech syllable/vowel/phone boundary detection using auditory attention cues
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
CN112863547A (en) Virtual resource transfer processing method, device, storage medium and computer equipment
CN106503184B (en) Determine the method and device of the affiliated class of service of target text
CN109065051B (en) Voice recognition processing method and device
CN105006230A (en) Voice sensitive information detecting and filtering method based on unspecified people
CN1013525B (en) Real-time phonetic recognition method and device with or without function of identifying a person
CN111462758A (en) Method, device and equipment for intelligent conference role classification and storage medium
CN110444190A (en) Method of speech processing, device, terminal device and storage medium
CN108021635A (en) The definite method, apparatus and storage medium of a kind of audio similarity
Sun et al. Dynamic time warping for speech recognition with training part to reduce the computation
CN108628813A (en) Treating method and apparatus, the device for processing
CN107564526A (en) Processing method, device and machine readable media
CN111710332B (en) Voice processing method, device, electronic equipment and storage medium
CN110473563A (en) Breathing detection method, system, equipment and medium based on time-frequency characteristics
CN110931019B (en) Public security voice data acquisition method, device, equipment and computer storage medium
CN109841221A (en) Parameter adjusting method, device and body-building equipment based on speech recognition
CN110728993A (en) Voice change identification method and electronic equipment
CN112466328B (en) Breath sound detection method and device and electronic equipment
CN106297795B (en) Audio recognition method and device
CN111968651A (en) WT (WT) -based voiceprint recognition method and system
CN106340310B (en) Speech detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant