CN107622773A

CN107622773A - A kind of audio feature extraction methods and device, electronic equipment

Info

Publication number: CN107622773A
Application number: CN201710803397.1A
Authority: CN
Inventors: 李永超; 方昕; 刘俊华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2017-09-08
Filing date: 2017-09-08
Publication date: 2018-01-23
Anticipated expiration: 2037-09-08
Also published as: CN107622773B

Abstract

The invention discloses a kind of audio feature extraction methods and device, electronic equipment, methods described to comprise the following steps：Step 1, obtain pending voice data；Step 2, according to the spectrum energy amplitude of the pending voice data, determine the original candidates extreme point of pending voice data；Step 3, the original candidates extreme point of every frame voice data is screened based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or based on the Difference Calculation result between candidate's extreme point, obtains the extreme value point list of the pending voice data；Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data, the noise immunity of the audio frequency characteristics of extraction can be improved by the present invention, the audio frequency characteristics of extraction is more accurately described voice data.

Description

A kind of audio feature extraction methods and device, electronic equipment

Technical field

The present invention relates to Speech processing, technical field of information retrieval, more particularly to a kind of audio feature extraction methods With device, electronic equipment.

Background technology

With the outburst of information technology and big data industry, the audio frequency and video of magnanimity store in digital form, utilize It is the very important one side of current artificial intelligence field that the voice data of magnanimity, which carries out analyzing and processing, and such as voice data is entered Audio retrieval, the retrieval of music primary sound are carried out after row analyzing and processing；After extracting the efficient voice in voice data, voice knowledge is carried out Not etc..When audio analysis is handled, the feature for how accurately extracting voice data describes voice data and is directly connected to audio The application effect of data.

Existing audio feature extraction methods are typically all simply to carry out extreme point detection according to the energy of voice data, The extreme point of voice data is obtained, then extracts the audio frequency characteristics of corresponding extreme point, such as spectrum signature or fundamental frequency feature；Or Person, directly extract the spectrum signature of voice data or voice data is described fundamental frequency feature.No matter however, it is to determine audio Extracted again after the extreme point of data voice data feature or directly extraction audio data characteristics method, its noise immunity compared with Difference, and when voice data has some noises, is difficult to accurately to extract audio frequency characteristics voice data is described, seriously Influence the result of subsequent audio data.

The content of the invention

To overcome above-mentioned the shortcomings of the prior art, the purpose of the present invention is to provide a kind of audio feature extraction methods With device, electronic equipment, to extract audio frequency characteristics exactly, the noise immunity of the audio frequency characteristics of extraction is improved, makes the audio of extraction Feature can more accurately describe voice data.

For the above-mentioned purpose, technical scheme provided by the invention is as follows：

A kind of audio feature extraction methods, comprise the following steps：

Step 1, obtain pending voice data；

Step 2, according to the spectrum energy amplitude of the pending voice data, determine original candidates extreme point；

Step 3, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or it is based on Difference Calculation result between candidate's extreme point is screened to the original candidates extreme point of the pending voice data, is obtained The extreme value point list of the pending voice data；

Step 4, the fingerprint characteristic of voice data is extracted according to the extreme value point list of the pending voice data.

Alternatively, the step of influence coefficient between the extreme point based on candidate is screened further comprises：

The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, obtain all candidate's extreme points in the region；

The influence coefficient between the candidate centers extreme point and other candidate's extreme points is calculated respectively；

Frequency domain amplitude according to coefficient and candidate's extreme point is influenceed determines whether to retain the candidate centers extreme point.

Alternatively, described the step of determining whether according to influence coefficient and retain the candidate centers extreme point, is specially：If institute The frequency domain amplitude for stating candidate centers extreme point is more than or equal to the frequency domain amplitude of each non-candidate center extreme point in the region With the product of corresponding influence coefficient, then retain the candidate centers extreme point.

Alternatively, the step of density based on candidate's extreme point is screened further comprises:

The original candidates extreme point of every frame voice data is selected successively and/or through based on the influence system between candidate's extreme point Each extreme point in candidate's extreme point after number sieve choosing calculates the close of current candidate extreme point as current candidate extreme point Degree；

If the density of current candidate extreme point is more than threshold value set in advance, the current candidate extreme point is deleted, it is no Then retain current candidate extreme point.

Alternatively, the step of Difference Calculation result between the extreme point based on candidate is screened further comprises：

Original candidates extreme point and/or warp to every frame voice data is based on the influence coefficient screening between candidate's extreme point Each candidate's extreme point in candidate's extreme point afterwards and/or candidate's extreme point after being screened based on the density of candidate's extreme point Difference Calculation is carried out, obtains the difference spectrum value of each candidate's extreme point；

Extreme value point list per frame voice data is determined according to the difference spectrum value of each candidate's extreme point.

Alternatively, the step of Difference Calculation is specially：According to previous or multiframe voice data the candidate of present frame Extreme point and present frame is latter or the spectrum value of candidate's extreme point of multiframe voice data is to the candidate pole of present frame voice data Value point carries out Difference Calculation and obtains the differentiated difference spectrum value of each candidate's extreme point of present frame voice data.

Alternatively, step 4 further comprises：

Based on each extreme point structure candidate region in the extreme value point list, it is determined that the extreme point pair of each extreme point；

According to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted；

The fingerprint characteristic of every frame voice data is merged, obtains the audio fingerprint feature per section audio data.

Alternatively, it is described to be based on each extreme point structure candidate region in the extreme value point list, it is determined that each extreme point The step of extreme point pair specifically include：

Each extreme point in the extreme value point list is selected to be used as fixed extreme point successively；

The candidate region is built based on the fixation extreme point, extreme point and the fixation are selected in the candidate region Extreme point forms extreme point pair.

To reach above-mentioned purpose, the present invention also provides a kind of audio feature extraction device, including：

Voice data acquiring unit, for obtaining pending voice data；

Candidate's extreme point determining unit, for the spectrum energy amplitude according to the pending voice data, it is determined that waiting to locate Manage the original candidates extreme point of voice data；

Candidate's extreme point screening unit, based on the influence coefficient between candidate's extreme point and/or based on candidate's extreme point Density and the original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, obtained The extreme value point list of the pending voice data；

Audio feature extraction unit, for extracting voice data according to the extreme value point list of the pending voice data Fingerprint characteristic.

Alternatively, the first screening unit of candidate's extreme point screening unit is based on the influence system between candidate's extreme point Several original candidates extreme points to every frame voice data screen, and are specifically used for：

Alternatively, second screening unit of candidate's extreme point screening unit is based on the shadow between candidate's extreme point Ring coefficient to screen the original candidates extreme point of voice data, be specifically used for：

The original candidates extreme point of every frame voice data is selected successively or through based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point after screening is as current candidate extreme point, the density of calculating current candidate extreme point；

If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained Current candidate extreme point.

Alternatively, density of the third filtering unit of candidate's extreme point screening unit based on candidate's extreme point and The original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, is specifically used for：

To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or Each candidate's extreme point in candidate's extreme point after being screened based on the density of candidate's extreme point carries out Difference Calculation, obtains every The difference spectrum value of individual candidate's extreme point；

Alternatively, the audio feature extraction unit further comprises：

Extreme point is to determining unit, based on each extreme point structure candidate region in the extreme value point list, it is determined that each The extreme point pair of extreme point；

Finger print characteristic abstract unit, for each extreme point pair in the extreme value point list, extract per frame audio The fingerprint characteristic of data；

Combining unit, for the fingerprint characteristic of every frame voice data to be merged, obtain the audio per section audio data Fingerprint characteristic.

The present invention also provides a kind of electronic equipment, and the electronic equipment includes；

Storage medium, a plurality of instruction is stored with, the instruction is loaded by processor, and perform claim requires the step of the above method Suddenly；And

Processor, for performing the instruction in the storage medium.

Compared with prior art, a kind of audio feature extraction methods of the present invention and device, the beneficial effect of electronic equipment exist In：

A kind of audio feature extraction methods of the present invention and device, electronic equipment by receiving pending voice data, according to The spectrum energy amplitude of the voice data determines candidate's extreme point of pending voice data, then is based respectively on auditory masking effect Should, the difference value of candidate's extreme value dot density and candidate's extreme point candidate's extreme point of voice data is screened, obtain waiting to locate The extreme value point list of voice data is managed, to realize the purpose for the fingerprint characteristic that voice data is extracted according to the extreme value point list, and And the present invention can effectively improve extraction using the difference value of auditory masking effect, candidate's extreme value dot density and candidate's extreme point The noise immunity of audio frequency characteristics, the audio frequency characteristics of extraction are enable more accurately to describe voice data.

Brief description of the drawings

Fig. 1 is a kind of step flow chart of one embodiment of audio feature extraction methods of the present invention；

Fig. 2 is the rectangular area schematic diagram of candidate centers extreme point in the specific embodiment of the invention；

Fig. 3 is the thin portion flow chart of step 104 in the specific embodiment of the invention

Fig. 4 is the structure schematic diagram that extreme point pair is fixed in Fig. 2；

Fig. 5 is a kind of structural representation of one embodiment of audio feature extraction device of the present invention

Fig. 6 is the detail structure chart of specific embodiment of the invention sound intermediate frequency feature extraction unit；

Fig. 7 is the structural representation for the electronic equipment that the present invention is used for audio feature extraction methods.

Embodiment

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, control is illustrated below The embodiment of the present invention.It should be evident that drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing, and obtain other embodiments.

To make simplified form, part related to the present invention is only schematically show in each figure, they are not represented Its practical structures as product.In addition, so that simplified form readily appreciates, there is identical structure or function in some figures Part, one of those is only symbolically depicted, or only marked one of those.Herein, "one" is not only represented " only this ", the situation of " more than one " can also be represented.

In one embodiment of the invention, as shown in figure 1, a kind of audio feature extraction methods of the present invention, including it is as follows Step：

Step 101, pending voice data is obtained.

The pending voice data can be the speech data for including effect voice, or absolute music audio number According to also or song data, the pending voice data can pass through the voice acquisition device such as microphone of smart machine Collection obtains, and smart machine can be mobile phone, PC, tablet personal computer etc., and certain pending voice data can also It is to prestore or the voice data of external equipment transmission, the specific present invention is not construed as limiting.

Step 102, according to the spectrum energy amplitude of the pending voice data, the original of pending voice data is determined Candidate's extreme point.

Specifically, step 102 further comprises：

The pending voice data is transformed into frequency domain by step a), obtains the spectrum energy amplitude of the voice data, by It is same as the prior art that the specific conversion method of time domain is transformed into the voice data that the present invention uses, will not be described here；

Step b) selects spectrum energy amplitude to exceed predetermined threshold value according to the spectrum energy amplitude per frame voice data Point, the original candidates extreme point as every frame voice data.

Step 103, based on the influence coefficient between candidate's extreme point and/or the density and/or base based on candidate's extreme point Difference Calculation result between candidate's extreme point is screened to the original candidates extreme point of the pending voice data, is obtained To the extreme value point list of the pending voice data.That is, in step 103, can be based between candidate's extreme point Influence coefficient, the density based on candidate's extreme point, based on one or more modes between candidate's extreme point to original candidates pole Value point is screened.

, can be based on the influence coefficient between candidate's extreme point to every frame voice data in step 103 as a kind of example Candidate's extreme point screened for the first time, obtain first candidate's extreme value point list of every frame voice data as the pending sound The extreme value point list of frequency evidence.

In the specific embodiment of the invention, it is used for i-th of candidate of expression using G (i, j) on time dimension and frequency dimension Influence coefficient between extreme point and j-th candidates extreme point, the influence coefficient is determined based on auditory masking effect, described to listen It is interactional when feeling that masking effect refers to people to perception of sound, between spectral peak frequency point, a frequency component may Masking and its similar frequency component.

The present invention is screened for the first time using the influence coefficient to candidate's extreme point, specifically, the first screening step It is rapid as follows：The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, all candidate's extreme points in the region are obtained, such as select present frame audio number first According to original candidates extreme point candidate's extreme point as candidate centers extreme point, composed in the language of the candidate centers extreme point Rectangular area of the structure centered on the extreme point on figure, find in rectangular area per candidate's extreme point of frame voice data, institute The transverse axis of predicate spectrogram is the time, and the longitudinal axis is that the shade of each candidate's extreme point in frequency values, figure represents amplitude, such as Fig. 2 It show candidate centers extreme point rectangular area schematic diagram；Calculate respectively the candidate centers extreme point with it is other in rectangular area Influence coefficient G (i, j) between candidate's extreme point, as shown in following formula (1)：

In formula (1), i_tAnd j_tThe time value of i-th of candidate's extreme point and j-th candidates extreme point, i are represented respectively_fWith j_fThe frequency value of i-th of candidate's extreme point and j-th candidates extreme point, l and w represent center extreme point rectangular area respectively Length and width；

Determine whether to retain the candidate centers extreme point according to the frequency domain amplitude of influence coefficient and candidate's extreme point, specifically Ground, if the frequency domain amplitude value of each non-candidate center extreme point is with influenceing in the rectangular area of the candidate centers extreme point When the product of coefficient is both less than the frequency domain amplitude value of center extreme point, then retain the candidate centers extreme point, such as formula (2) institute Show：

P(i)≥P(j)×G(i,j) (2)

Wherein, centered on P (i) extreme point frequency domain amplitude value, P (j) represent rectangular area in other non-central extreme values The frequency domain amplitude value of point.Herein it should be noted that, if directly retaining the candidate without other candidate's extreme points in rectangular area Center extreme point.

As the current candidate center extreme point in Fig. 2 rectangular area in, in addition to the extreme point of current candidate center, also 8 Other individual candidate's extreme points, candidate centers extreme point need to be calculated according to formula (2) respectively at 8 candidate's extreme points, only Have when all meeting the condition of formula (2), the candidate centers extreme point can just retain, and otherwise need to delete.

As a kind of example, in candidate's extreme point based on the influence coefficient between candidate's extreme point to every frame voice data After being screened, can also the density based on candidate's extreme point to through based between candidate's extreme point influence coefficient screening after the One candidate's extreme value point list is screened again, with filter current sound, obtains second candidate's extreme value point range of every frame voice data Extreme value point list of the table as the pending voice data.

In the audio of part on some frequency bands, all very big extreme point continuous in time of energy and density, i.e. electric current be present Sound.Current sound can cause Audio Matching, and matching degree is very high in a short time, misleads Audio Matching result；Therefore, in order to prevent sound There is the high spectrum energy point of comparatively dense in frequency, density screening of the present invention based on candidate's extreme point is specific in some frequency ranges Including：

Select in first candidate's extreme value point list that each extreme point is as current candidate extreme point successively, with current extreme value point For starting point respectively forwardly or after being moved rearwards the set time, candidate's extreme point sum in this time is counted, as current candidate The density of extreme point, the set time such as 5s；

So by being screened successively to each extreme point in first candidate's extreme value point list, you can obtain the second candidate Extreme value point list.

Certainly, the present invention can also the density based on candidate's extreme point it is straight to the original candidates extreme point of every frame voice data Capable screening is tapped into, specific step of screening is same as described above, will not be described here.

As a kind of example, in order to improve the noise immunity of extreme point and adaptivity, the present invention can also be through based on candidate The extreme point in second candidate's extreme value point list after the density screening of extreme point carries out Difference Calculation successively, to ensure audio energy Amount still can be matched after overall scaling.

During specific Difference Calculation, behind the candidate's extreme point and present frame of one or more frame voice datas before present frame The spectrum value of candidate's extreme point of one or more frame voice datas carries out Difference Calculation to candidate's extreme point of present frame voice data The differentiated frequency spectrum value of each candidate's extreme point of present frame voice data is obtained, shown in specific Difference Calculation formula such as formula (3):

Δ P (i)=| P (i)+P (i (t+1))-P (i (t-1))-P (i (t-2)) | (3)

Wherein, Δ P (i) represents the value after present frame candidate's extreme point i Difference Calculations, and P (i (t+1)) is represented and candidate Extreme point i is the same as the spectrum value of candidate's extreme point of a later frame of frequency range, P (i (t-1)) and P (i (t-2)) is represented respectively and candidate Extreme point i is the same as the former frame of frequency range and the spectrum value of front cross frame candidate's extreme point；

After terminating to the extreme point Difference Calculation in second candidate's extreme value point list, obtain each in candidate's extreme value point list The difference spectrum value of candidate's extreme point；Selection exceedes candidate's extreme point conduct of predetermined threshold value per frame voice data difference spectrum value Per the extreme point of frame voice data, naturally it is also possible to the difference frequency of each candidate's extreme point in second candidate's extreme value point list Spectrum is ranked up, according to N number of candidate's extreme point before the size selection difference spectrum value ranking of difference spectrum value as every The extreme point of frame voice data, so as to obtain the extreme value point list of every frame voice data.

Certainly, the present invention can also be to original candidates extreme point or through being screened based on the influence coefficient between candidate's extreme point First candidate's extreme value point list afterwards carries out Difference Calculation, obtains the extreme value point list per frame voice data, specific Difference Calculation The step of it is same as described above, will not be described here.

Step 104, the fingerprint characteristic of voice data is extracted according to the extreme value point list of pending voice data.

Specifically, as shown in figure 3, step 104 further comprises：

Step S31, based on each extreme point structure candidate region in extreme value point list, it is determined that the extreme point of each extreme point It is right.Specifically, in step S31, each extreme point in extreme value point list is selected to fix extreme point, Ran Houji as current successively In the fixation extreme point, candidate region is built in fixed frequency band and time range, selects spectrum energy to be more than the g of predetermined threshold value Individual extreme point with fixation extreme point composition point pair, such as contains 8 candidate's extreme points, only selected in fig. 2, in candidate region respectively 5 larger points of spectrum energy and fixed extreme point structure point pair are selected, if Fig. 4 is to fix extreme point in Fig. 2 to illustrate structure Figure.Said process is applied to each extreme point in extreme value point list, can be obtained in every frame voice data extreme value point list The extreme point pair of each extreme point composition；

Step S32, according to each extreme point pair in the extreme value point list, the fingerprint characteristic per frame voice data is extracted, During specific extraction, each extreme point is as current extreme value point in selection extreme value point list successively, according to current extreme value point and its group Into extreme point to extracting the fingerprint characteristic F of present frame voice data, time letter corresponding to frame specially where current extreme value point Cease t, the frequency-domain spectrum value f of current extreme value point, current extreme value point frame pair where extreme point each with its extreme point centering respectively The difference DELTA t of the temporal information value answered, current extreme value point respectively with each extreme point frequency-domain spectrum value of its extreme point centering Difference DELTA f, the unique identification audioID of present frame voice data, the Feature Representation for Fingerprints of present frame voice data is F ={ t, f, Δ t, Δ f, audioID }；

Step S33, the audio fingerprint feature for every frame voice data that every section audio packet contains is combined, obtained every The audio fingerprint feature of section audio data.

In one embodiment of the invention, as shown in figure 5, a kind of audio feature extraction device of the present invention, including：Audio Data capture unit 51, candidate's extreme point determining unit 52, extreme value point list determining unit 53 and audio feature extraction unit 54。

Voice data acquiring unit 51, for obtaining pending voice data.The pending voice data can be bag Speech data containing efficient voice, or absolute music voice data, can also be song data.

Candidate's extreme point determining unit 52, for the spectrum energy amplitude according to the pending voice data, it is determined that treating Handle the original candidates extreme point of voice data.

Extreme value point list determining unit 53, based on the influence coefficient between candidate's extreme point and/or based on candidate's extreme point Density and the original candidates extreme point of every frame voice data is sieved based on the Difference Calculation result between candidate's extreme point Choosing, obtains the extreme value point list of the pending voice data.

Specifically, extreme value point list determining unit 53 further comprises:

First screening unit, for the original candidates based on the influence coefficient between candidate's extreme point to every frame voice data Extreme point is screened；And/or

Second screening unit, the original candidates extreme value of every frame voice data is clicked through for the density based on candidate's extreme point Row screening or candidate's extreme point after first screening unit screening are screened；And/or

Third filtering unit, for candidate's extreme value to original candidates extreme point or after first screening unit screening Point or candidate's extreme point after second screening unit screening are screened.

First screening unit is specifically used for：

The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate successively Region centered on the extreme point of center, obtain all candidate's extreme points in the region.Specifically, present frame voice data is selected Original candidates extreme point candidate's extreme point as candidate centers extreme point, in the sound spectrograph of the candidate centers extreme point Rectangular area of the upper structure centered on the extreme point, finds in rectangular area per candidate's extreme point of frame voice data；

Determine whether to retain the candidate centers extreme point according to the frequency domain amplitude of influence coefficient and candidate's extreme point, specifically Ground, if the frequency domain amplitude value of each non-candidate center extreme point is with influenceing system in the rectangular area of the candidate centers extreme point When several products is both less than the frequency domain amplitude value of center extreme point, then retain the candidate centers extreme point.

In the present invention, original candidates extreme value of density of second screening unit based on candidate's extreme point to every frame voice data Point is screened or candidate's extreme point after first screening unit screening is screened, with filter current sound；Second sieve Menu member is specifically used for：

The original candidates extreme point of every frame voice data is selected successively or through based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point after screening as current candidate extreme point, using current extreme value point as starting point respectively forwardly Or after being moved rearwards the set time, candidate's extreme point sum in this time is counted, as the density of current candidate extreme point, institute State set time such as 5s；

In the present invention, third filtering unit is used for original candidates extreme point or after first screening unit screening Candidate's extreme point or through second screening unit screening after candidate's extreme point in candidate's extreme point carry out Difference Calculation after, It is determined that the extreme value point list per frame voice data.Third filtering unit is specifically used for：

To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or Each candidate's extreme point in candidate's extreme point after being screened based on the density of candidate's extreme point carries out Difference Calculation, obtains every The difference spectrum value of individual candidate's extreme point.Specific Difference Calculation is the candidate pole according to one or more frame voice datas before present frame Candidate pole of the spectrum value of candidate's extreme point of one or more frame voice datas to present frame voice data behind value point and present frame Value point carries out Difference Calculation and obtains the differentiated frequency spectrum value of each candidate's extreme point of present frame voice data；

Selection exceedes candidate's extreme point of threshold value per frame voice data difference spectrum value as the extreme value per frame voice data Point, or the difference spectrum value of each candidate's extreme point is ranked up, the size selection difference frequency spectrum according to difference spectrum value It is worth N number of candidate's extreme point before ranking as the extreme point per frame voice data, so as to obtain the extreme value of every frame voice data Point list.

Audio feature extraction unit 54, for extracting the finger of voice data according to the extreme value point list of pending voice data Line feature.

Specifically, as shown in fig. 6, audio feature extraction unit 54 further comprises：

Extreme point is to determining unit 541, based on each extreme point structure candidate region in the extreme value point list, it is determined that often The extreme point pair of individual extreme point, specifically, extreme point select determining unit 541 each extreme point in extreme value point list to make successively Extreme point is fixed to be current, and based on current fixed extreme point, candidate region is built in fixed frequency band and time range, is selected The g extreme point that spectrum energy is more than predetermined threshold value forms point pair with the fixation extreme point respectively；

Finger print characteristic abstract unit 542, for each extreme point pair in the extreme value point list, extract per frame sound The fingerprint characteristic of frequency evidence；

Combining unit 543, for the fingerprint characteristic of every frame voice data to be merged, obtain the sound per section audio data Frequency fingerprint characteristic.

Referring to Fig. 7, show that the present invention is used for the structural representation of the electronic equipment 300 of audio feature extraction methods.Ginseng According to Fig. 7, electronic equipment 300 includes processing component 301, and it further comprises one or more processors, and by storage medium Storage device resource representated by 302, can be by the instruction of the execution of processing component 301, such as application program for storing.Storage The application program stored in medium 302 can include it is one or more each correspond to the module of one group of instruction.This Outside, processing component 301 is configured as execute instruction, to perform each step of above-mentioned audio feature extraction methods.

Electronic equipment 300 can also include a power supply module 303, be configured as performing the power supply pipe of electronic equipment 300 Reason；One wired or wireless network interface 304, it is configured as electronic equipment 300 being connected to network；With an input and output (I/O) interface 305.Electronic equipment 300 can be operated based on the operating system for being stored in storage device 302, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

In summary, a kind of audio feature extraction methods of the present invention and device, electronic equipment are by receiving pending audio Data, candidate's extreme point of pending voice data is determined according to the spectrum energy amplitude of the voice data, then is based respectively on Time of the difference value of influence coefficient, candidate's extreme value dot density and candidate's extreme point between sense of hearing candidate's extreme point to voice data Select extreme point to be screened, obtain the extreme value point list of pending voice data, sound is extracted according to the extreme value point list to realize The purpose of the fingerprint characteristic of frequency evidence, and the present invention utilizes the influence coefficient between candidate's extreme point, candidate's extreme value dot density And the difference value of candidate's extreme point can effectively improve the noise immunity of the audio frequency characteristics of extraction, enable the audio frequency characteristics of extraction more accurate True description voice data.

It should be noted that above-described embodiment can independent assortment as needed.Described above is only the preferred of the present invention Embodiment, it is noted that for those skilled in the art, do not departing from the premise of the principle of the invention Under, some improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of audio feature extraction methods, comprise the following steps：

Step 1, obtain pending voice data；

Step 3, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point and/or based on candidate Difference Calculation result between extreme point, the original candidates extreme point of the pending voice data is screened, obtains institute State the extreme value point list of pending voice data；

2. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that described to be based between candidate's extreme point Influence coefficient the step of being screened further comprise：

The original candidates extreme point of present frame voice data is selected to be built as candidate centers extreme point with the candidate centers successively Region centered on extreme point, obtain all candidate's extreme points in the region；

3. a kind of audio feature extraction methods as claimed in claim 2, it is characterised in that described to be according to influence coefficient determination It is no reservation the candidate centers extreme point the step of be specially：If the frequency domain amplitude of the candidate centers extreme point is more than or equal to institute The frequency domain amplitude of each non-candidate center extreme point and the product of corresponding influence coefficient in region are stated, then is retained in the candidate Heart extreme point.

4. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that described based on the close of candidate's extreme point The step of degree is screened further comprises:

The original candidates extreme point per frame voice data is selected successively and/or through being sieved based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point after choosing is as current candidate extreme point, the density of calculating current candidate extreme point；

If the density of current candidate extreme point is more than threshold value set in advance, the current candidate extreme point is deleted, is otherwise protected Stay current candidate extreme point.

5. a kind of audio feature extraction methods as claimed in claim 4, it is characterised in that described to be based between candidate's extreme point Difference Calculation result the step of being screened further comprise：

Original candidates extreme point to every frame voice data and/or through based between candidate's extreme point influence coefficient screening after Candidate's extreme point and/or through based on the density of candidate's extreme point screen after candidate's extreme point in each candidate's extreme point carry out Difference Calculation, obtain the difference spectrum value of each candidate's extreme point；

6. a kind of audio feature extraction methods as claimed in claim 5, it is characterised in that specific the step of the Difference Calculation For：According to previous or multiframe voice data the candidate's extreme point and present frame of present frame be latter or the candidate of multiframe voice data It is each that candidate extreme point progress Difference Calculation of the spectrum value of extreme point to present frame voice data obtains present frame voice data The differentiated difference spectrum value of candidate's extreme point.

7. a kind of audio feature extraction methods as claimed in claim 1, it is characterised in that step 4 further comprises：

A kind of 8. audio feature extraction methods as claimed in claim 7, it is characterised in that：It is described to be based on the extreme value point list In each extreme point structure candidate region, it is determined that the step of the extreme point pair of each extreme point specifically includes：

The candidate region is built based on the fixation extreme point, extreme point and the fixed extreme value are selected in the candidate region Point composition extreme point pair.

9. a kind of audio feature extraction device, including：

Voice data acquiring unit, for obtaining pending voice data；

Candidate's extreme point determining unit, for the spectrum energy amplitude according to the pending voice data, determine pending sound The original candidates extreme point of frequency evidence；

Candidate's extreme point screening unit, based on the influence coefficient between candidate's extreme point and/or the density based on candidate's extreme point And/or the original candidates extreme point of voice data is screened based on the Difference Calculation result between candidate's extreme point, obtain The extreme value point list of the pending voice data；

Audio feature extraction unit, for extracting the fingerprint of voice data according to the extreme value point list of the pending voice data Feature.

10. a kind of audio feature extraction device as claimed in claim 9, it is characterised in that candidate's extreme point screening is single First screening unit of member is clicked through based on the coefficient that influences between candidate's extreme point to the original candidates extreme value of every frame voice data Row screening, is specifically used for：

11. a kind of audio feature extraction device as claimed in claim 9, it is characterised in that candidate's extreme point screening is single Second screening unit of member is clicked through based on the coefficient that influences between candidate's extreme point to the original candidates extreme value of voice data Row screening, is specifically used for：

The original candidates extreme point per frame voice data is selected successively or through being screened based on the influence coefficient between candidate's extreme point Each extreme point in candidate's extreme point afterwards is as current candidate extreme point, the density of calculating current candidate extreme point；

If the density of current candidate extreme point is more than threshold value set in advance, candidate's extreme point is deleted, is otherwise retained current Candidate's extreme point.

12. a kind of audio feature extraction device as claimed in claim 11, it is characterised in that candidate's extreme point screening is single Density of the third filtering unit based on candidate's extreme point of member and based on the Difference Calculation result pair between candidate's extreme point The original candidates extreme point of voice data is screened, and is specifically used for：

To original candidates extreme point or through based between candidate's extreme point influence coefficient screening after candidate's extreme point or through base Each candidate's extreme point in candidate's extreme point after the density screening of candidate's extreme point carries out Difference Calculation, obtains each time Select the difference spectrum value of extreme point；

A kind of 13. audio feature extraction device as claimed in claim 9, it is characterised in that the audio feature extraction unit Further comprise：

Extreme point is to determining unit, based on each extreme point structure candidate region in the extreme value point list, it is determined that each extreme value The extreme point pair of point；

Finger print characteristic abstract unit, for each extreme point pair in the extreme value point list, extract per frame voice data Fingerprint characteristic；

Combining unit, for the fingerprint characteristic of every frame voice data to be merged, obtain the audio-frequency fingerprint per section audio data Feature.

14. a kind of electronic equipment, it is characterised in that the electronic equipment includes；

Storage medium, a plurality of instruction is stored with, the instruction is loaded by processor, any one of perform claim requirement 1 to 8 side The step of method；And

Processor, for performing the instruction in the storage medium.