CN104240719A

CN104240719A - Feature extraction method and classification method for audios and related devices

Info

Publication number: CN104240719A
Application number: CN201310255746.2A
Authority: CN
Inventors: 谢志明; 潘晖; 潘石柱; 张兴明; 傅利泉; 朱江明; 吴军; 吴坚
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2013-06-24
Filing date: 2013-06-24
Publication date: 2014-12-24
Anticipated expiration: 2033-06-24
Also published as: CN104240719B

Abstract

The invention discloses a feature extraction method and a classification method for audios and related devices. The purpose is to solve the problem that in the prior art, features which are the same in length can not be extracted from the audios different in time length. The method includes the steps that the audios are obtained, and the following operations for all the obtained audios are executed: according to a preset framing rule, the audios are divided to obtain multiple audio frames; according to a preset feature extraction rule, feature extraction is performed on the audio frames to obtain features of all the audio frames; according to the obtained features of all the audio frames and all clustering centers used for distinguishing the types of the audio frames, the clustering center corresponding to each audio frame is determined; the number of the audio frames corresponding to each clustering center is determined, and the features of the audios are determined according to the determined numbers.

Description

The feature extracting method of audio frequency, the sorting technique of audio frequency and relevant apparatus

Technical field

The present invention relates to area of pattern recognition, particularly relate to a kind of feature extracting method of audio frequency, the sorting technique of audio frequency and relevant apparatus.

Background technology

The classification of audio frequency can be widely used in audio retrieval and accident detection.Such as, the example being applied to audio retrieval is the classification by carrying out voice and music to certain audio frequency, thus can retrieve in the database corresponding to the classification determined.In this example, if the classification that can pre-determine out this audio frequency is " music ", then directly can go retrieval in " musical database ".Especially, if can pre-determine out this audio frequency is the audio frequency possessing certain music style, then range of search can be reduced further.For another example, the example being applied to accident detection is the classification by carrying out birdie, glass fragmentation faced sound, shot and normal sound (as by someone normal word speed one's voice in speech etc.) to certain audio frequency, thus the event judging to produce this audio frequency is anomalous event or normal event.In this example, if the feature class determining the anomalous audio such as feature and birdie, shot or glass fragmentation faced of this audio frequency seemingly, then can determine that this audio frequency belongs to this classification of anomalous audio, thus determine that the event corresponding to this audio frequency is anomalous event; And if the feature class of the feature of this audio frequency and normal sound seemingly, then can determine that this audio frequency belongs to this classification of normal audio, thus determine that the event corresponding to this audio frequency is normal event.

Generally all to known class in prior art and the audio sample that duration equals specific duration (such as 1 second) carries out framing short time treatment (being divided into multiframe by a section audio), obtain MFCC cepstrum (the Mel Frequency Cepstrum Coefficient of each frame, MFCC), linear prediction residue error (Linear Predictive Cepstral Coding, LPCC) etc., and combined as the feature of this section audio sample, again the feature extracted from each audio sample is carried out the common feature that cluster or classification based training obtain each class audio frequency.Then when classifying to the audio frequency of unknown classification, also be that same sub-frame processing is carried out to a section audio of time fixed length, extract corresponding feature to send in the sorter that the cluster centre that obtains of cluster or classification based training obtain and compare, thus determine classification results.

The defect that above-mentioned this method exists is: the audio sample being no matter known class, or need the audio frequency of the unknown classification of classification, all require their duration necessary isometric (for fixed time length), because if duration Length discrepancy, the length of the feature then extracted according to the method described above is also unequal, thus cannot cluster or classification based training be carried out, more cannot classify to the audio frequency of unknown classification.

Summary of the invention

The embodiment of the present invention provides a kind of feature extracting method of audio frequency, the sorting technique of audio frequency and relevant apparatus, in order to solve the problem that cannot go out the feature of equal length in prior art to the audio extraction of different duration.

The embodiment of the present invention is by the following technical solutions:

A feature extracting method for audio frequency, comprising:

Obtain audio frequency, and perform following operation for each audio frequency obtained:

According to the framing rule pre-set, this audio frequency is divided, obtains multiple audio frame;

According to the feature extraction rule pre-set, respectively feature extraction is carried out to described multiple audio frame, obtain the feature of each audio frame;

According to the feature of each audio frame obtained, and for distinguishing each cluster centre of audio frame classification, determine the cluster centre that each audio frame is corresponding respectively; Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum; Described each cluster centre respectively each audio sample is divided into multiple audio sample frame according to described framing rule, and after feature according to each audio sample frame of described feature extraction Rule Extraction, the feature of each audio sample frame extracted is carried out to cluster obtains;

Determine the number of the audio frame corresponding to each cluster centre respectively, and determine the feature of described audio frequency according to the described number determined.

A feature deriving means for audio frequency, comprising:

Obtain unit, for obtaining audio frequency;

Divide frame unit, for performing each audio frequency obtaining unit acquisition: according to the framing rule pre-set, divide this audio frequency, obtain multiple audio frame;

Feature extraction unit, for according to the feature extraction rule pre-set, carries out feature extraction to described multiple audio frame that point frame unit obtains respectively, obtains the feature of each audio frame;

Cluster centre determining unit, for the feature of each audio frame obtained according to feature extraction unit, and for distinguishing each cluster centre of audio frame classification, determines the cluster centre that each audio frame is corresponding respectively; Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum; Described each cluster centre respectively each audio sample is divided into multiple audio sample frame according to described framing rule, and after feature according to each audio sample frame of described feature extraction Rule Extraction, the feature of each audio sample frame extracted is carried out to cluster obtains;

Characteristics determining unit, for determining the number of the audio frame corresponding to each cluster centre respectively, and determines the feature of described audio frequency according to the described number determined.

A sorting technique for audio frequency, comprising:

Step one: according to the framing rule pre-set, audio frequency to be sorted is divided, obtains multiple audio frame;

Step 2: according to the feature extraction rule pre-set, respectively feature extraction is carried out to described multiple audio frame, obtain the feature of each audio frame;

Successively at least twice is performed to step 3, step 4 and step 5:

Step 3: according to each audio frame obtained and intersegmental overlapping second number percent preset in advance, determine the audio section of predetermined number; And according to the audio section of the predetermined number determined, and for distinguishing each cluster centre of audio frame classification, determine the cluster centre that each audio frame that each audio section comprises is corresponding respectively; Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum; Described each cluster centre respectively each audio sample is divided into multiple audio sample frame according to described framing rule, and after feature according to each audio sample frame of described feature extraction Rule Extraction, the feature of each audio sample frame extracted is carried out to cluster obtains; Wherein, when described step 3 is performed at least twice, each institute according to described intersegmental overlapping second number percent different;

Step 4: the number determining the audio frame corresponding to each cluster centre respectively, and the feature determining described audio frequency according to the described number determined;

Step 5: according to the characteristic sum of the described audio frequency determined for distinguishing the sorter of audio categories, determine classification results; Wherein, described sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and described each cluster centre;

Step 6: according to the classification results determined, determines the classification of described audio frequency.

A sorter for audio frequency, comprising:

Divide frame unit, for the framing rule pre-set, audio frequency to be sorted is divided, obtains multiple audio frame;

Classification results determining unit, for performing at least twice successively to following step:

Step one: according to each audio frame obtained and intersegmental overlapping second number percent preset in advance, determine the audio section of predetermined number; And according to the audio section of the predetermined number determined, and for distinguishing each cluster centre of audio frame classification, determine the cluster centre that each audio frame that each audio section comprises is corresponding respectively; Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum; Described each cluster centre respectively each audio sample is divided into multiple audio sample frame according to described framing rule, and after feature according to each audio sample frame of described feature extraction Rule Extraction, the feature of each audio sample frame extracted is carried out to cluster obtains; Wherein, when described step one is performed at least twice, each institute according to described intersegmental overlapping second number percent different;

Step 2: the number determining the audio frame corresponding to each cluster centre respectively, and the feature determining described audio frequency according to the described number determined;

Step 3: according to the characteristic sum of the described audio frequency determined for distinguishing the sorter of audio categories, determine classification results; Wherein, described sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and described each cluster centre;

Classification determination unit, for the classification results determined according to classification results determining unit, determines the classification of described audio frequency.

A method for audio classification, comprising:

According to the framing rule pre-set, audio frequency to be sorted is divided, obtains multiple audio frame;

Determine the number of the audio frame corresponding to each cluster centre respectively, and determine the feature of described audio frequency according to the described number determined;

According to the characteristic sum of the described audio frequency determined for distinguishing the sorter of audio categories, determine the classification of described audio frequency; Wherein, described sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and described each cluster centre.

A sorter for audio frequency, comprising:

Characteristics determining unit, for determining the number of the audio frame corresponding to each cluster centre respectively, and determines the feature of described audio frequency according to the described number determined;

Classification determination unit, for according to the characteristic sum of described audio frequency determined for distinguishing the sorter of audio categories, determine the classification of described audio frequency; Wherein, described sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and described each cluster centre.

The beneficial effect of the embodiment of the present invention is as follows:

The embodiment of the present invention is by carrying out sub-frame processing and feature extraction to audio frequency, obtain the feature of each audio frame, again by the feature of each audio frame respectively with in advance each cluster centre that cluster obtains is carried out to the feature of the audio frame in audio sample and compares, determine the cluster centre that each audio frame is corresponding, and then determine the number of the audio frame corresponding to each cluster centre, finally obtain the feature of this audio frequency.Number due to the cluster centre obtained in advance is certain, so no matter the duration of audio frequency is how many, the length of the feature obtained is all constant, thus solves the problem that cannot go out the feature of equal length in prior art to the audio extraction of different duration.

Accompanying drawing explanation

The realization flow figure of the feature extracting method of a kind of audio frequency that Fig. 1 provides for the embodiment of the present invention one;

The realization flow figure of the feature extracting method of a kind of audio frequency that Fig. 2 provides for the embodiment of the present invention two;

The realization flow figure of the sorting technique of a kind of audio frequency that Fig. 3 provides for the embodiment of the present invention three;

The realization flow figure of the sorting technique of a kind of audio frequency that Fig. 4 provides for the embodiment of the present invention four;

The sorting technique particular flow sheet in actual applications of a kind of audio frequency that Fig. 5 provides for the embodiment of the present invention five;

The structural representation of the feature deriving means of a kind of audio frequency that Fig. 6 provides for the embodiment of the present invention six;

The structural representation of the sorter of a kind of audio frequency that Fig. 7 provides for the embodiment of the present invention seven;

The structural representation of the sorter of a kind of audio frequency that Fig. 8 provides for the embodiment of the present invention eight.

Embodiment

In order to solve the problem that cannot go out the feature of equal length in prior art to the audio extraction of different duration, embodiments provide a kind of feature extracting method of audio frequency, the sorting technique of audio frequency and relevant apparatus.This programme is by carrying out sub-frame processing and feature extraction to audio frequency, obtain the feature of each audio frame, again by the feature of each audio frame respectively with in advance each cluster centre that cluster obtains is carried out to the feature of the audio frame in audio sample and compares, determine the cluster centre that each audio frame is corresponding, and then determine the number of the audio frame corresponding to each cluster centre, finally determine the feature of this audio frequency.Number due to the cluster centre obtained in advance is certain, so no matter the duration of audio frequency is how many, the length of the feature obtained is all constant, thus solves the problem that cannot go out the feature of equal length in prior art to the audio extraction of different duration.

Below in conjunction with Figure of description, embodiments of the invention are described, should be appreciated that embodiment described herein is only for instruction and explanation of the present invention, is not limited to the present invention.And when not conflicting, the embodiment in this explanation and the feature of embodiment can be combined with each other.

Embodiment one:

First, the embodiment of the present invention one provides a kind of feature extracting method of audio frequency, and the realization flow figure of the method as shown in Figure 1, mainly comprises the steps:

Step 11, obtains audio frequency;

Wherein, following identical operation is performed for each audio frequency obtained.Therefore, following each step is described for same audio frequency.

Step 12, according to the framing rule pre-set, divides this audio frequency, obtains multiple audio frame;

Although sound signal great majority are random non-stationary signals, but can stationary signal be regarded as in short-term, therefore first this audio frequency obtained is carried out framing short time treatment according to the framing rule pre-set, wherein, framing rule can comprise frame duration and interframe overlapping percentages.Such as, when being 25ms according to frame duration, interframe overlapping percentages is the framing rule of 50% when dividing this audio frequency, and each audio frame obtained meets: the frame duration of each audio frame is 25ms, and the interframe overlapping percentages of each adjacent audio frame is 50%.

Step 13, according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frame respectively, obtains the feature of each audio frame;

In this step, first can carry out pre-service to each audio frame obtained, comprise one or more the combination such as zero-mean, pre-emphasis and windowing process.Again to the feature extraction that pretreated each audio frame carries out on time domain or frequency domain according to the feature extraction rule pre-set, wherein, the feature extracted can be linear predictor coefficient (Linear Predictive Coding, LPC), linear prediction residue error (Linear Predictive Cepstral Coding, LPCC), MFCC cepstrum (Mel Frequency Cepstrum Coefficient, and linear prediction MFCC cepstrum (Linear Predictive Coding Mel Frequency Cepstrum Coefficient MFCC), LPCMFCC) one or more the combination in.

It should be noted that, the framing rule mentioned in subsequent embodiment and feature extraction rule are all identical with feature extraction rule with the framing rule in the embodiment of the present invention, hereinafter repeat no more.

Step 14, according to the feature of each audio frame obtained, and for distinguishing each cluster centre of audio frame classification, determines the cluster centre that each audio frame is corresponding respectively;

Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum.In general, the feature of the characteristic sum cluster centre of audio frame can be vector, i.e. a proper vector.Thus the similarity of the feature of the characteristic sum cluster centre of audio frame can be represented by the distance between two proper vectors.This distance is less, and illustrate that two proper vectors are more similar, namely the similarity of the feature of this audio frame and the feature of this cluster centre is larger; Otherwise then illustrate that these two proper vector difference are larger, namely the similarity of the feature of this audio frame and the feature of this cluster centre is less.

In this step, respectively each audio sample is divided into multiple audio sample frame according to the framing rule in above-mentioned steps 12 for distinguishing each cluster centre of each audio frame classification, and after feature according to each audio sample frame of the feature extraction Rule Extraction in above-mentioned steps 13, the feature of each audio sample frame extracted is carried out to cluster obtains; Wherein, the method for cluster has the method for a lot of comparative maturity in the prior art, does not describe in detail one by one at this.

When carrying out cluster to the feature of all audio sample frames in each audio sample, namely when carrying out cluster to when each audio sample not segmentation, may occur that the feature coming audio frame below to the characteristic sum of the audio frame come above carries out the situation of cluster, this mode destroys the timing of the audio frame in audio sample, and the effectiveness comparison of the cluster centre causing cluster to obtain is poor.Therefore, in order to avoid this problem, the mode each audio sample being divided into audio sample section can be taked to carry out Segment Clustering, and the effectiveness comparison of the cluster centre obtained to make cluster is good.This mode will describe in detail in following embodiment two, not repeat them here.

It should be noted that, the number of each cluster centre obtained can be the constant arranged voluntarily according to user's request.

Step 15, determines the number of the audio frame corresponding to each cluster centre respectively, and determines the feature of this audio frequency according to the number determined.

Wherein, can first be normalized the number of the audio frame corresponding to each cluster centre determined, then using the form of numeral proper vector statistic histogram that obtains after the normalized feature as this audio frequency.

When the audio frequency obtained is multiple and classification is known class, after so obtaining the feature of the audio frequency of these known class, then can carry out classification based training to the feature of these audio frequency obtained according to its known class, and then obtain the sorter for distinguishing audio categories.

The said method provided from the embodiment of the present invention one, because the number for each cluster centre distinguishing each audio frame classification obtained in advance is certain, so the duration no matter getting audio frequency is how many, the length of the feature of the audio frequency finally obtained is all constant, thus solves the problem that cannot go out the feature of equal length in prior art to the audio extraction of different duration.

Embodiment two:

On the basis of the said method provided in the embodiment of the present invention one, the embodiment of the present invention two provides a kind of feature extracting method of audio frequency, and the realization flow figure of the method as shown in Figure 2, mainly comprises the following steps:

Step 21, obtains audio frequency;

Step 22, according to the framing rule pre-set, divides this audio frequency, obtains multiple audio frame;

Step 23, according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frame respectively, obtains the feature of each audio frame;

Step 24, according to each audio frame obtained and intersegmental overlapping second number percent pre-set, determines the audio section of predetermined number, and records each audio section ordering in time determined;

Wherein, the continuous print audio frame of equal number is all comprised in each audio section.

Step 25, performs respectively for each audio section: determine the cluster centre that each audio frame that this audio section comprises is corresponding respectively;

Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum.

Concrete, this step can comprise: according to each audio section ordering in time of record, determine the arrangement position of this audio section; Again according to the ordering recorded for each audio sample, from corresponding respectively to each cluster centre of each arrangement position of obtaining, determine and each cluster centre corresponding to the arrangement position of this audio section; The feature of each audio frame then comprised by this audio section compares with each cluster centre corresponding to the arrangement position of this audio section respectively, determines the cluster centre that each audio frame that this audio section comprises is corresponding respectively.

Wherein, each cluster centre corresponding to each arrangement position can obtain in the following way:

First, perform for each audio sample: according to above-mentioned framing rule, this audio sample is divided, obtains multiple audio sample frame; According to above-mentioned feature extraction rule, extract the feature of each audio sample frame; According to each audio sample frame obtained and intersegmental overlapping first number percent pre-set, obtain the audio sample section of predetermined number; And record each audio sample section ordering in time obtained; Wherein, each audio sample Duan Zhongjun comprises the continuous print audio sample frame of equal number.

Such as, predetermined number is set to 2, intersegmental overlapping first number percent is set to 30%, and division is carried out to this audio sample obtain N number of audio frame, then for this audio sample, two audio sample sections can be obtained, first audio sample section (hereinafter referred to as first paragraph) be the 1st frame to 10*N/17 frame, second audio sample section (hereinafter referred to as second segment) is 7*N/17 frame; If intersegmental overlapping first number percent is 0%, then first paragraph be the 1st frame to N/2 frame, second segment is that N/2+1 frame is to N frame.

Then, according to the ordering recorded for each audio sample, respectively cluster is carried out to the feature of each audio sample frame that all audio sample sections of the aligned identical position be in ordering comprise, obtain each cluster centre corresponding respectively to each arrangement position.

Wherein, or obtain two audio sample sections to be divided by each audio sample, the feature of each audio sample frame comprised by the first paragraph in all audio sample carries out cluster, obtain each cluster centre corresponding with each first paragraph, and the feature of each audio sample frame to be comprised by the second segment in all audio sample carries out cluster, obtains each cluster centre corresponding with each second segment.

Step 26, performs respectively for each audio section: determine the number with the audio frame corresponding to each cluster centre corresponding to the arrangement position of this audio section respectively, and determine the feature of this audio section according to the number determined;

The Feature Combination of each audio section determined, according to the arrangement position of each audio section, is the feature of this audio frequency by step 27.

Wherein, or obtain two audio sample sections to be divided by each audio sample, after determining the feature of characteristic sum second segment of first paragraph, can according to the weight pre-set by the feature of the characteristic sum second segment of first paragraph according to first paragraph to the sequential combination of second segment, obtain the feature of this audio frequency.

The said method provided from the embodiment of the present invention two, because the number for each cluster centre distinguishing each audio frame classification obtained in advance is certain, so the duration no matter getting audio frequency is how many, the length of the feature of the audio frequency finally obtained is all constant, thus solves the problem that cannot go out the feature of equal length in prior art to the audio extraction of different duration.

Further, be when carrying out cluster to when each audio sample segmentation, ensure that the timing of the audio frame in each audio sample in this embodiment, thus the effectiveness comparison of the cluster centre that cluster is obtained is good.

Embodiment three:

The embodiment of the present invention three additionally provides a kind of sorting technique of audio frequency, and the realization flow figure of the method as shown in Figure 3, mainly comprises the following steps:

Step 31, according to the framing rule pre-set, divides audio frequency to be sorted, obtains multiple audio frame;

Step 32, according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frame respectively, obtains the feature of each audio frame;

Step 33, according to the feature of each audio frame obtained, and for distinguishing each cluster centre of audio frame classification, determines the cluster centre that each audio frame is corresponding respectively;

Optionally, when each cluster centre for distinguishing audio frame classification is when corresponding to each cluster centre of each arrangement position in this step, be also the cluster centre determining that each audio frame that each audio section comprises is corresponding respectively in this step.

Step 34, determines the number of the audio frame corresponding to each cluster centre respectively, and determines the feature of above-mentioned audio frequency according to the number determined;

Wherein, step 31 with reference to the associated description in above-described embodiment one or embodiment two, can not repeat them here to the specific implementation process of step 34.

Step 35, according to the characteristic sum of the above-mentioned audio frequency determined for distinguishing the sorter of audio categories, determines the classification of audio frequency;

Wherein, sorter obtains according to carrying out classification based training to the feature of each audio sample, and the feature of each audio sample obtains according to the feature of its audio frame and above-mentioned each cluster centre; Wherein, the method for classification based training has the method for a lot of comparative maturity in the prior art, does not describe in detail one by one at this.

In an embodiment of the present invention, can the feature of the above-mentioned audio frequency determined be sent in the sorter of training and obtaining, be implemented the feature of more above-mentioned audio frequency and the feature of training each classification obtained by sorter.Particularly, sorter compares for the feature of this audio frequency and the feature of each classification, chooses the classification of that classification maximum with the similarity of the feature of this audio frequency as this audio frequency.

The said method provided from the embodiment of the present invention three, because the number for each cluster centre distinguishing each audio frame classification obtained in advance is certain, so the duration no matter getting audio frequency is how many, the length of the feature of the audio frequency finally obtained is all constant, therefore when classifying to audio frequency, can classify to the audio frequency of any duration, and no longer need the operation of carrying out specifying duration to audio frequency to be sorted.

Embodiment four:

By being further analyzed above-described embodiment three, find that the sorting technique of this audio frequency exists a defect, when namely carrying out classifying process for the voice data that speed rhythm is different, classifying quality is bad: if the fast audio sample adopted during training occupies the majority, and when classifying process be one section of slower audio frequency to be sorted, classifying quality can be very poor; If the slow audio sample that adopts occupies the majority during training, and when classifying process be one section of audio frequency to be sorted faster, classifying quality also can be very poor.

In order to provide the classification embodiment that can adapt to the different audio frequency of speed, the embodiment of the present invention four provides a kind of sorting technique of audio frequency, and the realization flow figure of the method as shown in Figure 4, comprises the following steps:

Step 41, according to the framing rule pre-set, divides audio frequency to be sorted, obtains multiple audio frame;

Step 42, according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frame respectively, obtains the feature of each audio frame;

Below to step 43, step 44 and step 45 perform at least twice successively, and each perform step 43 time institute according to intersegmental overlapping second number percent different.

Step 43, according to each audio frame obtained and intersegmental overlapping second number percent preset in advance, determine the audio section of predetermined number, and according to the audio section of the predetermined number determined, and for distinguishing each cluster centre of audio frame classification, determine the cluster centre that each audio frame that each audio section comprises is corresponding respectively;

Optionally, the mode obtaining each cluster centre can specifically comprise:

Perform for each audio sample: according to described framing rule, this audio sample is divided, obtains multiple audio sample frame; According to described feature extraction rule, extract the feature of each audio sample frame; According to each audio sample frame obtained and intersegmental overlapping first number percent pre-set, obtain the audio sample section of predetermined number; And record each audio sample section ordering in time obtained; Wherein, each audio sample Duan Zhongjun comprises the continuous print audio sample frame of equal number;

According to the described ordering recorded for each audio sample, respectively cluster is carried out to the feature of each audio sample frame that all audio sample sections of the aligned identical position be in described ordering comprise, obtain each cluster centre corresponding respectively to each arrangement position.

Step 44, determines the number of the audio frame corresponding to each cluster centre respectively, and determines the feature of above-mentioned audio frequency according to the number determined;

Wherein, step 41 with reference to the associated description in above-described embodiment one or embodiment two, can not repeat them here to the specific implementation process of step 44.

Step 45, according to the characteristic sum of the above-mentioned audio frequency determined for distinguishing the sorter of audio categories, determines the classification of audio frequency;

Wherein, sorter obtains according to carrying out classification based training to the feature of each audio sample, and the feature of each audio sample obtains according to the feature of its audio frame and above-mentioned each cluster centre.

Because step 43, step 44 and step 45 perform at least twice successively, so multiple features of this audio frequency can be obtained in step 44, and in step 45, the multiple features obtained are sent into respectively in the sorter of training and obtaining, for the feature of each audio frequency obtained, a classification maximum with its similarity can be chosen, and then obtain multiple classification results of this audio frequency.

Step 46, according to the classification results determined, determines the classification of above-mentioned audio frequency.

Particularly, the implementation of step 46 can comprise: according to each classification results obtained, and adopts ballot mode to determine the classification of above-mentioned audio frequency.

The said method provided from the embodiment of the present invention three, the method can not only realize classifying to the audio frequency of any duration, and because it to have carried out the division of at least twice audio section to same audio frequency to be sorted, multiple feature can be obtained according to the intersegmental overlapping percentages adopted during each division, thus be equivalent to the adaptability that improve this audio frequency to be sorted, enable to be applicable to very fast audio frequency accounting in the training sample of sorter higher, and in the training sample of sorter slower audio frequency accounting higher etc. different situations.Thus such method applicability is wider, has good robustness for the audio frequency that speed rhythm is different.

Embodiment five:

Below said method a kind of embody rule flow process in practice that the embodiment of the present invention four provides specifically is introduced.This application flow comprises following step as shown in Figure 5:

Step 51, collects the audio sample of each known class, and extracts each cluster centre and sorter to the audio sample of each known class of collecting;

Wherein, the method that the method extracting each cluster centre and sorter can describe in reference example two, does not repeat them here.

For predetermined number M in the embodiment of the present invention, M audio sample section is all divided into by each audio sample, then for the first paragraph in all audio sample, K1 cluster centre can be obtained, for the second segment in all audio sample, K2 cluster centre can be obtained, by that analogy, for the M section of all audio sample, Km cluster centre can be obtained.Wherein, generally, M gets 3 ~ 10 for best.

Step 52, for an audio frequency to be sorted, according to the framing rule pre-set, divides it, obtains multiple audio frame;

Step 53, according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frame respectively, obtains the feature of each audio frame;

Step 54, according to each audio frame obtained and intersegmental overlapping second number percent pre-set, determines M audio section;

Step 55, determines the feature of each audio section;

Concrete, perform respectively for each audio section:

First, each cluster centre corresponding to this audio section is determined; Such as, first corresponding K1 cluster centre of audio section, second corresponding K2 cluster centre of audio section, M corresponding Km the cluster centre of audio section;

Then, the feature of each audio frame comprised by this audio section compares with the cluster centre determined respectively, determine the cluster centre that each audio frame that this audio section comprises is corresponding respectively, and add up the number of audio frame corresponding to each cluster centre corresponding to this audio section;

Finally, the number of the audio frame that each cluster centre corresponding to this audio section of statistics is corresponding, determines the feature of this audio section.

For first audio section.Suppose to comprise 100 audio frames in first audio section, and its corresponding 10 cluster centres (K1=10), then arrange 10 counters, initial value is 0, and the corresponding cluster centre of each counter.So first, calculate the distance of the feature of each audio frame and the feature of 10 cluster centres respectively, the counter that nearest cluster centre is corresponding carries out adding 1 operation, the feature of the feature of such as first audio frame and the 3rd cluster centre nearest, then the 3rd counter carries out adding 1 operation, the like, the digital sum on 10 counters finally obtained is 100.Then, by the numeral on these 10 counters all respectively divided by total number 100 of audio frame, just can obtain the decimal between 10 0 ~ 1, these 10 decimals are the feature of this audio section, and can represent by the histogrammic form of proper vector, its length is 10.

Step 56, determines the feature of this audio frequency;

Concrete, by the feature of each audio section that the obtains sequential combination according to each audio section, in addition weight, just can obtain the feature of this audio frequency, wherein, the total length of the feature of this audio frequency is K1+K2+ ... + Km.

Step 57, according to the characteristic sum of the audio frequency determined for distinguishing the sorter of audio categories, determines classification results;

Above-mentioned steps 53 to step 57 performs at least twice successively, when wherein performing at every turn, in step 53 institute according to intersegmental overlapping second number percent different.

Step 58, according to the classification results determined, adopts the classification of ballot mode determination audio frequency.

The said method provided from the embodiment of the present invention five, the method can not only realize classifying to the audio frequency of any duration, and because it to have carried out the division of at least twice audio section to same audio frequency to be sorted, multiple feature can be obtained according to the intersegmental overlapping percentages adopted during each division, thus be equivalent to the adaptability that improve this audio frequency to be sorted, enable to be applicable to very fast audio frequency accounting in the training sample of sorter higher, and in the training sample of sorter slower audio frequency accounting higher etc. different situations.Thus such method applicability is wider, has good robustness for the audio frequency that speed rhythm is different.

Embodiment six:

The embodiment of the present invention also provides a kind of feature deriving means of audio frequency, and the concrete structure schematic diagram of this device as shown in Figure 6, mainly comprises following function unit:

Obtain unit 61, for obtaining audio frequency;

Divide frame unit 62, for performing each audio frequency obtaining unit 61 acquisition: according to the framing rule pre-set, divide this audio frequency, obtain multiple audio frame;

Feature extraction unit 63, for according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frames that point frame unit 62 obtains respectively, obtains the feature of each audio frame;

Cluster centre determining unit 64, for the feature of each audio frame obtained according to feature extraction unit 63, and for distinguishing each cluster centre of audio frame classification, determines the cluster centre that each audio frame is corresponding respectively;

Wherein, the cluster centre that each audio frame is corresponding with it meets: in the similarity of the feature of each cluster centre of the characteristic sum of this audio frame, and the similarity of the feature of the cluster centre of its correspondence of characteristic sum of this audio frame is maximum;

Each cluster centre respectively each audio sample is divided into multiple audio sample frame according to above-mentioned framing rule, and after feature according to each audio sample frame of above-mentioned feature extraction Rule Extraction, the feature of each audio sample frame extracted carried out to cluster obtains;

Characteristics determining unit 65, for determining the number of the audio frame corresponding to each cluster centre respectively, and according to the feature of the number determination audio frequency determined.

Optionally, the mode obtaining each cluster centre can specifically comprise:

First, perform for each audio sample: according to above-mentioned framing rule, this audio sample is divided, obtains multiple audio sample frame; According to above-mentioned feature extraction rule, extract the feature of each audio sample frame; According to each audio sample frame obtained and intersegmental overlapping first number percent pre-set, obtain the audio sample section of predetermined number; And record each audio sample section ordering in time obtained; Wherein, each audio sample Duan Zhongjun comprises the continuous print audio sample frame of equal number;

Secondly according to the ordering recorded for each audio sample, respectively cluster is carried out to the feature of each audio sample frame that all audio sample sections of the aligned identical position be in ordering comprise, obtain each cluster centre corresponding respectively to each arrangement position.

When adopting aforesaid way to obtain each cluster centre, cluster centre determining unit 64, can specifically comprise:

Audio section determination module 641, for each audio frame obtained according to point frame unit 62 and intersegmental overlapping second number percent pre-set, determines the audio section of predetermined number; And record each audio section ordering in time determined; Wherein, the continuous print audio frame of equal number is all comprised in each audio section;

Arrangement position determination module 642, performs respectively for each audio section determined for audio section determination module: according to each audio section ordering in time of record, determine the arrangement position of this audio section;

Cluster centre first determination module 643, for according to the ordering recorded for each audio sample, from corresponding respectively to each cluster centre of each arrangement position of obtaining, determines and each cluster centre corresponding to the arrangement position of this audio section;

Cluster centre second determination module 644, the each cluster centre determined with cluster centre first determination module 643 respectively for the feature of each audio frame comprised by this audio section compares, and determines the cluster centre that each audio frame that this audio section comprises is corresponding respectively.

Characteristics determining unit 65, can specifically comprise:

Its feature determination module 651, for performing respectively for each audio section: determine the number with the audio frame corresponding to each cluster centre corresponding to the arrangement position of this audio section respectively, and determine the feature of this audio section according to the number determined;

Audio frequency characteristics determination module 652, for the arrangement position according to each audio section, the Feature Combination of each audio section determined by its feature determination module 651 is the feature of audio frequency.

Embodiment seven:

The embodiment of the present invention also provides a kind of sorter of audio frequency, and the concrete structure schematic diagram of this device as shown in Figure 7, mainly comprises following function unit:

Divide frame unit 71, for the framing rule pre-set, audio frequency to be sorted is divided, obtains multiple audio frame;

Feature extraction unit 72, for according to the feature extraction rule pre-set, carries out feature extraction to multiple audio frames that point frame unit 71 obtains respectively, obtains the feature of each audio frame;

Classification results determining unit 73, for performing at least twice successively to following step:

Step one: according to each audio frame obtained and intersegmental overlapping second number percent preset in advance, determine the audio section of predetermined number; And according to the audio section of the predetermined number determined, and for distinguishing each cluster centre of audio frame classification, determine the cluster centre that each audio frame that each audio section comprises is corresponding respectively;

Wherein, when step one is performed at least twice, each institute according to intersegmental overlapping second number percent different;

Step 2: the number determining the audio frame corresponding to each cluster centre respectively, and the feature determining described audio frequency according to the number determined;

Step 3: according to the characteristic sum of the audio frequency determined for distinguishing the sorter of audio categories, determine classification results;

Wherein, sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and above-mentioned each cluster centre;

Classification determination unit 74, for the classification results determined according to classification results determining unit 73, determines the classification of audio frequency.

Optionally, the mode obtaining each cluster centre can specifically comprise:

When adopting aforesaid way to obtain each cluster centre, the step one in classification results determining unit 73 comprises:

According to each audio frame obtained and intersegmental overlapping second number percent pre-set, determine the audio section of predetermined number; And record each audio section ordering in time determined; Wherein, the continuous print audio frame of equal number is all comprised in each audio section;

Perform respectively for each audio section:

According to each audio section ordering in time of record, determine the arrangement position of this audio section;

According to the ordering recorded for each audio sample, from corresponding respectively to each cluster centre of each arrangement position of obtaining, determine and each cluster centre corresponding to the arrangement position of this audio section;

The feature of each audio frame comprised by this audio section compares with each cluster centre corresponding to the arrangement position of this audio section respectively, determines the cluster centre that each audio frame that this audio section comprises is corresponding respectively.

Step 2 in classification results determining unit 73 comprises:

Perform respectively for each audio section:

Determine the number with the audio frame corresponding to each cluster centre corresponding to the arrangement position of this audio section respectively, and determine the feature of this audio section according to the number determined;

According to the arrangement position of each audio section, be the feature of described audio frequency by the Feature Combination of each audio section determined.

Optionally, classification determination unit 74, can be specifically for:

According to each classification results that classification results determining unit 73 obtains, adopt the classification of ballot mode determination audio frequency.

Embodiment eight:

The embodiment of the present invention also provides a kind of sorter of audio frequency, and the concrete structure schematic diagram of this device as shown in Figure 8, mainly comprises following function unit:

Divide frame unit 81, for the framing rule pre-set, audio frequency to be sorted is divided, obtains multiple audio frame;

Feature extraction unit 82, for according to the feature extraction rule pre-set, carries out feature extraction to described multiple audio frame that point frame unit 81 obtains respectively, obtains the feature of each audio frame;

Cluster centre determining unit 83, for the feature of each audio frame obtained according to feature extraction unit 82, and for distinguishing each cluster centre of audio frame classification, determines the cluster centre that each audio frame is corresponding respectively;

Characteristics determining unit 84, for determining the number of the audio frame corresponding to each cluster centre respectively, and according to the feature of the number determination audio frequency determined;

Classification determination unit 85, for the characteristic sum of audio frequency determined according to characteristics determining unit 84 for distinguishing the sorter of audio categories, determines the classification of audio frequency;

Wherein, sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and described each cluster centre.

Optionally, the mode obtaining above-mentioned each cluster centre can specifically comprise:

When adopting aforesaid way to obtain each cluster centre, cluster centre determining unit 83 can specifically comprise:

Audio section determination module 831, for each audio frame obtained according to point frame unit 81 and intersegmental overlapping second number percent pre-set, determines the audio section of predetermined number; And record each audio section ordering in time determined; Wherein, the continuous print audio frame of equal number is all comprised in each audio section;

Arrangement position determination module 832, performs respectively for each audio section determined for audio section determination module 831: according to each audio section ordering in time of record, determine the arrangement position of this audio section;

Cluster centre first determination module 833, for according to the ordering recorded for each audio sample, from corresponding respectively to each cluster centre of each arrangement position of obtaining, determines and each cluster centre corresponding to the arrangement position of this audio section;

Cluster centre second determination module 834, each cluster centre that the feature for each audio frame comprised by this audio section is determined with cluster centre first determination module 833 respectively compares, and determines the cluster centre that each audio frame that this audio section comprises is corresponding respectively.

Characteristics determining unit 84, can specifically comprise:

Its feature determination module 841, for performing respectively for each audio section: determine the number with the audio frame corresponding to each cluster centre corresponding to the arrangement position of this audio section respectively, and determine the feature of this audio section according to the number determined;

Audio frequency characteristics determination module 842, for the arrangement position according to each audio section, the Feature Combination of each audio section determined by its feature determination module 841 is the feature of audio frequency.

Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.

The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a feature extracting method for audio frequency, is characterized in that, comprising:

2. the method for claim 1, is characterized in that, the mode obtaining described each cluster centre specifically comprises:

3. method as claimed in claim 2, is characterized in that, according to the feature of each audio frame obtained, and described each cluster centre, determine respectively specifically to comprise the cluster centre that each audio frame is corresponding:

Perform respectively for each audio section:

According to the described ordering recorded for each audio sample, from corresponding respectively to each cluster centre of each arrangement position of obtaining, determine and each cluster centre corresponding to the arrangement position of this audio section;

4. method as claimed in claim 3, is characterized in that, determine the number of the audio frame corresponding to each cluster centre respectively, and determine the feature of described audio frequency according to the described number determined, specifically comprise:

Perform respectively for each audio section:

5. a feature deriving means for audio frequency, is characterized in that, comprising:

Obtain unit, for obtaining audio frequency;

6. device as claimed in claim 5, it is characterized in that, the mode obtaining described each cluster centre specifically comprises:

7. device as claimed in claim 6, it is characterized in that, cluster centre determining unit, specifically comprises:

Audio section determination module, for each audio frame obtained according to point frame unit and intersegmental overlapping second number percent pre-set, determines the audio section of predetermined number; And record each audio section ordering in time determined; Wherein, the continuous print audio frame of equal number is all comprised in each audio section;

Arrangement position determination module, performs respectively for each audio section determined for audio section determination module: according to each audio section ordering in time of record, determine the arrangement position of this audio section;

Cluster centre first determination module, for according to the described ordering recorded for each audio sample, from corresponding respectively to each cluster centre of each arrangement position of obtaining, determines and each cluster centre corresponding to the arrangement position of this audio section;

Cluster centre second determination module, each cluster centre that the feature for each audio frame comprised by this audio section is determined with cluster centre first determination module respectively compares, and determines the cluster centre that each audio frame that this audio section comprises is corresponding respectively.

8. device as claimed in claim 7, is characterized in that characteristics determining unit specifically comprises:

Its feature determination module, for performing respectively for each audio section: determine the number with the audio frame corresponding to each cluster centre corresponding to the arrangement position of this audio section respectively, and determine the feature of this audio section according to the number determined;

Audio frequency characteristics determination module, for the arrangement position according to each audio section, the Feature Combination of each audio section determined by its feature determination module is the feature of described audio frequency.

9. a sorting technique for audio frequency, is characterized in that, comprising:

Successively at least twice is performed to step 3, step 4 and step 5:

10. method as claimed in claim 9, it is characterized in that, the mode obtaining described each cluster centre specifically comprises:

11. methods as claimed in claim 10, it is characterized in that, described step 3 specifically comprises:

Perform respectively for each audio section:

12. methods as claimed in claim 11, it is characterized in that, described step 4 specifically comprises:

Perform respectively for each audio section:

13. methods as claimed in claim 9, it is characterized in that, described step 6 specifically comprises:

According to each classification results obtained, ballot mode is adopted to determine the classification of described audio frequency.

The sorter of 14. 1 kinds of audio frequency, is characterized in that, comprising:

15. devices as claimed in claim 14, it is characterized in that, the mode obtaining described each cluster centre specifically comprises:

16. devices as claimed in claim 15, it is characterized in that, described classification results determining unit comprises:

Perform respectively for each audio section:

17. devices as claimed in claim 16, it is characterized in that, described classification results determining unit comprises:

Perform respectively for each audio section:

18. devices as claimed in claim 14, is characterized in that, described classification determination unit, specifically for:

According to each classification results that classification results determining unit obtains, ballot mode is adopted to determine the classification of described audio frequency.

The method of 19. 1 kinds of audio classifications, is characterized in that, comprising:

20. methods as claimed in claim 19, it is characterized in that, the mode obtaining described each cluster centre specifically comprises:

21. methods as claimed in claim 20, is characterized in that, according to the feature of each audio frame obtained, and described each cluster centre, determine respectively specifically to comprise the cluster centre that each audio frame is corresponding:

Perform respectively for each audio section:

22. methods as claimed in claim 21, is characterized in that, the described number determining the audio frame corresponding to each cluster centre respectively, and the feature determining described audio frequency according to the described number determined, specifically comprise:

Perform respectively for each audio section:

The sorter of 23. 1 kinds of audio frequency, is characterized in that, comprising:

Classification determination unit, for the characteristic sum of described audio frequency determined according to characteristics determining unit for distinguishing the sorter of audio categories, determines the classification of described audio frequency; Wherein, described sorter obtains according to carrying out classification based training to the feature of each audio sample; Wherein, the feature of each audio sample obtains according to the feature of its audio frame and described each cluster centre.

24. devices as claimed in claim 23, it is characterized in that, the mode obtaining described each cluster centre specifically comprises:

25. devices as claimed in claim 24, it is characterized in that, described cluster centre determining unit, specifically comprises:

26. devices as claimed in claim 25, it is characterized in that, described characteristics determining unit, specifically comprises: