CN109766929A

CN109766929A - A kind of audio frequency classification method and system based on SVM

Info

Publication number: CN109766929A
Application number: CN201811581291.2A
Authority: CN
Inventors: 韦鹏程; 姜娇; 周震
Original assignee: Chongqing University of Education
Current assignee: Chongqing University of Education
Priority date: 2018-12-24
Filing date: 2018-12-24
Publication date: 2019-05-17

Abstract

The invention belongs to audio data analysis technical fields, disclose a kind of audio frequency classification method and system based on SVM, automatic Audio Classification and segmentation are that the important means of structured message and semantic content is extracted in audio, are the bases of understanding, analysis and retrieval audio content；In essence, the classification of audio data is a pattern recognition problem, it includes two basic sides: feature extraction selection and classification；The information that can represent audio signal characteristic how is extracted most in audio signal, is vital for audio classification；Audio feature extraction can signature analysis and extracting method based on audio frame, and signature analysis and extracting method based on audio；In the method for extracting these characteristics, the characteristic of audio is extracted using time domain specification and frequency domain characteristic respectively.The present invention is based on the audio classification algorithms of SVM to have good classifying quality, and smooth audio segmentation result is more accurate.

Description

A kind of audio frequency classification method and system based on SVM

Technical field

The invention belongs to audio data analysis technical field more particularly to a kind of audio frequency classification method based on SVM and it is System.

Background technique

Currently, the prior art commonly used in the trade is such that

The human society of today has come into digital times.With computer technology, network technology and the communication technology Continuous development, the multimedia messages such as image, video, audio have been increasingly becoming the main shape of field of information processing information medium Formula.Wherein, audio occupies very important position.Audio is multimedia important component.Compared with image and video, sound Frequency not only has unique feature, but also amount of audio data is small, and processing speed is fast, causes the extensive concern of people.Audio frequency table The form reached is varied, meet people life, work, in terms of demand, audio data money on internet Continue to increase at an unprecedented rate in source.Needed for fast and effeciently obtaining and handle from a large amount of audio datas on internet The effective information wanted is a kind of method for analyzing well, classifying and retrieving data.How effectively these sounds of organization and management Frequency resource, make people be easier to find required audio fragment have become there is an urgent need to.

Now, about the classification that the research of audio classification problem is not only to music and language.The classification of classification will be with The demand of people and change, promote the work and life of people.In general, the most basic object of audio classification be voice, Music and mute；It is further divided into five classes: pure tone, music, ambient sound, background sound and dumb sound.Audio classification is audio-frequency information The basis of deep level of processing is the core technology of audio structure, is the important means for extracting audio structure and contents semantic.Its basis Audio data is divided into different classifications by the content of the characteristics of perceived or expression, and in speech retrieval, based on the audio of content It plays an important role in segmentation and audio supervision.On the one hand, it can be used as the initialization procedure of continuous speech recognition, forbid Non-voice stream in audio stream enters speech recognition device, improves the accuracy of speech recognition, shortens recognition time.On the other hand, This is also the first step of music type classification.The audio given for one, we can divide it by audio classification Class and segmentation.After judgement, different processing is carried out to different types of audio data, to obtain judging result.In this example In, different processing methods is used to different types of audio data, not only can shorten the time for the treatment of process and space disappears Consumption, and processing accuracy can be improved simultaneously.Currently, the research in the field is concentrated mainly on three aspects: Audio feature analysis With extraction, classifier design and realization and audio frequency splitting method.

The classification of audio can be described as the process of one mode identification.Its research emphasis generally includes two substantially just Face: Audio feature analysis and extraction, the design and implementation of classifier.The essence of audio classification is mode identification procedure, main real The following is showed: (1) having pre-processed.Before handling audio file, it would be desirable to pre-process it, i.e., audio stream is divided into Smaller unit.Classified by the audio unit shorter to these to classify to audio file.Audio signal it is pre- Processing includes pre- emphasis, frame and window.(2) acoustic characteristic is extracted to classify.The selection and extraction of feature are pattern-recognitions Most important part in system, most important part naturally also in audio classification.(3) function screening.Multiclass audio classification is more Grade secondary classification will select to be most suitable for better discriminate between two kinds of audio datas of every level-one using feature selection approach The feature set of each hierarchical classification.(4) selection of classifier.Classified not only to audio signal automatically using machine learning Reduce manpower, and also reduce the time, improves efficiency.The realization of common audio classifiers is broadly divided into two classes: base In the model of threshold value and statistics.

In audio classification field, the early stage of classifier implementation method realizes it is based on threshold value.This classification method needs A large amount of training data, and since the threshold value selected in different application programs is usually different, so it is not It is general, and thresholding method can only realize classification (such as classification music, mute, sound) in the thick grade of audio, it cannot Realize the disaggregated classification (such as to the identification of applause, shouting, explosive sound etc.) to audio data.Therefore, in order to overcome these disadvantages, people Propose the audio classification based on statistical model.Threshold value is not present in this classification method, is a kind of number based on statistical theory The disaggregated model obtained according to training.It can not only identify the other audio data of coarse grade, moreover it is possible to identify fine audio data.

In statistical model, also have any different between the model and unsupervised model that are subjected to supervision.In early stage, people are commonly used The data analysis and classification method of supervision, such as SVM (support vector machines).SVM is a kind of new machine based on Statistical Learning Theory Device learning method, it is suitable for processing classification, and reflects the difference between different classes of to a greater extent.SVM method is being permitted Its validity is sufficiently illustrated in multiple utility program.However, the quality and quantity of the Usefulness Pair training data of SVM method have Very strong dependence.One good classifier has determined higher nicety of grading, according to the class object pair of classification audio data Target is adjusted, to improve nicety of grading.The statistical model has the distribution of preferable simulated sound feature space Ability and good robustness.Therefore, in recent years, support vector machines (SVM) is widely used in audio classification.

The detection of audio segmentation, also referred to as jump refers to through certain means in tested audio sequence as its name suggests Jump is found in column.Does is so which type of point called jump? in general, when the ear of the mankind receives continuous sound When frequency signal, different signals can generate different feelings.From the perspective of perception, when the ear of the mankind feels signal When variation, this point is referred to as jump, also referred to as branch.From the perspective of signal, this variation can be referred to as the sense of hearing Certain features of the variation of feature, i.e., corresponding signal must change with this variation.It is partitioned into the audio of different length The process of segment is known as audio segmentation.

…….

Solve the difficulty and meaning of above-mentioned technical problem:

(1) can effectively these audio resources of organization and management, make people be easier to find required audio fragment；

(2) audio data is divided into different classifications, and is supervised in speech retrieval, the audio segmentation based on content and audio In play an important role, it can be used as the initialization procedure of continuous speech recognition, and the non-voice in audio stream is forbidden to flow into Enter speech recognition device, improve the accuracy of speech recognition, shortens recognition time；

(3) different processing methods is used to different types of audio data, can not only shortens the time for the treatment of process And space consuming, and processing accuracy can be improved simultaneously.

(4) disaggregated model that the data training based on statistical theory obtains, it can not only identify the other audio number of coarse grade According to, moreover it is possible to identify fine audio data.

(5) quality and quantity of the Usefulness Pair training data of SVM method have very strong dependence.According to classification audio number According to class object target is adjusted, to improve nicety of grading.The statistical model has preferable simulated sound The ability and good robustness of feature space distribution.

Summary of the invention

In view of the problems of the existing technology, the present invention provides a kind of audio frequency classification method and system based on SVM.This The audio classification and cutting techniques of invention can well solve prior art problem, be the building of audio, depth analysis and Solid foundation is provided to the utilization of audio-frequency information.

The invention is realized in this way a kind of audio frequency classification method based on SVM, the audio classification side based on SVM Method includes:

In the method for these characteristics of audio extraction, the spy of audio is extracted using time domain specification and frequency domain characteristic respectively Property；

In audio classification, classified using the classification method based on support vector machines；

In audio frequency splitting method, a confirmation is split using the audio frequency splitting method of bayesian information criterion BIC；Its In, audio segmentation extracts different audio categories from the audio stream of audio classification, and the classification of audio stream temporally axis is divided.

Further, it before audio extraction, needs to carry out:

Audio signal pretreatment: firstly, original audio signal is pretreated, audio signal is split, and to each Audio section carries out Windowing and frame；Secondly, extracting audio frame and audio section, and the feature of extraction is merged.

Further, in the method for these characteristics of audio extraction, sound is extracted using time domain specification and frequency domain characteristic respectively In the characteristic of frequency, specifically include:

1) audio time domain specificity analysis and extraction: audio time domain characteristic represents time domain specification, by time domain waveform Audio signal is analyzed in frame；Specifically have:

Zero-crossing rate ZCR: the ratio of the signal value and all hits of two neighbouring sample points on the discrete point of audio signal Value；Zero-crossing rate shows the frequency of signal zero passage

Wherein x (m) is treated discrete audio sig；

Short-term averaging energy: short-term averaging energy is audio frequency characteristics parameter, reflects the variation of audio power, this direct relation To the selection of length of window N；If the value of N is too long, the variation of entire energy is relatively steady and difference does not reflect, but one The too narrow then smoothless energy function of window；Select Haiming window to keep good balance therebetween；Short-term averaging Energy is calculated with lower formula:

When n-th of signal value in m-th of frame that x (n) indicates audio signal, w (n) is window function.The energy of short time Amount is arranged to a threshold value, in threshold value hereinafter, being determined as mute；

2) audio domain specificity analysis and extraction: calculating the characteristic value of each frame, then calculates the characteristic value of chip level；

3) it the signature analysis based on audio fragment and extraction: carries out:

A threshold value is arranged in mute ratio in frequency domain energy；

Feature is calculated than parameter by sub-belt energy than average value in sub-belt energy；

Bandwidth average value is the average bandwidth of each frame in audio section；

Carry out high zero-crossing rate calculating；

Calculate the ratio of low frequency energy frame in audio section；

Frequency spectrum conversion, describes the mean parameter of the SPECTRAL DIVERSITY of each adjacent audio frame in audio fragment；

In the elementary audio speed standard variance of audio section, the pitch frequencies of each frame are calculated, then with these tones frequency Rate parameter calculates their standard deviation；

The composition vector set for carrying out set of eigenvectors divides, and is divided into the MFCC vector of 24 dimensions, and mentioned by audio fragment 11 dimensional feature vectors taken.

Further, audio frequency classification method includes:

1) quiet and noise uses rule-based classification method；

2) classification of each audio categories: the classifier based on SVM is used for pure voice/background sound and music/environment Sound is classified.

Further, audio frequency classification method includes: improved Δ BIC dividing method, for each BIC window detected, If detecting a split point, a specific length is slided into next window；If split point is not detected, window Length also increases, and when length of window increases to a certain extent, split point is had not found, then window keeps current window long It spends and slides and restore home window length until finding cut-point forward；When detecting cut-point, the length of window is immediately rearward It is mobile.

Another object of the present invention is to provide a kind of computer journeys of the audio frequency classification method described in realize based on SVM Sequence.

Another object of the present invention is to provide a kind of terminal, the terminal at least carries Claims 1 to 5 any one The processor of the audio frequency classification method based on SVM.

Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that computer executes the audio frequency classification method based on SVM.

Another object of the present invention is to provide the audio frequency classification methods described in a kind of realize based on SVM based on SVM's Audio classification control system.

Another object of the present invention is to provide the audio classification control system multimedia letters described in a kind of carrying based on SVM Cease processing equipment.

In conclusion advantages of the present invention and good effect are as follows:

In the present invention, automatic Audio Classification and segmentation are the important hands that structured message and semantic content are extracted in audio Section is the basis of understanding, analysis and retrieval audio content.In essence, the classification of audio data is that a pattern-recognition is asked Topic, it includes two basic sides: feature extraction selection and classification.Audio signal can most be represented by how extracting in audio signal The information of feature is vital for audio classification.Audio feature extraction signature analysis based on audio frame and can mention Method is taken, and signature analysis and extracting method based on audio.It is special using time domain respectively in the method for extracting these characteristics Property and frequency domain characteristic extract the characteristic of audio.On the basis of existing algorithm is sufficiently studied and tested, sound is realized The technical process of frequency division class and segmentation.This mainly includes two contents of audio classification and audio segmentation.It is adopted in classification method With the classification method for being based on support vector machines (SVM).Support vector machines be machine learning in recent years research it is main at Fruit.As a kind of new machine learning method, SVM can solve the practical problems such as small sample, non-linear and high dimension, from forming For a new research hotspot of neural network research.In dividing method, using the audio point of bayesian information criterion (BIC) Segmentation method is split a confirmation.Audio segmentation be different audio categories are extracted from the audio stream of audio classification, that is, It says, the category division of audio stream temporally axis.It is demonstrated experimentally that the audio classification algorithms based on SVM have good classifying quality, Smooth audio segmentation result is more accurate.

The present invention is as follows by carrying out experimental analysis:

1) quiet and noise uses rule-based classification method.Experimental design is as follows: to all samples carry out it is mute and The judgement of noise thresholding records correct classification number, and the quantity for calculating the classification that makes mistake is (non-mute but be judged as Mute editing quantity), and calculate nicety of grading.Experimental result is as follows:

1 noise of table/quiet classification result

	Correct classification number	Mistake classification number	Classification accuracy
				Noise	541	127	85.87%
It is mute	709	23	93.28%

For other classifications, audio size is obviously different, therefore accuracy of identification is very high.Mistake classification mainly due to In a segment, it contains mute and other audio categories, so average energy may be relatively small.Solves reduction The method of energy threshold.Accuracy of identification to noise is 85.87%.The reason of analysis is that is occurred in different audio categories makes an uproar Sound source is different, therefore the time-frequency characteristic of noise is also different.Single threshold value lacks generality for judging.Therefore, it is surveying The accuracy that noise judges in examination is not high, and false positive rate is high.There is the ambient sound of minor change to be easy to by mistake on energy spectrum Ground is judged as noise.

2) classification of each audio categories

Classifier based on SVM is for classifying to pure voice/background sound and music/ambient sound.Each point Class will be tested three times.

The sound classification result of the pure voice/background sound of table 2

3 music of table/ambient sound classification results

From experiments it is evident that the nicety of grading of support vector machine classifier is very high, pure voice and background sound Average nicety of grading is 91.28%, and the average nicety of grading of music and ambient sound is also 90.77%.It can be with from experimental data Find out, the support vector machine classifier proposed has preferable classifying quality and accuracy to audio classification work.

3) traditional Δ BIC dividing method and improved Δ BIC dividing method

Traditional dividing method and improved dividing method is respectively adopted in experiment.For more improved dividing method and pass The precision of system dividing method, demonstrates the reasonability of conventional segmentation methods, and demonstrate the validity of improved dividing method. Classification results are divided using traditional dividing method and improved dividing method.

Table 4 divides test result

Dividing method	Segmentation result	Correct number/precision	It fails to judge	Erroneous judgement
					Conventional segmentation methods	165	127/82.5%	27	38
Improved dividing method	148	135/87.6%	9	13

Traditional Δ BIC dividing method is equivalent to improved Δ BIC dividing method, and the number of the segmentation result detected Amount is far longer than improved method.The reason of analysis is that traditional method has only carried out smoothing processing to classification results, then directly It connects same category of audio combination together, obtains segmentation result.The interaction between adjacent segment is not considered, is had ignored point The global optimization problem cut.This is equivalent to the limitation for relaxing audio shot segmentation, improves accuracy rate, will inevitably The increase for causing mistake to be classified, so as to cause the audio camera lens being more detected.Improved method converts segmentation problem For optimization problem solution.It is a kind of dynamic method, fully considered segmentation between interaction and segmentation it is whole Body optimization, thus greatly reduces the quantity of wrong report, and precision has also obtained certain raising.The result shows that optimization method Actual efficiency be apparently higher than conventional method.

Detailed description of the invention

Fig. 1 is the audio frequency classification method flow chart provided in an embodiment of the present invention based on SVM.

Fig. 2 is preemphasis filter schematic diagram provided in an embodiment of the present invention.

Fig. 3 is Mel scale filter group picture provided in an embodiment of the present invention.

Fig. 4 is support vector machines schematic diagram provided in an embodiment of the present invention.

Fig. 5 is singular point figure provided in an embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.

In present multimedia information processing, audio is in occupation of very important position, but due to the spy of source of media itself The constraint with the prior art is put, the further analysis and utilization of audio-frequency information are limited.

As shown in Figure 1, the audio frequency classification method provided in an embodiment of the present invention based on SVM, comprising:

Application of the invention is further described below with reference to concrete analysis.

1, audio signal pre-processes

Audio signal pretreatment is divided into two steps: firstly, original audio signal is pretreated, main purpose is unified sound Frequency format, is pre-processed, and is split to audio signal, and carries out Windowing and frame to each audio section；Secondly, extracting Audio frame and audio section, and the feature of extraction is merged.Main purpose is to obtain final required audio feature vector.Pre- place Manage original audio data, including pre- emphasis, segmentation and adding window.

(1) preemphasis is handled

It is combined with human auditory system mechanism, the audio frequency range that human ear can be heard is 60Hz~20KHz.When progress sound When audio signalprocessing, for audio signal by pre-emphasis, the purpose is to eliminate the power-frequency of low-frequency disturbance, especially 50Hz or 60Hz Interference.The digital filter that pre-emphasis usually uses pre-emphasis digitizes audio signal, this generallys use single order height Logical digital filter:

H (z)=1- μ z^-1(1)

For time-domain, if the signal passed through is y (n), that y (n) can be indicated are as follows:

Y (n)=x (n)-μ * x (n-1) (2)

Wherein x (n) indicates original signal sequence, and y (n) indicates pre-emphasis sequence.

Single order high-pass digital filter, as shown in Fig. 2 preemphasis filter schematic diagram.

It is handled by pre-emphasis, the influence of sharp noise can be reduced, improve the high frequency section of signal, make the frequency spectrum of signal Steadily, and make pre-emphasis coefficient usually 0.97 or 0.98 or so.The signal needs that filter is emphasized in advance are typically canonicalized.

(2) adding window frame

After the digital filtering processing for carrying out pre-emphasis, adding window and frame processing are executed in next step.In a short time, audio Signal intensity is very slow, therefore in this slow transient process, and the acoustic characteristic of extraction keeps stablizing.Therefore, locating When managing audio signal, discrete audio signal is divided into a length unit first and is handled, that is to say, that discrete sound Frequency sampling point is divided into audio frame.This method is a kind of signal " in short-term " processing method.In general, one " short time " The duration of audio frame is about a few to tens of milliseconds.According to the length of partial unit, audio unit can be divided into: audio Frame, audio clips, audio camera lens, audio high-level semantics unit.Although the method that frame can use successive segmentation, general to use The method of overlay segment shown in figure so as to seamlessly transit between frame and frame, and keeps its continuity.Former frame and next frame Lap is known as frame displacement, and frame shifts the half for being typically considered frame length.Frame is by weighting a finite length Window realize that this window can form a window audio letter with y (n) multiplied by a specific window function w (n) Number y_w(n)=w (n) * y (n).Signal multiplication in time domain is equivalent to the convolutional calculation of frequency domain.Therefore, the calculating of window can also To indicate as follows:

Wherein Y and W respectively indicate frequency spectrum.

As can be seen that window function w (n) not only influences the waveform of time domain original signal, but also influence the waveform of its frequency domain.Two A most common window function is rectangular window and Hamming window.

Rectangular window:

Hamming window:

The selection of the shape and length of window function w (n) has a great impact to the feature of short run analysis parameter.Therefore, it answers Window appropriate is selected, short term parameters is made preferably to reflect the changing features of voice signal.Rectangular window has preferable spectrum smooth Property, but high fdrequency component and waveform loss in detail, rectangular window will will lead to leakage.Hamming window can be efficiently against leakage (Gibbs) Phenomenon has widest application range.If length of window is N, it is equivalent to a very narrow low-pass filter.Work as sound When frequency signal passes through, the high frequency section of reflection configuration details is hindered, and short time energy changes over time less.This is not It can really reflect the amplitude variations of voice signal.On the contrary, the passband of filter will broaden if N is too small, short-term energy with Time and change dramatically, and smooth energy function can not be obtained.Therefore, it should properly select the length of window, usually 15-30 milliseconds.After the treatment, audio signal is divided into the short time signal of interframe windowed function, and then each is short Phase audio frame is considered as a smooth random signal, and digital signal technique is for extracting audio frequency characteristics parameter.

The analysis of 2 acoustic characteristics is extracted

Audio signal includes a large amount of information, and there are many interference signal and redundancies.How audio signal is extracted In most representative information be the key that audio classification.Acoustic characteristic is the basis of audio classification, and the acoustic characteristic extracted It is then the significant properties in order to reflect audio as much as possible.At the same time, good robustness should reflect that the influence of environment, It eliminates simultaneously and causes to identify ambiguity^[3]Signal characteristic.The classification processing method of the parameter of the feature extraction as vector form Input.Therefore, it is contemplated that the independence between vector parameter, and while guaranteeing result accuracy, reduce calculating to the greatest extent Complexity.It has feature as much as possible comprising information, but data volume should be small as far as possible.The feature extraction of audio can With the signature analysis of extraction and audio segmentation based on signature analysis and audio frame and extraction.By audio frame to audio The characteristics of frame, is analyzed, and carries out signature analysis and extraction to audio section according to the characteristic parameter of audio frame.The characteristics of audio Including three aspects: temporal signatures, frequency domain character and Perception Features.

(1) time domain specification: there are two aspects for time domain specification.Our main indicators used in audio frame are short time energy Amount and zero-crossing rate.Index used in audio section is mainly there are three index: mute than, low frequency energy ratio and high zero-crossing rate.

(2) frequency domain characteristic: frequency domain character is obtained after Fourier transformation, and there are two aspects.Index used in audio frame is frequency Rate domain energy, sub-belt energy distribution, frequency mass center, bandwidth, fundamental frequency, MFCC coefficient (Mel- frequency concocts coefficient).In sound In frequency part, we used sub-belt energies than average value, spectral centroid average value, bandwidth average value, Spectrum Conversion and MFCC The indexs such as coefficient average value.

(3) Perception Features: Perception Features mainly have the base of audio frame feature high, and audio section is mainly characterized by basic sound Frequency standard deviation.In the present invention, in operation, acoustic characteristic can not be well reflected out the category feature of audio, because We will not use it for this.

2.1 audio time domain specificity analysis and extraction

Audio time domain characteristic refers to a vector parameters, it represents a time domain specification, by the frame of time domain waveform Analyze audio signal.

Zero-crossing rate (ZCR): refer to that the signal value of two neighbouring sample points is adopted with all on the discrete point of audio signal The ratio of sample number.Zero-crossing rate shows the frequency of signal zero passage, and zero-crossing rate is also a common audio frequency characteristics.

Wherein x (m) is treated discrete audio sig.

Short-term averaging energy: short-term averaging energy is one of common audio frequency characteristics parameter.It is a relative straightforward Characteristic reflects the variation of audio power, this is directly related to the selection of length of window N.If the value of N is too long, entire energy Variation is relatively steady and difference does not reflect, but the too narrow just smoothless energy function of a window.Therefore, window is selected Mouth is even more important.In the present invention, Haiming window has been selected, to keep good balance therebetween.Short-term averaging energy Amount can be calculated with formula (7):

When n-th of signal value in m-th of frame that x (n) indicates audio signal, w (n) is previously described in the text Window function.The energy of short time can be set to a threshold value, in threshold value hereinafter, can determine that be mute, so the short time Energy is mainly used for judging whether audio signal is mute.Short-term specific energy can be used to judge whether audio signal belongs to language The classification of sound, music and noise.

2.2 audio domain specificity analysis and extraction

Frame is the smallest unit in the audio signal of our processing, calculates the characteristic value of each frame, then calculates chip level Characteristic value.Frame level not on usually have several typical acoustic characteristics.

(1) MFCC coefficient, Mel frequency modification coefficient are the acoustic features being derived by human hearing mechanism.The mankind abide by A kind of perception audio range of approximate linear relationship is followed at 1000 hertz or less.To 1000 hertz or more of audio frequency range Perception do not follow linear relationship, but linear approximate relationship is followed on logarithm.Mel scale describes human ear to frequency The nonlinear characteristic of the perception of rate.MFCC is a kind of cepstrum parameter extracted in Mel scale frequency domain.The characteristic has higher Discrimination and good noise robustness.

MFCC is derived from the result of study of two auditory systems.Firstly, the mankind are to the perception of single-tone and the logarithm of pitch frequency It is approximate directly proportional.So-called Mel frequency scale, value generally correspond to actual frequency log series model relationship.In Mel frequency domain In, people are linear to the perception of tone.Relationship between Mel frequency and actual frequency can be with following formula come close Seemingly:

Secondly, can only hear a kind of tone when the tone similar in two frequencies is simultaneously emitted by.Critical bandwidth is to instigate master See the bandwidth boundary of decreased sensation.When the difference on the frequency between two tones is less than critical bandwidth, the two tones will be heard As a whole, this is referred to as screen effect.The calculation formula of critical bandwidth is as follows:

Wherein f_cIndicate centre frequency.

Therefore, a critical band filter group can be constructed to simulate the Perception Features of human ear.Utilize the filter in frequency spectrum Wave group method calculates Mel frequency cepstral coefficient (MFCC).Audio is divided into a series of filter sequence of triangles.This group filter Wave device is identical bandwidth in the Mel coordinate system of frequency.As shown in Fig. 3 Mel scale filter group.

(2) frequency domain energy, frequency domain energy formula are as follows:

Wherein F (ω) is the coefficient of the frame FFT transform, ω₀It is the half of sample frequency.Frequency domain ENERGY E is for true Determine quiet frame.If the frequency domain energy of a certain frame is less than threshold value, frame is marked as a quiet frame, and otherwise it is one non-quiet Silent frame.

(3) frequency domain is divided into four subbands by sub-belt energy ratio, is respectively as follows: Then the Energy distribution of each subband is calculated.Shown in calculation formula such as formula (11):

Wherein L_jAnd H_jFor the Lower and upper bounds frequency of subband.Different types of audio has different energy in each intersubband in Amount distribution.In each intersubband in, the relatively uniform distribution of frequency domain energy of music.In terms of voice, energy is mainly concentrated In the 0th subband, about 80% or more.

(4) zero-crossing rate, in the case where discrete-time signal, the adjacent sample with different algebraic symbols is referred to as zero passage Rate.Zero-crossing rate is the speed for describing zero passage, is a kind of straightforward procedure of measuring signal frequency.Formula is provided by equation (12):

Wherein x (m) indicates discrete audio sig.ZCR is a kind of more common audio-frequency function.

(5) frequency mass center, the brightness of frame is by frequency centroid measurement in frame, shown in calculation method such as equation (13):

(6) bandwidth, bandwidth are the indicators of the frequency range of audio.Calculation equation is formula (14):

(7) pitch frequencies.Pitch frequencies are to measure the unit of pitch.Pitch periods detection method can be roughly divided into three classes: The method of time domain method, frequency domain method and time domain and frequency domain characteristic for summarizing signal.Under normal circumstances, pitch period is to use Come what is estimated, which is suitable for the short-term auto-correlation function that center is cut out for a kind of simpler peak value trimming algorithm.Auto-correlation The principle of method is that short time auto-correlation function has a biggish peak value on the integral multiple of pitch periods, as long as and Find the position of maximal peak point, so that it may estimate pitch periods.The step of calculating pitch periods is as follows:

(a) pre-process: center editing function (15) is used for editing audio, to reduce the effect of formant.Amplitude limit value L be by What the peak amplitude of voice signal determined, generally take the 60%-70% of maximum signal amplitude.

(b) y (n) and y'(n are calculated) correlation: in order to overcome the problems, such as a large amount of short-term autocorrelation calculation, in equation (15) y (n) auto-correlation function after central-line shear angle is replaced two cross-correlated signals.One signal is unique y (n), Another signal is the unique consequence y'(n after the quantization of three rank of y (n)), it may be assumed that

Cross correlation is calculated using following formula:

(c) look for pitch periods: the maximum value of selection R (k) is denoted as R_max.If R_max< c*R (0) (c is threshold value), is recognized To be noiseless, therefore its pitch periods are 0；Otherwise as R (k)=R_maxWhen, pitch periods k, it may be assumed that

(d) it post-processes: since there are the influence of the factors such as acoustic jamming, pitch periods estimation, some scattered pitch periods Pitch periods track is deviated from, for the accuracy and convenience of post-processing, median filtering technology is commonly used in smooth original Curve.Median filtering is a non-linear process.It selects one piece of data using a sliding window from data sequence, then With the intermediate value replacement data of data.When window is slided constantly along data sequence, it can constantly draw intermediate value, this was The result of filter.

2.3 signature analysis and extraction based on audio fragment

Audio-frequency unit is bigger than audio frame unit.One audio section generally comprises several audio frames.Its feature is to count Upper division audio frame.General calculation method is mean value, variance and the standard deviation for calculating audio frame included in audio section. Main audio segment used in this chapter is:

(1) threshold value is arranged in mute ratio in frequency domain energy.When the energy of sample frame is less than this threshold value, I This frame be referred to as silent frame, otherwise it is a non-silent frame.On the basis of audio section, the ratio of mute frame is quiet Signal to noise ratio example can be indicated with following formula (19).

Parameter M indicates the quantity of quiet frame in audio section, and parameter N indicates all audio frames for including in audio fragment Quantity.

(2) sub-belt energy compares average value_[9]It is by sub-belt energy than the feature that parameter calculates, that is, audio section In each frame subband energy ratio average value.The characteristic is widely used in signal research.

(3) bandwidth average value means that bandwidth average value is the average bandwidth of each frame in audio section, and spectral centroid Average value is the average value of the audio brightness of each frame in audio section.

(4) high zero-crossing rate.The zero-crossing rate of language is higher than music.If being provided with a threshold value, can calculate super Cross the ratio of the audio frame of the audio fragment of this threshold value.This ratio is referred to as high zero-crossing rate (high ZCR ratio).Threshold value is usual It is 1.5 times of zero-crossing rate average value in audio section.Shown in the calculation formula of its characteristic value such as following formula (20):

Parameter N indicates that the sum of audio section sound intermediate frequency frame, ZCR (n) indicate the zero-crossing rate of n-th frame in audio section.

(5) energy threshold is arranged in low frequency energy ratio in audio section.Low frequency energy is referred to below as in this energy Measure frame.The ratio of low frequency energy frame in audio section can be calculated.This ratio is known as low frequency energy ratio, referred to as LRER^[10], obtained by formula (21).

Parameter N is the sum of audio section sound intermediate frequency frame, and E (n) represents the frequency domain energy of n-th frame in audio section.This is public The critical value of formula is in audio section 0.5 times of the average value of energy in each frame.

(6) frequency spectrum converts the mean parameter for describing the SPECTRAL DIVERSITY of each adjacent audio frame in audio fragment.It calculates Formula is such as shown in (22):

(7) in the elementary audio speed standard variance of audio section, the pitch frequencies of each frame are calculated first, then use these Pitch frequency parameter calculates their standard deviation, this is for describing pitch frequencies range feature.

(8) the composition vector set of set of eigenvectors is divided into two parts, the MFCC vector of one 24 dimension, and by audio 11 dimensional feature vectors of snippet extraction.Because the difference between feature vector is relatively large, need to standardize to it. However, experimental result is not improved well after the standardization of MFCC vector set.Therefore, only fragmentation profile It is standardization and processing.As shown in formula (23):

x_i'=(x_i-μ_i)/β_i (23)

Parameter x_iNeed to standardize input characteristics vector, μ_iIt is mean value, β_iIt is variance, x_i' it is the feature obtained after standardizing Value.

2.4 audio frequency classification method

Audio classification techniques are substantially one mode identification technologies.Statistical learning method have solid theoretical basis and Simple realization mechanism, and used by most of current audio classification systems.Statistical learning method needs to provide one in advance The training sample with category label is criticized, and classifier is generated by guidance learning training, then test sample collection is divided Class, to measure classification performance.Typical audio frequency classification method includes minimum distance method, support vector machines, and neural network is hidden Markov model and decision tree.

2.4.1 support vector cassification algorithm

Support vector machines (SVM) is a kind of based on the theoretical structure wind proposed with nineteen ninety-five by Cortes and Vapik of VC dimension The machine learning method nearly minimized, and its performance is very good.It can solve small sample and nonlinear problem.And higher-dimension Pattern-recognition and other problems can show own unique advantage.Briefly, the purpose of support vector machine method is An optimal Optimal Separating Hyperplane is found, it can be kept completely separate the data of both types in maximum time interval.SVM There can be good learning effect, without considering two classes or more classification problems.SVM method is used primarily for solving the problems, such as two classes.Under The basic principle of second of classification is explained in detail in face.

Training sample set is X={ x₁...x_n, X ∈ R^d.Corresponding classification is marked as { y₁...y_n},y_i∈{1,-1}。 The dimension for enabling training sample feature vector is d, sample number n.Such as Fig. 4 support vector machines schematic diagram.

(1) linear SVM

For linear separability problem, two points of problems can construct an Optimal Separating Hyperplane, keep the sample of positive and negative complete It is fully separating.As shown in Figure 4.The real sampling point on the left side indicates that positive sample, the hollow sample point on the right represent negative sample.In H₁With H₂Between have several classification layers, all these samples that can be kept completely separate out positive and negative.If one of classifying face is not only It can be kept completely separate out the sample of positive and negative, and geometry spacing can also be increased to the maximum extent, then this classification line is known as Optimal classification hyperplane.So-called geometry spacing is H₁And H₂The distance between.H is classification plane, H₁And H₂It is parallel to H's Straight line, while transmitting the two kinds of sample of distance H, H₁And H₂On sample point be we discuss supporting vector.Exactly this A little supporting vectors construct optimal Optimal Separating Hyperplane jointly.Assuming that linear discriminant function are as follows:

G (x)=wx+b.Usually { x₁...x_nMeet g (x) >=1, at this point, class interval is 2l | | w | |.

yi[wx_i+ b] -1 >=0, i=1 ..., n (24)

When formula (24) are set up, this classifier can correctly mark all samples.Obviously, class interval is maximized Actually minimize | | w | |.Therefore, optimal classification hyperplane should meet equation (25) simultaneously and minimize | | w | |.It supports Vector machine is a sample of formula (25).In short, solving the problems, such as that optimal separating hyper plane is equivalent to following constrained optimization and asks Topic:

By this method, solution quadratic programming problem is converted by the solution of SVM, in theory, the solution of SVM Certainly scheme is globally unique optimal solution.Firstly, construction Lagrangian:

A in formula_iIt is Lagrange coefficient, distinguishes w and b in the equation above respectively and them is made to be equal to 0.It is availableDual problem is converted by original optimization problem:

The corresponding a of each sample can be obtained by solving above-mentioned formula_iValue, obtained solution is the optimal solution of optimization problem.Only With a_iCorresponding sample is not 0 to be only supporting vector.The usually only a of sub-fraction sample_iValue is not 0.Last classification letter Number diagnostic method is as follows:

It is tilt quantity by the b that above formula calculates.As a in formula_iWhen not * being 0, x_sIndicate any in these two types of samples A pair of of supporting vector.

In reality, due to the influence of noise, classification samples cannot be by linear separation, therefore can not obtain uncorrected point Class hyperplane.Here noise is considered the black color dots of rightmost in Fig. 5.Obviously, this is the sample of a negative class.This A strange sample makes linear separability problem linearization and indivisible.Referred to as " approximately linear can divide usual this problem Property ".For such issues that, we are at common processing method, and sample point is initially accidentally mistakenly by sample error flag User, this is interference, noise, it should is ignored.But it is unsolvable that its presence results in this problem really, so In this case, the solution that we take, it is full that it allows a small amount of sample point not need to the distance of Optimal Separating Hyperplane The original requirement of foot.That is, we initially require all sample points should at least more than 1 section to Optimal Separating Hyperplane. It being added now fault-tolerant, and allows for a hard -threshold to be added in cirrhosis amount, this allows some sampled points to fall in geometry section, Expression formula becomes following form:

Slack variable is non-negative, that is to say, that final the result is that sample interval is allowed to less than 1.When sample point it Between interval when being calculated as less than 1, it means that classifier abandons the precise classification of these singular points.Although this itself can Some losses are caused to classifier, but it also allows the hyperplane of classification moving on to these sampled points, without by these The influence of sample point, to generate bigger geometry spacing.So there is multiple weight between the two.

It is known | | w | |²It is objective function, and it is expected that its value is small as far as possible, so loss amount can make | | w | |²More Greatly.Usually there are two types of the method for measurement loss, first is second order soft margin classification device:

The other is single order soft margin classification device:

Increase a loss in objective function and need a penalty factor, so initial optimization problem can following institute It writes:

(2) Nonlinear Support Vector Machines

Support vector machines is described in the basic principle for solving linear separability problem and " approximately linear separability problem ". But in real world, many times, in original low dimensional sample space, sample is extremely inseparable.Anyway Optimal Separating Hyperplane is found, total there are many singular point is undesirable.During this period, it is necessary to by linearly inseparable in lower dimensional space Sample data be mapped to higher dimensional space.Divide although mapping is not fairly linear after mapping, it is at least " approximate Linear separability ".Then a small amount of singular point, available good result are handled with slack variable.By one from low-dimensional The sample of degree space reflection to higher dimensional space needs to realize by a kernel function, therefore kernel function are as follows:

K(x_i,x_j)=Φ (x_i)·Φ(x_j) (34)

Kernel function itself have to meet the condition of Mercer.Its basic function is that arrow is inputted in two lower dimensional spaces Then amount calculates the inner product of vector value of the higher dimensional space of a conversion.So original problem can be converted into following form:

Discriminant function becomes:

。

(3) to the introduction of kernel function

When handling nonlinearly separable problem, kernel function makes support vector machines operational excellence.By different kernel functions The Nonlinear Classifier of construction is also different.When handling practical problem, there is presently no the guidelines of selection kernel function. It is more to need to verify by testing, to select best kernel function.Common kernel function is listed below:

(a) linear kernel function:

K(x,x_i)=(x_i·x) (38)

(b) Polynomial kernel function^[17]:

K(x,x_i)=[p (x_i·x)+s]^q (39)

(c) Sigmoid kernel function^[18]:

K(x,x_i)=tanh (μ (x_i·x)+c) (40)

(d) radial base and function:

K(x,x_i)=exp (- γ | x-x_i|²) (41)

Above-mentioned kernel function is most widely used that radial basis function, it has extensive convergence domain, is suitable for various occasions, Such as low-dimensional, higher-dimension, small sample and large sample.Optimal Radial basis kernel function is also selected for audio classification.The value of γ is 8.

2.4.2 the more classification methods of support vector machines

In recent years, the SVM Multiclass Classification that researchers at home and abroad propose can be roughly divided into two classes: one is by base This two kinds of SVM is expanded in multicategory classification SVM, and this solves optimization problems.It uses in this process Many variables, so it is unpractiaca, because computation complexity is too high.Another method is gradually by multicategory classification problem Two class classification problems are converted into, that is, form the multi classifier with multiple twin-stages classification SVM.Currently, this method obtains To being widely applied, there are two types of common classification policies: one kind is for one kind^[20]Strategy, another kind are directed to all policies.

(1) one-to-one strategy.This strategy is to be proposed by Knerr et al. in nineteen ninety.Its main thought is to divide It is one Optimal Separating Hyperplane of any two category construction when class, and separates N number of classification.This strategy is used to carry out N number of classification Classification, needs N* (N-1)/2 twin-stage SVM classifier in total.Then, according to the combination of both types, classify to every two class Problem carries out classifier training.In identification process, each test sample inputs N* (N-1)/2 two classifier respectively, each The classification results that classifier obtains are voted-for to obtain most ballot papers.The final classification of sample is as a result, the strategy is known as " ballot Method ".

(2) one-to-many strategy.This method is to be proposed by Bottou et al. in 1994.Its main thought is: When classification, more classification problems of classification N number of for training sample, one classification of construction first between the i-th class and other N-1 classes Hyperplane.Therefore, the N number of two kinds of SVM classifier of the algorithm construction.When i-stage classifier is trained to, the i-th class Sample is positive 1, and the sample point of other classes is minus 1, to execute the training of two class classification problems.In identification process, each quilt The sample of identification will input trained N classifier respectively, and the output valve that each classifier obtains is compared to obtain Obtain classification results.The one-to-many each classifier output of policy mandates belongs to the probability value that classifier differentiates some class of classification, so More all output probability values afterwards, and the class of the classifier with maximum probability is by the class as sample.Support vector machines Output be a specific classification, and exported without probability value.Therefore, when the one-to-many strategy of application, we are not looked for To the differentiation classification of SVM, but find the probability output of SVM.By this calculating process, each sample is in each classification There is a probability value, shows that sample belongs to the probability of a certain classification.Finally, having selected the classification with maximum output probability value Device, and the final classification result of sample that the class indicated by positive 1 maybe identify.One-to-many strategy is simple, effective, and has The very short training time.It is than one-to-one strategy more suitable for large scale data classification.

2.5 audio segmentation technologies

The purpose of audio segmentation is to utilize the computer program intelligently piece by finite state Automat at different length and attribute Section, to liberate the time of manual segmentation, labour and capital cost.So-called consistency means the feature ginseng of audio section Number is the same or similar in time domain or frequency domain.

2.5.1 the audio segmentation algorithm based on BIC theory

Audio segmentation based on bayesian information criterion (BIC) is a kind of widely used method.BIC criterion usually passes through Difference between the maximum likelihood value of sample and the complexity of model comes whether detection model meets BIC criterion.The complexity of model Property is commonly referred to as the parameter of model.In recent years, due to its excellent performance, audio segmentation and cluster problem have been introduced into In.It is assumed that X={ x_i: i=1,2 ..., N } it is one section of tonic train to be tested, N is signal length, M={ m_i:1,2,..., K } it is candidate family parameter, L (X, M) is the maximum likelihood function of sample data X in model M, and m is the number parameter of model M, BIC criterion is defined in equation (42):

Wherein λ is penalty factor, is usually taken to be 1.

If signal X meets multivariate Gaussian distribution, it has a length of window signal Y={ y₁,y₂,...,y_n, wherein n It is length of window.In order to detect in Y with the presence or absence of a branch, it is necessary to detect the every bit i (0 < i < n) in Y.Assuming that Y Two parts are divided by point i: Y₁={ y₁,y₂,...y_iAnd Y₂={ y_i+1,y_i+2,...y_n, if H₀And H₁It is generated in Y, this meaning In Y without or only one calibration point, shown in mathematical description such as formula (43):

Corresponding maximum likelihood ratio can be described with equation (44):

R (i)=n*ln | Σ |-n₁*ln|Σ₁|-n₂*ln|Σ₂| (44)

Wherein μ, μ₁,μ₂It is Y, Y respectively₁,Y₂Average value, Σ, Σ₁,Σ₂It is the corresponding covariance matrix of each, n, n₁,n₂For corresponding signal length.

Compare H₀And H₁Model, and define by equation (45) difference between their BIC value:

Δ BIC=BIC (H₁)-BIC(H₀)=R (i)-λ p (45)

Wherein p=1/2 × (d+1/2 × (d+1)) ln (n), d is the dimension of sample space.If the candidate of all sequences The weighted variance collection of cut-point is greater than 0, it means that has a cut-point in Y, and assumes H₁It is correct.Condition description As shown in equation (46):

{ max Δ BIC (i) } > 0 (46)

When formula (46) meets, there is a branch in Y, at the time of where cut-point such as equation (47) is described:

If formula (46) is unsatisfactory for, it is assumed that H₀It has built up, that is to say, that there is no cut-point in Y, and one A new window Y is to be detected by amplification n to execute BIC.Branch individual for one and multiple branches, Chen are proposed Themselves solution^[24], this is better choice for the short-term editing for having more transition.However, if wanting The column of test are too long, and cut-point cannot be detected within a very long time, it undoubtedly will increase calculation amount.In addition, should Method is easy to appear cumulative error.If occurring the cut-point of mistake before, this mistake continued to, and will not with After corrected.

2.5.2 improved BIC audio segmentation algorithm

Although its advantages are to cannot be neglected there are various defects in the audio segmentation based on BIC.For Guarantee the robustness of algorithm, only need to slightly modify various shortcomings.It is later researcher below to solve this Some higher improved descriptions of identification that a little defects are done.

The accumulation of error of traditional BIC method and what is calculated is greatly increase due to length of window, after institute The researcher come proposes a kind of more intuitive improved method, sliding-modes of this method based on fixed length of window.For The BIC window each detected, the length of home window are constant.It is specific by one if detecting a split point Length slides into next window.If split point is not detected, the length of window also increases, but when length of window increases to centainly In degree, split point is had not found, then window keeps current window length and slide to restore until finding cut-point forward Home window length.Even if detecting cut-point, the length of window will not increase, and can move immediately rearward.

Application of the invention is further described below with reference to experiment.

The experiment is all in MatlabR2014b environment, Windows7 version, 64 bit manipulation systems, Intel Core CPU, clock frequency is 3.40GHz and interior save as is completed under conditions of 8GB.

It is noiseless/noise, pure voice, mixing voice, music, ambient sound that testing audio data, which are tested, by manual sort Deng, and being used in mixed way as training sample and test sample.

There are many audio format, such as wav, mp3, midi.These channels are divided into monophonic, binary channels and multichannel.It adopts Sample rate is 44.1kHz, 32kHz, 16kHz, 8kHz, and precision is 32,16 and 8.Audio is marked before audio experiment Standardization, sample frequency 44.1kHz, quantified precision are 16, and audio file is unified for wav, and carries out to university's channel data Analysis.It is 3600 that audio, which is divided into editing sequence number, after manual classification, mute editing 760, and noise editing 630, music excerpt 570, Pure voice editing 530, the sound clip 560 with background sound, ambient sound editing 550.

Application of the invention is further described below with reference to mute and noise classification.

1 noise of table/quiet classification result

2) classification of each audio categories

The sound classification result of the pure voice/background sound of table 2

3 music of table/ambient sound classification results

3) traditional Δ BIC dividing method and improved Δ BIC dividing method

Table 4 divides test result

Application of the invention is further described below with reference to effect.

Audio classification is the basis of audio-frequency information deep level of processing, is the core technology of audio structure, is to extract audio structure With the important means of contents semantic.It according to it is perceived the characteristics of or expression content, audio data is divided into different classifications, It plays an important role in Video segmentation, speech retrieval and the audio supervision based on content.Audio classification is also audio-frequency information One of processing, Audio Information Retrieval and key technology of data management.Although audio classification studies people without very long history Member has carried out more detailed research in this field, this not only makes the knowledge in the field become a complete system, Er Qie The development of audio-frequency information processing technique is promoted to a certain extent.In essence, the classification of audio data is considered The process of pattern-recognition.Two basic sides and classifier that its emphasis generally includes Audio feature analysis and extracts Design and implementation.Audio is divided into six classes by a kind of audio classification algorithms based on SVM: mute, noise, music, background sound, Pure voice and background sound.On the basis of classification, a kind of smooth criterion is proposed, and smooth place has been carried out to classification results Reason, divides audio stream finally by audio classification.The experimental results showed that the sorting algorithm based on SVM has well Classifying quality and higher nicety of grading.Smoothing processing further improves nicety of grading, reduces misclassification rate, and makes point It is more accurate to cut result.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of audio frequency classification method based on SVM, which is characterized in that the audio frequency classification method based on SVM includes:

In the method for these characteristics of audio extraction, the characteristic of audio is extracted using time domain specification and frequency domain characteristic respectively；

In audio frequency splitting method, a confirmation is split using the audio frequency splitting method of bayesian information criterion BIC；Wherein, Audio segmentation extracts different audio categories from the audio stream of audio classification, and the classification of audio stream temporally axis is divided.

2. as described in claim 1 based on the audio frequency classification method of SVM, which is characterized in that before audio extraction, need to carry out:

Audio signal pretreatment: firstly, original audio signal is pretreated, audio signal is split, and to each audio Duan Jinhang is Windowing and frame；Secondly, extracting audio frame and audio section, and the feature of extraction is merged.

3. as described in claim 1 based on the audio frequency classification method of SVM, which is characterized in that in audio extraction these characteristics In method, in the characteristic of extracting audio using time domain specification and frequency domain characteristic respectively, specifically include:

1) audio time domain specificity analysis and extraction: audio time domain characteristic represents time domain specification, by the frame of time domain waveform Analyze audio signal；Specifically have:

Zero-crossing rate ZCR: the ratio of the signal value of two neighbouring sample points and all hits on the discrete point of audio signal；It crosses Zero rate shows the frequency of signal zero passage

Wherein x (m) is treated discrete audio sig；

Short-term averaging energy: short-term averaging energy is audio frequency characteristics parameter, reflects the variation of audio power, this is directly related to window The selection of mouth length N；If the value of N is too long, the variation of entire energy is relatively steady and difference does not reflect, but a window Too narrow then smoothless energy function；Select Haiming window to keep good balance therebetween；Short-term averaging energy It is calculated with lower formula:

When n-th of signal value in m-th of frame that x (n) indicates audio signal, w (n) is window function.The energy quilt of short time It is set as a threshold value, in threshold value hereinafter, being determined as mute；

A threshold value is arranged in mute ratio in frequency domain energy；

Carry out high zero-crossing rate calculating；

Calculate the ratio of low frequency energy frame in audio section；

In the elementary audio speed standard variance of audio section, the pitch frequencies of each frame are calculated, are then joined with these pitch frequencies Number calculates their standard deviation；

The composition vector set for carrying out set of eigenvectors divides, and is divided into the MFCC vector of 24 dimensions, and extracted by audio fragment 11 dimensional feature vectors.

4. as described in claim 1 based on the audio frequency classification method of SVM, which is characterized in that audio frequency classification method includes:

1) quiet and noise uses rule-based classification method；

2) classification of each audio categories: the classifier based on SVM is used for pure voice/background sound and music/ambient sound Classify.

5. as described in claim 1 based on the audio frequency classification method of SVM, which is characterized in that audio frequency classification method includes: to improve Δ BIC dividing method, for each BIC window detected, if detecting a split point, by a specific length Slide into next window；If split point is not detected, the length of window also increases, when length of window increase to a certain extent On, split point has not found, then window keeps current window length and slide to restore initial until finding cut-point forward Length of window；When detecting cut-point, the length of window moves immediately rearward.

6. a kind of computer program for realizing the audio frequency classification method described in Claims 1 to 5 any one based on SVM.

7. a kind of terminal, which is characterized in that the terminal at least carries the sound described in Claims 1 to 5 any one based on SVM The processor of frequency classification method.

8. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires the audio frequency classification method described in 1-5 any one based on SVM.

9. a kind of audio classification control system based on SVM for realizing the audio frequency classification method described in claim 1 based on SVM.

10. the audio classification control system multimedia signal processing equipment based on SVM described in a kind of carrying claim 9.