CN101546556A

CN101546556A - Classification system for identifying audio content

Info

Publication number: CN101546556A
Application number: CN200810035351A
Authority: CN
Inventors: 黄鹤云; 林福辉
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd
Priority date: 2008-03-28
Filing date: 2008-03-28
Publication date: 2009-09-30
Anticipated expiration: 2028-03-28
Also published as: CN101546556B

Abstract

The invention provides an audio content classification system, which comprises a training end and a test end, wherein the training end extracts characteristics of audio test samples through an audio characteristics extracting module, and trains classifier parameters through a classifier training module; and the test end comprises the audio characteristics extracting module shared by the training end, a classifier decision module, a transient characteristics extracting module, a transient characteristics smoothing module and an incremental learning module, wherein the audio characteristics extracting module is used for extracting audio characteristics of input signals; the classifier decision module takes output audio characteristics of the audio characteristics extracting module as input to classify the classifier parameters obtained by training a first frame through a training part; simultaneously, the transient characteristics extracting module extracts transient characteristics of the input signals, and outputs the transient characteristics of the input signals to the transient characteristics smoothing module; the transient characteristics smoothing module corrects and outputs an output result of the classifier decision module; and simultaneously, an incremental learning module utilizes classified class information and characteristic information of audio frames as a group of incremental learning samples to update the classifier parameters.

Description

The categorizing system that is used for audio content identification

Technical field

The present invention relates to a kind of pattern-recognition and signal processing technology, relate in particular to a kind of categorizing system that is used for audio content identification.

Background technology

Audio frequency is a kind of important medium in the multimedia, the audio-frequency information retrieval technique is a pith in the multimedia information retrieval technology, corresponding prior art can be with reference to No. 1391211,1223739 and 1270361, Chinese patent and United States Patent (USP) 5,613,037,6,292,776 and 5,440, No. 662 etc.In audio retrieval is used, need classify to voice data, its purpose is that the sound signal of distinguishing input belongs to that class, common audio categories has voice, ground unrest, pop music, classical music etc., and the application of audio content classification is also very extensive, particularly in the audio retrieval field, audio content classification decisive role, and in the extraction process of some multimedia summaries, the audio content classification has also been played vital role as a kind of supplementary means of video content retrieval.Broadly, at a lot of voice and audio standard, for example in the AMR-WB and AMR-WB+ of 3GPP, they have all used voice/noise classification device and voice/music sorter, offering the scrambler input signal is any sound signal, thereby each signal is taked different scramblers, and it is quite crucial and important therefore designing a kind of good audio content sorting technique.In common sorting technique, usually use two requisite modules, i.e. audio feature extraction module, its function are to extract to reflect the audio content kinds of information from the audio sample point of input, another then is a sorter, and it utilizes these information to finish the process that kind is judged.A lot of features of audio content wherein, temporal signatures (zero-crossing rate for example, curvature, linear predictor coefficient or the like), frequency domain character (Mel cepstrum coefficient, fourier transform coefficient, wavelet conversion coefficient or the like) and some other nonlinear characteristics (fractal, chaos parameter or the like) is proved to be very effective sorting technique, and in audio content sorting technique field, existing a variety of sorters have been widely used, wherein decision tree (Decision Tree) and k-arest neighbors method (K Nearest Neighbor) are two kinds of relative sorters of realizing and understanding of being easy to, they and to voice, neighbourhood noise, music three class audio frequency classifying contents have been obtained good effect.In addition, in the AMR-WB+ standard, the sorter of voice and music also is the method for the decision tree of employing.And support vector machine classifier (Support Vector MachineClassifier) as a kind of in recent years by the sorter that adopts in a lot of machine learning and the area of pattern recognition, also be proved to be a kind of very efficient ways.Other several classical sorters, reverse neural network (Back-Propagation Neural Network) for example, artificial neural network (ArtificialNeural Network) cluster (Clustering) method, it is effective also being proved to be audio content classification.

And in existing categorizing system, because that the parameter of its sorter is is fixing, can't upgrades in time, and the acoustic characteristic of accident can't effectively be handled, therefore can not satisfy the request for utilization of specific environment (as safety monitoring).

Summary of the invention

The technical problem to be solved in the present invention is to propose a kind of audio content classification system, the defective that can't upgrade and can't effectively handle the acoustic characteristic of accident in order to the parameter that solves existing sorter.

For addressing the above problem, according to a kind of audio content classification system of the present invention, comprise training end and test lead, wherein the training end comprises audio feature extraction module and sorter training module, wherein the audio feature extraction module is in order to extract the feature of audio-frequency test sample, and the sorter training module trains the parameter of sorter according to the audio frequency characteristics of audio feature extraction module collection and the classification information of this sound signal; And test lead comprises and train the shared audio feature extraction module of end, the sorter decision-making module, the transient state characteristic extraction module, level and smooth module of transient state characteristic and incremental learning module, wherein the audio feature extraction module is in order to extract the audio frequency characteristics of input signal, the sorter decision-making module is that the output audio according to the audio feature extraction module is characterized as input, the classifier parameters that training obtains to first frame utilization training part is classified, the transient state characteristic extraction module extracts and exports to the level and smooth module of transient state characteristic to the transient state characteristic of this input signal simultaneously, the level and smooth module of this transient state characteristic comes the output result of sorter decision-making module is revised and exports, and the incremental learning module utilizes the classification information of classified audio frame and characteristic information to be used as the parameter that one group of incremental learning sample upgrades sorter simultaneously.

According to above-mentioned principal character, the transient state characteristic extraction module extracts the transient state characteristic of present frame and judges, the level and smooth module of transient state characteristic is taked different smoothing processing methods according to the difference of transient state characteristic, when wherein present frame is judged as the transient state frame, adopt second smoothing method, otherwise adopt first smoothing method, wherein first smoothing method is meant and the irrelevant smoothing method of transient state characteristic, and second smoothing method then is the smoothing method relevant with transient state characteristic.

According to above-mentioned principal character, it is that the input audio frame is divided into M section: B that transient state characteristic extracts _i, l=1,2 ..., 32, wherein:

B_{l} = {x_{N_{l} + 1}, x_{N_{l} + 2}, . . . ., x_{N_{l} + 32}}, N_{l} = \frac{lN}{64}, l = 1,2, . . ., 64;

Calculate every section amplitude sum then, i.e. the absolute value sum of sampled point numerical value obtains:

M_{i} = \frac{1}{32} \underset{n &Element; B_{i}}{Σ} | x_{n} |, i = 1,2, . . ., 64;

Calculate energy ratio and the amplitude-energy ratio of each section and the last period afterwards again:

r_{l}^{1} = \frac{E_{l}}{\min (E_{l - 1}, E_{l - 2})},

r_{l}^{2} = \frac{\max_{x_{i} &Element; B_{l}} x_{i}^{2}}{E_{l - 1}}, l &Element; S,

Wherein

E_{l} = \underset{n &Element; B_{l}}{Σ} x_{n}^{2}

Calculate maximum amplitude-energy ratio and energy ratio again:

F_{i} = \max_{l} (\log r_{l}^{i}), i = 1,2,

Therefore, transient state characteristic can calculate with following mode:

F＝0.45F ₁+0.55F ₂；

Obtain after the transient state characteristic, whether judge F greater than first threshold value, if greater than would be expressed as the transient state frame, then adopt second smoothing method, otherwise then adopt first smoothing method.

According to above-mentioned principal character, first smoothing method is to analyze first three frame earlier, if " non-accident frame, accident frame, non-accident frame " this classification results, all smoothly be non-accident frame then with three frames, and a kind of embodiment of second smoothing method can be as feature F during greater than second threshold value, then makes this frame begin first three frame and back three frames all are accident.

According to above-mentioned principal character, second threshold value is bigger than first threshold value.

According to above-mentioned principal character, the renewal classifier parameters is to form a bigger training sample by the sample of the training data that will preserve in advance and incremental learning, and training classifier upgrades classifier parameters again.

According to above-mentioned principal character, also comprise Feature Fusion module or feature dimensionality reduction module in the above-mentioned sorter.

According to above-mentioned principal character, after having extracted feature and before the decision-making classification, use principal component analysis (PCA) with the feature dimensionality reduction.

According to above-mentioned principal character, the transient state characteristic extracting method is a perceptual entropy.

According to above-mentioned principal character, described sorter adopts traditional decision-tree.

According to above-mentioned principal character, described sorter adopts neural net method.

According to above-mentioned principal character, described sorter adopts support vector machine method.

According to above-mentioned principal character, described sorter adopts clustering method.

According to above-mentioned principal character, described sorter adopts bayes method.

Compared with prior art, the present invention has adopted enhancing learning art and transient state characteristic smoothing technique, has improved the accuracy of classification.

Description of drawings

Fig. 1 is the composition Organization Chart of the training end of the embodiment of the invention.

Fig. 2 is the composition Organization Chart of the test lead of the embodiment of the invention.

Embodiment

Below in conjunction with accompanying drawing the specific embodiment of the invention is described.

Audio frequency is a kind of important medium in the multimedia, and the audio-frequency information retrieval technique is a pith in the multimedia information retrieval technology.In audio retrieval is used, need classify to voice data, its purpose is that the sound signal of distinguishing input belongs to that class, common audio categories has voice, ground unrest, pop music, classical music etc., and the application of audio content classification is also very extensive, particularly in the audio retrieval field, audio content classification decisive role, and in the extraction process of some multimedia summaries, the audio content classification has also been played vital role as a kind of supplementary means of video content retrieval.Broadly, at a lot of voice and audio standard, for example in the AMR-WB and AMR-WB+ of 3GPP, they have all used voice/noise classification device and voice/music sorter, offering the scrambler input signal is any sound signal, thereby each signal is taked different scramblers, and it is quite crucial and important therefore designing a kind of good audio content sorting technique.In common sorting technique, usually use two requisite modules, i.e. audio feature extraction module, its function are to extract to reflect the audio content kinds of information from the audio sample point of input, another then is a sorter, and it utilizes these information to finish the process that kind is judged.A lot of features of audio content wherein, temporal signatures (zero-crossing rate for example, curvature, linear predictor coefficient or the like), frequency domain character (Mel cepstrum coefficient, fourier transform coefficient, wavelet conversion coefficient or the like) and some other nonlinear characteristics (fractal, chaos parameter or the like) is proved to be very effective sorting technique, and in audio content sorting technique field, existing a variety of sorters have been widely used, wherein decision tree (Decision Tree) and k-arest neighbors method (K Nearest Neighbor) are two kinds of relative sorters of realizing and understanding of being easy to, they and to voice, neighbourhood noise, music three class audio frequency classifying contents have been obtained good effect.In addition, in the AMR-WB+ standard, the sorter of voice and music also is the method for the decision tree of employing.And support vector machine classifier (Support Vector Machine Classifier) as a kind of in recent years by the sorter that adopts in a lot of machine learning and the area of pattern recognition, also be proved to be a kind of very efficient ways.Other several classical sorters, reverse neural network (Back-Propagation NeuralNetwork) for example, artificial neural network (Artificial Neural Network), cluster (Clustering) method, it is effective also being proved to be audio content classification.

And in existing categorizing system, because the parameter of its sorter is fixing, can't upgrade in time, and the acoustic characteristic to accident can't effectively be handled, therefore can not satisfy the request for utilization of specific environment (as safety monitoring), therefore the invention provides a kind of audio content classification system, the defective that can't upgrade and can't effectively handle the acoustic characteristic of accident in order to the parameter that solves existing sorter.

Figure 1 shows that the composition Organization Chart of the training end of the embodiment of the invention, wherein the training end comprises two modules, and one is the audio feature extraction module, and one is the sorter training module.In the present invention, all Audio Signal Processing all are to handle frame by frame, suppose to read in each frame sound signal and are expressed as x ₁, x ₂...., x _N, after characteristic extracting module is handled, can obtain the proper vector (F of a M dimension ₁, F ₂...., F _M), that is:

x_{1}, x_{2}, . . ., x_{N} \overset{Feature Extraction}{&RightArrow;} F_{1}, F_{2}, . . ., F_{M}

Be that zero-crossing rate (Zero-Crossing Rate) with signal is a feature in the present embodiment, other calculates according to following method:

F_{1} = ZCR = Σ_{i = 1}^{N - 1} sgn (x_{i} x_{i + 1})

Sgn (x) is-symbol function wherein, if x greater than zero then get 1, gets-1 less than zero, equalling zero then is zero.

Certainly, also the gross energy of available signal is a feature, and it calculates according to following formula:

F_{2} = TE = Σ_{i = 1}^{N} x_{i}^{2}

Obtain feature and promptly finished the work of audio feature extraction later on, carry out last classification according to feature then, promptly enter the sorter training module, the effect of sorter training module is according to feature (F ₁, F ₂...., F _M) and the classification information of this frame sound signal, train the parameter of sorter, use for test lead, wherein common sorter embodiment has traditional decision-tree, neural net method, support vector machine method, clustering method, bayes method etc.

See also shown in Figure 2, composition Organization Chart for the test lead of the embodiment of the invention, wherein test lead comprises and trains and hold shared audio feature extraction module, the sorter decision-making module, the transient state characteristic extraction module, level and smooth module of transient state characteristic and incremental learning module, wherein the sorter decision-making module is that output audio according to the audio feature extraction module is characterized as input, the sorter that training obtains to first frame utilization training part is classified, all frames that second frame is begun use the sorter (being detailed later) after incremental learnings upgrade to classify, and embodiment can comprise traditional decision-tree, neural net method, support vector machine method, clustering method and bayes method etc.And the audio feature extraction module is when extracting audio frequency characteristics to the input audio frame, and the transient state characteristic extraction module has extracted the transient state characteristic of this frame, outputs to the level and smooth module of transient state characteristic and comes the output result of sorter decision-making module is revised.The definition of transient state characteristic then is whether the energy at time domain up-sampling point significantly improves, and take different smoothing processing methods according to the difference of transient state characteristic, when wherein present frame is judged as the transient state frame, adopt second smoothing method, otherwise adopt first smoothing method.Wherein first smoothing method is meant and the irrelevant smoothing method of transient state characteristic, and second smoothing method then is the smoothing method relevant with transient state characteristic.

Wherein the embodiment of transient state characteristic extraction then is that the input audio frame is divided into M section: B _i, l=1,2 ..., 32, wherein:

B_{l} = {x_{N_{l} + 1}, x_{N_{l} + 2}, . . . ., x_{N_{l} + 32}}, N_{l} = \frac{lN}{64}, l = 1,2, . . ., 64;

So between the adjacent segment the overlapping of half arranged.Calculate every section amplitude sum then, i.e. the absolute value sum of sampled point numerical value obtains:

M_{i} = \frac{1}{32} \underset{n &Element; B_{i}}{Σ} | x_{n} |, i = 1,2, . . ., 64;

r_{l}^{1} = \frac{E_{l}}{\min (E_{l - 1}, E_{l - 2})},

r_{l}^{2} = \frac{\max_{x_{i} &Element; B_{l}} x_{i}^{2}}{E_{l - 1}}, l &Element; S,

Wherein

E_{l} = \underset{n &Element; B_{l}}{Σ} x_{n}^{2}

Calculate maximum amplitude-energy ratio and energy ratio again:

F_{i} = \max_{l} (\log r_{l}^{i}), i = 1,2,

Therefore, transient state characteristic can calculate with following mode:

F＝0.45F ₁+0.55F ₂。

Obtain after the transient state characteristic, judge according to this feature to start which smoothing method.Transient state characteristic can be an one dimension, also can be higher-dimension, and whether output is bidimensional at least, be transient state frame or non-transient state frame in order to judge this frame.A kind of embodiment then is whether to judge F greater than first threshold value, if greater than would be expressed as the transient state frame, start classification results second smoothing method, otherwise then start first smoothing method.A kind of embodiment of first smoothing method can be that (being that present frame is non-transient state frame) analyzes earlier first three frame, if " non-accident frame, accident frame, non-accident frame " this classification results, all smoothly be non-accident frame then with three frames.A kind of embodiment of second smoothing method can be as feature F during greater than second threshold value (bigger than first threshold value usually), then makes this frame begin first three frame and back three frames all are accident.

The incremental learning module then is to utilize the classification information of classified audio frame and characteristic information to be used as the parameter that one group of incremental learning sample upgrades sorter.A kind of embodiment then is that the training data of preservation in advance and the sample of incremental learning are formed a bigger training sample, and training classifier has reached the purpose of upgrading classifier parameters again.

Pay special attention to, be with the part preferred implementation in above-mentioned description, really in above-mentioned all sorters, can take any one feature extraction algorithm or several feature extraction algorithm, and in wherein involved all sorters, can increase Feature Fusion module or feature dimensionality reduction module arbitrarily, a kind of preferable mode then is to use principal component analysis (PCA) with the feature dimensionality reduction after having extracted feature with before the decision-making classification, and in the related sorter, can take any one sorting technique, a kind of variation example is support vector machine classifier or neural network classifier.In addition, in above-mentioned description in the related sorter, the transient state characteristic extracting method can be any one method, a kind of variation pattern is a perceptual entropy, and the transient state characteristic extracting method can extract one-dimensional characteristic, also can extract high dimensional feature, the output of transient state frame determination methods can be the bidimensional result, also can be higher-dimension result more, and the method that the transient state frame is judged can be any one method, a kind of variation example then is a support vector machine method, and the classification results smoothing algorithm can be an arbitrary method.

In addition, in above-mentioned all sorters, the incremental learning module can adopt incremental learning method arbitrarily.

Be understandable that, for those of ordinary skills, can be equal to replacement or change according to technical scheme of the present invention and inventive concept thereof, and all these changes or replacement all should belong to the protection domain of the appended claim of the present invention.

Claims

1. an audio content classification system comprises training end and test lead, it is characterized in that the training end comprises:

The audio feature extraction module is in order to extract the feature of audio-frequency test sample;

The sorter training module, it trains the parameter of sorter according to the audio frequency characteristics of audio feature extraction module collection and the classification information of this sound signal;

And test lead comprises:

With the shared audio feature extraction module of training end;

The sorter decision-making module is characterized as input according to the output audio of audio feature extraction module, and the classifier parameters that training obtains to first frame utilization training part is classified;

The transient state characteristic extraction module extracts and exports to the level and smooth module of transient state characteristic to the transient state characteristic of this input signal;

The level and smooth module of this transient state characteristic comes the output result of sorter decision-making module is revised and exports;

The incremental learning module utilizes the classification information of classified audio frame and characteristic information to be used as the parameter that one group of incremental learning sample upgrades sorter.

2. audio content classification system as claimed in claim 1, it is characterized in that: the transient state characteristic extraction module extracts the transient state characteristic of present frame and judges, the level and smooth module of transient state characteristic is taked different smoothing processing methods according to the difference of transient state characteristic, when wherein present frame is judged as the transient state frame, adopt second smoothing method, otherwise adopt first smoothing method, wherein first smoothing method is meant and the irrelevant smoothing method of transient state characteristic, and second smoothing method then is the smoothing method relevant with transient state characteristic.

3. audio content classification system as claimed in claim 2 is characterized in that: it is that the input audio frame is divided into M section: B that transient state characteristic extracts _l, l=1,2 ..., 32, wherein:

B_{l} = {x_{N_{l} + 1}, x_{N_{l} + 2}, . . . ., x_{N_{l} + 32}}, N_{l} = \frac{lN}{64}, l = 1,2, . . ., 64;

M_{i} = \frac{1}{32} \underset{n &Element; B_{i}}{Σ} | x_{n} |, i = 1,2, . . ., 64;

r_{l}^{1} = \frac{E_{l}}{\min (E_{l - 1}, E_{l - 2})}, r_{l}^{2} = \frac{\max_{x_{i} &Element; B_{l}} x_{i}^{2}}{E_{l - 1}}, l &Element; S,

Wherein

E_{l} = \underset{n &Element; B_{l}}{Σ} x_{n}^{2}

Calculate maximum amplitude-energy ratio and energy ratio again:

F_{i} = \max_{l} (\log r_{l}^{i}), i = 1,2,

Therefore, transient state characteristic can calculate with following mode:

F＝0.45F ₁+0.55F ₂；

4. audio content classification system as claimed in claim 3, it is characterized in that: first smoothing method is to analyze first three frame earlier, if " non-accident frame, accident frame, non-accident frame " this classification results, all smoothly be non-accident frame then with three frames, and a kind of embodiment of second smoothing method can be as feature F during greater than second threshold value, then makes this frame begin first three frame and back three frames all are accident.

5. audio content classification system as claimed in claim 4 is characterized in that: second threshold value is bigger than first threshold value.

6. audio content classification system as claimed in claim 1 is characterized in that: the renewal classifier parameters is to form a bigger training sample by the sample of the training data that will preserve in advance and incremental learning, and training classifier upgrades classifier parameters again.

7. audio content classification system as claimed in claim 1 is characterized in that: also comprise Feature Fusion module or feature dimensionality reduction module in the above-mentioned sorter.

8. audio content classification system as claimed in claim 7 is characterized in that: use principal component analysis (PCA) with the feature dimensionality reduction after having extracted feature and before the decision-making classification.

9. audio content classification system as claimed in claim 1 is characterized in that: the transient state characteristic extracting method is a perceptual entropy.

10. as each described audio content classification system of claim 1 to 9, it is characterized in that: described sorter adopts traditional decision-tree.

11. ask 1 to 9 each described audio content classification system as claim, it is characterized in that: described sorter adopts neural net method.

12. as each described audio content classification system of claim 1 to 9, it is characterized in that: described sorter adopts support vector machine method.

13. as each described audio content classification system of claim 1 to 9, it is characterized in that: described sorter adopts clustering method.

14. as each described audio content classification system of claim 1 to 9, it is characterized in that: described sorter adopts bayes method.