CN113807408A

CN113807408A - Data-driven audio classification method, system and medium for supervised dictionary learning

Info

Publication number: CN113807408A
Application number: CN202110988214.4A
Authority: CN
Inventors: 陈真; 邱小群; 向友君; 张淘珊
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2021-12-17
Anticipated expiration: 2041-08-26
Also published as: CN113807408B

Abstract

The invention discloses a method, a system and a medium for audio classification based on data-driven supervised dictionary learning. The method comprises the following steps: determining the number of sample set categories; training a specific class dictionary by using the input sample and the class label corresponding to the input sample; obtaining sparse codes of input samples by using the trained dictionary, and training an SVM classifier by taking the sparse codes as characteristics; and classifying the input samples by using the trained dictionary and the trained SVM classifier, and outputting the prediction labels. The invention realizes the minimization of intra-class uniformity by learning one dictionary per class, maximizes the separability of the classes, improves the sparsity to control the complexity of signal decomposition on the dictionary, minimizes class-based reconstruction errors, and improves the pair-wise orthogonality of the dictionaries. The invention can be widely applied to a plurality of scenes, such as calculation of auditory scene recognition and music and string recognition; the test on the data set is relatively stable, and the generalization capability is excellent.

Description

Data-driven audio classification method, system and medium for supervised dictionary learning

Technical Field

The invention belongs to the technical field of sparse representation and supervised dictionary learning, and particularly relates to a data-driven supervised dictionary learning-based audio classification method, system and medium.

Background

Conventional dictionary learning formulas minimize reconstruction errors between a given signal and its sparse representation on a learning dictionary. Although this method is convenient for solving signal denoising, it may not be suitable for the classification task since its final goal is to obtain a discriminative decomposition of the training signal through a learned dictionary. Due to the limitations of the traditional dictionary learning technology in the aspect of classification, supervised dictionary learning is widely applied.

Ramirez et al suggest that different information may be obtained by enhancing the orthogonality of dictionaries to make the learned dictionaries as different as possible, i.e., one class corresponds to one dictionary; fulkerson et al propose to first learn a very large dictionary and then merge the atoms of the dictionary according to predefined criteria including the condensation information bottleneck (AIB) to act as a compression dictionary; mairal et al propose a joint learning dictionary and classification task; post-tensioning and young et al propose embedding class labels into dictionaries and learning of sparse coding to minimize intra-class differences and maximize inter-class differences.

Disclosure of Invention

The invention mainly aims to overcome the defects of the traditional dictionary learning method on the audio recognition task, and provides a supervised dictionary learning audio classification method, a supervised dictionary learning audio classification system and a supervised dictionary learning audio classification medium based on data driving.

In order to achieve the purpose, the invention adopts the following technical scheme:

in one aspect of the present invention, a method for audio classification based on data-driven supervised dictionary learning is provided, which comprises the following steps:

s1, determining the class number C of the sample set, and using the input sample x_nAnd its corresponding class label y_nTraining C class-specific dictionaries D_c，c∈[1，C]；

S2, utilizing the trained dictionary D_c，c∈[1，C]To obtain input samples x_nSparse coding of a_nTraining an SVM classifier by taking the sparse code as a characteristic;

s3, utilizing the trained dictionary D_c，c∈[1，C]And the trained SVM classifier on the input sample x_nClassifying and outputting the prediction label y^～ _n。

As an optimized technical scheme, the C specific class dictionaries D are trained_c，c∈[1，C]The following were used:

s11, initializing dictionary D_c ⁰Learning rate eta₀Learning rate update rate alpha and iteration times T;

s12, determining a loss function J;

s13, starting the iterative solution process with the number of times of T, and fixing the dictionary D when the number of iterations is T^t-1Computing a sparse coding set A^t；

S14 set A of fixed sparse codes^tUpdating dictionary D_c ^t；

And S15, T is T +1, and the next iteration is carried out until T is T.

As a preferred technical solution, the loss function J is in a specific form:

J(A，D)＝J₁(D，A)+μJ₂(D，A)+λJ₃(A)+γ₁J₄(A)+γ₂J₅(D)；

where μ is a sample constraint parameter, λ is a classifier constraint parameter, γ₁For sparsely encoding the constraint parameter, gamma₂The constraint parameters are learned for the dictionary.

As a preferred technical solution, in the iterative solution process with the start time being T, when the iteration time is T, the dictionary D is fixed^t-1Computing a sparse coding set A^tIn particular minimizing the loss function J (D) by the Lasso algorithm^t-1，A^t) To obtain A^t。

As a preferred technical solution, the set a of fixed sparse codes^tUpdating dictionary D_c ^tThe method comprises the following specific steps:

s141, calculating gradient G of loss function J relative to dictionary D^t；

S142, preliminary update, D_c ^t/2＝D_c ^t-1-ηG^t；

S143, constraining the preliminarily updated dictionary through a near-end projection operator Prox;

s144, up to J (D)_c ^t，A^t)＜J(D_c ^t-1，A^t-1) And ending the updating of the dictionary.

As a preferred technical solution, the training SVM classifier specifically includes: training to obtain a hyperplane, and separating different samples; the testing stage is to determine which side of the space divided by the hyperplane the sample is on.

In another aspect of the present invention, a data-driven audio classification system for supervised dictionary learning is further provided, which is applied to the above data-driven audio classification method for supervised dictionary learning, and includes a dictionary training module, an SVM classifier training module, and a prediction output module;

the dictionary training module is used for determining the class number C of the sample set and utilizing the input sample x_nAnd its corresponding class label y_nTraining C class-specific dictionaries D_c，c∈[1，C]；

The SVM classifier training module is used for utilizing the trained dictionary D_c，c∈[1，C]To obtain input samples x_nSparse coding of a_nTraining an SVM classifier by taking the sparse code as a characteristic;

the prediction output module is used for utilizing the trained dictionary D_c，c∈[1，C]And the trained SVM classifier on the input sample x_nClassifying and outputting the prediction label y^～ _n。

In another aspect of the present invention, a storage medium is provided, which stores a program, and when the program is executed by a processor, the program implements the above-mentioned method for audio classification based on data-driven supervised dictionary learning.

Compared with the prior art, the invention has the following advantages and beneficial effects:

(1) the supervised dictionary learning audio recognition method based on data driving disclosed by the invention realizes the minimization of intra-class uniformity by learning one dictionary per class, maximizes the separability of the classes, improves the sparsity to control the complexity of signal decomposition on the dictionary, simultaneously minimizes class-based reconstruction errors, and improves the pair-wise orthogonality of the dictionaries;

(2) the method provided by the invention can be widely applied to a plurality of scenes, such as calculation of auditory scene recognition and music and string recognition; the test on the data set is relatively stable, and the generalization capability is excellent.

(3) The method provided by the invention can accurately improve the recognition of the audio frequency, and has excellent performance in the field of security calculation such as voice authentication and audio frequency identification.

Drawings

FIG. 1 is a flowchart of implementation steps of a method for audio classification based on data-driven supervised dictionary learning according to an embodiment of the present invention;

FIG. 2 is a class-specific dictionary D according to an embodiment of the present invention_cA flowchart of the learning step of (1);

FIG. 3 is a flowchart of the training steps of an SVM classifier according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a process of classifying and outputting prediction tags during a testing phase according to an embodiment of the present invention;

FIG. 5 is a graph of similarity of pairs of class-specific dictionaries learned on Rouen datasets by an embodiment of the present invention;

FIG. 6 is a similarity graph of a pair of class-specific dictionaries learned on a music and chord dataset according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of an audio classification system based on data-driven supervised dictionary learning according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Examples

As shown in fig. 1, the present embodiment provides a method for learning audio classification based on a data-driven supervised dictionary, comprising the following steps:

s1, determining the class number C of the sample set, and using the input sample x_nAnd its corresponding class label y_nTraining C class-specific dictionaries D_c，c∈[1，C]As shown in fig. 2, the method specifically includes the following steps:

s12, determining a loss function J;

further, the loss function J is embodied as:

J(A，D)＝J₁(D，A)+μJ₂(D，A)+λJ₃(A)+γ₁J₄(A)+γ₂J₅(D)；

S13, starting the iterative solution process with the number of times of T, and fixing the dictionary D when the number of iterations is T^t-1Computing sparse code A^t；

Further, the sparse coding A^tMinimizing the loss function J (D) by the Lasso algorithm^t-1，A^t) Thus obtaining the product.

S14 fixed sparse coding A^tUpdating dictionary D_c ^tThe method comprises the following steps:

s141, calculating gradient G of loss function J relative to dictionary D^t(ii) a Specifically, the loss function is:

wherein:

the gradient is:

wherein:

s142, preliminary update, D_c ^t/2＝D_c ^t-1-ηG^t；

And S15, T is T +1, and the next iteration is carried out until T is T.

S2, utilizing the trained dictionary D_c，c∈[1，C]To obtain input samples x_nSparse coding of a_nTraining an SVM classifier by using the sparse code as a feature, as shown in FIG. 3;

the training SVM classifier specifically comprises the following steps: training to obtain a hyperplane, and separating different samples; the testing stage is to determine which side of the space divided by the hyperplane the sample is on.

S3, in testing stage, using trained dictionary D_c，c∈[1，C]And the trained SVM classifier on the input sample x_nClassifying and outputting the prediction label y^～ _nAs shown in fig. 4.

In this example, two different audio signal classification problems were tested, respectively auditory scene recognition and music and string recognition:

(1) in computing the auditory identification problem, the present invention performed experiments on both East Anglia and Litis Rouen datasets. Table 1 lists the results of the method of the present invention compared to other methods in this regard;

TABLE 1 comparison of the method of the present invention with other methods in computing auditory identification problems

As is clear from Table 1, the method of the present invention has completely outperformed certain methods, the test on two data sets is relatively stable, and the generalization ability is excellent, which indicates that the method of the present invention has certain prospect to be explored. Fig. 5 shows pairwise similarities of different dictionaries, and it can be seen that, in terms of calculating auditory scene recognition, dictionaries corresponding to different categories still have greater similarity, that is, features possibly extracted by different categories are similar and are not beneficial to classification, and increasing categories make it difficult to enforce dissimilarity between the dictionaries.

(2) In terms of musical chord identification, the present invention produces 2156 musical chord samples containing 14 different categories, each sample having a duration of 2s and a frequency of 44100 Hz. Comparing the method of the present invention with some conventional characteristics, the results shown in table 2 are obtained;

Features	Music chord
		Chroma	0.19±0.01
Interpolated PSD	0.15±0.02
		Spectrogram pooling	0.14±0.01
Dictionary learning	0.66±0.01

TABLE 2. results of comparison of the method of the present invention with conventional characteristics in terms of musical chord identification

As is apparent from table 2, the method of the present invention is superior to other conventional features. Fig. 6 shows the pairwise similarity of different dictionaries, and it can be seen that the maximum value of the pairwise similarity of different dictionaries is on the diagonal line from top left to bottom right, which illustrates that the method of the present invention achieves the required effect on music and chord recognition data sets, i.e., dictionaries corresponding to different categories can extract different information, which is a good illustration that the method of the present invention overcomes other traditional characteristics.

In another embodiment of the present application, as shown in fig. 7, there is provided a data-driven supervised dictionary learning based audio classification system, which includes a dictionary training module, an SVM classifier training module, and a prediction output module;

It should be noted that the system provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the above described functions.

As shown in fig. 8, in another embodiment of the present application, there is further provided a storage medium storing a program, which when executed by a processor, implements a method for learning audio classification based on a data-driven supervised dictionary, specifically:

S2, utilizing the trained dictionaryD_c，c∈[1，C]To obtain input samples x_nSparse coding of a_nTraining an SVM classifier by taking the sparse code as a characteristic;

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The audio classification method based on data-driven supervised dictionary learning is characterized by comprising the following steps of:

determining the number of sample set classes C, using the input samples x_nAnd its corresponding class label y_nTraining C class-specific dictionaries D_c，c∈[1，C]；

Using trained dictionaries D_c，c∈[1，C]To obtain input samples x_nSparse coding of a_nTraining an SVM classifier by taking the sparse code as a characteristic;

using trained dictionaries D_c，c∈[1，C]And trained SVM classificationFor input sample x_nClassifying and outputting the prediction label y^～ _n。

2. The audio classification method based on data-driven supervised dictionary learning of claim 1, wherein the C dictionaries of specific class D are trained_c，c∈[1，C]The following were used:

initializing dictionary D_c ⁰Learning rate eta₀Learning rate update rate alpha and iteration times T;

determining a loss function J;

starting an iterative solution process with the number of times of T, and fixing a dictionary D when the number of iterations is T^t-1Computing a set A of sparse codes^t；

Set A of fixed sparse codes^tUpdating dictionary D_c ^t；

And T is T +1, and entering the next iteration until T is T.

3. The audio classification method based on data-driven supervised dictionary learning according to claim 2, wherein the loss function J is in the specific form:

J(A，D)＝J₁(D，A)+μJ₂(D，A)+λJ₃(A)+γ₁J₄(A)+γ₂J₅(D)；

4. The audio classification method based on data-driven supervised dictionary learning of claim 2, wherein the iterative solution process with the starting number of T is characterized in that when the iterative number is T, the fixed dictionary D is fixed^t-1Computing a sparse coding set A^tIn particular minimizing the loss function J (D) by the Lasso algorithm^t-1，A^t) To obtain A^t。

5. The data-driven supervised dictionary learning-based audio classification method according to claim 2, wherein the set A of fixed sparse codes^tUpdating dictionary D_c ^tThe method comprises the following specific steps:

calculating the gradient G of the loss function J with respect to the dictionary D^t；

Preliminary update, D_c ^t/2＝D_c ^t-1-ηG^t；

Constraining the preliminarily updated dictionary through a near-end projection operator Prox;

up to J (D)_c ^t，A^t)＜J(D_c ^t-1，A^t-1) And ending the updating of the dictionary.

6. The audio classification method based on data-driven supervised dictionary learning of claim 1, wherein the training SVM classifier is specifically: training to obtain a hyperplane, and separating different samples; the testing stage is to determine which side of the space divided by the hyperplane the sample is on.

7. The audio classification system based on data-driven supervised dictionary learning is characterized by being applied to the audio classification method based on data-driven supervised dictionary learning of any one of claims 1 to 6, and comprising a dictionary training module, an SVM classifier training module and a prediction output module;

8. A storage medium storing a program, characterized in that: the program, when executed by a processor, implements the data-driven supervised dictionary learning based audio classification method of any one of claims 1-6.