CN112447188B

CN112447188B - Acoustic scene classification method based on improved softmax function

Info

Publication number: CN112447188B
Application number: CN202011296395.6A
Authority: CN
Inventors: 杨吉斌; 张强; 张雄伟; 曹铁勇; 张睿; 白玮; 赵斐
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2023-10-20
Anticipated expiration: 2040-11-18
Also published as: CN112447188A

Abstract

The application provides an acoustic scene classification method based on an improved softmax function. In the network training process, the shape of the classification decision surface of the network can be adjusted by setting parameters of the loss function corresponding to various signal samples, the characteristic clustering characteristics of various sound signals are adapted, the distance of the classification decision surface of different types of signals is increased, and the classification performance of acoustic scenes is improved.

Description

Acoustic scene classification method based on improved softmax function

Technical Field

The application relates to the technical field of pattern recognition and classification, in particular to an acoustic scene classification method based on an improved softmax function.

Background

The acoustic scene classification utilizes the sound signals to judge the scene category of the sound signals, and the technology belongs to the technical field of pattern recognition and plays an important role in applications such as intelligent perception of robots and unmanned systems. The classification technology based on deep learning has better effect in the classification of acoustic scenes, but the basic softmax cross entropy loss function adopted by each typical deep network classifier has the problem of weak discrimination of the learned characteristics. Since there are very similar scenes in the acoustic scene, the classification performance using the basic softmax function is not ideal.

In order to better enhance the discrimination of sample representation learning, there have been many effective improvements and optimization schemes such as L-softmax, A-softmax, GA-softmax, etc. The L-softmax penalty function uses the full connection layer weights in the classification module as the weights for each class of classifier. In order to improve the discrimination of the learned features, the L-softmax function introduces a multiplicative angle margin on the original softmax function, and the distance of each class set is increased by increasing the difficulty of example learning. The A-softmax penalty function is further normalized to the weights of the fully connected layers in the classification module. The GA-softmax loss function converts multiplicative angle margins in the A-softmax loss function into additive angle margins, and the A-softmax loss function is generalized. Meanwhile, the GA-softmax function introduces scale factors and feature normalization, and the introduction of the parameters enables the classification decision surface of the classifier to be adjustable, and the discriminant of the learned features is controlled more flexibly. However, in the loss function based on the softmax cross entropy framework such as AM-softmax, the decision margin of any two classes appears as a parallel narrow band on a two-dimensional plane; the loss function based on the softmax cross entropy frame is introduced to construct a classifier, the shape of the classification decision boundary of each class is not changeable, the decision boundary cannot be flexibly adjusted according to sample feature distribution in a feature space, and the classification recognition rate is required to be improved. The adoption of these loss functions for classifying acoustic scenes limits further improvement of classification performance.

Disclosure of Invention

The application aims at the technical problems that a loss function based on a softmax cross entropy frame is introduced into a current acoustic scene classification model, a decision boundary cannot be flexibly adjusted according to sample feature distribution in a feature space, and the classification recognition rate is to be improved.

The application adopts the following technical scheme. The application provides an acoustic scene classification method based on an improved softmax function, which comprises the steps of obtaining time-frequency characteristics of an acoustic signal sample; using the time-frequency characteristics as the input of an acoustic scene classification model which is trained in advance, and judging the categories of the time-frequency characteristics by using the acoustic scene classification model to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained using a modified softmax function.

Further, the acoustic scene classification model comprises a deep convolutional neural network and a fully connected layer; extracting acoustic representation features by using a deep convolutional neural network, and outputting the obtained acoustic representation features to the full connection layer; and the full connection layer is used for carrying out type discrimination on the acoustic representation characteristics and outputting an acoustic scene classification result.

Further, the training method of the acoustic scene classification model is as follows:

inputting time-frequency characteristics of an acoustic signal training sample;

extracting acoustic representation features by using a deep convolutional neural network;

classifying the acoustic scenes by using the full connection layer according to the acoustic representation characteristics; calculating sine and cosine similarity measurement and sine and cosine similarity measurement of weights corresponding to each output node of the acoustic representation feature and the full-connection layer; calculating cross entropy loss of the improved softmax function based on the sine and cosine similarity metrics;

training the acoustic scene classification model by using the cross entropy loss obtained by calculation to respectively obtain network parameters of each layer of the deep convolutional neural network and weight parameters of the output nodes of the full-connection layer.

Still further, the specific method for calculating the sine and cosine similarity measure and the sine and cosine similarity measure of the acoustic representation feature and the weight corresponding to each output node of the fully connected layer is as follows:

calculating cosine similarity of the weight of the i-th class acoustic representation feature and the i-th class output node corresponding to the full-connection layer to obtain sine and cosine similarity value s _ip Calculating cosine similarity of weights of acoustic representation features and output nodes of j-th class corresponding to the full-connection layer to obtain a negative cosine similarity value s _jn I+.j. Further, a sine and cosine similarity measure is obtained by adopting the formula (1)And a negative cosine similarity measure->

Wherein lambda is _p Is the scale factor corresponding to positive similarity, lambda _n Is the scale factor corresponding to the negative similarity; alpha _p Is a weight update factor corresponding to positive similarity, alpha _n Is a weight update factor corresponding to negative similarity, Δp is a margin factor corresponding to positive similarity, and Δn is a margin factor corresponding to negative similarity. To simplify the superparameter settings, we let λ _n ＝a·λ _p 。

Still further, the modified softmax function is expressed as follows:

wherein N is the number of samples, m is a first adjustment parameter, a is a second adjustment parameter, lambda _p And C is the number of acoustic scene categories, which is the scale factor corresponding to the positive similarity.

Still further, the decision boundary of the acoustic scene classification is changed by adjusting the first tuning parameter m and the second tuning parameter a.

Further preferably, the obtained acoustic representation features are normalized and output to the feature classification module.

The beneficial technical effects obtained by the application are as follows: in the acoustic scene classification model based on the improved softmax function, in the acoustic scene classification based on the deep convolutional neural network (Convolutional Neural Network, CNN), the sine and cosine similarity softmax function is designed, and training loss learning network parameters are calculated by using the function. The parameters in the loss function are adjusted, the shape of the classification decision boundary among the categories is controlled, and the distribution of the representative features of the samples of the categories is approximated, so that the clustering effect of the similar acoustic samples is improved while the judgment boundary interval of the heterogeneous acoustic samples is increased, the false judgment rate is reduced, the discrimination of the representative features of the acoustic samples is improved, the classification precision is remarkably improved, and the performance of the acoustic scene classification system is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a schematic diagram of an acoustic scene classification model structure based on a sine-cosine similarity softmax function according to an embodiment of the present application;

fig. 2 is a schematic flow chart of implementing acoustic scene classification by using an acoustic scene classification model based on a sine-cosine similarity softmax function according to an embodiment of the present application;

FIG. 3 is a flowchart of a training method of a depth image classification model based on a sine-cosine similarity softmax function according to an embodiment of the present application;

FIG. 4 is a schematic diagram of the decision boundary adjustment results according to an embodiment of the present application, which includes FIGS. 4 (a), 4 (b), 4 (c) and 4 (d), where each of FIGS. 4 (a), 4 (b), 4 (c) and 4 (d) is a decision boundary corresponding to m being 0.4, 0.3, 0.2, 0.1, and a being 3;

fig. 5 is a schematic diagram of the result of adjusting decision boundaries according to another embodiment of the present application, which includes decision boundaries corresponding to fig. 5 (a), 5 (b), 5 (c) and 5 (d), where m is 0.4, and a is 3, 2, 1/2 and 1/3, respectively.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "comprising" and "having" and any variations thereof in the description and claims of the application and in the foregoing drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

An acoustic scene classification method based on an improved softmax function comprises the steps of constructing an acoustic scene classification model; the acoustic scene classification model is used for judging the category of the time-frequency characteristics of the input acoustic signal sample to obtain an acoustic scene classification result; the structural schematic diagram of the classification model is shown in fig. 1, the acoustic scene classification model comprises a feature extraction module and a feature classification module, the feature extraction module is used for extracting acoustic representation features by using a deep convolutional neural network, and the obtained acoustic representation features are output to the feature classification module;

the feature classification module comprises a full-connection layer, and the number of output nodes of the full-connection layer is the same as the number of categories of the acoustic scene; the full-connection layer is used for carrying out type discrimination on the acoustic representation features extracted by the feature extraction module and outputting an acoustic scene classification result.

As shown in fig. 1, in a specific embodiment, the method may include inputting acoustic signal samples, and calculating corresponding basic time-frequency characteristics. The basic time-frequency characteristics of the acoustic signal samples are obtained using prior art calculations, which are not described in the present application. The network structure may take the form of a convolutional neural network like a standard two-dimensional CNN, res net, etc., including, but not limited to, several convolutional layers, pooled layers, batch normalized layers, etc.

In this embodiment, a feature classification module is built based on a Full Connected (FC) layer, the number of FC output nodes corresponds to the number of sample classes, and weight vectors w= [ W ] of the classes corresponding to all output nodes of the FC layer ₁ ,w ₂ ,...,w _C ]Is 1. And the bias of each output node is 0.

Based on the second embodiment and the first embodiment, the present embodiment provides a training method for an acoustic scene classification model in the first embodiment, as shown in fig. 2, where the training method includes the following steps:

s101, extracting time-frequency characteristics corresponding to the sound scene signals.

It should be noted that, the time-frequency characteristic samples obtained by processing the sound scene signal by the above system are not limited to signal length and specific form, for example, the 10s long audio file in the DCASE2019 dataset may be converted into a logarithmic Mel energy spectrum or a constant Q transform spectrum as sample data. Optionally, the time-frequency characteristic may be divided into a training sample, a verification sample and a test sample according to a preset division ratio.

S102, constructing a characteristic extraction module of an acoustic signal based on a deep convolutional neural network, inputting time-frequency characteristic data of the acoustic signal, and calculating a low-dimensional representation characteristic.

In a specific implementation, the system can input the isochronous frequency characteristic of the logarithmic Mel energy spectrum and the constant Q conversion spectrum into a deep convolution network for training. The deep convolutional neural network structure adopted by the feature extraction module can adopt a convolutional neural network similar to the forms of standard two-dimensional CNN, resNet and the like, and comprises a plurality of convolutional layers, a pooling layer, a batch standardization layer and the like, but is not limited to a specific form.

S103, the low-dimensional representation features learned by the deep network are normalized and then input into a feature classification module. The feature classification module adopts a full-connection layer form, and calculates sine and cosine similarity measurement and sine and cosine similarity measurement of the representative features and classifier weights (namely weights corresponding to each output node of the full-connection layer).

S104, converting the positive and negative cosine similarity, and calculating softmax cross entropy loss of the network based on the improved positive and negative cosine similarity measurement.

In a specific implementation, the model has a loss function with three parameters, m, a and lambda _p It may be provided that m and a together control the shape of the classification decision boundary. Tasks of different difficulty level correspond to different optimal parameter combinations.

S105, training a feature extraction module and a feature classification module based on the loss function.

After learning, obtaining network parameters of each layer of the feature extraction module and weight parameters W of the output nodes of the feature classification module.

S106, classifying the test sample data by utilizing the trained network.

In this embodiment, for the fully connected layer, the input is the acoustic representation of each sample through CNN learning. The output is acoustic representation characteristics and weight vector W= [ W ] of each output node in the classifier ₁ ,w ₂ ,...,w _C ]Is a product of the inner product of (a). Because the number of output nodes corresponds to the number of categories, the weight vector of each output node can be considered as a representative representation feature of each category.

The FC layer requires that the input representation feature norms be 1, so the features of the acoustic representation need to be normalized by the FC layer. The weight vector W two norms of the network is 1 and the bias of each node of the FC is 0. Thus, the network output may be regarded as cosine similarity of the representative feature and the representative feature of the input samples.

In this embodiment, the specific method for calculating the positive and negative cosine similarity measure of the representative feature and the classifier weight is as follows:

in a class C scene classification task, if a sample belongs to class i, defining cosine similarity of the representing characteristics and i class classifier weights as sine and cosine similarity value s _ip The cosine similarity of the characteristic and the weight of the j-th class classifier is a negative cosine similarity value s _jn (j. Noteq. I). Respectively converting the positive and negative similarity by adopting different mapping forms to respectively obtain improved positive and negative cosine similarity measurementThe different roles of the positive/negative examples can be distinguished as shown in formula (1):

wherein lambda is _p Is the scale factor corresponding to positive similarity, lambda _n Is the scale factor corresponding to the negative similarity; alpha _p Is a weight update factor corresponding to positive similarity, alpha _n Is a weight update factor corresponding to negative similarity, Δp is a margin factor corresponding to positive similarity, and Δn is a margin factor corresponding to negative similarity.

Further, positive and negative similarity and respective optimization targets O are adopted _p 、O _n Is used as a weight update factor, alpha _p ＝O _p -s _ip ，α _n ＝s _jn -O _n . Due to positive similarity s corresponding to the sample _ip Is 1, negative similarity s _jn The optimization objective of (1) is 0, the distance between positive and negative similarity is 1. Under this constraint, the parameter relationship in the formula (2) is obtained,

further, let Δn=m (0.ltoreq.m.ltoreq.1), a=λ _n /λ _p Δp=1-m, O _n ＝-m,O _p =1+m, converted sine and cosine similarity measureAnd a negative cosine similarity measure ++>Respectively is

Calculating the loss of the convolution classification network model by adopting an original softmax cross entropy framework to obtain a softmax function L based on the cosine similarity measurement of positive and negative samples _CS As shown in formula (4):

wherein N is the number of samples, m is a first adjustment parameter, a is a second adjustment parameter, lambda _p Is the scale factor corresponding to positive similarity.

Using the calculated cross entropy loss L _CS And performing network training to respectively obtain network parameters of CNN and FC layers, namely network parameters of each layer of the deep convolutional neural network and weight parameters of output nodes of the full-connection layer.

And judging the classified acoustic sample data by using the trained network, and judging by using a classifier to obtain a classification result.

In a specific embodiment, the classification output layer based on the positive and negative cosine similarity softmax function can realize flexible adjustment of the decision boundary shape by adjusting parameters m and a. The principle is as follows:

for sample feature x of class i, the decision boundary for judging that the sample belongs to class i and does not belong to class j can be determined byThe derived decision boundary is:

wherein, the liquid crystal display device comprises a liquid crystal display device,

when the parameters satisfy the formula (2), the formula (5) can be expressed as the formula (6)

Sample feature x of class j is the same, and decision boundary that it belongs to class j and does not belong to class i is:

schematic diagrams of the two classification decision boundaries corresponding to the sine and cosine similarity softmax loss function are given in fig. 4 and 5. There are two decision boundaries between any two classes, and the shape of the decision boundaries is commonly controlled by m and a. As can be seen from comparing fig. 4 and 5, m mainly controls the size of the decision area, and a mainly controls the shape of the decision boundary.

As shown in fig. 4, when m is changed from 0.4 to 0.1, the decision area is continuously contracted in the horizontal and vertical directions, the decision margin is continuously increased, and the discrimination of the learned feature is continuously enhanced as m is reduced.

As shown in fig. 5, when a is changed from 3 to 1/3, the decision area becomes deeper in the vertical direction and narrower in the horizontal direction as a is reduced.

Therefore, in practical application, the parameter setting can be adjusted in a targeted manner according to the characteristic distribution when the network converges, so that the decision boundary is enabled to be continuously approximate to the characteristic distribution.

Further, the method further comprises the following steps:

and dividing the acoustic sample data into a training sample, a verification sample and a test sample according to a preset dividing proportion.

The application extracts the representation characteristics of the acoustic scene based on the Convolutional Neural Network (CNN), calculates the positive and negative cosine similarity measurement between each acoustic signal sample and the representative characteristics, and the characteristic classification module can classify the acoustic scene according to the similarity. In the network training process, the shape of the classification decision surface of the network can be adjusted by setting parameters of the loss function corresponding to various signal samples, so that the classification precision is improved, and the performance of the classification of the acoustic scene is improved.

Table 1 shows the classification accuracy obtained using the CS-softmaxloss loss function of the present application on the development dataset of DCASE2019 ASC using the CNN9avg model (see paper Cross-talk learning for audio tagging, sound event detection and spatial localization: DCASE2019 baseline systems, qiaqiariang Kong, yin Cao, turab Iqbal, yong Xu, wenwu Wang, mark D. Plumbley, arXiv preprint arXiv:1904.03476,2019). The classification accuracy obtained with the original softmax loss was 70.3%. As can be seen from the results in the table, when three control parameters m, a and λ _p When the classification accuracy changes, the classification accuracy changes. Most cases are better than the original softmax losing the corresponding classification accuracy.

Table 1 CS-precision of softmax penalty function in DCASE2019 Acoustic scene Classification (Unit:%)

In the embodiment of the application, the embedded representation for classification is learned by performing deep network processing on the time-frequency characteristic samples corresponding to the sound scene data. The depth network is trained by adopting a softmax function based on the similarity of positive and negative cosine, and the separability of the depth embedding representation is improved. And the performance of the acoustic scene classification task is effectively improved by using the learned depth network parameters.

In the following, a procedure of a training method for a CIFAR10 depth image classification model based on a sine-cosine similarity softmax function will be described in connection with a specific implementation of an embodiment of the present application, as shown in FIG. 3. Data in the acoustic scene classification, whose time-frequency characteristics can be regarded as an image of the data on the time-frequency plane, and therefore network training and test implementation based on the sine-cosine similarity softmax function can be implemented in the image embodiment. This embodiment uses CIFAR10 as the training test dataset. The method can comprise the following steps:

s201, dividing the image data into a training sample and a test sample.

S202, a feature extraction module of image data is constructed based on a deep convolutional neural network, training data samples are input, and low-dimensional representation features are calculated.

S203, inputting the representing features of the image sample into a feature classification module, and calculating to obtain the positive and negative cosine similarity measurement of the weights of the representing features and all output nodes of the full connection layer by the classification module in the form of the full connection layer.

S204, converting the positive and negative similarity measures to obtain improved sine and cosine similarity measures.

S205, calculating to obtain the loss of the network based on the softmax cross entropy framework, training the network, and obtaining the optimal network parameters after the loss converges.

S206, classifying the image test data by using the trained network.

In this embodiment, the specific calculation method in the steps is provided in the above embodiment, and the description of this embodiment is not repeated.

In the embodiment of the application, through carrying out deep convolution network processing on samples such as audio, images and the like, the representative features for classification are learned, and the classification judgment is carried out by utilizing the full connection layer. The positive and negative cosine similarity softmax loss function is utilized, and the adjustable parameters m and a are embedded, so that the shape of the classification decision boundary between different categories can be flexibly controlled, the learned representation features can be better gathered in the interface, and the clustering characteristic of the sample is improved. And extracting the representation features of the test sample by using the learned deep network parameters, and classifying the representation features by directly adopting a classification layer, so that the performance of multi-category classification tasks is effectively improved.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims

1. An acoustic scene classification method based on an improved softmax function, comprising: acquiring time-frequency characteristics of an acoustic signal sample; using the time-frequency characteristics as the input of an acoustic scene classification model which is trained in advance, and judging the categories of the time-frequency characteristics by using the acoustic scene classification model to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained using an improved softmax function;

the acoustic scene classification model comprises a deep convolutional neural network and a full-connection layer; extracting acoustic representation features by using a deep convolutional neural network, and outputting the obtained acoustic representation features to the full connection layer; the full connection layer is used for carrying out type discrimination on the acoustic representation characteristics and outputting an acoustic scene classification result;

the modified softmax function is expressed as follows:

wherein N is the number of samples, m is a first adjustment parameter, a is a second adjustment parameter, lambda _p For the third adjustment parameter, C is the number of categories of the acoustic scene, s _ip Is sine and cosine similarity value, s _jn Is the value of the similarity of the negative cosine,for an improved sine-cosine similarity measure,for improved negative cosine similarityMetric lambda _p Is the scale factor corresponding to positive similarity; i is the i-th class of acoustic representation feature and j is the j-th class of acoustic representation feature.

2. The improved softmax function-based acoustic scene classification method of claim 1, wherein the training method of the acoustic scene classification model is as follows:

inputting time-frequency characteristics of an acoustic signal training sample;

3. The acoustic scene classification method based on the improved softmax function according to claim 2, wherein the specific method for calculating the sine and cosine similarity metrics and the sine and cosine similarity metrics of the weights corresponding to each output node of the fully connected layer is as follows:

calculating cosine similarity of the weight of the i-th acoustic representation feature and the i-th output node corresponding to the full connection layer to obtain sine and cosine similarity value s _ip Calculating cosine similarity of the weight of the i-th acoustic representation feature and the j-th output node corresponding to the full-connection layer to obtain a negative cosine similarity value s _jn I+.j; based on the obtained sine and cosine similarity value s _ip And a negative cosine similarity value s _jn Improved sine and cosine similarity measureAnd a negative cosine similarity measure->

4. The improved softmax function-based acoustic scene classification method of claim 3, wherein the improved sine-cosine similarity measure is obtained using equation (1)And a negative cosine similarity measure->

Wherein lambda is _p Is the scale factor corresponding to positive similarity, lambda _n Is the scale factor corresponding to the negative similarity;

α _p is a weight update factor corresponding to positive similarity, alpha _n Is a weight update factor corresponding to negative similarity, Δp is a margin factor corresponding to positive similarity, and Δn is a margin factor corresponding to negative similarity.

5. The acoustic scene classification method based on the modified softmax function according to claim 1, wherein the decision boundary of the acoustic scene classification is changed by adjusting the first tuning parameter m and the second tuning parameter a.

6. The improved softmax function-based acoustic scene classification method of claim 1, wherein the obtained acoustic representation features are normalized and output to a fully connected layer.

7. A computer readable storage medium storing a computer program, which when executed by a processor performs the steps of the method according to any one of claims 1 to 6.