CN112447188A

CN112447188A - Acoustic scene classification method based on improved softmax function

Info

Publication number: CN112447188A
Application number: CN202011296395.6A
Authority: CN
Inventors: 杨吉斌; 张强; 张雄伟; 曹铁勇; 张睿; 白玮; 赵斐
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2021-03-05
Anticipated expiration: 2040-11-18
Also published as: CN112447188B

Abstract

The invention provides an acoustic scene classification method based on an improved softmax function, which is characterized in that representation features of an acoustic scene are extracted based on a Convolutional Neural Network (CNN), positive cosine similarity measurement and negative cosine similarity measurement between each acoustic signal sample and the representation features are calculated, and classification can be carried out according to the similarity. In the network training process, the shape of the classification decision surface of the network can be adjusted by setting parameters of loss functions corresponding to various signal samples, so that the characteristic clustering characteristics of various sound signals are adapted, the distance of the classification decision surface of different types of signals is increased, and the performance of acoustic scene classification is improved.

Description

Acoustic scene classification method based on improved softmax function

Technical Field

The invention relates to the technical field of pattern recognition and classification, in particular to an acoustic scene classification method based on an improved softmax function.

Background

The acoustic scene classification utilizes the sound signals to judge the scene types of the sound signals, belongs to the technical field of pattern recognition, and plays an important role in applications such as intelligent perception of robots and unmanned systems. The classification technology based on deep learning has a good effect in acoustic scene classification, but the basic softmax cross entropy loss function adopted by each typical deep network classifier has the problem of poor discriminability of the learned features. Since there are strong similarity scenes in the acoustic scene, the classification performance using the basic softmax function is not ideal.

In order to better improve the discriminability of sample representation learning, many effective improvement and optimization schemes exist, such as L-softmax, a-softmax, GA-softmax, etc. The L-softmax loss function uses the full link layer weights in the classification module as the weight for each class classifier. In order to improve the discriminability of the learned features, the L-softmax function introduces a multiplicative angle margin on the original softmax function, and the distance of each class set is increased by increasing the difficulty of example learning. The a-softmax loss function further normalizes the weights of the fully connected layers in the classification module. The GA-softmax loss function converts multiplicative angle edge distance in the A-softmax loss function into additive angle edge distance, and the A-softmax loss function is popularized. Meanwhile, a GA-softmax function introduces a scale factor and characteristic normalization, and the introduction of the parameters enables the classification decision surface of the classifier to be adjustable, so that the discriminability of the learned characteristics is controlled more flexibly. In the loss function based on the softmax cross entropy framework, such as AM-softmax, decision margins of any two classes are presented as a parallel narrow band on a two-dimensional plane; by introducing the loss function construction classifier based on the softmax cross entropy framework, the shape of the classification decision boundary of each class is not variable, the decision boundary can not be flexibly adjusted according to the sample feature distribution in the feature space, and the classification recognition rate needs to be improved. The acoustic scene classification is carried out by adopting the loss functions, so that the further improvement of the classification performance is limited.

Disclosure of Invention

The invention aims to solve the technical problems that a loss function based on a softmax cross entropy framework is introduced into a current acoustic scene classification model, a decision boundary cannot be flexibly adjusted according to sample feature distribution in a feature space, and classification recognition rate needs to be improved.

The invention adopts the following technical scheme. The invention provides an acoustic scene classification method based on an improved softmax function, which comprises the steps of obtaining time-frequency characteristics of acoustic signal samples; the time-frequency characteristics are used as the input of an acoustic scene classification model which is trained in advance, and the acoustic scene classification model is used for carrying out classification judgment on the time-frequency characteristics to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained by adopting an improved softmax function.

Further, the acoustic scene classification model comprises a deep convolutional neural network and a full connection layer; extracting acoustic representation features by adopting a deep convolutional neural network, and outputting the obtained acoustic representation features to the full-connection layer; and the full connection layer is used for judging the type of the acoustic representation characteristics and outputting an acoustic scene classification result.

Further, the training method of the acoustic scene classification model is as follows:

inputting time-frequency characteristics of an acoustic signal training sample;

extracting acoustic representation features by utilizing a deep convolutional neural network;

classifying the acoustic scenes by utilizing a full connection layer according to the acoustic representation characteristics; calculating the sine and cosine similarity measurement and the cosine similarity measurement of the weight corresponding to the acoustic representation feature and each output node of the full-connection layer; calculating a cross entropy loss of an improved softmax function based on the sine and cosine similarity measures and the negative cosine similarity measure;

and training the acoustic scene classification model by using the cross entropy loss obtained by calculation to respectively obtain each layer of network parameters of the deep convolutional neural network and the weight parameters of the output nodes of the full connection layer.

Still further, a specific method for calculating the sine and cosine similarity measure and the cosine similarity measure of the weight corresponding to the acoustic representation feature and each output node of the full connection layer is as follows:

calculating cosine similarity of the ith acoustic representation feature and the weight of the ith output node corresponding to the full-connection layer to obtain a sine and cosine similarity value s_ipCalculating the cosine similarity of the weight of the acoustic representation feature and the output node of the j-th class corresponding to the full connection layer to obtain a negative cosine similarity value s_jnI ≠ j. Further, the sine and cosine similarity measurement is obtained by adopting the formula (1)

And negative cosine similarity measure

Wherein λ_pIs a scale factor, λ, corresponding to positive similarity_nIs a scale factor corresponding to negative similarity; alpha is alpha_pIs a weight update factor, alpha, corresponding to positive similarity_nIs the weight update factor for negative similarity, Δ p is the margin factor for positive similarity, and Δ n is the margin factor for negative similarity. To simplify the hyper-parameter setup, let us let λ_n＝a·λ_p。

Still further, the modified softmax function is expressed as follows:

wherein N is the number of samples, m is the first adjustment parameter, a is the second adjustment parameter, and λ_pAnd C is the scale factor corresponding to positive similarity, and the number of acoustic scene categories.

Still further, the decision boundary of the acoustic scene classification is changed by adjusting the first adjustment parameter m and the second adjustment parameter a.

Further preferably, the obtained acoustic representation features are output to the feature classification module after being normalized.

The invention has the following beneficial technical effects: the acoustic scene classification model based on the improved softmax function provided by the invention is used for designing the softmax function with the positive and negative cosine similarity in acoustic scene classification based on the deep Convolutional Neural Network (CNN), and calculating the training loss learning Network parameters by using the function. By adjusting parameters in the loss function, the classification decision boundary shape among all the classes is controlled, and the distribution of the representation characteristics of all the classes of samples is approximated, so that the clustering effect of similar acoustic samples is improved while the judgment boundary interval of heterogeneous acoustic samples is increased, the misjudgment rate is reduced, the discriminability of the representation characteristics of the acoustic samples is improved, the classification precision is obviously improved, and the performance of an acoustic scene classification system is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic structural diagram of an acoustic scene classification model based on a sin-cos similarity softmax function according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of an acoustic scene classification model based on a sin-cos similarity softmax function to realize acoustic scene classification according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of a method for training a depth image classification model based on a sin-cos similarity softmax function according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a result of adjusting a decision boundary according to an embodiment of the present invention, which includes fig. 4(a), 4(b), 4(c), and 4(d), where fig. 4(a), 4(b), 4(c), and 4(d) are decision boundaries where m is 0.4, 0.3, 0.2, and 0.1, and a is 3, respectively;

fig. 5 is a schematic diagram of a result of adjusting a decision boundary according to another embodiment of the present invention, which includes fig. 5(a), 5(b), 5(c), and 5(d), where m in fig. 5(a), 5(b), 5(c), and 5(d) is 0.4, and a is a decision boundary corresponding to 3, 2, 1/2, and 1/3, respectively.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "including" and "having," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

The acoustic scene classification method based on the modified softmax function comprises the steps of constructing an acoustic scene classification model; the acoustic scene classification model is used for carrying out classification judgment on the time-frequency characteristics of the input acoustic signal samples to obtain an acoustic scene classification result; the structural schematic diagram of the classification model is shown in fig. 1, the acoustic scene classification model comprises a feature extraction module and a feature classification module, the feature extraction module is used for extracting acoustic representation features by adopting a deep convolutional neural network, and outputting the obtained acoustic representation features to the feature classification module;

the feature classification module comprises a full connection layer, and the number of output nodes of the full connection layer is the same as the number of acoustic scene categories; and the full connection layer is used for judging the type of the acoustic representation features extracted by the feature extraction module and outputting an acoustic scene classification result.

As shown in fig. 1, a specific embodiment may include inputting acoustic signal samples and calculating corresponding basic time-frequency characteristics. The basic time-frequency characteristics of the acoustic signal samples are obtained by calculation in the prior art, which is not described in the present application. The network structure may be a convolutional neural network similar to a standard two-dimensional CNN, ResNet, etc., including several convolutional layers, pooling layers, batch normalization layers, etc., but is not limited to a specific form.

In this embodiment, a feature classification module is constructed based on a Full Connected (FC) layer, the number of FC output nodes corresponds to the number of sample classes, and a weight vector W ═ W of each class corresponding to all output nodes in the FC layer₁,w₂,...,w_C]Is 1. And the bias of each output node is 0.

In a second embodiment, on the basis of the first embodiment, the present embodiment provides a training method for an acoustic scene classification model in the first embodiment, as shown in fig. 2, the training method includes the following steps:

s101, extracting time-frequency characteristics corresponding to the sound scene signals.

It should be noted that the time-frequency feature samples obtained by processing the sound scene signals by the system are not limited to the length and specific form of the signals, for example, 10s long audio files in the DCASE2019 data set may be converted into a logarithmic Mel energy spectrum or a constant Q transform spectrum as sample data. Optionally, the time-frequency features may be divided into training samples, verification samples, and test samples according to a preset division ratio.

S102, constructing a feature extraction module of the acoustic signal based on the deep convolutional neural network, inputting time-frequency feature data of the acoustic signal, and calculating low-dimensional representation features.

In specific implementation, the system can input time-frequency characteristics such as a logarithm Mel energy spectrum and a constant Q transform spectrum into a deep convolution network for training. The deep convolutional neural network structure adopted by the feature extraction module can adopt convolutional neural networks in the forms similar to standard two-dimensional CNN, ResNet and the like, including a plurality of convolutional layers, pooling layers, batch normalization layers and the like, but is not limited to a certain specific form.

And S103, normalizing the low-dimensional representation features learned by the deep network and inputting the normalized low-dimensional representation features into a feature classification module. The feature classification module adopts a full connection layer form and calculates a sine and cosine similarity measure and a cosine similarity measure representing features and classifier weights (the classifier weights are weights corresponding to each output node of the full connection layer).

And S104, converting the positive cosine similarity and the negative cosine similarity, and calculating the softmax cross entropy loss of the network based on the improved positive cosine similarity and the negative cosine similarity.

In a specific implementation, the loss function of the model has three parameters m, a and λ_pIt may be provided that m and a together control the shape of the classification decision boundary. Tasks of different difficulty levels correspond to different optimal parameter combinations.

And S105, training the feature extraction module and the feature classification module based on the loss function.

After learning, the network parameters of each layer of the feature extraction module and the weight parameters W of the output nodes of the feature classification module are obtained.

And S106, classifying the test sample data by using the network obtained by training.

In this embodiment, for the fully connected layer, the acoustic representation characteristics of each sample learned through CNN are input. The output is a weight vector W ═ W of acoustic representation features and each output node in the classifier₁,w₂,...,w_C]The inner product of (d). Since the number of output nodes corresponds to the number of classes, the weight vector of each output node can be regarded as a representative representation feature of each class.

The FC layer requires that the norm of the input representation characteristic is 1, so the characteristics of the acoustic representation need to be normalized by the FC layer. The weight vector W of the net is two-norm 1 and the bias of each node of FC is 0. Thus, the network output can be viewed as the cosine similarity of the representative and representative representation features of the input sample.

In this embodiment, a specific method for calculating the positive and negative cosine similarity metric representing the feature and the classifier weight is as follows:

in a classification task of a class C scene, if a sample belongs to a class i, the cosine similarity of the representation characteristics and the weight of an ith class classifier is defined as a sine and cosine similarity value s_ipIt means that the cosine similarity of the features to the class j classifier weights is a negative cosine similarity value s_jn(j ≠ i). Respectively converting positive and negative similarity by adopting different mapping forms to respectively obtain improved positive and negative cosine similarity measurement

The different effects of the positive/negative examples can be distinguished, as shown in formula (1):

wherein λ is_pIs a scale factor, λ, corresponding to positive similarity_nIs a scale factor corresponding to negative similarity; alpha is alpha_pIs a weight update factor, alpha, corresponding to positive similarity_nIs the weight update factor for negative similarity, Δ p is the margin factor for positive similarity, and Δ n is the margin factor for negative similarity.

Further, positive and negative similarity and respective optimization target O are adopted_p、O_nAs a weight update factor, α_p＝O_p-s_ip，α_n＝s_jn-O_n. Due to the corresponding positive similarity s of the samples_ipIs 1, negative similarity s_jnThe optimization target of (1) is 0, and the distance between positive similarity and negative similarity is 1. Under the constraint, obtaining the parameter relation in the formula (2),

further, let Δ n ═ m (0. ltoreq. m.ltoreq.1), and a ═ λ_n/λ_pWhen Δ p is 1-m, O_n＝-m,O _p1+ m, transformed sine and cosine similarity measure

And negative cosine similarity measure

Are respectively as

Calculating the loss of the convolution classification network model by adopting an original softmax cross entropy framework to obtain a softmax function L based on the cosine similarity measurement of positive and negative samples_CSAs shown in formula (4):

wherein N is the number of samples, m is the first adjustment parameter, a is the second adjustment parameter, and λ_pScale factors corresponding to positive similarity.

Using calculated cross entropy loss L_CSAnd (4) carrying out network training to respectively obtain network parameters of the CNN layer and the FC layer, namely network parameters of each layer of the deep convolutional neural network and weight parameters of output nodes of the full connection layer.

And judging the classified acoustic sample data by using the trained network, and judging by using a classifier to obtain a classification result.

In a specific embodiment, the classification output layer based on the positive cosine similarity and negative cosine similarity softmax function can realize flexible adjustment of the shape of the decision boundary by adjusting parameters m and a. The principle is as follows:

for the sample feature x of class i, the decision boundary for determining that the sample belongs to class i and does not belong to class j can be determined by

It is derived that the decision boundary is determined to be:

wherein the content of the first and second substances,

when the parameters satisfy formula (2), formula (5) can be expressed as formula (6)

Similarly, the sample characteristic x of the class j is judged to belong to the class j, and the decision boundary not belonging to the class i is as follows:

two classification decision boundary diagrams corresponding to the positive cosine similarity softmax loss function are given in fig. 4 and 5. There are two decision boundaries between any two classes, and the shape of the decision boundary is controlled by both m and a. Comparing fig. 4 and 5, it can be seen that m mainly controls the size of the decision area and a mainly controls the shape of the decision boundary.

As shown in fig. 4, when m is changed from 0.4 to 0.1, the decision area is continuously contracted in both horizontal and vertical directions with the decrease of m, the decision margin is continuously increased, and the discriminative ability of the learned features is continuously enhanced.

As shown in fig. 5, when a is changed from 3 to 1/3, the decision region becomes deeper in the vertical direction and narrower in the horizontal direction as a decreases.

Therefore, in practical application, parameter setting can be adjusted in a targeted manner according to the characteristic distribution during network convergence, so that the decision boundary is continuously close to the characteristic distribution.

Further, the method further comprises:

the method comprises the steps of dividing acoustic sample data into training samples, verification samples and test samples according to a preset dividing proportion.

The method is based on a Convolutional Neural Network (CNN) to extract the representation characteristics of the acoustic scene, and calculates the similarity measurement of the positive cosine and the negative cosine between each acoustic signal sample and the representation characteristics, and a characteristic classification module can classify according to the size of the similarity. In the network training process, the shape of a classification decision surface of the network can be adjusted by setting parameters of loss functions corresponding to various signal samples, so that the classification precision is improved, and the performance of acoustic scene classification is improved.

Table 1 shows the classification accuracy obtained by using the CS-softmax loss function of the present invention on the devilpope data set of DCASE2019 ASC using the CNN9avg model (see Cross-talk learning for audio tagging, sound event detection and spatial localization: DCASE2019 baseline systems, Qiaqiong Kong, Yin Cao, Turab Iqbal, Yong Xu, Wenwu Wang, Mark D. The classification accuracy obtained with the original softmax loss was 70.3%. As can be seen from the results in the table, when three control parameters m, a and λ are used_pWhen the classification accuracy changes, the classification accuracy changes. Most cases are better than the original softmax loss of corresponding classification accuracy.

TABLE 1 precision of CS-softmax loss function in DCASE2019 Acoustic scene Classification (Unit:%)

In the embodiment of the invention, the embedded expression for classification is learned by performing deep network processing on the time-frequency characteristic sample corresponding to the sound scene data. And training the deep network by adopting a softmax function based on the similarity of positive cosine and negative cosine, and improving the separability of deep embedded representation. And the performance of the acoustic scene classification task is effectively improved by utilizing the learned deep network parameters.

In the following, a flow of a CIFAR10 deep image classification model training method based on a sin-cos similarity softmax function is introduced with reference to a specific implementation manner of the embodiment of the present invention, as shown in fig. 3. Data in the acoustic scene classification, the time-frequency characteristics of which can be regarded as an image of the data on a time-frequency plane, therefore, network training and testing implementation based on a sine and cosine similarity softmax function can be implemented in the image embodiment. The present embodiment employs CIFAR10 as the training test data set. May include the steps of:

s201, dividing the image data into training samples and testing samples.

S202, constructing a feature extraction module of image data based on the deep convolutional neural network, inputting training data samples, and calculating low-dimensional representation features.

And S203, inputting the representation characteristics of the image sample into a characteristic classification module, wherein the classification module adopts a full connection layer form, and calculating to obtain the positive cosine similarity measurement of the representation characteristics and the weight of each output node of the full connection layer.

S204, the positive and negative similarity measurement is converted to obtain improved sine and cosine similarity measurement.

And S205, calculating to obtain the loss of the network based on the softmax cross entropy framework, training the network, and obtaining the optimal network parameter after the loss is converged.

And S206, classifying the image test data by using the network obtained by training.

It should be noted that, in this embodiment, the specific calculation method in the step is provided as in the above embodiment, and the description of this embodiment is not repeated.

In the embodiment of the invention, the representation characteristics for classification are learned by carrying out deep convolution network processing on samples such as audio and images, and classification judgment is carried out by utilizing a full connection layer. By utilizing a positive cosine similarity softmax loss function and embedding adjustable parameters m and a, the shape of a classification decision boundary between different classes can be flexibly controlled, so that the learned expression characteristics can be better gathered in a boundary surface, and the clustering characteristic of the sample is improved. The representation characteristics of the test sample are extracted by utilizing the learned deep network parameters, and the representation characteristics can be directly classified by adopting a classification layer, so that the performance of a multi-class classification task is effectively improved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. The acoustic scene classification method based on the improved softmax function is characterized by comprising the following steps: acquiring time-frequency characteristics of an acoustic signal sample; the time-frequency characteristics are used as the input of an acoustic scene classification model which is trained in advance, and the acoustic scene classification model is used for carrying out classification judgment on the time-frequency characteristics to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained by adopting an improved softmax function.

2. The acoustic scene classification method based on the modified softmax function of claim 1, wherein the acoustic scene classification model comprises a deep convolutional neural network and a fully connected layer; extracting acoustic representation features by adopting a deep convolutional neural network, and outputting the obtained acoustic representation features to the full-connection layer; and the full connection layer is used for judging the type of the acoustic representation characteristics and outputting an acoustic scene classification result.

3. The acoustic scene classification method based on the modified softmax function according to claim 1 or 2, wherein the training method of the acoustic scene classification model is as follows:

inputting time-frequency characteristics of an acoustic signal training sample;

4. The method for classifying acoustic scenes based on the modified softmax function according to claim 3, wherein the concrete method for calculating the sine-cosine similarity measure and the cosine-cosine similarity measure of the weight corresponding to the acoustic representation feature and each output node of the full connection layer is as follows:

calculating cosine similarity of the ith acoustic representation feature and the weight of the ith output node corresponding to the full-connection layer to obtain a sine and cosine similarity value s_ipCalculating the cosine similarity of the ith acoustic representation characteristic and the weight of the jth output node corresponding to the full-connection layer to obtain a negative cosine similarity value s_jnI ≠ j; based on the obtained sine and cosine similarity value s_ipAnd a negative cosine similarity value s_jnYielding an improved sine and cosine similarity measure

And negative cosine similarity measure

5. The method for classifying acoustic scenes based on the modified softmax function of claim 4, wherein the modified sine and cosine similarity measure is obtained by using the formula (1)

And negative cosine similarity measure

Wherein λ_pIs a scale factor, λ, corresponding to positive similarity_nIs a scale factor corresponding to negative similarity; alpha is alpha_pIs a weight update factor, alpha, corresponding to positive similarity_nIs the weight update factor for negative similarity, Δ p is the margin factor for positive similarity, and Δ n is the margin factor for negative similarity.

6. The method for acoustic scene classification based on the modified softmax function according to claim 1, characterized in that the modified softmax function is represented as follows:

wherein N is the number of samples, m is the first adjustment parameter, a is the second adjustment parameter, and λ_pAs a third adjustment parameter, C is the number of acoustic scene classes, s_ipIs a sine-cosine similarity value, s_jnIs a negative cosine similarity value and is,

for an improved measure of the sine-cosine similarity,

for an improved negative cosine similarity measure, λ_pIs a scale factor for positive similarity.

7. The method for classifying an acoustic scene based on the modified softmax function according to claim 6, wherein the decision boundary of the acoustic scene classification is changed by adjusting the first adjustment parameter m and the second adjustment parameter a.

8. The method for classifying an acoustic scene based on the softmax function as recited in claim 1, wherein the obtained acoustic representation features are normalized and outputted to a full connection layer.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.