CN112447188B - Acoustic scene classification method based on improved softmax function - Google Patents

Acoustic scene classification method based on improved softmax function Download PDF

Info

Publication number
CN112447188B
CN112447188B CN202011296395.6A CN202011296395A CN112447188B CN 112447188 B CN112447188 B CN 112447188B CN 202011296395 A CN202011296395 A CN 202011296395A CN 112447188 B CN112447188 B CN 112447188B
Authority
CN
China
Prior art keywords
acoustic
cosine similarity
scene classification
sine
improved
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011296395.6A
Other languages
Chinese (zh)
Other versions
CN112447188A (en
Inventor
杨吉斌
张强
张雄伟
曹铁勇
张睿
白玮
赵斐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202011296395.6A priority Critical patent/CN112447188B/en
Publication of CN112447188A publication Critical patent/CN112447188A/en
Application granted granted Critical
Publication of CN112447188B publication Critical patent/CN112447188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an acoustic scene classification method based on an improved softmax function. In the network training process, the shape of the classification decision surface of the network can be adjusted by setting parameters of the loss function corresponding to various signal samples, the characteristic clustering characteristics of various sound signals are adapted, the distance of the classification decision surface of different types of signals is increased, and the classification performance of acoustic scenes is improved.

Description

Acoustic scene classification method based on improved softmax function
Technical Field
The application relates to the technical field of pattern recognition and classification, in particular to an acoustic scene classification method based on an improved softmax function.
Background
The acoustic scene classification utilizes the sound signals to judge the scene category of the sound signals, and the technology belongs to the technical field of pattern recognition and plays an important role in applications such as intelligent perception of robots and unmanned systems. The classification technology based on deep learning has better effect in the classification of acoustic scenes, but the basic softmax cross entropy loss function adopted by each typical deep network classifier has the problem of weak discrimination of the learned characteristics. Since there are very similar scenes in the acoustic scene, the classification performance using the basic softmax function is not ideal.
In order to better enhance the discrimination of sample representation learning, there have been many effective improvements and optimization schemes such as L-softmax, A-softmax, GA-softmax, etc. The L-softmax penalty function uses the full connection layer weights in the classification module as the weights for each class of classifier. In order to improve the discrimination of the learned features, the L-softmax function introduces a multiplicative angle margin on the original softmax function, and the distance of each class set is increased by increasing the difficulty of example learning. The A-softmax penalty function is further normalized to the weights of the fully connected layers in the classification module. The GA-softmax loss function converts multiplicative angle margins in the A-softmax loss function into additive angle margins, and the A-softmax loss function is generalized. Meanwhile, the GA-softmax function introduces scale factors and feature normalization, and the introduction of the parameters enables the classification decision surface of the classifier to be adjustable, and the discriminant of the learned features is controlled more flexibly. However, in the loss function based on the softmax cross entropy framework such as AM-softmax, the decision margin of any two classes appears as a parallel narrow band on a two-dimensional plane; the loss function based on the softmax cross entropy frame is introduced to construct a classifier, the shape of the classification decision boundary of each class is not changeable, the decision boundary cannot be flexibly adjusted according to sample feature distribution in a feature space, and the classification recognition rate is required to be improved. The adoption of these loss functions for classifying acoustic scenes limits further improvement of classification performance.
Disclosure of Invention
The application aims at the technical problems that a loss function based on a softmax cross entropy frame is introduced into a current acoustic scene classification model, a decision boundary cannot be flexibly adjusted according to sample feature distribution in a feature space, and the classification recognition rate is to be improved.
The application adopts the following technical scheme. The application provides an acoustic scene classification method based on an improved softmax function, which comprises the steps of obtaining time-frequency characteristics of an acoustic signal sample; using the time-frequency characteristics as the input of an acoustic scene classification model which is trained in advance, and judging the categories of the time-frequency characteristics by using the acoustic scene classification model to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained using a modified softmax function.
Further, the acoustic scene classification model comprises a deep convolutional neural network and a fully connected layer; extracting acoustic representation features by using a deep convolutional neural network, and outputting the obtained acoustic representation features to the full connection layer; and the full connection layer is used for carrying out type discrimination on the acoustic representation characteristics and outputting an acoustic scene classification result.
Further, the training method of the acoustic scene classification model is as follows:
inputting time-frequency characteristics of an acoustic signal training sample;
extracting acoustic representation features by using a deep convolutional neural network;
classifying the acoustic scenes by using the full connection layer according to the acoustic representation characteristics; calculating sine and cosine similarity measurement and sine and cosine similarity measurement of weights corresponding to each output node of the acoustic representation feature and the full-connection layer; calculating cross entropy loss of the improved softmax function based on the sine and cosine similarity metrics;
training the acoustic scene classification model by using the cross entropy loss obtained by calculation to respectively obtain network parameters of each layer of the deep convolutional neural network and weight parameters of the output nodes of the full-connection layer.
Still further, the specific method for calculating the sine and cosine similarity measure and the sine and cosine similarity measure of the acoustic representation feature and the weight corresponding to each output node of the fully connected layer is as follows:
calculating cosine similarity of the weight of the i-th class acoustic representation feature and the i-th class output node corresponding to the full-connection layer to obtain sine and cosine similarity value s ip Calculating cosine similarity of weights of acoustic representation features and output nodes of j-th class corresponding to the full-connection layer to obtain a negative cosine similarity value s jn I+.j. Further, a sine and cosine similarity measure is obtained by adopting the formula (1)And a negative cosine similarity measure->
Wherein lambda is p Is the scale factor corresponding to positive similarity, lambda n Is the scale factor corresponding to the negative similarity; alpha p Is a weight update factor corresponding to positive similarity, alpha n Is a weight update factor corresponding to negative similarity, Δp is a margin factor corresponding to positive similarity, and Δn is a margin factor corresponding to negative similarity. To simplify the superparameter settings, we let λ n =a·λ p
Still further, the modified softmax function is expressed as follows:
wherein N is the number of samples, m is a first adjustment parameter, a is a second adjustment parameter, lambda p And C is the number of acoustic scene categories, which is the scale factor corresponding to the positive similarity.
Still further, the decision boundary of the acoustic scene classification is changed by adjusting the first tuning parameter m and the second tuning parameter a.
Further preferably, the obtained acoustic representation features are normalized and output to the feature classification module.
The beneficial technical effects obtained by the application are as follows: in the acoustic scene classification model based on the improved softmax function, in the acoustic scene classification based on the deep convolutional neural network (Convolutional Neural Network, CNN), the sine and cosine similarity softmax function is designed, and training loss learning network parameters are calculated by using the function. The parameters in the loss function are adjusted, the shape of the classification decision boundary among the categories is controlled, and the distribution of the representative features of the samples of the categories is approximated, so that the clustering effect of the similar acoustic samples is improved while the judgment boundary interval of the heterogeneous acoustic samples is increased, the false judgment rate is reduced, the discrimination of the representative features of the acoustic samples is improved, the classification precision is remarkably improved, and the performance of the acoustic scene classification system is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a schematic diagram of an acoustic scene classification model structure based on a sine-cosine similarity softmax function according to an embodiment of the present application;
fig. 2 is a schematic flow chart of implementing acoustic scene classification by using an acoustic scene classification model based on a sine-cosine similarity softmax function according to an embodiment of the present application;
FIG. 3 is a flowchart of a training method of a depth image classification model based on a sine-cosine similarity softmax function according to an embodiment of the present application;
FIG. 4 is a schematic diagram of the decision boundary adjustment results according to an embodiment of the present application, which includes FIGS. 4 (a), 4 (b), 4 (c) and 4 (d), where each of FIGS. 4 (a), 4 (b), 4 (c) and 4 (d) is a decision boundary corresponding to m being 0.4, 0.3, 0.2, 0.1, and a being 3;
fig. 5 is a schematic diagram of the result of adjusting decision boundaries according to another embodiment of the present application, which includes decision boundaries corresponding to fig. 5 (a), 5 (b), 5 (c) and 5 (d), where m is 0.4, and a is 3, 2, 1/2 and 1/3, respectively.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "comprising" and "having" and any variations thereof in the description and claims of the application and in the foregoing drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
An acoustic scene classification method based on an improved softmax function comprises the steps of constructing an acoustic scene classification model; the acoustic scene classification model is used for judging the category of the time-frequency characteristics of the input acoustic signal sample to obtain an acoustic scene classification result; the structural schematic diagram of the classification model is shown in fig. 1, the acoustic scene classification model comprises a feature extraction module and a feature classification module, the feature extraction module is used for extracting acoustic representation features by using a deep convolutional neural network, and the obtained acoustic representation features are output to the feature classification module;
the feature classification module comprises a full-connection layer, and the number of output nodes of the full-connection layer is the same as the number of categories of the acoustic scene; the full-connection layer is used for carrying out type discrimination on the acoustic representation features extracted by the feature extraction module and outputting an acoustic scene classification result.
As shown in fig. 1, in a specific embodiment, the method may include inputting acoustic signal samples, and calculating corresponding basic time-frequency characteristics. The basic time-frequency characteristics of the acoustic signal samples are obtained using prior art calculations, which are not described in the present application. The network structure may take the form of a convolutional neural network like a standard two-dimensional CNN, res net, etc., including, but not limited to, several convolutional layers, pooled layers, batch normalized layers, etc.
In this embodiment, a feature classification module is built based on a Full Connected (FC) layer, the number of FC output nodes corresponds to the number of sample classes, and weight vectors w= [ W ] of the classes corresponding to all output nodes of the FC layer 1 ,w 2 ,...,w C ]Is 1. And the bias of each output node is 0.
Based on the second embodiment and the first embodiment, the present embodiment provides a training method for an acoustic scene classification model in the first embodiment, as shown in fig. 2, where the training method includes the following steps:
s101, extracting time-frequency characteristics corresponding to the sound scene signals.
It should be noted that, the time-frequency characteristic samples obtained by processing the sound scene signal by the above system are not limited to signal length and specific form, for example, the 10s long audio file in the DCASE2019 dataset may be converted into a logarithmic Mel energy spectrum or a constant Q transform spectrum as sample data. Optionally, the time-frequency characteristic may be divided into a training sample, a verification sample and a test sample according to a preset division ratio.
S102, constructing a characteristic extraction module of an acoustic signal based on a deep convolutional neural network, inputting time-frequency characteristic data of the acoustic signal, and calculating a low-dimensional representation characteristic.
In a specific implementation, the system can input the isochronous frequency characteristic of the logarithmic Mel energy spectrum and the constant Q conversion spectrum into a deep convolution network for training. The deep convolutional neural network structure adopted by the feature extraction module can adopt a convolutional neural network similar to the forms of standard two-dimensional CNN, resNet and the like, and comprises a plurality of convolutional layers, a pooling layer, a batch standardization layer and the like, but is not limited to a specific form.
S103, the low-dimensional representation features learned by the deep network are normalized and then input into a feature classification module. The feature classification module adopts a full-connection layer form, and calculates sine and cosine similarity measurement and sine and cosine similarity measurement of the representative features and classifier weights (namely weights corresponding to each output node of the full-connection layer).
S104, converting the positive and negative cosine similarity, and calculating softmax cross entropy loss of the network based on the improved positive and negative cosine similarity measurement.
In a specific implementation, the model has a loss function with three parameters, m, a and lambda p It may be provided that m and a together control the shape of the classification decision boundary. Tasks of different difficulty level correspond to different optimal parameter combinations.
S105, training a feature extraction module and a feature classification module based on the loss function.
After learning, obtaining network parameters of each layer of the feature extraction module and weight parameters W of the output nodes of the feature classification module.
S106, classifying the test sample data by utilizing the trained network.
In this embodiment, for the fully connected layer, the input is the acoustic representation of each sample through CNN learning. The output is acoustic representation characteristics and weight vector W= [ W ] of each output node in the classifier 1 ,w 2 ,...,w C ]Is a product of the inner product of (a). Because the number of output nodes corresponds to the number of categories, the weight vector of each output node can be considered as a representative representation feature of each category.
The FC layer requires that the input representation feature norms be 1, so the features of the acoustic representation need to be normalized by the FC layer. The weight vector W two norms of the network is 1 and the bias of each node of the FC is 0. Thus, the network output may be regarded as cosine similarity of the representative feature and the representative feature of the input samples.
In this embodiment, the specific method for calculating the positive and negative cosine similarity measure of the representative feature and the classifier weight is as follows:
in a class C scene classification task, if a sample belongs to class i, defining cosine similarity of the representing characteristics and i class classifier weights as sine and cosine similarity value s ip The cosine similarity of the characteristic and the weight of the j-th class classifier is a negative cosine similarity value s jn (j. Noteq. I). Respectively converting the positive and negative similarity by adopting different mapping forms to respectively obtain improved positive and negative cosine similarity measurementThe different roles of the positive/negative examples can be distinguished as shown in formula (1):
wherein lambda is p Is the scale factor corresponding to positive similarity, lambda n Is the scale factor corresponding to the negative similarity; alpha p Is a weight update factor corresponding to positive similarity, alpha n Is a weight update factor corresponding to negative similarity, Δp is a margin factor corresponding to positive similarity, and Δn is a margin factor corresponding to negative similarity.
Further, positive and negative similarity and respective optimization targets O are adopted p 、O n Is used as a weight update factor, alpha p =O p -s ip ,α n =s jn -O n . Due to positive similarity s corresponding to the sample ip Is 1, negative similarity s jn The optimization objective of (1) is 0, the distance between positive and negative similarity is 1. Under this constraint, the parameter relationship in the formula (2) is obtained,
further, let Δn=m (0.ltoreq.m.ltoreq.1), a=λ np Δp=1-m, O n =-m,O p =1+m, converted sine and cosine similarity measureAnd a negative cosine similarity measure ++>Respectively is
Calculating the loss of the convolution classification network model by adopting an original softmax cross entropy framework to obtain a softmax function L based on the cosine similarity measurement of positive and negative samples CS As shown in formula (4):
wherein N is the number of samples, m is a first adjustment parameter, a is a second adjustment parameter, lambda p Is the scale factor corresponding to positive similarity.
Using the calculated cross entropy loss L CS And performing network training to respectively obtain network parameters of CNN and FC layers, namely network parameters of each layer of the deep convolutional neural network and weight parameters of output nodes of the full-connection layer.
And judging the classified acoustic sample data by using the trained network, and judging by using a classifier to obtain a classification result.
In a specific embodiment, the classification output layer based on the positive and negative cosine similarity softmax function can realize flexible adjustment of the decision boundary shape by adjusting parameters m and a. The principle is as follows:
for sample feature x of class i, the decision boundary for judging that the sample belongs to class i and does not belong to class j can be determined byThe derived decision boundary is:
wherein, the liquid crystal display device comprises a liquid crystal display device,
when the parameters satisfy the formula (2), the formula (5) can be expressed as the formula (6)
Sample feature x of class j is the same, and decision boundary that it belongs to class j and does not belong to class i is:
schematic diagrams of the two classification decision boundaries corresponding to the sine and cosine similarity softmax loss function are given in fig. 4 and 5. There are two decision boundaries between any two classes, and the shape of the decision boundaries is commonly controlled by m and a. As can be seen from comparing fig. 4 and 5, m mainly controls the size of the decision area, and a mainly controls the shape of the decision boundary.
As shown in fig. 4, when m is changed from 0.4 to 0.1, the decision area is continuously contracted in the horizontal and vertical directions, the decision margin is continuously increased, and the discrimination of the learned feature is continuously enhanced as m is reduced.
As shown in fig. 5, when a is changed from 3 to 1/3, the decision area becomes deeper in the vertical direction and narrower in the horizontal direction as a is reduced.
Therefore, in practical application, the parameter setting can be adjusted in a targeted manner according to the characteristic distribution when the network converges, so that the decision boundary is enabled to be continuously approximate to the characteristic distribution.
Further, the method further comprises the following steps:
and dividing the acoustic sample data into a training sample, a verification sample and a test sample according to a preset dividing proportion.
The application extracts the representation characteristics of the acoustic scene based on the Convolutional Neural Network (CNN), calculates the positive and negative cosine similarity measurement between each acoustic signal sample and the representative characteristics, and the characteristic classification module can classify the acoustic scene according to the similarity. In the network training process, the shape of the classification decision surface of the network can be adjusted by setting parameters of the loss function corresponding to various signal samples, so that the classification precision is improved, and the performance of the classification of the acoustic scene is improved.
Table 1 shows the classification accuracy obtained using the CS-softmaxloss loss function of the present application on the development dataset of DCASE2019 ASC using the CNN9avg model (see paper Cross-talk learning for audio tagging, sound event detection and spatial localization: DCASE2019 baseline systems, qiaqiariang Kong, yin Cao, turab Iqbal, yong Xu, wenwu Wang, mark D. Plumbley, arXiv preprint arXiv:1904.03476,2019). The classification accuracy obtained with the original softmax loss was 70.3%. As can be seen from the results in the table, when three control parameters m, a and λ p When the classification accuracy changes, the classification accuracy changes. Most cases are better than the original softmax losing the corresponding classification accuracy.
Table 1 CS-precision of softmax penalty function in DCASE2019 Acoustic scene Classification (Unit:%)
In the embodiment of the application, the embedded representation for classification is learned by performing deep network processing on the time-frequency characteristic samples corresponding to the sound scene data. The depth network is trained by adopting a softmax function based on the similarity of positive and negative cosine, and the separability of the depth embedding representation is improved. And the performance of the acoustic scene classification task is effectively improved by using the learned depth network parameters.
In the following, a procedure of a training method for a CIFAR10 depth image classification model based on a sine-cosine similarity softmax function will be described in connection with a specific implementation of an embodiment of the present application, as shown in FIG. 3. Data in the acoustic scene classification, whose time-frequency characteristics can be regarded as an image of the data on the time-frequency plane, and therefore network training and test implementation based on the sine-cosine similarity softmax function can be implemented in the image embodiment. This embodiment uses CIFAR10 as the training test dataset. The method can comprise the following steps:
s201, dividing the image data into a training sample and a test sample.
S202, a feature extraction module of image data is constructed based on a deep convolutional neural network, training data samples are input, and low-dimensional representation features are calculated.
S203, inputting the representing features of the image sample into a feature classification module, and calculating to obtain the positive and negative cosine similarity measurement of the weights of the representing features and all output nodes of the full connection layer by the classification module in the form of the full connection layer.
S204, converting the positive and negative similarity measures to obtain improved sine and cosine similarity measures.
S205, calculating to obtain the loss of the network based on the softmax cross entropy framework, training the network, and obtaining the optimal network parameters after the loss converges.
S206, classifying the image test data by using the trained network.
In this embodiment, the specific calculation method in the steps is provided in the above embodiment, and the description of this embodiment is not repeated.
In the embodiment of the application, through carrying out deep convolution network processing on samples such as audio, images and the like, the representative features for classification are learned, and the classification judgment is carried out by utilizing the full connection layer. The positive and negative cosine similarity softmax loss function is utilized, and the adjustable parameters m and a are embedded, so that the shape of the classification decision boundary between different categories can be flexibly controlled, the learned representation features can be better gathered in the interface, and the clustering characteristic of the sample is improved. And extracting the representation features of the test sample by using the learned deep network parameters, and classifying the representation features by directly adopting a classification layer, so that the performance of multi-category classification tasks is effectively improved.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims (7)

1. An acoustic scene classification method based on an improved softmax function, comprising: acquiring time-frequency characteristics of an acoustic signal sample; using the time-frequency characteristics as the input of an acoustic scene classification model which is trained in advance, and judging the categories of the time-frequency characteristics by using the acoustic scene classification model to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained using an improved softmax function;
the acoustic scene classification model comprises a deep convolutional neural network and a full-connection layer; extracting acoustic representation features by using a deep convolutional neural network, and outputting the obtained acoustic representation features to the full connection layer; the full connection layer is used for carrying out type discrimination on the acoustic representation characteristics and outputting an acoustic scene classification result;
the modified softmax function is expressed as follows:
wherein N is the number of samples, m is a first adjustment parameter, a is a second adjustment parameter, lambda p For the third adjustment parameter, C is the number of categories of the acoustic scene, s ip Is sine and cosine similarity value, s jn Is the value of the similarity of the negative cosine,for an improved sine-cosine similarity measure,for improved negative cosine similarityMetric lambda p Is the scale factor corresponding to positive similarity; i is the i-th class of acoustic representation feature and j is the j-th class of acoustic representation feature.
2. The improved softmax function-based acoustic scene classification method of claim 1, wherein the training method of the acoustic scene classification model is as follows:
inputting time-frequency characteristics of an acoustic signal training sample;
extracting acoustic representation features by using a deep convolutional neural network;
classifying the acoustic scenes by using the full connection layer according to the acoustic representation characteristics; calculating sine and cosine similarity measurement and sine and cosine similarity measurement of weights corresponding to each output node of the acoustic representation feature and the full-connection layer; calculating cross entropy loss of the improved softmax function based on the sine and cosine similarity metrics;
training the acoustic scene classification model by using the cross entropy loss obtained by calculation to respectively obtain network parameters of each layer of the deep convolutional neural network and weight parameters of the output nodes of the full-connection layer.
3. The acoustic scene classification method based on the improved softmax function according to claim 2, wherein the specific method for calculating the sine and cosine similarity metrics and the sine and cosine similarity metrics of the weights corresponding to each output node of the fully connected layer is as follows:
calculating cosine similarity of the weight of the i-th acoustic representation feature and the i-th output node corresponding to the full connection layer to obtain sine and cosine similarity value s ip Calculating cosine similarity of the weight of the i-th acoustic representation feature and the j-th output node corresponding to the full-connection layer to obtain a negative cosine similarity value s jn I+.j; based on the obtained sine and cosine similarity value s ip And a negative cosine similarity value s jn Improved sine and cosine similarity measureAnd a negative cosine similarity measure->
4. The improved softmax function-based acoustic scene classification method of claim 3, wherein the improved sine-cosine similarity measure is obtained using equation (1)And a negative cosine similarity measure->
Wherein lambda is p Is the scale factor corresponding to positive similarity, lambda n Is the scale factor corresponding to the negative similarity;
α p is a weight update factor corresponding to positive similarity, alpha n Is a weight update factor corresponding to negative similarity, Δp is a margin factor corresponding to positive similarity, and Δn is a margin factor corresponding to negative similarity.
5. The acoustic scene classification method based on the modified softmax function according to claim 1, wherein the decision boundary of the acoustic scene classification is changed by adjusting the first tuning parameter m and the second tuning parameter a.
6. The improved softmax function-based acoustic scene classification method of claim 1, wherein the obtained acoustic representation features are normalized and output to a fully connected layer.
7. A computer readable storage medium storing a computer program, which when executed by a processor performs the steps of the method according to any one of claims 1 to 6.
CN202011296395.6A 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function Active CN112447188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011296395.6A CN112447188B (en) 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011296395.6A CN112447188B (en) 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function

Publications (2)

Publication Number Publication Date
CN112447188A CN112447188A (en) 2021-03-05
CN112447188B true CN112447188B (en) 2023-10-20

Family

ID=74737165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011296395.6A Active CN112447188B (en) 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function

Country Status (1)

Country Link
CN (1) CN112447188B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5390285A (en) * 1990-11-08 1995-02-14 British Telecommunications Public Limited Company Method and apparatus for training a neural network depending on average mismatch
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN109829377A (en) * 2018-12-28 2019-05-31 河海大学 A kind of pedestrian's recognition methods again based on depth cosine metric learning
CN110659378A (en) * 2019-09-07 2020-01-07 吉林大学 Fine-grained image retrieval method based on contrast similarity loss function
WO2020023585A1 (en) * 2018-07-26 2020-01-30 Med-El Elektromedizinische Geraete Gmbh Neural network audio scene classifier for hearing implants
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium
CN111723675A (en) * 2020-05-26 2020-09-29 河海大学 Remote sensing image scene classification method based on multiple similarity measurement deep learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3078530A1 (en) * 2017-10-26 2019-05-02 Magic Leap, Inc. Gradient normalization systems and methods for adaptive loss balancing in deep multitask networks
US11636328B2 (en) * 2018-03-28 2023-04-25 University Of Maryland, College Park L2 constrained softmax loss for discriminative face verification

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5390285A (en) * 1990-11-08 1995-02-14 British Telecommunications Public Limited Company Method and apparatus for training a neural network depending on average mismatch
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
WO2020023585A1 (en) * 2018-07-26 2020-01-30 Med-El Elektromedizinische Geraete Gmbh Neural network audio scene classifier for hearing implants
CN109829377A (en) * 2018-12-28 2019-05-31 河海大学 A kind of pedestrian's recognition methods again based on depth cosine metric learning
CN110659378A (en) * 2019-09-07 2020-01-07 吉林大学 Fine-grained image retrieval method based on contrast similarity loss function
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium
CN111723675A (en) * 2020-05-26 2020-09-29 河海大学 Remote sensing image scene classification method based on multiple similarity measurement deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
acoustic scene classification based on additive margin softmax;Kun Yao. et al;IEEE;全文 *
基于声学音素向量和孪生网络的二语者发音偏误确认;王振宇等;中文信息学报;第33卷(第04期);全文 *
基于改进余弦相似度的协同过滤推荐算法;李一野;邓浩江;;计算机与现代化(第01期);全文 *

Also Published As

Publication number Publication date
CN112447188A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
JP6798614B2 (en) Image recognition device, image recognition method and image recognition program
CN105976809A (en) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
WO2023125654A1 (en) Training method and apparatus for face recognition model, electronic device and storage medium
CN109117817B (en) Face recognition method and device
CN112101430A (en) Anchor frame generation method for image target detection processing and lightweight target detection method
CN104795064A (en) Recognition method for sound event under scene of low signal to noise ratio
CN108627798B (en) WLAN indoor positioning algorithm based on linear discriminant analysis and gradient lifting tree
CN107301376B (en) Pedestrian detection method based on deep learning multi-layer stimulation
CN104778230B (en) A kind of training of video data segmentation model, video data cutting method and device
CN109559758A (en) A method of texture image is converted by haptic signal based on deep learning
CN107688790A (en) Human bodys' response method, apparatus, storage medium and electronic equipment
CN102122353A (en) Method for segmenting images by using increment dictionary learning and sparse representation
CN114841257A (en) Small sample target detection method based on self-supervision contrast constraint
CN108615532A (en) A kind of sorting technique and device applied to sound field scape
Koluguri et al. Spectrogram enhancement using multiple window Savitzky-Golay (MWSG) filter for robust bird sound detection
CN102521402B (en) Text filtering system and method
Kohl et al. Learning similarity metrics for numerical simulations
CN112447188B (en) Acoustic scene classification method based on improved softmax function
CN111881965B (en) Hyperspectral pattern classification and identification method, device and equipment for medicinal material production place grade
CN113496260A (en) Grain depot worker non-standard operation detection method based on improved YOLOv3 algorithm
CN117115197A (en) Intelligent processing method and system for design data of LED lamp bead circuit board
CN109614929A (en) Method for detecting human face and system based on more granularity cost-sensitive convolutional neural networks
Zhao et al. Learning saliency features for face detection and recognition using multi-task network
CN116665390A (en) Fire detection system based on edge calculation and optimized YOLOv5
CN114998731A (en) Intelligent terminal navigation scene perception identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant