CN112447188A - Acoustic scene classification method based on improved softmax function - Google Patents

Acoustic scene classification method based on improved softmax function Download PDF

Info

Publication number
CN112447188A
CN112447188A CN202011296395.6A CN202011296395A CN112447188A CN 112447188 A CN112447188 A CN 112447188A CN 202011296395 A CN202011296395 A CN 202011296395A CN 112447188 A CN112447188 A CN 112447188A
Authority
CN
China
Prior art keywords
acoustic
cosine similarity
scene classification
acoustic scene
softmax function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011296395.6A
Other languages
Chinese (zh)
Other versions
CN112447188B (en
Inventor
杨吉斌
张强
张雄伟
曹铁勇
张睿
白玮
赵斐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202011296395.6A priority Critical patent/CN112447188B/en
Publication of CN112447188A publication Critical patent/CN112447188A/en
Application granted granted Critical
Publication of CN112447188B publication Critical patent/CN112447188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

The invention provides an acoustic scene classification method based on an improved softmax function, which is characterized in that representation features of an acoustic scene are extracted based on a Convolutional Neural Network (CNN), positive cosine similarity measurement and negative cosine similarity measurement between each acoustic signal sample and the representation features are calculated, and classification can be carried out according to the similarity. In the network training process, the shape of the classification decision surface of the network can be adjusted by setting parameters of loss functions corresponding to various signal samples, so that the characteristic clustering characteristics of various sound signals are adapted, the distance of the classification decision surface of different types of signals is increased, and the performance of acoustic scene classification is improved.

Description

Acoustic scene classification method based on improved softmax function
Technical Field
The invention relates to the technical field of pattern recognition and classification, in particular to an acoustic scene classification method based on an improved softmax function.
Background
The acoustic scene classification utilizes the sound signals to judge the scene types of the sound signals, belongs to the technical field of pattern recognition, and plays an important role in applications such as intelligent perception of robots and unmanned systems. The classification technology based on deep learning has a good effect in acoustic scene classification, but the basic softmax cross entropy loss function adopted by each typical deep network classifier has the problem of poor discriminability of the learned features. Since there are strong similarity scenes in the acoustic scene, the classification performance using the basic softmax function is not ideal.
In order to better improve the discriminability of sample representation learning, many effective improvement and optimization schemes exist, such as L-softmax, a-softmax, GA-softmax, etc. The L-softmax loss function uses the full link layer weights in the classification module as the weight for each class classifier. In order to improve the discriminability of the learned features, the L-softmax function introduces a multiplicative angle margin on the original softmax function, and the distance of each class set is increased by increasing the difficulty of example learning. The a-softmax loss function further normalizes the weights of the fully connected layers in the classification module. The GA-softmax loss function converts multiplicative angle edge distance in the A-softmax loss function into additive angle edge distance, and the A-softmax loss function is popularized. Meanwhile, a GA-softmax function introduces a scale factor and characteristic normalization, and the introduction of the parameters enables the classification decision surface of the classifier to be adjustable, so that the discriminability of the learned characteristics is controlled more flexibly. In the loss function based on the softmax cross entropy framework, such as AM-softmax, decision margins of any two classes are presented as a parallel narrow band on a two-dimensional plane; by introducing the loss function construction classifier based on the softmax cross entropy framework, the shape of the classification decision boundary of each class is not variable, the decision boundary can not be flexibly adjusted according to the sample feature distribution in the feature space, and the classification recognition rate needs to be improved. The acoustic scene classification is carried out by adopting the loss functions, so that the further improvement of the classification performance is limited.
Disclosure of Invention
The invention aims to solve the technical problems that a loss function based on a softmax cross entropy framework is introduced into a current acoustic scene classification model, a decision boundary cannot be flexibly adjusted according to sample feature distribution in a feature space, and classification recognition rate needs to be improved.
The invention adopts the following technical scheme. The invention provides an acoustic scene classification method based on an improved softmax function, which comprises the steps of obtaining time-frequency characteristics of acoustic signal samples; the time-frequency characteristics are used as the input of an acoustic scene classification model which is trained in advance, and the acoustic scene classification model is used for carrying out classification judgment on the time-frequency characteristics to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained by adopting an improved softmax function.
Further, the acoustic scene classification model comprises a deep convolutional neural network and a full connection layer; extracting acoustic representation features by adopting a deep convolutional neural network, and outputting the obtained acoustic representation features to the full-connection layer; and the full connection layer is used for judging the type of the acoustic representation characteristics and outputting an acoustic scene classification result.
Further, the training method of the acoustic scene classification model is as follows:
inputting time-frequency characteristics of an acoustic signal training sample;
extracting acoustic representation features by utilizing a deep convolutional neural network;
classifying the acoustic scenes by utilizing a full connection layer according to the acoustic representation characteristics; calculating the sine and cosine similarity measurement and the cosine similarity measurement of the weight corresponding to the acoustic representation feature and each output node of the full-connection layer; calculating a cross entropy loss of an improved softmax function based on the sine and cosine similarity measures and the negative cosine similarity measure;
and training the acoustic scene classification model by using the cross entropy loss obtained by calculation to respectively obtain each layer of network parameters of the deep convolutional neural network and the weight parameters of the output nodes of the full connection layer.
Still further, a specific method for calculating the sine and cosine similarity measure and the cosine similarity measure of the weight corresponding to the acoustic representation feature and each output node of the full connection layer is as follows:
calculating cosine similarity of the ith acoustic representation feature and the weight of the ith output node corresponding to the full-connection layer to obtain a sine and cosine similarity value sipCalculating the cosine similarity of the weight of the acoustic representation feature and the output node of the j-th class corresponding to the full connection layer to obtain a negative cosine similarity value sjnI ≠ j. Further, the sine and cosine similarity measurement is obtained by adopting the formula (1)
Figure BDA0002785462370000031
And negative cosine similarity measure
Figure BDA0002785462370000032
Figure BDA0002785462370000033
Figure BDA0002785462370000034
Wherein λpIs a scale factor, λ, corresponding to positive similaritynIs a scale factor corresponding to negative similarity; alpha is alphapIs a weight update factor, alpha, corresponding to positive similaritynIs the weight update factor for negative similarity, Δ p is the margin factor for positive similarity, and Δ n is the margin factor for negative similarity. To simplify the hyper-parameter setup, let us let λn=a·λp
Still further, the modified softmax function is expressed as follows:
Figure BDA0002785462370000041
wherein N is the number of samples, m is the first adjustment parameter, a is the second adjustment parameter, and λpAnd C is the scale factor corresponding to positive similarity, and the number of acoustic scene categories.
Still further, the decision boundary of the acoustic scene classification is changed by adjusting the first adjustment parameter m and the second adjustment parameter a.
Further preferably, the obtained acoustic representation features are output to the feature classification module after being normalized.
The invention has the following beneficial technical effects: the acoustic scene classification model based on the improved softmax function provided by the invention is used for designing the softmax function with the positive and negative cosine similarity in acoustic scene classification based on the deep Convolutional Neural Network (CNN), and calculating the training loss learning Network parameters by using the function. By adjusting parameters in the loss function, the classification decision boundary shape among all the classes is controlled, and the distribution of the representation characteristics of all the classes of samples is approximated, so that the clustering effect of similar acoustic samples is improved while the judgment boundary interval of heterogeneous acoustic samples is increased, the misjudgment rate is reduced, the discriminability of the representation characteristics of the acoustic samples is improved, the classification precision is obviously improved, and the performance of an acoustic scene classification system is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic structural diagram of an acoustic scene classification model based on a sin-cos similarity softmax function according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of an acoustic scene classification model based on a sin-cos similarity softmax function to realize acoustic scene classification according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a method for training a depth image classification model based on a sin-cos similarity softmax function according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a result of adjusting a decision boundary according to an embodiment of the present invention, which includes fig. 4(a), 4(b), 4(c), and 4(d), where fig. 4(a), 4(b), 4(c), and 4(d) are decision boundaries where m is 0.4, 0.3, 0.2, and 0.1, and a is 3, respectively;
fig. 5 is a schematic diagram of a result of adjusting a decision boundary according to another embodiment of the present invention, which includes fig. 5(a), 5(b), 5(c), and 5(d), where m in fig. 5(a), 5(b), 5(c), and 5(d) is 0.4, and a is a decision boundary corresponding to 3, 2, 1/2, and 1/3, respectively.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "including" and "having," and any variations thereof, in the description and claims of this invention and the above-described drawings are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
The acoustic scene classification method based on the modified softmax function comprises the steps of constructing an acoustic scene classification model; the acoustic scene classification model is used for carrying out classification judgment on the time-frequency characteristics of the input acoustic signal samples to obtain an acoustic scene classification result; the structural schematic diagram of the classification model is shown in fig. 1, the acoustic scene classification model comprises a feature extraction module and a feature classification module, the feature extraction module is used for extracting acoustic representation features by adopting a deep convolutional neural network, and outputting the obtained acoustic representation features to the feature classification module;
the feature classification module comprises a full connection layer, and the number of output nodes of the full connection layer is the same as the number of acoustic scene categories; and the full connection layer is used for judging the type of the acoustic representation features extracted by the feature extraction module and outputting an acoustic scene classification result.
As shown in fig. 1, a specific embodiment may include inputting acoustic signal samples and calculating corresponding basic time-frequency characteristics. The basic time-frequency characteristics of the acoustic signal samples are obtained by calculation in the prior art, which is not described in the present application. The network structure may be a convolutional neural network similar to a standard two-dimensional CNN, ResNet, etc., including several convolutional layers, pooling layers, batch normalization layers, etc., but is not limited to a specific form.
In this embodiment, a feature classification module is constructed based on a Full Connected (FC) layer, the number of FC output nodes corresponds to the number of sample classes, and a weight vector W ═ W of each class corresponding to all output nodes in the FC layer1,w2,...,wC]Is 1. And the bias of each output node is 0.
In a second embodiment, on the basis of the first embodiment, the present embodiment provides a training method for an acoustic scene classification model in the first embodiment, as shown in fig. 2, the training method includes the following steps:
s101, extracting time-frequency characteristics corresponding to the sound scene signals.
It should be noted that the time-frequency feature samples obtained by processing the sound scene signals by the system are not limited to the length and specific form of the signals, for example, 10s long audio files in the DCASE2019 data set may be converted into a logarithmic Mel energy spectrum or a constant Q transform spectrum as sample data. Optionally, the time-frequency features may be divided into training samples, verification samples, and test samples according to a preset division ratio.
S102, constructing a feature extraction module of the acoustic signal based on the deep convolutional neural network, inputting time-frequency feature data of the acoustic signal, and calculating low-dimensional representation features.
In specific implementation, the system can input time-frequency characteristics such as a logarithm Mel energy spectrum and a constant Q transform spectrum into a deep convolution network for training. The deep convolutional neural network structure adopted by the feature extraction module can adopt convolutional neural networks in the forms similar to standard two-dimensional CNN, ResNet and the like, including a plurality of convolutional layers, pooling layers, batch normalization layers and the like, but is not limited to a certain specific form.
And S103, normalizing the low-dimensional representation features learned by the deep network and inputting the normalized low-dimensional representation features into a feature classification module. The feature classification module adopts a full connection layer form and calculates a sine and cosine similarity measure and a cosine similarity measure representing features and classifier weights (the classifier weights are weights corresponding to each output node of the full connection layer).
And S104, converting the positive cosine similarity and the negative cosine similarity, and calculating the softmax cross entropy loss of the network based on the improved positive cosine similarity and the negative cosine similarity.
In a specific implementation, the loss function of the model has three parameters m, a and λpIt may be provided that m and a together control the shape of the classification decision boundary. Tasks of different difficulty levels correspond to different optimal parameter combinations.
And S105, training the feature extraction module and the feature classification module based on the loss function.
After learning, the network parameters of each layer of the feature extraction module and the weight parameters W of the output nodes of the feature classification module are obtained.
And S106, classifying the test sample data by using the network obtained by training.
In this embodiment, for the fully connected layer, the acoustic representation characteristics of each sample learned through CNN are input. The output is a weight vector W ═ W of acoustic representation features and each output node in the classifier1,w2,...,wC]The inner product of (d). Since the number of output nodes corresponds to the number of classes, the weight vector of each output node can be regarded as a representative representation feature of each class.
The FC layer requires that the norm of the input representation characteristic is 1, so the characteristics of the acoustic representation need to be normalized by the FC layer. The weight vector W of the net is two-norm 1 and the bias of each node of FC is 0. Thus, the network output can be viewed as the cosine similarity of the representative and representative representation features of the input sample.
In this embodiment, a specific method for calculating the positive and negative cosine similarity metric representing the feature and the classifier weight is as follows:
in a classification task of a class C scene, if a sample belongs to a class i, the cosine similarity of the representation characteristics and the weight of an ith class classifier is defined as a sine and cosine similarity value sipIt means that the cosine similarity of the features to the class j classifier weights is a negative cosine similarity value sjn(j ≠ i). Respectively converting positive and negative similarity by adopting different mapping forms to respectively obtain improved positive and negative cosine similarity measurement
Figure BDA0002785462370000091
The different effects of the positive/negative examples can be distinguished, as shown in formula (1):
Figure BDA0002785462370000092
wherein λ ispIs a scale factor, λ, corresponding to positive similaritynIs a scale factor corresponding to negative similarity; alpha is alphapIs a weight update factor, alpha, corresponding to positive similaritynIs the weight update factor for negative similarity, Δ p is the margin factor for positive similarity, and Δ n is the margin factor for negative similarity.
Further, positive and negative similarity and respective optimization target O are adoptedp、OnAs a weight update factor, αp=Op-sip,αn=sjn-On. Due to the corresponding positive similarity s of the samplesipIs 1, negative similarity sjnThe optimization target of (1) is 0, and the distance between positive similarity and negative similarity is 1. Under the constraint, obtaining the parameter relation in the formula (2),
Figure BDA0002785462370000093
further, let Δ n ═ m (0. ltoreq. m.ltoreq.1), and a ═ λnpWhen Δ p is 1-m, On=-m,O p1+ m, transformed sine and cosine similarity measure
Figure BDA0002785462370000094
And negative cosine similarity measure
Figure BDA0002785462370000095
Are respectively as
Figure BDA0002785462370000096
Calculating the loss of the convolution classification network model by adopting an original softmax cross entropy framework to obtain a softmax function L based on the cosine similarity measurement of positive and negative samplesCSAs shown in formula (4):
Figure BDA0002785462370000097
wherein N is the number of samples, m is the first adjustment parameter, a is the second adjustment parameter, and λpScale factors corresponding to positive similarity.
Using calculated cross entropy loss LCSAnd (4) carrying out network training to respectively obtain network parameters of the CNN layer and the FC layer, namely network parameters of each layer of the deep convolutional neural network and weight parameters of output nodes of the full connection layer.
And judging the classified acoustic sample data by using the trained network, and judging by using a classifier to obtain a classification result.
In a specific embodiment, the classification output layer based on the positive cosine similarity and negative cosine similarity softmax function can realize flexible adjustment of the shape of the decision boundary by adjusting parameters m and a. The principle is as follows:
for the sample feature x of class i, the decision boundary for determining that the sample belongs to class i and does not belong to class j can be determined by
Figure BDA0002785462370000101
It is derived that the decision boundary is determined to be:
Figure BDA0002785462370000102
wherein the content of the first and second substances,
Figure BDA0002785462370000103
when the parameters satisfy formula (2), formula (5) can be expressed as formula (6)
Figure BDA0002785462370000104
Similarly, the sample characteristic x of the class j is judged to belong to the class j, and the decision boundary not belonging to the class i is as follows:
Figure BDA0002785462370000105
two classification decision boundary diagrams corresponding to the positive cosine similarity softmax loss function are given in fig. 4 and 5. There are two decision boundaries between any two classes, and the shape of the decision boundary is controlled by both m and a. Comparing fig. 4 and 5, it can be seen that m mainly controls the size of the decision area and a mainly controls the shape of the decision boundary.
As shown in fig. 4, when m is changed from 0.4 to 0.1, the decision area is continuously contracted in both horizontal and vertical directions with the decrease of m, the decision margin is continuously increased, and the discriminative ability of the learned features is continuously enhanced.
As shown in fig. 5, when a is changed from 3 to 1/3, the decision region becomes deeper in the vertical direction and narrower in the horizontal direction as a decreases.
Therefore, in practical application, parameter setting can be adjusted in a targeted manner according to the characteristic distribution during network convergence, so that the decision boundary is continuously close to the characteristic distribution.
Further, the method further comprises:
the method comprises the steps of dividing acoustic sample data into training samples, verification samples and test samples according to a preset dividing proportion.
The method is based on a Convolutional Neural Network (CNN) to extract the representation characteristics of the acoustic scene, and calculates the similarity measurement of the positive cosine and the negative cosine between each acoustic signal sample and the representation characteristics, and a characteristic classification module can classify according to the size of the similarity. In the network training process, the shape of a classification decision surface of the network can be adjusted by setting parameters of loss functions corresponding to various signal samples, so that the classification precision is improved, and the performance of acoustic scene classification is improved.
Table 1 shows the classification accuracy obtained by using the CS-softmax loss function of the present invention on the devilpope data set of DCASE2019 ASC using the CNN9avg model (see Cross-talk learning for audio tagging, sound event detection and spatial localization: DCASE2019 baseline systems, Qiaqiong Kong, Yin Cao, Turab Iqbal, Yong Xu, Wenwu Wang, Mark D. The classification accuracy obtained with the original softmax loss was 70.3%. As can be seen from the results in the table, when three control parameters m, a and λ are usedpWhen the classification accuracy changes, the classification accuracy changes. Most cases are better than the original softmax loss of corresponding classification accuracy.
TABLE 1 precision of CS-softmax loss function in DCASE2019 Acoustic scene Classification (Unit:%)
Figure BDA0002785462370000121
In the embodiment of the invention, the embedded expression for classification is learned by performing deep network processing on the time-frequency characteristic sample corresponding to the sound scene data. And training the deep network by adopting a softmax function based on the similarity of positive cosine and negative cosine, and improving the separability of deep embedded representation. And the performance of the acoustic scene classification task is effectively improved by utilizing the learned deep network parameters.
In the following, a flow of a CIFAR10 deep image classification model training method based on a sin-cos similarity softmax function is introduced with reference to a specific implementation manner of the embodiment of the present invention, as shown in fig. 3. Data in the acoustic scene classification, the time-frequency characteristics of which can be regarded as an image of the data on a time-frequency plane, therefore, network training and testing implementation based on a sine and cosine similarity softmax function can be implemented in the image embodiment. The present embodiment employs CIFAR10 as the training test data set. May include the steps of:
s201, dividing the image data into training samples and testing samples.
S202, constructing a feature extraction module of image data based on the deep convolutional neural network, inputting training data samples, and calculating low-dimensional representation features.
And S203, inputting the representation characteristics of the image sample into a characteristic classification module, wherein the classification module adopts a full connection layer form, and calculating to obtain the positive cosine similarity measurement of the representation characteristics and the weight of each output node of the full connection layer.
S204, the positive and negative similarity measurement is converted to obtain improved sine and cosine similarity measurement.
And S205, calculating to obtain the loss of the network based on the softmax cross entropy framework, training the network, and obtaining the optimal network parameter after the loss is converged.
And S206, classifying the image test data by using the network obtained by training.
It should be noted that, in this embodiment, the specific calculation method in the step is provided as in the above embodiment, and the description of this embodiment is not repeated.
In the embodiment of the invention, the representation characteristics for classification are learned by carrying out deep convolution network processing on samples such as audio and images, and classification judgment is carried out by utilizing a full connection layer. By utilizing a positive cosine similarity softmax loss function and embedding adjustable parameters m and a, the shape of a classification decision boundary between different classes can be flexibly controlled, so that the learned expression characteristics can be better gathered in a boundary surface, and the clustering characteristic of the sample is improved. The representation characteristics of the test sample are extracted by utilizing the learned deep network parameters, and the representation characteristics can be directly classified by adopting a classification layer, so that the performance of a multi-class classification task is effectively improved.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. The acoustic scene classification method based on the improved softmax function is characterized by comprising the following steps: acquiring time-frequency characteristics of an acoustic signal sample; the time-frequency characteristics are used as the input of an acoustic scene classification model which is trained in advance, and the acoustic scene classification model is used for carrying out classification judgment on the time-frequency characteristics to obtain an acoustic scene classification result; wherein the acoustic scene classification model is trained by adopting an improved softmax function.
2. The acoustic scene classification method based on the modified softmax function of claim 1, wherein the acoustic scene classification model comprises a deep convolutional neural network and a fully connected layer; extracting acoustic representation features by adopting a deep convolutional neural network, and outputting the obtained acoustic representation features to the full-connection layer; and the full connection layer is used for judging the type of the acoustic representation characteristics and outputting an acoustic scene classification result.
3. The acoustic scene classification method based on the modified softmax function according to claim 1 or 2, wherein the training method of the acoustic scene classification model is as follows:
inputting time-frequency characteristics of an acoustic signal training sample;
extracting acoustic representation features by utilizing a deep convolutional neural network;
classifying the acoustic scenes by utilizing a full connection layer according to the acoustic representation characteristics; calculating the sine and cosine similarity measurement and the cosine similarity measurement of the weight corresponding to the acoustic representation feature and each output node of the full-connection layer; calculating a cross entropy loss of an improved softmax function based on the sine and cosine similarity measures and the negative cosine similarity measure;
and training the acoustic scene classification model by using the cross entropy loss obtained by calculation to respectively obtain each layer of network parameters of the deep convolutional neural network and the weight parameters of the output nodes of the full connection layer.
4. The method for classifying acoustic scenes based on the modified softmax function according to claim 3, wherein the concrete method for calculating the sine-cosine similarity measure and the cosine-cosine similarity measure of the weight corresponding to the acoustic representation feature and each output node of the full connection layer is as follows:
calculating cosine similarity of the ith acoustic representation feature and the weight of the ith output node corresponding to the full-connection layer to obtain a sine and cosine similarity value sipCalculating the cosine similarity of the ith acoustic representation characteristic and the weight of the jth output node corresponding to the full-connection layer to obtain a negative cosine similarity value sjnI ≠ j; based on the obtained sine and cosine similarity value sipAnd a negative cosine similarity value sjnYielding an improved sine and cosine similarity measure
Figure FDA0002785462360000021
And negative cosine similarity measure
Figure FDA0002785462360000022
5. The method for classifying acoustic scenes based on the modified softmax function of claim 4, wherein the modified sine and cosine similarity measure is obtained by using the formula (1)
Figure FDA0002785462360000023
And negative cosine similarity measure
Figure FDA0002785462360000024
Figure FDA0002785462360000025
Figure FDA0002785462360000026
Wherein λpIs a scale factor, λ, corresponding to positive similaritynIs a scale factor corresponding to negative similarity; alpha is alphapIs a weight update factor, alpha, corresponding to positive similaritynIs the weight update factor for negative similarity, Δ p is the margin factor for positive similarity, and Δ n is the margin factor for negative similarity.
6. The method for acoustic scene classification based on the modified softmax function according to claim 1, characterized in that the modified softmax function is represented as follows:
Figure FDA0002785462360000031
wherein N is the number of samples, m is the first adjustment parameter, a is the second adjustment parameter, and λpAs a third adjustment parameter, C is the number of acoustic scene classes, sipIs a sine-cosine similarity value, sjnIs a negative cosine similarity value and is,
Figure FDA0002785462360000032
for an improved measure of the sine-cosine similarity,
Figure FDA0002785462360000033
for an improved negative cosine similarity measure, λpIs a scale factor for positive similarity.
7. The method for classifying an acoustic scene based on the modified softmax function according to claim 6, wherein the decision boundary of the acoustic scene classification is changed by adjusting the first adjustment parameter m and the second adjustment parameter a.
8. The method for classifying an acoustic scene based on the softmax function as recited in claim 1, wherein the obtained acoustic representation features are normalized and outputted to a full connection layer.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202011296395.6A 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function Active CN112447188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011296395.6A CN112447188B (en) 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011296395.6A CN112447188B (en) 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function

Publications (2)

Publication Number Publication Date
CN112447188A true CN112447188A (en) 2021-03-05
CN112447188B CN112447188B (en) 2023-10-20

Family

ID=74737165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011296395.6A Active CN112447188B (en) 2020-11-18 2020-11-18 Acoustic scene classification method based on improved softmax function

Country Status (1)

Country Link
CN (1) CN112447188B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5390285A (en) * 1990-11-08 1995-02-14 British Telecommunications Public Limited Company Method and apparatus for training a neural network depending on average mismatch
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
US20190130275A1 (en) * 2017-10-26 2019-05-02 Magic Leap, Inc. Gradient normalization systems and methods for adaptive loss balancing in deep multitask networks
CN109829377A (en) * 2018-12-28 2019-05-31 河海大学 A kind of pedestrian's recognition methods again based on depth cosine metric learning
US20190303754A1 (en) * 2018-03-28 2019-10-03 University Of Maryland, College Park L2 constrained softmax loss for discriminative face verification
CN110659378A (en) * 2019-09-07 2020-01-07 吉林大学 Fine-grained image retrieval method based on contrast similarity loss function
WO2020023585A1 (en) * 2018-07-26 2020-01-30 Med-El Elektromedizinische Geraete Gmbh Neural network audio scene classifier for hearing implants
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium
CN111723675A (en) * 2020-05-26 2020-09-29 河海大学 Remote sensing image scene classification method based on multiple similarity measurement deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5390285A (en) * 1990-11-08 1995-02-14 British Telecommunications Public Limited Company Method and apparatus for training a neural network depending on average mismatch
US9824692B1 (en) * 2016-09-12 2017-11-21 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
US20190130275A1 (en) * 2017-10-26 2019-05-02 Magic Leap, Inc. Gradient normalization systems and methods for adaptive loss balancing in deep multitask networks
US20190303754A1 (en) * 2018-03-28 2019-10-03 University Of Maryland, College Park L2 constrained softmax loss for discriminative face verification
WO2020023585A1 (en) * 2018-07-26 2020-01-30 Med-El Elektromedizinische Geraete Gmbh Neural network audio scene classifier for hearing implants
CN109829377A (en) * 2018-12-28 2019-05-31 河海大学 A kind of pedestrian's recognition methods again based on depth cosine metric learning
CN110659378A (en) * 2019-09-07 2020-01-07 吉林大学 Fine-grained image retrieval method based on contrast similarity loss function
CN111462755A (en) * 2020-03-03 2020-07-28 深圳壹账通智能科技有限公司 Information prompting method and device, electronic equipment and medium
CN111723675A (en) * 2020-05-26 2020-09-29 河海大学 Remote sensing image scene classification method based on multiple similarity measurement deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KUN YAO. ET AL: "acoustic scene classification based on additive margin softmax", IEEE *
李一野;邓浩江;: "基于改进余弦相似度的协同过滤推荐算法", 计算机与现代化, no. 01 *
王振宇等: "基于声学音素向量和孪生网络的二语者发音偏误确认", 中文信息学报, vol. 33, no. 04 *

Also Published As

Publication number Publication date
CN112447188B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN109949317A (en) Based on the semi-supervised image instance dividing method for gradually fighting study
JP6798614B2 (en) Image recognition device, image recognition method and image recognition program
KR101780676B1 (en) Method for learning rejector by forming classification tree in use of training image and detecting object in test image by using the rejector
CN105678231A (en) Pedestrian image detection method based on sparse coding and neural network
CN104795064A (en) Recognition method for sound event under scene of low signal to noise ratio
CN107644032A (en) Outlier detection method and apparatus
CN110738132B (en) Target detection quality blind evaluation method with discriminant perception capability
CN109559758A (en) A method of texture image is converted by haptic signal based on deep learning
Koluguri et al. Spectrogram enhancement using multiple window Savitzky-Golay (MWSG) filter for robust bird sound detection
CN113762049B (en) Content identification method, content identification device, storage medium and terminal equipment
CN112348360B (en) Chinese medicine production process parameter analysis system based on big data technology
CN109902692A (en) A kind of image classification method based on regional area depth characteristic coding
CN112447188A (en) Acoustic scene classification method based on improved softmax function
CN109753922A (en) Anthropomorphic robot expression recognition method based on dense convolutional neural networks
Zhao et al. Learning saliency features for face detection and recognition using multi-task network
CN112215112A (en) Method and system for generating neural network model for hand motion recognition
Chen et al. An intelligent nocturnal animal vocalization recognition system
CN115481685A (en) Radiation source individual open set identification method based on prototype network
CN109767545B (en) Method and system for classifying defects of valuable bills
CN109409381A (en) The classification method and system of furniture top view based on artificial intelligence
JP7341962B2 (en) Learning data collection device, learning device, learning data collection method and program
CN112949385B (en) Water surface target detection and identification method based on optical vision
CN109344881B (en) Extended classifier based on space-time continuity
CN106327494A (en) Pavement crack image automatic detection method
KR101094433B1 (en) Method for identifying image face and system thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant