CN111723874B

CN111723874B - Sound field scene classification method based on width and depth neural network

Info

Publication number: CN111723874B
Application number: CN202010624687.1A
Authority: CN
Inventors: 黄张金; 李艳雄; 张文浩; 林子珩; 陈奕纯; 谭煜枫
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2023-05-26
Anticipated expiration: 2040-07-02
Also published as: CN111723874A

Abstract

The invention discloses a sound scene classification method based on a width and depth neural network, which comprises the following steps: firstly, extracting logarithmic mel-spectrum characteristics from a sound field Jing Yinpin sample, and dividing the logarithmic mel-spectrum characteristics into a training set and a testing set; designing a width neural network and a depth joint probability network; taking the logarithmic mel spectrum characteristics of each audio sample of the training set as input, and pre-training the two networks; constructing a joint discrimination classification tree model according to the pre-training result, training and optimizing the joint discrimination classification tree model; and finally, inputting the logarithmic Mel spectrum characteristics of each audio sample of the test set into a joint discrimination classification tree model, and identifying the sound field scene corresponding to each audio sample. The combined discrimination classification tree model constructed by the invention can complement the defects of poor generalization capability and weak stability of a single network, and improves the sound scene classification effect by utilizing the dominant complementary characteristics of a wide neural network and a deep neural network.

Description

Sound field scene classification method based on width and depth neural network

Technical Field

The invention belongs to the technical field of machine hearing, relates to a width and deep learning technology, and particularly relates to a sound scene classification method based on a width and depth neural network.

Background

Daily activities of people involve a variety of different sound events, the combination of which constitutes a variety of different sound scenes. The sound field scene classification technology has wide application fields such as audio monitoring, multimedia retrieval, automatic auxiliary driving, intelligent home and the like.

The classification accuracy of the wide neural network applied to the classification of the sound scene is difficult to be improved after the classification accuracy is improved to a certain degree, and the practical requirement is difficult to be met. The classification of the prior sound scene is mostly based on the deep neural network, but the training time is too long, which is a disadvantage of the deep neural network. In fact, the classification accuracy of the wide neural network for sound scenes of some categories can reach a higher value, but the classification accuracy of the wide neural network for sound scenes of other categories is lower, so that the overall accuracy cannot rise any more after a certain degree.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a sound scene classification method based on a width and depth neural network, which introduces the width network into sound scene classification, reduces the training time of the depth neural network, thereby reducing the training time of the whole classification network, and combines the width neural network and the depth combined probability network in the form of classification trees, thereby improving the classification accuracy. The invention improves the training efficiency of the network on the basis of ensuring the accuracy of the sound scene classification network.

The aim of the invention can be achieved by adopting the following technical scheme:

a sound scene classification method based on a width and depth neural network comprises the following steps:

s1, establishing an audio data set; extracting logarithmic mel-spectrum characteristics from a sound field Jing Yinpin sample, and dividing the logarithmic mel-spectrum characteristics into a training set and a testing set according to a proportion;

s2, constructing a width neural network: establishing a feature mapping layer and an enhancement layer, wherein the feature mapping layer and the enhancement layer perform feature mapping on an input sample, the mapped features are combined in parallel to form an input layer, and the input layer is connected with an output layer through a weight matrix;

s3, constructing a deep joint probability network: respectively establishing a one-dimensional convolutional neural network and a long-short time memory network, and then combining the one-dimensional convolutional neural network and the long-short time memory network into a deep joint probability network by weighting and averaging the output probabilities of the one-dimensional convolutional neural network and the long-short time memory network;

s4, constructing a joint discrimination classification tree model: constructing a joint discrimination classification tree model according to the preliminary training results of the width neural network and the depth joint probability network, training and adjusting parameters of the joint discrimination classification tree model until the model converges, and obtaining a trained joint discrimination classification tree model;

s5, sound field Jing Bianshi: inputting the logarithmic mel spectrum characteristics of the test audio samples into a trained joint discrimination classification tree model to obtain sound scene categories of the test audio samples.

Further, the step S1 is as follows:

s1.1, acquiring audio data of an acoustic scene by using recording equipment or Internet resources, converting the sampling rate and quantization accuracy of audio samples into a uniform format, and labeling a sound field Jing Leibie to which each audio sample belongs;

s1.2, extracting logarithmic Mel spectrum characteristics from an audio sample and carrying out mean normalization processing;

s1.3, randomly dividing experimental data into training sets and testing sets which are mutually disjoint, wherein the training sets account for about 70% and the testing sets account for about 30%.

Further, the step S2 is as follows:

s2.1, establishing a feature mapping layer, wherein the feature mapping layer is formed by N ₁ Each feature window is composed of N ₂ Characteristic nodes, N ₁ And N ₂ According to the feature number of the neural network with the actual input width, the method selects to meet N ₁ ×N ₂ Number of features per 2;

s2.2, building an enhancement layer, wherein the number of enhancement nodes is N ₃ Here satisfies (N) ₁ ×N ₂ )＞N ₃ ；

S2.3, the feature mapping layer performs feature mapping on the input samples, and the sample set of the input width neural network is set as D ₁ Wherein the number of samples is c, the feature number of each sample is f, and a feature value is added after each sample to be equal to 1, so as to obtain an amplified sample set D ₂ The obtained characteristic number of each sample becomes f+1, and a random weight matrix W is generated for each characteristic window _e ，W _e Is (f+1) times N ₂ A matrix of dimensions whose values conform to a Gaussian distribution with a mean of 0 and a variance of 1, generating a new eigenvector A ₁ ＝D ₂ ×W _e ，A ₁ Is c x N in dimension ₂ Will A ₁ Normalization and sparsification representation, and re-solving for sparse matrices

Wherein the method comprises the steps of

Representation pair D ₂ Solving the inverse of the matrix to finally generate a characteristic node T of a window ₁ ＝normal(D ₂ X W), wherein noraml () represents returnUnifying the obtained T ₁ Dimension c×N ₂ Is N ₁ Generating feature nodes by the feature windows to finally obtain feature mapping y of the feature mapping layer _b Its dimension is c×N ₁ ×N ₂ ；

S2.4, performing feature mapping on the enhancement layer input samples, and randomly generating a sample with a dimension of (N ₁ ×N ₂ )×N ₃ Is normalized by the orthogonal weight matrix W _h Using feature mapping y _b Feature mapping to get enhancement layer

tan sig () is the activation function of the neural network, the resulting feature map T ₂ Dimension c×N ₃ ；

S2.5, combining the mapped features in parallel to form an input layer, and mapping the features to y _b And feature map T ₂ Parallel combination is carried out to obtain an input layer

Each sample has a characteristic dimension of N ₂ ×N ₁ +N ₃ ；

S2.6, connecting an input layer and an output layer through a weight matrix, wherein the output Y of the output layer is the onehot vector of the label of the sound scene type, and the dimension is c multiplied by n _B ，n _B For inputting the classification class number of the samples of the wide neural network, y=xw _B ，W _B The weight matrix obtained for the training of the breadth neural network has a dimension (N ₂ ×N ₁ +N ₃ )×n _B 。

Further, the step S3 is as follows:

s3.1, constructing a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network consists of two or more one-dimensional convolutional layers, a full-connection layer and a Softmax classified output layer in cascade, each one-dimensional convolutional layer is activated by a nonlinear activation function and then is output after being subjected to maximum pooling, and d is set ^(l-1) And d ^l Input to the first convolution layer and output from the first convolution layer, respectively, the input to the first convolution layer being the output of the (l-1) th convolution layer, due to the first convolutionThe layer has a plurality of feature maps, one of which is therefore considered to be

The output of the convolutional layer is expressed as:

wherein, represents the convolution operation,

representing the kernel weight of the first layer, fun () is a nonlinear activation function; the final output of the first layer is expressed as:

maxpooling () represents maximum pooling, the result of which is used as the input of the next layer; the output result of the last layer of convolution layer is connected through a full connection layer, and finally output of the one-dimensional convolution neural network is obtained through a Softmax classification output layer, and probability matrix y of the audio sample belonging to different sound scene categories is output _c ；

S3.2, constructing a long-short time memory network, wherein the long-short time memory network consists of two long-short time memory layers and a Softmax classified output layer in cascade connection, optionally adding a Dropout layer after each long-short time memory layer, and giving an input sequence x= (x) for each long-short time memory layer ₁ ，...x _T ) The long and short time memory layer uses the hidden vector sequence h= (h) ₁ ，...h _T ) From the t=1st iteration to the T-th iteration, y= (y) is generated ₁ ，...y _T ) Output of (2):

o _t ＝σ(W ₀ [h _t-1 ，x _t ]+b _o )

h _t ＝o _t ×tanh(C _t )

y _t ＝W _hy h _t +b _y

in which W is _o And b _o Respectively representing a weight matrix and a bias vector from an input layer to a hidden layer in long-short time memory, sigma () represents a Sigmoid activation function, o _t And C _t Respectively represent the output gate and the cell activation vector, h _t Intermediate hidden layer variable W for long-short term memory network _hy And b _y Respectively representing a weight matrix and a bias vector from a hidden layer to an output layer of the long-short memory network; finally, the long-short-term memory network outputs the audio samples through a Softmax classification output layer to obtain probability matrix y of different sound scene categories of each audio sample _l ；

S3-3, combining the one-dimensional convolutional neural network and the long-short-time memory network into a deep joint probability network through weighted average of output probabilities of the one-dimensional convolutional neural network and the long-short-time memory network, wherein the deep joint probability network is expressed as follows:

y _a ＝w _c y _c +w _l y _l

wherein w is _c And w _l Respectively representing weights of a one-dimensional convolutional neural network and a long-short-term memory network, y _c Representing a probability matrix of audio sample output by a one-dimensional convolutional neural network, y _l Probability matrix representing long-short term memory network output, y _a Probability matrix representing deep joint probability network output, and classification result y finally output by deep joint probability network _{result_k} Taking a sound field Jing Leibie corresponding to the maximum output probability node:

y _{result_k} ＝argmax(y _a )，1≤k≤K

wherein argmax () represents the number of samples of the input depth joint probability network taking the subscript corresponding to the maximum probability value.

Further, the step S4 is as follows:

s4.1, performing preliminary training on the width neural network constructed in the step S2 and the depth joint probability network constructed in the step S3 by using the training set divided in the step S1 to obtain the classification accuracy of the width neural network, ranking the width neural network from high to low, wherein the higher the accuracy is, the higher the classification ranking is, and the pre-training weight is obtained by the depth joint probability network;

s4.2, classifying the wide neural network and the deep joint probability network as nodes of a classification tree, setting the node number BN of the wide neural network and the node number DN of the deep joint probability network, wherein,

dn=1, where->

Representing a downward rounding, n representing the number of classifications of the entire audio data set, a representing the super-parameters of the number of sub-classes that the wide neural network can split, a being at +.>

Integers within the range;

s4.3, constructing a joint discrimination classification tree model, wherein the joint discrimination classification tree model consists of BN wide neural network nodes and DN deep joint probability network nodes, all input samples firstly pass through the wide neural network nodes to obtain BN x (a-1) with highest accuracy, and then the rest n-BN x (a-1) input samples are classified by the deep joint probability network nodes, and the process is as follows:

s4.3.1, the combined discrimination classification tree model takes the wide neural network as a branch node of the combined discrimination classification tree model to extend downwards, and the data to be classified is classified and output after passing through the branch node: width-sensitive class 1, width-sensitive class 2, … … width-sensitive class a-1 and width-insensitive class, where a is a defined in S4.2, i.e. the output of a branch node is formed by classification results; if the classification result is a sensitive class, directly outputting the result, and if the classification result is a width non-sensitive class, inputting the data to be classified into the next node;

s4.3.2 if the number of the wide neural network nodes in the joint discrimination classification tree model does not reach BN, continuing to extend the joint discrimination classification tree model downward in the process of the step S4.3.1, and if the number of the wide neural network nodes in the joint discrimination classification tree model reaches BN, performing a step S4.3.3;

s4.3.3 the joint discrimination classification tree model takes a depth joint probability network as the last node of the classification tree, receives the final output width insensitive classes passing through BN width neural network nodes as input to obtain final width insensitive class classification output, and finally, the terminal branches of the tree in the joint discrimination classification tree model obtain all classifications;

s4.4, in the training process, gradually increasing the two types of nodes of the classification tree until the set nodes are counted, so as to average the overall accuracy ACC and the loss function L of the depth joint probability network _deep As a supervisory signal:

in the acc _i The accuracy of the ith classification of the audio data set after the whole joint discrimination classification tree model is represented, and n represents the classification number of the whole audio data set;

s4.5, finally training and optimizing the joint discrimination classification tree model, and searching for proper a and w by adopting a network searching method _c 、w _l The value of a is selected each time, the structure of the joint discrimination classification tree model changes correspondingly, the width neural network and the depth joint probability network in the joint discrimination classification tree model train correspondingly according to the input and output data after the structure of the joint discrimination classification tree model changes, and the training and optimization of the whole joint discrimination classification tree model ensure that ACC is maximum and the loss of the depth joint probability network uses a cross entropy function L _deep And (3) obtaining a trained joint discrimination classification tree model after the optimization.

Further, the loss of the depth joint probability network uses a cross entropy loss function L _deep Is defined as:

wherein K represents the number of input samples,

true label representing kth sample, +.>

Representing the predictive label of the kth sample.

Compared with the prior art, the invention has the following advantages and effects:

1) According to the invention, two types of neural networks are used as the base classifier, and different base classifiers distinguish sound scenes of different categories, so that the network number and structure can be adaptively adjusted according to the sound scene category number, and the optimal utilization of computing resources is realized.

2) The invention uses the wide neural network, has the advantage of incremental learning, can dynamically update the network model, can quickly adjust the network structure when training data changes, and can quickly complete training.

3) According to the invention, the training weight results of the width neural network are used for combining the two depth networks by adopting a probability value weighted average method, so that the classification accuracy is further improved, meanwhile, the parameters are pre-adjusted by the depth neural network by means of the width neural network weight results, the convergence speed of the depth neural network model is increased, and the training time is shortened.

4) The overall classification model can be flexibly adjusted and optimized according to training results of two types of networks, and can meet classification requirements of various sound field scenes to the greatest extent.

5) The classifier structure can be used as a general classification method framework, is applied to other classification scenes, can accelerate training speed and improves classification accuracy.

Drawings

FIG. 1 is a flow diagram of a method for classifying sound scenes based on a width and depth neural network disclosed in an embodiment of the invention;

FIG. 2 is a schematic diagram of a wide neural network architecture in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a joint discriminant classification tree model in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a deep joint probability network structure in an embodiment of the invention;

FIG. 5 is a schematic diagram of another joint discriminant classification tree model in accordance with an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Examples

The embodiment discloses a sound scene classification method based on a width and depth neural network, and a flow chart of the sound scene classification method based on the width and depth neural network is shown in fig. 1, and the method comprises the following specific steps:

s1, establishing an audio data set:

s1.1, adopting a DCASE 2018 Task5 public data set as an audio sample of an audio data set, continuously recording a sound event of one week in a home environment, classifying 9 sound field scenes, 72984 audio samples, wherein the length of each sample is 10S, the sampling rate is 16kHz, and the quantization bits are the same;

s1.2, extracting 20-dimensional logarithmic Mel spectrum characteristics of each audio sample in an audio data set, namely, each audio sample corresponds to a characteristic spectrum with the number of pixels being 20 multiplied by 399, carrying out mean normalization on each pixel of the characteristic spectrum, converting the spectrum into one-dimensional characteristic vectors with the length of 20 multiplied by 399=7980, and giving sound field Jing Leibie labels corresponding to all the characteristic spectrums;

s1.3, the established audio data set contains 9 kinds of sound scenes, namely 72984 parts of characteristic maps, 51000 parts of characteristic maps are randomly divided in the data set to serve as training sets and 21984 parts of characteristic maps to serve as test sets.

S2, constructing a width neural network:

s2.1, establishing a feature mapping layer:

setting the number of feature windows of a feature mapping layer as N ₁ =50, the number of feature nodes in each feature window is N ₂ ＝80；

S2.2, establishing an enhancement layer:

setting the number of the enhancement nodes of the enhancement layer as N ₃ =1500, satisfying the condition (N ₁ ×N ₂ )>N ₃ ；

S2.3, calculating the mapping from the input sample to the feature layer:

the audio sample data set of the input width neural network is set to D ₁ Wherein the number of samples is c, the feature number of each sample is f=7980, and a feature value is added to each sample to be equal to 1, so as to obtain an amplified sample set D ₂ The feature of each sample becomes 7981, and 50 random weight matrices W are generated for each feature window _e ，W _e Is 7981 XN ₂ Two-dimensional matrix of =7981×80, whose value conforms to gaussian distribution with mean 0 and variance 1, generating new eigenvector a ₁ ＝D ₂ ×W _e ，A ₁ Is c x 80, with A ₁ Normalization and sparse representation, and re-solving for sparse matrix

Wherein->

Representation pair D ₂ Solving the inverse of the matrix to finally generate a feature node T of a feature window ₁ ＝normal(D ₂ X W), T is obtained ₁ The dimension is c multiplied by 80, feature nodes are generated for 50 feature windows, and finally feature mapping y is obtained _b The dimension is c multiplied by 50 multiplied by 80;

s2.4, calculating the mapping from the input samples to the enhancement layer:

randomly generating orthogonal normalized weight matrix W with dimension of (50×80) ×1500 _h By y in S2.3 _b Obtaining a new feature map

Wherein c is the same as S2.3, tan sig () is a commonly used activation function in neural networks to obtain T ₂ The dimension is c×1500;

s2.5, combining mapping features in parallel to form an input layer:

mapping y the features obtained in S2.3 _b And the feature map T obtained in S2.4 ₂ Parallel combination is carried out to obtain an input layer

The characteristic dimension of each sample is 80×50+1500;

s2.6, connecting an input layer to an output layer through a weight matrix, wherein the output layer Y is an onehot vector of a label of a sound scene type, and the dimension is c multiplied by n _B ，n _B For inputting the classification class number of the samples of the wide neural network, y=xw _B ，W _B The dimension of the weight matrix to be trained for the width neural network is (80×50+1500) ×n _B 。

S3, constructing a deep joint probability network:

s3.1, constructing a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network adopts a 5-layer structure which is sequentially connected in turn and consists of 3 one-dimensional convolutional layers, a full-connection layer and a Softmax classified output layer in cascade connection, wherein each convolutional layer is activated by a nonlinear activation function and is output after being maximally pooled, and d is set ^(l-1) And d ^l The input of the first convolution layer is the output of the (l-1) th convolution layer, and one of the feature maps is considered as one of the feature maps because the first convolution layer has multiple feature maps

The output of the convolutional layer is expressed as:

wherein, represents the convolution operation,

maxpooling () represents maximum pooling, the result of which is used as the input of the next layer; the output result of the last layer of convolution layer is connected through a full connection layer, and finally output of the one-dimensional convolution neural network is obtained through a Softmax classification output layer, and probability matrix y of the audio sample belonging to different sound scene categories is output _c In this embodiment, the number of convolution kernels is 128, the window length of the convolution kernels is 100, the convolution step length is 1, the activation function is ReLU, and the maximum pooled window size is 2; the second layer of one-dimensional convolution layer, the number of convolution kernels is 128, the window length of the convolution kernels is 30, the convolution step length is 1, the activation function is ReLU, and the maximum pooling window size is 2; the third layer of one-dimensional convolution layer has 128 convolution kernels, the window length of the convolution kernels is 15, the convolution step length is 1, the activation function is ReLU, and the maximum pooling window size is 2; the fourth layer is a full-connection layer, which is used as a transition from a convolution layer to a Softmax classified output layer, the output dimension of the full-connection layer is 128, and the dropout ratio is set to be 50%; the fifth layer is a Softmax classification output layer, and outputs probability matrix y of the audio samples belonging to different sound scene categories _c ；

S3.2, constructing a long-short-time memory network, wherein the long-short-time memory network adopts a 4-layer structure which is sequentially connected in turn and consists of two long-short-time memory layers, a Dropout and a Softmax classified output layer in cascade connection, and giving an input sequence x= (x) to each CuDNN long-short-time memory layer ₁ ，...x _T ) The long and short time memory layer uses the hidden vector sequence h= (h) ₁ ，...h _T ) From the t=1st iteration to the T-th iteration, y= (y) is generated ₁ ，...y _T ) Output of (2):

o _t ＝σ(W ₀ [h _t-1 ，x _t ]+b _o )

h _t ＝o _t ×tanh(C _t )

y _t ＝W _hy h _t +b _y

in which W is _o And b _o Respectively representing weight matrix and bias vector from input layer to hidden layer in long-short time memory, a () represents Sigmoid activation function, o _t And C _t Respectively represent the output gate and the cell activation vector, h _t Intermediate hidden layer variable W for long-short term memory network _hy And b _y Respectively representing a weight matrix and a bias vector from a hidden layer to an output layer of the long-short memory network; finally, the long-short-term memory network outputs the audio samples through a Softmax classification output layer to obtain probability matrix y of different sound scene categories of each audio sample _l ；

In the embodiment, the training process is accelerated by using CuDNN, the first layer of CuDNN is a long short-time memory layer, the output dimension is 64, and all output sequences are returned; a second CuDNN long short-time memory layer, the output dimension is 64; adding a third Dropout layer after the second long-short-time memory layer, setting the probability of randomly disconnecting the input neurons to be 50%, and preventing overfitting; the fourth layer is a Softmax classification output layer, and outputs a probability matrix y of each audio sample belonging to different sound scene categories _l ；

S3.3, combining the one-dimensional convolutional neural network and the long-short-time memory network into a joint probability network through weighted average of output probabilities of the one-dimensional convolutional neural network and the long-short-time memory network, wherein the joint probability network is expressed as follows:

y _a ＝w _c y _c +w _l y _l

wherein w is _c And w _l Respectively representing weights of a one-dimensional convolutional neural network and a long-short-term memory network, y _c Representing a probability matrix, y, of one-dimensional convolutional neural network output audio sample in S3.1 _l Representing a probability matrix of long-short-term memory network output in S3.2, y _a Probability matrix representing deep joint probability network output and classification result finally output by deep joint probability networky _{result_k} Taking a sound field Jing Leibie corresponding to the maximum output probability node:

y _{result_k} ＝argmax(y _a )，1≤k≤K

wherein argmax () represents the subscript corresponding to the maximum probability value, and K is the number of samples of the input depth joint probability network;

in this embodiment, the weight of the one-dimensional convolutional neural network is ω _c =0.7, the long-short term memory network has a weight of ω _l =0.3, one-dimensional convolutional neural network using Adam optimizer, learning rate set to 0.0002, sample step size to 256, training algebra to 1000; the long-time and short-time memory network uses an Adam optimizer, the learning rate is set to 0.00001, the step length of the sample number is 50, and the iteration number is 500.

S4, constructing a joint discrimination classification tree model:

s4.1, performing preliminary training on the wide neural network constructed in the S2 and the deep joint probability network constructed in the S3 by using the training set divided in the S1 to obtain the classification accuracy of the wide neural network, ranking the wide neural network from high to low, wherein the classification accuracy is respectively 6 th, 7 th, 1 st, 2 nd, 3 rd, 0 th, 4 th, 8 th and 5 th, and the classification accuracy is higher than the classification accuracy, and the deep joint probability network obtains a pre-training weight;

dn=1, where->

Represents a downward rounding, n represents the number of classes n= 9,a in this example represents the superparameter of the number of sub-classes that the wide neural network can split, at +.>

The whole numbers in the range are 2,3 and 4;

s4.3, constructing a joint discrimination classification tree model, wherein the joint discrimination classification tree consists of BN wide neural network nodes and DN deep joint probability network nodes, all input samples firstly pass through the wide neural network nodes of the joint discrimination classification tree to obtain BN x (a-1) class with highest accuracy, and then the rest n-BN x (a-1) class input samples are sent to the deep joint probability network nodes of the joint discrimination classification tree for classification, wherein the process is as follows:

s4.3.1, the combined discrimination classification tree model takes the wide neural network as a branch node of the combined discrimination classification tree model to extend downwards, and the data to be classified is classified and output after passing through the branch node: the method comprises the steps of width sensitive class 1, width sensitive class 2, … … width sensitive class a-1 and width non-sensitive class, namely, forming output of a branch node by using a classification result, directly outputting the result if the classification result is a sensitive class, and inputting data to be classified into a next node if the classification result is a width non-sensitive class;

in this embodiment, if a=3 is taken, bn=2, there are two total wide neural network nodes, the first wide neural network node outputs are the 6 th class, the 7 th class and the wide non-sensitive class, the wide non-sensitive class is other classification in the node inputs except the 6 th class and the 7 th class for the first wide neural network node, the second wide neural network node receives the output of the first wide neural network node as input, the second wide neural network node outputs are the 1 st class, the 2 nd class and the wide non-sensitive class, the wide non-sensitive class is other classification in the node inputs except the 1 st class and the 2 nd class for the second wide neural network node, and after reaching the two wide neural network nodes, step S4.3.3 is performed;

s4.3.3 the joint discrimination classification tree model takes a depth joint probability network as the last node of the classification tree, receives the final output width insensitive classes passing through BN width neural network nodes as input to obtain final width insensitive class classification output, and the terminal branches of the tree in the joint discrimination classification tree model obtain all classifications;

in this embodiment, if a=3 is taken, bn=2, and after passing through two width neural network nodes, the output width insensitive classes are: class 0, class 3, class 4, class 5 and class 8, the deep joint probability network takes the 5 classes as input to judge classification classes, 4 classification classes are output by the two wide neural network nodes, 5 classification classes are output by one deep joint probability network, and all 9 classification classes are branched from the tips of the tree in the joint judgment classification tree model;

s4.4, in the training process, combining the loss function L of the probability network by using the average overall accuracy and depth _deep As a supervisory signal, the average overall accuracy ACC is defined as:

in the acc _i The accuracy of the ith classification of the audio data set after the whole joint discrimination classification tree model is represented, and in this embodiment, 9 classification classes are added, L _deep The definition is as follows:

/>

wherein K represents the number of input samples,

true label representing kth sample, +.>

A predictive label representing a kth sample;

s4.5, finally training and optimizing the joint discrimination classification tree model, and searching for proper a and w by adopting a network searching method _c 、w _l The value of a is selected each time, the structure of the joint discrimination classification tree model changes correspondingly, the width neural network and the depth joint probability network in the joint discrimination classification tree model train correspondingly according to the input and output data after the structure of the joint discrimination classification tree model changes, and the training and optimization of the whole joint discrimination classification tree model are enabled to be the largest ACC and L _deep And (3) obtaining a trained joint discrimination classification tree model after the optimization.

In this embodiment, the network search a takes the value, ω, from 2,3,4 _c ＝0.7，ω _l =0.3, the best classification of a=3 is finally achieved, so that ACC is the largest and L _deep The minimum, the joint discrimination classification tree model is shown in figure 5, the joint discrimination classification tree model is composed of two width neural network nodes and a depth joint probability network node, the first node is divided into three branches by the width neural network, namely a 6 th branch, a 7 th branch and a second node branch, the second node branch is divided into a 1 st branch, a 2 nd branch and a third node branch, the third node is divided into a 0 th branch, a 3 rd branch, a 4 th branch, a 5 th branch and a 8 th branch by the depth neural network, and classification of all 9 kinds of sound field scenes is completed by each tip, and training and optimizing are constructed and obtained.

S5, sound field Jing Bianshi:

and inputting the logarithmic mel spectrum characteristics of the test audio samples in the audio data set into a trained joint discrimination classification tree model to obtain the sound scene category to which each test sample belongs.

In summary, in this embodiment, the breadth neural network and the depth neural network are combined in the form of the combined discrimination classification tree, and a sound scene classification method based on the breadth and depth neural network is provided. Firstly, a wide neural network is adopted to distinguish certain sound fields Jing Leibie, and then a deep joint probability network is adopted to distinguish the rest sound field scene types. In addition, a construction method of a breadth neural network, a depth joint probability network and a joint discrimination classification tree model is provided. The invention can improve the training efficiency of the sound scene classification model, shorten the training time, ensure the training accuracy and complement the defects of poor generalization capability and poor stability of a single network.

For the whole network structure, the class number of deep neural network classification is reduced by means of the rapid training of width learning and the characteristic of high accuracy of individual sound scene classification. Due to the reduction of the classification categories of the deep neural network, the classification requirements on the deep neural network can be reduced, the training data volume is reduced, the training efficiency of the deep neural network is improved, and the classification accuracy of individual sound field scene categories can be improved. The joint discrimination classification tree model is superior to the deep joint probability network in all aspects of accuracy and training time.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The sound scene classification method based on the width and depth neural network is characterized by comprising the following steps of:

the process of the step S4 is as follows:

dn=1, wherein

Integers within the range;

s4.5, finally training and optimizing the joint discrimination classification tree model, and searching for proper a and w by adopting a network searching method _c 、w _l Value, w _c And w _l The weight values of the one-dimensional convolutional neural network and the long-short-term memory network are respectively represented, corresponding changes are made on the structure of the joint discrimination classification tree model by selecting a each time, and the width neural network and the depth joint probability network in the joint discrimination classification tree model correspond to the input and output data after the structure of the joint discrimination classification tree model is changedTraining, training and optimizing the whole joint discriminant classification tree model to maximize ACC and use cross entropy function L for loss of deep joint probability network _deep Minimum, obtaining a trained joint discrimination classification tree model after optimization;

2. The method for classifying sound scenes based on the width and depth neural network according to claim 1, wherein the step S1 is as follows:

s1.3, randomly dividing experimental data into training sets and test sets which are mutually disjoint, wherein the training sets account for 70% and the test sets account for 30%.

3. The method for classifying sound scenes based on the width and depth neural network according to claim 1, wherein the step S2 is as follows:

s2.2, building an enhancement layer, wherein the number of enhancement nodes is N ₃ Here satisfies (N) ₁ ×N ₂ )>N ₃ ；

S2.3, the feature mapping layer performs feature mapping on the input samples, and the sample set of the input width neural network is set as D ₁ Wherein the number of samples is c, eachThe characteristic number of the sample is f, and after each sample is added with a characteristic value equal to 1, an amplified sample set D is obtained ₂ The obtained characteristic number of each sample becomes f+1, and a random weight matrix W is generated for each characteristic window _e ，W _e Is (f+1) times N ₂ A matrix of dimensions whose values conform to a Gaussian distribution with a mean of 0 and a variance of 1, generating a new eigenvector A ₁ ＝D ₂ ×W _e ，A ₁ Is c x N in dimension ₂ Will A ₁ Normalization and sparsification representation, and re-solving for sparse matrices

Wherein->

Representation pair D ₂ Solving the inverse of the matrix to finally generate a characteristic node T of a window ₁ ＝normal(D ₂ X W), wherein noraml () represents normalization, resulting T ₁ Dimension c×N ₂ Is N ₁ Generating feature nodes by the feature windows to finally obtain feature mapping y of the feature mapping layer _b Its dimension is c×N ₁ ×N ₂ ；

Each sample has a characteristic dimension of N ₂ ×N ₁ +N ₃ ；

4. The method for classifying sound scenes based on the width and depth neural network according to claim 1, wherein the step S3 is as follows:

s3.1, constructing a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network consists of two or more one-dimensional convolutional layers, a full-connection layer and a Softmax classified output layer in cascade, each one-dimensional convolutional layer is activated by a nonlinear activation function and then is output after being subjected to maximum pooling, and d is set ^(l-1) And d ^l The input of the first convolution layer is the output of the (l-1) th convolution layer, and one of the feature maps is considered as one of the feature maps because the first convolution layer has multiple feature maps

The output of the convolutional layer is expressed as:

wherein, represents the convolution operation,

S3.2, constructing a long-short time memory network, wherein the long-short time memory network consists of two long-short time memory layers and a Softmax classified output layer in cascade connection, optionally adding a Dropout layer after each long-short time memory layer, and giving an input sequence x= (x) for each long-short time memory layer ₁ ,…x _T ) The long and short time memory layer uses the hidden vector sequence h= (h) ₁ ,…h _T ) From the t=1st iteration to the T-th iteration, y= (y) is generated ₁ ,…y _T ) Output of (2):

o _t ＝σ(W ₀ [h _t-1 ,x _t ]+b _o )

h _t ＝o _t ×tanh(C _t )

y _t ＝W _hy h _t +b _y

S3.3, combining the one-dimensional convolutional neural network and the long-short-time memory network into a deep joint probability network through weighted average of output probabilities of the one-dimensional convolutional neural network and the long-short-time memory network, wherein the deep joint probability network is expressed as follows:

y _a ＝w _c y _c +w _l y _l

y _{result_k} ＝argmax(y _a ),1≤k≤K

5. The method for classifying sound scenes based on a width and depth neural network according to claim 1, wherein the loss of the depth joint probability network uses a cross entropy loss function L _deep Is defined as:

wherein K represents the number of input samples,

true label representing kth sample, +.>

Representing the predictive label of the kth sample. />