CN111723874B - Sound field scene classification method based on width and depth neural network - Google Patents

Sound field scene classification method based on width and depth neural network Download PDF

Info

Publication number
CN111723874B
CN111723874B CN202010624687.1A CN202010624687A CN111723874B CN 111723874 B CN111723874 B CN 111723874B CN 202010624687 A CN202010624687 A CN 202010624687A CN 111723874 B CN111723874 B CN 111723874B
Authority
CN
China
Prior art keywords
neural network
layer
network
output
joint
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010624687.1A
Other languages
Chinese (zh)
Other versions
CN111723874A (en
Inventor
黄张金
李艳雄
张文浩
林子珩
陈奕纯
谭煜枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010624687.1A priority Critical patent/CN111723874B/en
Publication of CN111723874A publication Critical patent/CN111723874A/en
Application granted granted Critical
Publication of CN111723874B publication Critical patent/CN111723874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Abstract

The invention discloses a sound scene classification method based on a width and depth neural network, which comprises the following steps: firstly, extracting logarithmic mel-spectrum characteristics from a sound field Jing Yinpin sample, and dividing the logarithmic mel-spectrum characteristics into a training set and a testing set; designing a width neural network and a depth joint probability network; taking the logarithmic mel spectrum characteristics of each audio sample of the training set as input, and pre-training the two networks; constructing a joint discrimination classification tree model according to the pre-training result, training and optimizing the joint discrimination classification tree model; and finally, inputting the logarithmic Mel spectrum characteristics of each audio sample of the test set into a joint discrimination classification tree model, and identifying the sound field scene corresponding to each audio sample. The combined discrimination classification tree model constructed by the invention can complement the defects of poor generalization capability and weak stability of a single network, and improves the sound scene classification effect by utilizing the dominant complementary characteristics of a wide neural network and a deep neural network.

Description

Sound field scene classification method based on width and depth neural network
Technical Field
The invention belongs to the technical field of machine hearing, relates to a width and deep learning technology, and particularly relates to a sound scene classification method based on a width and depth neural network.
Background
Daily activities of people involve a variety of different sound events, the combination of which constitutes a variety of different sound scenes. The sound field scene classification technology has wide application fields such as audio monitoring, multimedia retrieval, automatic auxiliary driving, intelligent home and the like.
The classification accuracy of the wide neural network applied to the classification of the sound scene is difficult to be improved after the classification accuracy is improved to a certain degree, and the practical requirement is difficult to be met. The classification of the prior sound scene is mostly based on the deep neural network, but the training time is too long, which is a disadvantage of the deep neural network. In fact, the classification accuracy of the wide neural network for sound scenes of some categories can reach a higher value, but the classification accuracy of the wide neural network for sound scenes of other categories is lower, so that the overall accuracy cannot rise any more after a certain degree.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a sound scene classification method based on a width and depth neural network, which introduces the width network into sound scene classification, reduces the training time of the depth neural network, thereby reducing the training time of the whole classification network, and combines the width neural network and the depth combined probability network in the form of classification trees, thereby improving the classification accuracy. The invention improves the training efficiency of the network on the basis of ensuring the accuracy of the sound scene classification network.
The aim of the invention can be achieved by adopting the following technical scheme:
a sound scene classification method based on a width and depth neural network comprises the following steps:
s1, establishing an audio data set; extracting logarithmic mel-spectrum characteristics from a sound field Jing Yinpin sample, and dividing the logarithmic mel-spectrum characteristics into a training set and a testing set according to a proportion;
s2, constructing a width neural network: establishing a feature mapping layer and an enhancement layer, wherein the feature mapping layer and the enhancement layer perform feature mapping on an input sample, the mapped features are combined in parallel to form an input layer, and the input layer is connected with an output layer through a weight matrix;
s3, constructing a deep joint probability network: respectively establishing a one-dimensional convolutional neural network and a long-short time memory network, and then combining the one-dimensional convolutional neural network and the long-short time memory network into a deep joint probability network by weighting and averaging the output probabilities of the one-dimensional convolutional neural network and the long-short time memory network;
s4, constructing a joint discrimination classification tree model: constructing a joint discrimination classification tree model according to the preliminary training results of the width neural network and the depth joint probability network, training and adjusting parameters of the joint discrimination classification tree model until the model converges, and obtaining a trained joint discrimination classification tree model;
s5, sound field Jing Bianshi: inputting the logarithmic mel spectrum characteristics of the test audio samples into a trained joint discrimination classification tree model to obtain sound scene categories of the test audio samples.
Further, the step S1 is as follows:
s1.1, acquiring audio data of an acoustic scene by using recording equipment or Internet resources, converting the sampling rate and quantization accuracy of audio samples into a uniform format, and labeling a sound field Jing Leibie to which each audio sample belongs;
s1.2, extracting logarithmic Mel spectrum characteristics from an audio sample and carrying out mean normalization processing;
s1.3, randomly dividing experimental data into training sets and testing sets which are mutually disjoint, wherein the training sets account for about 70% and the testing sets account for about 30%.
Further, the step S2 is as follows:
s2.1, establishing a feature mapping layer, wherein the feature mapping layer is formed by N 1 Each feature window is composed of N 2 Characteristic nodes, N 1 And N 2 According to the feature number of the neural network with the actual input width, the method selects to meet N 1 ×N 2 Number of features per 2;
s2.2, building an enhancement layer, wherein the number of enhancement nodes is N 3 Here satisfies (N) 1 ×N 2 )>N 3
S2.3, the feature mapping layer performs feature mapping on the input samples, and the sample set of the input width neural network is set as D 1 Wherein the number of samples is c, the feature number of each sample is f, and a feature value is added after each sample to be equal to 1, so as to obtain an amplified sample set D 2 The obtained characteristic number of each sample becomes f+1, and a random weight matrix W is generated for each characteristic window e ,W e Is (f+1) times N 2 A matrix of dimensions whose values conform to a Gaussian distribution with a mean of 0 and a variance of 1, generating a new eigenvector A 1 =D 2 ×W e ,A 1 Is c x N in dimension 2 Will A 1 Normalization and sparsification representation, and re-solving for sparse matrices
Figure GDA0004096602160000031
Wherein the method comprises the steps of
Figure GDA0004096602160000032
Representation pair D 2 Solving the inverse of the matrix to finally generate a characteristic node T of a window 1 =normal(D 2 X W), wherein noraml () represents returnUnifying the obtained T 1 Dimension c×N 2 Is N 1 Generating feature nodes by the feature windows to finally obtain feature mapping y of the feature mapping layer b Its dimension is c×N 1 ×N 2
S2.4, performing feature mapping on the enhancement layer input samples, and randomly generating a sample with a dimension of (N 1 ×N 2 )×N 3 Is normalized by the orthogonal weight matrix W h Using feature mapping y b Feature mapping to get enhancement layer
Figure GDA0004096602160000033
tan sig () is the activation function of the neural network, the resulting feature map T 2 Dimension c×N 3
S2.5, combining the mapped features in parallel to form an input layer, and mapping the features to y b And feature map T 2 Parallel combination is carried out to obtain an input layer
Figure GDA0004096602160000034
Each sample has a characteristic dimension of N 2 ×N 1 +N 3
S2.6, connecting an input layer and an output layer through a weight matrix, wherein the output Y of the output layer is the onehot vector of the label of the sound scene type, and the dimension is c multiplied by n B ,n B For inputting the classification class number of the samples of the wide neural network, y=xw B ,W B The weight matrix obtained for the training of the breadth neural network has a dimension (N 2 ×N 1 +N 3 )×n B
Further, the step S3 is as follows:
s3.1, constructing a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network consists of two or more one-dimensional convolutional layers, a full-connection layer and a Softmax classified output layer in cascade, each one-dimensional convolutional layer is activated by a nonlinear activation function and then is output after being subjected to maximum pooling, and d is set (l-1) And d l Input to the first convolution layer and output from the first convolution layer, respectively, the input to the first convolution layer being the output of the (l-1) th convolution layer, due to the first convolutionThe layer has a plurality of feature maps, one of which is therefore considered to be
Figure GDA0004096602160000041
The output of the convolutional layer is expressed as:
Figure GDA0004096602160000042
wherein, represents the convolution operation,
Figure GDA0004096602160000043
representing the kernel weight of the first layer, fun () is a nonlinear activation function; the final output of the first layer is expressed as:
Figure GDA0004096602160000044
maxpooling () represents maximum pooling, the result of which is used as the input of the next layer; the output result of the last layer of convolution layer is connected through a full connection layer, and finally output of the one-dimensional convolution neural network is obtained through a Softmax classification output layer, and probability matrix y of the audio sample belonging to different sound scene categories is output c
S3.2, constructing a long-short time memory network, wherein the long-short time memory network consists of two long-short time memory layers and a Softmax classified output layer in cascade connection, optionally adding a Dropout layer after each long-short time memory layer, and giving an input sequence x= (x) for each long-short time memory layer 1 ,...x T ) The long and short time memory layer uses the hidden vector sequence h= (h) 1 ,...h T ) From the t=1st iteration to the T-th iteration, y= (y) is generated 1 ,...y T ) Output of (2):
o t =σ(W 0 [h t-1 ,x t ]+b o )
h t =o t ×tanh(C t )
y t =W hy h t +b y
in which W is o And b o Respectively representing a weight matrix and a bias vector from an input layer to a hidden layer in long-short time memory, sigma () represents a Sigmoid activation function, o t And C t Respectively represent the output gate and the cell activation vector, h t Intermediate hidden layer variable W for long-short term memory network hy And b y Respectively representing a weight matrix and a bias vector from a hidden layer to an output layer of the long-short memory network; finally, the long-short-term memory network outputs the audio samples through a Softmax classification output layer to obtain probability matrix y of different sound scene categories of each audio sample l
S3-3, combining the one-dimensional convolutional neural network and the long-short-time memory network into a deep joint probability network through weighted average of output probabilities of the one-dimensional convolutional neural network and the long-short-time memory network, wherein the deep joint probability network is expressed as follows:
y a =w c y c +w l y l
wherein w is c And w l Respectively representing weights of a one-dimensional convolutional neural network and a long-short-term memory network, y c Representing a probability matrix of audio sample output by a one-dimensional convolutional neural network, y l Probability matrix representing long-short term memory network output, y a Probability matrix representing deep joint probability network output, and classification result y finally output by deep joint probability network result_k Taking a sound field Jing Leibie corresponding to the maximum output probability node:
y result_k =argmax(y a ),1≤k≤K
wherein argmax () represents the number of samples of the input depth joint probability network taking the subscript corresponding to the maximum probability value.
Further, the step S4 is as follows:
s4.1, performing preliminary training on the width neural network constructed in the step S2 and the depth joint probability network constructed in the step S3 by using the training set divided in the step S1 to obtain the classification accuracy of the width neural network, ranking the width neural network from high to low, wherein the higher the accuracy is, the higher the classification ranking is, and the pre-training weight is obtained by the depth joint probability network;
s4.2, classifying the wide neural network and the deep joint probability network as nodes of a classification tree, setting the node number BN of the wide neural network and the node number DN of the deep joint probability network, wherein,
Figure GDA0004096602160000051
dn=1, where->
Figure GDA0004096602160000052
Representing a downward rounding, n representing the number of classifications of the entire audio data set, a representing the super-parameters of the number of sub-classes that the wide neural network can split, a being at +.>
Figure GDA0004096602160000061
Integers within the range;
s4.3, constructing a joint discrimination classification tree model, wherein the joint discrimination classification tree model consists of BN wide neural network nodes and DN deep joint probability network nodes, all input samples firstly pass through the wide neural network nodes to obtain BN x (a-1) with highest accuracy, and then the rest n-BN x (a-1) input samples are classified by the deep joint probability network nodes, and the process is as follows:
s4.3.1, the combined discrimination classification tree model takes the wide neural network as a branch node of the combined discrimination classification tree model to extend downwards, and the data to be classified is classified and output after passing through the branch node: width-sensitive class 1, width-sensitive class 2, … … width-sensitive class a-1 and width-insensitive class, where a is a defined in S4.2, i.e. the output of a branch node is formed by classification results; if the classification result is a sensitive class, directly outputting the result, and if the classification result is a width non-sensitive class, inputting the data to be classified into the next node;
s4.3.2 if the number of the wide neural network nodes in the joint discrimination classification tree model does not reach BN, continuing to extend the joint discrimination classification tree model downward in the process of the step S4.3.1, and if the number of the wide neural network nodes in the joint discrimination classification tree model reaches BN, performing a step S4.3.3;
s4.3.3 the joint discrimination classification tree model takes a depth joint probability network as the last node of the classification tree, receives the final output width insensitive classes passing through BN width neural network nodes as input to obtain final width insensitive class classification output, and finally, the terminal branches of the tree in the joint discrimination classification tree model obtain all classifications;
s4.4, in the training process, gradually increasing the two types of nodes of the classification tree until the set nodes are counted, so as to average the overall accuracy ACC and the loss function L of the depth joint probability network deep As a supervisory signal:
Figure GDA0004096602160000071
in the acc i The accuracy of the ith classification of the audio data set after the whole joint discrimination classification tree model is represented, and n represents the classification number of the whole audio data set;
s4.5, finally training and optimizing the joint discrimination classification tree model, and searching for proper a and w by adopting a network searching method c 、w l The value of a is selected each time, the structure of the joint discrimination classification tree model changes correspondingly, the width neural network and the depth joint probability network in the joint discrimination classification tree model train correspondingly according to the input and output data after the structure of the joint discrimination classification tree model changes, and the training and optimization of the whole joint discrimination classification tree model ensure that ACC is maximum and the loss of the depth joint probability network uses a cross entropy function L deep And (3) obtaining a trained joint discrimination classification tree model after the optimization.
Further, the loss of the depth joint probability network uses a cross entropy loss function L deep Is defined as:
Figure GDA0004096602160000072
wherein K represents the number of input samples,
Figure GDA0004096602160000073
true label representing kth sample, +.>
Figure GDA0004096602160000074
Representing the predictive label of the kth sample.
Compared with the prior art, the invention has the following advantages and effects:
1) According to the invention, two types of neural networks are used as the base classifier, and different base classifiers distinguish sound scenes of different categories, so that the network number and structure can be adaptively adjusted according to the sound scene category number, and the optimal utilization of computing resources is realized.
2) The invention uses the wide neural network, has the advantage of incremental learning, can dynamically update the network model, can quickly adjust the network structure when training data changes, and can quickly complete training.
3) According to the invention, the training weight results of the width neural network are used for combining the two depth networks by adopting a probability value weighted average method, so that the classification accuracy is further improved, meanwhile, the parameters are pre-adjusted by the depth neural network by means of the width neural network weight results, the convergence speed of the depth neural network model is increased, and the training time is shortened.
4) The overall classification model can be flexibly adjusted and optimized according to training results of two types of networks, and can meet classification requirements of various sound field scenes to the greatest extent.
5) The classifier structure can be used as a general classification method framework, is applied to other classification scenes, can accelerate training speed and improves classification accuracy.
Drawings
FIG. 1 is a flow diagram of a method for classifying sound scenes based on a width and depth neural network disclosed in an embodiment of the invention;
FIG. 2 is a schematic diagram of a wide neural network architecture in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a joint discriminant classification tree model in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a deep joint probability network structure in an embodiment of the invention;
FIG. 5 is a schematic diagram of another joint discriminant classification tree model in accordance with an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
The embodiment discloses a sound scene classification method based on a width and depth neural network, and a flow chart of the sound scene classification method based on the width and depth neural network is shown in fig. 1, and the method comprises the following specific steps:
s1, establishing an audio data set:
s1.1, adopting a DCASE 2018 Task5 public data set as an audio sample of an audio data set, continuously recording a sound event of one week in a home environment, classifying 9 sound field scenes, 72984 audio samples, wherein the length of each sample is 10S, the sampling rate is 16kHz, and the quantization bits are the same;
s1.2, extracting 20-dimensional logarithmic Mel spectrum characteristics of each audio sample in an audio data set, namely, each audio sample corresponds to a characteristic spectrum with the number of pixels being 20 multiplied by 399, carrying out mean normalization on each pixel of the characteristic spectrum, converting the spectrum into one-dimensional characteristic vectors with the length of 20 multiplied by 399=7980, and giving sound field Jing Leibie labels corresponding to all the characteristic spectrums;
s1.3, the established audio data set contains 9 kinds of sound scenes, namely 72984 parts of characteristic maps, 51000 parts of characteristic maps are randomly divided in the data set to serve as training sets and 21984 parts of characteristic maps to serve as test sets.
S2, constructing a width neural network:
s2.1, establishing a feature mapping layer:
setting the number of feature windows of a feature mapping layer as N 1 =50, the number of feature nodes in each feature window is N 2 =80;
S2.2, establishing an enhancement layer:
setting the number of the enhancement nodes of the enhancement layer as N 3 =1500, satisfying the condition (N 1 ×N 2 )>N 3
S2.3, calculating the mapping from the input sample to the feature layer:
the audio sample data set of the input width neural network is set to D 1 Wherein the number of samples is c, the feature number of each sample is f=7980, and a feature value is added to each sample to be equal to 1, so as to obtain an amplified sample set D 2 The feature of each sample becomes 7981, and 50 random weight matrices W are generated for each feature window e ,W e Is 7981 XN 2 Two-dimensional matrix of =7981×80, whose value conforms to gaussian distribution with mean 0 and variance 1, generating new eigenvector a 1 =D 2 ×W e ,A 1 Is c x 80, with A 1 Normalization and sparse representation, and re-solving for sparse matrix
Figure GDA0004096602160000091
Wherein->
Figure GDA0004096602160000092
Representation pair D 2 Solving the inverse of the matrix to finally generate a feature node T of a feature window 1 =normal(D 2 X W), T is obtained 1 The dimension is c multiplied by 80, feature nodes are generated for 50 feature windows, and finally feature mapping y is obtained b The dimension is c multiplied by 50 multiplied by 80;
s2.4, calculating the mapping from the input samples to the enhancement layer:
randomly generating orthogonal normalized weight matrix W with dimension of (50×80) ×1500 h By y in S2.3 b Obtaining a new feature map
Figure GDA0004096602160000101
Wherein c is the same as S2.3, tan sig () is a commonly used activation function in neural networks to obtain T 2 The dimension is c×1500;
s2.5, combining mapping features in parallel to form an input layer:
mapping y the features obtained in S2.3 b And the feature map T obtained in S2.4 2 Parallel combination is carried out to obtain an input layer
Figure GDA0004096602160000102
The characteristic dimension of each sample is 80×50+1500;
s2.6, connecting an input layer to an output layer through a weight matrix, wherein the output layer Y is an onehot vector of a label of a sound scene type, and the dimension is c multiplied by n B ,n B For inputting the classification class number of the samples of the wide neural network, y=xw B ,W B The dimension of the weight matrix to be trained for the width neural network is (80×50+1500) ×n B
S3, constructing a deep joint probability network:
s3.1, constructing a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network adopts a 5-layer structure which is sequentially connected in turn and consists of 3 one-dimensional convolutional layers, a full-connection layer and a Softmax classified output layer in cascade connection, wherein each convolutional layer is activated by a nonlinear activation function and is output after being maximally pooled, and d is set (l-1) And d l The input of the first convolution layer is the output of the (l-1) th convolution layer, and one of the feature maps is considered as one of the feature maps because the first convolution layer has multiple feature maps
Figure GDA0004096602160000103
The output of the convolutional layer is expressed as:
Figure GDA0004096602160000104
wherein, represents the convolution operation,
Figure GDA0004096602160000105
representing the kernel weight of the first layer, fun () is a nonlinear activation function; the final output of the first layer is expressed as:
Figure GDA0004096602160000111
maxpooling () represents maximum pooling, the result of which is used as the input of the next layer; the output result of the last layer of convolution layer is connected through a full connection layer, and finally output of the one-dimensional convolution neural network is obtained through a Softmax classification output layer, and probability matrix y of the audio sample belonging to different sound scene categories is output c In this embodiment, the number of convolution kernels is 128, the window length of the convolution kernels is 100, the convolution step length is 1, the activation function is ReLU, and the maximum pooled window size is 2; the second layer of one-dimensional convolution layer, the number of convolution kernels is 128, the window length of the convolution kernels is 30, the convolution step length is 1, the activation function is ReLU, and the maximum pooling window size is 2; the third layer of one-dimensional convolution layer has 128 convolution kernels, the window length of the convolution kernels is 15, the convolution step length is 1, the activation function is ReLU, and the maximum pooling window size is 2; the fourth layer is a full-connection layer, which is used as a transition from a convolution layer to a Softmax classified output layer, the output dimension of the full-connection layer is 128, and the dropout ratio is set to be 50%; the fifth layer is a Softmax classification output layer, and outputs probability matrix y of the audio samples belonging to different sound scene categories c
S3.2, constructing a long-short-time memory network, wherein the long-short-time memory network adopts a 4-layer structure which is sequentially connected in turn and consists of two long-short-time memory layers, a Dropout and a Softmax classified output layer in cascade connection, and giving an input sequence x= (x) to each CuDNN long-short-time memory layer 1 ,...x T ) The long and short time memory layer uses the hidden vector sequence h= (h) 1 ,...h T ) From the t=1st iteration to the T-th iteration, y= (y) is generated 1 ,...y T ) Output of (2):
o t =σ(W 0 [h t-1 ,x t ]+b o )
h t =o t ×tanh(C t )
y t =W hy h t +b y
in which W is o And b o Respectively representing weight matrix and bias vector from input layer to hidden layer in long-short time memory, a () represents Sigmoid activation function, o t And C t Respectively represent the output gate and the cell activation vector, h t Intermediate hidden layer variable W for long-short term memory network hy And b y Respectively representing a weight matrix and a bias vector from a hidden layer to an output layer of the long-short memory network; finally, the long-short-term memory network outputs the audio samples through a Softmax classification output layer to obtain probability matrix y of different sound scene categories of each audio sample l
In the embodiment, the training process is accelerated by using CuDNN, the first layer of CuDNN is a long short-time memory layer, the output dimension is 64, and all output sequences are returned; a second CuDNN long short-time memory layer, the output dimension is 64; adding a third Dropout layer after the second long-short-time memory layer, setting the probability of randomly disconnecting the input neurons to be 50%, and preventing overfitting; the fourth layer is a Softmax classification output layer, and outputs a probability matrix y of each audio sample belonging to different sound scene categories l
S3.3, combining the one-dimensional convolutional neural network and the long-short-time memory network into a joint probability network through weighted average of output probabilities of the one-dimensional convolutional neural network and the long-short-time memory network, wherein the joint probability network is expressed as follows:
y a =w c y c +w l y l
wherein w is c And w l Respectively representing weights of a one-dimensional convolutional neural network and a long-short-term memory network, y c Representing a probability matrix, y, of one-dimensional convolutional neural network output audio sample in S3.1 l Representing a probability matrix of long-short-term memory network output in S3.2, y a Probability matrix representing deep joint probability network output and classification result finally output by deep joint probability networky result_k Taking a sound field Jing Leibie corresponding to the maximum output probability node:
y result_k =argmax(y a ),1≤k≤K
wherein argmax () represents the subscript corresponding to the maximum probability value, and K is the number of samples of the input depth joint probability network;
in this embodiment, the weight of the one-dimensional convolutional neural network is ω c =0.7, the long-short term memory network has a weight of ω l =0.3, one-dimensional convolutional neural network using Adam optimizer, learning rate set to 0.0002, sample step size to 256, training algebra to 1000; the long-time and short-time memory network uses an Adam optimizer, the learning rate is set to 0.00001, the step length of the sample number is 50, and the iteration number is 500.
S4, constructing a joint discrimination classification tree model:
s4.1, performing preliminary training on the wide neural network constructed in the S2 and the deep joint probability network constructed in the S3 by using the training set divided in the S1 to obtain the classification accuracy of the wide neural network, ranking the wide neural network from high to low, wherein the classification accuracy is respectively 6 th, 7 th, 1 st, 2 nd, 3 rd, 0 th, 4 th, 8 th and 5 th, and the classification accuracy is higher than the classification accuracy, and the deep joint probability network obtains a pre-training weight;
s4.2, classifying the wide neural network and the deep joint probability network as nodes of a classification tree, setting the node number BN of the wide neural network and the node number DN of the deep joint probability network, wherein,
Figure GDA0004096602160000131
dn=1, where->
Figure GDA0004096602160000132
Represents a downward rounding, n represents the number of classes n= 9,a in this example represents the superparameter of the number of sub-classes that the wide neural network can split, at +.>
Figure GDA0004096602160000133
The whole numbers in the range are 2,3 and 4;
s4.3, constructing a joint discrimination classification tree model, wherein the joint discrimination classification tree consists of BN wide neural network nodes and DN deep joint probability network nodes, all input samples firstly pass through the wide neural network nodes of the joint discrimination classification tree to obtain BN x (a-1) class with highest accuracy, and then the rest n-BN x (a-1) class input samples are sent to the deep joint probability network nodes of the joint discrimination classification tree for classification, wherein the process is as follows:
s4.3.1, the combined discrimination classification tree model takes the wide neural network as a branch node of the combined discrimination classification tree model to extend downwards, and the data to be classified is classified and output after passing through the branch node: the method comprises the steps of width sensitive class 1, width sensitive class 2, … … width sensitive class a-1 and width non-sensitive class, namely, forming output of a branch node by using a classification result, directly outputting the result if the classification result is a sensitive class, and inputting data to be classified into a next node if the classification result is a width non-sensitive class;
s4.3.2 if the number of the wide neural network nodes in the joint discrimination classification tree model does not reach BN, continuing to extend the joint discrimination classification tree model downward in the process of the step S4.3.1, and if the number of the wide neural network nodes in the joint discrimination classification tree model reaches BN, performing a step S4.3.3;
in this embodiment, if a=3 is taken, bn=2, there are two total wide neural network nodes, the first wide neural network node outputs are the 6 th class, the 7 th class and the wide non-sensitive class, the wide non-sensitive class is other classification in the node inputs except the 6 th class and the 7 th class for the first wide neural network node, the second wide neural network node receives the output of the first wide neural network node as input, the second wide neural network node outputs are the 1 st class, the 2 nd class and the wide non-sensitive class, the wide non-sensitive class is other classification in the node inputs except the 1 st class and the 2 nd class for the second wide neural network node, and after reaching the two wide neural network nodes, step S4.3.3 is performed;
s4.3.3 the joint discrimination classification tree model takes a depth joint probability network as the last node of the classification tree, receives the final output width insensitive classes passing through BN width neural network nodes as input to obtain final width insensitive class classification output, and the terminal branches of the tree in the joint discrimination classification tree model obtain all classifications;
in this embodiment, if a=3 is taken, bn=2, and after passing through two width neural network nodes, the output width insensitive classes are: class 0, class 3, class 4, class 5 and class 8, the deep joint probability network takes the 5 classes as input to judge classification classes, 4 classification classes are output by the two wide neural network nodes, 5 classification classes are output by one deep joint probability network, and all 9 classification classes are branched from the tips of the tree in the joint judgment classification tree model;
s4.4, in the training process, combining the loss function L of the probability network by using the average overall accuracy and depth deep As a supervisory signal, the average overall accuracy ACC is defined as:
Figure GDA0004096602160000151
in the acc i The accuracy of the ith classification of the audio data set after the whole joint discrimination classification tree model is represented, and in this embodiment, 9 classification classes are added, L deep The definition is as follows:
Figure GDA0004096602160000152
/>
wherein K represents the number of input samples,
Figure GDA0004096602160000153
true label representing kth sample, +.>
Figure GDA0004096602160000154
A predictive label representing a kth sample;
s4.5, finally training and optimizing the joint discrimination classification tree model, and searching for proper a and w by adopting a network searching method c 、w l The value of a is selected each time, the structure of the joint discrimination classification tree model changes correspondingly, the width neural network and the depth joint probability network in the joint discrimination classification tree model train correspondingly according to the input and output data after the structure of the joint discrimination classification tree model changes, and the training and optimization of the whole joint discrimination classification tree model are enabled to be the largest ACC and L deep And (3) obtaining a trained joint discrimination classification tree model after the optimization.
In this embodiment, the network search a takes the value, ω, from 2,3,4 c =0.7,ω l =0.3, the best classification of a=3 is finally achieved, so that ACC is the largest and L deep The minimum, the joint discrimination classification tree model is shown in figure 5, the joint discrimination classification tree model is composed of two width neural network nodes and a depth joint probability network node, the first node is divided into three branches by the width neural network, namely a 6 th branch, a 7 th branch and a second node branch, the second node branch is divided into a 1 st branch, a 2 nd branch and a third node branch, the third node is divided into a 0 th branch, a 3 rd branch, a 4 th branch, a 5 th branch and a 8 th branch by the depth neural network, and classification of all 9 kinds of sound field scenes is completed by each tip, and training and optimizing are constructed and obtained.
S5, sound field Jing Bianshi:
and inputting the logarithmic mel spectrum characteristics of the test audio samples in the audio data set into a trained joint discrimination classification tree model to obtain the sound scene category to which each test sample belongs.
In summary, in this embodiment, the breadth neural network and the depth neural network are combined in the form of the combined discrimination classification tree, and a sound scene classification method based on the breadth and depth neural network is provided. Firstly, a wide neural network is adopted to distinguish certain sound fields Jing Leibie, and then a deep joint probability network is adopted to distinguish the rest sound field scene types. In addition, a construction method of a breadth neural network, a depth joint probability network and a joint discrimination classification tree model is provided. The invention can improve the training efficiency of the sound scene classification model, shorten the training time, ensure the training accuracy and complement the defects of poor generalization capability and poor stability of a single network.
For the whole network structure, the class number of deep neural network classification is reduced by means of the rapid training of width learning and the characteristic of high accuracy of individual sound scene classification. Due to the reduction of the classification categories of the deep neural network, the classification requirements on the deep neural network can be reduced, the training data volume is reduced, the training efficiency of the deep neural network is improved, and the classification accuracy of individual sound field scene categories can be improved. The joint discrimination classification tree model is superior to the deep joint probability network in all aspects of accuracy and training time.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (5)

1. The sound scene classification method based on the width and depth neural network is characterized by comprising the following steps of:
s1, establishing an audio data set; extracting logarithmic mel-spectrum characteristics from a sound field Jing Yinpin sample, and dividing the logarithmic mel-spectrum characteristics into a training set and a testing set according to a proportion;
s2, constructing a width neural network: establishing a feature mapping layer and an enhancement layer, wherein the feature mapping layer and the enhancement layer perform feature mapping on an input sample, the mapped features are combined in parallel to form an input layer, and the input layer is connected with an output layer through a weight matrix;
s3, constructing a deep joint probability network: respectively establishing a one-dimensional convolutional neural network and a long-short time memory network, and then combining the one-dimensional convolutional neural network and the long-short time memory network into a deep joint probability network by weighting and averaging the output probabilities of the one-dimensional convolutional neural network and the long-short time memory network;
s4, constructing a joint discrimination classification tree model: constructing a joint discrimination classification tree model according to the preliminary training results of the width neural network and the depth joint probability network, training and adjusting parameters of the joint discrimination classification tree model until the model converges, and obtaining a trained joint discrimination classification tree model;
the process of the step S4 is as follows:
s4.1, performing preliminary training on the width neural network constructed in the step S2 and the depth joint probability network constructed in the step S3 by using the training set divided in the step S1 to obtain the classification accuracy of the width neural network, ranking the width neural network from high to low, wherein the higher the accuracy is, the higher the classification ranking is, and the pre-training weight is obtained by the depth joint probability network;
s4.2, classifying the wide neural network and the deep joint probability network as nodes of a classification tree, setting the node number BN of the wide neural network and the node number DN of the deep joint probability network, wherein,
Figure FDA0004096602150000011
dn=1, wherein
Figure FDA0004096602150000012
Representing a downward rounding, n representing the number of classifications of the entire audio data set, a representing the super-parameters of the number of sub-classes that the wide neural network can split, a being at +.>
Figure FDA0004096602150000021
Integers within the range;
s4.3, constructing a joint discrimination classification tree model, wherein the joint discrimination classification tree model consists of BN wide neural network nodes and DN deep joint probability network nodes, all input samples firstly pass through the wide neural network nodes to obtain BN x (a-1) with highest accuracy, and then the rest n-BN x (a-1) input samples are classified by the deep joint probability network nodes, and the process is as follows:
s4.3.1, the combined discrimination classification tree model takes the wide neural network as a branch node of the combined discrimination classification tree model to extend downwards, and the data to be classified is classified and output after passing through the branch node: width-sensitive class 1, width-sensitive class 2, … … width-sensitive class a-1 and width-insensitive class, where a is a defined in S4.2, i.e. the output of a branch node is formed by classification results; if the classification result is a sensitive class, directly outputting the result, and if the classification result is a width non-sensitive class, inputting the data to be classified into the next node;
s4.3.2 if the number of the wide neural network nodes in the joint discrimination classification tree model does not reach BN, continuing to extend the joint discrimination classification tree model downward in the process of the step S4.3.1, and if the number of the wide neural network nodes in the joint discrimination classification tree model reaches BN, performing a step S4.3.3;
s4.3.3 the joint discrimination classification tree model takes a depth joint probability network as the last node of the classification tree, receives the final output width insensitive classes passing through BN width neural network nodes as input to obtain final width insensitive class classification output, and finally, the terminal branches of the tree in the joint discrimination classification tree model obtain all classifications;
s4.4, in the training process, gradually increasing the two types of nodes of the classification tree until the set nodes are counted, so as to average the overall accuracy ACC and the loss function L of the depth joint probability network deep As a supervisory signal:
Figure FDA0004096602150000031
in the acc i The accuracy of the ith classification of the audio data set after the whole joint discrimination classification tree model is represented, and n represents the classification number of the whole audio data set;
s4.5, finally training and optimizing the joint discrimination classification tree model, and searching for proper a and w by adopting a network searching method c 、w l Value, w c And w l The weight values of the one-dimensional convolutional neural network and the long-short-term memory network are respectively represented, corresponding changes are made on the structure of the joint discrimination classification tree model by selecting a each time, and the width neural network and the depth joint probability network in the joint discrimination classification tree model correspond to the input and output data after the structure of the joint discrimination classification tree model is changedTraining, training and optimizing the whole joint discriminant classification tree model to maximize ACC and use cross entropy function L for loss of deep joint probability network deep Minimum, obtaining a trained joint discrimination classification tree model after optimization;
s5, sound field Jing Bianshi: inputting the logarithmic mel spectrum characteristics of the test audio samples into a trained joint discrimination classification tree model to obtain sound scene categories of the test audio samples.
2. The method for classifying sound scenes based on the width and depth neural network according to claim 1, wherein the step S1 is as follows:
s1.1, acquiring audio data of an acoustic scene by using recording equipment or Internet resources, converting the sampling rate and quantization accuracy of audio samples into a uniform format, and labeling a sound field Jing Leibie to which each audio sample belongs;
s1.2, extracting logarithmic Mel spectrum characteristics from an audio sample and carrying out mean normalization processing;
s1.3, randomly dividing experimental data into training sets and test sets which are mutually disjoint, wherein the training sets account for 70% and the test sets account for 30%.
3. The method for classifying sound scenes based on the width and depth neural network according to claim 1, wherein the step S2 is as follows:
s2.1, establishing a feature mapping layer, wherein the feature mapping layer is formed by N 1 Each feature window is composed of N 2 Characteristic nodes, N 1 And N 2 According to the feature number of the neural network with the actual input width, the method selects to meet N 1 ×N 2 Number of features per 2;
s2.2, building an enhancement layer, wherein the number of enhancement nodes is N 3 Here satisfies (N) 1 ×N 2 )>N 3
S2.3, the feature mapping layer performs feature mapping on the input samples, and the sample set of the input width neural network is set as D 1 Wherein the number of samples is c, eachThe characteristic number of the sample is f, and after each sample is added with a characteristic value equal to 1, an amplified sample set D is obtained 2 The obtained characteristic number of each sample becomes f+1, and a random weight matrix W is generated for each characteristic window e ,W e Is (f+1) times N 2 A matrix of dimensions whose values conform to a Gaussian distribution with a mean of 0 and a variance of 1, generating a new eigenvector A 1 =D 2 ×W e ,A 1 Is c x N in dimension 2 Will A 1 Normalization and sparsification representation, and re-solving for sparse matrices
Figure FDA0004096602150000041
Wherein->
Figure FDA0004096602150000042
Representation pair D 2 Solving the inverse of the matrix to finally generate a characteristic node T of a window 1 =normal(D 2 X W), wherein noraml () represents normalization, resulting T 1 Dimension c×N 2 Is N 1 Generating feature nodes by the feature windows to finally obtain feature mapping y of the feature mapping layer b Its dimension is c×N 1 ×N 2
S2.4, performing feature mapping on the enhancement layer input samples, and randomly generating a sample with a dimension of (N 1 ×N 2 )×N 3 Is normalized by the orthogonal weight matrix W h Using feature mapping y b Feature mapping to get enhancement layer
Figure FDA0004096602150000043
tan sig () is the activation function of the neural network, the resulting feature map T 2 Dimension c×N 3
S2.5, combining the mapped features in parallel to form an input layer, and mapping the features to y b And feature map T 2 Parallel combination is carried out to obtain an input layer
Figure FDA0004096602150000044
Each sample has a characteristic dimension of N 2 ×N 1 +N 3
S2.6, connecting an input layer and an output layer through a weight matrix, wherein the output Y of the output layer is the onehot vector of the label of the sound scene type, and the dimension is c multiplied by n B ,n B For inputting the classification class number of the samples of the wide neural network, y=xw B ,W B The weight matrix obtained for the training of the breadth neural network has a dimension (N 2 ×N 1 +N 3 )×n B
4. The method for classifying sound scenes based on the width and depth neural network according to claim 1, wherein the step S3 is as follows:
s3.1, constructing a one-dimensional convolutional neural network, wherein the one-dimensional convolutional neural network consists of two or more one-dimensional convolutional layers, a full-connection layer and a Softmax classified output layer in cascade, each one-dimensional convolutional layer is activated by a nonlinear activation function and then is output after being subjected to maximum pooling, and d is set (l-1) And d l The input of the first convolution layer is the output of the (l-1) th convolution layer, and one of the feature maps is considered as one of the feature maps because the first convolution layer has multiple feature maps
Figure FDA0004096602150000051
The output of the convolutional layer is expressed as:
Figure FDA0004096602150000052
wherein, represents the convolution operation,
Figure FDA0004096602150000053
Figure FDA0004096602150000054
representing the kernel weight of the first layer, fun () is a nonlinear activation function; the final output of the first layer is expressed as:
Figure FDA0004096602150000055
maxpooling () represents maximum pooling, the result of which is used as the input of the next layer; the output result of the last layer of convolution layer is connected through a full connection layer, and finally output of the one-dimensional convolution neural network is obtained through a Softmax classification output layer, and probability matrix y of the audio sample belonging to different sound scene categories is output c
S3.2, constructing a long-short time memory network, wherein the long-short time memory network consists of two long-short time memory layers and a Softmax classified output layer in cascade connection, optionally adding a Dropout layer after each long-short time memory layer, and giving an input sequence x= (x) for each long-short time memory layer 1 ,…x T ) The long and short time memory layer uses the hidden vector sequence h= (h) 1 ,…h T ) From the t=1st iteration to the T-th iteration, y= (y) is generated 1 ,…y T ) Output of (2):
o t =σ(W 0 [h t-1 ,x t ]+b o )
h t =o t ×tanh(C t )
y t =W hy h t +b y
in which W is o And b o Respectively representing a weight matrix and a bias vector from an input layer to a hidden layer in long-short time memory, sigma () represents a Sigmoid activation function, o t And C t Respectively represent the output gate and the cell activation vector, h t Intermediate hidden layer variable W for long-short term memory network hy And b y Respectively representing a weight matrix and a bias vector from a hidden layer to an output layer of the long-short memory network; finally, the long-short-term memory network outputs the audio samples through a Softmax classification output layer to obtain probability matrix y of different sound scene categories of each audio sample l
S3.3, combining the one-dimensional convolutional neural network and the long-short-time memory network into a deep joint probability network through weighted average of output probabilities of the one-dimensional convolutional neural network and the long-short-time memory network, wherein the deep joint probability network is expressed as follows:
y a =w c y c +w l y l
wherein w is c And w l Respectively representing weights of a one-dimensional convolutional neural network and a long-short-term memory network, y c Representing a probability matrix of audio sample output by a one-dimensional convolutional neural network, y l Probability matrix representing long-short term memory network output, y a Probability matrix representing deep joint probability network output, and classification result y finally output by deep joint probability network result_k Taking a sound field Jing Leibie corresponding to the maximum output probability node:
y result_k =argmax(y a ),1≤k≤K
wherein argmax () represents the number of samples of the input depth joint probability network taking the subscript corresponding to the maximum probability value.
5. The method for classifying sound scenes based on a width and depth neural network according to claim 1, wherein the loss of the depth joint probability network uses a cross entropy loss function L deep Is defined as:
Figure FDA0004096602150000071
wherein K represents the number of input samples,
Figure FDA0004096602150000072
true label representing kth sample, +.>
Figure FDA0004096602150000073
Representing the predictive label of the kth sample. />
CN202010624687.1A 2020-07-02 2020-07-02 Sound field scene classification method based on width and depth neural network Active CN111723874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010624687.1A CN111723874B (en) 2020-07-02 2020-07-02 Sound field scene classification method based on width and depth neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010624687.1A CN111723874B (en) 2020-07-02 2020-07-02 Sound field scene classification method based on width and depth neural network

Publications (2)

Publication Number Publication Date
CN111723874A CN111723874A (en) 2020-09-29
CN111723874B true CN111723874B (en) 2023-05-26

Family

ID=72571132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010624687.1A Active CN111723874B (en) 2020-07-02 2020-07-02 Sound field scene classification method based on width and depth neural network

Country Status (1)

Country Link
CN (1) CN111723874B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112309405A (en) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 Method and device for detecting multiple sound events, computer equipment and storage medium
CN113539283A (en) * 2020-12-03 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method and device based on artificial intelligence, electronic equipment and storage medium
CN113411205B (en) * 2021-05-18 2023-02-28 郑州埃文计算机科技有限公司 Decision tree-based IP application scene division method
CN113689673A (en) * 2021-08-18 2021-11-23 广东电网有限责任公司 Cable monitoring protection method, device, system and medium
CN115249133B (en) * 2022-09-22 2023-02-14 华南理工大学 Building construction process risk classification method based on width learning network
CN115861302B (en) * 2023-02-16 2023-05-05 华东交通大学 Pipe joint surface defect detection method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108776820A (en) * 2018-06-07 2018-11-09 中国矿业大学 It is a kind of to utilize the improved random forest integrated approach of width neural network
CN109409516A (en) * 2017-08-11 2019-03-01 微软技术许可有限责任公司 Machine learning model for the tool depth and width that position is recommended
CN109859771A (en) * 2019-01-15 2019-06-07 华南理工大学 A kind of sound field scape clustering method of combined optimization deep layer transform characteristics and cluster process
CN109978034A (en) * 2019-03-18 2019-07-05 华南理工大学 A kind of sound scenery identification method based on data enhancing
CN111340771A (en) * 2020-02-23 2020-06-26 北京工业大学 Fine particle real-time monitoring method integrating visual information richness and wide-depth combined learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9754607B2 (en) * 2015-08-26 2017-09-05 Apple Inc. Acoustic scene interpretation systems and related methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409516A (en) * 2017-08-11 2019-03-01 微软技术许可有限责任公司 Machine learning model for the tool depth and width that position is recommended
CN108776820A (en) * 2018-06-07 2018-11-09 中国矿业大学 It is a kind of to utilize the improved random forest integrated approach of width neural network
CN109859771A (en) * 2019-01-15 2019-06-07 华南理工大学 A kind of sound field scape clustering method of combined optimization deep layer transform characteristics and cluster process
CN109978034A (en) * 2019-03-18 2019-07-05 华南理工大学 A kind of sound scenery identification method based on data enhancing
CN111340771A (en) * 2020-02-23 2020-06-26 北京工业大学 Fine particle real-time monitoring method integrating visual information richness and wide-depth combined learning

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
Anastasios Vafeiadis et al..ACOUSTIC SCENE CLASSIFICATION: FROM A HYBRID CLASSIFIER TO DEEP LEARNING.《Detection and Classification of Acoustic Scenes and Events 2017》.2017,全文. *
Annamaria Mesaros et al..Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge.《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》.2018,第26卷(第26期),全文. *
Yanxiong Li et al..Acoustic Scene Classification Using Deep Audio Feature and BLSTM Network.《ICALIP2018》.2018,全文. *
Yanxiong Li et al..Acoustic Scene Clustering Using Joint Optimization of Deep Embedding Learning and Clustering Iteration.《IEEE TRANSACTIONS ON MULTIMEDIA》.2020,第22卷(第22期),全文. *
Yanxiong Li et al..Anomalous Sound Detection Using Deep Audio Representation and a BLSTM Network for Audio Surveillance of Roads.《IEEE Access》.2018,全文. *
Yanxiong Li et al..Using multi-stream hierarchical deep neural network to extract deep audio feature for acoustic event detection.《Multimed Tools Appl》.2017,全文. *
赵智辉 等.基于宽度 - 深度神经网络的风电功率预测方法.《中国电子科学研究院学报》.2019,第14卷(第14期),全文. *

Also Published As

Publication number Publication date
CN111723874A (en) 2020-09-29

Similar Documents

Publication Publication Date Title
CN111723874B (en) Sound field scene classification method based on width and depth neural network
CN108805188B (en) Image classification method for generating countermeasure network based on feature recalibration
CN108664632B (en) Text emotion classification algorithm based on convolutional neural network and attention mechanism
CN108122562A (en) A kind of audio frequency classification method based on convolutional neural networks and random forest
CN108847223B (en) Voice recognition method based on deep residual error neural network
CN112216271B (en) Audio-visual dual-mode speech recognition method based on convolution block attention mechanism
CN111723239B (en) Video annotation method based on multiple modes
WO2021042857A1 (en) Processing method and processing apparatus for image segmentation model
CN112784929B (en) Small sample image classification method and device based on double-element group expansion
CN112784031B (en) Method and system for classifying customer service conversation texts based on small sample learning
CN109858972B (en) Method and device for predicting advertisement click rate
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN111353583A (en) Deep learning network based on group convolution characteristic topological space and training method thereof
CN113109782B (en) Classification method directly applied to radar radiation source amplitude sequence
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN116593980B (en) Radar target recognition model training method, radar target recognition method and device
CN112529637B (en) Service demand dynamic prediction method and system based on context awareness
CN112818982B (en) Agricultural pest image detection method based on depth feature autocorrelation activation
CN114511747A (en) Unbalanced load data type identification method based on VAE preprocessing and RP-2DCNN
CN114678030A (en) Voiceprint identification method and device based on depth residual error network and attention mechanism
CN113901820A (en) Chinese triplet extraction method based on BERT model
CN113361631A (en) Insulator aging spectrum classification method based on transfer learning
CN113823292B (en) Small sample speaker recognition method based on channel attention depth separable convolution network
CN111832588A (en) Riot and terrorist image labeling method based on integrated classification
CN113723456B (en) Automatic astronomical image classification method and system based on unsupervised machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant