CN110751044B

CN110751044B - Urban noise identification method based on deep network migration characteristics and augmented self-coding

Info

Publication number: CN110751044B
Application number: CN201910886926.8A
Authority: CN
Inventors: 曹九稳; 崔小南; 王天磊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2022-07-29
Anticipated expiration: 2039-09-19
Also published as: CN110751044A

Abstract

The invention discloses a city noise identification method based on deep network migration characteristics and augmented self-coding. The invention comprises the following steps: 1. preprocessing each type of acquired urban noise signals, including denoising, framing and windowing; 2. converting the processed noise signal into a spectrogram; 3. performing feature extraction on the spectrogram obtained in the step (2) by using a plurality of pre-trained deep convolutional neural networks; 4. fusing the obtained feature vectors x by using an augmented self-encoder; 5. constructing a multi-layer one-class classification model on the basis of the fusion characteristics in the step 4; 6. calculating an output weight and a decision threshold of the ML-OCRLS; 7. and carrying out classification prediction on the unknown signals. The hidden layer neuron of the augmented self-encoder provided by the invention can optimize all the characteristics, main information can be extracted by the ML-OCRLS based on the augmented self-encoder, the characteristic redundancy is reduced, and meanwhile, various transfer learning characteristics are effectively fused, so that the classification accuracy of a classifier is improved.

Description

Urban noise identification method based on deep network migration characteristics and augmented self-coding

Technical Field

The invention belongs to the field of sound signal identification, and relates to a city noise identification method based on deep network migration characteristics and augmented self-coding.

Background

With the advance of urbanization construction, the noise pollution problem is increasingly serious, and great influence is caused on the life quality and health of people. By identifying various typical urban environmental noises and correspondingly processing the noises, the method plays a vital role in monitoring and governing urban noise pollution. Most of the existing methods are based on traditional speech features combined with classifier algorithms to perform urban noise recognition. However, these methods have the following problems: 1) for the urban noise signals with multiple categories and more complex, the traditional speech features cannot effectively represent the urban noise signals. 2) The urban noise identification method based on multiple characteristics relieves the problem that single characteristics cannot represent all types of urban noise, but the multi-characteristic fusion method still stays in simple splicing, adding or multiplying modes and the like. 3) The traditional shallow multi-classifier algorithm has limited generalization capability and complex model updating algorithm, and is not suitable for complex and changeable urban environment noise identification modeling.

Disclosure of Invention

In order to overcome the problems in urban noise identification, the invention provides an urban noise identification method based on deep network migration characteristics and augmented self-encoding. The method is respectively improved aiming at the existing problems, and comprises the steps that 1) aiming at the problem that the traditional voice characteristics cannot effectively express urban noise, a plurality of deep convolution networks pre-trained in ImageNet are adopted as a characteristic extractor to extract a plurality of convolution characteristics based on an urban noise spectrogram, the deep convolution networks gradually extract the deep characteristics of the image through operations such as convolution transformation, pooling and the like, the rich nonlinear information in the spectrogram can be learned, and the extracted convolution characteristics have stronger clustering characteristics and generalization capability; the multiple convolution characteristics are used for compensating the problem that a single convolution characteristic cannot effectively represent all noise signals; 2) in order to effectively fuse the multi-convolution features, an Augmented Auto Encoder (AAE) is proposed to fuse the extracted multi-convolution features, so as to obtain a higher-level representation of the urban noise signals; 3) a multi-layer single classification model is provided, the single classification model is used for learning and forming data description of a target class from data only containing a single class (the target class), whether an unknown sample belongs to the target class is judged according to a similarity measurement set threshold value instead of distributing the unknown sample to a predefined class, therefore, the single classification can accurately identify a new type of noise signal, when the noise signal is added to the new class, only a specific single classifier needs to be trained, repeated training of the known class can be avoided, and the training time of the classification model is shortened.

The technical scheme of the invention mainly comprises the following steps:

step 1, preprocessing each type of collected urban noise signals, including denoising, framing and windowing, wherein the frame length is L, and the frame shift is

And 2, converting the processed noise signal into a spectrogram.

2-1, performing fast Fourier transform on the preprocessed sound signal data, taking each frequency band after the transform as a vertical coordinate, and simultaneously taking multi-frame signals as a horizontal coordinate to construct a two-dimensional image matrix, wherein each pixel point is the energy of the current frame signal in the corresponding frequency band;

and 2-2, calculating the spectral energy density of each pixel point, and expressing the energy density by using the shade of the color tone to obtain a three-dimensional image, namely a spectrogram.

And 3, performing feature extraction on the spectrogram obtained in the step 2 by using a plurality of pre-trained deep convolutional neural networks. Supposing that M deep convolution networks are used for extracting the features of the spectrogram, firstly, the spectrogram is cut and scaled according to the adopted deep convolution networks, then, the adjusted spectrogram is input into the deep convolution networks, the output of the last full-connection layer of the network is extracted as the features extracted by the convolution networks, and D is respectively obtained for the M deep convolution networks ₁ ，...，D _M And (5) dimension feature vectors, and splicing the M features to obtain spliced feature vectors x.

And 4, fusing the obtained feature vectors x by using an augmented self-encoder.

4-1, processing the data through the step 3 to obtain a training data set X ═ X ₁ ，...，x _N ]Wherein x is _n The feature vector after splicing the M features of the nth spectrogram is obtained, wherein N is 1, …, and N is the number of samples; setting the convolution characteristics of each sample in the data set X except the mth characteristic as 0, wherein M is 1, … and M, and obtaining the result

A new data set X _m (ii) a Then two different convolution characteristics in the data set X are reserved, the other characteristics are set to be 0, and two different characteristic combinations are sequentially selected to obtain

A new data set, and so on for a total of

A new data set is merged to obtain X _x Constructed as an input to an augmented self-encoder

As an output from the encoder.

For example, two deep convolution networks are used for extracting features of a spectrogram, and feature vectors obtained by the nth sample are respectively

And

splicing is carried out to obtain the feature vector

The training data set is

Respectively construct X ₁ ＝[X ⁽¹⁾ 0] ^T ，X ₂ ＝[0 X ⁽²⁾ ] ^T And X ₁₂ ＝[X ⁽¹⁾ X ⁽²⁾ ] ^T Combining three new data sets to obtain X _x ＝{X ₁ ，X ₂ ，X ₁₂ As the input of the augmented autoencoder, construct X _y X, X as the output of the augmented autoencoder.

4-2. weight and bias matrix W of random initialization coding layer ₀₁ And b ₀₁ Weight and offset matrix W of decoding layer ₀₂ And b ₀₂ (ii) a The hidden layer output of AAE is H ═ g (W) ₀₁ X _x +b ₀₁ ) The output of the output layer is phi ═ sigma (W) ₀₂ H+b ₀₂ ) The following loss function is constructed:

where λ is the weight attenuation parameter, ρ is the sparsity parameter, typically ρ > 0 and close to 0,

represents the average activation value of the i-th hidden layer neuron of the self-encoder, beta is a penalty factor,

is a penalty term.

4-3, training the augmented self-encoder by using a random gradient descent method, and extracting the weight W of the trained encoding layer ₀₁ And bias b ₀₁ 。

4-4, finally, using the trained AAE to encode the training data set X, and the encoding output is H ₀ ＝g(W ₀₁ X+b ₀₁ ) Obtaining H ₀ I.e. the fusion signature.

And 5, constructing a multi-layer one-class classification model (ML-OCRLS) on the basis of the fusion characteristics in the step 4.

5-1, extracting the weight W of the AAE coding layer ₀₁ And bias b ₀₁ As parameters between the input layer and the first hidden layer of the ML-OCRLS;

5-2. in order to enable some structural patterns in the input data to be found, a conventional sparse self-encoder is used, with H ₀ Training for input and output, sparsity constraint on hidden layer neuron activation is added by using a sparse self-encoder, whose loss function is as follows:

Wherein phi ₁ Is the actual output of the sparse autoencoder.

5-3, training the self-encoder by using a random gradient descent method, and then extracting the weight W of the trained encoding layer ₁₁ And bias b ₁₁ Obtaining hidden layer output H as parameter between first hidden layer and second hidden layer of ML-OCRLS ₁ ＝g(W ₁₁ H ₀ +b ₁₁ ) The next autoencoder is trained as input and output.

5-4, training a plurality of self-encoders in sequence, and obtaining the parameter W of the ML-OCRLS after the training is finished ₀₁ ，...，W _k1 And b ₀₁ ，...，b _k1 And k is the number of self-encoders. Computing the hidden layer output H of the kth self-encoder _k ＝g(W _k1 H _k-1 +b _k1 )。

And 6, calculating the output weight and the decision threshold of the ML-OCRLS.

6-1. setting the expected output T ═ T of the training samples ₁ ，...，t _N ] ^T ＝[1，...，1] ^T Solving a minimization problem:

where C is the trade-off coefficient between the two terms and β is the output weight matrix of ML-OCRLS.

Solving the above problem:

obtaining the real output O ═ H of the training sample _k β；

6-2, calculating the distance between each sample and the target class, wherein the formula is as follows:

d(x _i )＝|o _i -t _i |＝|ε _i |

sorting the materials from large to small to obtain d ═ d ₁ ，...，d _N ]；

Setting a classification decision threshold θ ═ d _floor(μ·N) Where μ is a threshold parameter.

And 7, carrying out classification prediction on the unknown signals.

For an unknown sound signal z, the signal z is converted into a spectrogram after being preprocessed, the same deep convolution network is used for feature extraction, M kinds of features are obtained and input into ML-OCRLS for fusion and classification after being spliced, and the output is

The distance of the signal z from the target class is then calculated:

and judging the category of the unknown sample according to a decision function, wherein the decision function is as follows:

the invention has the following beneficial effects:

the invention trains corresponding ML-OCRLS aiming at different types of urban noise signals, can accurately identify the new type of urban noise, does not need to retrain the known category when the urban noise increases the new category, and can obviously reduce the model training time. Compared with the traditional voice feature extraction method and the method using a single deep convolution network as a feature extractor, the method uses a plurality of deep convolution networks to extract features of the spectrogram, the extracted features can effectively represent all noise signals, and richer and more detailed information of the spectrogram can be obtained. Compared with a common feature fusion method, the hidden layer neuron of the augmented self-encoder provided by the invention can optimize all features, main information can be extracted based on the ML-OCRLS of the augmented self-encoder, feature redundancy is reduced, and meanwhile, various transfer learning features are effectively fused, so that the classification precision of a classifier is improved.

Drawings

FIG. 1 is a flow chart of a city noise identification method based on deep network migration feature and augmented self-coding proposed by the present invention;

FIG. 2(a) is a view showing the structure of the acceptance _ v3 model;

FIG. 2(b) is a diagram of the resnet152 model architecture;

FIG. 2(c) is a view showing the structure of the concept _ resnet _ v2 model;

fig. 3 is a diagram of a ML-OCRLS network architecture.

Detailed Description

The invention is further illustrated by the following figures and examples.

Taking 11 kinds of city noise signals as an example, the invention is further explained by using three deep convolutional neural networks pre-trained on ImageNet, namely, initiation _ v3, respet 152 and initiation _ respet _ v2, as feature extractors. The following description is exemplary and explanatory only and is not restrictive of the invention in any way.

As shown in fig. 1, the urban noise identification method based on deep network migration features and augmented self-coding is specifically implemented as follows:

1-1 normalization and pre-emphasis

Firstly, normalizing the amplitude of the collected urban noise signal to [ -1, 1 [ -1 [ ]]The influence on the identification result caused by the difference is reduced; then, a first-order high-pass filter is used for pre-emphasis processing of the signal, and the corresponding characteristics of the first-order high-pass filter are as follows: h (z) ═ 1-z ^-1 U has a value range of [0.9, 1 ]]；

1-2 framing and windowing

Framing the preprocessed signal to obtain stable short-time signal, adding window function to each frame signal to reduce the frequency spectrum leakage of the framed sound signal, and using a certain amount of window functionMultiplying the window function w (n) of the length by the sound signal x (n) to obtain a windowed signal x for each frame _i (n)，x _i (n)＝w(n)*x(n)；

Here a hanning window is used as a window function,

and 2, converting the processed sound signal into a spectrogram. Firstly, performing fast Fourier transform on preprocessed sound signal data, taking each frequency band after the transform as a vertical coordinate, and simultaneously using multi-frame signals as a horizontal coordinate to construct a two-dimensional image matrix, wherein each pixel point is the energy of the frame signal in the corresponding frequency band, then calculating the spectral energy density of each point, and expressing the energy density by the shade of the tone to obtain a three-dimensional sound spectrogram.

And 3, performing feature extraction on the spectrogram obtained in the step 2 by using three pre-trained deep convolutional neural networks, wherein the three convolutional networks are initiation _ v3, respet 152 and initiation _ reset _ v2 respectively.

Firstly, clipping and scaling a spectrogram according to three depth convolution networks, wherein the input image sizes of an interception _ v3, a rest 152 and an interception _ rest _ v2 are 299 multiplied by 3, 224 multiplied by 3 and 299 multiplied by 3 respectively, and the clipped and scaled spectrogram is input into the depth convolution networks respectively to extract deep features;

The concept of decomposing into small convolutions is introduced into the interception _ v3 network, a larger two-dimensional convolution is decomposed into two smaller one-dimensional convolutions, a large number of parameters are saved, operation is accelerated, overfitting is relieved, meanwhile, the expression capability of a layer of nonlinear expansion model is increased, more and richer space features can be processed, and feature diversity is increased, for example, as an interception _ v3 network structure diagram shown in fig. 2(a), 2048-dimensional features of a last fully-connected linear lagits layer of the network are extracted as features extracted by an interception _ v3 network;

the resnet network introduces a residual error learning unit structure to relieve the degradation of the deep neural network, the resnet152 has a deep network structure, the extracted features are more abstract, semantic information is richer, as shown in fig. 2(b), a resnet152 network structure diagram is shown, and 2048 dimensional features in conv5_ x of the last layer of the network are extracted as the features extracted by the resnet152 network;

an interception _ net _ v2 network adds a short-circuit connection of reset in an interception module, which not only avoids the degradation problem caused by a deep structure, but also reduces the training time, as shown in fig. 2(c), an interception _ net _ v2 network structure diagram is extracted, and 1536-dimensional features in the last full-connection layer of the network are taken as the features extracted by the interception _ net _ v2 network;

And then, splicing the three features to obtain 5632-dimensional feature vector x ═ x ⁽¹⁾ x ⁽²⁾ x ⁽³⁾ ] ^T 。

And 4, fusing the obtained feature vectors by using an augmented self-encoder. Processing the data through the step 3 to obtain a training data set X ═ X ₁ ，...，x _N ]Wherein x is _n N is a feature vector formed by splicing three features of the nth spectrogram, and N is the number of samples. Setting the convolution characteristics of each sample in the data set X except the m, m-1, 2 and 3 types of characteristics as 0 to obtain three new data sets X ₁ ＝[X ⁽¹⁾ 0 0] ^T ，X ₂ ＝[0 X ⁽²⁾ 0] ^T And X ₃ ＝[0 0 X ⁽³⁾ ] ^T . Then two different convolution characteristics in the data set X are reserved, the other characteristics are set to be 0, and two different characteristic combinations are sequentially selected to obtain three new data sets X ₁₂ ＝[X ⁽¹⁾ X ⁽²⁾ 0] ^T ，X ₁₃ ＝[X ⁽¹⁾ 0 X ⁽³⁾ ] ^T And X ₂₃ ＝[0X ⁽²⁾ X ⁽³⁾ ] ^T And finally, reserving three characteristics in X to obtain a new data set X ₁₂₃ ＝[X ⁽¹⁾ X ⁽²⁾ X ⁽³⁾ ] ^T And are combined to give X _x ＝{X ₁ ，X ₂ ，X ₃ ，X ₁₂ ，X ₁₃ ，X ₂₃ ，X ₁₂₃ Constructed as an input to an augmented autoencoder

As an output from the encoder.

Then randomly initializing the weight and bias matrix W of the coding layer and the decoding layer ₀₁ ，b ₀₁ ，W ₀₂ And b ₀₂ The hidden layer output of AAE is H ═ g (W) ₀₁ X _x +b ₀₁ ) The actual output is phi ═ sigma (W) ₀₂ H+b ₀₂ ) We construct the loss function as follows:

wherein lambda is a weight attenuation parameter, beta is a penalty factor, and the two parameters are determined by a grid optimization method; ρ is a sparsity parameter, and is empirically set to 0.05.

Training the augmented self-encoder by using a random gradient descent method, and extracting the weight W of the trained encoding layer ₀₁ And bias b ₀₁ 。

Finally, the trained AAE is used for coding the training data set X, and the coded output is H ₀ ＝g(W ₀₁ X+b ₀₁ ) Obtaining H ₀ I.e. the fusion characteristics.

And 5, constructing a multi-layer one-class classification model (ML-OCRLS) on the basis of the fusion characteristics in the step 4. Firstly, the weight W of the extracted AAE coding layer ₀₁ And bias b ₀₁ As parameters between the input layer and the first hidden layer of the ML-OCRLS; then using a conventional sparse auto-encoder, with H ₀ Training for input and output, in order to enable some structural patterns in the input data to be discovered, a sparse autoencoder is used to add sparsity constraints on hidden neuron activation, and the loss function of the sparse autoencoder is as follows:

wherein phi ₁ Is the actual output from the encoder. Benefit toTraining the self-encoder by using a random gradient descent method, and then extracting the weight W of the trained encoding layer ₁₁ And bias b ₁₁ Obtaining hidden layer output H as parameter between first hidden layer and second hidden layer of ML-OCRLS ₁ ＝g(W ₁₁ H ₀ +b ₁₁ ) The next autoencoder is trained as input and output.

Training a plurality of self-encoders in sequence, and obtaining the parameter W of the ML-OCRLS after the training is finished ₀₁ ，...，W _k1 And b ₀₁ ，...，b _k1 And k is the number of self-encoders. As shown in the network structure diagram of ML-QCRLS in FIG. 3, the hidden layer output H of the kth self-encoder is calculated _k ＝g(W _k1 H _k-1 +b _k1 )。

And 6, calculating the output weight and the decision threshold of the ML-OCRLS. Setting an expected output T ═ T for training samples ₁ ，...，t _N ] ^T ＝[1，...，1] ^T Solving the minimization problem

Solving the above problem:

obtaining the real output O ═ H of the training sample _k β, then calculating the distance between each sample and the target class, the formula is as follows:

d(x _i )＝|o _i -t _i |＝|ε _i |

sorting it d ═ d ₁ ，...，d _N ]Wherein d is ₁ ≥d _N ；

Setting a classification decision threshold θ ═ d _floor(μ·N) Where μ is the threshold parameter, we set to 0.1 empirically.

And 7, carrying out classification decision on the unknown signals. For an unknown sound signal z, the signal is converted into a spectrogram after being preprocessed, the same deep convolution network is used for feature extraction, three kinds of features are obtained and input into ML-OCRLS for fusion and classification after being spliced, and the output is

Then the distance of z from the target class is calculated:

Claims

1. the urban noise identification method based on the deep network migration characteristic and the augmented self-coding is characterized by comprising the following steps of:

Step 2, converting the processed noise signal into a spectrogram;

step 3, extracting the characteristics of the spectrogram obtained in the step 2 by using a plurality of pre-trained deep convolutional neural networks; supposing that M deep convolution networks are used for extracting the features of the spectrogram, firstly, the spectrogram is cut and scaled according to the adopted deep convolution networks, then, the adjusted spectrogram is input into the deep convolution networks, the output of the last full-connection layer of the network is extracted as the features extracted by the convolution networks, and D is respectively obtained for the M deep convolution networks ₁ ,…,D _M Dimension feature vectors, namely splicing the M features to obtain spliced feature vectors x;

step 4, fusing the obtained feature vector x by using an augmented self-encoder;

step 5, constructing a multilayer one-class classification model ML-OCRLS on the basis of the fusion characteristics in the step 4;

step 6, calculating the output weight and the decision threshold of the ML-OCRLS;

step 7, carrying out classification prediction on unknown signals;

the step 2 is realized as follows:

2-2, calculating the spectral energy density of each pixel point, and expressing the energy density by using the shade of the tone to obtain a three-dimensional image, namely a spectrogram;

and 4, fusing the obtained feature vector x by using the augmented self-encoder, and specifically realizing the following steps:

4-1, processing the data through the step 3 to obtain a training data set X ═ X ₁ ,…,x _N ]Wherein x is _n The feature vector after splicing the M features of the nth spectrogram is obtained, wherein N is 1, …, and N is the number of samples; setting the convolution characteristics of each sample in the data set X except the mth characteristic as 0, wherein M is 1, … and M, and obtaining the result

A new data set, and so on for a total of

A new data set is merged to obtain X _x' Constructing X as an input to an augmented autoencoder _y As an output of an augmented autoencoder;

4-2. weight and bias matrix W of random initialization coding layer ₀₁ And b ₀₁ Weight and offset matrix W of decoding layer ₀₂ And b ₀₂ (ii) a The hidden layer output of AAE is H ═ g (W) ₀₁ X _x' +b ₀₁ ) The output of the output layer is phi ═ sigma (W) ₀₂ H+b ₀₂ ) The following loss function is constructed:

Where λ is the weight attenuation parameter and ρ is the sparsity parameter, usually ρ>0 and is close to 0, and the ratio of,

is a penalty item;

4-3, training the augmented self-encoder by using a random gradient descent method, and extracting the weight W of the trained encoding layer ₀₁ And bias b ₀₁ ；

4-4, finally, using the trained AAE to encode the training data set X, and the encoding output is H ₀ ＝g(W ₀₁ X+b ₀₁ ) Obtaining H ₀ Namely the fusion characteristics;

the step 4-1 is specifically realized as follows:

performing feature extraction on a spectrogram by using two deep convolution networks, wherein feature vectors obtained by the nth sample are respectively

And

splicing is carried out to obtain the feature vector

The training data set is

Respectively construct X ₁ ＝[X ⁽¹⁾ 0] ^T ，X ₂ ＝[0 X ⁽²⁾ ] ^T And X ₁₂ ＝[X ⁽¹⁾ X ⁽²⁾ ] ^T Combining three new data sets to obtain X _x' ＝{X ₁ ,X ₂ ,X ₁₂ As the input of the augmented autoencoder, construct X _y As an output of an augmented autoencoder;

step 5, on the basis of the fusion characteristics in step 4, constructing a multi-layer one-class classification model, which is specifically realized as follows:

5-2. use sparse autoencoder, with H ₀ Training for input and output, sparsity constraint on hidden layer neuron activation is added by using a sparse self-encoder, whose loss function is as follows:

wherein phi ₁ Is the actual output of the sparse autoencoder;

5-3, training the self-encoder by using a random gradient descent method, and then extracting the weight W of the trained encoding layer ₁₁ And bias b ₁₁ As MThe parameters between the first hidden layer and the second hidden layer of the L-OCRLS are obtained to obtain the hidden layer output H ₁ ＝g(W ₁₁ H ₀ +b ₁₁ ) Training the next autoencoder as input and output;

5-4, training a plurality of self-encoders in sequence, and obtaining the parameter W of the ML-OCRLS after the training is finished ₀₁ ,…,W _k1 And b ₀₁ ,…,b _k1 K is the number of self-encoders; computing the hidden layer output H of the kth self-encoder _k ＝g(W _k1 H _k-1 +b _k1 )；

The calculation of the output weight and the decision threshold of the ML-OCRLS in the step 6 is specifically realized as follows:

6-1. setting the expected output T ═ T of the training samples ₁ ,…,t _N ] ^T ＝[1,…,1] ^T Solving a minimization problem:

where C is a trade-off coefficient between the two terms, β' is the output weight matrix of ML-OCRLS; solving the above problem:

obtaining the real output O ═ H of the training sample _k β'；

d(x _i )＝|o _i -t _i |＝|ε _i |

sorting the materials from large to small to obtain d ═ d ₁ ,…,d _N ]；

Setting a classification decision threshold θ ═ d _floor(μ·N) Where μ is a threshold parameter;

the classification prediction of the unknown signal in the step 7 is specifically realized as follows:

for an unknown sound signal z, the signal is processedConverting the preprocessed z into a spectrogram, extracting features by using the same deep convolution network to obtain M kinds of features, splicing the M kinds of features, inputting the M kinds of features into ML-OCRLS for fusion and classification, and outputting

The distance of the signal z from the target class is then calculated: