CN117238320B

CN117238320B - Noise classification method based on multi-feature fusion convolutional neural network

Info

Publication number: CN117238320B
Application number: CN202311524974.5A
Authority: CN
Inventors: 张涛; 胡柏洋; 刘炜杰; 耿彦章; 王恒
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-01-09
Anticipated expiration: 2043-11-16
Also published as: CN117238320A

Abstract

The invention provides a noise classification method based on a multi-feature fusion convolutional neural network, which comprises the steps of acquiring a noise data set, and extracting a spectrogram, a Mel spectrogram, a first-order differential Mel spectrogram and a second-order differential Mel spectrogram of a noise audio signal; extracting feature matrix X from the spectrogram, the Mel spectrogram, the first-order differential Mel spectrogram and the second-order differential Mel spectrogram ₁ 、X ₂ 、X ₃ 、X ₄ The four noise evaluation indexes of noise kurtosis, equivalent A sound level, accumulated noise exposure and sound entropy form a characteristic matrix X ₅ The characteristic matrix X ₁ Feature matrix X ₂ Feature matrix X ₃ Feature matrix X ₄ Feature matrix X ₅ The characteristic matrix X is obtained through the serial connection and splicing, and then dimension reduction treatment is carried out on the characteristic matrix X; and constructing a classification model MCCNN, and inputting the feature matrix X subjected to the dimension reduction treatment into the classification model to obtain a noise classification result. The method of the invention can improve the accuracy of noise classification.

Description

Noise classification method based on multi-feature fusion convolutional neural network

Technical Field

The invention belongs to the technical field of noise classification, and particularly relates to a noise classification method based on a multi-feature fusion convolutional neural network.

Background

Noise has a wide and profound negative impact on people's production and life. Noise generated by different noise sources presents a variety and also presents significant differences in human health and environmental impact. Therefore, the method carries out accurate and systematic noise classification work, and brings forward a corresponding control method along with the work, thereby having important scientific and practical values. The noise is classified and prevented from four angles of industrial noise, construction noise, traffic noise and social life noise through analysis and research on a large amount of noise data by the ' method for preventing and controlling noise pollution of the people's republic of China '. However, the current noise classification work still has some problems and limitations. First, the diversity and complexity of noise sources increases the difficulty of noise classification work, especially for classification of industrial noise and construction noise, where accuracy is to be further improved. Secondly, existing noise classification works are mainly aimed at industrial noise and traffic noise, lack of consideration on building noise and social life noise, and some methods are too fine and cumbersome for noise classification, and lack of a comprehensive noise classification method applicable to a wider range of scenes.

Disclosure of Invention

In view of this, the present invention aims to overcome the shortcomings of the prior art, and provides a noise classification method based on a multi-feature fusion convolutional neural network, and the present invention refers to the method for controlling noise pollution in the people's republic of China, and constructs a corresponding data set for the noise classification range of industrial noise, construction noise, traffic noise and social life noise, and provides a noise classification method based on a multi-feature fusion convolutional neural network, which performs high-accuracy classification work on four types of noise data.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a noise classification method based on a multi-feature fusion convolutional neural network comprises the following steps:

step 1: acquiring a noise data set, and extracting a language spectrogram, a Mel spectrogram, a first-order differential Mel spectrogram and a second-order differential Mel spectrogram of the noise data;

step 2: extracting a 1X1024 feature matrix X from the spectrogram ₁ Extracting a 1X1024 feature matrix X from a Mel spectrogram ₂ Extracting a 1X1024 feature matrix X from the first-order differential Mel spectrogram ₃ Extracting a 1X1024 feature matrix X from a second-order differential Mel spectrogram ₄ The four noise evaluation indexes of noise kurtosis, equivalent A sound level, accumulated noise exposure and sound entropy form a characteristic matrix X ₅ The characteristic matrix X ₁ Feature matrix X ₂ Feature matrix X ₃ Feature matrix X ₄ Feature matrix X ₅ The characteristic matrix X is obtained through the serial connection and splicing, and then dimension reduction treatment is carried out on the characteristic matrix X;

step 3: constructing a classification model MCCNN, and inputting the feature matrix X subjected to dimension reduction treatment into the classification model MCCNN to obtain a noise classification result;

the classification model MCCNN comprises a first feature extraction module, a second feature extraction module, a third feature extraction module and a fourth feature extraction module which are sequentially connected, and a first jump residual error connection module, a second jump residual error connection module and a third jump residual error connection module which are respectively connected with the second feature extraction module, the third feature extraction module and the fourth feature extraction module, wherein the first feature extraction module, the second feature extraction module, the third feature extraction module and the fourth feature extraction module comprise a convolution module, a normalization module, a ReLu module, a pooling operation module and a concat layer, and the first jump residual error connection module, the second jump residual error connection module and the third jump residual error connection module comprise a residual error jump convolution module and a residual error jump ReLu module;

The first feature extraction module extracts features of a feature matrix input by an input end through convolution-normalization-ReLu-pooling operation, and then the feature matrix is combined with jump connection of the input end through a concat layer to obtain a first feature extraction result; the input end of the second feature extraction module extracts features from the input first feature extraction result through convolution-normalization-ReLu-pooling operation, simultaneously, the features are extracted through the first jump residual error connection module, the results of the first feature extraction module and the second feature extraction module are combined, and then the jump connection of the input end of the second feature extraction module and the jump connection of the input end of the second feature extraction module are combined through a concat layer, so that a second feature extraction result is obtained; the input end of the third feature extraction module extracts features through convolution-normalization-ReLu-pooling operation on the input second feature extraction result, meanwhile, features are extracted through the second jump residual error connection module, the results of the convolution-normalization-ReLu-pooling operation and the second jump residual error connection module are combined, and then the jump residual error connection module is combined with the input end of the third feature extraction module through a concat layer to obtain a third feature extraction result; the input end of the fourth feature extraction module extracts features through convolution-normalization-ReLu-pooling operation on the input third feature extraction result, meanwhile, features are extracted through the third jump residual error connection module, the results of the convolution-normalization-ReLu-pooling operation and the third jump residual error connection module are combined with the jump connection of the input end of the third feature extraction module through a concat layer to obtain a fourth feature extraction result, the first feature extraction result, the second feature extraction result, the third feature extraction result and the fourth feature extraction result are combined in a fifth concat layer, and then the final classification result is obtained through a full connection layer, a softmax and a classout layer.

Further, in the step 2, a feature matrix X of 1X1024 is extracted from the spectrogram ₁ Comprising the following steps:

selecting a frequency bandwidth range of interest from the spectrogram;

performing dimension reduction processing on the data matrix within the selected frequency bandwidth range;

intercepting the data after dimension reduction to obtain a required characteristic matrix X ₁ 。

Further, in the step 2, a feature matrix X of 1X1024 is extracted from the mel spectrogram ₂ Comprising the following steps:

carrying out logarithmic compression on the Mel spectrogram to obtain a logarithmic Mel spectrogram;

dividing the logarithmic mel spectrogram into a plurality of segments, wherein each segment has a plurality of frequency bandwidths, and averaging the data in each frequency bandwidth to obtain an average value of the data;

normalizing the average value of each segment;

the data is truncated to obtain the required 1x1024 feature matrix.

Further, the method for dimension reduction processing specifically includes:

carrying out averaging treatment on the input data matrix;

establishing a correlation coefficient matrix R based on the data matrix after the averaging treatment;

calculating the feature value and the feature vector, and selecting all dimension reduction features with the accumulated contribution rate of 95% to obtain a dimension-reduced feature matrix.

The invention also provides a noise classification device based on the multi-feature fusion convolutional neural network, which comprises

The data extraction module is used for acquiring a noise data set and extracting a spectrogram, a Mel spectrogram, a first-order differential Mel spectrogram and a second-order differential Mel spectrogram of the noise data;

a feature extraction module for extracting a 1X1024 feature matrix X from the spectrogram ₁ Extracting a 1X1024 feature matrix X from a Mel spectrogram ₂ Extracting a 1X1024 feature matrix X from the first-order differential Mel spectrogram ₃ Extracting a 1X1024 feature matrix X from a second-order differential Mel spectrogram ₄ The four noise evaluation indexes of noise kurtosis, equivalent A sound level, accumulated noise exposure and sound entropy form a characteristic matrix X ₅ The characteristic matrix X ₁ Feature matrix X ₂ Feature matrix X ₃ Feature matrix X ₄ Feature matrix X ₅ The characteristic matrix X is obtained through the serial connection and splicing, and then dimension reduction treatment is carried out on the characteristic matrix X;

the classification model building module is used for building a classification model MCCNN, and inputting the feature matrix X subjected to the dimension reduction treatment into the classification model MCCNN to obtain a noise classification result;

The invention also provides a nonvolatile storage medium which is used for storing a program, wherein the program is used for controlling equipment where the nonvolatile storage medium is located to execute the noise classification method based on the multi-feature fusion convolutional neural network when running.

The invention also provides an electronic device, which comprises a processor and a memory, wherein the memory stores computer readable instructions, and the processor is used for running the computer readable instructions, wherein the computer readable instructions execute the noise classification method based on the multi-feature fusion convolutional neural network when running.

Compared with the prior art, the noise classification method based on the multi-feature fusion convolutional neural network has the following advantages:

according to the invention, the jump residual error connection modules are combined with the convolutional neural network, so that the performance and classification accuracy of the model can be improved, the model fusion is beneficial to alleviating the problem of overfitting, and particularly under the condition of limited data volume or more interference, the robustness of the model can be enhanced, so that the model can better perform in the face of noise or uncertainty;

The method can classify four types of noise data with high accuracy.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of a network model according to the present invention;

fig. 2 is a schematic diagram of a residual block according to the present invention;

FIG. 3 is a graph of four types of noise patterns according to the present invention;

FIG. 4 is a schematic diagram of a first-order differential Mel spectrum and a second-order differential Mel spectrum of the industrial noise of the present invention;

FIG. 5 is a schematic diagram of a first-order differential Mel spectrum and a second-order differential Mel spectrum of the building noise of the present invention;

FIG. 6 is a schematic diagram of a first-order differential Mel spectrum and a second-order differential Mel spectrum of the traffic noise of the present invention;

FIG. 7 is a schematic diagram of a Mel spectrum, a first-order differential Mel spectrum, and a second-order differential Mel spectrum of social life noise according to the present invention;

fig. 8 is a schematic diagram of an accuracy curve and a non-normalized loss function curve of MCCNN network operation according to the present invention;

fig. 9 is a schematic diagram of a confusion matrix according to the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

As shown in fig. 1-2, the invention provides a noise classification method based on a multi-feature fusion convolutional neural network, which comprises the following steps:

from the spectrogram we can observe various features of the sound, such as pitch, volume, duration of the sound and harmonic structure of the sound. This makes the spectrogram an important tool for analyzing sound, detecting noise, audio signal processing and speech recognition.

MFCC (Mel-Frequency Cepstral Coefficients) is a widely used technique in audio signal processing and audio feature extraction. The method is mainly used in applications such as voice recognition, audio classification, speaker recognition and the like to extract characteristic vectors of the voice and help a machine learning algorithm to recognize and classify the voice.

MFCC characteristics are static in nature and have good time invariance, but lack time series information and do not reflect the dynamic nature of noise signals. The dynamic characteristics are obtained by differentiating or superposing adjacent time window characteristics on the basis of the static characteristics. The dynamic characteristics have the advantage of reflecting the dynamic change of the noise signals, so that the model can better process the continuity and the time sequence of the noise signals. The accuracy and the robustness of the model can be improved by comprehensively using the static features and the dynamic features, so that the model is more suitable for processing the classification and identification tasks of noise signals.

Step 2: extracting a 1X1024 feature matrix X from the spectrogram ₁ Extracting a 1X1024 feature matrix X from a Mel spectrogram ₂ Extracting a 1X1024 feature matrix X from the first-order differential Mel spectrogram ₃ Extracting a 1X1024 feature matrix X from a second-order differential Mel spectrogram ₄ The four noise evaluation indexes of noise kurtosis, equivalent A sound level, accumulated noise exposure and sound entropy form a characteristic matrix X ₅ The characteristic matrix X ₁ Features and characteristicsMatrix X ₂ Feature matrix X ₃ Feature matrix X ₄ Feature matrix X ₅ The characteristic matrix X is obtained through the serial connection and splicing, and then dimension reduction treatment is carried out on the characteristic matrix X;

specifically, in order to extract a feature matrix from a spectrogram, the invention adopts the following method: first, an audio signal is converted into a spectrogram by a signal processing technique, and the image reflects the distribution of the audio signal in time and frequency. The frequency bandwidth (0-5 kHz) range of interest for the present invention is selected so as to focus on frequency regions containing important information. Then, the data in the selected frequency bandwidth range is subjected to dimension reduction processing, and the dimension reduction method for improving PCA is selected to ensure that the dimension is matched with the size of the characteristic matrix of the target 1x 1024. And then intercepting the data subjected to dimension reduction according to the research requirement to obtain a required 1x1024 feature matrix. The method can extract the most representative speech feature matrix X ₁ 。

Specifically, the dimension reduction method for improving PCA comprises the following steps:

1. carrying out averaging treatment on the input data matrix;

For the original information, it consists of two parts: the correlation coefficient reflects the correlation information of the degree of mutual influence of the indexes and the variance information reflects the variation degree of the indexes. The traditional mean-standard deviation mode enables the index mean to be 0 and the variance to be 1, so that the difference among different indexes is eliminated, and only the correlation information is reserved, so that the primary component analysis model established based on the traditional standardized mode cannot accurately reflect the original information. In order to solve the problems, the invention adopts a averaging method to process the original input data, and the formula is that

；

In the method, in the process of the invention,for the original data +.>Is a result of the averaging of (3). The input data matrix is processed in the mode of the above-mentioned averaging, so that not only the difference in dimension and value is eliminated, but also the correlation coefficient matrix is not changed, and the variance and correlation information of the original information are reserved.

2. Establishing a correlation coefficient matrix R based on the data matrix after the averaging treatment;

3. calculating the feature value and the feature vector, and selecting all dimension reduction features with the accumulated contribution rate of 95% to obtain a dimension-reduced feature matrix.

In the invention, the feature matrix X is subjected to dimension reduction by using a dimension reduction method of improved PCA.

Specifically, the invention extracts a 1X1024 feature matrix X from the Mel spectrogram by the following steps ₂ : carrying out logarithmic compression on the mel spectrogram to reduce the data range and improve the information expression of the low-energy part; then dividing the logarithmic mel-graph into a plurality of segments, each segment having a certain number of frequency bandwidths; the data within each frequency bandwidth is averaged to reduce the dimensionality of the data while retaining some critical frequency information. By averaging the data values within the frequency bandwidth to obtain an average value, a more compact representation of the features may be obtained for subsequent noise classification tasks; then, normalizing the average value of each segment to ensure that the data has a consistent scale so as to adapt to the difference between different audio frequencies; finally, the data is intercepted to obtain the required 1x1024 characteristic matrix, and the method can be used for various audio processing and pattern recognition tasks.

Specifically, the method for extracting the feature matrix from the first-order differential mel spectrogram and the second-order differential mel spectrogram is similar to the method. The first order difference represents the rate of change of each frequency band of the mel-pattern, and the second order difference represents the rate of change of the first order difference, further capturing the audio features. Extracting feature matrices of the first-order differential Mel spectrogram and the second-order differential Mel spectrogram by using the same method as Mel spectrogram feature matrix extraction to obtain two feature matrices X of 1X1024 ₃ 、X ₄ 。

Specifically, the feature number extracted by the present invention is the same as the number of frames extracted by the MFCC, which has the following advantages. Dimension reduction: by reducing the number of columns of the MFCC feature matrix to be equal to the frame length, the feature dimensions can be significantly reduced, which helps reduce the complexity of data processing and storage, especially when operating on high-dimensional data; and (3) information extraction: selecting a number of features equal to the frame length may preserve more of the original information, as each feature corresponds to a time window in the original audio signal, which may help preserve the time domain information of the speech signal; calculation efficiency: the calculation of the one-dimensional feature vector is typically faster because there is no need to calculate correlations between multiple features or perform a dimension reduction operation.

Specifically, the spectrogram, the mel spectrogram and the first-order and second-order differences play a key role in sound signal analysis. These features not only reflect the characteristics of sound in the frequency and time domains, but also provide detailed information about the sound characteristics of sound spectrum, time domain dynamics, energy conditions, etc. In particular, the spectrogram presents a distribution of sound over time and frequency, emphasizes time and frequency domain characteristics of sound, and helps to analyze the dynamic characteristics of sound. The mel spectrogram particularly highlights the frequency range perceived by the human ear, is suitable for voice recognition tasks, and provides more detailed voice information by capturing the rate of change of voice features through first and second order differences. In addition, the invention also adds four 1X1 dimension features of noise kurtosis, equivalent A sound level, accumulated noise exposure and sound entropy to form a 1X4 feature matrix X ₅ The statistical distribution of the sound, the human ear hearing characteristics of the sound, the accumulated influence of the sound, the degree of confusion and uncertainty of the sound are respectively reflected. The four characteristics are the characteristics which cannot be embodied by the spectrogram, the Mel spectrogram and the difference thereof, and can be used for extracting the sound more comprehensively.

The five partial feature matrixes are spliced in series to obtain a feature matrix X of 1X4100 of initial single audio data ₀ ，

The method comprises the steps of carrying out a first treatment on the surface of the Wherein the feature matrix X ₁ Size is 1x1024, characteristic momentArray X ₂ The size is 1X1024, and the characteristic matrix X ₃ The size is 1X1024, and the characteristic matrix X ₄ The size is 1X1024, and the characteristic matrix X ₅ The five matrixes are connected in series with the size of 1X4 to obtain a matrix X of a characteristic matrix 1X (1024+1024+1024+1024+4) =1x 4100 of single data ₀ 。

The noise kurtosis, equivalent A sound level, accumulated noise exposure and entropy of sound are noise evaluation indexes, the noise kurtosis is also called as kurtosis coefficient, the characteristic number of the peak value of the probability density distribution curve at the average value is represented, the kurtosis reflects the sharpness of the peak part, the kurtosis of a sample is statistic compared with normal distribution, and if the kurtosis is more than 3, the shape of the peak is steeper than that of the normal distribution peak. In statistics, kurtosis measures the kurtosis of a probability distribution of a real random variable, with high kurtosis indicating that the variance increase is caused by extreme differences that are less frequent than or less than the average; the equivalent A sound level is represented by indirectly exposed noise with several different A sound levels by an average quantity by adopting an equal energy method at a selected position in the sound field; the accumulated noise exposure is a new index derived by researchers combining noise intensity and exposure time according to the principle of equal energy. The generation is that researchers find that the hearing loss is not only related to the noise intensity, but also has a strong relation with the exposure time, and the harm of the noise can be underestimated by only evaluating the noise through the noise intensity, so that the contact noise time factors are listed in the evaluation index, and the accumulated noise exposure shows the influence of the continuous existence of the noise on people; from the point of view of information theory, entropy is a measure of uncertainty or information quantity in information, and entropy of sound is a measure of randomness or uncertainty of sound. The higher the entropy of the sound, the greater the amount of information contained in the sound, and the more difficult the audio signal is to predict or compress. Entropy of sound may have important applications in the fields of audio processing, speech recognition, information encoding, etc., for analyzing the information amount and characteristics of sound signals. The entropy of sound is a measure of the randomness or uncertainty of noise.

The data set used in the invention has 10000 pieces of noise data, and each piece of noise data extracts a characteristic matrix X ₀ Enter intoAnd (3) splicing the data to obtain a characteristic matrix X of 10000X 4100. In X, each row represents one noise data, each column represents one feature, and a final feature matrix X (10000X 4100) is obtained based on 10000 pieces of noise data and the feature matrix of the previous 1X4100 single data.

The first feature extraction module extracts features of a feature matrix input by an input end through convolution-normalization-ReLu-pooling operation, and then the feature matrix is combined with jump connection of the input end through a concat layer to obtain a first feature extraction result; the input end of the second feature extraction module extracts features from the input first feature extraction result through convolution-normalization-ReLu-pooling operation, simultaneously, the features are extracted through the first jump residual error connection module, the results of the first feature extraction module and the second feature extraction module are combined, and then the jump connection of the input end of the second feature extraction module and the jump connection of the input end of the second feature extraction module are combined through a concat layer, so that a second feature extraction result is obtained; the input end of the third feature extraction module extracts features through convolution-normalization-ReLu-pooling operation on the input second feature extraction result, meanwhile, features are extracted through the second jump residual error connection module, the results of the convolution-normalization-ReLu-pooling operation and the second jump residual error connection module are combined, and then the jump residual error connection module is combined with the input end of the third feature extraction module through a concat layer to obtain a third feature extraction result; the input end of the fourth feature extraction module extracts features through convolution-normalization-ReLu-pooling operation on the input third feature extraction result, meanwhile, features are extracted through the third jump residual error connection module, the results of the third feature extraction module and the third feature extraction module are combined, then the jump connection of the input end of the third feature extraction module and the third jump residual error connection module is combined through a concat layer, the fourth feature extraction result is obtained, the first feature extraction result, the second feature extraction result, the third feature extraction result and the fourth feature extraction result are combined in a fifth concat layer, and then the final classification result is obtained through a full connection layer, a softmax and a classout (classification output).

Specifically, in order to better extract high-frequency characteristics, the invention uses a convolution layer to extract image characteristics in a jump residual error connection module, and simultaneously uses a residual error block shown in fig. 2 to extract residual error characteristics, wherein the residual error block is composed of a convolution layer, an activation function and a residual error connection, and the network pre-information can be fully used in a network through a simple residual error structure. The result of each set of residual blocks is combined with the convolution result of the same layer and passed to the next layer, and the result of the residual blocks is also transmitted directly to the end of the network via a long jump connection.

The effectiveness of the scheme of the invention is illustrated by experimental data.

And (3) a step of: experimental data.

The noise data used in the present invention comes from the noise data in the ZENODO database and the loving network. In order to accord with the classification standard of noise in the ' noise pollution control method of the people's republic ' namely industrial noise, building noise, traffic noise and social life noise, the invention manually endows the noise with corresponding labels. The tag content and the specific noise under each tag are shown in table 1.

For classification of industrial noise, the invention refers to the gist of an operation cost method in management and classifies the industrial noise according to the specific classification of the operation; building noise is classified specifically according to building operations, also according to the gist of the operation cost method; the classification of traffic noise is relatively simple, and is mainly performed according to the vehicles with noise; the classification of the social life noise is mainly divided into residential area, commercial area and public place area noise according to the life area.

TABLE 1

And II: audio data spectrograms and mel spectrograms.

A typical speech spectrum diagram of four types of noise (industrial noise, construction noise, traffic noise, social life noise) is shown in fig. 3, for example.

Comparing the spectrograms corresponding to the four different noises, the time axis shows that the changes of the industrial noise and the social life noise along with time are stable, the change difference of the traffic noise along with time is maximum, and obvious burst is realized; as can be seen from the frequency axis, as the frequency decreases, the depth of the grey scale decreases, reflecting the increase in the amplitude of the audio, and the energy of the noise is relatively large compared to the other three types of noise. In summary, four types of noise have relatively large differences in the spectrograms, and the spectrograms can be expected to achieve good effects by being used as features.

Typical mel patterns of four kinds of noise (industrial noise, construction noise, traffic noise, social life noise) and first-order and second-order differential mel patterns thereof are shown in fig. 4, 5, 6 and 7.

Comparing the mel spectrograms corresponding to the four different noises, it can be seen that the energy of the industrial noise is obviously larger than that of the other three noises, and the traffic noise has obvious burstiness. The difference of four kinds of noise can be fully reflected on the mel spectrogram of the first-order difference and the second-order difference, and the four kinds of noise are obviously different in energy distribution, frequency peak value and resonance structure. Thus, the use of mel-patterns and their differences as features is expected to achieve good results.

Extracting feature matrixes from the four types of spectrograms by using the method, performing dimension reduction by using an improved PCA method, obtaining an improved feature matrix of 10000x100 after dimension reduction, and taking the improved feature matrix as input of a classification model.

Thirdly,: classification accuracy.

The invention takes the characteristic matrix of noise as the input of the classification model, selects about the first 2/3 (6664 pieces) of 10000 pieces of noise data as the training set, and takes the rest about 1/3 (3336 pieces) as the test set. The invention respectively endows the industrial noise, the building noise, the traffic noise and the social life noise with codes of 1, 2, 3 and 4, sets the iteration number as 2000 and takes the classification result of the test set as the final result. The accuracy curve and the non-normalized loss function curve of the MCCNN network operation are shown in fig. 8, and the confusion matrix is shown in fig. 9.

Meanwhile, the MCCNN network model, the traditional CNN network model, the urban sound classification model and the CNN-based environmental noise classification model are adopted as comparison experiments, the experimental accuracy results are shown in the table 2, and the average value is the weighted average accuracy of four types of noise. In a comparison experiment, drilling, jackhammer in a comparison classification model document is used as industrial noise, car horn, engine classification and siren as traffic noise, air conditioner, child playing, dog bark, gun shot and street noise are used as social life noise, and the accuracy of sound classification classified as industrial noise, traffic noise and social life noise is averaged to obtain the accuracy of three types of noise. Wherein, the urban sound classification model is a scheme in Chen Bo, yu YIing, research on Urban Sound Classification Model Based on Deep Neural Networks [ J ]. Journal of Zhejiang University of Technology, 2019,47 (02): 199-203, and the CNN-based environmental noise classification model is a scheme in Zhang Ke, su Yu, wang Jingyu, et al, research on Environmental Sound Classification System Based on Fusion Features and Convolutional Neural Networks [ J ]. Journal of Northwest Polytechnical University, 2020,38 (01): 162-169.

TABLE 2

Overall, all models have a high classification accuracy for industrial noise. For building noise, compared with the traditional CNN model, the method provided by the invention has the advantage that the improvement percentage of the classification accuracy is 24.56% ((92.30-74.10)/74.10, and the following is the same). For traffic noise, the classification accuracy of the invention reaches 100%, and due to the burstiness and the pulse nature of the traffic noise, the traffic noise is easier to distinguish compared with other types of noise, and the fact that the classification accuracy is higher is also proved from the side. For social life noise, the classification accuracy of the invention is 93.10 percent, and compared with the traditional CNN model, urban sound classification model and environmental noise classification model, the improvement percentage of the classification accuracy of the invention is 9.14 percent, 12.99 percent and 13.18 percent respectively. In the aspect of average accuracy, the result shows that compared with the traditional CNN model, the urban sound classification model and the environment noise classification model, the MCCNN model provided by the invention has the advantages that the average accuracy of four types of noise is respectively improved by 17.59%, 10.62% and 8.15%. Therefore, in combination, the improved MCCNN network model has better accuracy and better classification effect on classification of four types of noise (industrial noise, building noise, traffic noise and social life noise).

Fourth, the method comprises the following steps: ablation experiments.

The invention carries out an ablation experiment on the selected characteristics to verify the effectiveness of the characteristic selection. According to the invention, five types of input feature matrices, namely a no-spectrogram feature matrix, a no-mel spectrogram feature matrix, a no-first-order differential mel spectrogram feature matrix, a no-second-order differential mel spectrogram feature matrix and a no-four evaluation index feature matrix, are selected to serve as the input of a classification model, and an ablation experiment is carried out on the classification model. The results of the ablation experiments are shown in table 3.

TABLE 3 Table 3

As can be seen from table 3, in terms of feature selection, compared with no spectrogram feature matrix input, the MCCNN network of the present invention has an average accuracy improvement percentage of 4.18% ((94.80-91.00)/91.00, the following is the same); compared with no mel spectrogram feature matrix input, the average accuracy improvement percentage of the MCCNN network is 4.64%; compared with the method without the input of the first-order differential Mel spectrogram feature matrix, the improvement percentage of the average accuracy of the MCCNN network is 4.41%; compared with the characteristic moment input of the non-second-order differential mel spectrogram, the improvement percentage of the average accuracy of the MCCNN network is 2.49%; compared with the method without four evaluation index feature matrix inputs, the MCCNN network has the average accuracy improvement percentage of 2.71%. The five features have the largest contribution to the improvement of the classification accuracy rate, namely the Mel spectrogram feature, then the first-order differential Mel spectrogram feature, the spectrogram feature and the four evaluation index features, and the second-order differential Mel spectrogram feature is relatively the smallest. In summary, the five features selected in the invention and the classification average accuracy of the proposed MCCNN model are improved compared with the models under other conditions. The result shows that five characteristics, namely the spectrogram characteristic, the Mel spectrogram characteristic, the first-order differential Mel spectrogram characteristic, the second-order differential Mel spectrogram characteristic and the four evaluation index characteristics, can effectively improve the performance of the algorithm and the precision of sound classification.

The invention also provides a noise classification device based on the multi-feature fusion convolutional neural network, which comprises a data extraction module, a data extraction module and a data classification module, wherein the data extraction module is used for acquiring a noise data set and extracting a spectrogram, a Mel spectrogram, a first-order differential Mel spectrogram and a second-order differential Mel spectrogram of noise data;

The first feature extraction module extracts features of a feature matrix input by an input end through convolution-normalization-ReLu-pooling operation, and then the feature matrix is combined with jump connection of the input end through a concat layer to obtain a first feature extraction result; the input end of the second feature extraction module extracts features from the input first feature extraction result through convolution-normalization-ReLu-pooling operation, simultaneously, the features are extracted through the first jump residual error connection module, the results of the first feature extraction module and the second feature extraction module are combined, and then the jump connection of the input end of the second feature extraction module and the jump connection of the input end of the second feature extraction module are combined through a concat layer, so that a second feature extraction result is obtained; the input end of the third feature extraction module extracts features through convolution-normalization-ReLu-pooling operation on the input second feature extraction result, meanwhile, features are extracted through the second jump residual error connection module, the results of the convolution-normalization-ReLu-pooling operation and the second jump residual error connection module are combined, and then the jump residual error connection module is combined with the input end of the third feature extraction module through a concat layer to obtain a third feature extraction result; the input end of the fourth feature extraction module extracts features through convolution-normalization-ReLu-pooling operation on the input third feature extraction result, meanwhile, features are extracted through the third jump residual error connection module, the results of the convolution-normalization-ReLu-pooling operation and the third jump residual error connection module are combined with the jump connection of the input end of the third feature extraction module through a concat layer to obtain a fourth feature extraction result, the first feature extraction result, the second feature extraction result, the third feature extraction result and the fourth feature extraction result are combined in a fifth concat layer, and then the final classification result is obtained through a full connection layer, a softmax and a classout.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A noise classification method based on a multi-feature fusion convolutional neural network is characterized by comprising the following steps of: the method comprises the following steps:

Step 2: extracting a 1X1024 feature matrix X from the spectrogram ₁ Extracting a 1X1024 feature matrix X from a Mel spectrogram ₂ Extracting a 1X1024 feature matrix X from the first-order differential Mel spectrogram ₃ Extracting a 1x1024 from the second-order differential Mel spectrogramFeature matrix X ₄ The four noise evaluation indexes of noise kurtosis, equivalent A sound level, accumulated noise exposure and sound entropy form a characteristic matrix X ₅ The characteristic matrix X ₁ Feature matrix X ₂ Feature matrix X ₃ Feature matrix X ₄ Feature matrix X ₅ The characteristic matrix X is obtained through the serial connection and splicing, and then dimension reduction treatment is carried out on the characteristic matrix X;

2. The noise classification method based on the multi-feature fusion convolutional neural network according to claim 1, wherein the noise classification method is characterized in that: in the step 2, a feature matrix X of 1X1024 is extracted from the spectrogram ₁ Comprising the following steps:

selecting a frequency bandwidth range of interest from the spectrogram;

3. The noise classification method based on the multi-feature fusion convolutional neural network according to claim 1, wherein the noise classification method is characterized in that: in the step 2, a 1X1024 feature matrix X is extracted from the Mel spectrogram ₂ Comprising the following steps:

normalizing the average value of each segment;

the data is truncated to obtain the required 1x1024 feature matrix.

4. The noise classification method based on the multi-feature fusion convolutional neural network according to claim 2, wherein the noise classification method is characterized in that: the dimension reduction processing method specifically comprises the following steps:

Carrying out averaging treatment on the input data matrix;

5. A noise classification device based on a multi-feature fusion convolutional neural network is characterized in that: comprising

6. A non-volatile storage medium, characterized by: the nonvolatile storage medium is used for storing a program, wherein the program controls a device in which the nonvolatile storage medium is located to execute the noise classification method based on the multi-feature fusion convolutional neural network according to any one of claims 1-4 when running.

7. An electronic device, characterized in that: comprising a processor and a memory, the memory having stored therein computer readable instructions for executing the computer readable instructions, wherein the computer readable instructions when executed perform a method of noise classification based on a multi-feature fusion convolutional neural network as claimed in any one of claims 1-4.