CN116959477B

CN116959477B - Convolutional neural network-based noise source classification method and device

Info

Publication number: CN116959477B
Application number: CN202311208076.9A
Authority: CN
Inventors: 纪盟盟; 张静; 毛志德; 李兆行; 张凯帆
Original assignee: Hangzhou Aihua Instruments Co ltd
Current assignee: Hangzhou Aihua Instruments Co ltd
Priority date: 2023-09-19
Filing date: 2023-09-19
Publication date: 2023-12-19
Anticipated expiration: 2043-09-19
Also published as: CN116959477A

Abstract

The application relates to the technical field of environmental noise identification, solves the problem that a neural network algorithm-based method in the prior art is generally limited by less training samples, so that model accuracy is poor, and discloses a convolutional neural network-based noise source classification method and device, wherein the method comprises the following steps: acquiring noise sample audio, expanding the noise sample audio by adopting methods of splicing, re-cutting, audio tone changing, audio speed changing, noise adding and random cutting, and constructing a convolutional neural network model; the method comprises the steps of performing data enhancement processing on samples which are difficult to collect, performing multiple expansion on training samples after data enhancement to improve accuracy and generalization of the training model, canceling filling of time domain dimensions in the network, performing zero filling only in the frequency domain dimensions, and reducing calculated amount during training and actual use.

Description

Convolutional neural network-based noise source classification method and device

Technical Field

The application relates to the technical field of environmental noise identification, in particular to a method and a device for classifying noise sources based on a convolutional neural network.

Background

In recent years, with rapid development of industrial technology and increasing promotion of living standard of people, noise source types in life are increasing, including living noise, traffic noise, industrial noise, and the like. The contradiction and dispute caused by noise pollution are more and more, and along with the improvement of life quality of people, the influence of people on environmental noise is more and more important. Therefore, in the context of new noise law promulgation, the resolution of noise source categories is also an important issue faced by many regulatory authorities.

The noise source classification refers to the classification of the noise sound source, and two implementation modes based on a traditional algorithm and a neural network algorithm exist at present. The traditional noise source classification algorithm is used for manually extracting the audio characteristics and classifying according to the differences among the characteristics, so that the problems that the classification accuracy is difficult to improve and the classification category of the noise source is single are solved. The method based on the neural network algorithm at the present stage is generally limited by few training samples, so that the model precision is poor, and the number of parameters and the calculated amount of the model in actual use are too large.

In general, in the prior art, because of many noise types, some noises have the problem of difficult acquisition, such as thunder and rain, and specific weather environments are needed for acquisition, so that training samples of a neural network model are few, the classification precision of the trained model is low, and the generalization is poor; in addition, most convolutional neural networks used for classification at the present stage are a resnet51 network, the network is huge in parameter quantity and calculation quantity, the power consumption and calculation power requirements are high in use, and the real-time classification requirements are difficult to finish.

Disclosure of Invention

The method and the device aim to solve the problem that in the prior art, a method based on a neural network algorithm is generally limited by less training samples and the model precision is poor.

In a first aspect, a method for classifying noise sources based on a convolutional neural network is provided, including:

acquiring noise sample audio, and expanding the noise sample audio by adopting methods of splicing, re-cutting, audio tone changing, audio speed changing, noise adding and random cutting;

constructing a convolutional neural network model;

inputting the noise sample audio and the audio obtained by expansion into a convolutional neural network model for model training so as to obtain a noise classification model;

the collected noise audio is subjected to frequency spectrum conversion to obtain a log_mel spectrum characteristic vector of the noise audio;

inputting log_mel spectral feature vectors of noise audio into the noise classification model to output respective noise categories and corresponding probabilities;

carrying out noise category statistics in a period of time;

and calculating a noise class corresponding to the maximum probability in the class statistics as a classification result of the noise source.

Further, the splicing and re-cutting includes: all noise sample audios are spliced into a long audio, and the long audio is sheared into a plurality of audios with first preset duration in an overlapping mode.

Further, the audio tone variation includes: each noise sample audio is individually subjected to a random pitch variation process to obtain an equal number of audio.

Further, the audio shifting includes: performing random speed change processing on each piece of noise sample audio independently, comparing the audio time after speed change with a first preset time, and if the audio time after speed change does not reach the first preset time, performing self-splicing by using the audio after speed change to enable the audio time to be equal to the first preset time; if the audio time after the speed change exceeds the first preset time, cutting the audio after the speed change to enable the audio time to be equal to the first preset time.

Further, the adding noise includes: random snr ambient noise is added to each noise sample audio to get an equal amount of audio.

Further, the random clipping includes: and randomly cutting each noise sample audio according to the audio feature points for a second preset time period, and splicing the randomly cut audio to the first preset time period by using the audio.

Further, the convolutional neural network model sequentially includes: the device comprises a two-dimensional conv layer, a feature extraction module, a two-dimensional DepthwiseConv layer, a mean pooling layer, a two-dimensional conv layer, a pooling layer, a Reshape layer, a two-dimensional conv layer and a Softmax layer, wherein the feature extraction module comprises 4 Transmit Block blocks and 12 normalBlock blocks.

Furthermore, the convolutional neural network model adopts a calculation method of multiplexing convolutional results, the filling of time domain dimension is canceled in the network, zero filling is only carried out on the frequency domain dimension, and the corresponding reduction of feature map dimension is carried out through a pooling layer, so that the effect of residual mapping addition is achieved.

In a second aspect, an apparatus for noise source classification based on convolutional neural network is provided, comprising:

the industrial personal computer comprises a processor, a memory and a program or an instruction stored on the memory and capable of running on the processor, wherein the program or the instruction realizes the method according to any implementation manner of the first aspect when being executed by the processor;

the microphone is electrically connected with the processor;

and the display screen is electrically connected with the processor.

In a third aspect, a computer readable storage medium is provided, the computer readable medium storing program code for execution by a device, the program code comprising steps for performing the method as in any one of the implementations of the first aspect.

The application has the following beneficial effects:

according to the method, data enhancement processing is performed on the samples which are difficult to collect, and after data enhancement, multiple expansion is performed on the training samples so as to improve the accuracy and generalization of the training model;

according to the method, the parameters of the model are greatly reduced through the separation convolution and the network layer pruning technology, the convolution result multiplexing technology is carried out during identification, the convolution redundancy calculation during frame-by-frame calculation is greatly reduced, the calculated amount is reduced, and the classification efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application.

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of convolutional neural network based noise source classification of embodiment 1 of the present application;

FIG. 2 is a flow chart of sample expansion and model training in the method of convolutional neural network-based noise source classification of embodiment 1 of the present application;

FIG. 3 is a block diagram of a convolutional neural network model in the convolutional neural network-based noise source classification method of embodiment 1 of the present application;

FIG. 4 is a construction diagram of a transition_block module in the method for classifying noise sources based on convolutional neural network according to embodiment 1 of the present application;

FIG. 5 is a diagram of the normal_block module in the method for convolutional neural network based noise source classification of embodiment 1 of the present application;

FIG. 6 is a schematic diagram of a normal convolution calculation;

FIG. 7 is a schematic diagram of convolutional multiplexing computation in the method of convolutional neural network-based noise source classification of embodiment 1 of the present application;

fig. 8 is a block diagram of the structure of the device for noise source classification based on convolutional neural network according to embodiment 2 of the present application.

Reference numerals:

100. an industrial personal computer; 101. a processor; 102. a memory; 200. a microphone; 300. and a display screen.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The method for classifying noise sources based on convolutional neural network according to embodiment 1 of the present application includes: acquiring noise sample audio, and expanding the noise sample audio by adopting methods of splicing, re-cutting, audio tone changing, audio speed changing, noise adding and random cutting; constructing a convolutional neural network model; inputting the noise sample audio and the audio obtained by expansion into a convolutional neural network model for model training so as to obtain a noise classification model; the collected noise audio is subjected to frequency spectrum conversion to obtain a log_mel spectrum characteristic vector of the noise audio; inputting log_mel spectral feature vectors of noise audio into the noise classification model to output respective noise categories and corresponding probabilities; carrying out noise category statistics in a period of time; the noise class corresponding to the maximum probability in the class statistics is calculated to serve as a classification result of the noise source, data enhancement processing can be carried out on samples which are difficult to collect by the method, and after data enhancement, multiple expansion is carried out on training samples so as to improve the accuracy and generalization of the training model.

Specifically, fig. 1 shows a flowchart (mainly divided into a training phase and an reasoning phase) of a method for classifying noise sources based on a convolutional neural network in application example 1, including:

s100, obtaining noise sample audio, and expanding the noise sample audio by adopting methods of splicing, re-cutting, audio tone changing, audio speed changing, noise adding and random cutting;

it is known that neural network-based noise source classification models want to achieve high recognition rates, requiring rich training samples. If all the training samples are manually collected, huge manpower and material resources are needed, and certain samples are required to be subjected to special weather (lightning strike, heavy rain, hail and the like), the invention provides a sample amplification scheme aiming at the problem of small number of manually collected samples.

In this embodiment, the audio with the duration of 10s is used as noise sample audio, and the sample amplification scheme is shown in fig. 2, and the process of the sample amplification scheme is that the manually collected 10s noise sample audio sample is spliced, sheared, audio tonal variation, audio speed variation, noise addition and random clipping are performed on the manually collected 10s noise sample audio sample, the sample is subjected to five times and more expansion, the sample after expansion and the manually collected sample are jointly used as training samples of a network, and a noise classification model is obtained through training, as shown in fig. 2, so that the model has very high accuracy and generalization.

The data expansion scheme comprises the following specific steps:

splicing and then cutting: all manually collected 10s samples are spliced into a whole long audio, then overlapped sheared is added into training samples, the number of the samples obtained by the expansion scheme is more than that of the original samples, and more than one time of expansion effect is achieved (the specific data amount depends on the overlap amount in shearing);

audio tone variation: carrying out random tone modification treatment on each manually collected 10s sample independently, wherein the sample quantity obtained by the expansion scheme is equal to the sample quantity of a sample;

audio frequency speed change: each manually collected 10s sample is independently subjected to random speed change treatment, the audio after speed change is less than 10s (namely, the first preset duration is 10s to ensure that the duration of the audio after expansion is equal to the duration of the original noise sample audio), the audio after speed change is spliced to 10s by the audio after speed change, the audio with the length exceeding 10s is randomly shortened to 10s, and the sample quantity obtained by the expansion scheme is equal to the sample quantity of the original sample;

adding noise: adding random snr environmental noise to each manually collected 10s sample, wherein the sample quantity obtained by the expansion scheme is equal to the sample quantity of the original sample;

randomly cutting: for each manually collected 10s sample, according to the audio feature points, randomly cutting the audio with the audio feature points to be 9s (namely, the second preset time length is 1s, the original 10s sample is randomly cut for 1s and then becomes the audio with the audio frequency of 9 s), then splicing the sample to the audio with the audio feature points by itself to be 10s (namely, the audio feature points are lower than the preset time length by 10 s), each sample can be cut into n pieces of audio, and the sample size obtained by the expansion scheme is n times of the sample size of the original sample.

S200, constructing a convolutional neural network model;

as shown in fig. 3, the convolutional neural network model sequentially includes: the two-dimensional conv layer, the feature extraction module, the two-dimensional DepthwiseConv layer, the mean pooling layer, the two-dimensional conv layer, the pooling layer, the Reshape layer, the two-dimensional conv layer and the Softmax layer, wherein the feature extraction module comprises 4 Transmit Block blocks and 12 normalBlock blocks, and the flow of the convolutional neural network model is as follows: the method comprises the steps of extracting log_mel characteristics (MxN) of original audio (P x 1) as original characteristics of a network, inputting the log_mel characteristics through a two-dimensional Conv layer, inputting the log_mel characteristics into a characteristic extraction module, wherein the characteristic extraction module consists of 4 TransmitionBlock blocks and 12 normalBlock blocks, extracting a characteristic diagram, inputting the extracted characteristic diagram through a two-dimensional DepthwiseConv layer after passing through a Mean layer, inputting the extracted characteristic diagram into the two-dimensional Conv layer after passing through a pooling layer, carrying out dimension adjustment through a Reshape layer after passing through the two-dimensional Conv layer, and finally obtaining corresponding category scores through a Softmax layer. The specific parameter settings are shown in table 1:

table 1: convolutional neural network model parameter setting table

Network layer	Structure of the	Output dimension	Quantity of parameters
				1	padding	[400, 132, 1]	0
2	conv2d	[396, 64, 16]	416
				3	transition_block	[394, 64, 16]	800
4	normal_block	[392, 64, 16]	480
				5	normal_block	[390, 64, 16]	480
6	padding	[390, 64, 16]	0
				7	transition_block	[386, 32, 32]	2112
8	normal_block	[382, 32, 32]	1472
				9	normal_block	[378, 32, 32]	1472
10	padding	[378, 32, 32]	0
				11	transition_block	[370, 16, 64]	7296
12	normal_block	[362, 16, 64]	4992
				13	normal_block	[354, 16, 64]	4992
14	normal_block	[346, 16, 64]	4992
				15	normal_block	[338, 16, 64]	4992
16	padding	[338, 16, 64]	0
				17	transition_block	[322, 16, 128]	26880
18	normal_block	[306, 16, 128]	18176
				19	normal_block	[290, 16, 128]	18176
20	normal_block	[274, 16, 128]	18176
				21	normal_block	[258, 16, 128]	18176
22	padding	[258, 16, 128]	0
				23	padding	[258, 20, 128]	0
24	conv2d	[254, 16, 128]	3328
				25	mean	[254, 1, 128]	0
26	conv2d	[254, 1, 32]	4096
				27	padding	[1, 1, 32]	0
28	conv2d	[1, 1, 9]	288

The specific construction of the transform_block module is shown in fig. 4, and the specific parameter settings of the transform_block module are shown in table 2:

table 2: specific parameter setting table of transition_block module

Network layer	Structure of the
		1	conv2d
2	batch_normalization
		3	padding
4	depthwise_conv2d
		5	batch_normalization
6	pooling
		7	depthwise_conv2d
8	batch_normalization
		9	conv2d

The specific construction of the normal_block module is shown in fig. 5, and the specific parameter settings of the normal_block module are shown in table 3:

table 3: specific parameter setting table of normal_block module

Network layer	Structure of the
		1	pooling
2	padding
		3	depthwise_conv2d
4	batch_normalization
		5	pooling
6	depthwise_conv2d
		7	batch_normalization
8	conv2d

Aiming at the problem of large calculation amount of the convolutional neural network model, the calculation method of convolutional result multiplexing is used in the embodiment, the filling of time domain dimension is canceled in the network, and zero filling is only carried out on the frequency domain dimension, so that the dimension of the feature map is ensured, and the calculation amount in training and practical use is reduced on the premise of not losing the model precision.

If the input of the convolutional neural network model is the audio length of 10 seconds, 400 frames are obtained after framing as the time domain length of the input features, the audio is completely convolved every 10 seconds during each forward reasoning, and when the next 10 seconds of audio is convolved again after the time of one frame, the two sections of audio which are only 10 seconds of one frame apart have great convolution repetition calculation. In the embodiment, the convolution result of each frame is stored, when the network infers the audio frequency of the next frame for 10 seconds, only the convolution result of the tail frame is needed to be calculated, the convolution calculation can be greatly reduced, the inference speed is increased, and real-time inference is realized.

It should be noted that, the corresponding relationship between Chinese and English with respect to terms in the specification and the drawings is as follows:

conv: a convolution layer;

depthwise Conv: a deep convolution layer;

pooling: pooling layers;

transmission Block: a transition block;

normal Block: a regular block;

mean: an average layer;

reshape: dimensional remodeling;

softmax: a Softmax layer;

cls_prob: category scores;

batch normalization: a batch normalization layer;

ReLU: a ReLU activation function;

pad: a zero-fill layer;

swish: swish activates the function.

As shown in fig. 6 and 7, if the convolutional neural network model is input into 6 frames of length at the time of reasoning to obtain a final result, the convolution kernel is 3, and in the case of normal convolution calculation, separate convolution calculation is performed for every 6 frames of data, the results of 4 convolutions of two adjacent 6 frames of length are the same, which results in 4 repeated convolution calculations. In the embodiment, the convolution calculation results are multiplexed, each convolution result is stored, and the average of the sum of the corresponding convolution results is obtained when the final result of forward reasoning is obtained, so that the repeated calculation of the convolution is reduced, and the calculation amount is greatly reduced.

Meanwhile, in the embodiment, because the residual mapping is applied to the network, when the time domain convolution is performed, if zero filling is performed, the result of each convolution is different, and convolution result administration cannot be performed, so that the zero filling of the time domain dimension is cancelled, and a pooling layer is used for performing corresponding reduction of the feature map dimension, so that the effect of adding the residual mapping is achieved, and the structures of a Transmit Block module and a normalBlock module are specifically seen.

S300, inputting noise sample audio and audio obtained by expansion into a convolutional neural network model for model training so as to obtain a noise classification model;

it should be noted that, the model training stage belongs to the preparation stage of the model, and does not belong to the practical stage, in this embodiment, the manually collected noise audio sample is subjected to a sample expansion method to obtain a training sample of the model, and the convolutional neural network model is trained by using the training sample to obtain an optimal inference weight, so as to supply the noise classification model of the inference stage for use.

S400, carrying out frequency spectrum conversion on the collected noise audio to obtain a log_mel spectrum feature vector of the noise audio;

s500, inputting log_mel spectral feature vectors of noise audio into the noise classification model to output each noise category and corresponding probability;

s600, carrying out noise category statistics in a period of time;

and S700, calculating a noise category corresponding to the maximum probability in the category statistics as a classification result of the noise source.

It should be noted that, steps S400-S700 belong to an inference stage, the inference stage is an actual practical stage of the noise classification model, the microphone receives the environmental noise, and through spectrum conversion, a log_mel spectrum feature vector of the audio is obtained, and is used for inputting the noise classification model (the weight obtained in the loaded training stage) to obtain the probability corresponding to each noise class, then the class statistics is performed for a period of time, and finally the class corresponding to the maximum probability is output.

The use of a 1x1 two-dimensional convolution and a 1x3 two-dimensional convolution in this embodiment reduces the number of parameters of the model. The method uses a calculation method of frame convolution, the filling of time domain dimension is canceled in a network, and zero filling is only carried out on the frequency domain dimension, so that the size of the dimension of the feature map is ensured, and the calculated amount in training and actual use is reduced on the premise of not losing the model precision.

Aiming at the problem of large calculation amount of a convolutional neural network, the calculation method of convolutional result multiplexing is used in the scheme, filling of time domain dimension is canceled in the network, zero filling is only carried out on the frequency domain dimension, the size of the dimension of the feature map is guaranteed, and the calculation amount in training and practical use is reduced on the premise that model accuracy is not lost.

Example 2

As shown in fig. 8, an apparatus for classifying noise sources based on convolutional neural network according to embodiment 2 of the present application includes:

100 industrial personal computers, the industrial personal computer 100 comprising a processor 101, a memory 102 and a program or instruction stored on the memory 102 and executable on the processor 101, the program or instruction implementing the method according to any one of the embodiments 1 when executed by the processor 101;

a microphone 200, wherein the microphone 200 is electrically connected with the processor 101;

the display screen 300, the display screen 300 is electrically connected with the processor 101.

It should be noted that, for avoiding redundancy, reference may be made to other specific embodiments of the device for classifying noise sources based on convolutional neural networks in the embodiment of the present invention, and in order to avoid redundancy, details are not described here, when in use, the microphone 200 collects audio information and transmits the audio information to the industrial personal computer 100, the industrial personal computer 100 carries a program or an instruction that can run on the processor 101, and when the program or the instruction is executed by the processor 101, the method described in any one of embodiments 1 is implemented, the audio information is processed by the industrial personal computer 100 to obtain the category to which the audio belongs, and the category information is transmitted to the display screen 300 for display.

Example 3

A computer readable storage medium according to embodiment 3 of the present application stores program code for execution by a device, the program code including steps for performing the method in any one of the implementations of embodiment 1 of the present application;

wherein the computer readable storage medium may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM); the computer readable storage medium may store program code which, when executed by a processor, is adapted to perform the steps of a method as in any one of the implementations of embodiment 1 of the present application.

The above is only a preferred embodiment of the present application; the scope of protection of the present application is not limited in this respect. Any person skilled in the art, within the technical scope of the present disclosure, shall cover the protection scope of the present application by making equivalent substitutions or alterations to the technical solution and the improved concepts thereof.

Claims

1. A method for noise source classification based on convolutional neural network, comprising:

constructing a convolutional neural network model, wherein the convolutional neural network model sequentially comprises: the device comprises a two-dimensional conv layer, a feature extraction module, a two-dimensional DepthwiseConv layer, a mean pooling layer, a two-dimensional conv layer, a pooling layer, a Reshape layer, a two-dimensional conv layer and a Softmax layer, wherein the feature extraction module comprises 4 Transmit Block blocks and 12 normalBlock blocks, the convolutional neural network model adopts a calculation method of convolutional result multiplexing, the filling of a time domain dimension is canceled in a network, zero filling is only carried out on the frequency domain dimension, and the corresponding reduction of a feature map dimension is carried out through the pooling layer so as to achieve the effect of residual mapping addition;

carrying out noise category statistics in a period of time;

2. The convolutional neural network-based noise source classification method of claim 1, wherein the stitching re-clipping comprises: all noise sample audios are spliced into a long audio, and the long audio is sheared into a plurality of audios with first preset duration in an overlapping mode.

3. The method of convolutional neural network-based noise source classification of claim 1, wherein the audio tone variation comprises: each noise sample audio is individually subjected to a random pitch variation process to obtain an equal number of audio.

4. The method of convolutional neural network-based noise source classification of claim 1, wherein the audio shifting comprises: performing random speed change processing on each piece of noise sample audio independently, comparing the audio time after speed change with a first preset time, and if the audio time after speed change does not reach the first preset time, performing self-splicing by using the audio after speed change to enable the audio time to be equal to the first preset time; if the audio time after the speed change exceeds the first preset time, cutting the audio after the speed change to enable the audio time to be equal to the first preset time.

5. The method of convolutional neural network-based noise source classification of claim 1, wherein the adding noise comprises: random snr ambient noise is added to each noise sample audio to get an equal amount of audio.

6. The method of convolutional neural network-based noise source classification of claim 1, wherein the random clipping comprises: and randomly cutting each noise sample audio according to the audio feature points for a second preset time period, and splicing the randomly cut audio to the first preset time period by using the audio.

7. An apparatus for noise source classification based on convolutional neural network, comprising:

an industrial personal computer comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor implements the method of any one of claims 1-6;

the microphone is electrically connected with the processor;

and the display screen is electrically connected with the processor.

8. A computer readable storage medium storing program code for execution by a device, the program code comprising steps for performing the method of any one of claims 1-6.