CN110796027B

CN110796027B - Sound scene recognition method based on neural network model of tight convolution

Info

Publication number: CN110796027B
Application number: CN201910960583.5A
Authority: CN
Inventors: 张涛; 冯国庆; 梁晋华
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-10
Filing date: 2019-10-10
Publication date: 2023-10-17
Anticipated expiration: 2039-10-10
Also published as: CN110796027A

Abstract

A sound scene recognition method based on a neural network model of tight convolution comprises the following steps: establishing a neural network model for the tight convolution of sound scene classification; inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification; reading an audio file and preprocessing to obtain an audio signal fragment; extracting a logarithmic mel-graph from said audio signal segment; inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification. The invention not only ensures that the effective characteristics are fully utilized to ensure that the accuracy is unchanged, but also simplifies the network model to reduce the memory consumption, is more efficient in the voice scene recognition, and better meets the performance requirement of the voice scene recognition equipment.

Description

Sound scene recognition method based on neural network model of tight convolution

Technical Field

The invention relates to a sound scene recognition method. In particular to a sound scene recognition method based on a neural network model of tight convolution.

Background

The sound scene recognition is a technology for processing collected sound signals and then making scene judgment, and is widely applied to aspects such as intelligent home, safety monitoring, audio retrieval and the like. In recent years, with the advent of various deep neural network frameworks, convolutional neural networks have been increasingly used in voice scene recognition. Most of application occasions of voice scene recognition require that a scene recognition function is realized in mobile equipment, but the neural network model is quite large, the performance requirement on hardware equipment is quite high, and the computing resource of the mobile equipment is difficult to meet the requirement of the large-scale neural network model. How to reduce the model size and increase the operation rate while maintaining better performance is a research focus. The method reduces unnecessary parameter calculation, and has important significance for designing a lightweight network capable of being applied to scene recognition in mobile equipment and adapting to real-time of scene recognition in production and life.

According to the distribution characteristics of information in space and channels, different methods for reducing information redundancy are proposed. Space pooling is a method capable of effectively reducing space information redundancy, and is commonly used in a mode of maximum pooling and average pooling, so that unnecessary characteristic information can be reduced, and a receptive field can be enlarged. The method for reducing the redundancy of information on the channels is commonly used for pruning the channels, namely, some channels are randomly discarded without considering the information carried by the feature map, and the method is simple and easy to operate, but the randomly discarded channels inevitably carry some important information, so that the accuracy of the neural network model is reduced.

The limitation of computing resources is always an important obstacle for the development of a neural network, and in an AlexNet network framework proposed by face++ in 2012, in order to solve the defect of the memory of a display card, the concept of packet convolution is proposed, and the network is divided into two parts which are respectively operated in two display cards at the same time. In the ShuffleNet paper published in 2017, shuffle grouping convolution is proposed based on the concept of grouping convolution, and grouping sub-groups after grouping are randomly disturbed and then regrouped, so that the parameter amount can be reduced through grouping convolution, and the flow of information between channels can be increased after the grouping is disturbed. The lightweight network framework MobileNet v1 published in 2017 by google has separable convolutions applied to it, and conventional convolutions are separated into deep convolutions and 1*1 convolutions, so that the parameter amount is greatly reduced.

Disclosure of Invention

The invention aims to solve the technical problem of providing a sound scene recognition method based on a neural network model with compact convolution, which can reduce the complexity of the model while maintaining better performance.

The technical scheme adopted by the invention is as follows: a sound scene recognition method based on a neural network model of tight convolution comprises the following steps:

1) Establishing a neural network model for the tight convolution of sound scene classification;

2) Inputting training sets containing audio files of different scene categories and corresponding scene categories into a tightly-convolved neural network model for sound scene classification, and training the tightly-convolved neural network model for sound scene classification;

3) Reading an audio file and preprocessing to obtain an audio signal fragment;

4) Extracting a logarithmic mel-graph from said audio signal segment;

5) Inputting the logarithmic mel diagram into a trained neural network model for the tight convolution of the sound scene classification to obtain the final sound scene classification.

The tightly-convoluted neural network model in the step 1) comprises the following components in series:

the first feature extraction module is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;

the second feature extraction module is used for extracting secondary features of the first nonlinear feature map by adopting different convolution kernels and performing secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;

the tight convolution module is formed by sequentially connecting n tight convolution units in series and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;

and the Softmax layer is used for carrying out weighted judgment on the finally extracted depth feature images under different convolution kernels and outputting a sound scene recognition result.

The first feature extraction module and the second feature extraction module have the same structure and both comprise: and the ReLU activation function layer is used for carrying out characteristic extraction on the received logarithmic Meier diagram by adopting different convolution kernels and carrying out nonlinear transformation on the extracted characteristic diagram under the different convolution kernels.

The n tightly-convolved units have the same structure and comprise tightly-convolved layers for extracting depth features of the received feature images by adopting different convolution kernels and a ReLU activation function layer for carrying out nonlinear transformation on the feature images under the different convolution kernels extracted by the tightly-convolved layers.

The tight convolution layer comprises the following components in series: the device comprises a depth convolution layer for extracting the feature images by using different convolution kernels, a channel compression layer for reducing the feature images under the different convolution kernels extracted by the depth convolution layer, and a 1X 1 convolution layer for carrying out 1X 1 convolution on the reduced feature images.

The channel compression layer comprises:

(1) Grouping received feature images under different convolution kernels extracted by a deep convolution layer, wherein each group has more than 2 feature images;

(2) Comparing or averaging the parameters of the same position of all the feature images of each group, and taking the maximum value or the obtained average value in the comparison result as the parameter of the same position of the new feature image, thereby obtaining the new feature image of the group;

(3) And outputting each new characteristic diagram.

The preprocessing in the step 3) is to cut the input signal into signal segments with fixed duration of 10 s.

Step 4) comprises:

(1) Framing and windowing an input audio signal segment;

(2) The obtained audio frames pass through a Mel filter group, the energy passing through each Mel filter in each time step range in the audio frames is calculated, all the energy passing through the Mel filters obtained in each time step range are formed into energy vectors, the energy vectors in all the time step ranges are combined, and a two-dimensional Mel diagram of the corresponding audio frames is obtained;

(3) And carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram.

Compared with the traditional convolutional neural network model, the voice scene recognition method based on the neural network model has the advantage that the parameter and floating point operation amount are greatly reduced under the condition of keeping the same accuracy. The lightweight voice scene recognition network has the advantages of small calculation amount and memory consumption, low power consumption and reduced requirement on computing resources, so that the lightweight voice scene recognition network has stronger practicability and can be better applied to mobile equipment. The invention can effectively reduce redundant characteristic information, avoid processing unnecessary characteristic information in the neural network, ensure that the effective characteristics are fully utilized to ensure that the accuracy is unchanged, simplify a network model to reduce the memory consumption, and realize more efficient and better meeting the performance requirements of the voice scene recognition equipment in voice scene recognition.

Drawings

FIG. 1 is a schematic diagram of the construction of a tightly-convoluted neural network model in a sound scene recognition method based on the tightly-convoluted neural network model of the present invention;

FIG. 2 is a schematic diagram of the construction of a tightly-convolved cell in a tightly-convolved neural network model;

fig. 3 is a schematic diagram of the composition of a channel compression layer in a tight convolution unit.

Detailed Description

A sound scene recognition method based on a neural network model of the present invention will be described in detail with reference to the embodiments and the accompanying drawings.

The invention discloses a sound scene recognition method based on a neural network model of tight convolution, which comprises the following steps:

1) Establishing a neural network model for the tight convolution of sound scene classification; as shown in fig. 1, the neural network model of the tight convolution includes the following components in series:

the first feature extraction module 1 is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;

the second feature extraction module 2 is used for extracting secondary features of the first nonlinear feature map by adopting different convolution kernels and performing secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;

the tight convolution module 3 is formed by sequentially connecting n tight convolution units 3.1, 3.2 and 3.n in series, and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;

and the Softmax layer 4 is used for carrying out weighted judgment on the depth feature images under the different convolution kernels extracted finally and outputting a sound scene recognition result. Wherein, the liquid crystal display device comprises a liquid crystal display device,

the first feature extraction module 1 and the second feature extraction module 2 have the same structure and both comprise: and the ReLU activation function layer is used for carrying out characteristic extraction on the received logarithmic Meier diagram by adopting different convolution kernels and carrying out nonlinear transformation on the extracted characteristic diagram under the different convolution kernels.

As shown in fig. 2, the n tightly-convolved units 3.1, 3.2, and 3.N have the same structure, and each tightly-convolved unit includes a tightly-convolved layer for extracting depth features from a received feature map by using different convolution kernels, and a ReLU activation function layer for performing nonlinear transformation on the feature map extracted by the tightly-convolved layer under the different convolution kernels. The tight convolution layer comprises the following components in series: a depth convolution layer 3.10 for feature map extraction with different convolution kernels, a channel compression layer 3.11 for downscaling the feature map under the different convolution kernels extracted by the depth convolution layer 3.10, and a 1 x 1 convolution layer 3.12 for 1 x 1 convolution of the downscaled feature map.

The channel compression layer 3.11 is used for reducing information redundancy among channels and reducing parameter calculation amount, different feature graphs on the channels can be fused, parameters which can more express information in two or more feature graphs are left, and parameters with small relative information amount are discarded, so that the information redundancy can be reduced, the parameter amount of the next convolution calculation is reduced, and the integrity of the features can be ensured. In order to further reduce the calculation amount, separable convolution is used for replacing common convolution, and the separable convolution is combined with channel compression, so that the complexity of a model can be well reduced under the condition that the accuracy of the model is not reduced. As shown in fig. 3, the channel compression layer 3.11 specifically includes:

(1) Grouping received characteristic diagrams under different convolution kernels extracted by a depth convolution layer (3.10), wherein each group has more than 2 characteristic diagrams;

(3) And outputting each new characteristic diagram.

3) Reading an audio file and preprocessing to obtain an audio signal fragment; the pretreatment is to cut the input signal into signal segments with fixed duration of 10 s.

4) Extracting a logarithmic mel-graph from said audio signal segment; comprising the following steps:

(1) Since the speech signal is a typical non-stationary signal, but the movement of the sound generating organ is very slow compared to the speed of sound wave vibration, it is generally considered that the speech signal is stationary in a period of 10 ms-30 ms, so that the input audio signal segments are framed and windowed;

the energy passing through each Mel filter in each time step range in the audio frame is calculated by adopting the following formula:

where M is the number of Mel filters, H (k) is the transfer function of the Mel filters, and X (k) is the magnitude value of the corresponding FFT.

To verify the effectiveness of a sound scene recognition method based on a tightly-convoluted neural network model of the present invention, this section will compare the effects of two network frameworks of MobileNet v1 and CNN8 with or without tightly-convoluted. The data set adopts DCASE2019, an Adam optimizer is used in the training process, and adaptive learning rate attenuation is set, and 32 samples are trained each time. The prediction process only involves forward propagation, and the specific measures are as follows:

1. reading an audio signal and performing truncation processing, wherein each section is truncated into a voice fragment with a fixed duration of 10 s;

2. carrying out frame-division windowing processing on a voice signal with fixed duration, wherein 2048 sampling points are added to each frame, and 2048 Hamming windows are added to each frame;

3. extracting features of the signals after framing through a Mel filter bank, taking logarithms, wherein the number of the filters is 134, the window length of the filters is 1704 points, and 852 points are overlapped between frames;

4. and inputting the Mel spectrogram into a tightly convolved neural network model, and performing forward propagation to obtain the final sound scene category.

For the MobileNet v1 network, except for the first layer convolution which keeps the original structure, all convolutions used by the other layers are replaced by tight convolutions, wherein the tight coefficient is set to be 4, and an average value taking method is adopted in the channel compression operation. And the full connection layer of the original MobileNet v1 is removed, so that the number of parameters can be further reduced.

For the CNN8 network, which is actually a simplified version of the VGG16 network, in order to retain more original characteristics, the first layer and the second layer of network still adopt common convolution, and the common convolution operation is replaced by the tight convolution in the rest convolution layers, wherein the tight coefficient is set to be 2, and a maximum value taking method is adopted in the channel compression operation.

Table 1 shows a comparison of six neural network models, under the condition that the width scaling factor is 1.0 at the same time in the MobileNet v1 network framework, the parameter quantity and floating point operation quantity of the MobileNet v1 network based on the tight convolution are about 1/4 of that of the common convolution network, and the accuracy is improved. Although there are some more parameter and floating point operations than the MobileNet v1 at a width scale factor of 0.5, the accuracy of MobileNet v 1.5 is two percent lower. This illustrates that the tight convolution processes the characteristic information more properly during channel compression, discarding redundant information, and retaining more valuable information, thus making the model more efficient. Compared with a CNN8 network based on the tight convolution and a CNN8 network based on the common convolution, under the condition of keeping the accuracy basically unchanged, the parameter quantity and the floating point operation quantity are reduced to about 1/14, and the effect of model compression is more obvious.

Table 1 comparison of various audio scene recognition algorithms

Claims

1. A sound scene recognition method based on a neural network model of tight convolution is characterized by comprising the following steps:

1) Establishing a neural network model for the tight convolution of sound scene classification; the tightly-convoluted neural network model comprises the following components in series:

the first feature extraction module (1) is used for carrying out feature extraction once on the received logarithmic mel diagram by adopting different convolution kernels and carrying out nonlinear transformation to obtain a first nonlinear feature diagram under the different convolution kernels;

the second feature extraction module (2) is used for carrying out secondary feature extraction on the first nonlinear feature map by adopting different convolution kernels and carrying out secondary nonlinear transformation to obtain a second nonlinear feature map under different convolution kernels;

the tight convolution module (3) is formed by sequentially connecting n tight convolution units (3.1, 3.2 and 3.n) in series and is used for sequentially extracting depth features of the second nonlinear feature map by adopting different convolution kernels;

the Softmax layer (4) is used for carrying out weighted judgment on the finally extracted depth feature images under different convolution kernels and outputting a sound scene recognition result;

the first feature extraction module (1) and the second feature extraction module (2) have the same structure and both comprise: the ReLU activation function layer is used for carrying out nonlinear transformation on the feature graphs under the extracted different convolution kernels;

the n tight convolution units (3.1, 3.2, 3. N) have the same structure and comprise a tight convolution layer for extracting depth features of the received feature images by adopting different convolution kernels and a ReLU activation function layer for performing nonlinear transformation on the feature images extracted by the tight convolution layer under the different convolution kernels;

the tight convolution layer comprises the following components in series: a depth convolution layer (3.10) for extracting feature images using different convolution kernels, a channel compression layer (3.11) for reducing feature images under the different convolution kernels extracted by the depth convolution layer (3.10), and a 1 x 1 convolution layer (3.12) for performing 1 x 1 convolution on the reduced feature images;

the channel compression layer (3.11) comprises:

(3) Outputting each group of new feature graphs;

3) Reading an audio file and preprocessing to obtain an audio signal fragment;

the pretreatment is to cut the input signal into signal segments with fixed duration of 10 s;

(1) Framing and windowing an input audio signal segment;

(3) Carrying out logarithmic processing on the two-dimensional Mel diagram to obtain a logarithmic Mel diagram;