CN110176250B

CN110176250B - Robust acoustic scene recognition method based on local learning

Info

Publication number: CN110176250B
Application number: CN201910464699.XA
Authority: CN
Inventors: 韩纪庆; 杨皓; 郑贵滨; 郑铁然
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2021-05-07
Anticipated expiration: 2039-05-30
Also published as: CN110176250A

Abstract

The invention provides a robust acoustic scene recognition method based on local learning, and belongs to the technical field of sound signal processing. Firstly, collecting sound signals of different acoustic scenes, and extracting frequency domain characteristics; preprocessing the extracted characteristic data; then, carrying out mean shift on the normalized data, and carrying out data expansion by using a mixup method; then, a convolutional neural network model is established according to the local learning idea, and a training sample set subjected to data expansion is input into the model for training to obtain a trained model; and finally, sequentially carrying out frequency domain feature extraction and data preprocessing on a sample to be recognized, inputting the sample to be recognized into the trained model for recognition, and obtaining an acoustic scene recognition result. The method and the device solve the problem of low accuracy of acoustic scene identification under the conditions of audio channel mismatching and unbalanced number of different channel samples. The method can be applied to the recognition of the acoustic scene with various channels and unbalanced different channel sample numbers.

Description

Robust acoustic scene recognition method based on local learning

Technical Field

The invention relates to an acoustic scene recognition method, and belongs to the technical field of sound signal processing.

Background

The sound scene recognition can be widely applied to the fields of robots, unmanned vehicles and the like which need to effectively sense the surrounding sound environment. However, there are often more than one real-world sound acquisition devices, and the acquired signals are usually not identical due to different channel characteristics of different acquisition devices. How to automatically and accurately classify scenes of sounds input by different channels and realize robust acoustic scene recognition becomes an urgent and challenging research topic.

In order to achieve robust acoustic scene recognition, a priori knowledge of the data needs to be fully utilized. At present, most methods are acoustic scene recognition methods under pure voice or the same channel; such as acoustic scene recognition based on convolutional neural networks, acoustic scene recognition based on hidden markov models, acoustic scene recognition based on recurrent neural networks, and so on. The technologies do not match the channels of the audio data and correspondingly adjust when the data volume of the equipment type is unbalanced, so that if the methods are applied to the actual environment with various channels and unbalanced number of different channel samples, the accuracy of acoustic scene recognition is low, and the requirements for actual tasks cannot be met.

Disclosure of Invention

The invention provides a robust acoustic scene recognition method based on local learning, which aims to solve the problem of low acoustic scene recognition accuracy under the conditions of audio channel mismatching and unbalanced different channel sample numbers.

The invention relates to a robust acoustic scene recognition method based on local learning, which is realized by the following technical scheme:

collecting sound signals of different acoustic scenes, extracting frequency domain characteristics, extracting 40-dimensional FBank characteristics of the sound signals, and establishing a training sample set;

step two, preprocessing the characteristic data extracted in the step one:

calculating the mean value and the standard deviation of the features extracted in the step one on each dimension, and normalizing all the features by using the obtained mean value and standard deviation;

step three, channel adaptation and data expansion:

carrying out mean shift on the normalized data; then, performing data expansion by using a mixup method;

establishing a convolutional neural network model according to the local learning idea, and constructing a loss function to enable the closest distance between any sample point and the similar sample point to be smaller than the closest distance between the same sample point and the heterogeneous sample point; inputting the training sample set subjected to data expansion into the convolutional neural network model for training to obtain a trained model; the homogeneous sample points refer to sample points belonging to the same audio scene as the arbitrary sample points; the heterogeneous sample points refer to sample points belonging to different audio scenes from the arbitrary sample points;

and fifthly, sequentially carrying out frequency domain feature extraction and data preprocessing on the sample to be recognized, and then inputting the sample to be recognized into the trained model for recognition to obtain an acoustic scene recognition result.

The most prominent characteristics and remarkable beneficial effects of the invention are as follows:

the invention relates to a robust acoustic scene recognition method based on local learning, which is characterized by collecting sound signals of different acoustic scenes, extracting FBank characteristics of the sound signals to establish a training sample set, then carrying out mean shift on the training sample set to increase the robustness of a system, and generating a new sample by using a mixup method to carry out data expansion so as to solve the problem of unbalanced equipment category number. The method has the characteristics of easy realization and good reliability, and can effectively identify the acoustic scene under the conditions of audio channel mismatching and unbalanced number of samples of different channels, thereby being suitable for popularization and use; compared with the traditional deep learning method, the method has better identification effect and faster calculation speed, and in a simulation experiment, the method obtains an average accuracy rate of 55% on a small amount of equipment, and the accuracy rate is 9.4% higher than that of a general deep learning method by 45.6%.

Drawings

FIG. 1 is a schematic diagram of the calculation of the mean and standard deviation in step two of the present invention;

FIG. 2 is a schematic diagram of a convolutional neural network model established based on the idea of local learning in the present invention; in fig. 2, a filled circle represents an anchor point, an open circle represents a sample point that is the same kind as the anchor point, and an open triangle represents a sample point that is different kind from the anchor point.

Detailed Description

The first embodiment is as follows: the robust acoustic scene recognition method based on local learning provided by the embodiment specifically comprises the following steps:

collecting sound signals of different acoustic scenes, wherein the sampling frequency is 44.1KHz, extracting frequency domain characteristics, segmenting the collected audio into a frame sequence, the frame length is 40ms, and extracting 40-dimensional FBank (filter bank) characteristics of each frame of data to establish a training sample set;

step two, preprocessing the characteristic data extracted in the step one:

calculating the mean value and the standard deviation of the features extracted in the step one in each dimension, as shown in fig. 1, calculating the mean value mu of all samples along the time axis direction, and calculating the standard deviation sigma by the same method; normalizing all the characteristics by using the obtained mean value and standard deviation;

step three, channel adaptation and data expansion:

after the processing of the second step, the characteristic data of all different channels are normalized by using the same mean value and standard deviation, and then the difference between the different channels can be scaled; therefore, mean shift is carried out on the normalized data (the difference value between the mean value of the sample data acquired by the main equipment and the mean value of the sample data acquired by other equipment is calculated, the difference value is added to the training sample data with a certain probability to be used as the processed training data, and the method is called mean shift) so as to increase the robustness of the system; and then a mixup method (an unconventional data enhancement method) is used for generating new samples for data expansion so as to solve the problem of unbalanced equipment category number.

Establishing a convolutional neural network model according to the local learning idea, and constructing a loss function to enable the closest distance between any sample point and the similar sample point to be smaller than the closest distance between the same sample point and the heterogeneous sample point; inputting the training sample set subjected to data expansion into the convolutional neural network model for training to obtain a trained model; the homogeneous sample points referred to herein refer to sample points belonging to the same audio scene as the arbitrary sample points (anchor points); the heterogeneous sample point refers to a sample point belonging to a different audio scene from the arbitrary sample point (anchor point);

The second embodiment is as follows: the difference between this embodiment and the first embodiment is that the normalization of all the features by using the obtained mean and standard deviation in the second step specifically includes:

the resulting mean and standard deviation were used to normalize the feature data according to the following formula:

wherein x is_normData after normalization are represented, mu is a mean value, and sigma is a standard deviation; x represents characteristic data.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode: the difference between this embodiment and the second embodiment is that the mean shift in step three is specifically:

adding a difference value epsilon to the normalized data by a probability p:

wherein, mu_mostA data mean vector representing the device that collected the largest number of samples; n denotes the number of devices other than the device which collects the largest number of samples, μ_iA data mean vector representing the ith device except the device with the largest number of collected samples; 1, …, N; to increase the robustness of the system, not all data are differenced, but rather the probability p, p ∈ [0,1 ] is added]。

Other steps and parameters are the same as those in the second embodiment.

The fourth concrete implementation mode: the difference between this embodiment and the first, second or third embodiment is that the data expansion using the mixup method in step three is specifically:

the mixup method generates a new sample by combining two known samples, so that one sample (x) is randomly selected from the data collected by the device collecting the largest number of samples_j,y_j) Randomly picking another sample (x) from data collected by other devices_i,y_i) Combining the two samples to generate a new sample

New samples

Characteristic data of

And corresponding label

The calculation method of (c) is as follows:

wherein, λ represents the mixing coefficient, λ ∈ [0,1 ]]；x_i、y_iRespectively represent samples (x)_i,y_i) And the corresponding tag, x_j、y_jRespectively represent samples (x)_j,y_j) And the corresponding tag.

Other steps and parameters are the same as those in the first, second or third embodiment.

The fifth concrete implementation mode: the present embodiment is described with reference to fig. 2, and is different from the fourth embodiment in that the loss function in step four is specifically:

L＝max(0,d_ap-d_an+α) (4)

as shown in fig. 2, d needs to be satisfied for any anchor point (solid circle)_ap+α<d_anThus constructing the above-mentioned loss function L;

wherein, the anchor point is a certain sample point; d_apRepresenting the nearest Euclidean distance, d, of a sample point (anchor point) to a sample point of the same kind_anDenotes the Euclidean distance between the sample point (anchor point) and the heterogeneous sample point, and α denotes d_apAnd d_anThe minimum value of the distance interval.

Other steps and parameters are the same as those in the fourth embodiment.

Examples

The following examples were used to demonstrate the beneficial effects of the present invention:

comparing the method with a general deep learning method on an international public data set DCASE2018 Task1-Subtask B acoustic scene recognition data set, wherein the method comprises the following steps:

step one, segmenting audio in an international public data set DCASE2018 Task1-Subtask B acoustic scene recognition data set into a frame sequence, wherein the frame length is 40ms, extracting 40-dimensional FBank features from each frame of data, and establishing a training sample set by using the extracted FBank features;

step two, calculating the mean value mu and the standard deviation sigma on each dimension of the features extracted in the step one, and normalizing all the features by using the obtained mean value and standard deviation; the normalized expression is:

thirdly, mean shift is carried out on the normalized data:

the normalized data is added with the difference epsilon with the probability p being 0.5:

generating new sample by using mixup method

Perform data expansion, new samples

Characteristic data of

And corresponding label

The calculation method of (c) is as follows:

here, the mixing coefficient λ is 0.1;

step four, establishing a convolutional neural network model according to the local learning idea, and constructing a loss function:

L＝max(0,d_ap-d_an+α) (4)

α is set to 1.5;

inputting the training sample set subjected to data expansion into the convolutional neural network model for training to obtain a trained model;

Compared with the identification results obtained by a general deep learning method, the method of the invention obtains an average accuracy rate of 55% on a verification set of a small amount of equipment (except the equipment with the largest number of collected samples), and the accuracy rate is 9.4% higher than that of a general deep learning method of 45.6%. Therefore, the method can effectively identify the acoustic scene under the conditions of audio channel mismatching and unbalanced number of samples of different channels.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A robust acoustic scene recognition method based on local learning is characterized by specifically comprising the following steps:

step two, preprocessing the characteristic data extracted in the step one:

step three, channel adaptation and data expansion:

2. The robust acoustic scene recognition method based on local learning according to claim 1, wherein the normalization of all features by using the obtained mean and standard deviation in the second step is specifically:

the feature data is normalized as follows:

3. The robust acoustic scene recognition method based on local learning according to claim 2, wherein the mean shift in step three specifically includes:

optionally, the normalized data is summed with a probability p, the difference epsilon:

wherein, mu_mostA data mean vector representing the device that collected the largest number of samples; n denotes the number of devices other than the device which collects the largest number of samples, μ_iA data mean vector representing the ith device except the device with the largest number of collected samples; 1., N.

4. The robust acoustic scene recognition method based on local learning according to claim 1, 2 or 3, wherein the data expansion using the mixup method in step three specifically comprises:

randomly selecting a sample (x) from the data collected by the device with the largest number of samples_j,y_j) Randomly picking another sample (x) from data collected by other devices_i,y_i) Combining the two samples to generate a new sample

New samples

Characteristic data of

And corresponding label

The calculation method of (c) is as follows:

5. The robust acoustic scene recognition method based on local learning according to claim 4, wherein the loss function in step four is specifically:

L＝max(0,d_ap-d_an+α) (4)

wherein d is_apRepresenting the nearest Euclidean distance, d, of a sample point to a sample point of the same kind_anRepresenting the nearest Euclidean distance between a sample point and a heterogeneous sample point, and alpha represents d_apAnd d_anThe minimum value of the distance interval.