CN113361592A

CN113361592A - Acoustic event identification method based on public subspace representation learning

Info

Publication number: CN113361592A
Application number: CN202110620415.9A
Authority: CN
Inventors: 韩纪庆; 史秋莹; 郑贵滨; 郑铁然
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-09-07
Anticipated expiration: 2041-06-03
Also published as: CN113361592B

Abstract

An acoustic event identification method based on common subspace representation learning relates to an acoustic event identification method. The method aims to solve the problem of low accuracy of an acoustic event recognition task caused by inconsistency of subspaces among different semantic features. Firstly, sampling, quantizing, frame level feature extraction, segment level feature extraction and expansion are carried out on each original acoustic event signal; then, semantic feature representation of the common subspace is obtained through learning; and calculating a kernel matrix of the training set, and training a classifier to obtain a classification model. During testing, sampling, quantizing, frame level feature extraction, segment level feature extraction and expansion are carried out on each original acoustic event signal; obtaining semantic feature representation of the common subspace under the guidance of the learned common subspace; and finally, calculating a kernel matrix of the test set, and performing model matching under the guidance of the classification model to obtain a prediction result. The method is mainly used for the identification of acoustic events.

Description

Acoustic event identification method based on public subspace representation learning

Technical Field

The invention relates to an acoustic event identification method.

Background

Acoustic events are sound signals with well-defined semantics and are also important media for human perception of the surrounding environment. With the development of information technology, it is possible for machines to have the ability to recognize and understand acoustic events by simulating the human auditory mechanisms. The acoustic event recognition technology is also widely applied to various practical fields such as environmental monitoring and smart home, and is receiving attention of more and more researchers.

In a plurality of acoustic event identification technologies, the semantic feature extraction method based on subspace learning can effectively improve the identification performance due to the fact that the content information and the time sequence relation of acoustic events are considered. The method fully describes the semantic features of a single event by independently learning respective subspaces of different acoustic events. However, since it ignores the consistency of the subspace used to characterize semantic features of different acoustic events, the differences between these features come not only from the semantic features implied by the events, but also from the inconsistency of the subspace, thereby affecting the recognition accuracy of the acoustic events.

Disclosure of Invention

The method aims to solve the problem of low accuracy of an acoustic event recognition task caused by inconsistency of subspaces among different semantic features.

An acoustic event recognition method based on common subspace representation learning, comprising the following steps:

extracting logarithmic Mel spectrum characteristics of the audio frame corresponding to the acoustic event sample to be identified to obtain a frame level characteristic matrix; further abstracting the frame-level features into segment-level features by using a convolutional neural network;

utilizing common subspace basis matrices U^*Solving the optimal semantic feature matrix of each acoustic event sample

Then, calculating a kernel matrix of the acoustic event sample to be identified:

wherein the content of the first and second substances,

to utilize a common subspace basis matrix U^*Aiming at an optimal semantic feature matrix obtained by a training set, wherein the training set is a set formed by acoustic event samples used for training; n is a radical of_teNumber of samples, N, of acoustic event samples to be identified_trThe total number of training samples in the training set; k (·, ·) is a grassmann kernel function; r represents a real number space;

finally in the classification parameter alpha^*Under the guidance of (1), model matching is performed according to the following formula:

P＝(α^*)^TK_te

wherein the content of the first and second substances,

for the prediction result, each column of the prediction result is a probability score of the acoustic event sample to be identified on each category, and the identification result of the acoustic event sample to be identified is determined by taking the maximum value of each column;

the classification parameter alpha^*The classification parameters of the support vector machine classifier are obtained by training with a training set.

Further, the classification parameter α^*The acquisition process comprises the following steps:

calculating a kernel matrix according to the optimal semantic feature matrix of the training set:

wherein, K_trA kernel matrix which is a training set;

respectively according to the optimal semantic features of the training set;

by K_trTraining a support vector machine classifier to obtain a classification model, wherein classification parameters of the classification model are expressed as

Wherein c is the total number of classes.

Further, the common subspace base matrix U^*The determination process of (a) includes the steps of:

step 5.1: randomly initializing a base matrix of the public subspace;

step 5.2: extracting frame level characteristics corresponding to each training sample in the training set, obtaining segment level characteristics according to the frame level characteristic matrix, and recording the segment level characteristics of the ith sample in the segment level characteristic set as the segment level characteristics

Wherein N is_iD is the nth segment to form the number of segments required for the sample

A characteristic dimension of (d);

randomly selecting part of acoustic event samples in the training set of the segment-level features, and respectively solving semantic feature matrixes of the acoustic event samples under the condition of fixing a common subspace basis matrix;

step 5.3: updating the base matrix of the public subspace by using the plurality of semantic feature matrixes obtained in the step 5.2;

step 5.4: repeating the steps 5.2 to 5.3 until convergence to obtain the optimal representation of the common subspace base matrix, namely the common subspace base matrix U^*。

Further, the segment level feature X of step 5.2_iA smoothing process and a length normalization process are required.

Further, the process of randomly selecting some acoustic event samples and solving their semantic feature matrices respectively under the condition of fixing the common subspace basis matrix as described in step 5.2 includes the following steps:

in the training set of the segment-level features, selecting l acoustic event samples at random, fixing U, and respectively obtaining a semantic feature matrix according to the following formula:

wherein, U is a standard orthogonal basis matrix of a random initialization public subspace; v_iFor the ith sample X in the l training samples_iCorresponding semantic feature matrix, constraining the different columns in the matrix to be mutually orthogonal, and dividing V_iIs shown as

The above elements; q. q.s_iIs N_iThe total combination number of two segments with the occurrence sequence in each segment, lambda is used for reflecting a hyper-parameter of the influence of the characterization time sequence relation on semantic feature quality, and eta is a hyper-parameter reflecting the obvious degree of the characterization time sequence relation; 1_p＝[1]∈R^p×1Is a p-dimensional whole-column vector,

is p × q_iAll one matrix of (a);

respectively represent X_iNth, mth segment level features;

solving about

To the optimization problem of (2).

Further, said N_iTotal combined number q of two segments with successive occurrence order in each segment_i＝0.5×N_i×(N_i-1)。

Further, in the process of updating the base matrix of the common subspace by using the plurality of semantic feature matrices obtained in step 5.2, the base matrix of the common subspace is updated according to the following formula:

wherein | · | purple_FRepresenting the Frobenius norm.

Furthermore, the acoustic event sample to be identified is obtained after sampling and quantizing the acoustic event signal to be identified.

Further, in the process of further abstracting the frame-level features into the segment-level features by using the convolutional neural network, a plurality of adjacent frame-level features are further abstracted into the segment-level features.

Further, after the frame-level features are further abstracted into segment-level features by using the convolutional neural network, the segment-level features are subjected to smoothing processing and length normalization processing.

Has the advantages that:

the method can well solve the influence of the inconsistency of the subspace among different semantic features on the accuracy rate of the acoustic event recognition task. Meanwhile, the invention can describe the semantic features of the acoustic event from multiple angles by learning the multidimensional public subspace, thereby improving the recognition performance together, and the accuracy of the acoustic event recognition task is as high as 84.1%.

Drawings

Fig. 1 is a schematic diagram of an acoustic event recognition method based on common subspace representation learning.

FIG. 2 is a convolutional neural network architecture diagram for extracting segment-level features.

Fig. 3 is a histogram of the accuracy of acoustic event recognition methods and correlation methods on ESC-50 data sets based on common subspace representation learning.

Detailed Description

The first embodiment is as follows: the present embodiment is described with reference to fig. 1, and fig. 1 is a schematic diagram of an acoustic event recognition method based on common subspace representation learning. In the training stage, firstly, sampling, quantizing, frame level feature extraction, segment level feature extraction and expansion are respectively carried out on original signals from a training set; then, obtaining semantic feature representation of the public subspace by learning the public subspace; and finally, calculating a kernel matrix of the training set, and training a classifier to obtain a classification model. In the testing stage, firstly, sampling, quantizing, frame level feature extracting, segment level feature extracting and expanding are carried out on each original acoustic event signal in a testing set; then, obtaining semantic feature representation under the guidance of the learned public subspace; and finally, calculating a kernel matrix of the test set, and performing model matching under the guidance of the classification model to obtain a prediction result.

The acoustic event identification method based on common subspace representation learning in the embodiment comprises the following steps:

step 1: and respectively sampling and quantizing the original acoustic event signals in the training set and the testing set to obtain processed acoustic event samples. In this embodiment, the sampling rate may be 44100 Hz, and the number of quantization bits may be 16.

Step 2: and (2) dividing each acoustic event sample obtained in the step (1) into a plurality of audio frames, dividing the audio frames into a plurality of audio frames according to a pre-specified frame length and an inter-frame overlapping proportion, and respectively extracting classical logarithmic Mel spectral features in an acoustic event recognition task from the audio frames according to a pre-specified Mel frequency band number to obtain a frame level feature matrix. In the present embodiment, the frame length, the inter-frame overlap, and the number of mel-frequency bands may be set to 23 msec, 50%, and 128, respectively.

And step 3: considering that the audio frame is often too short in duration, there is a limitation that the audio frame contains insufficient semantic information. For this purpose, for each frame-level feature matrix obtained in step 2, according to a pre-specified segment length and an inter-segment overlapping proportion, a plurality of adjacent frame-level features are input into a pre-trained convolutional neural network, and under the condition of fixed network parameters, segment-level features are obtained without supervision, so that a segment-level feature matrix is obtained. FIG. 2 is a convolutional neural network architecture diagram for extracting segment-level features. The network is composed of 13 convolutional layers, 6 maximum pooling layers and an average pooling layer, logarithmic Mel spectrum characteristics of samples in an AudioSet data set are used as input, supervised pre-training is carried out, and then optimal network parameters are obtained.

In order to fully utilize and utilize the capability of the convolutional neural network and effectively control the total time of samples, in this embodiment, each segment-level feature may be set as a further abstraction result of 84 audio frames, the inter-segment overlap may be 20%, and the penultimate layer (last convolutional layer) of the above network may be output as a segment-level feature, the node number of which is 1024.

And 4, step 4: in order to enhance the time sequence correlation among the segment-level features, smoothing is performed on each segment-level feature matrix obtained in the step 3 by using a classical time domain transform averaging method, and then length normalization is performed on the smoothed segment-level feature matrix to obtain the segment-level feature matrix expanded by the operation. For the convenience of the following description, the segment level of the i-th sample after being expanded is characterized as

The characteristic dimension of (c). Since the above expansion operation does not change the segment-level feature dimension obtained after step 3, d in this step can also be set to 1024.

And 5: in order to depict the overall semantic features of the acoustic event samples, a public subspace is constructed by using a strategy based on public subspace learning, and the method comprises the following steps:

step 5.1: defining a subspace with a dimension p, randomly initializing the orthonormal basis matrix of the subspace, and recording as

Wherein the content of the first and second substances,

refers to the Stiefel flow pattern, which isA set consisting of orthonormal matrices of size d × p. Wherein p can range from [1,2,3,4,5 ]]Selecting.

Step 5.2: in the training set processed in step 4, randomly selecting l acoustic event samples, fixing U (steps 5.1-5.3 are three main steps in an iterative algorithm, which can be described as alternately updating U and Vi, and here, fixing U updates the other), and obtaining their semantic feature matrices according to the following formula:

wherein the content of the first and second substances,

is about V_iFunction of V_iFor the ith sample X in the l training samples_iCorresponding semantic feature matrix, in order to avoid redundancy in semantic features, the invention restrains the mutual orthogonality between different columns in the matrix, and combines V_iIs shown as

The above elements; q. q.s_iIs N_iThe total combination number of two segments with the successive occurrence sequence in each segment can be determined by q_i＝0.5×N_i×(N_i-1) performing a calculation; λ is a hyper-parameter of the invention used for reflecting the influence of the characterization time sequence relation on the semantic feature quality, and η is a hyper-parameter reflecting the significance degree of the characterization time sequence relation; 1_p＝[1]∈R^p×1Is a p-dimensional whole-column vector,

is p × q_iAll of the matrices of (a).

Containing a sample X_iThe difference between any two of the segments having a tandem occurrence relationship;

respectively represent X_iNth, mth segment level features.

To effectively solve about

The invention utilizes a classical Riemann gradient descent algorithm which is an iterative solution strategy, and obtains a semantic feature representation V after the preassigned maximum iteration times_i. According to the invention, through learning the multidimensional public subspace, the semantic features of the acoustic events can be described from multiple angles, so that the recognition performance is improved together.

Specifically, the algorithm depends on the iteration of each step

About V_iMay be represented as:

where l may be set to 64, λ may be set to 0.1, η may be set to 0.001, and the maximum number of iterations may be set to 20.

Step 5.3: updating the base matrix of the public subspace according to the following formula by using the plurality of semantic feature matrixes obtained in the step 5.2:

is a function related to U, and in order to effectively solve the formula, a classical Riemannian gradient descent algorithm is adopted for updating once to obtain updated U. Specifically, the above algorithms all depend on the update process

With respect to the gradient of U, it can be expressed as:

step 5.4: repeating steps 5.2 to 5.3 until convergence to obtain an optimal representation U of the common subspace base matrix^*。

Step 6: at U^*Respectively solving the optimal semantic feature matrix of each acoustic event sample in the training set and the test set after the step 4 under the guidance of (1), and recording the optimal semantic feature matrix set of the training set as

Wherein N is_trThe total training sample number is obtained; the optimal semantic feature matrix set of the test set is

Wherein N is_teIs the total number of test samples.

And 7: calculating a kernel matrix of the training set by using the optimal semantic feature matrix of the training set according to the following formula:

wherein, K_trTo train the kernel matrix of the set, k (·,) is a pre-specified grassmann kernel function. Further, with K_trTraining a support vector machine classifier to obtain a classification model, wherein classification parameters of the model can be expressed as

Wherein c is the total number of classes.

And 8: and calculating a kernel matrix of the test set according to the following formula by using the optimal semantic feature matrix of the training set and the test set:

wherein, K_teIs the above-mentioned kernel matrix. Further, at α^*Under the guidance of (1), model matching is performed according to the following formula:

P＝(α^*)^TK_te

wherein the content of the first and second substances,

for predicting the result, each column can be regarded as a probability score of the corresponding test sample on each category, and the recognition result of the test sample can be determined by taking the maximum value of each column.

The invention is tested on the internationally published data set, and the test result shows that the technology can effectively improve the identification performance.

Fig. 3 is a histogram of the accuracy of acoustic event recognition methods and correlation methods on ESC-50 data sets based on common subspace representation learning. By comparing the accuracy of the method provided by the invention and the acoustic event identification method based on subspace representation learning, the necessity of introducing a common subspace can be verified. Meanwhile, compared with the accuracy achieved by the pre-training convolutional neural network and the human ear, the effectiveness of the method provided by the invention can be verified.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. An acoustic event recognition method based on common subspace representation learning, characterized by comprising the following steps:

wherein the content of the first and second substances,

P＝(α^*)^TK_te

wherein the content of the first and second substances,

for predicting the result, each column thereof corresponds to a sample of the acoustic event to be recognizedScoring the probability of each category, and determining the recognition result of the acoustic event sample to be recognized by taking the maximum value of each column;

2. The method of claim 1, wherein the classification parameter α is a^*The acquisition process comprises the following steps:

wherein, K_trA kernel matrix which is a training set;

respectively according to the optimal semantic features of the training set;

Wherein c is the total number of classes.

3. The method for recognizing acoustic events based on common subspace representation learning according to claim 1 or 2, wherein the common subspace base matrix U^*The determination process of (a) includes the steps of:

step 5.1: randomly initializing a base matrix of the public subspace;

A characteristic dimension of (d);

4. The method of claim 3, wherein the segment-level feature X of step 5.2 is the same as the segment-level feature X of the acoustic event recognition method based on common subspace representation learning_iA smoothing process and a length normalization process are required.

5. The method for recognizing acoustic events based on common subspace representation learning according to claim 4, wherein the process of randomly selecting partial acoustic event samples and respectively solving the semantic feature matrices thereof under the condition of fixing the common subspace base matrix in step 5.2 comprises the following steps:

is p × q_iAll one matrix of (a);

respectively represent X_iNth, mth segment level features;

solving about

To the optimization problem of (2).

6. The method of claim 5, wherein N is the number of acoustic events identified based on the learning of the common subspace representation_iTotal combined number q of two segments with successive occurrence order in each segment_i＝0.5×N_i×(N_i-1)。

7. The method for recognizing acoustic events based on common subspace representation learning according to claim 6, wherein the updating of the base matrix of the common subspace by using the plurality of semantic feature matrices obtained in step 5.2 is performed by updating the base matrix of the common subspace according to the following formula:

wherein | · | purple_FRepresenting the Frobenius norm.

8. The method as claimed in claim 7, wherein the acoustic event samples to be identified are obtained by sampling and quantizing the acoustic event signals to be identified.

9. The method of claim 8, wherein the further abstraction of the frame-level features into the segment-level features by the convolutional neural network is to further abstract a plurality of adjacent frame-level features into the segment-level features.

10. The method of claim 9, wherein after the frame-level features are further abstracted into segment-level features by using a convolutional neural network, the segment-level features are further subjected to smoothing and length normalization.