CN110223712B

CN110223712B - Music emotion recognition method based on bidirectional convolution cyclic sparse network

Info

Publication number: CN110223712B
Application number: CN201910485792.9A
Authority: CN
Inventors: 杨新宇; 董怡卓; 罗晶; 张亦弛; 魏洁; 崔宇涵; 夏小景; 吉姝蕾
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2021-04-20
Anticipated expiration: 2039-06-05
Also published as: CN110223712A

Abstract

The invention discloses a music emotion recognition method based on a bidirectional convolution cyclic sparse network. The method is combined with a convolutional neural network and a cyclic neural network to adaptively learn the emotional significance characteristics containing time sequence information from two-dimensional time-frequency representation (namely a time-frequency graph) of an original audio signal. Furthermore, the invention provides a method for reducing the computational complexity of numerical real data by converting a regression prediction problem into weighted combination of a plurality of binary classification problems by adopting a weighted mixed binary representation method. Experiment results show that the emotion significant characteristics containing time sequence information extracted by the bidirectional convolution cyclic sparse network show better prediction performance compared with the optimal characteristics in the MediaEval 2015; compared with the current common music emotion recognition network structure and the optimal method, the training time of the proposed model is reduced, and the prediction precision is improved. Therefore, the method effectively solves the problems of the accuracy and the efficiency of the music emotion recognition, and is superior to the existing recognition method.

Description

Music emotion recognition method based on bidirectional convolution cyclic sparse network

Technical Field

The invention belongs to the field of machine learning and emotion calculation, and particularly relates to a music emotion recognition method based on a bidirectional convolution cyclic sparse network.

Background

With the development of multimedia technology, the explosive growth of the number of digital music from different media has led to an increasing interest in the research of fast and efficient music query and retrieval approaches. Because music can transmit emotion-related information and the emotion-based music information retrieval mode has higher universality and user satisfaction, music information retrieval by identifying the emotion of a music audio signal becomes an important research trend, and the core difficulty of the music information retrieval method is how to further improve the accuracy and efficiency of music emotion identification.

The goal of music emotion recognition is to learn its perceptual emotional state by extracting and analyzing music features such as tempo, timbre, intensity, etc. A great number of recognition studies of music emotion based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) show certain advantages. The CNN can adaptively learn the characteristics of high-level invariant features from original audio data to eliminate the dependence of the feature extraction process on human subjectivity or experience, and the RNN can solve the time sequence dependence problem of music information. The music emotion recognition method based on the bidirectional convolution cyclic sparse network is adopted, combines the characteristic of CNN self-adaptive learning advanced invariant feature and the capability of RNN learning feature time sequence relation, and is used for predicting the emotion value of excitation (Arousal) and Valence (Valence), thereby improving the accuracy of music emotion recognition.

Disclosure of Invention

The invention aims to improve the accuracy and efficiency of music emotion recognition, and provides a music emotion recognition method based on a bidirectional convolution sparse network.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

a music emotion recognition method based on a bidirectional convolution cyclic sparse network comprises the steps of firstly converting an audio signal into a time-frequency graph; secondly, establishing an audio time sequence model by adopting a mode of internal fusion of a convolutional neural network and a cyclic neural network to learn the emotional significance characteristics containing time sequence information, namely SII-ASF, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity; and finally, carrying out continuous emotion recognition on the music.

The invention is further improved in that the method specifically comprises the following steps:

1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,

1-1) time-frequency diagram conversion of audio files: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph;

1-2) dimension reduction treatment of the time-frequency diagram: adopting a PCA whitening method, setting 99% of data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph;

2) establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: combining the CNN self-adaptive learning characteristic and the RNN capacity of processing time sequence data to construct a bidirectional convolution sparse-cycle network, BCRSN for short; the connection between the model input layer and the hidden layer is changed in a CNN local interconnection and weight sharing mode, and a plurality of convolution kernels are used for obtaining a bidirectional convolution cyclic feature map group, namely BCRFMs; the long-term dependence relationship between the BCRFMs is considered by replacing each neuron in the BCRFMs by a long-term and short-term memory network (LSTM) module, wherein the long-term and short-term memory network is called LSTM for short;

3) the regression problem translates into a two-class problem: including representation of binary values and thinning, the following steps,

3-1) representation of binary values: the method comprises the steps of converting a regression problem into weighted combination of a plurality of binary problems based on a numerical real data representation method and a weighted mixed binary representation method so as to reduce the calculation complexity of a model;

3-2) sparse processing: using a consistency correlation coefficient as a loss function and adding a penalty term into CCC as an objective function of a model to make BCRFMs as sparse as possible to obtain SII-ASF, wherein the consistency correlation coefficient is abbreviated as CCC;

4) continuous emotion recognition of music: and according to the results of the two classifications, firstly carrying out emotion recognition on the audio content of one segment, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.

The further improvement of the invention is that the step 1-1) is specifically operated as follows: each time domain audio file is divided into non-overlapping segments by the unit of 500ms of duration, and for each divided audio segment, a 60ms frame length and a sliding window with 10ms step length are adopted to convert the audio segment into a time-frequency diagram.

The further improvement of the invention is that the step 1-2) is specifically operated as follows: and carrying out PCA whitening according to 99% of data difference retention, reducing the dimensionality of a frequency domain of the time-frequency graph to 45 dimensionalities, and obtaining the time-frequency graph with the size of 45 multiplied by 45 dimensionalities as the input of the BCRSN model.

A further development of the invention is that said step 2) is specifically operative to: performing convolution operation on the time-frequency diagram in a time domain range by using 64 convolution kernels with the length of 3 multiplied by 1 and the step length of 2 to obtain BCRFMs; bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame; and simultaneously, each neuron in the BCRFMs is modified by using an LSTM module, certain information of any time length segment is memorized through the input, the output and a forgetting threshold of the module, and finally, the size of a characteristic diagram is reduced by using a downsampling operation with the size of 3 multiplied by 1, so that the robustness of the model is enhanced.

A further improvement of the invention is that the learning of BCRFMs in step 2) comprises the steps of:

(i) the connection between the BCRSN model input layer and the forward and reverse convolution cycle layers takes convolution kernels as media, the forward and reverse convolution cycle layers are provided with the same number and arrangement mode of neurons as those of the CNN convolution layer, so that the model has the capability of self-adaptive learning invariant feature, and the convolution result of each neuron is calculated through a formula (1):

in the formula, C_nt,kThe convolution result for the neuron at the kth signature position (N, T), N being 1,2., (N-1)/2, T being 1,2., T;

is a two-dimensional feature matrix at a corresponding position (n, t) of the input layer, W_kThe weight parameter of the kth convolution kernel;

(ii) bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame;

for the feature map of the forward convolution loop layer, the input to each neuron is represented by equation (2):

the output is expressed as formula (3):

FO_nt,k＝σ(FI_nt,k+b_nt,k) (3)

for the signature of the deconvolution loop layer, the input to each neuron is represented by equation (4):

the output is expressed as formula (5):

BO_nt,k＝σ(BI_nt,k+b_nt,k) (5)

in the formula

Representing the output results of all neurons of a previous frame t-1/t +1 of the kth feature map;

respectively representing the connection matrix of the neurons in the forward propagation process and the backward propagation process, and sharing the weight among all the audio frames; b_nt,kBiasing the network;

(iii) modifying each neuron in BCRFMs by using an LSTM module, memorizing certain information of any duration segment through the input, the output and a forgetting threshold of the module, carrying out down-sampling operation in a frequency domain range between a forward convolution layer and a backward convolution layer and a forward pooling layer, sequentially representing the characteristics of a 3 multiplied by 1 down-sampling area by using the maximum characteristics in the area, and reducing the size of a characteristic diagram.

The further improvement of the invention is that the step 3-1) is specifically operated as follows: setting L +1 neurons on an output layer of the BCRSN model, and expressing an obtained prediction sequence by using O; wherein, O₁Predicting the positive or negative of the true value, O₂～O_L+1Predicting the absolute value of the true value, wherein the range is (0, 1); each neuron acts as a two-classifier, reducing the computational complexity of the penalty function to O ((L + 1). times.1)²) O (L +1), making the model converge faster.

The invention is further improved in that, in the step 3-1), a weighted mixed binary number representation method is adopted, and the method comprises the following steps:

(i) new weighted hybrid binary representation converts numeric real data g into hybrid binary vector O^*To reduce the computational complexity, each bit of the vector

Calculated by equation (6):

in the formula g₁＝g；

From g₁Positive or negative determination of value when g₁When the content is more than or equal to 0,

g₁when the ratio is less than 0, the reaction mixture is,

(ii) setting output layer neuron O_iThe convergence direction of the model loss function is controlled by the contribution weight of the model loss function, the prediction precision is improved, and the method is calculated by the following formula:

where δ (-) represents the formula for the calculation of the loss function, λ_iRepresents O_iThe contribution to the segment loss function.

The further improvement of the invention is that the step 3-2) is specifically operated as follows: and (3) using CCC as a loss function and adding a Lasso penalty term of the weight of the BCRFMs into the CCC as an objective function of the model to make the BCRFMs as sparse as possible and obtain the SII-ASF.

The invention is further improved in that CCC is used as a loss function in the step 3-2) to train the network more discriminatively; specifically, each song is divided into segments of fixed duration and the real data of each segment is converted into a hybrid binary vector O^*The solution of the loss function comprises the following steps：

(i) Calculating the predicted sequence O and the real sequence O of each segment^*CCC, predicted sequence f of sequence sample s_sAnd the target sequence

CCC between is defined as:

in the formula S_sA mean square error (SSE),

Q_sthe covariance is represented as a function of time,

t denotes the time index of each tag value, N_sIndicates the length of the sequence s; based on the above, the digit number L +1 of the mixed binary vector is taken as the sequence length of each segment, and the contribution weight of each digit to the model loss function is considered, and the formula (7) is rewritten to obtain the predicted sequence O and the real sequence O of each segment^*CCC of (c):

in the formula, O^*O denotes a mixed binary vector of segment true and predicted, respectively, and λ ═ λ₁,λ₂,...,λ_L+1) Representing the contribution parameter set of O to the segment loss function; thus, the CCC solution to the regression prediction problem is translated into a weighted sum of multiple two-class accuracies, i.e.

Thereby defining:

(ii) the average CCC per song was calculated from its CCC per segment and the number of segments:

in the formula, N_sRepresents the length of each song, i.e., the number of segments;

setting coefficients of some neurons to be 0 by using Lasso regression to delete repeated related variables and a plurality of noise features, and selecting the SII-ASF with stronger emotional significance; in particular, in the loss function

On the basis, adding a Lasso penalty term of the BCRFMs weight as a final objective function:

in the formula, beta_FA set of parameters representing the BCRFMs,

in a similar manner, the first and second substrates are,

α_Fand alpha_BThe method is a hyper-parameter used for controlling the sparsity of a characteristic diagram, and the larger the alpha value is, the higher the sparsity is; and minimizing L to remove noise features, selecting the emotion significance features, and improving the prediction accuracy.

The invention has the following beneficial technical effects:

the music emotion recognition method based on the bidirectional convolution cyclic sparse network comprises the steps of firstly converting audio signals into a time-frequency diagram, secondly establishing an audio time sequence model by adopting a mode of internal fusion of CNN and RNN to learn SII-ASF, meanwhile, converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity, and finally carrying out continuous emotion recognition on music. Compared with the current common music emotion recognition network structure and the optimal method, the BCRSN model can obviously reduce training time and improve prediction precision, and the extracted SII-ASF features show better prediction performance compared with the optimal features proposed by the participants in MediaEval 2015.

Drawings

FIG. 1 is a flow chart of the BCRSN system of the present invention;

FIG. 2 is a diagram illustrating the conversion process from numeric real data to hybrid binary vectors in the present invention;

FIG. 3 is a comparison graph of prediction performance and training time of the BCRSN model and CNN-based, BLSTM-based and stacked CNN-BLSTM-based models on DEAM and MTurk music emotion recognition data sets in the present invention.

Detailed Description

The present invention is described in further detail below with reference to the attached drawings.

Referring to fig. 1, according to the music emotion recognition method based on the bidirectional convolution cyclic sparse network provided by the invention, firstly, an audio signal is converted into a time-frequency diagram; secondly, establishing an audio time sequence model by adopting an internal fusion mode of a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) to learn emotion significance characteristics (SII-ASF for short) containing time sequence information, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the computational complexity; and finally, carrying out continuous emotion recognition on music, specifically comprising the following steps:

step1 converting the time-frequency diagram of the audio file: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph;

and (4) dimension reduction processing of the Step2 time-frequency graph: and (3) setting a certain data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph by adopting a PCA whitening method.

2) Establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: and (3) combining the CNN self-adaptive learning characteristic and the RNN capability of processing time series data to construct a bidirectional convolution sparse-round network (BCRSN for short). Referring to fig. 1, an input two-dimensional time-frequency diagram replaces each frame t by means of CNN local interconnection and weight sharing_iinter-Layer connection of inner input Layer and Forward/Backward convolution loop Layer (Forward/Backward 1c Layer), and between audio frames

Setting bidirectional cycle transmission timing sequence information to learn BCRFMs; while using LSTM network modules instead of each neuron in the BCRFMs, there is a long-term dependency between features within the BCRFMs.

3) The regression problem translates into a two-class problem: including representation of the weighted binary values and sparseness processing, and referring to fig. 1 and 2, there are the following steps in detail,

step1 weights the representation of the binary value: based on a method for representing numerical real data and a weighted mixed binary representation method, converting a regression problem into weighted combination of a plurality of binary problems so as to reduce the complexity of calculation;

step2 sparse processing: using CCC as a loss function and adding a Lasso penalty term (L1 regularization) of BCRFMs weights to CCC as an objective function of the model to make BCRFMs as sparse as possible, obtain SII-ASF.

4) Continuous emotion recognition of music: and inputting the audio time-frequency diagram into a BCRSN model, firstly carrying out emotion recognition on the audio content of a single segment according to a plurality of secondary classification results, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.

Referring to fig. 3, on the data sets of DEAM and MTurk, the BCRSN model in the present invention achieves the best performance for continuous emotion prediction in both Valence and Arousal dimensions compared with models based on CNN, BLSTM, and stacked CNN-BLSTM.

Referring to table 1, compared with the optimal algorithm of MediaEval2015, the BCRSN model in the present invention can adaptively learn valid features from the original audio signal for the prediction target with the least a priori knowledge, which is superior to the first three performance-optimal methods (BLSTM-RNN, BLSTM-ELM, deep LSTM-RNN) in MediaEval 2015.

Table 1: in the invention, the BCRSN model is compared with the first three methods (BLSTM-RNN, BLSTM-ELM and deep LSTM-RNN) with optimal performance in the MediaEval2015 by taking an original audio signal as input.

Note: n.s. -Not significan indicates that the performance of the method is Not significantly different from that of the BCRSN model, otherwise it indicates Significant difference.

Referring to Table 2, the SII-ASF and SII-NASF obtained by the BCRSN model with and without the Lasso penalty all showed good prediction performance compared to the set of features proposed by the competitor in MediaEval2015 (JUNLP, PKUAIPL, HKPolyU, THU-HCSIL and IRIT-SAMOVA).

Table 2: the SII-ASF and SII-NASF signatures extracted in the present invention were compared with the performance of the signatures proposed by the competitors in MediaEval2015 (JUNLP, PKUAIPL, HKPolyU, THU-HCSIL and IRIT-SAMOVA).

Note: n.s. -Not significan indicates that the performance of the feature is Not significantly different from the SII-ASF ratio, otherwise it indicates a Significant difference.

Claims

1. A music emotion recognition method based on a bidirectional convolution cyclic sparse network is characterized in that an audio signal is converted into a time-frequency graph; secondly, establishing an audio time sequence model by adopting a mode of internal fusion of a convolutional neural network and a cyclic neural network to learn the emotional significance characteristics containing time sequence information, namely SII-ASF, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity; finally, carrying out continuous emotion recognition on music; the method specifically comprises the following steps:

1-1) time-frequency diagram conversion of audio files: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph; the specific operation is as follows: dividing each time domain audio file into non-overlapping segments in a unit of 500ms of duration, and converting each divided audio segment into a time-frequency graph by adopting a 60ms frame length and a sliding window with 10ms step length;

1-2) dimension reduction treatment of the time-frequency diagram: adopting a PCA whitening method, setting 99% of data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph; the specific operation is as follows: PCA whitening is carried out according to 99% of data difference retention, the dimensionality of a time-frequency graph frequency domain is reduced to 45 dimensionality, and a time-frequency graph with the size of 45 multiplied by 45 is obtained and used as the input of a BCRSN model;

2) establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: combining the CNN self-adaptive learning characteristic and the RNN capacity of processing time sequence data to construct a bidirectional convolution sparse-cycle network, BCRSN for short; the connection between the model input layer and the hidden layer is changed in a CNN local interconnection and weight sharing mode, and a plurality of convolution kernels are used for obtaining a bidirectional convolution cyclic feature map group, namely BCRFMs; the long-term dependence relationship among the BCRFMs is considered by replacing each neuron in the BCRFMs by a long-term and short-term memory network LSTM module, wherein the long-term and short-term memory network is called LSTM for short; the specific operation is as follows: performing convolution operation on the time-frequency diagram in a time domain range by using 64 convolution kernels with the length of 3 multiplied by 1 and the step length of 2 to obtain BCRFMs; bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame; meanwhile, each neuron in the BCRFMs is modified by an LSTM module, certain information of any time length segment is memorized through the input, the output and a forgetting threshold of the module, and finally, the size of a characteristic diagram is reduced by using a downsampling operation with the size of 3 multiplied by 1, so that the robustness of the model is enhanced;

learning of BCRFMs comprising the steps of:

the output is expressed as formula (3):

FO_nt,k＝σ(FI_nt,k+b_nt,k) (3)

the output is expressed as formula (5):

BO_nt,k＝σ(BI_nt,k+b_nt,k) (5)

in the formula

(iii) modifying each neuron in the BCRFMs by using an LSTM module, memorizing certain information of any time segment through the input, the output and a forgetting threshold of the module, performing downsampling operation in a frequency domain range between a forward convolution layer and a reverse convolution layer and a forward pooling layer and a reverse pooling layer, sequentially representing the characteristics of a downsampling area with the size of 3 multiplied by 1 by using the maximum characteristics in the area, and reducing the size of a characteristic diagram;

3-1) representation of binary values: the method comprises the steps of converting a regression problem into weighted combination of a plurality of binary problems based on a numerical real data representation method and a weighted mixed binary representation method so as to reduce the calculation complexity of a model; the specific operation is as follows: setting L +1 neurons on an output layer of the BCRSN model, and expressing an obtained prediction sequence by using O; wherein, O₁Predicting the positive or negative of the true value, O₂～O_L+1Predicting the absolute value of the true value, wherein the range is (0, 1); each neuron acts as a two-classifier, reducing the computational complexity of the penalty function to O ((L + 1). times.1)²) O (L +1), making the model converge faster;

the method for representing the binary value by adopting the weighted mixing comprises the following steps:

Calculated by equation (6):

in the formula g₁＝g；

g₁when the ratio is less than 0, the reaction mixture is,

where δ (-) represents the formula for the calculation of the loss function, λ_iRepresents O_iA contribution to a loss function;

3-2) sparse processing: using a consistency correlation coefficient as a loss function and adding a penalty term into CCC as an objective function of a model to make BCRFMs as sparse as possible to obtain SII-ASF, wherein the consistency correlation coefficient is abbreviated as CCC; the specific operation is as follows: using CCC as a loss function and adding a Lasso penalty term of BCRFMs weight to the CCC as an objective function of a model to make the BCRFMs as sparse as possible and obtain SII-ASF;

CCC is used as a loss function to enable the network to be trained more distinctively; specifically, each song is divided into segments of fixed duration and the real data of each segment is converted into a hybrid binary vector O^*The solution of the loss function comprises the following steps:

CCC between is defined as:

in the formula S_sA mean square error (SSE),

Q_sthe covariance is represented as a function of time,

Thereby defining:

selecting a case by eliminating the repetitive variables and many noise features by setting coefficients of some neurons to 0 using Lasso regressionSII-ASF with stronger significance; in particular, in the loss function

in the formula, beta_FA set of parameters representing the BCRFMs,

in a similar manner, the first and second substrates are,

α_Fand alpha_BThe method is a hyper-parameter used for controlling the sparsity of a characteristic diagram, and the larger the alpha value is, the higher the sparsity is; minimizing L to delete the noise features, selecting the emotion significance features, and improving the prediction accuracy;