CN110223712B - Music emotion recognition method based on bidirectional convolution cyclic sparse network - Google Patents
Music emotion recognition method based on bidirectional convolution cyclic sparse network Download PDFInfo
- Publication number
- CN110223712B CN110223712B CN201910485792.9A CN201910485792A CN110223712B CN 110223712 B CN110223712 B CN 110223712B CN 201910485792 A CN201910485792 A CN 201910485792A CN 110223712 B CN110223712 B CN 110223712B
- Authority
- CN
- China
- Prior art keywords
- time
- convolution
- bcrfms
- model
- neuron
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H1/00—Details of electrophonic musical instruments
- G10H1/0008—Associated control or indicating means
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2210/00—Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
- G10H2210/031—Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Child & Adolescent Psychology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a music emotion recognition method based on a bidirectional convolution cyclic sparse network. The method is combined with a convolutional neural network and a cyclic neural network to adaptively learn the emotional significance characteristics containing time sequence information from two-dimensional time-frequency representation (namely a time-frequency graph) of an original audio signal. Furthermore, the invention provides a method for reducing the computational complexity of numerical real data by converting a regression prediction problem into weighted combination of a plurality of binary classification problems by adopting a weighted mixed binary representation method. Experiment results show that the emotion significant characteristics containing time sequence information extracted by the bidirectional convolution cyclic sparse network show better prediction performance compared with the optimal characteristics in the MediaEval 2015; compared with the current common music emotion recognition network structure and the optimal method, the training time of the proposed model is reduced, and the prediction precision is improved. Therefore, the method effectively solves the problems of the accuracy and the efficiency of the music emotion recognition, and is superior to the existing recognition method.
Description
Technical Field
The invention belongs to the field of machine learning and emotion calculation, and particularly relates to a music emotion recognition method based on a bidirectional convolution cyclic sparse network.
Background
With the development of multimedia technology, the explosive growth of the number of digital music from different media has led to an increasing interest in the research of fast and efficient music query and retrieval approaches. Because music can transmit emotion-related information and the emotion-based music information retrieval mode has higher universality and user satisfaction, music information retrieval by identifying the emotion of a music audio signal becomes an important research trend, and the core difficulty of the music information retrieval method is how to further improve the accuracy and efficiency of music emotion identification.
The goal of music emotion recognition is to learn its perceptual emotional state by extracting and analyzing music features such as tempo, timbre, intensity, etc. A great number of recognition studies of music emotion based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) show certain advantages. The CNN can adaptively learn the characteristics of high-level invariant features from original audio data to eliminate the dependence of the feature extraction process on human subjectivity or experience, and the RNN can solve the time sequence dependence problem of music information. The music emotion recognition method based on the bidirectional convolution cyclic sparse network is adopted, combines the characteristic of CNN self-adaptive learning advanced invariant feature and the capability of RNN learning feature time sequence relation, and is used for predicting the emotion value of excitation (Arousal) and Valence (Valence), thereby improving the accuracy of music emotion recognition.
Disclosure of Invention
The invention aims to improve the accuracy and efficiency of music emotion recognition, and provides a music emotion recognition method based on a bidirectional convolution sparse network.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
a music emotion recognition method based on a bidirectional convolution cyclic sparse network comprises the steps of firstly converting an audio signal into a time-frequency graph; secondly, establishing an audio time sequence model by adopting a mode of internal fusion of a convolutional neural network and a cyclic neural network to learn the emotional significance characteristics containing time sequence information, namely SII-ASF, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity; and finally, carrying out continuous emotion recognition on the music.
The invention is further improved in that the method specifically comprises the following steps:
1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,
1-1) time-frequency diagram conversion of audio files: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph;
1-2) dimension reduction treatment of the time-frequency diagram: adopting a PCA whitening method, setting 99% of data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph;
2) establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: combining the CNN self-adaptive learning characteristic and the RNN capacity of processing time sequence data to construct a bidirectional convolution sparse-cycle network, BCRSN for short; the connection between the model input layer and the hidden layer is changed in a CNN local interconnection and weight sharing mode, and a plurality of convolution kernels are used for obtaining a bidirectional convolution cyclic feature map group, namely BCRFMs; the long-term dependence relationship between the BCRFMs is considered by replacing each neuron in the BCRFMs by a long-term and short-term memory network (LSTM) module, wherein the long-term and short-term memory network is called LSTM for short;
3) the regression problem translates into a two-class problem: including representation of binary values and thinning, the following steps,
3-1) representation of binary values: the method comprises the steps of converting a regression problem into weighted combination of a plurality of binary problems based on a numerical real data representation method and a weighted mixed binary representation method so as to reduce the calculation complexity of a model;
3-2) sparse processing: using a consistency correlation coefficient as a loss function and adding a penalty term into CCC as an objective function of a model to make BCRFMs as sparse as possible to obtain SII-ASF, wherein the consistency correlation coefficient is abbreviated as CCC;
4) continuous emotion recognition of music: and according to the results of the two classifications, firstly carrying out emotion recognition on the audio content of one segment, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.
The further improvement of the invention is that the step 1-1) is specifically operated as follows: each time domain audio file is divided into non-overlapping segments by the unit of 500ms of duration, and for each divided audio segment, a 60ms frame length and a sliding window with 10ms step length are adopted to convert the audio segment into a time-frequency diagram.
The further improvement of the invention is that the step 1-2) is specifically operated as follows: and carrying out PCA whitening according to 99% of data difference retention, reducing the dimensionality of a frequency domain of the time-frequency graph to 45 dimensionalities, and obtaining the time-frequency graph with the size of 45 multiplied by 45 dimensionalities as the input of the BCRSN model.
A further development of the invention is that said step 2) is specifically operative to: performing convolution operation on the time-frequency diagram in a time domain range by using 64 convolution kernels with the length of 3 multiplied by 1 and the step length of 2 to obtain BCRFMs; bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame; and simultaneously, each neuron in the BCRFMs is modified by using an LSTM module, certain information of any time length segment is memorized through the input, the output and a forgetting threshold of the module, and finally, the size of a characteristic diagram is reduced by using a downsampling operation with the size of 3 multiplied by 1, so that the robustness of the model is enhanced.
A further improvement of the invention is that the learning of BCRFMs in step 2) comprises the steps of:
(i) the connection between the BCRSN model input layer and the forward and reverse convolution cycle layers takes convolution kernels as media, the forward and reverse convolution cycle layers are provided with the same number and arrangement mode of neurons as those of the CNN convolution layer, so that the model has the capability of self-adaptive learning invariant feature, and the convolution result of each neuron is calculated through a formula (1):
in the formula, Cnt,kThe convolution result for the neuron at the kth signature position (N, T), N being 1,2., (N-1)/2, T being 1,2., T;is a two-dimensional feature matrix at a corresponding position (n, t) of the input layer, WkThe weight parameter of the kth convolution kernel;
(ii) bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame;
for the feature map of the forward convolution loop layer, the input to each neuron is represented by equation (2):
the output is expressed as formula (3):
FOnt,k=σ(FInt,k+bnt,k) (3)
for the signature of the deconvolution loop layer, the input to each neuron is represented by equation (4):
the output is expressed as formula (5):
BOnt,k=σ(BInt,k+bnt,k) (5)
in the formulaRepresenting the output results of all neurons of a previous frame t-1/t +1 of the kth feature map; respectively representing the connection matrix of the neurons in the forward propagation process and the backward propagation process, and sharing the weight among all the audio frames; bnt,kBiasing the network;
(iii) modifying each neuron in BCRFMs by using an LSTM module, memorizing certain information of any duration segment through the input, the output and a forgetting threshold of the module, carrying out down-sampling operation in a frequency domain range between a forward convolution layer and a backward convolution layer and a forward pooling layer, sequentially representing the characteristics of a 3 multiplied by 1 down-sampling area by using the maximum characteristics in the area, and reducing the size of a characteristic diagram.
The further improvement of the invention is that the step 3-1) is specifically operated as follows: setting L +1 neurons on an output layer of the BCRSN model, and expressing an obtained prediction sequence by using O; wherein, O1Predicting the positive or negative of the true value, O2~OL+1Predicting the absolute value of the true value, wherein the range is (0, 1); each neuron acts as a two-classifier, reducing the computational complexity of the penalty function to O ((L + 1). times.1)2) O (L +1), making the model converge faster.
The invention is further improved in that, in the step 3-1), a weighted mixed binary number representation method is adopted, and the method comprises the following steps:
(i) new weighted hybrid binary representation converts numeric real data g into hybrid binary vector O*To reduce the computational complexity, each bit of the vectorCalculated by equation (6):
in the formula g1=g;From g1Positive or negative determination of value when g1When the content is more than or equal to 0,g1when the ratio is less than 0, the reaction mixture is,
(ii) setting output layer neuron OiThe convergence direction of the model loss function is controlled by the contribution weight of the model loss function, the prediction precision is improved, and the method is calculated by the following formula:
where δ (-) represents the formula for the calculation of the loss function, λiRepresents OiThe contribution to the segment loss function.
The further improvement of the invention is that the step 3-2) is specifically operated as follows: and (3) using CCC as a loss function and adding a Lasso penalty term of the weight of the BCRFMs into the CCC as an objective function of the model to make the BCRFMs as sparse as possible and obtain the SII-ASF.
The invention is further improved in that CCC is used as a loss function in the step 3-2) to train the network more discriminatively; specifically, each song is divided into segments of fixed duration and the real data of each segment is converted into a hybrid binary vector O*The solution of the loss function comprises the following steps:
(i) Calculating the predicted sequence O and the real sequence O of each segment*CCC, predicted sequence f of sequence sample ssAnd the target sequenceCCC between is defined as:
in the formula SsA mean square error (SSE),Qsthe covariance is represented as a function of time,t denotes the time index of each tag value, NsIndicates the length of the sequence s; based on the above, the digit number L +1 of the mixed binary vector is taken as the sequence length of each segment, and the contribution weight of each digit to the model loss function is considered, and the formula (7) is rewritten to obtain the predicted sequence O and the real sequence O of each segment*CCC of (c):
in the formula, O*O denotes a mixed binary vector of segment true and predicted, respectively, and λ ═ λ1,λ2,...,λL+1) Representing the contribution parameter set of O to the segment loss function; thus, the CCC solution to the regression prediction problem is translated into a weighted sum of multiple two-class accuracies, i.e.Thereby defining:
(ii) the average CCC per song was calculated from its CCC per segment and the number of segments:
in the formula, NsRepresents the length of each song, i.e., the number of segments;
setting coefficients of some neurons to be 0 by using Lasso regression to delete repeated related variables and a plurality of noise features, and selecting the SII-ASF with stronger emotional significance; in particular, in the loss functionOn the basis, adding a Lasso penalty term of the BCRFMs weight as a final objective function:
in the formula, betaFA set of parameters representing the BCRFMs,in a similar manner, the first and second substrates are,αFand alphaBThe method is a hyper-parameter used for controlling the sparsity of a characteristic diagram, and the larger the alpha value is, the higher the sparsity is; and minimizing L to remove noise features, selecting the emotion significance features, and improving the prediction accuracy.
The invention has the following beneficial technical effects:
the music emotion recognition method based on the bidirectional convolution cyclic sparse network comprises the steps of firstly converting audio signals into a time-frequency diagram, secondly establishing an audio time sequence model by adopting a mode of internal fusion of CNN and RNN to learn SII-ASF, meanwhile, converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity, and finally carrying out continuous emotion recognition on music. Compared with the current common music emotion recognition network structure and the optimal method, the BCRSN model can obviously reduce training time and improve prediction precision, and the extracted SII-ASF features show better prediction performance compared with the optimal features proposed by the participants in MediaEval 2015.
Drawings
FIG. 1 is a flow chart of the BCRSN system of the present invention;
FIG. 2 is a diagram illustrating the conversion process from numeric real data to hybrid binary vectors in the present invention;
FIG. 3 is a comparison graph of prediction performance and training time of the BCRSN model and CNN-based, BLSTM-based and stacked CNN-BLSTM-based models on DEAM and MTurk music emotion recognition data sets in the present invention.
Detailed Description
The present invention is described in further detail below with reference to the attached drawings.
Referring to fig. 1, according to the music emotion recognition method based on the bidirectional convolution cyclic sparse network provided by the invention, firstly, an audio signal is converted into a time-frequency diagram; secondly, establishing an audio time sequence model by adopting an internal fusion mode of a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) to learn emotion significance characteristics (SII-ASF for short) containing time sequence information, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the computational complexity; and finally, carrying out continuous emotion recognition on music, specifically comprising the following steps:
1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,
step1 converting the time-frequency diagram of the audio file: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph;
and (4) dimension reduction processing of the Step2 time-frequency graph: and (3) setting a certain data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph by adopting a PCA whitening method.
2) Establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: and (3) combining the CNN self-adaptive learning characteristic and the RNN capability of processing time series data to construct a bidirectional convolution sparse-round network (BCRSN for short). Referring to fig. 1, an input two-dimensional time-frequency diagram replaces each frame t by means of CNN local interconnection and weight sharingiinter-Layer connection of inner input Layer and Forward/Backward convolution loop Layer (Forward/Backward 1c Layer), and between audio frames Setting bidirectional cycle transmission timing sequence information to learn BCRFMs; while using LSTM network modules instead of each neuron in the BCRFMs, there is a long-term dependency between features within the BCRFMs.
3) The regression problem translates into a two-class problem: including representation of the weighted binary values and sparseness processing, and referring to fig. 1 and 2, there are the following steps in detail,
step1 weights the representation of the binary value: based on a method for representing numerical real data and a weighted mixed binary representation method, converting a regression problem into weighted combination of a plurality of binary problems so as to reduce the complexity of calculation;
step2 sparse processing: using CCC as a loss function and adding a Lasso penalty term (L1 regularization) of BCRFMs weights to CCC as an objective function of the model to make BCRFMs as sparse as possible, obtain SII-ASF.
4) Continuous emotion recognition of music: and inputting the audio time-frequency diagram into a BCRSN model, firstly carrying out emotion recognition on the audio content of a single segment according to a plurality of secondary classification results, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.
Referring to fig. 3, on the data sets of DEAM and MTurk, the BCRSN model in the present invention achieves the best performance for continuous emotion prediction in both Valence and Arousal dimensions compared with models based on CNN, BLSTM, and stacked CNN-BLSTM.
Referring to table 1, compared with the optimal algorithm of MediaEval2015, the BCRSN model in the present invention can adaptively learn valid features from the original audio signal for the prediction target with the least a priori knowledge, which is superior to the first three performance-optimal methods (BLSTM-RNN, BLSTM-ELM, deep LSTM-RNN) in MediaEval 2015.
Table 1: in the invention, the BCRSN model is compared with the first three methods (BLSTM-RNN, BLSTM-ELM and deep LSTM-RNN) with optimal performance in the MediaEval2015 by taking an original audio signal as input.
Note: n.s. -Not significan indicates that the performance of the method is Not significantly different from that of the BCRSN model, otherwise it indicates Significant difference.
Referring to Table 2, the SII-ASF and SII-NASF obtained by the BCRSN model with and without the Lasso penalty all showed good prediction performance compared to the set of features proposed by the competitor in MediaEval2015 (JUNLP, PKUAIPL, HKPolyU, THU-HCSIL and IRIT-SAMOVA).
Table 2: the SII-ASF and SII-NASF signatures extracted in the present invention were compared with the performance of the signatures proposed by the competitors in MediaEval2015 (JUNLP, PKUAIPL, HKPolyU, THU-HCSIL and IRIT-SAMOVA).
Note: n.s. -Not significan indicates that the performance of the feature is Not significantly different from the SII-ASF ratio, otherwise it indicates a Significant difference.
Claims (1)
1. A music emotion recognition method based on a bidirectional convolution cyclic sparse network is characterized in that an audio signal is converted into a time-frequency graph; secondly, establishing an audio time sequence model by adopting a mode of internal fusion of a convolutional neural network and a cyclic neural network to learn the emotional significance characteristics containing time sequence information, namely SII-ASF, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity; finally, carrying out continuous emotion recognition on music; the method specifically comprises the following steps:
1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,
1-1) time-frequency diagram conversion of audio files: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph; the specific operation is as follows: dividing each time domain audio file into non-overlapping segments in a unit of 500ms of duration, and converting each divided audio segment into a time-frequency graph by adopting a 60ms frame length and a sliding window with 10ms step length;
1-2) dimension reduction treatment of the time-frequency diagram: adopting a PCA whitening method, setting 99% of data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph; the specific operation is as follows: PCA whitening is carried out according to 99% of data difference retention, the dimensionality of a time-frequency graph frequency domain is reduced to 45 dimensionality, and a time-frequency graph with the size of 45 multiplied by 45 is obtained and used as the input of a BCRSN model;
2) establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: combining the CNN self-adaptive learning characteristic and the RNN capacity of processing time sequence data to construct a bidirectional convolution sparse-cycle network, BCRSN for short; the connection between the model input layer and the hidden layer is changed in a CNN local interconnection and weight sharing mode, and a plurality of convolution kernels are used for obtaining a bidirectional convolution cyclic feature map group, namely BCRFMs; the long-term dependence relationship among the BCRFMs is considered by replacing each neuron in the BCRFMs by a long-term and short-term memory network LSTM module, wherein the long-term and short-term memory network is called LSTM for short; the specific operation is as follows: performing convolution operation on the time-frequency diagram in a time domain range by using 64 convolution kernels with the length of 3 multiplied by 1 and the step length of 2 to obtain BCRFMs; bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame; meanwhile, each neuron in the BCRFMs is modified by an LSTM module, certain information of any time length segment is memorized through the input, the output and a forgetting threshold of the module, and finally, the size of a characteristic diagram is reduced by using a downsampling operation with the size of 3 multiplied by 1, so that the robustness of the model is enhanced;
learning of BCRFMs comprising the steps of:
(i) the connection between the BCRSN model input layer and the forward and reverse convolution cycle layers takes convolution kernels as media, the forward and reverse convolution cycle layers are provided with the same number and arrangement mode of neurons as those of the CNN convolution layer, so that the model has the capability of self-adaptive learning invariant feature, and the convolution result of each neuron is calculated through a formula (1):
in the formula, Cnt,kThe convolution result for the neuron at the kth signature position (N, T), N being 1,2., (N-1)/2, T being 1,2., T;is a two-dimensional feature matrix at a corresponding position (n, t) of the input layer, WkThe weight parameter of the kth convolution kernel;
(ii) bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame;
for the feature map of the forward convolution loop layer, the input to each neuron is represented by equation (2):
the output is expressed as formula (3):
FOnt,k=σ(FInt,k+bnt,k) (3)
for the signature of the deconvolution loop layer, the input to each neuron is represented by equation (4):
the output is expressed as formula (5):
BOnt,k=σ(BInt,k+bnt,k) (5)
in the formulaRepresenting the output results of all neurons of a previous frame t-1/t +1 of the kth feature map; respectively representing the connection matrix of the neurons in the forward propagation process and the backward propagation process, and sharing the weight among all the audio frames; bnt,kBiasing the network;
(iii) modifying each neuron in the BCRFMs by using an LSTM module, memorizing certain information of any time segment through the input, the output and a forgetting threshold of the module, performing downsampling operation in a frequency domain range between a forward convolution layer and a reverse convolution layer and a forward pooling layer and a reverse pooling layer, sequentially representing the characteristics of a downsampling area with the size of 3 multiplied by 1 by using the maximum characteristics in the area, and reducing the size of a characteristic diagram;
3) the regression problem translates into a two-class problem: including representation of binary values and thinning, the following steps,
3-1) representation of binary values: the method comprises the steps of converting a regression problem into weighted combination of a plurality of binary problems based on a numerical real data representation method and a weighted mixed binary representation method so as to reduce the calculation complexity of a model; the specific operation is as follows: setting L +1 neurons on an output layer of the BCRSN model, and expressing an obtained prediction sequence by using O; wherein, O1Predicting the positive or negative of the true value, O2~OL+1Predicting the absolute value of the true value, wherein the range is (0, 1); each neuron acts as a two-classifier, reducing the computational complexity of the penalty function to O ((L + 1). times.1)2) O (L +1), making the model converge faster;
the method for representing the binary value by adopting the weighted mixing comprises the following steps:
(i) new weighted hybrid binary representation converts numeric real data g into hybrid binary vector O*To reduce the computational complexity, each bit of the vectorCalculated by equation (6):
in the formula g1=g;From g1Positive or negative determination of value when g1When the content is more than or equal to 0,g1when the ratio is less than 0, the reaction mixture is,
(ii) setting output layer neuron OiThe convergence direction of the model loss function is controlled by the contribution weight of the model loss function, the prediction precision is improved, and the method is calculated by the following formula:
where δ (-) represents the formula for the calculation of the loss function, λiRepresents OiA contribution to a loss function;
3-2) sparse processing: using a consistency correlation coefficient as a loss function and adding a penalty term into CCC as an objective function of a model to make BCRFMs as sparse as possible to obtain SII-ASF, wherein the consistency correlation coefficient is abbreviated as CCC; the specific operation is as follows: using CCC as a loss function and adding a Lasso penalty term of BCRFMs weight to the CCC as an objective function of a model to make the BCRFMs as sparse as possible and obtain SII-ASF;
CCC is used as a loss function to enable the network to be trained more distinctively; specifically, each song is divided into segments of fixed duration and the real data of each segment is converted into a hybrid binary vector O*The solution of the loss function comprises the following steps:
(i) calculating the predicted sequence O and the real sequence O of each segment*CCC, predicted sequence f of sequence sample ssAnd the target sequenceCCC between is defined as:
in the formula SsA mean square error (SSE),Qsthe covariance is represented as a function of time,t denotes the time index of each tag value, NsIndicates the length of the sequence s; based on the above, the digit number L +1 of the mixed binary vector is taken as the sequence length of each segment, and the contribution weight of each digit to the model loss function is considered, and the formula (7) is rewritten to obtain the predicted sequence O and the real sequence O of each segment*CCC of (c):
in the formula, O*O denotes a mixed binary vector of segment true and predicted, respectively, and λ ═ λ1,λ2,...,λL+1) Representing the contribution parameter set of O to the segment loss function; thus, the CCC solution to the regression prediction problem is translated into a weighted sum of multiple two-class accuracies, i.e.Thereby defining:
(ii) the average CCC per song was calculated from its CCC per segment and the number of segments:
in the formula, NsRepresents the length of each song, i.e., the number of segments;
selecting a case by eliminating the repetitive variables and many noise features by setting coefficients of some neurons to 0 using Lasso regressionSII-ASF with stronger significance; in particular, in the loss functionOn the basis, adding a Lasso penalty term of the BCRFMs weight as a final objective function:
in the formula, betaFA set of parameters representing the BCRFMs,in a similar manner, the first and second substrates are,αFand alphaBThe method is a hyper-parameter used for controlling the sparsity of a characteristic diagram, and the larger the alpha value is, the higher the sparsity is; minimizing L to delete the noise features, selecting the emotion significance features, and improving the prediction accuracy;
4) continuous emotion recognition of music: and according to the results of the two classifications, firstly carrying out emotion recognition on the audio content of one segment, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910485792.9A CN110223712B (en) | 2019-06-05 | 2019-06-05 | Music emotion recognition method based on bidirectional convolution cyclic sparse network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910485792.9A CN110223712B (en) | 2019-06-05 | 2019-06-05 | Music emotion recognition method based on bidirectional convolution cyclic sparse network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110223712A CN110223712A (en) | 2019-09-10 |
CN110223712B true CN110223712B (en) | 2021-04-20 |
Family
ID=67819412
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910485792.9A Active CN110223712B (en) | 2019-06-05 | 2019-06-05 | Music emotion recognition method based on bidirectional convolution cyclic sparse network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110223712B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110689902B (en) * | 2019-12-11 | 2020-07-14 | 北京影谱科技股份有限公司 | Audio signal time sequence processing method, device and system based on neural network and computer readable storage medium |
CN111326164B (en) * | 2020-01-21 | 2023-03-21 | 大连海事大学 | Semi-supervised music theme extraction method |
CN113268628B (en) * | 2021-04-14 | 2023-05-23 | 上海大学 | Music emotion recognition method based on modularized weighted fusion neural network |
CN115294644A (en) * | 2022-06-24 | 2022-11-04 | 北京昭衍新药研究中心股份有限公司 | Rapid monkey behavior identification method based on 3D convolution parameter reconstruction |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105469065A (en) * | 2015-12-07 | 2016-04-06 | 中国科学院自动化研究所 | Recurrent neural network-based discrete emotion recognition method |
CN106128479A (en) * | 2016-06-30 | 2016-11-16 | 福建星网视易信息系统有限公司 | A kind of performance emotion identification method and device |
CN106228977A (en) * | 2016-08-02 | 2016-12-14 | 合肥工业大学 | The song emotion identification method of multi-modal fusion based on degree of depth study |
US9570091B2 (en) * | 2012-12-13 | 2017-02-14 | National Chiao Tung University | Music playing system and music playing method based on speech emotion recognition |
WO2017122798A1 (en) * | 2016-01-14 | 2017-07-20 | 国立研究開発法人産業技術総合研究所 | Target value estimation system, target value estimation method, and target value estimation program |
CN107169409A (en) * | 2017-03-31 | 2017-09-15 | 北京奇艺世纪科技有限公司 | A kind of emotion identification method and device |
CN107506722A (en) * | 2017-08-18 | 2017-12-22 | 中国地质大学(武汉) | One kind is based on depth sparse convolution neutral net face emotion identification method |
US20180075343A1 (en) * | 2016-09-06 | 2018-03-15 | Google Inc. | Processing sequences using convolutional neural networks |
CN108717856A (en) * | 2018-06-16 | 2018-10-30 | 台州学院 | A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network |
CN108805089A (en) * | 2018-06-14 | 2018-11-13 | 南京云思创智信息科技有限公司 | Based on multi-modal Emotion identification method |
CN109146066A (en) * | 2018-11-01 | 2019-01-04 | 重庆邮电大学 | A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition |
CN109147826A (en) * | 2018-08-22 | 2019-01-04 | 平安科技(深圳)有限公司 | Music emotion recognition method, device, computer equipment and computer storage medium |
CN109508375A (en) * | 2018-11-19 | 2019-03-22 | 重庆邮电大学 | A kind of social affective classification method based on multi-modal fusion |
CN109599128A (en) * | 2018-12-24 | 2019-04-09 | 北京达佳互联信息技术有限公司 | Speech-emotion recognition method, device, electronic equipment and readable medium |
-
2019
- 2019-06-05 CN CN201910485792.9A patent/CN110223712B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9570091B2 (en) * | 2012-12-13 | 2017-02-14 | National Chiao Tung University | Music playing system and music playing method based on speech emotion recognition |
CN105469065B (en) * | 2015-12-07 | 2019-04-23 | 中国科学院自动化研究所 | A kind of discrete emotion identification method based on recurrent neural network |
CN105469065A (en) * | 2015-12-07 | 2016-04-06 | 中国科学院自动化研究所 | Recurrent neural network-based discrete emotion recognition method |
WO2017122798A1 (en) * | 2016-01-14 | 2017-07-20 | 国立研究開発法人産業技術総合研究所 | Target value estimation system, target value estimation method, and target value estimation program |
CN106128479A (en) * | 2016-06-30 | 2016-11-16 | 福建星网视易信息系统有限公司 | A kind of performance emotion identification method and device |
CN106228977A (en) * | 2016-08-02 | 2016-12-14 | 合肥工业大学 | The song emotion identification method of multi-modal fusion based on degree of depth study |
US20180075343A1 (en) * | 2016-09-06 | 2018-03-15 | Google Inc. | Processing sequences using convolutional neural networks |
CN107169409A (en) * | 2017-03-31 | 2017-09-15 | 北京奇艺世纪科技有限公司 | A kind of emotion identification method and device |
CN107506722A (en) * | 2017-08-18 | 2017-12-22 | 中国地质大学(武汉) | One kind is based on depth sparse convolution neutral net face emotion identification method |
CN108805089A (en) * | 2018-06-14 | 2018-11-13 | 南京云思创智信息科技有限公司 | Based on multi-modal Emotion identification method |
CN108717856A (en) * | 2018-06-16 | 2018-10-30 | 台州学院 | A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network |
CN109147826A (en) * | 2018-08-22 | 2019-01-04 | 平安科技(深圳)有限公司 | Music emotion recognition method, device, computer equipment and computer storage medium |
CN109146066A (en) * | 2018-11-01 | 2019-01-04 | 重庆邮电大学 | A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition |
CN109508375A (en) * | 2018-11-19 | 2019-03-22 | 重庆邮电大学 | A kind of social affective classification method based on multi-modal fusion |
CN109599128A (en) * | 2018-12-24 | 2019-04-09 | 北京达佳互联信息技术有限公司 | Speech-emotion recognition method, device, electronic equipment and readable medium |
Non-Patent Citations (5)
Title |
---|
"LSTM for dynamic emotion and group emotion recognition in the wild";B Sun;《the 18th ACM International conference 》;20161231;全文 * |
"review of data features-based music Emotion Recognition method";yang Xinyu;《multimedia system》;20180630;第24卷(第4期);全文 * |
"stacked convolutional recurrent neural networks for music emotion recognition";M Malik;《arXiv:1706.02292v1》;20170607;全文 * |
"基于深度学习的音乐情感识别";唐霞;《电脑知识与技术》;20190430;第15卷(第11期);全文 * |
"跨库语音情感识别若干关键技术研究";张昕然;《中国博士学位论文全文数据库信息科技辑》;20171115;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110223712A (en) | 2019-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110223712B (en) | Music emotion recognition method based on bidirectional convolution cyclic sparse network | |
CN111667884B (en) | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism | |
Choi et al. | Convolutional recurrent neural networks for music classification | |
AU2020100710A4 (en) | A method for sentiment analysis of film reviews based on deep learning and natural language processing | |
Sirat et al. | Neural trees: a new tool for classification | |
CN110442705B (en) | Abstract automatic generation method based on concept pointer network | |
CN110600047A (en) | Perceptual STARGAN-based many-to-many speaker conversion method | |
CN111816156A (en) | Many-to-many voice conversion method and system based on speaker style feature modeling | |
CN110060657B (en) | SN-based many-to-many speaker conversion method | |
WO2020095321A2 (en) | Dynamic structure neural machine for solving prediction problems with uses in machine learning | |
CN108876044B (en) | Online content popularity prediction method based on knowledge-enhanced neural network | |
CN111461322A (en) | Deep neural network model compression method | |
CN111276187B (en) | Gene expression profile feature learning method based on self-encoder | |
Tsouvalas et al. | Privacy-preserving speech emotion recognition through semi-supervised federated learning | |
CN113643724B (en) | Kiwi emotion recognition method and system based on time-frequency double-branch characteristics | |
CN112766360A (en) | Time sequence classification method and system based on time sequence bidimensionalization and width learning | |
CN117059103A (en) | Acceleration method of voice recognition fine tuning task based on low-rank matrix approximation | |
CN116469561A (en) | Breast cancer survival prediction method based on deep learning | |
CN110600046A (en) | Many-to-many speaker conversion method based on improved STARGAN and x vectors | |
CN115810351A (en) | Controller voice recognition method and device based on audio-visual fusion | |
CN114743569A (en) | Speech emotion recognition method based on double-layer fusion deep network | |
Lin et al. | Learning semantically meaningful embeddings using linear constraints | |
Jie et al. | Regularized flexible activation function combination for deep neural networks | |
Pandey et al. | Generative Restricted Kernel Machines. | |
Vadiraja et al. | A Survey on Knowledge integration techniques with Artificial Neural Networks for seq-2-seq/time series models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |