CN110223712B - Music emotion recognition method based on bidirectional convolution cyclic sparse network - Google Patents

Music emotion recognition method based on bidirectional convolution cyclic sparse network Download PDF

Info

Publication number
CN110223712B
CN110223712B CN201910485792.9A CN201910485792A CN110223712B CN 110223712 B CN110223712 B CN 110223712B CN 201910485792 A CN201910485792 A CN 201910485792A CN 110223712 B CN110223712 B CN 110223712B
Authority
CN
China
Prior art keywords
time
convolution
bcrfms
model
neuron
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910485792.9A
Other languages
Chinese (zh)
Other versions
CN110223712A (en
Inventor
杨新宇
董怡卓
罗晶
张亦弛
魏洁
崔宇涵
夏小景
吉姝蕾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201910485792.9A priority Critical patent/CN110223712B/en
Publication of CN110223712A publication Critical patent/CN110223712A/en
Application granted granted Critical
Publication of CN110223712B publication Critical patent/CN110223712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a music emotion recognition method based on a bidirectional convolution cyclic sparse network. The method is combined with a convolutional neural network and a cyclic neural network to adaptively learn the emotional significance characteristics containing time sequence information from two-dimensional time-frequency representation (namely a time-frequency graph) of an original audio signal. Furthermore, the invention provides a method for reducing the computational complexity of numerical real data by converting a regression prediction problem into weighted combination of a plurality of binary classification problems by adopting a weighted mixed binary representation method. Experiment results show that the emotion significant characteristics containing time sequence information extracted by the bidirectional convolution cyclic sparse network show better prediction performance compared with the optimal characteristics in the MediaEval 2015; compared with the current common music emotion recognition network structure and the optimal method, the training time of the proposed model is reduced, and the prediction precision is improved. Therefore, the method effectively solves the problems of the accuracy and the efficiency of the music emotion recognition, and is superior to the existing recognition method.

Description

Music emotion recognition method based on bidirectional convolution cyclic sparse network
Technical Field
The invention belongs to the field of machine learning and emotion calculation, and particularly relates to a music emotion recognition method based on a bidirectional convolution cyclic sparse network.
Background
With the development of multimedia technology, the explosive growth of the number of digital music from different media has led to an increasing interest in the research of fast and efficient music query and retrieval approaches. Because music can transmit emotion-related information and the emotion-based music information retrieval mode has higher universality and user satisfaction, music information retrieval by identifying the emotion of a music audio signal becomes an important research trend, and the core difficulty of the music information retrieval method is how to further improve the accuracy and efficiency of music emotion identification.
The goal of music emotion recognition is to learn its perceptual emotional state by extracting and analyzing music features such as tempo, timbre, intensity, etc. A great number of recognition studies of music emotion based on Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) show certain advantages. The CNN can adaptively learn the characteristics of high-level invariant features from original audio data to eliminate the dependence of the feature extraction process on human subjectivity or experience, and the RNN can solve the time sequence dependence problem of music information. The music emotion recognition method based on the bidirectional convolution cyclic sparse network is adopted, combines the characteristic of CNN self-adaptive learning advanced invariant feature and the capability of RNN learning feature time sequence relation, and is used for predicting the emotion value of excitation (Arousal) and Valence (Valence), thereby improving the accuracy of music emotion recognition.
Disclosure of Invention
The invention aims to improve the accuracy and efficiency of music emotion recognition, and provides a music emotion recognition method based on a bidirectional convolution sparse network.
In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:
a music emotion recognition method based on a bidirectional convolution cyclic sparse network comprises the steps of firstly converting an audio signal into a time-frequency graph; secondly, establishing an audio time sequence model by adopting a mode of internal fusion of a convolutional neural network and a cyclic neural network to learn the emotional significance characteristics containing time sequence information, namely SII-ASF, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity; and finally, carrying out continuous emotion recognition on the music.
The invention is further improved in that the method specifically comprises the following steps:
1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,
1-1) time-frequency diagram conversion of audio files: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph;
1-2) dimension reduction treatment of the time-frequency diagram: adopting a PCA whitening method, setting 99% of data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph;
2) establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: combining the CNN self-adaptive learning characteristic and the RNN capacity of processing time sequence data to construct a bidirectional convolution sparse-cycle network, BCRSN for short; the connection between the model input layer and the hidden layer is changed in a CNN local interconnection and weight sharing mode, and a plurality of convolution kernels are used for obtaining a bidirectional convolution cyclic feature map group, namely BCRFMs; the long-term dependence relationship between the BCRFMs is considered by replacing each neuron in the BCRFMs by a long-term and short-term memory network (LSTM) module, wherein the long-term and short-term memory network is called LSTM for short;
3) the regression problem translates into a two-class problem: including representation of binary values and thinning, the following steps,
3-1) representation of binary values: the method comprises the steps of converting a regression problem into weighted combination of a plurality of binary problems based on a numerical real data representation method and a weighted mixed binary representation method so as to reduce the calculation complexity of a model;
3-2) sparse processing: using a consistency correlation coefficient as a loss function and adding a penalty term into CCC as an objective function of a model to make BCRFMs as sparse as possible to obtain SII-ASF, wherein the consistency correlation coefficient is abbreviated as CCC;
4) continuous emotion recognition of music: and according to the results of the two classifications, firstly carrying out emotion recognition on the audio content of one segment, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.
The further improvement of the invention is that the step 1-1) is specifically operated as follows: each time domain audio file is divided into non-overlapping segments by the unit of 500ms of duration, and for each divided audio segment, a 60ms frame length and a sliding window with 10ms step length are adopted to convert the audio segment into a time-frequency diagram.
The further improvement of the invention is that the step 1-2) is specifically operated as follows: and carrying out PCA whitening according to 99% of data difference retention, reducing the dimensionality of a frequency domain of the time-frequency graph to 45 dimensionalities, and obtaining the time-frequency graph with the size of 45 multiplied by 45 dimensionalities as the input of the BCRSN model.
A further development of the invention is that said step 2) is specifically operative to: performing convolution operation on the time-frequency diagram in a time domain range by using 64 convolution kernels with the length of 3 multiplied by 1 and the step length of 2 to obtain BCRFMs; bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame; and simultaneously, each neuron in the BCRFMs is modified by using an LSTM module, certain information of any time length segment is memorized through the input, the output and a forgetting threshold of the module, and finally, the size of a characteristic diagram is reduced by using a downsampling operation with the size of 3 multiplied by 1, so that the robustness of the model is enhanced.
A further improvement of the invention is that the learning of BCRFMs in step 2) comprises the steps of:
(i) the connection between the BCRSN model input layer and the forward and reverse convolution cycle layers takes convolution kernels as media, the forward and reverse convolution cycle layers are provided with the same number and arrangement mode of neurons as those of the CNN convolution layer, so that the model has the capability of self-adaptive learning invariant feature, and the convolution result of each neuron is calculated through a formula (1):
Figure BDA0002085343230000031
in the formula, Cnt,kThe convolution result for the neuron at the kth signature position (N, T), N being 1,2., (N-1)/2, T being 1,2., T;
Figure BDA0002085343230000032
is a two-dimensional feature matrix at a corresponding position (n, t) of the input layer, WkThe weight parameter of the kth convolution kernel;
(ii) bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame;
for the feature map of the forward convolution loop layer, the input to each neuron is represented by equation (2):
Figure BDA0002085343230000041
the output is expressed as formula (3):
FOnt,k=σ(FInt,k+bnt,k) (3)
for the signature of the deconvolution loop layer, the input to each neuron is represented by equation (4):
Figure BDA0002085343230000042
the output is expressed as formula (5):
BOnt,k=σ(BInt,k+bnt,k) (5)
in the formula
Figure BDA0002085343230000043
Representing the output results of all neurons of a previous frame t-1/t +1 of the kth feature map;
Figure BDA0002085343230000044
Figure BDA0002085343230000045
respectively representing the connection matrix of the neurons in the forward propagation process and the backward propagation process, and sharing the weight among all the audio frames; bnt,kBiasing the network;
(iii) modifying each neuron in BCRFMs by using an LSTM module, memorizing certain information of any duration segment through the input, the output and a forgetting threshold of the module, carrying out down-sampling operation in a frequency domain range between a forward convolution layer and a backward convolution layer and a forward pooling layer, sequentially representing the characteristics of a 3 multiplied by 1 down-sampling area by using the maximum characteristics in the area, and reducing the size of a characteristic diagram.
The further improvement of the invention is that the step 3-1) is specifically operated as follows: setting L +1 neurons on an output layer of the BCRSN model, and expressing an obtained prediction sequence by using O; wherein, O1Predicting the positive or negative of the true value, O2~OL+1Predicting the absolute value of the true value, wherein the range is (0, 1); each neuron acts as a two-classifier, reducing the computational complexity of the penalty function to O ((L + 1). times.1)2) O (L +1), making the model converge faster.
The invention is further improved in that, in the step 3-1), a weighted mixed binary number representation method is adopted, and the method comprises the following steps:
(i) new weighted hybrid binary representation converts numeric real data g into hybrid binary vector O*To reduce the computational complexity, each bit of the vector
Figure BDA0002085343230000051
Calculated by equation (6):
Figure BDA0002085343230000052
in the formula g1=g;
Figure BDA0002085343230000053
From g1Positive or negative determination of value when g1When the content is more than or equal to 0,
Figure BDA0002085343230000054
g1when the ratio is less than 0, the reaction mixture is,
Figure BDA0002085343230000055
(ii) setting output layer neuron OiThe convergence direction of the model loss function is controlled by the contribution weight of the model loss function, the prediction precision is improved, and the method is calculated by the following formula:
Figure BDA0002085343230000056
where δ (-) represents the formula for the calculation of the loss function, λiRepresents OiThe contribution to the segment loss function.
The further improvement of the invention is that the step 3-2) is specifically operated as follows: and (3) using CCC as a loss function and adding a Lasso penalty term of the weight of the BCRFMs into the CCC as an objective function of the model to make the BCRFMs as sparse as possible and obtain the SII-ASF.
The invention is further improved in that CCC is used as a loss function in the step 3-2) to train the network more discriminatively; specifically, each song is divided into segments of fixed duration and the real data of each segment is converted into a hybrid binary vector O*The solution of the loss function comprises the following steps:
(i) Calculating the predicted sequence O and the real sequence O of each segment*CCC, predicted sequence f of sequence sample ssAnd the target sequence
Figure BDA0002085343230000057
CCC between is defined as:
Figure BDA0002085343230000058
in the formula SsA mean square error (SSE),
Figure BDA0002085343230000059
Qsthe covariance is represented as a function of time,
Figure BDA00020853432300000510
t denotes the time index of each tag value, NsIndicates the length of the sequence s; based on the above, the digit number L +1 of the mixed binary vector is taken as the sequence length of each segment, and the contribution weight of each digit to the model loss function is considered, and the formula (7) is rewritten to obtain the predicted sequence O and the real sequence O of each segment*CCC of (c):
Figure BDA00020853432300000511
in the formula, O*O denotes a mixed binary vector of segment true and predicted, respectively, and λ ═ λ12,...,λL+1) Representing the contribution parameter set of O to the segment loss function; thus, the CCC solution to the regression prediction problem is translated into a weighted sum of multiple two-class accuracies, i.e.
Figure BDA0002085343230000061
Thereby defining:
Figure BDA0002085343230000062
Figure BDA0002085343230000063
(ii) the average CCC per song was calculated from its CCC per segment and the number of segments:
Figure BDA0002085343230000064
in the formula, NsRepresents the length of each song, i.e., the number of segments;
setting coefficients of some neurons to be 0 by using Lasso regression to delete repeated related variables and a plurality of noise features, and selecting the SII-ASF with stronger emotional significance; in particular, in the loss function
Figure BDA0002085343230000065
On the basis, adding a Lasso penalty term of the BCRFMs weight as a final objective function:
Figure BDA0002085343230000066
in the formula, betaFA set of parameters representing the BCRFMs,
Figure BDA0002085343230000067
in a similar manner, the first and second substrates are,
Figure BDA0002085343230000068
αFand alphaBThe method is a hyper-parameter used for controlling the sparsity of a characteristic diagram, and the larger the alpha value is, the higher the sparsity is; and minimizing L to remove noise features, selecting the emotion significance features, and improving the prediction accuracy.
The invention has the following beneficial technical effects:
the music emotion recognition method based on the bidirectional convolution cyclic sparse network comprises the steps of firstly converting audio signals into a time-frequency diagram, secondly establishing an audio time sequence model by adopting a mode of internal fusion of CNN and RNN to learn SII-ASF, meanwhile, converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity, and finally carrying out continuous emotion recognition on music. Compared with the current common music emotion recognition network structure and the optimal method, the BCRSN model can obviously reduce training time and improve prediction precision, and the extracted SII-ASF features show better prediction performance compared with the optimal features proposed by the participants in MediaEval 2015.
Drawings
FIG. 1 is a flow chart of the BCRSN system of the present invention;
FIG. 2 is a diagram illustrating the conversion process from numeric real data to hybrid binary vectors in the present invention;
FIG. 3 is a comparison graph of prediction performance and training time of the BCRSN model and CNN-based, BLSTM-based and stacked CNN-BLSTM-based models on DEAM and MTurk music emotion recognition data sets in the present invention.
Detailed Description
The present invention is described in further detail below with reference to the attached drawings.
Referring to fig. 1, according to the music emotion recognition method based on the bidirectional convolution cyclic sparse network provided by the invention, firstly, an audio signal is converted into a time-frequency diagram; secondly, establishing an audio time sequence model by adopting an internal fusion mode of a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) to learn emotion significance characteristics (SII-ASF for short) containing time sequence information, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the computational complexity; and finally, carrying out continuous emotion recognition on music, specifically comprising the following steps:
1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,
step1 converting the time-frequency diagram of the audio file: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph;
and (4) dimension reduction processing of the Step2 time-frequency graph: and (3) setting a certain data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph by adopting a PCA whitening method.
2) Establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: and (3) combining the CNN self-adaptive learning characteristic and the RNN capability of processing time series data to construct a bidirectional convolution sparse-round network (BCRSN for short). Referring to fig. 1, an input two-dimensional time-frequency diagram replaces each frame t by means of CNN local interconnection and weight sharingiinter-Layer connection of inner input Layer and Forward/Backward convolution loop Layer (Forward/Backward 1c Layer), and between audio frames
Figure BDA0002085343230000081
Figure BDA0002085343230000082
Setting bidirectional cycle transmission timing sequence information to learn BCRFMs; while using LSTM network modules instead of each neuron in the BCRFMs, there is a long-term dependency between features within the BCRFMs.
3) The regression problem translates into a two-class problem: including representation of the weighted binary values and sparseness processing, and referring to fig. 1 and 2, there are the following steps in detail,
step1 weights the representation of the binary value: based on a method for representing numerical real data and a weighted mixed binary representation method, converting a regression problem into weighted combination of a plurality of binary problems so as to reduce the complexity of calculation;
step2 sparse processing: using CCC as a loss function and adding a Lasso penalty term (L1 regularization) of BCRFMs weights to CCC as an objective function of the model to make BCRFMs as sparse as possible, obtain SII-ASF.
4) Continuous emotion recognition of music: and inputting the audio time-frequency diagram into a BCRSN model, firstly carrying out emotion recognition on the audio content of a single segment according to a plurality of secondary classification results, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.
Referring to fig. 3, on the data sets of DEAM and MTurk, the BCRSN model in the present invention achieves the best performance for continuous emotion prediction in both Valence and Arousal dimensions compared with models based on CNN, BLSTM, and stacked CNN-BLSTM.
Referring to table 1, compared with the optimal algorithm of MediaEval2015, the BCRSN model in the present invention can adaptively learn valid features from the original audio signal for the prediction target with the least a priori knowledge, which is superior to the first three performance-optimal methods (BLSTM-RNN, BLSTM-ELM, deep LSTM-RNN) in MediaEval 2015.
Table 1: in the invention, the BCRSN model is compared with the first three methods (BLSTM-RNN, BLSTM-ELM and deep LSTM-RNN) with optimal performance in the MediaEval2015 by taking an original audio signal as input.
Figure BDA0002085343230000083
Figure BDA0002085343230000091
Note: n.s. -Not significan indicates that the performance of the method is Not significantly different from that of the BCRSN model, otherwise it indicates Significant difference.
Referring to Table 2, the SII-ASF and SII-NASF obtained by the BCRSN model with and without the Lasso penalty all showed good prediction performance compared to the set of features proposed by the competitor in MediaEval2015 (JUNLP, PKUAIPL, HKPolyU, THU-HCSIL and IRIT-SAMOVA).
Table 2: the SII-ASF and SII-NASF signatures extracted in the present invention were compared with the performance of the signatures proposed by the competitors in MediaEval2015 (JUNLP, PKUAIPL, HKPolyU, THU-HCSIL and IRIT-SAMOVA).
Figure BDA0002085343230000092
Figure BDA0002085343230000101
Note: n.s. -Not significan indicates that the performance of the feature is Not significantly different from the SII-ASF ratio, otherwise it indicates a Significant difference.

Claims (1)

1. A music emotion recognition method based on a bidirectional convolution cyclic sparse network is characterized in that an audio signal is converted into a time-frequency graph; secondly, establishing an audio time sequence model by adopting a mode of internal fusion of a convolutional neural network and a cyclic neural network to learn the emotional significance characteristics containing time sequence information, namely SII-ASF, and simultaneously converting a regression problem into a plurality of binary classification problems by combining a weighted mixed binary representation method to reduce the calculation complexity; finally, carrying out continuous emotion recognition on music; the method specifically comprises the following steps:
1) time-frequency diagram conversion of audio signals: comprises the time-frequency graph conversion of an audio file and the dimension reduction processing of the time-frequency graph, and specifically comprises the following steps,
1-1) time-frequency diagram conversion of audio files: dividing each time domain audio file into non-overlapping segments with fixed duration, setting a sliding window with fixed frame length and step length for each segment, and converting the sliding window into a time-frequency graph; the specific operation is as follows: dividing each time domain audio file into non-overlapping segments in a unit of 500ms of duration, and converting each divided audio segment into a time-frequency graph by adopting a 60ms frame length and a sliding window with 10ms step length;
1-2) dimension reduction treatment of the time-frequency diagram: adopting a PCA whitening method, setting 99% of data difference retention degree to reduce the dimension of the frequency domain of the time-frequency graph; the specific operation is as follows: PCA whitening is carried out according to 99% of data difference retention, the dimensionality of a time-frequency graph frequency domain is reduced to 45 dimensionality, and a time-frequency graph with the size of 45 multiplied by 45 is obtained and used as the input of a BCRSN model;
2) establishing an audio time sequence model to learn the emotion significance characteristics containing time sequence information: combining the CNN self-adaptive learning characteristic and the RNN capacity of processing time sequence data to construct a bidirectional convolution sparse-cycle network, BCRSN for short; the connection between the model input layer and the hidden layer is changed in a CNN local interconnection and weight sharing mode, and a plurality of convolution kernels are used for obtaining a bidirectional convolution cyclic feature map group, namely BCRFMs; the long-term dependence relationship among the BCRFMs is considered by replacing each neuron in the BCRFMs by a long-term and short-term memory network LSTM module, wherein the long-term and short-term memory network is called LSTM for short; the specific operation is as follows: performing convolution operation on the time-frequency diagram in a time domain range by using 64 convolution kernels with the length of 3 multiplied by 1 and the step length of 2 to obtain BCRFMs; bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame; meanwhile, each neuron in the BCRFMs is modified by an LSTM module, certain information of any time length segment is memorized through the input, the output and a forgetting threshold of the module, and finally, the size of a characteristic diagram is reduced by using a downsampling operation with the size of 3 multiplied by 1, so that the robustness of the model is enhanced;
learning of BCRFMs comprising the steps of:
(i) the connection between the BCRSN model input layer and the forward and reverse convolution cycle layers takes convolution kernels as media, the forward and reverse convolution cycle layers are provided with the same number and arrangement mode of neurons as those of the CNN convolution layer, so that the model has the capability of self-adaptive learning invariant feature, and the convolution result of each neuron is calculated through a formula (1):
Figure FDA0002943940520000021
in the formula, Cnt,kThe convolution result for the neuron at the kth signature position (N, T), N being 1,2., (N-1)/2, T being 1,2., T;
Figure FDA0002943940520000022
is a two-dimensional feature matrix at a corresponding position (n, t) of the input layer, WkThe weight parameter of the kth convolution kernel;
(ii) bidirectional circulation according to the time sequence of audio frames exists among the neurons in the BCRFMs, and the input of the neuron of a certain frame is the weighted sum of the corresponding convolution result and the neuron output of the previous/next frame;
for the feature map of the forward convolution loop layer, the input to each neuron is represented by equation (2):
Figure FDA0002943940520000023
the output is expressed as formula (3):
FOnt,k=σ(FInt,k+bnt,k) (3)
for the signature of the deconvolution loop layer, the input to each neuron is represented by equation (4):
Figure FDA0002943940520000024
the output is expressed as formula (5):
BOnt,k=σ(BInt,k+bnt,k) (5)
in the formula
Figure FDA0002943940520000025
Representing the output results of all neurons of a previous frame t-1/t +1 of the kth feature map;
Figure FDA0002943940520000026
Figure FDA0002943940520000027
respectively representing the connection matrix of the neurons in the forward propagation process and the backward propagation process, and sharing the weight among all the audio frames; bnt,kBiasing the network;
(iii) modifying each neuron in the BCRFMs by using an LSTM module, memorizing certain information of any time segment through the input, the output and a forgetting threshold of the module, performing downsampling operation in a frequency domain range between a forward convolution layer and a reverse convolution layer and a forward pooling layer and a reverse pooling layer, sequentially representing the characteristics of a downsampling area with the size of 3 multiplied by 1 by using the maximum characteristics in the area, and reducing the size of a characteristic diagram;
3) the regression problem translates into a two-class problem: including representation of binary values and thinning, the following steps,
3-1) representation of binary values: the method comprises the steps of converting a regression problem into weighted combination of a plurality of binary problems based on a numerical real data representation method and a weighted mixed binary representation method so as to reduce the calculation complexity of a model; the specific operation is as follows: setting L +1 neurons on an output layer of the BCRSN model, and expressing an obtained prediction sequence by using O; wherein, O1Predicting the positive or negative of the true value, O2~OL+1Predicting the absolute value of the true value, wherein the range is (0, 1); each neuron acts as a two-classifier, reducing the computational complexity of the penalty function to O ((L + 1). times.1)2) O (L +1), making the model converge faster;
the method for representing the binary value by adopting the weighted mixing comprises the following steps:
(i) new weighted hybrid binary representation converts numeric real data g into hybrid binary vector O*To reduce the computational complexity, each bit of the vector
Figure FDA0002943940520000031
Calculated by equation (6):
Figure FDA0002943940520000032
in the formula g1=g;
Figure FDA0002943940520000033
From g1Positive or negative determination of value when g1When the content is more than or equal to 0,
Figure FDA0002943940520000034
g1when the ratio is less than 0, the reaction mixture is,
Figure FDA0002943940520000035
(ii) setting output layer neuron OiThe convergence direction of the model loss function is controlled by the contribution weight of the model loss function, the prediction precision is improved, and the method is calculated by the following formula:
Figure FDA0002943940520000036
where δ (-) represents the formula for the calculation of the loss function, λiRepresents OiA contribution to a loss function;
3-2) sparse processing: using a consistency correlation coefficient as a loss function and adding a penalty term into CCC as an objective function of a model to make BCRFMs as sparse as possible to obtain SII-ASF, wherein the consistency correlation coefficient is abbreviated as CCC; the specific operation is as follows: using CCC as a loss function and adding a Lasso penalty term of BCRFMs weight to the CCC as an objective function of a model to make the BCRFMs as sparse as possible and obtain SII-ASF;
CCC is used as a loss function to enable the network to be trained more distinctively; specifically, each song is divided into segments of fixed duration and the real data of each segment is converted into a hybrid binary vector O*The solution of the loss function comprises the following steps:
(i) calculating the predicted sequence O and the real sequence O of each segment*CCC, predicted sequence f of sequence sample ssAnd the target sequence
Figure FDA0002943940520000041
CCC between is defined as:
Figure FDA0002943940520000042
in the formula SsA mean square error (SSE),
Figure FDA0002943940520000043
Qsthe covariance is represented as a function of time,
Figure FDA0002943940520000044
t denotes the time index of each tag value, NsIndicates the length of the sequence s; based on the above, the digit number L +1 of the mixed binary vector is taken as the sequence length of each segment, and the contribution weight of each digit to the model loss function is considered, and the formula (7) is rewritten to obtain the predicted sequence O and the real sequence O of each segment*CCC of (c):
Figure FDA0002943940520000045
in the formula, O*O denotes a mixed binary vector of segment true and predicted, respectively, and λ ═ λ12,...,λL+1) Representing the contribution parameter set of O to the segment loss function; thus, the CCC solution to the regression prediction problem is translated into a weighted sum of multiple two-class accuracies, i.e.
Figure FDA0002943940520000046
Thereby defining:
Figure FDA0002943940520000047
Figure FDA0002943940520000048
(ii) the average CCC per song was calculated from its CCC per segment and the number of segments:
Figure FDA0002943940520000049
in the formula, NsRepresents the length of each song, i.e., the number of segments;
selecting a case by eliminating the repetitive variables and many noise features by setting coefficients of some neurons to 0 using Lasso regressionSII-ASF with stronger significance; in particular, in the loss function
Figure FDA00029439405200000410
On the basis, adding a Lasso penalty term of the BCRFMs weight as a final objective function:
Figure FDA00029439405200000411
in the formula, betaFA set of parameters representing the BCRFMs,
Figure FDA0002943940520000051
in a similar manner, the first and second substrates are,
Figure FDA0002943940520000052
αFand alphaBThe method is a hyper-parameter used for controlling the sparsity of a characteristic diagram, and the larger the alpha value is, the higher the sparsity is; minimizing L to delete the noise features, selecting the emotion significance features, and improving the prediction accuracy;
4) continuous emotion recognition of music: and according to the results of the two classifications, firstly carrying out emotion recognition on the audio content of one segment, and then carrying out continuous emotion recognition on a plurality of audio segments of the complete music file.
CN201910485792.9A 2019-06-05 2019-06-05 Music emotion recognition method based on bidirectional convolution cyclic sparse network Active CN110223712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910485792.9A CN110223712B (en) 2019-06-05 2019-06-05 Music emotion recognition method based on bidirectional convolution cyclic sparse network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910485792.9A CN110223712B (en) 2019-06-05 2019-06-05 Music emotion recognition method based on bidirectional convolution cyclic sparse network

Publications (2)

Publication Number Publication Date
CN110223712A CN110223712A (en) 2019-09-10
CN110223712B true CN110223712B (en) 2021-04-20

Family

ID=67819412

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910485792.9A Active CN110223712B (en) 2019-06-05 2019-06-05 Music emotion recognition method based on bidirectional convolution cyclic sparse network

Country Status (1)

Country Link
CN (1) CN110223712B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689902B (en) * 2019-12-11 2020-07-14 北京影谱科技股份有限公司 Audio signal time sequence processing method, device and system based on neural network and computer readable storage medium
CN111326164B (en) * 2020-01-21 2023-03-21 大连海事大学 Semi-supervised music theme extraction method
CN113268628B (en) * 2021-04-14 2023-05-23 上海大学 Music emotion recognition method based on modularized weighted fusion neural network
CN115294644A (en) * 2022-06-24 2022-11-04 北京昭衍新药研究中心股份有限公司 Rapid monkey behavior identification method based on 3D convolution parameter reconstruction

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN106128479A (en) * 2016-06-30 2016-11-16 福建星网视易信息系统有限公司 A kind of performance emotion identification method and device
CN106228977A (en) * 2016-08-02 2016-12-14 合肥工业大学 The song emotion identification method of multi-modal fusion based on degree of depth study
US9570091B2 (en) * 2012-12-13 2017-02-14 National Chiao Tung University Music playing system and music playing method based on speech emotion recognition
WO2017122798A1 (en) * 2016-01-14 2017-07-20 国立研究開発法人産業技術総合研究所 Target value estimation system, target value estimation method, and target value estimation program
CN107169409A (en) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 A kind of emotion identification method and device
CN107506722A (en) * 2017-08-18 2017-12-22 中国地质大学(武汉) One kind is based on depth sparse convolution neutral net face emotion identification method
US20180075343A1 (en) * 2016-09-06 2018-03-15 Google Inc. Processing sequences using convolutional neural networks
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN108805089A (en) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 Based on multi-modal Emotion identification method
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109147826A (en) * 2018-08-22 2019-01-04 平安科技(深圳)有限公司 Music emotion recognition method, device, computer equipment and computer storage medium
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN109599128A (en) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 Speech-emotion recognition method, device, electronic equipment and readable medium

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9570091B2 (en) * 2012-12-13 2017-02-14 National Chiao Tung University Music playing system and music playing method based on speech emotion recognition
CN105469065B (en) * 2015-12-07 2019-04-23 中国科学院自动化研究所 A kind of discrete emotion identification method based on recurrent neural network
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
WO2017122798A1 (en) * 2016-01-14 2017-07-20 国立研究開発法人産業技術総合研究所 Target value estimation system, target value estimation method, and target value estimation program
CN106128479A (en) * 2016-06-30 2016-11-16 福建星网视易信息系统有限公司 A kind of performance emotion identification method and device
CN106228977A (en) * 2016-08-02 2016-12-14 合肥工业大学 The song emotion identification method of multi-modal fusion based on degree of depth study
US20180075343A1 (en) * 2016-09-06 2018-03-15 Google Inc. Processing sequences using convolutional neural networks
CN107169409A (en) * 2017-03-31 2017-09-15 北京奇艺世纪科技有限公司 A kind of emotion identification method and device
CN107506722A (en) * 2017-08-18 2017-12-22 中国地质大学(武汉) One kind is based on depth sparse convolution neutral net face emotion identification method
CN108805089A (en) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 Based on multi-modal Emotion identification method
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109147826A (en) * 2018-08-22 2019-01-04 平安科技(深圳)有限公司 Music emotion recognition method, device, computer equipment and computer storage medium
CN109146066A (en) * 2018-11-01 2019-01-04 重庆邮电大学 A kind of collaborative virtual learning environment natural interactive method based on speech emotion recognition
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN109599128A (en) * 2018-12-24 2019-04-09 北京达佳互联信息技术有限公司 Speech-emotion recognition method, device, electronic equipment and readable medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"LSTM for dynamic emotion and group emotion recognition in the wild";B Sun;《the 18th ACM International conference 》;20161231;全文 *
"review of data features-based music Emotion Recognition method";yang Xinyu;《multimedia system》;20180630;第24卷(第4期);全文 *
"stacked convolutional recurrent neural networks for music emotion recognition";M Malik;《arXiv:1706.02292v1》;20170607;全文 *
"基于深度学习的音乐情感识别";唐霞;《电脑知识与技术》;20190430;第15卷(第11期);全文 *
"跨库语音情感识别若干关键技术研究";张昕然;《中国博士学位论文全文数据库信息科技辑》;20171115;全文 *

Also Published As

Publication number Publication date
CN110223712A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
CN110223712B (en) Music emotion recognition method based on bidirectional convolution cyclic sparse network
CN111667884B (en) Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism
Choi et al. Convolutional recurrent neural networks for music classification
AU2020100710A4 (en) A method for sentiment analysis of film reviews based on deep learning and natural language processing
Sirat et al. Neural trees: a new tool for classification
CN110442705B (en) Abstract automatic generation method based on concept pointer network
CN110600047A (en) Perceptual STARGAN-based many-to-many speaker conversion method
CN111816156A (en) Many-to-many voice conversion method and system based on speaker style feature modeling
CN110060657B (en) SN-based many-to-many speaker conversion method
WO2020095321A2 (en) Dynamic structure neural machine for solving prediction problems with uses in machine learning
CN108876044B (en) Online content popularity prediction method based on knowledge-enhanced neural network
CN111461322A (en) Deep neural network model compression method
CN111276187B (en) Gene expression profile feature learning method based on self-encoder
Tsouvalas et al. Privacy-preserving speech emotion recognition through semi-supervised federated learning
CN113643724B (en) Kiwi emotion recognition method and system based on time-frequency double-branch characteristics
CN112766360A (en) Time sequence classification method and system based on time sequence bidimensionalization and width learning
CN117059103A (en) Acceleration method of voice recognition fine tuning task based on low-rank matrix approximation
CN116469561A (en) Breast cancer survival prediction method based on deep learning
CN110600046A (en) Many-to-many speaker conversion method based on improved STARGAN and x vectors
CN115810351A (en) Controller voice recognition method and device based on audio-visual fusion
CN114743569A (en) Speech emotion recognition method based on double-layer fusion deep network
Lin et al. Learning semantically meaningful embeddings using linear constraints
Jie et al. Regularized flexible activation function combination for deep neural networks
Pandey et al. Generative Restricted Kernel Machines.
Vadiraja et al. A Survey on Knowledge integration techniques with Artificial Neural Networks for seq-2-seq/time series models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant