CN115064181B

CN115064181B - Music multi-mode data emotion recognition method based on deep learning

Info

Publication number: CN115064181B
Application number: CN202210654145.8A
Authority: CN
Inventors: 韩东红; 孔彦茹; 李嘉豪; 韩嘉懿; 刘莹
Original assignee: 东北大学
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2024-04-19
Anticipated expiration: 2042-06-10
Also published as: CN115064181A

Abstract

The invention relates to the technical field of music multi-modal data emotion recognition, in particular to a music multi-modal data emotion recognition method based on deep learning. The method mainly aims at the problems that the space for improving emotion recognition in the existing music learning single mode is limited and feature vectors in a music data set cannot be deeply mined, and provides the following technical scheme: s1: preprocessing music data; s2: extracting the characteristics of MIDI data; s3: extracting characteristics of text data; s4: and (5) multi-mode fusion. According to the method, the multi-modal fusion is carried out by utilizing the thought of decision-level fusion, so that a better emotion classification effect can be obtained than the symptom-level fusion, emotion deep learning of music texts is facilitated, application of deep learning in music emotion recognition is promoted, analysis effect of music is improved, workload of manual emotion marking is reduced, accuracy is improved, and the method is mainly applied to music multi-modal data emotion recognition based on deep learning.

Description

Music multi-mode data emotion recognition method based on deep learning

Technical Field

The invention relates to the technical field of deep learning of music emotion, in particular to a deep learning-based music multi-mode data emotion recognition method.

Background

With the continuous popularization of mobile terminal devices, the online electronic music market is rapidly developed, and people can access massive music resources from various channels. In order to facilitate the audience to obtain a musical composition, each large music platform may use tags such as emotions, genres, etc. to sort and organize the musical composition. Since music is a carrier of emotion, it is particularly important to manage musical compositions using emotion. However, the manual emotion marking of the musical compositions is time-consuming and labor-consuming and has high error rate, so that the research of automatically identifying the musical emotion by utilizing the artificial intelligence technology has practical significance.

In the field of intelligent searching and recommending, on the other hand, the user can conveniently search music by using emotion, and the user can be better recommended with the aid of historical data of the user, so that better experience is brought to the user. No link exists between the low-level audio features and the semantics and emotion of the music; the lack of the data set limits the space for characteristic extraction and model design of researchers, so that the performance of emotion recognition by using lyrics is poor; the ceiling is touched by music emotion recognition by utilizing single-mode data such as audio frequency or lyrics, and the lifting space is limited. Therefore, we propose a music multi-mode data emotion recognition method based on deep learning.

Disclosure of Invention

The invention aims to solve the problems that in the background technology, the existing music learning single mode is limited in emotion recognition space, feature vectors in a music data set cannot be deeply mined, recognition performance is poor and the like, and provides a music multi-mode data emotion recognition method based on deep learning.

The technical scheme of the invention is as follows: the music multi-mode data emotion recognition method based on deep learning comprises the following learning steps:

S1: preprocessing music data; cleaning audio data: traversing the data by adopting a MIDI file processing tool package pretty _midi of Python, judging that the data with errors are invalid audio, and deleting the invalid audio data in the data set; main track extraction: firstly, deleting the audio tracks which do not meet the condition, wherein the left audio tracks are candidate audio tracks; step two, six feature quantities are obtained for each candidate audio track, then the six feature quantities are summed to be used as the score of the audio track, and the highest score is used as the main audio track; the main track extraction step includes: note number feature quantity F _nc, sound emission time feature quantity F _nd, average pitch feature quantity F _p, average intensity feature quantity F _v, sound emission area feature quantity F _pd, and loudness area feature quantity F _vd; the note number refers to the note number of one track, and is expressed by note-count, and is extracted from the MIDI file; pitch refers to the frequency of a note, and is represented by pitch, pitch can be directly extracted from a MIDI file, and sounding time refers to the duration of a note, and is represented by duration, and can be directly extracted from the MIDI file; the intensity refers to the strength of notes, and is expressed by intensity, and the intensity can be directly extracted from MIDI files; the sound producing area of a note refers to the product of the pitch of the note and the sound producing time, and the loudness area of the note is the product of the note duration and the sound producing time;

S2: extracting the characteristics of MIDI data; the feature extraction module of ERMSLM model is composed of: melody feature extraction, tonal feature extraction and manual feature extraction;

Melody feature extraction: for completing the extraction of the melody feature vector m _i from the note pitch set P _i of the ith music; and (3) extracting tonal characteristics: for extracting tonal features k _i from tonal data of music, manual feature extraction: extracting four kinds of information of pitch, dynamics, duration and speed of music from a main track of music to construct a manual feature hcf _i;

S3: extracting characteristics of text data; the characteristics of the text data comprise lyric characteristics and social label characteristics, and lyric characteristic extraction comprises three parts: the first part is to acquire word vectors of each word in lyrics by using a pre-trained BERT model, calculate the similarity of each word vector and four category label word vectors, and construct BERT emotion characteristics by using the calculated similarity; the second part is to construct an emotion dictionary containing four emotions based on the ANEW list, and construct dictionary emotion characteristics by using the emotion dictionary; the third part is to calculate TF-IDF values of each word in the lyrics on four emotion categories, accumulate the TF-IDF values of all words on a certain category as the numerical value of the category, and the numerical values of the four categories form the TF-IDF characteristics of the lyrics;

The social label feature extraction comprises three steps: preprocessing a data set; the data set preprocessing is used for sorting the original data so as to input a social label distribution analysis algorithm to acquire a social label distribution table; a second step of social label distribution analysis algorithm; designing an algorithm according to the Tag of the social label set and the label summary set T obtained by data preprocessing, analyzing the connection between the social label and the music emotion, and obtaining a social label distribution table based on the algorithm; thirdly, extracting social label features; extracting social label characteristics fta _i＝[t₁...t_c for extracting the ith music from the social label characteristics, wherein c is the emotion category number, and each dimension of the characteristics corresponds to one emotion category;

s4: multimodal fusion; multimodal fusion is used for fusing MIDI and text data of music to carry out multimodal music emotion recognition; the multi-mode fusion comprises a feature level fusion model and a decision level fusion model;

The dimension t _j in the social tab feature fta _i in S3 uses the formula Calculating, wherein n represents the number of social labels owned by the ith piece of music, and ts _k is the score of the social label ta _k in the social label distribution table; stdt _j contains social labels with a top ranking of importance related to emotion e _j;

The feature level fusion model is used for connecting MIDI features and text features obtained in the S2 and the S3 to obtain fused features fu _i, and finally inputting the fused features into an MLP and a softmax layer to obtain an emotion result y '_i, wherein the calculation formula of y' _i is as follows: h ^ef is the hidden layer output of each layer, and w ^ef and b ^ef are parameters of each layer.

Preferably, the conditions to be satisfied by the audio track in S1 include the following: (1) the number of channels is not 10; (2) The number of notes is not less than half of the average number of notes for each track; (3) The total duration of note sounding is not less than half of the average sounding time of each track note.

Preferably, the dimension t _j in the social tab feature fta _i in S3 uses the formulaCalculating, wherein n represents the number of social labels owned by the ith piece of music, and ts _k is the score of the social label ta _k in the social label distribution table; stdt _j contains social labels with top alpha importance row associated with emotion e _j.

Preferably, the feature level fusion model is used for connecting the MIDI features and the text features obtained in the S2 and the S3 to obtain fused features fu _i, and finally inputting the fused features into the MLP and the softmax layer to obtain an emotion result y '_i, and the calculation formula of y' _i is as follows: h ^ef is the hidden layer output of each layer, and w ^ef and b ^ef are parameters of each layer.

Preferably, the decision-level fusion model is used for extracting features of different modes, and the features of each mode are used for result prediction by using one classifier independently, and then the results of the classifiers of each mode are fused to obtain a final classification result.

Preferably, the result fusion adopts a linear weighted summation method, and the decision-level fusion model comprises the following processing steps: step a: extracting features by using MIDI and text data in the data set in a MIDI and text feature extraction mode to obtain MIDI features f _i and text features fte _i; step b: inputting the two features obtained in the step a into respective MLP and softmax layers for emotion classification training, wherein the prediction results are respectively thatAnd/>Wherein/>And/>The probability prediction values of the MIDI mode and the text mode on the j-th emotion are respectively represented; step c: performing pre-weighted summation on y _m and y _t to obtain a fusion result rf _i＝[rf₁,rf₂,rf₃,rf₄ ]; and d, passing the rf _j through a softmax layer to obtain a multi-mode fusion result y' _i.

Preferably, the calculation formulas of y _m、y_t and rf _j are as follows:

Wherein h ^lft is the hidden layer output of each layer, w ^lft and b ^lft are the parameters of each layer, Represents the proportion of the MIDI modal classification result, and/>

Compared with the prior art, the invention has the following beneficial technical effects:

1. According to the invention, through carrying out emotion classification on MIDI data of music by using a deep learning model, the problems that the existing common low-level audio features are designed for other audio tasks, and direct connection between the features and music emotion is lacking, so that emotion recognition effect is limited and interpretability is poor are solved; thus, improving music may improve the accuracy of emotion recognition.

2. According to the invention, through an emotion recognition model ERMSLM constructed by text data, music emotion is recognized by using the text data of music, a social label distribution analysis algorithm is provided for analyzing the relationship between a social label and emotion types, and social label characteristics are constructed by using analysis results. And constructing an emotion dictionary based on the ANEW list, and identifying emotion tendencies of the lyrics according to the dictionary. The feature extraction is divided into lyric feature extraction and social label feature extraction.

3. According to the invention, the music emotion recognition is researched by fusing multi-mode data, and the research method mainly comprises a feature level fusion method and a decision level fusion method; the feature level fusion is easy to realize, only one classifier learning process is needed, training time is shortened, and relevance among various modal features can be considered. The decision-level fusion does not need to consider the problem of multi-mode synchronization, and has stronger expandability.

4. In summary, the multi-mode fusion is performed by utilizing the thought of the decision-level fusion, so that a better emotion classification effect can be obtained than the feature-level fusion, emotion deep learning of music texts is realized, application of deep learning in music emotion recognition is promoted, analysis effect of music is improved, workload of manual emotion marking is reduced, and accuracy is improved.

Drawings

FIG. 1 is a diagram of a multi-modal MER study framework in this scenario;

FIG. 2 is a schematic diagram of a feature level fusion emotion recognition model;

FIG. 3 is a schematic diagram of a decision-level fusion emotion recognition model;

Fig. 4 is a bar graph of the comparison results of the four models in this scenario.

Detailed Description

The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments.

Examples

As shown in fig. 1-4, the music multi-mode data emotion recognition method based on deep learning provided by the invention comprises the following learning steps: s1: preprocessing music data; cleaning audio data: traversing the data by adopting a MIDI file processing tool package pretty _midi of Python, judging that the data with errors are invalid audio, and deleting the invalid audio data in the data set; main track extraction: firstly, deleting the audio tracks which do not meet the condition, wherein the left audio tracks are candidate audio tracks; step two, six feature quantities are obtained for each candidate audio track, then the six feature quantities are summed to be used as the score of the audio track, and the highest score is used as the main audio track;

the conditions to be met by the audio track include the following: (1) the number of channels is not 10; (2) The number of notes is not less than half of the average number of notes for each track; (3) The total sound production duration of the notes is not less than half of the average sound production time of the notes of each track, and the main track extraction of six characteristic quantities in the second step comprises the following steps: note number feature quantity F _nc, sound emission time feature quantity F _nd, average pitch feature quantity F _p, average intensity feature quantity F _v, sound emission area feature quantity F _pd, and loudness area feature quantity F _vd; the note number refers to the note number of one track, and is expressed by note-count, and is extracted from the MIDI file; pitch refers to the frequency of a note, and is represented by pitch, pitch can be directly extracted from a MIDI file, and sounding time refers to the duration of a note, and is represented by duration, and can be directly extracted from the MIDI file; the intensity refers to the strength of notes, and is expressed by intensity, and the intensity can be directly extracted from MIDI files; the sound producing area of a note refers to the product of the pitch of the note and the sound producing time, and the loudness area of the note is the product of the note duration and the sound producing time;

The melody feature extraction is used for completing the extraction of a melody feature vector m _i from a note pitch set P _i of the ith music; and (3) manual feature extraction: four kinds of information, namely pitch, intensity, duration and speed of music, of notes are extracted from a main track of music to construct manual feature hcf _i.

S3: extracting characteristics of text data; the text data features comprise lyrics features and social label features, and lyrics feature extraction comprises three parts: the first part is to acquire word vectors of each word in lyrics by using a pre-trained BERT model, calculate the similarity of each word vector and four category label word vectors, and construct BERT emotion characteristics by using the calculated similarity; the second part is to construct an emotion dictionary containing four emotions based on the ANEW list, and construct dictionary emotion characteristics by using the emotion dictionary; the third part is to calculate TF-IDF values of each word in the lyrics on four emotion categories, accumulate the TF-IDF values of all words on a certain category as the numerical value of the category, and the numerical values of the four categories form TFIDF characteristics of the lyrics;

the social label feature extraction comprises three steps: preprocessing a data set; the data set preprocessing is used for sorting the original data so as to input a social label distribution analysis algorithm to acquire a social label distribution table; a second step of social label distribution analysis algorithm; designing an algorithm according to the Tag of the social label set and the label summary set T obtained by data preprocessing, analyzing the connection between the social label and the music emotion, and obtaining a social label distribution table based on the algorithm; thirdly, extracting social label features; extracting social label characteristics fta _i＝[t₁...t_c for extracting the ith music from the social label characteristics, wherein c is the emotion category number, and each dimension of the characteristics corresponds to one emotion category; dimension t _j in social tab feature fta _i uses the formula Calculating, wherein n represents the number of social labels owned by the ith piece of music, and ts _k is the score of the social label ta _k in the social label distribution table; stdt _j contains social labels with the importance of emotion e _j related to the top alpha;

S4: multimodal fusion: multimodal fusion is used for fusing MIDI and text data of music to carry out multimodal music emotion recognition; the multi-modal fusion comprises a feature level fusion model and a decision level fusion model.

The feature level fusion model is used for connecting MIDI features and text features obtained in the S2 and the S3 to obtain fused features fu _i, and finally inputting the fused features into the MLP and softmax layers to obtain an emotion result y '_i, wherein the calculation formula of y' _i is as follows: h ^ef is the hidden layer output of each layer, and w ^ef and b ^ef are parameters of each layer.

The decision-level fusion model is used for extracting features of different modes, and the features of each mode are used for result prediction by using one classifier independently, and then the results of the classifiers of each mode are fused to obtain a final classification result.

The result fusion adopts a linear weighted summation method, and the decision-level fusion model comprises the following processing steps: step a: extracting features by using MIDI and text data in the data set in a MIDI and text feature extraction mode to obtain MIDI features f _i and text features fte _i; step b: inputting the two features obtained in the step a into respective MLP and softmax layers for emotion classification training, wherein the prediction results are respectively thatAnd/>Wherein/>And/>The probability prediction values of the MIDI mode and the text mode on the j-th emotion are respectively represented; step c: performing pre-weighted summation on y _m and y _t to obtain a fusion result rf _i＝[rf₁,rf₂,rf₃,rf₄ ]; step d, rf _j is subjected to softmax layer to obtain a multi-mode fusion result y' _i

The calculation formulas of y _m、y_t and rf _j are:

Because the parameters with larger influence on the result in the feature level fusion method and the decision level fusion method are the proportion of the MIDI modal classification resultThus pair/>Different values were taken and experiments were performed, the experimental results are shown in the following table:

MIDI modality weight value	Accuracy
		0.1	0.7299
0.2	0.7336
		0.3	0.7372
0.4	0.7372
		0.5	0.6788
0.6	0.5766
		0.7	0.5693
0.8	0.5730
		0.9	0.5730

As shown in the above table: when the weight of the MIDI modal classification result gradually becomes larger, the accuracy rate is increased; when the weight exceeds 0.4, the accuracy decreases, thus weightingThe value of (2) is set to 0.3. The experimental results also show that the text mode is more important for the emotion distinguishing degree.

In this embodiment, the performance of four models, namely, the performance of MIDI modal model (ERMSLM) only, text modal model (ERMBT) only, feature level fusion model (FF-ERM) and decision level fusion model (DF-ERM), are compared from the overall accuracy and the accuracy in each emotion categoryThe value of the parameter is 0.3;

the comparison of the four models is shown below:

The data table is converted into a histogram as shown in fig. 4:

from the above data table and histogram, it can be known that:

(1) When emotion recognition is carried out by using two single-mode data, the text mode can obtain better emotion recognition effect, the accuracy rate of the text mode in four categories is 15.69% higher than that of the MIDI mode, and the classification accuracy rate of the text mode in each emotion category is higher than that of the MIDI mode;

(2) When multi-mode data is used for emotion recognition, the effect of decision-level fusion is better than that of feature-level fusion, the accuracy rate of the feature-level fusion on four categories is 2.92% higher than that of feature-level fusion, and the effect of feature-level fusion is 1.82% lower than that of text mode only;

(3) The four classification models all obtain the highest emotion recognition accuracy rate on v ^-v^- emotion types;

The four classification models all obtain the lowest emotion recognition accuracy rate on the v ⁺v^- emotion types.

From this it can be concluded that: when two single-mode data are used for music emotion recognition, a text mode can obtain a better effect; when the multi-modal data is used for music emotion recognition, the multi-modal emotion recognition model fused in a decision level can obtain better emotion recognition effect than the feature level fusion and the single-mode data.

The above-described embodiment is only one preferred embodiment of the present invention, and many alternative modifications and combinations of the above-described embodiments can be made by those skilled in the art based on the technical solutions of the present invention and the related teachings of the above-described embodiments.

Claims

1. The music multi-mode data emotion recognition method based on deep learning is characterized by comprising the following learning steps:

S1: preprocessing music data; cleaning audio data: traversing the data by adopting a MIDI file processing tool package pretty _midi of Python, judging that the data with errors are invalid audio, and deleting the invalid audio data in the data set; main track extraction: firstly, deleting the audio tracks which do not meet the condition, wherein the left audio tracks are candidate audio tracks; step two, six feature quantities are obtained for each candidate audio track, then the six feature quantities are summed to be used as the score of the audio track, and the highest score is used as the main audio track; the six feature quantities include: note number feature quantity F _nc, sound emission time feature quantity F _nd, average pitch feature quantity F _p, average intensity feature quantity F _v, sound emission area feature quantity F _pd, and loudness area feature quantity F _vd; the note number refers to the note number of one track, and is expressed by note-count, and is extracted from the MIDI file; pitch refers to the frequency of a note, and is represented by pitch, pitch is extracted from a MIDI file, and sounding time refers to the duration of a note, and is represented by duration, and is extracted from the MIDI file; the intensity refers to the intensity of notes, and is extracted from MIDI files; the sound producing area of a note refers to the product of the pitch of the note and the sound producing time, and the loudness area of the note is the product of the note duration and the sound producing time;

The feature level fusion model is used for connecting MIDI features and text features obtained in the S2 and the S3 to obtain fused features fu _i, and finally inputting the fused features into an MLP and a softmax layer to obtain an emotion result y '_i, wherein the calculation formula of y' _i is as follows: h ^ef is the hidden layer output of each layer, and w ^ef and b ^ef are parameters of each layer;

2. The deep learning-based music multi-modal data emotion recognition method of claim 1, wherein the condition to be satisfied by the audio track in S1 includes the following: (1) the number of channels is not 10; (2) The number of notes is not less than half of the average number of notes for each track; (3) The total duration of note sounding is not less than half of the average sounding time of each track note.

3. The deep learning-based music multi-mode data emotion recognition method according to claim 1, wherein the result fusion adopts a linear weighted summation method, and the decision-level fusion model comprises the following processing steps: step a: extracting features by using MIDI and text data in the data set in a MIDI and text feature extraction mode to obtain MIDI features f _i and text features fte _i; step b: inputting the two features obtained in the step a into respective MLP and softmax layers for emotion classification training, wherein the prediction results are respectively thatAnd/>Wherein/>And/>The probability prediction values of the MIDI mode and the text mode on the j-th emotion are respectively represented; step c: performing pre-weighted summation on y _m and y _t to obtain a fusion result rf _i＝[rf₁,rf₂,rf₃,rf₄ ]; and d, passing the rf _j through a softmax layer to obtain a multi-mode fusion result y' _i.

4. The deep learning-based music multi-modal data emotion recognition method of claim 3, wherein the calculation formulas of y _m、y_t and rf _j are: