CN115064181B - Music multi-mode data emotion recognition method based on deep learning - Google Patents

Music multi-mode data emotion recognition method based on deep learning Download PDF

Info

Publication number
CN115064181B
CN115064181B CN202210654145.8A CN202210654145A CN115064181B CN 115064181 B CN115064181 B CN 115064181B CN 202210654145 A CN202210654145 A CN 202210654145A CN 115064181 B CN115064181 B CN 115064181B
Authority
CN
China
Prior art keywords
music
emotion
data
midi
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210654145.8A
Other languages
Chinese (zh)
Other versions
CN115064181A (en
Inventor
韩东红
孔彦茹
李嘉豪
韩嘉懿
刘莹
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202210654145.8A priority Critical patent/CN115064181B/en
Publication of CN115064181A publication Critical patent/CN115064181A/en
Application granted granted Critical
Publication of CN115064181B publication Critical patent/CN115064181B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of music multi-modal data emotion recognition, in particular to a music multi-modal data emotion recognition method based on deep learning. The method mainly aims at the problems that the space for improving emotion recognition in the existing music learning single mode is limited and feature vectors in a music data set cannot be deeply mined, and provides the following technical scheme: s1: preprocessing music data; s2: extracting the characteristics of MIDI data; s3: extracting characteristics of text data; s4: and (5) multi-mode fusion. According to the method, the multi-modal fusion is carried out by utilizing the thought of decision-level fusion, so that a better emotion classification effect can be obtained than the symptom-level fusion, emotion deep learning of music texts is facilitated, application of deep learning in music emotion recognition is promoted, analysis effect of music is improved, workload of manual emotion marking is reduced, accuracy is improved, and the method is mainly applied to music multi-modal data emotion recognition based on deep learning.

Description

Music multi-mode data emotion recognition method based on deep learning
Technical Field
The invention relates to the technical field of deep learning of music emotion, in particular to a deep learning-based music multi-mode data emotion recognition method.
Background
With the continuous popularization of mobile terminal devices, the online electronic music market is rapidly developed, and people can access massive music resources from various channels. In order to facilitate the audience to obtain a musical composition, each large music platform may use tags such as emotions, genres, etc. to sort and organize the musical composition. Since music is a carrier of emotion, it is particularly important to manage musical compositions using emotion. However, the manual emotion marking of the musical compositions is time-consuming and labor-consuming and has high error rate, so that the research of automatically identifying the musical emotion by utilizing the artificial intelligence technology has practical significance.
In the field of intelligent searching and recommending, on the other hand, the user can conveniently search music by using emotion, and the user can be better recommended with the aid of historical data of the user, so that better experience is brought to the user. No link exists between the low-level audio features and the semantics and emotion of the music; the lack of the data set limits the space for characteristic extraction and model design of researchers, so that the performance of emotion recognition by using lyrics is poor; the ceiling is touched by music emotion recognition by utilizing single-mode data such as audio frequency or lyrics, and the lifting space is limited. Therefore, we propose a music multi-mode data emotion recognition method based on deep learning.
Disclosure of Invention
The invention aims to solve the problems that in the background technology, the existing music learning single mode is limited in emotion recognition space, feature vectors in a music data set cannot be deeply mined, recognition performance is poor and the like, and provides a music multi-mode data emotion recognition method based on deep learning.
The technical scheme of the invention is as follows: the music multi-mode data emotion recognition method based on deep learning comprises the following learning steps:
S1: preprocessing music data; cleaning audio data: traversing the data by adopting a MIDI file processing tool package pretty _midi of Python, judging that the data with errors are invalid audio, and deleting the invalid audio data in the data set; main track extraction: firstly, deleting the audio tracks which do not meet the condition, wherein the left audio tracks are candidate audio tracks; step two, six feature quantities are obtained for each candidate audio track, then the six feature quantities are summed to be used as the score of the audio track, and the highest score is used as the main audio track; the main track extraction step includes: note number feature quantity F nc, sound emission time feature quantity F nd, average pitch feature quantity F p, average intensity feature quantity F v, sound emission area feature quantity F pd, and loudness area feature quantity F vd; the note number refers to the note number of one track, and is expressed by note-count, and is extracted from the MIDI file; pitch refers to the frequency of a note, and is represented by pitch, pitch can be directly extracted from a MIDI file, and sounding time refers to the duration of a note, and is represented by duration, and can be directly extracted from the MIDI file; the intensity refers to the strength of notes, and is expressed by intensity, and the intensity can be directly extracted from MIDI files; the sound producing area of a note refers to the product of the pitch of the note and the sound producing time, and the loudness area of the note is the product of the note duration and the sound producing time;
S2: extracting the characteristics of MIDI data; the feature extraction module of ERMSLM model is composed of: melody feature extraction, tonal feature extraction and manual feature extraction;
Melody feature extraction: for completing the extraction of the melody feature vector m i from the note pitch set P i of the ith music; and (3) extracting tonal characteristics: for extracting tonal features k i from tonal data of music, manual feature extraction: extracting four kinds of information of pitch, dynamics, duration and speed of music from a main track of music to construct a manual feature hcf i;
S3: extracting characteristics of text data; the characteristics of the text data comprise lyric characteristics and social label characteristics, and lyric characteristic extraction comprises three parts: the first part is to acquire word vectors of each word in lyrics by using a pre-trained BERT model, calculate the similarity of each word vector and four category label word vectors, and construct BERT emotion characteristics by using the calculated similarity; the second part is to construct an emotion dictionary containing four emotions based on the ANEW list, and construct dictionary emotion characteristics by using the emotion dictionary; the third part is to calculate TF-IDF values of each word in the lyrics on four emotion categories, accumulate the TF-IDF values of all words on a certain category as the numerical value of the category, and the numerical values of the four categories form the TF-IDF characteristics of the lyrics;
The social label feature extraction comprises three steps: preprocessing a data set; the data set preprocessing is used for sorting the original data so as to input a social label distribution analysis algorithm to acquire a social label distribution table; a second step of social label distribution analysis algorithm; designing an algorithm according to the Tag of the social label set and the label summary set T obtained by data preprocessing, analyzing the connection between the social label and the music emotion, and obtaining a social label distribution table based on the algorithm; thirdly, extracting social label features; extracting social label characteristics fta i=[t1...tc for extracting the ith music from the social label characteristics, wherein c is the emotion category number, and each dimension of the characteristics corresponds to one emotion category;
s4: multimodal fusion; multimodal fusion is used for fusing MIDI and text data of music to carry out multimodal music emotion recognition; the multi-mode fusion comprises a feature level fusion model and a decision level fusion model;
The dimension t j in the social tab feature fta i in S3 uses the formula Calculating, wherein n represents the number of social labels owned by the ith piece of music, and ts k is the score of the social label ta k in the social label distribution table; stdt j contains social labels with a top ranking of importance related to emotion e j;
The feature level fusion model is used for connecting MIDI features and text features obtained in the S2 and the S3 to obtain fused features fu i, and finally inputting the fused features into an MLP and a softmax layer to obtain an emotion result y 'i, wherein the calculation formula of y' i is as follows: h ef is the hidden layer output of each layer, and w ef and b ef are parameters of each layer.
Preferably, the conditions to be satisfied by the audio track in S1 include the following: (1) the number of channels is not 10; (2) The number of notes is not less than half of the average number of notes for each track; (3) The total duration of note sounding is not less than half of the average sounding time of each track note.
Preferably, the dimension t j in the social tab feature fta i in S3 uses the formulaCalculating, wherein n represents the number of social labels owned by the ith piece of music, and ts k is the score of the social label ta k in the social label distribution table; stdt j contains social labels with top alpha importance row associated with emotion e j.
Preferably, the feature level fusion model is used for connecting the MIDI features and the text features obtained in the S2 and the S3 to obtain fused features fu i, and finally inputting the fused features into the MLP and the softmax layer to obtain an emotion result y 'i, and the calculation formula of y' i is as follows: h ef is the hidden layer output of each layer, and w ef and b ef are parameters of each layer.
Preferably, the decision-level fusion model is used for extracting features of different modes, and the features of each mode are used for result prediction by using one classifier independently, and then the results of the classifiers of each mode are fused to obtain a final classification result.
Preferably, the result fusion adopts a linear weighted summation method, and the decision-level fusion model comprises the following processing steps: step a: extracting features by using MIDI and text data in the data set in a MIDI and text feature extraction mode to obtain MIDI features f i and text features fte i; step b: inputting the two features obtained in the step a into respective MLP and softmax layers for emotion classification training, wherein the prediction results are respectively thatAnd/>Wherein/>And/>The probability prediction values of the MIDI mode and the text mode on the j-th emotion are respectively represented; step c: performing pre-weighted summation on y m and y t to obtain a fusion result rf i=[rf1,rf2,rf3,rf4 ]; and d, passing the rf j through a softmax layer to obtain a multi-mode fusion result y' i.
Preferably, the calculation formulas of y m、yt and rf j are as follows:
Wherein h lft is the hidden layer output of each layer, w lft and b lft are the parameters of each layer, Represents the proportion of the MIDI modal classification result, and/>
Compared with the prior art, the invention has the following beneficial technical effects:
1. According to the invention, through carrying out emotion classification on MIDI data of music by using a deep learning model, the problems that the existing common low-level audio features are designed for other audio tasks, and direct connection between the features and music emotion is lacking, so that emotion recognition effect is limited and interpretability is poor are solved; thus, improving music may improve the accuracy of emotion recognition.
2. According to the invention, through an emotion recognition model ERMSLM constructed by text data, music emotion is recognized by using the text data of music, a social label distribution analysis algorithm is provided for analyzing the relationship between a social label and emotion types, and social label characteristics are constructed by using analysis results. And constructing an emotion dictionary based on the ANEW list, and identifying emotion tendencies of the lyrics according to the dictionary. The feature extraction is divided into lyric feature extraction and social label feature extraction.
3. According to the invention, the music emotion recognition is researched by fusing multi-mode data, and the research method mainly comprises a feature level fusion method and a decision level fusion method; the feature level fusion is easy to realize, only one classifier learning process is needed, training time is shortened, and relevance among various modal features can be considered. The decision-level fusion does not need to consider the problem of multi-mode synchronization, and has stronger expandability.
4. In summary, the multi-mode fusion is performed by utilizing the thought of the decision-level fusion, so that a better emotion classification effect can be obtained than the feature-level fusion, emotion deep learning of music texts is realized, application of deep learning in music emotion recognition is promoted, analysis effect of music is improved, workload of manual emotion marking is reduced, and accuracy is improved.
Drawings
FIG. 1 is a diagram of a multi-modal MER study framework in this scenario;
FIG. 2 is a schematic diagram of a feature level fusion emotion recognition model;
FIG. 3 is a schematic diagram of a decision-level fusion emotion recognition model;
Fig. 4 is a bar graph of the comparison results of the four models in this scenario.
Detailed Description
The technical scheme of the invention is further described below with reference to the attached drawings and specific embodiments.
Examples
As shown in fig. 1-4, the music multi-mode data emotion recognition method based on deep learning provided by the invention comprises the following learning steps: s1: preprocessing music data; cleaning audio data: traversing the data by adopting a MIDI file processing tool package pretty _midi of Python, judging that the data with errors are invalid audio, and deleting the invalid audio data in the data set; main track extraction: firstly, deleting the audio tracks which do not meet the condition, wherein the left audio tracks are candidate audio tracks; step two, six feature quantities are obtained for each candidate audio track, then the six feature quantities are summed to be used as the score of the audio track, and the highest score is used as the main audio track;
the conditions to be met by the audio track include the following: (1) the number of channels is not 10; (2) The number of notes is not less than half of the average number of notes for each track; (3) The total sound production duration of the notes is not less than half of the average sound production time of the notes of each track, and the main track extraction of six characteristic quantities in the second step comprises the following steps: note number feature quantity F nc, sound emission time feature quantity F nd, average pitch feature quantity F p, average intensity feature quantity F v, sound emission area feature quantity F pd, and loudness area feature quantity F vd; the note number refers to the note number of one track, and is expressed by note-count, and is extracted from the MIDI file; pitch refers to the frequency of a note, and is represented by pitch, pitch can be directly extracted from a MIDI file, and sounding time refers to the duration of a note, and is represented by duration, and can be directly extracted from the MIDI file; the intensity refers to the strength of notes, and is expressed by intensity, and the intensity can be directly extracted from MIDI files; the sound producing area of a note refers to the product of the pitch of the note and the sound producing time, and the loudness area of the note is the product of the note duration and the sound producing time;
S2: extracting the characteristics of MIDI data; the feature extraction module of ERMSLM model is composed of: melody feature extraction, tonal feature extraction and manual feature extraction;
The melody feature extraction is used for completing the extraction of a melody feature vector m i from a note pitch set P i of the ith music; and (3) manual feature extraction: four kinds of information, namely pitch, intensity, duration and speed of music, of notes are extracted from a main track of music to construct manual feature hcf i.
S3: extracting characteristics of text data; the text data features comprise lyrics features and social label features, and lyrics feature extraction comprises three parts: the first part is to acquire word vectors of each word in lyrics by using a pre-trained BERT model, calculate the similarity of each word vector and four category label word vectors, and construct BERT emotion characteristics by using the calculated similarity; the second part is to construct an emotion dictionary containing four emotions based on the ANEW list, and construct dictionary emotion characteristics by using the emotion dictionary; the third part is to calculate TF-IDF values of each word in the lyrics on four emotion categories, accumulate the TF-IDF values of all words on a certain category as the numerical value of the category, and the numerical values of the four categories form TFIDF characteristics of the lyrics;
the social label feature extraction comprises three steps: preprocessing a data set; the data set preprocessing is used for sorting the original data so as to input a social label distribution analysis algorithm to acquire a social label distribution table; a second step of social label distribution analysis algorithm; designing an algorithm according to the Tag of the social label set and the label summary set T obtained by data preprocessing, analyzing the connection between the social label and the music emotion, and obtaining a social label distribution table based on the algorithm; thirdly, extracting social label features; extracting social label characteristics fta i=[t1...tc for extracting the ith music from the social label characteristics, wherein c is the emotion category number, and each dimension of the characteristics corresponds to one emotion category; dimension t j in social tab feature fta i uses the formula Calculating, wherein n represents the number of social labels owned by the ith piece of music, and ts k is the score of the social label ta k in the social label distribution table; stdt j contains social labels with the importance of emotion e j related to the top alpha;
S4: multimodal fusion: multimodal fusion is used for fusing MIDI and text data of music to carry out multimodal music emotion recognition; the multi-modal fusion comprises a feature level fusion model and a decision level fusion model.
The feature level fusion model is used for connecting MIDI features and text features obtained in the S2 and the S3 to obtain fused features fu i, and finally inputting the fused features into the MLP and softmax layers to obtain an emotion result y 'i, wherein the calculation formula of y' i is as follows: h ef is the hidden layer output of each layer, and w ef and b ef are parameters of each layer.
The decision-level fusion model is used for extracting features of different modes, and the features of each mode are used for result prediction by using one classifier independently, and then the results of the classifiers of each mode are fused to obtain a final classification result.
The result fusion adopts a linear weighted summation method, and the decision-level fusion model comprises the following processing steps: step a: extracting features by using MIDI and text data in the data set in a MIDI and text feature extraction mode to obtain MIDI features f i and text features fte i; step b: inputting the two features obtained in the step a into respective MLP and softmax layers for emotion classification training, wherein the prediction results are respectively thatAnd/>Wherein/>And/>The probability prediction values of the MIDI mode and the text mode on the j-th emotion are respectively represented; step c: performing pre-weighted summation on y m and y t to obtain a fusion result rf i=[rf1,rf2,rf3,rf4 ]; step d, rf j is subjected to softmax layer to obtain a multi-mode fusion result y' i
The calculation formulas of y m、yt and rf j are:
Wherein h lft is the hidden layer output of each layer, w lft and b lft are the parameters of each layer, Represents the proportion of the MIDI modal classification result, and/>
Because the parameters with larger influence on the result in the feature level fusion method and the decision level fusion method are the proportion of the MIDI modal classification resultThus pair/>Different values were taken and experiments were performed, the experimental results are shown in the following table:
MIDI modality weight value Accuracy
0.1 0.7299
0.2 0.7336
0.3 0.7372
0.4 0.7372
0.5 0.6788
0.6 0.5766
0.7 0.5693
0.8 0.5730
0.9 0.5730
As shown in the above table: when the weight of the MIDI modal classification result gradually becomes larger, the accuracy rate is increased; when the weight exceeds 0.4, the accuracy decreases, thus weightingThe value of (2) is set to 0.3. The experimental results also show that the text mode is more important for the emotion distinguishing degree.
In this embodiment, the performance of four models, namely, the performance of MIDI modal model (ERMSLM) only, text modal model (ERMBT) only, feature level fusion model (FF-ERM) and decision level fusion model (DF-ERM), are compared from the overall accuracy and the accuracy in each emotion categoryThe value of the parameter is 0.3;
the comparison of the four models is shown below:
The data table is converted into a histogram as shown in fig. 4:
from the above data table and histogram, it can be known that:
(1) When emotion recognition is carried out by using two single-mode data, the text mode can obtain better emotion recognition effect, the accuracy rate of the text mode in four categories is 15.69% higher than that of the MIDI mode, and the classification accuracy rate of the text mode in each emotion category is higher than that of the MIDI mode;
(2) When multi-mode data is used for emotion recognition, the effect of decision-level fusion is better than that of feature-level fusion, the accuracy rate of the feature-level fusion on four categories is 2.92% higher than that of feature-level fusion, and the effect of feature-level fusion is 1.82% lower than that of text mode only;
(3) The four classification models all obtain the highest emotion recognition accuracy rate on v -v- emotion types;
The four classification models all obtain the lowest emotion recognition accuracy rate on the v +v- emotion types.
From this it can be concluded that: when two single-mode data are used for music emotion recognition, a text mode can obtain a better effect; when the multi-modal data is used for music emotion recognition, the multi-modal emotion recognition model fused in a decision level can obtain better emotion recognition effect than the feature level fusion and the single-mode data.
The above-described embodiment is only one preferred embodiment of the present invention, and many alternative modifications and combinations of the above-described embodiments can be made by those skilled in the art based on the technical solutions of the present invention and the related teachings of the above-described embodiments.

Claims (4)

1. The music multi-mode data emotion recognition method based on deep learning is characterized by comprising the following learning steps:
S1: preprocessing music data; cleaning audio data: traversing the data by adopting a MIDI file processing tool package pretty _midi of Python, judging that the data with errors are invalid audio, and deleting the invalid audio data in the data set; main track extraction: firstly, deleting the audio tracks which do not meet the condition, wherein the left audio tracks are candidate audio tracks; step two, six feature quantities are obtained for each candidate audio track, then the six feature quantities are summed to be used as the score of the audio track, and the highest score is used as the main audio track; the six feature quantities include: note number feature quantity F nc, sound emission time feature quantity F nd, average pitch feature quantity F p, average intensity feature quantity F v, sound emission area feature quantity F pd, and loudness area feature quantity F vd; the note number refers to the note number of one track, and is expressed by note-count, and is extracted from the MIDI file; pitch refers to the frequency of a note, and is represented by pitch, pitch is extracted from a MIDI file, and sounding time refers to the duration of a note, and is represented by duration, and is extracted from the MIDI file; the intensity refers to the intensity of notes, and is extracted from MIDI files; the sound producing area of a note refers to the product of the pitch of the note and the sound producing time, and the loudness area of the note is the product of the note duration and the sound producing time;
S2: extracting the characteristics of MIDI data; the feature extraction module of ERMSLM model is composed of: melody feature extraction, tonal feature extraction and manual feature extraction;
Melody feature extraction: for completing the extraction of the melody feature vector m i from the note pitch set P i of the ith music; and (3) extracting tonal characteristics: for extracting tonal features k i from tonal data of music, manual feature extraction: extracting four kinds of information of pitch, dynamics, duration and speed of music from a main track of music to construct a manual feature hcf i;
S3: extracting characteristics of text data; the characteristics of the text data comprise lyric characteristics and social label characteristics, and lyric characteristic extraction comprises three parts: the first part is to acquire word vectors of each word in lyrics by using a pre-trained BERT model, calculate the similarity of each word vector and four category label word vectors, and construct BERT emotion characteristics by using the calculated similarity; the second part is to construct an emotion dictionary containing four emotions based on the ANEW list, and construct dictionary emotion characteristics by using the emotion dictionary; the third part is to calculate TF-IDF values of each word in the lyrics on four emotion categories, accumulate the TF-IDF values of all words on a certain category as the numerical value of the category, and the numerical values of the four categories form the TF-IDF characteristics of the lyrics;
The social label feature extraction comprises three steps: preprocessing a data set; the data set preprocessing is used for sorting the original data so as to input a social label distribution analysis algorithm to acquire a social label distribution table; a second step of social label distribution analysis algorithm; designing an algorithm according to the Tag of the social label set and the label summary set T obtained by data preprocessing, analyzing the connection between the social label and the music emotion, and obtaining a social label distribution table based on the algorithm; thirdly, extracting social label features; extracting social label characteristics fta i=[t1...tc for extracting the ith music from the social label characteristics, wherein c is the emotion category number, and each dimension of the characteristics corresponds to one emotion category;
s4: multimodal fusion; multimodal fusion is used for fusing MIDI and text data of music to carry out multimodal music emotion recognition; the multi-mode fusion comprises a feature level fusion model and a decision level fusion model;
The dimension t j in the social tab feature fta i in S3 uses the formula Calculating, wherein n represents the number of social labels owned by the ith piece of music, and ts k is the score of the social label ta k in the social label distribution table; stdt j contains social labels with a top ranking of importance related to emotion e j;
The feature level fusion model is used for connecting MIDI features and text features obtained in the S2 and the S3 to obtain fused features fu i, and finally inputting the fused features into an MLP and a softmax layer to obtain an emotion result y 'i, wherein the calculation formula of y' i is as follows: h ef is the hidden layer output of each layer, and w ef and b ef are parameters of each layer;
The decision-level fusion model is used for extracting features of different modes, and the features of each mode are used for result prediction by using one classifier independently, and then the results of the classifiers of each mode are fused to obtain a final classification result.
2. The deep learning-based music multi-modal data emotion recognition method of claim 1, wherein the condition to be satisfied by the audio track in S1 includes the following: (1) the number of channels is not 10; (2) The number of notes is not less than half of the average number of notes for each track; (3) The total duration of note sounding is not less than half of the average sounding time of each track note.
3. The deep learning-based music multi-mode data emotion recognition method according to claim 1, wherein the result fusion adopts a linear weighted summation method, and the decision-level fusion model comprises the following processing steps: step a: extracting features by using MIDI and text data in the data set in a MIDI and text feature extraction mode to obtain MIDI features f i and text features fte i; step b: inputting the two features obtained in the step a into respective MLP and softmax layers for emotion classification training, wherein the prediction results are respectively thatAnd/>Wherein/>And/>The probability prediction values of the MIDI mode and the text mode on the j-th emotion are respectively represented; step c: performing pre-weighted summation on y m and y t to obtain a fusion result rf i=[rf1,rf2,rf3,rf4 ]; and d, passing the rf j through a softmax layer to obtain a multi-mode fusion result y' i.
4. The deep learning-based music multi-modal data emotion recognition method of claim 3, wherein the calculation formulas of y m、yt and rf j are:
Wherein h lft is the hidden layer output of each layer, w lft and b lft are the parameters of each layer, Represents the proportion of the MIDI modal classification result, and/>
CN202210654145.8A 2022-06-10 2022-06-10 Music multi-mode data emotion recognition method based on deep learning Active CN115064181B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210654145.8A CN115064181B (en) 2022-06-10 2022-06-10 Music multi-mode data emotion recognition method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210654145.8A CN115064181B (en) 2022-06-10 2022-06-10 Music multi-mode data emotion recognition method based on deep learning

Publications (2)

Publication Number Publication Date
CN115064181A CN115064181A (en) 2022-09-16
CN115064181B true CN115064181B (en) 2024-04-19

Family

ID=83199924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210654145.8A Active CN115064181B (en) 2022-06-10 2022-06-10 Music multi-mode data emotion recognition method based on deep learning

Country Status (1)

Country Link
CN (1) CN115064181B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674339A (en) * 2019-09-18 2020-01-10 北京工业大学 Chinese song emotion classification method based on multi-mode fusion
CN111460213A (en) * 2020-03-20 2020-07-28 河海大学 Music emotion classification method based on multi-mode learning
CN113192471A (en) * 2021-04-16 2021-07-30 南京航空航天大学 Music main melody track identification method based on neural network
CN114186140A (en) * 2021-11-11 2022-03-15 北京奇艺世纪科技有限公司 Social interaction information processing method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10510328B2 (en) * 2017-08-31 2019-12-17 Spotify Ab Lyrics analyzer

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674339A (en) * 2019-09-18 2020-01-10 北京工业大学 Chinese song emotion classification method based on multi-mode fusion
CN111460213A (en) * 2020-03-20 2020-07-28 河海大学 Music emotion classification method based on multi-mode learning
CN113192471A (en) * 2021-04-16 2021-07-30 南京航空航天大学 Music main melody track identification method based on neural network
CN114186140A (en) * 2021-11-11 2022-03-15 北京奇艺世纪科技有限公司 Social interaction information processing method and device, electronic equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《MUSIC MOOD REPRESENTATIONS FROM SOCIAL TAGS》;Cyril Laurier et al.;《ISMIR 2009》;20090131;第381-386页 *
《Toward Multi-modal Music Emotion Classification》;Yi-Hsuan Yang et al.;《PCM 2008》;20081209;第70-79页 *
《面向新浪微博的情感社区检测算法》;韩东红等;《东北大学学报( 自然科学版)》;20210131;第42卷(第1期);第21-30页 *

Also Published As

Publication number Publication date
CN115064181A (en) 2022-09-16

Similar Documents

Publication Publication Date Title
CN108897857B (en) Chinese text subject sentence generating method facing field
Cao et al. Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech
Won et al. Multimodal metric learning for tag-based music retrieval
CN112100344A (en) Financial field knowledge question-answering method based on knowledge graph
Wang et al. Music Emotion Classification of Chinese Songs based on Lyrics Using TF* IDF and Rhyme.
JP2012027845A (en) Information processor, relevant sentence providing method, and program
Manco et al. Learning music audio representations via weak language supervision
Bai et al. Music emotions recognition by cognitive classification methodologies
Sadiq et al. High dimensional latent space variational autoencoders for fake news detection
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
Al Marouf et al. Lyricist identification using stylometric features utilizing banglamusicstylo dataset
CN115064181B (en) Music multi-mode data emotion recognition method based on deep learning
Souza et al. Affective interaction based hybrid approach for emotion detection using machine learning
Harichandana et al. Leapmood: Light and efficient architecture to predict mood with genetic algorithm driven hyperparameter tuning
Wu et al. A study on natural language processing classified news
Yafooz et al. Enhancing multi-class web video categorization model using machine and deep learning approaches
CN113076425B (en) Event related viewpoint sentence classification method for microblog comments
Hu et al. Collaborative data relabeling for robust and diverse voice apps recommendation in intelligent personal assistants
CN113989076A (en) Intellectual property achievement transformation management system
Patsiouras et al. GreekPolitics: Sentiment Analysis on Greek Politically Charged Tweets
Singh Prediction of voice sentiment using machine learning technique
Charoendee et al. Speech emotion recognition using derived features from speech segment and kernel principal component analysis
Ke Intelligent Classification Model of Music Emotional Environment Using Convolutional Neural Networks
Hemakirthiga et al. Improving Emotion Detection in Text: A Comparative Analysis of Machine Learning Algorithms and Genetic Algorithm-Optimized Logistic Regression
Schuller et al. A hybrid music retrieval system using belief networks to integrate multimodal queries and contextual knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant