CN114925742B

CN114925742B - Symbol music emotion classification system and method based on auxiliary task

Info

Publication number: CN114925742B
Application number: CN202210296315.XA
Authority: CN
Inventors: 陈俊龙; 邱际宝; 张通
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2024-05-14
Anticipated expiration: 2042-03-24
Also published as: CN114925742A

Abstract

The invention provides a symbolic music emotion classification system and method based on auxiliary tasks. The system comprises a symbol music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, an adjusting classifier and a dynamics classifier. The method comprises a pre-training stage and a fine-tuning stage, wherein auxiliary tasks can be adopted only in the fine-tuning stage, and auxiliary tasks can also be adopted in the pre-training stage and the fine-tuning stage at the same time. According to the emotion recognition method, the accuracy of emotion recognition is improved by using the auxiliary task related to emotion, and the model can learn the representation related to emotion better through multi-task learning, so that the accuracy of the model on the emotion recognition task is improved.

Description

Symbol music emotion classification system and method based on auxiliary task

Technical Field

The invention belongs to the field of emotion classification, and particularly relates to a symbol music emotion classification method and system based on auxiliary tasks.

Background

Music was closely linked to emotion from the beginning of birth, and lister et al think that music itself was generated for the purpose of expressing emotion. Thus, emotion recognition of music is an important direction of music psychological research. Music is available in a variety of forms, such as audio-modal music (presentation in audio, storage formats mp3, wav, etc.) and symbolic-modal music (presentation in symbolic representations of score classes, stored in formats MIDI and MusicXML, etc.). Symbolic music generally encodes the beat, rhythm, pitch, duration, and intensity of each note. The existing research shows that the music in the symbol mode is more suitable for machine learning or deep learning models to carry out automatic emotion classification.

The emotion of music can be divided into two dimensions according to the emotion dimension theory of Russell: valency (value) and Arousal (Arousal). Where the titer represents whether the emotion is positive or negative and the arousal level represents the intensity of the emotion. Further, according to this valence-arousal model, the emotion of music can be divided into four categories: happy (high titer high arousal), anger or fear (low titer high arousal), sad (low titer low arousal) and calm (high titer low arousal).

The existing symbolic music emotion classification algorithm is mainly divided into two types:

a) The first is a machine learning method based on statistical features. The method utilizes the existing music analysis tool to extract some statistical characteristics of symbolic music, such as the distribution of pitch and the interval of melody. Based on the extracted statistical features, the features are input into machine learning models, such as support vector machines and classification tree algorithms, for emotion classification of music. However, the machine learning method based on the statistical characteristics needs to perform manual characteristic extraction, and meanwhile, the accuracy of identifying the emotion of the symbolic music by the method is low. At present, a second deep learning method based on symbolic music expression is mainly adopted for emotion recognition of symbolic music.

B) The second is a deep learning method based on symbolic music expressions. The method encodes symbolic music into a sequence of events. The pronunciation of a note is expressed as: the pitch of the note, the pronunciation time (duration) of the note, and the dynamics of the note. Such event sequence data is then input into a neural network model (e.g., a long-short term neural network) capable of processing the time series data for emotion classification. In this process, events are considered to be expressions resembling words in natural language, so this approach is typically classified using some model in the field of natural language processing (Natural Language Processing, NLP). The usual training method is to perform unsupervised pre-training on a large-sized data set of the non-emotion tags and then save the pre-trained weights. After pre-training, the model is considered to have learned some domain knowledge related to music. And finally loading the pre-trained weight on the small-sized data set with the emotion labels to carry out fine adjustment of emotion classification. The backbone network of the fine tuning process and the backbone network of the pre-training process are the same, the backbone network is regarded as a feature extractor, and a plurality of fully-connected neural network layers are added behind the feature extractor to serve as a classifier of emotion classification for emotion classification. The learning rate of the neural network is generally smaller than that of the pre-training process in the fine tuning process, so that the general knowledge obtained by learning the pre-training model on a large data set is not changed greatly. The main pre-training method adopted by the method is a mask language model, namely, random (or the next event of the current processing) event is masked in the pre-training stage, and then the neural network is utilized to predict the masked event. The method first appears in the natural language processing domain and then is migrated to the symbolic music emotion classification domain. However, this method of pre-training and then fine-tuning directly does not make full use of the structural information of the music. While structural information of music such as mode (major and minor), the loudness of notes in music is shown to be closely related to emotion of music in both music theory and psychology studies. Music theory and psychology suggest that the tuning of music has a great influence on emotional expressions, e.g. music with a tuning of big tune is generally considered happy and positive, and music with a tuning of small tune is generally considered sad and negative. In addition, the loudness of music is considered to be related to the emotion of the music, with music of great loudness generally being considered to be happy and positive, and music of small loudness generally being considered to be sad and negative. The deep learning method based on symbolized music expression does not fully utilize the structural information of music, and the accuracy of the obtained emotion classification is low. As disclosed in Yi-Hui Chou et al, "MidiBERT-Piano Larget-scale Pre-training for Symbolic Music Understanding," it is a multi-task study that designs different models for different tasks, does not take into account the links between tasks, and does not perform tonal and emotional classification.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides the symbol music emotion classification method based on auxiliary tasks, which can utilize structural information related to emotion and can improve the accuracy of a model for emotion classification.

In order to achieve the purpose of the invention, the auxiliary task-based symbolic music emotion classification system provided by the invention comprises a symbolic music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, an adjustable classifier and a strength classifier, wherein:

the symbol music coding module is used for coding the symbol music into an event group or an event sequence;

The embedding layer is used for obtaining an event embedding representation after embedding the event group or the event sequence, adding position information to obtain a position embedding representation, and adding the event embedding representation and the position embedding representation to obtain a final embedding representation l ₁;

The feature extractor is used for extracting features of the embedded feature and outputting a feature sequence l ₂ with context information;

the convergence layer is used for converging the information of the sequence, wherein for the emotion classification main task and the tone classification auxiliary task, the convergence layer multiplies the attention weight with the characterization sequence l ₂ to obtain a convergence vector l ₃, and for the note dynamics classification auxiliary task, the convergence layer adopts an equivalent mapping mode to obtain a convergence vector corresponding to each note;

The emotion classifier and the adjusting classifier are used for predicting emotion types and adjusting modes according to the convergence vector l ₃, and the dynamics classifier is used for predicting dynamics according to the convergence vector corresponding to each note.

Further, the symbol music coding method adopted by the symbol music coding module comprises any one of a CP representation method and a ferrora representation method, wherein the CP representation method gathers events formed by symbol music coding into groups, each group codes the pitch, duration, measure of the beginning pronunciation of a note and secondary beats in the measure, and the ferrora representation method is used for coding the symbol music into the following stepsWhereinRespectively representing the intensity, duration and pitch of the ith note, n representing the sequence of events together with n sets of events.

Further, in the embedding layer, the process of obtaining the final embedded representation l ₁ is:

When the symbol music coding module adopts a CP representation method to code symbol music into an event group, the coded sequence is x= { x ₁,...,x_n }, n is the sequence length, x _i∈R⁴ of each position comprises four dimensions of bar, secondary beat, pitch and duration, each dimension is respectively embedded and characterized, after the embedded characterization of each dimension is obtained, the embedded characterization e _i of the position x _i is obtained by splicing the embedded characterization, and the formula is expressed as follows:

In the middle of One-hot representation of the kth attribute in the ith event group,/>For the embedding matrix of the kth attribute in the group, N _k is the number of events of the kth attribute, H _k is the size of the attribute embedding dimension, concat represents the tensor stitching operation,/>Are two-dimensional matrices;

After each dimension is embedded, the spliced embedded vector is converted into a dimension H required by the feature extractor, and the formula is as follows:

g_i＝e_iW (2)

In the method, in the process of the invention, As linear layer weight parameters, g _i∈R^H is embedded representation of the linear change layer; n _k is the number of events for the kth attribute;

When the symbol music coding module adopts the Ferriera representation symbol music coding to obtain an event sequence, the event is directly embedded to obtain g _i, and the formula is expressed as follows:

g_i＝x_iW^P (3)

Wherein x _i∈R^V is the one-hot representation of the ith event, V is the event vocabulary size, and W ^P∈R^V×H is the embedding matrix;

adding position information to obtain a position embedded representation, and calculating the position information p _i∈R^H of the ith position as follows:

p_i＝Z_iW^P (4)

Wherein Z _i∈R^1×n is the one-hot encoding of position i, and W ^P∈R^n×H is the position embedding matrix;

Adding the event embedded token to the location embedded token then yields the final input embedded token I _i,I_i∈R^H for the ith event:

I_i＝g_i+p_i (5)。

Further, in the convergence layer, the step of obtaining the convergence vector includes:

The input of the convergence layer is l ₂∈R^n×H, n is the sequence length, H is the hidden layer dimension output by the feature extractor, and for emotion and tonality tasks of the sequence level, the convergence layer calculates attention weight a according to an attention mechanism:

In the method, in the process of the invention, Tanh is an activation function,/>D _a is an adjustable super parameter, and T is the transposition operation of the matrix or vector;

After deriving the attention weight a e R ⁿ, multiplying it with the token sequence with context information l ₂ gives an aggregate vector l ₃∈R^H weighted according to the attention mechanism:

l₃＝al₂ (7)

and for the dynamics classification task of the note level, the convergence layer obtains convergence vectors in an equivalent mapping mode.

Further, in the emotion classifier and the tonality classifier, the manner of predicting emotion and tonality labels is as follows:

P^t(c^t|l₃)＝softmax(φ^t(l₃)) (8)

Where P ^t(c^t|l₃) represents the predicted tag distribution of task t, φ ^t represents the classifier of task t, and c ^t is the tag type of task t.

Further, in the dynamics classifier, the mode of predicting the dynamics label is:

where P _i(c|H_i) represents the predicted intensity distribution for the ith note, And c is a label type of the dynamics.

Further, the loss function in the system learning is:

When the symbol music coding module adopts a CP representation method to code symbol music into an event group, let L ₁,L₂ and L ₃ be loss functions of emotion, mode and dynamics classification respectively, and the adaptive multitask loss function of the system is as follows:

Wherein, L _t is the loss of the t-th task, and sigma _t is a parameter to be learned for balancing the loss of different tasks;

When the symbol music coding module adopts the Ferriera representation symbol music coding as an event sequence, the system only has emotion classification and tonal classification loss, and L _t in the formula (10) only represents emotion and tonal classification loss.

The invention also provides a symbol music emotion classification method based on auxiliary tasks by adopting the system, which comprises a pre-training stage and a fine-tuning stage, wherein the pre-training stage adopts a language model or a mask language model to directly reconstruct original input or simultaneously can also predict a sequence level for the tone and predict a note level for the dynamics, and the fine-tuning stage needs to predict the sequence level for the tone and predict the note level for the dynamics.

Further, the pre-training phase comprises the steps of:

Performing symbol music coding on the pre-training data set, and acquiring a dynamics label of each note and a tonal label of each work;

Inputting the encoded symbol music expression into an embedding layer to obtain an embedding representation for inputting a feature extractor;

inputting the embedded token to a feature extractor to learn the sequence's top and bottom Wen Biaozheng;

If only a language model or a mask language model is adopted, the original input is directly reconstructed, the sequence level prediction can be carried out on the tonality, and the note level prediction can be carried out on the dynamics;

Calculating a loss function;

gradient propagation is carried out, and parameter updating is carried out;

and calculating the index on the verification set for calculating the pre-training data set, iterating for a plurality of times, and storing a model with the best index.

Further, the fine tuning stage includes the steps of:

Carrying out symbol music data coding on the fine tuning data set to obtain emotion, tonality and strength labels;

loading model parameters stored in a pre-training stage as symbol music data coding for the fine adjustment data set, and obtaining emotion, tonality and strength labels;

Inputting the symbol music expression obtained by encoding into an embedding layer of a model to obtain an embedding representation for inputting a feature extractor;

Inputting the context representation into a convergence layer and a classifier to classify emotion, adjustability and strength;

calculating a loss function and carrying out back propagation learning parameters;

Calculating indexes, and storing a model with the best indexes;

loading the model parameters with the best fine tuning stage indexes, and carrying out emotion prediction on the symbol music data of the unknown tag.

Compared with the prior art, the invention has the following beneficial effects:

The method and the system provided by the invention are combined with three tasks to learn together, the tasks are directly restricted with each other, and the characteristics favorable for all the tasks, namely 'general knowledge' in the field of music, can be learned. The emotion recognition accuracy is improved by using the auxiliary task related to emotion, and the model can learn the representation related to emotion better through multi-task learning, so that the accuracy of the model to the emotion recognition task is improved.

Drawings

Fig. 1 is a schematic diagram of a symbolic music emotion classification system based on auxiliary tasks according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a method for coding (representing) symbolic music according to an embodiment of the present invention.

Fig. 3 is a flowchart of a pre-training stage of a symbolic music emotion classification method based on auxiliary tasks according to an embodiment of the present invention.

Fig. 4 is a flowchart of a fine tuning stage of a symbolic music emotion classification method based on auxiliary tasks according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention has the following thought: the structure of music is closely connected with emotion. Music tuning has a great influence on emotional expressions, e.g. music tuning to big tune is generally considered happy and positive, and music tuning to small tune is generally considered sad and negative. The loudness of music is considered to be related to the emotion of the music, with large loudness music generally being considered to be happy and positive, and small loudness music generally being considered to be sad and negative. However, loudness is a concept in the audio domain, and how the symbol domain calculates loudness is still an unsolved problem. However, recent studies have shown that there is a linear link between the strength of notes in the symbol domain and the loudness in the audio domain, which also means that the strength has an important effect on emotion. In view of the association of such musical structure information with musical emotion, the present invention proposes a multitasking framework for emotion classification primary tasks and emotion-related auxiliary tasks to learn together.

As shown in FIG. 1, the music emotion classification system based on auxiliary tasks provided by the embodiment of the invention comprises a symbol music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, a tone classifier and a dynamics classifier. The symbol music coding module is not a part of a neural network and is a coding tool for coding symbol music stored in files such as MIDI or musicXML into a series of events. The embedded layer, the feature extractor, the convergence layer, the emotion classifier, the adjustable classifier and the dynamics classifier belong to the components of the neural network and are parameter modules which need to be learned by a gradient descent method. The three tasks of emotion classifier, pitch classifier and strength classifier share the same symbolic music coding module, embedding layer and feature extractor. The function and alternatives of the individual components are described in more detail below.

A. Symbol music coding

The symbolic music representation method is a method of encoding symbolic music into a text-like event sequence. After encoding the symbolic music into an event sequence using a symbolic music representation method, the resulting event sequence may be considered to be text similar to that in natural language processing, and then processed using a natural language processing method. As shown in fig. 2, the present invention is represented by the existing CP representation method (Compound word representation) and the representation method proposed by Ferreira, and the coding scheme that can be used for symbol music coding is described in detail, and the feasible structural design of the system proposed by the present invention is described through these two different coding schemes.

In the present invention, the CP representation method aggregates events into groups of symbolic music codes, each group codes a Pitch (Pitch), duration (Duration) of a note, measure at which the note starts to sound (Onset), and Sub-beat (Sub-beat) within the Measure. In some of the embodiments of the present invention, a musical piece as shown in fig. 2 (a) is coded by CP representation as a sequence of events as shown in fig. 2 (b). One column is a group, and the time advancing direction is from left to right. The representation divides a beat of music into 4 beats to mark the location information of each note within a bar. The smallest duration unit is the duration of 32 notes, so one quarter note is the 8-note duration unit. Every time a new section appears, it is marked as a section (new), otherwise it is a section (continuation), both continuing the section.

In the invention, the Ferriera is represented by encoding symbolic music asOf (3), wherein/>Respectively representing the intensity, duration and pitch of the ith note, n representing the sequence of events together with n sets of events. In some embodiments of the present invention, as shown in fig. 2 (c), event "v_76" indicates that the strength of the note is 76, "d_8" indicates that the note duration is 8 notes of 32, i.e., the duration of one quarter note, "n_64" indicates that the pitch of the note is 64, and "a" indicates that the time advances by one quarter note.

CP and Ferreira representation methods mainly suffer from two different points: the first is that the CP representation method aggregates events into groups, each group having a note; the second is that the ferrora representation encodes the dynamics information, while the CP representation in the symbolic musical emotion classification ignores the dynamics information. Based on these two differences, the framework differs in the design of the embedding layer, classifier and loss function.

B. Embedding layer

After step a the symbolic music is encoded to resemble a CP representation or this ferrora representation, a sequence is obtained which is a set or sequence of events.

For the CP representation method, the encoded sequence is x= { x ₁,...,x_n }, n is the sequence length. Each position x _i∈R⁴ contains four dimensions (bar, secondary beat, pitch and duration), the four attributes are required to be respectively embedded (Embedding) for representation, after the embedded representation of each dimension is obtained, the embedded representation e _i of the position x _i is obtained by splicing, and the formula is expressed as follows:

In the middle of A One-hot representation (One-hot) representing the kth attribute in the ith event group,For the embedding matrix of the kth attribute in the group, N _k is the number of events of the kth attribute (i.e., the vocabulary size of the event), H _k is the size of the attribute embedding dimension, concat represents the tensor stitching operation,/>Are two-dimensional matrices.

After each attribute is embedded, the spliced embedded vector is converted into a dimension H required by the feature extractor through a linear change layer, and the formula is as follows:

g_i＝e_iW (2)

In the method, in the process of the invention, As linear layer weight parameters, g _i∈R^H is embedded representation of the linear change layer; is N _k is the number of events for the kth attribute (i.e., the vocabulary size for the event).

If the Ferriera expression is adopted, the sequence of the event is obtained, and the event is directly embedded to obtain g _i, and the formula is expressed as follows:

g_i＝x_iW^P (3)

Where x _i∈R^V is the one-hot representation of the ith event, V is the event vocabulary size, and W ^P∈R^V×H is the embedding matrix.

Since the feature extractor is based on a self-attention mechanism, this attention mechanism does not itself carry position information, which means that the events in the shuffle sequence have no effect on this attention mechanism. However, the sequence of events resulting from the encoding of symbolic music is obviously time-dependent, and therefore requires an artificial addition of position information on the basis of the embedded representation of the input T-feature extractor. The position information is learnable in the neural network process, and the calculation formula of the position information p _i∈R^H of the ith position is as follows:

p_i＝Z_iW^P (4)

Where Z _i∈R^1×n is the one-hot encoding of position i and W ^P∈R^n×H is the position embedding matrix.

By means of position coding, each position gets a position vector corresponding to the position. Adding the event embedded token to the location embedded token then yields the final input embedded token I _i for the ith event:

I_i＝g_i+p_i (5)

wherein I _i∈R^H.

The input token i ₁ (input relative to the transducer model) for the entire sequence of events is obtained through the above steps, which can then be input into the feature extractor for learning.

C. Feature extractor

The feature extractor of the present invention may employ either a Transformer based feature extractor or a long and short term memory network (Long Short Term Memory, LSTM).

In some embodiments of the present application, a transducer-based feature extractor is employed. The transducer model acts as a feature extractor in the present system. The transducer comprises two modules: an encoder module (Encoder) and a Decoder module (Decoder). However, for the task of symbolic music emotion classification, both modules are not employed at the same time. Existing approaches either employ only encoder structures or only decoder structures as feature extractors. The framework proposed by the application is suitable for both models. Given the input sequence l ₁＝{I₁,...,I_n, where I _i∈R^H, the output after passing through the feature extractor is the token sequence l ₂＝{H₁,...,H_n with the context information, where H _i∈R^H is the context token for the ith event.

D. Convergence layer and classifier

The emotion classification system provided by the invention comprises a main task and two auxiliary tasks. The primary task is emotion recognition of symbolic music, the first secondary task is to classify the tonality of the music, and the second secondary task is to classify the intensity of each note. These three tasks are described in detail and the design of the convergence layer is given.

The emotion classification of music is to perform the task of classifying the titers and the arousal degrees based on a titer-arousal degree model or perform the task of classifying the emotions of four categories, namely anger or fear (low titer, high arousal degree), sadness (low titer, low arousal degree) and calm (high titer, low arousal degree) respectively.

The tonality of music is a collective term for the dominant and the mode. The main tone is the most important tone in the tone, and twelve tones can be the main tone. The pitch is an arrangement of scales, which is divided into big and small pitches. The sizes are tuned with their different syllable arrangements. There are a total of 12 x 2 = 24 tones according to the combination of the dominant tone and the mode.

The intensity classification of notes is a prediction of the intensity of each note, and the intensity can be divided into six categories, pp, p, mp, mf, f, ff, from weak to strong.

Since both the emotion classification and the tonal classification of music are for one sequence, and the intensity classification of notes is for a single note, it is necessary to aggregate the sequence information before classifying each task. For sequence-level tasks (emotion classification and tonal classification), the vector that results from the convergence should be information that one sequence has only one and integrates the entire sequence. For the task of the note level (intensity classification of notes), each note has its corresponding convergence vector.

Referring to fig. 1, let l ₂∈R^n×H be the input of the convergence layer, n be the sequence length, and H be the hidden layer dimension of the feature extractor output. For sequence level tasks, the convergence layer calculates the attention weight a from the attention mechanism:

In the method, in the process of the invention, Tanh is an activation function,/>D _a is an adjustable super-parameter, and T is a transpose operation on the matrix or vector.

l₃＝al₂ (7)

For the classification task of the note level, the convergence layer adopts an equivalent mapping (IDENTICAL MAPPING) mode.

The structure of these three task classifiers is designed as a combination of multiple fully connected layers and activation functions. Given a representation of a sequence of symbolic music, i ₃, for the task at the sequence level (emotion and tonality) the label is predicted by:

P^t(c^t|l₃)＝softmax(φ^t(l₃)) (8)

Where φ ^t denotes the classifier for task t, P ^t(c^t|l₃) denotes the predicted tag distribution for task t, and c ^t is the tag type for task t (e.g., 12 major and 12 minor of the tonality classification, 24 tonality).

The classification task for the note level was also designed for the CP representation system, given the sequence characterization l ₂＝{H₁,...,H_n }, the strength of the ith note is predicted by:

In the method, in the process of the invention, Classifier for the intensity classification task, P _i(c|H_i) represents the intensity distribution predicted for the ith note, c is the label type of intensity (6 total from weak to strong).

E. Loss function

The framework provided by the invention is multi-task joint learning, wherein emotion classification is a main task, and tonal classification and dynamics classification are auxiliary tasks. If the adopted representation method is a Ferriera representation method, the coding process already contains the strength information, so that the strength classification is not carried out to avoid label leakage. Other encoding modes of the non-leakage dynamics information can adopt the adjustability and the dynamics classification at the same time. All classification tasks employ cross entropy loss functions.

If the CP representation is adopted, let L ₁,L₂ and L ₃ be loss functions of emotion, mode and dynamics classification respectively, the total loss function of the framework is:

Where L _t is the loss of the t-th task, σ _t is the parameter to be learned to balance the loss of different tasks.

Equation (10) is an adaptive multitasking loss function, the second term being a regularization term.

If the Ferriera notation is used, the system has only emotion classification and tonality classification loss, and in formula (10), L _t only represents emotion and tonality classification loss.

The invention also provides an emotion classification method by adopting the system.

The classification method for emotion classification by adopting the system comprises two stages: a pre-training phase and a fine-tuning phase. In order to be compatible with some models that have been pre-trained, the proposed system may employ auxiliary tasks only during the fine tuning phase. If the auxiliary tasks are adopted in the pre-training and fine-tuning stages at the same time, the effect is better.

The main flow of the two-stage system is described next.

A. Pretraining stage

The system is typically pre-trained on a larger, non-emotional tag dataset, and the pre-trained task may employ a Language Model (LM) or a masked Language Model (Mask Language Model, MLM). Simultaneously, the tonal classification and the dynamics classification proposed by the system are equally applicable to the pre-training stage. As shown in fig. 3, the steps at this stage are:

Firstly, the pre-training data set is subjected to symbol music coding, meanwhile, the dynamics label of each note is obtained by utilizing the dynamics information coded by a symbol music file (such as MIDI (musical interface, musicXML)), and the tonality information of each work is calculated to be used as a label by utilizing a tonality analysis method, wherein in some embodiments of the invention, the tonality of the music is obtained by utilizing Krumhansl-Kessler algorithm.

Secondly, inputting the symbol music expression obtained by encoding into an embedding layer of the model to obtain an embedding representation for inputting a feature extractor;

Third, input the embedded token to a feature extractor to learn the sequence's top and bottom Wen Biaozheng;

Fourth, if only a language model or a mask language model is adopted, the reconstruction of the original input is directly carried out, and if the auxiliary task proposed by the system is adopted, the prediction of the sequence level is carried out on the schedule, and the prediction of the note level is carried out on the dynamics; the auxiliary task provided by the application is required to be used in a fine adjustment stage, and the pre-training stage can be used or not, so that the effect is better. Since there are many models that are pre-trained using only language models or mask language models and are published on the network, the cost is excessive if the pre-training is performed again using the present framework. The auxiliary tasks proposed by the present application can be added only in the fine tuning stage. Because the embedded layer and the feature extractor in the system provided by the application can flexibly adopt the pre-trained model, and in the fine tuning stage, the classifier of the auxiliary task provided by the application can be accessed to the rear end of the model only by loading the pre-trained embedded layer and the feature extractor.

Fifth, a loss function is calculated, and if learning of the auxiliary task proposed by the system is performed simultaneously in the pre-training stage, the total loss function is calculated by using formula (10), wherein the reconstruction loss is used for replacing the emotion classification loss.

In some embodiments of the invention, for the Ferriera representation, the reconstruction penalty is the cross entropy between the pre-training phase model for the masked predicted distribution of events and the real events, and for the CP representation, the reconstruction penalty is the mean of the cross entropy for each attribute distribution and the real attributes in the masked set of events;

Sixth, gradient propagation updates parameters such as embedded layer, feature extractor, convergence layer, classifier, and σ _t in equation (10).

Seventh, an index (e.g., validation set loss or reconstruction accuracy) is calculated on the validation set of the pre-training dataset. If the better index is not obtained after the iteration is carried out for the designated times, stopping learning and storing the model with the best index. When the language model or the mask language model is pre-trained, the input sequence part is masked, the masked part needs to be predicted, and the reconstruction accuracy is the accuracy of predicting the masked part.

B. Fine tuning stage

The steps of the fine tuning stage are similar to those of the pre-training stage, except that the main task of the fine tuning stage is emotion classification and the learning rate is smaller than that of the pre-training stage. As shown in fig. 4, this stage includes the steps of:

firstly, performing symbolic music data coding on a fine tuning data set to obtain emotion, tonality and dynamics labels;

secondly, loading pre-trained model parameters (the model can be an existing pre-trained model instead of a model trained by using auxiliary tasks proposed by the system) as initial parameters of a fine tuning stage;

Thirdly, inputting the symbol music expression obtained by encoding into an embedding layer of the model to obtain an embedding representation for inputting a feature extractor;

fourth, input the embedded token to the feature extractor to learn the sequence's top and bottom Wen Biaozheng;

fifthly, inputting the context representation into a convergence layer and a classifier to classify emotion, adjustability and strength;

Sixthly, calculating a loss function by using the formula (10), and carrying out back propagation learning parameters;

seventh, calculating index (accuracy of emotion classification) on verification set of fine tuning data set, and storing model with best index.

Eighth, load the best model parameter of fine setting stage index, carry on emotion prediction to the symbol music data of the unknown label.

The existing symbol music emotion recognition method mainly adopts single task learning, and omits the close connection between a music structure and emotion. The auxiliary task for classifying the adjustability and the strength provided by the valve can improve the learning of the model on the music structure related to emotion, so that the accuracy of the model on emotion recognition tasks is improved.

The multi-task combined learning is carried out by utilizing the coordination and dynamics auxiliary tasks and the emotion classification main task, and the tasks are mutually assisted and mutually restrained. In the process of multi-task joint learning by using the two auxiliary tasks, the model learns knowledge favorable for all tasks as much as possible, so that the overfitting can be effectively reduced even if the data volume is small. In addition, the auxiliary tasks in the invention have tonality classification and dynamics classification, wherein tonality information can be extracted by the existing mature method, and dynamics information can be directly extracted from the symbolic music file. The system is therefore applicable to all signed music datasets with emotion tags.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The symbol music emotion classification system based on auxiliary tasks is characterized by comprising a symbol music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, an adjustable classifier and a dynamics classifier, wherein:

The embedding layer is used for obtaining an event embedding representation after embedding the event group or the event sequence, adding position information to obtain a position embedding representation, and adding the event embedding representation and the position embedding representation to obtain a final embedding representation ；

The feature extractor is used for extracting features of the embedded feature and outputting a feature sequence with context information；

The convergence layer is used for converging the information of the sequence, wherein for the emotion classification main task and the tone classification auxiliary task, the convergence layer carries out attention weight and characterization sequenceConvergence vector/>For the dynamics classification auxiliary task of notes, the convergence layer adopts an equivalent mapping mode to obtain convergence vectors corresponding to each note;

the emotion classifier and the adjustable classifier are used for respectively classifying the emotion vectors according to the convergence vectors Predicting emotion types and modes, wherein a dynamics classifier is used for predicting dynamics according to the convergence vector corresponding to each note;

Wherein, the final embedded representation is obtained in the embedded layer The process of (1) is as follows:

When the symbol music coding module adopts the CP representation method to code the symbol music into event groups, the coded sequence is ，/>Is the sequence length, per position/>The method comprises the steps of performing embedded characterization on each dimension respectively to obtain embedded characterization of each dimension, and then splicing the embedded characterization of each dimension to obtain the position/>Embedded characterization of/>The formula is:

(1)

In the middle of One-hot representation of the kth attribute in the ith event group,/>For the embedding matrix of the kth attribute within the group,/>Event number for kth attribute,/>Is the size of the attribute embedding dimension, concat represents a tensor stitching operation,/>、/>Are two-dimensional matrices;

(2)

In the method, in the process of the invention, Is a linear layer weight parameter,/>Is characterized by embedding through a linear change layer; /(I)The number of events that are the kth attribute;

when the symbol music coding module adopts Ferriera representation symbol music coding to obtain an event sequence, the event is directly embedded to obtain The formula is:

(3)

In the method, in the process of the invention, For the one-hot representation of the ith event, V is the event vocabulary size,/>Is an embedded matrix;

Adding position information to obtain a position embedded representation, then Location information of individual locations/>The calculation formula is as follows:

(4)

Wherein, Is the position/>Is one-hot coded,/>Is a location embedding matrix;

adding the event embedded token to the location embedded token results in a final input embedded token for the ith event ，：

(5)。

2. The auxiliary task-based symbolic music emotion classification system of claim 1, wherein the symbolic music coding method adopted by the symbolic music coding module comprises any one of a CP representation method and a ferrora representation method, wherein the CP representation method gathers events into which symbolic music is coded into groups, each group codes a pitch, a duration, a bar where a note starts to sound and a sub-beat within the bar, and the ferrora representation method is used for coding symbolic music into groupsOf (3), wherein/>Respectively represent the/>Dynamics, duration and pitch of individual notes,/>Representing event sequence consensus/>A group event.

3. The auxiliary task based symbolic music emotion classification system of claim 1, wherein in the convergence layer, the step of obtaining a convergence vector comprises:

The input of the convergence layer is N is the sequence length, H is the hidden layer dimension output by the feature extractor, and for emotion and tonality tasks at the sequence level, the convergence layer calculates the attention weight/>, according to the attention mechanism：

(6)

In the middle ofTan h is an activation function,/>，/>The super parameter is adjustable, T is the transposition operation of the matrix or the vector;

Obtain attention weight After that, it is compared with a characterization sequence/>, with context informationMultiplying to obtain a convergence vector/>, weighted according to the attention mechanism：

(7)

4. The auxiliary task-based symbolic music emotion classification system of claim 1, wherein the emotion and tonality classifier predicts emotion and tonality labels in the manner of:

(8)

In the method, in the process of the invention, Classifier representing task t,/>Is the tag type of task t.

5. The auxiliary task-based symbolic music emotion classification system of claim 1, wherein in the dynamics classifier, the mode of predicting dynamics labels is:

(9)

In the method, in the process of the invention, Representing a predicted intensity distribution for the ith note,/>And c is a label type of the dynamics.

6. The auxiliary task based symbolic music emotion classification system of any of claims 1-5, wherein the loss function in system learning is:

when the symbol music coding module adopts the CP representation method to code the symbol music into event groups, the method is that ，/>And/>The self-adaptive multi-task loss function of the system is as follows:

(10)

In the method, in the process of the invention, For the loss of the t-th task,/>Is a parameter to be learned for balancing the losses of different tasks;

when the symbol music coding module adopts Ferriera representation symbol music coding to form an event sequence, the system only has emotion classification and tonal classification loss, and the symbol music coding module is in a formula (10) Only emotion and mode classification loss is represented.

7. The symbol music emotion classification method based on auxiliary tasks by using the system according to any one of claims 1-6, characterized by comprising a pre-training stage and a fine-tuning stage, wherein the pre-training stage adopts a language model or a mask language model to directly reconstruct original input or simultaneously can also predict a sequence level for a tonality and a note level for a dynamics, and the fine-tuning stage needs to predict the sequence level for the tonality and the note level for the dynamics.

8. The auxiliary task based symbolic music emotion classification method of claim 7, wherein the pre-training phase comprises the steps of:

Calculating a loss function;

gradient propagation is carried out, and parameter updating is carried out;

9. A method of task-assisted based symbolic music emotion classification according to any of claims 7-8, characterized in that the fine tuning phase comprises the steps of: carrying out symbol music data coding on the fine tuning data set to obtain emotion, tonality and strength labels;

Calculating indexes, and storing a model with the best indexes;