CN114925742A

CN114925742A - Symbolic music emotion classification system and method based on auxiliary task

Info

Publication number: CN114925742A
Application number: CN202210296315.XA
Authority: CN
Inventors: 陈俊龙; 邱际宝; 张通
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-08-19

Abstract

The invention provides a symbol music emotion classification system and method based on an auxiliary task. The system comprises a symbol music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, a mode classifier and a force classifier. The method comprises a pre-training stage and a fine-tuning stage, and can adopt an auxiliary task only in the fine-tuning stage or simultaneously adopt the auxiliary task in the pre-training stage and the fine-tuning stage. According to the emotion recognition method, the emotion recognition accuracy is improved by using the auxiliary tasks related to emotion, and the model can better learn the characteristics related to emotion through multi-task learning, so that the emotion recognition accuracy of the model to the emotion recognition task is improved.

Description

Symbol music emotion classification system and method based on auxiliary task

Technical Field

The invention belongs to the field of emotion classification, and particularly relates to a symbol music emotion classification method and system based on an auxiliary task.

Background

Music is closely linked to emotions since the beginning of birth, and listeria et al believe that music itself is generated to express emotions. Therefore, emotional recognition of music is an important direction in psychological research on music. Music is available in many forms, such as audio-modal music (presented in audio form stored in mp3, wav, etc.) and symbolic-modal music (presented in symbolic representation of score classes stored in MIDI and MusicXML, etc.). The symbolic music generally encodes the beat, rhythm, pitch, duration, and force of the music. The existing research shows that music in symbolic mode is more suitable for machine learning or deep learning models to carry out automatic emotion classification.

The emotion of music can be divided into two dimensions according to the emotion dimension theory of Russell: potency (valency) and Arousal (Arousal). Where valence represents whether the emotion is positive or negative and the arousal level represents the intensity of the emotion. Further, according to this valence-arousal model, the emotion of music can be classified into four categories: happy (high potency high arousal), angry or fear (low potency high arousal), sad (low potency low arousal) and calm (high potency low arousal).

The existing symbolic music emotion classification algorithm is mainly divided into two types:

a) the first is a machine learning method based on statistical features. The method utilizes the existing music analysis tool to extract some statistical characteristics of symbolic music, such as pitch distribution, melody interval and the like. Based on the extracted statistical features, the statistical features are input into a machine learning model, such as a support vector machine and a classification tree algorithm, for emotion classification of music. However, the machine learning method based on statistical features needs manual feature extraction, and meanwhile, the accuracy of symbol music emotion recognition by the method is low. At present, a second deep learning method based on symbolic music expression is mainly adopted to identify the emotion of symbolic music.

b) The second is a deep learning approach based on symbolized musical expressions. The method encodes the symbolic music into a sequence of events. Such as expressing the pronunciation of a note as: pitch of a note, duration of the note, and velocity of the note. This event sequence data is then input into a neural network model (e.g., long and short term neural network) that can process the time series data for emotion classification. In this process, an event is considered to be an expression similar to a word in Natural Language, so the method is generally classified by using some models in the Natural Language Processing (NLP) field. The common training method is to perform unsupervised pre-training on a large data set without emotion labels, and then store the pre-trained weights. After pre-training, the model is considered to have learned some domain knowledge about the music. And finally, loading pre-trained weights on a small data set with emotion labels to perform fine adjustment of emotion classification. The backbone networks in the fine tuning process and the pre-training process are the same, the backbone networks are regarded as feature extractors, and a plurality of fully-connected neural network layers are added behind the feature extractors to be used as classifiers for emotion classification. The learning rate of the neural network in the fine tuning process is generally smaller than that in the pre-training process, so that the universal knowledge obtained by the pre-training model on a large data set is not changed greatly. The pre-training method mainly adopted by the method is a mask language model, namely, random (or next to a currently processed event) events are masked in a pre-training stage, and then the masked events are predicted by using a neural network. The method first appears in the natural language processing domain and then is migrated to the symbolic music emotion classification domain. However, this method of pre-training and then directly fine-tuning does not fully utilize the structural information of the music. And structural information of music such as mode (major and minor), loudness of note, has been shown to be closely related to music emotion in both musical theory and psychological research. Music and psychology consider that the style of music has a large impact on emotional expression, e.g., music with major key is generally considered happy and positive, and music with minor key is generally considered sad and negative. In addition to this, the loudness of music is considered to be related to the emotion of music, music with high loudness is generally considered happy and positive, and music with low loudness is generally considered sad and negative. The deep learning method based on symbolized music expression does not fully utilize the structural information of music, and the obtained emotion classification accuracy is low. Such as the scheme disclosed by Yi-Hui Chou et al in Midibert-Piano Large-scale Pre-training for Symbolic Music exploration, which is a multi-task learning scheme that designs different models for different tasks, does not take into account the connections between tasks, and does not perform tonality and emotion classification.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a symbolic music emotion classification method based on an auxiliary task, which can utilize structural information related to emotion and improve the accuracy of a model for emotion classification.

In order to achieve the purpose of the invention, the auxiliary task-based symbolic music emotion classification system provided by the invention comprises a symbolic music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, a mode classifier and a force classifier, wherein:

the symbol music coding module is used for coding the symbol music into an event group or an event sequence;

the embedding layer is used for embedding an event group or an event sequence to obtain an event embedding representation, adding position information to obtain a position embedding representation, and adding the event embedding representation and the position embedding representation to obtain a final embedding representation l ₁ ；

The characteristic extractor is used for extracting the characteristics of the embedded characteristics and outputting a characteristic sequence l with context information ₂ ；

The convergence layer is used for converging the information of the sequence, wherein the convergence layer combines the attention weight with the characterization sequence l for the emotion classification main task and the tone classification auxiliary task ₂ Multiplying to obtain a convergence vector l ₃ For the dynamics classification auxiliary task of the musical notes, the convergence layer adopts an equivalent mapping mode to obtain a convergence vector corresponding to each musical note;

the emotion classifier and the mode classifier are used for respectively classifying the emotion according to the convergence vector l ₃ And predicting the emotion types and the modes, wherein the strength classifier is used for predicting the strength according to the convergence vector corresponding to each note.

Further, the symbol music coding method adopted by the symbol music coding module comprises any one of a CP representation method and a Ferreira representation method, wherein the CP representation method assembles the events coded by the symbol music into groups, and each group codes the pitch and the duration of one noteThe measure of the musical notes, the measure of the onset of the musical notes and the time of the beats in the measure, the Ferreira representation method is to code the symbolic music into

Of a sequence of events of, wherein

Respectively representing the strength, duration and pitch of the ith note, and n represents n groups of events in the event sequence.

Further, embedding into the layer, resulting in a final embedded representation l ₁ The process comprises the following steps:

when the symbol music coding module adopts a CP representation method to code the symbol music into an event group, the sequence obtained by coding is x ═ { x ═ x ₁ ，...，x _n N is the sequence length, x for each position _i ∈R ⁴ The method comprises the steps of including four dimensions of bar, secondary beat, pitch and duration, embedding representation is carried out on each dimension respectively, after the embedding representation of each dimension is obtained, the embedding representations are spliced to obtain the position x _i Embedded characterization e of _i The formula is expressed as:

in the formula

A one-hot representation representing the kth attribute in the ith event group,

embedding matrix for the kth attribute in the set, N _k Number of events for the kth attribute, H _k Is the size of the attribute embedding dimension, Concat represents the tensor stitching operation,

are all two-dimensional matrixes;

after each dimension is embedded, the embedded vectors after splicing are converted into the dimension H required by the feature extractor, and the formula is as follows:

g _i ＝e _i W (2)

in the formula (I), the compound is shown in the specification,

for the linear layer weight parameter, g _i ∈R ^H Is an embedded characterization through the linear variable layer; is N _k Is the number of events of the kth attribute;

when the symbolic music coding module adopts Ferreira representation symbolic music coding as event sequence, the event is directly embedded to obtain g _i The formula is expressed as:

g _i ＝x _i W ^P (3)

in the formula, x _i ∈R ^V Is a one-hot representation of the ith event, V is the event vocabulary size, W ^P ∈R ^V×H Is an embedded matrix;

adding the position information to obtain the position embedded representation, the position information p of the ith position _i ∈R ^H The calculation formula is as follows:

p _i ＝Z _i W ^P (4)

wherein Z is _i ∈R ^1×n Is a one-hot encoding of position i, W ^P ∈R ^n×H Is a location embedding matrix;

adding the event embedding characterization and the position embedding characterization can obtain the final input embedding characterization I of the ith event _i ，I _i ∈R ^H ：

I _i ＝g _i +p _i (5)。

Further, in the convergence layer, the step of obtaining a convergence vector includes:

the input to the convergence layer is l ₂ ∈R ^n×H N is the sequence length, H is the hidden layer dimension output by the feature extractor, and for the emotion and tonal tasks of the sequence level, the convergence layer calculates the attention weight a according to an attention mechanism:

in the formula (I), the compound is shown in the specification,

the tan h is the function of activation,

d _a the method is characterized in that the method is adjustable hyperparameter, and T is transposition operation of a matrix or a vector;

obtaining attention weight a ∈ R ⁿ Then, it is compared with a token sequence l with context information ₂ Multiplying to obtain a convergence vector l weighted according to the attention mechanism ₃ ∈R ^H ：

l ₃ ＝al ₂ (7)

For the dynamics classification task of the note level, the convergence layer obtains a convergence vector in an equivalent mapping mode.

Further, in the emotion classifier and the tonal classifier, the manner of predicting emotion and tonal labels is as follows:

P ^t (c ^t |l ₃ )＝softmax(φ ^t (l ₃ )) (8)

in the formula, P ^t (c ^t |l ₃ ) Represents the predicted label distribution of task t, phi ^t Classifier representing task t, c ^t Is the tag type of task t.

Further, in the strength classifier, the method for predicting the strength label is as follows:

in the formula, P _i (c|H _i ) Representing the power distribution for the prediction of the ith note,

classify ren for strengthThe classifier of affairs, c is the label type of the dynamics.

Further, the loss function in the system learning is:

when the symbol music coding module adopts CP representation method to code the symbol music into event group, set L ₁ ，L ₂ And L ₃ The system is characterized by comprising the following adaptive multi-task loss functions, wherein the adaptive multi-task loss functions are classified according to emotion, mode and force respectively:

in the formula, L _t For the loss of the t-th task, σ _t Are parameters that need to be learned to balance the loss of different tasks;

when the symbolic music coding module adopts Ferreira representation symbolic music coding as event sequence, the system only has emotion classification and tone classification loss, L in the formula (10) _t Only emotion and mood classification losses are represented.

The invention also provides a symbolic music emotion classification method based on the auxiliary task, which comprises a pre-training stage and a fine-tuning stage, wherein the pre-training stage adopts a language model or a mask language model, the reconstruction of original input is directly carried out, or meanwhile, the sequence level prediction can be carried out on the tone, the note level prediction is carried out on the strength, and the fine-tuning stage needs the sequence level prediction on the tone and the note level prediction on the strength.

Further, the pre-training phase comprises the steps of:

carrying out symbol music coding on the pre-training data set, and acquiring a strength label of each note and a tonality label of each work;

inputting the coded symbol music expression into an embedding layer to obtain an embedding representation for inputting a feature extractor;

inputting the embedded tokens to a feature extractor to learn a contextual token of the sequence;

if only a language model or a mask language model is adopted, the reconstruction of the original input is directly carried out, the sequence level prediction can be carried out on the tone, and the note level prediction can be carried out on the strength;

calculating a loss function;

gradient propagation, and updating parameters;

and calculating indexes on a verification set for calculating a pre-training data set, iterating for multiple times, and storing a model with the best indexes.

Further, the fine tuning phase comprises the steps of:

carrying out symbol music data coding on the fine tuning data set to obtain emotion, tone and strength labels;

loading model parameters stored in a pre-training stage as symbol music data coding for the fine tuning data set to obtain emotion, tone and strength labels;

inputting the coded symbol music expression into an embedding layer of the model to obtain an embedding representation for inputting a feature extractor;

inputting the context representation into a convergence layer and a classifier, and classifying emotion, tone and strength;

calculating a loss function and performing back propagation learning parameters;

calculating indexes, and taking the model with the best index for storage;

and loading the model parameter with the best index in the fine adjustment stage, and performing emotion prediction on the symbol music data of the unknown label.

Compared with the prior art, the invention can realize the following beneficial effects:

the method and the system provided by the invention are combined with three tasks to learn together, the tasks are directly restricted with each other, and the characteristics which are beneficial to all the tasks, namely 'general knowledge' in the field of music can be learned. The accuracy of emotion recognition is improved by using the auxiliary tasks related to emotion, and the model can better learn the representation related to emotion through multi-task learning, so that the accuracy of the model to the emotion recognition task is improved.

Drawings

FIG. 1 is a schematic diagram of a symbolic music emotion classification system based on auxiliary tasks according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a symbolic music coding (representing) method according to an embodiment of the present invention.

FIG. 3 is a flowchart of a pre-training phase of a symbolic music emotion classification method based on auxiliary tasks according to an embodiment of the present invention.

Fig. 4 is a flowchart of a fine-tuning phase of a symbolic music emotion classification method based on an auxiliary task according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention idea is as follows: the structure and emotion of music are closely linked. The mode of music has a great influence on emotional expression, for example, music with a large mode is generally considered happy and positive, and music with a small mode is generally considered sad and negative. The loudness of music is considered to be related to the emotion of the music, music with a high loudness is generally considered happy and positive, and music with a low loudness is generally considered sad and negative. However, loudness is a concept in the audio field, and how the sign field calculates loudness is still an unsolved problem. However, recent studies show that there is a linear relationship between the intensity of the note in the symbol domain and the loudness in the audio domain, which also means that the intensity has an important influence on emotion. In view of this connection of music structure information to music emotion, the present invention proposes a multi-task framework for emotion classification main tasks and emotion-related auxiliary tasks to learn together.

As shown in fig. 1, the music emotion classification system based on auxiliary tasks provided in the embodiment of the present invention includes a symbolic music coding module, an embedding layer, a feature extractor, a convergence layer, an emotion classifier, a mode classifier, and a strength classifier. The symbol music coding module does not belong to the component of the neural network, is a coding tool and codes the symbol music stored in files such as MIDI or MusicXML and the like into a series of events. The embedded layer, the feature extractor, the convergence layer, the emotion classifier, the mode classifier and the force classifier belong to the components of a neural network and are parameter modules needing to be learned through a gradient descent method. The emotion classifier, the mode classifier and the force classifier share the same symbol music coding module, the embedding layer and the feature extractor. The function and alternatives of the various components are described in detail below.

A. Symbolic music coding

The symbol music representation method is a method of encoding symbol music into an event sequence similar to text. After the symbolic music is encoded into the event sequence by the symbolic music representation method, the resulting event sequence may be regarded as similar to a text in natural language processing, and then processed by the method of natural language processing. As shown in fig. 2, the present invention takes the existing CP representation (Compound word representation) and representation proposed by Ferreira as representatives, and introduces the coding schemes that can be adopted by symbol music coding in detail, and introduces the feasible structural design of the system proposed by the present invention through these two different coding schemes.

In the invention, the CP expression method gathers the events coded by symbol music into groups, and each group codes the Pitch (Pitch), Duration (Duration), Measure (Measure) of a note when the note starts to sound (Onset) and Sub-beat (Sub-beat) in the Measure. In some embodiments of the invention, a piece of music as shown in fig. 2(a) is encoded by the CP representation as a sequence of events as shown in fig. 2 (b). One of the columns is a group, and the time advance direction is from left to right. The representation method divides a beat of music into 4 beats to mark position information of each note within a bar. The minimum duration unit is the duration of a 32-minute note, so a quarter note is an 8-minute duration unit. Whenever a new section appears, it is marked as section (new), otherwise it is section (continued), and the section is continued.

In the invention, the Ferreira expression method is to encode the symbol music into

In which

Respectively representing the strength, duration and pitch of the ith note, and n represents n groups of events in the event sequence. In some embodiments of the invention, as shown in FIG. 2(c), event "v _ 76" indicates that the note has a velocity of 76, "d _ 8" indicates that the note has a duration of 8 notes of 32 minutes, i.e., a duration of one quarter note, "n _ 64" indicates that the pitch of the note is 64, and ". quadrature" indicates that the time advances by one quarter note.

The CP and Ferreira representation methods mainly differ in two points: the first is that the CP representation method aggregates events into groups, each group having a note; the second is that the Ferreira representation method encodes the force information, while the CP representation method in symbolic music emotion classification ignores the force information. Based on these two differences, the framework differs in the design of the embedding layer, the classifier, and the loss function.

B. Embedded layer

After encoding the symbolic music in step a to resemble a CP representation or such a Ferreira representation, a group or sequence of events is obtained.

For CP representation, the sequence obtained by coding is x ═ { x ═ x ₁ ，...，x _n And n is the sequence length. X at each position _i ∈R ⁴ The method comprises the steps of containing four dimensions (bar, secondary beat, pitch and duration), respectively Embedding (Embedding) representation of the four attributes, splicing the embedded representation of each dimension to obtain the position x _i Embedded characterization e of _i The formula is expressed as:

in the formula

A One-hot representation (One-hot) representing the kth attribute in the ith event group,

embedding matrix for the kth attribute in the set, N _k The number of events for the kth attribute (i.e., the vocabulary size of the event), H _k Is the size of the attribute embedding dimension, Concat represents the tensor splicing operation,

are both two-dimensional matrices.

After each attribute is embedded, a layer of linear variable layer is needed to convert the spliced embedded vector into the dimension H needed by the feature extractor, and the formula is as follows:

g _i ＝e _i W (2)

in the formula (I), the compound is shown in the specification,

as linear layer weight parameter, g _i ∈R ^H Is an embedded characterization through the linear variable layer; is N _k Is the number of events for the kth attribute (i.e., the vocabulary size of the event).

If Ferreira representation is adopted, the sequence of the event is obtained, and g can be obtained by directly embedding the event _i The formula is expressed as:

g _i ＝x _i W ^P (3)

in the formula x _i ∈R ^V Is a one-hot representation of the ith event, V is the event vocabulary size, W ^P ∈R ^V×H Is an embedded matrix.

Since the feature extractor is based on a self-attention mechanism, this attention mechanism does not itself carry position information, which means that each in the sequence is shuffledEvents have no effect on this attention mechanism. However, the sequence of events resulting from symbolic music coding is obviously time-dependent, and therefore, the position information needs to be artificially added on the basis of the embedded representation input into the T feature extractor. The position information is learnable in the neural network process, the position information p of the ith position _i ∈R ^H The calculation formula is as follows:

p _i ＝Z _i W ^P (4)

wherein Z _i ∈R ^1×n Is a one-hot encoding of position i, W ^P ∈R ^n×H Is a position embedding matrix.

With position encoding, each position results in a position vector corresponding to the position. Subsequently adding the event embedded representation and the position embedded representation can obtain the final input embedded representation I of the ith event _i ：

I _i ＝g _i +p _i (5)

In the formula I _i ∈R ^H 。

Obtaining the input representation l of the whole event sequence through the steps ₁ (input with respect to the transform model), the characterization can then be input to a feature extractor for learning.

C. Feature extractor

The feature extractor of the present invention may employ any one of a transform-based feature extractor and a Long Short Term Memory (LSTM) network.

In some embodiments of the invention, a Transformer-based feature extractor is employed. The Transformer model serves as a feature extractor in the present system. The Transformer comprises two modules: an Encoder module (Encoder) and a Decoder module (Decoder). However, for the symbolic music emotion classification task, these two modules are not used simultaneously. Existing methods either use only the encoder structure or only the decoder structure as a feature extractor. The framework proposed in this application fits both types of models. Given an input sequence l ₁ ＝{I ₁ ，...，I _n In which I _i ∈R ^H The output obtained after the feature extractor is a characterization sequence l with context information ₂ ＝{H ₁ ，...，H _n In which H is _i ∈R ^H Is the context token corresponding to the ith event.

D. Convergence layer and classifier

The emotion classification system provided by the invention comprises a main task and two auxiliary tasks. The main task is to identify the emotion of symbolic music, the first auxiliary task is to classify the tone of music, and the second auxiliary task is to classify the dynamics of each note. These three tasks are described in detail and the design of the convergence layer is given.

The emotion classification of music is based on a valence-arousal model to perform two classification tasks of valence and arousal or four classification tasks of worry (high efficiency valence high arousal), anger or fear (low efficiency valence high arousal), sadness (low valence low arousal) and calm (high valence low arousal), respectively.

The tonality of music is a general term for the key and the mode. The major key is the most important sound in the tones, and twelve tones can be the major key. The key style is a kind of arrangement of musical scales, which is divided into major key and minor key. The major and minor keys have their different syllable arrangements. The total tone is 12 × 2 ═ 24 tones according to the combination of the key and the tone pattern.

The dynamics classification of the notes is to predict the dynamics of each note, and the dynamics can be divided into six categories from weak to strong, namely pp, p, mp, mf, f and ff.

Since the emotion classification and the key classification of music are for one sequence and the force classification of notes is for a single note, it is necessary to aggregate the sequence information before classifying each task. For the tasks (emotion classification and tonal classification) at the sequence level, the vector obtained by aggregation should be information of only one sequence and integrating the whole sequence. For the task of note hierarchy (dynamics classification of notes), each note has its corresponding convergence vector.

Referring to FIG. 1, let the input of the convergence layer be l ₂ ∈R ^n×H N is a sequenceLength, H is the hidden layer dimension output by the feature extractor. For the tasks at the sequence level, the convergence layer calculates the attention weight a according to the attention mechanism:

in the formula (I), the compound is shown in the specification,

the tan h is the function of activation,

d _a is a super-parameter that can be adjusted, and T is the transpose operation on a matrix or vector.

Obtaining the attention weight a epsilon R ⁿ Then, it is compared with a token sequence l with context information ₂ Multiplying to obtain a convergence vector l weighted according to the attention mechanism ₃ ∈R ^H ：

l ₃ ＝al ₂ (7)

For the classification task of the note level, the convergence layer adopts an equivalent mapping (identification mapping) mode.

The structure of these three task classifiers is designed as a combination of multiple fully-connected layers and activation functions. Characterization l of a sequence of music given a symbol ₃ For sequence level (emotion and tone) tasks, the tags are predicted by:

P ^t (c ^t |l ₃ )＝softmax(φ ^t (l ₃ )) (8)

in the formula, phi ^t Classifier, P, representing task t ^t (c ^t |l ₃ ) A distribution of predicted labels representing task t, c ^t The label type of task t (e.g., 12 major keys of the tonality classification, 24 tonality for 12 minor keys).

For the CP representation system, a classification task of note level is also designed, and a given sequence representation l ₂ ＝{H ₁ ，...，H _n }, the intensity of the ith note is predicted byAnd (3) testing:

in the formula (I), the compound is shown in the specification,

classifiers for dynamics classification tasks, P _i (c|H _i ) Representing the power distribution of the prediction of the ith note, and c is the label type of power (6 types from weak to strong).

E. Loss function

The framework provided by the invention is multi-task joint learning, wherein emotion classification is a main task, and tone classification and force classification are auxiliary tasks. If the adopted representation method is a Ferreira representation method, because the coding process of the Ferreira representation method already comprises the strength information, in order to avoid label leakage, the strength classification is not carried out. Other coding modes of the undisleaked force information can adopt the tone and the force classification at the same time. All classification tasks employ a cross entropy loss function.

If CP notation is adopted, let L ₁ ，L ₂ And L ₃ The loss functions of emotion, mode and force classification are respectively, and the total loss function of the framework is as follows:

in the formula, L _t For the loss of the t-th task, σ _t Are parameters that need to be learned to balance the loss of different tasks.

Equation (10) is an adaptive multi-tasking loss function, and the second term is a regularization term.

If Ferreira representation is adopted, the system only has emotion classification and tone classification loss, L in the formula (10) _t Only sentiment and mood classification losses are represented.

The invention also provides a method for classifying the emotion by adopting the system.

The classification method for emotion classification by adopting the system comprises two stages: a pre-training phase and a fine-tuning phase. In order to be compatible with some models that are already pre-trained, the system proposed by the present application may employ auxiliary tasks only in the fine tuning phase. If the auxiliary task is adopted in the pre-training and fine-tuning stages at the same time, the effect is better.

The main flow of the two-stage system is described next.

A. Pre-training phase

The system is typically pre-trained on a large emotion-tag-free data set, and the pre-trained task may use a Language Model (LM) or a Mask Language Model (MLM). Meanwhile, the tone classification and the strength classification provided by the system are suitable for the pre-training stage. As shown in fig. 3, the steps at this stage are:

firstly, symbol music coding is performed on a pre-training data set, meanwhile, a strength label of each note is obtained by using strength information coded by a symbol music file (such as MIDI (musical instrument digital interface), and the tonality information of each work is obtained by calculating by using a tonality analysis method as a label, wherein in some embodiments of the invention, the tonality of music is obtained by analyzing by using a Krumhansl-Kessler algorithm.

Secondly, inputting the coded symbolic music expression into an embedding layer of the model to obtain an embedding representation for inputting a feature extractor;

third, the embedded tokens are input to a feature extractor to learn a contextual token of the sequence;

fourthly, if only a language model or a mask language model is adopted, the original input is directly reconstructed, and if an auxiliary task provided by the system is also adopted, the sequence level prediction is required to be carried out on the scheduling, and the note level prediction is required to be carried out on the strength; the auxiliary task provided by the application is required to be used in the fine adjustment stage, and the pre-training stage can be used or not used, so that the use effect is better. At present, a plurality of models which are pre-trained only by adopting a language model or a mask language model exist and are released on the network, and the cost is too high if the frame is adopted to perform pre-training again. The auxiliary tasks proposed by the present application may be added only during the fine tuning phase. Because the embedded layer and the feature extractor in the system can flexibly adopt the pre-trained model, the classifier of the auxiliary task provided by the application can be accessed at the rear end of the model only by loading the pre-trained embedded layer and the pre-trained feature extractor in the fine tuning stage.

And fifthly, calculating a loss function, and if learning of an auxiliary task proposed by the system is carried out simultaneously in a pre-training stage, calculating a total loss function by using an equation (10), wherein the emotion classification loss is replaced by reconstruction loss.

In some embodiments of the invention, for the Ferreira representation, the reconstruction penalty is the cross entropy of the pre-trained phase model between the predicted distribution of masked out events and the true events, and for the CP representation, the reconstruction penalty is the mean of the cross entropy of the distribution of each attribute in the masked out event group and the true attribute;

sixth, gradient propagation, for embedding layer, feature extractor, convergence layer, classifier, and σ in equation (10) _t And updating the parameters.

Seventh, an index indicator (e.g., validation set loss or reconstruction accuracy) is calculated on the validation set of the pre-training dataset. And if no better index is obtained after iteration for a specified number of times, stopping learning and storing the model with the best index. When the language model or the mask language model is pre-trained, the input sequence part is covered, the covered part needs to be predicted, and the reconstruction accuracy is the accuracy of predicting the covered part.

B. Fine tuning phase

The steps of the fine tuning stage are similar to those of the pre-training stage, and the difference is that the main task of the fine tuning stage is emotion classification, and the learning rate is smaller than that of the pre-training stage. As shown in fig. 4, the steps at this stage are:

firstly, symbol music data coding is carried out on the fine tuning data set, and emotion, tone and strength labels are obtained;

secondly, loading pre-trained model parameters (the model can be an existing pre-trained model instead of a model trained by using an auxiliary task provided by the system) as initial parameters of a fine tuning stage;

thirdly, inputting the coded symbolic music expression into an embedding layer of the model to obtain an embedding representation for inputting a feature extractor;

fourth, the embedded tokens are input to a feature extractor to learn a contextual token of the sequence;

fifthly, inputting the context representation into a convergence layer and a classifier, and classifying the emotion, the tone and the strength;

sixthly, calculating a loss function by using the formula (10), and performing back propagation on the learning parameters;

and seventhly, calculating indexes (the accuracy of emotion classification) on the verification set of the fine adjustment data set, and taking and storing the model with the best index.

And eighthly, loading the model parameter with the best index in the fine tuning stage, and performing emotion prediction on the symbol music data of the unknown label.

The conventional symbol music emotion recognition method mainly adopts single-task learning, neglects the close relation between a music structure and emotion, and adopts a multi-task learning mode to improve the accuracy of an emotion recognition task. The key and force classification auxiliary task provided by the valve can improve the learning of the model to the music structure related to emotion, so that the accuracy of the model to the emotion recognition task is improved.

And performing multi-task combined learning by using the tone and strength auxiliary task and the emotion classification main task, wherein the tasks are mutually assisted and restricted. In the process of multi-task joint learning by using the two auxiliary tasks, the model learns the knowledge which is beneficial to all the tasks as much as possible, so that overfitting can be effectively reduced even if the data volume is small. In addition, the auxiliary tasks in the invention include tone classification and force classification, wherein the tone information can be extracted by the existing mature method, and the force information can be directly extracted from the symbolic music file. The system is therefore applicable to all symbol music data sets with emotion tags.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. Symbol music emotion classification system based on auxiliary task, its characterized in that is including symbol music coding module, embedding layer, feature extractor, convergence layer, emotion classifier, mode classifier and dynamics classifier, wherein:

2. The auxiliary task based symbolic music emotion classification system of claim 1, wherein the symbolic music encoding method adopted by the symbolic music encoding module comprises CPAny one of a representation method of aggregating events into which symbol music is encoded into groups, each group encoding a pitch, a duration of a note, a bar in which the note starts to be pronounced, and a beat within the bar, and a Ferreira representation method of encoding symbol music into a pitch, a duration, and a tempo of a note

In which

3. Auxiliary task based symbolic music emotion classification system according to claim 1, characterized in that, in the embedding layer, the final embedded token/is obtained ₁ The process comprises the following steps:

when the symbol music coding module adopts a CP representation method to code the symbol music into an event group, the sequence obtained by coding is x ═ { x ═ x ₁ ，...，x _n Is the sequence length, x at each position _i ∈R ⁴ The method comprises the steps of including four dimensions of bar, secondary beat, pitch and duration, embedding representation is carried out on each dimension respectively, after the embedding representation of each dimension is obtained, the embedding representations are spliced to obtain the position x _i Embedded characterization e of _i The formula is expressed as:

in the formula

A one-hot representation representing the kth attribute in the ith event group,

is the first in the groupEmbedded matrix of k attributes, N _k Number of events for k-th attribute, H _k Is the size of the attribute embedding dimension, Concat represents the tensor splicing operation,

are all two-dimensional matrixes;

g _i ＝e _i W (2)

in the formula (I), the compound is shown in the specification,

as linear layer weight parameter, g _i ∈R ^H Is an embedded characterization through the linear variable layer; is N _k Is the number of events for the kth attribute;

g _i ＝x _i W ^P (3)

in the formula, x _i ∈R ^V Is a one-hot representation of the ith event, V is the event vocabulary size, W _P ∈R ^V×H Is an embedded matrix;

p _i ＝Z _i W ^P (4)

wherein, Z _i ∈R ^1×n Is a one-hot encoding of position i, W ^P ∈R ^n×H Is a location embedding matrix;

adding the event embedding representation and the position embedding representation can obtain the final input embedding representation I of the ith event _i ，I _i ∈R ^H ：

I _i ＝g _i +p _i (5)

4. The symbolic music emotion classification system based on auxiliary task of claim 1, wherein in the convergence layer, the step of obtaining a convergence vector comprises:

in the formula (I), the compound is shown in the specification,

the tan h is the function of activation,

d _a is a super parameter which can be adjusted, and T is the transposition operation of a matrix or a vector;

l ₃ ＝al ₂ (7)

5. The auxiliary task based symbolic music emotion classification system of claim 1, wherein in the emotion classifier and the key classifier, the manner of predicting emotion and key labels is as follows:

P ^t (c ^t |l ₃ )＝softmax(φ ^t (l ₃ )) (8)

in the formula, P ^t (c ^t |l ₃ ) A distribution of predicted labels representing the task t,φ ^t classifier representing task t, c ^t Is the tag type of task t.

6. The auxiliary task based symbolic music emotion classification system of claim 1, wherein in the force classifier, the manner of predicting the force labels is:

the classifier is a classifier of a force classification task, and c is a label type of force.

7. An auxiliary task based symbolic music emotion classification system as claimed in any of claims 1 to 6, wherein the penalty function in the system learning is:

when the symbolic music coding module adopts Ferreira representation symbolic music coding as event sequence, the system only has emotion classification and tone classification loss, L in the formula (10) _t Only sentiment and mood classification losses are represented.

8. The symbolic music emotion classification method based on auxiliary tasks using the system of any of claims 1-7, comprising a pre-training phase and a fine-tuning phase, wherein the pre-training phase directly reconstructs an original input using a language model or a mask language model, or simultaneously predicts a sequence level of the key and a note level of the force, and the fine-tuning phase requires the sequence level prediction of the key and the note level prediction of the force.

9. A method for symbolic music emotion classification based on an auxiliary task as claimed in claim 8, wherein the pre-training phase comprises the steps of:

calculating a loss function;

gradient propagation, and updating parameters;

10. A symbolic music emotion classification method based on an auxiliary task according to any of claims 8-9, wherein the fine tuning phase comprises the steps of:

calculating indexes, and taking the model with the best index for storage;

and loading the model parameter with the best index in the fine tuning stage, and performing emotion prediction on the symbol music data of the unknown label.