CN111858943A

CN111858943A - Music emotion recognition method and device, storage medium and electronic equipment

Info

Publication number: CN111858943A
Application number: CN202010750419.4A
Authority: CN
Inventors: 赵剑; 刘华平; 梁晓晶; 段振宇
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-10-30

Abstract

The embodiment of the invention relates to the technical field of computers, in particular to a music emotion recognition method and device, a storage medium and an electronic device. The method comprises the steps of obtaining a frequency spectrum matrix and a text vector matrix corresponding to a music file to be recognized, and inputting the frequency spectrum matrix and the text vector matrix into a multi-mode network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer; performing feature extraction on the frequency spectrum feature matrix through the audio processing network to obtain audio modal features, and performing feature extraction on the text vector matrix through the text processing network to obtain text modal features; and mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label through the classification layer so as to obtain an emotion classification result corresponding to the music file to be identified. The method and the device can improve the accuracy of music emotion recognition.

Description

Music emotion recognition method and device, storage medium and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a music emotion recognition method and device, a storage medium and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims and the description herein is not admitted to be prior art by inclusion in this section.

Music is very closely related to emotion, and emotion information is expressed by melody or melody and lyric. The Music Emotion Recognition (MER) technology is to analyze and process Music features by using a computer, study the mapping relation between Music feature space and Emotion space, and realize the cognitive process of Emotion expressed by Music.

Disclosure of Invention

In some techniques, a music emotion recognition method based on single-mode deep learning can be utilized, and a music emotion recognition model can be trained by utilizing audio information or lyric information. But the scheme only uses the audio frequency or lyric information of the music and ignores the influence of the other side on the emotion expression, so that the emotion recognition is not accurate enough. In addition, the multi-modal music emotion recognition method requires a large number of labeled samples during model training. However, manual music emotion data set labeling is high in cost and time-consuming; the labeled sample amount is small, and the accuracy of the model for recognizing the music emotion cannot be guaranteed.

Therefore, an improved music emotion recognition method and apparatus, storage medium and electronic device are needed, which can use a small number of labeled samples for training and improve the accuracy of music emotion recognition.

In this context, embodiments of the present invention are intended to provide a music emotion recognition method and apparatus, a storage medium, and an electronic device.

According to one aspect of the present disclosure, there is provided a music emotion recognition method, including:

in an exemplary embodiment of the disclosure, a frequency spectrum matrix and a text vector matrix corresponding to a music file to be recognized are obtained, and the frequency spectrum matrix and the text vector matrix are input into a multi-mode network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer;

performing feature extraction on the frequency spectrum feature matrix through the audio processing network to obtain audio modal features, and performing feature extraction on the text vector matrix through the text processing network to obtain text modal features;

and mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label through the classification layer so as to obtain an emotion classification result corresponding to the music file to be identified.

In an exemplary embodiment of the present disclosure, the obtaining a spectrum matrix and a text vector matrix corresponding to a music file to be identified includes:

the method comprises the steps of obtaining audio data corresponding to a music file to be identified and corresponding text data, and respectively preprocessing the audio data and the text data to obtain a corresponding frequency spectrum matrix and a corresponding text vector matrix.

In an exemplary embodiment of the present disclosure, the spectrum matrix is a mel-frequency spectrum matrix; preprocessing the audio data to obtain a corresponding spectral matrix, comprising:

performing voice endpoint detection on the audio data to screen non-silence framed audio data;

and constructing the Mel frequency spectrum matrix according to the screened non-silent frame audio data.

In an exemplary embodiment of the present disclosure, preprocessing the text data to obtain the text vector matrix includes:

and performing word segmentation processing on the text data, and constructing a text vector matrix with a target size according to word segmentation results.

In an exemplary embodiment of the present disclosure, the performing, by the audio processing network, feature extraction on the spectral feature matrix to obtain audio modal features includes:

performing convolution on the spectrum characteristic matrix by using a first convolution layer to obtain a first-dimension spectrum characteristic;

performing dimensionality reduction on the first-dimension spectral feature by using a first maximum pooling layer;

sequentially extracting the first-dimension spectral features subjected to dimensionality reduction by using a plurality of convolution layers which are continuously arranged to obtain target-dimension spectral features;

and performing dimensionality reduction processing on the target dimensionality spectrum feature by utilizing a second maximum pooling layer to obtain the audio modal feature.

In an exemplary embodiment of the disclosure, when the dimension reduction processing is performed on the first-dimension spectral feature by using the first max-pooling layer, the method further includes:

utilizing an anti-overfitting processing layer to conduct regularization processing on the first-dimension frequency spectrum characteristics after the dimension reduction processing; or

Utilizing an anti-overfitting processing layer to conduct regularization processing on the first-dimension spectral features, and inputting the first-dimension spectral features after regularization processing into the first maximum pooling layer to conduct dimension reduction processing;

wherein the anti-overfitting treatment layer comprises two Dropout layers arranged in series.

In an exemplary embodiment of the present disclosure, the performing, by the text processing network, feature extraction on the text vector matrix to obtain text modal features includes:

performing feature extraction on the text vector matrix by using a one-dimensional convolutional layer to obtain a first text feature;

performing dimension reduction processing on the first text feature through a maximum pooling layer;

and processing the first text feature after the dimension reduction processing by using a long-short term memory network layer to obtain the text modal feature.

In an exemplary embodiment of the present disclosure, the classification layer includes: sharing a fully connected network layer and a Softmax layer;

the step of mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label through the classification layer to obtain an emotion classification result corresponding to the music file to be identified includes:

inputting the audio modal characteristics and the text modal characteristics into a shared full-connection network layer, so as to map characteristic data to a preset number of emotion category labels and output characteristic values of a target size; the target size is the same as the number of the preset emotion category labels;

processing the characteristic value by utilizing a Softmax layer to acquire probability distribution of each emotion category label;

and carrying out normalization processing on the probability distribution of each emotion category label to obtain the emotion classification result of the audio data to be recognized and the corresponding text data.

In an exemplary embodiment of the disclosure, the shared fully-connected network layer includes two Linear layers arranged in series.

In an exemplary embodiment of the present disclosure, the method further comprises:

the music file to be identified is segmented to obtain a frequency spectrum matrix and a text vector matrix corresponding to a plurality of file segments;

respectively carrying out music emotion recognition on each file segment to obtain a plurality of segment emotion classification results;

and calculating the average emotion label of the emotion classification results of the segments to obtain the emotion classification result of the music file to be identified.

In an exemplary embodiment of the present disclosure, the method further comprises: training the multi-modal network model based on transfer learning, comprising:

a teacher network and a student network are established, and the teacher network and the student network are initialized by network parameters identified by music songs;

acquiring sample data, wherein the sample data comprises an audio sample, a text sample and a corresponding emotion category label, which correspond to a sample file;

inputting the sample data into a teacher network and a student network respectively to obtain a first output result of the student network and a second output result output by the teacher network;

and constructing a knowledge distillation loss function according to the first output result and the second output result, and performing back propagation on the student network by using the knowledge distillation loss function so as to iteratively train the student network to be convergent and obtain the multi-mode network model.

In an exemplary embodiment of the disclosure, the knowledge distillation loss function comprises:

Loss＝CE+λKL

wherein KL is relative entropy; CE is classified cross entropy; λ is a weight coefficient.

In an exemplary embodiment of the present disclosure, the teacher network and the student network each include: a parallel audio processing network, a text processing network, and a classification layer; wherein the content of the first and second substances,

the audio processing network comprises a first convolution layer, a maximum pooling layer and a Dropout layer which are sequentially arranged, and a second convolution layer and a maximum pooling layer which comprise four convolution layers which are continuously arranged;

the text processing network comprises a one-dimensional convolution layer, a maximum pooling layer, an LSTM network layer and a Dropout layer which are sequentially arranged;

the classification layer comprises a sharing full-connection network layer and a Softmax layer which are sequentially arranged; the shared fully-connected network layer comprises two continuous Linear layers.

According to an aspect of the present disclosure, there is provided a music emotion recognition apparatus including:

the data acquisition module is used for acquiring a frequency spectrum matrix and a text vector matrix corresponding to the music file to be identified and inputting the frequency spectrum matrix and the text vector matrix into the multi-mode network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer;

the characteristic extraction module is used for extracting the characteristics of the frequency spectrum characteristic matrix through the audio processing network so as to obtain audio modal characteristics, and extracting the characteristics of the text vector matrix through the text processing network so as to obtain text modal characteristics;

and the classification result output module is used for mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label through the classification layer so as to obtain an emotion classification result corresponding to the music file to be identified.

In an exemplary embodiment of the present disclosure, the data acquisition module includes:

and the preprocessing unit is used for acquiring audio data corresponding to the music file to be identified and corresponding text data, and respectively preprocessing the audio data and the text data to acquire a corresponding frequency spectrum matrix and a corresponding text vector matrix.

In an exemplary embodiment of the present disclosure, the spectrum matrix is a mel-frequency spectrum matrix; the preprocessing unit includes:

the audio processing unit is used for carrying out voice endpoint detection on the audio data so as to screen non-silent framing audio data; and constructing the Mel frequency spectrum matrix according to the screened non-silent framing audio data.

In an exemplary embodiment of the present disclosure, the preprocessing unit includes:

and the text processing unit is used for performing word segmentation processing on the text data and constructing the text vector matrix with a target size according to word segmentation results.

In an exemplary embodiment of the present disclosure, the feature extraction module includes:

the audio feature extraction unit is used for convolving the spectrum feature matrix by using a first convolution layer to acquire a first-dimension spectrum feature; performing dimensionality reduction on the first-dimension spectral feature by using a first maximum pooling layer; sequentially extracting the first-dimension spectral features subjected to dimensionality reduction by using a plurality of convolution layers which are continuously arranged to obtain target-dimension spectral features; and performing dimensionality reduction processing on the target dimensionality spectrum feature by utilizing a second maximum pooling layer to obtain the audio modal feature.

In an exemplary embodiment of the present disclosure, the audio feature extraction unit may be further configured to, when performing dimension reduction processing on the first-dimension spectral feature by using a first maximum pooling layer, perform regularization processing on the dimension-reduced first-dimension spectral feature by using an over-fitting prevention processing layer; or utilizing an anti-overfitting processing layer to carry out regularization processing on the first-dimension spectral features, and then inputting the first-dimension spectral features after regularization processing into the first maximum pooling layer to carry out dimension reduction processing; wherein the anti-overfitting treatment layer comprises two Dropout layers arranged in series.

the text feature extraction unit is used for extracting features of the text vector matrix by using the one-dimensional convolution layer to obtain first text features; performing dimension reduction processing on the first text feature through a maximum pooling layer; and processing the first text feature after the dimension reduction processing by using a long-short term memory network layer to obtain the text modal feature.

In an exemplary embodiment of the present disclosure, the classification layer includes: sharing a fully connected network layer and a Softmax layer; the classification result output module comprises:

the full-connection processing unit is used for inputting the audio modal characteristics and the text modal characteristics into a shared full-connection network layer so as to map the characteristic data to a preset number of emotion category labels and output a characteristic value of a target size; the target size is the same as the number of the preset emotion category labels; processing the characteristic value by utilizing a Softmax layer to acquire probability distribution of each emotion category label;

and the classification processing unit is used for carrying out normalization processing on the probability distribution of each emotion category label so as to obtain the emotion classification result of the audio data to be identified and the corresponding text data.

In an exemplary embodiment of the present disclosure, the apparatus further includes:

the segmentation processing module is used for segmenting the music file to be identified so as to obtain a frequency spectrum matrix and a text vector matrix corresponding to a plurality of file segments; respectively carrying out music emotion recognition on each file segment to obtain a plurality of segment emotion classification results; and calculating the average emotion label of the emotion classification results of the segments to obtain the emotion classification result of the music file to be identified.

the model training module is used for constructing a teacher network and a student network and initializing the teacher network and the student network by using the network parameters identified by the music songs; acquiring sample data, wherein the sample data comprises an audio sample, a text sample and a corresponding emotion category label, which correspond to a sample file; inputting the sample data into a teacher network and a student network respectively to obtain a first output result of the student network and a second output result output by the teacher network; and constructing a knowledge distillation loss function according to the first output result and the second output result, and performing back propagation on the student network by using the knowledge distillation loss function so as to iteratively train the student network to be convergent and obtain the multi-mode network model.

Loss＝CE+λKL

the audio processing network comprises a first convolution layer, a maximum pooling layer, an anti-overfitting processing layer, a second convolution layer and a maximum pooling layer, wherein the first convolution layer, the maximum pooling layer and the anti-overfitting processing layer are sequentially arranged;

the text processing network comprises a one-dimensional convolution layer, a maximum pooling layer, a long-term and short-term memory network layer and an anti-overfitting processing layer which are sequentially arranged;

According to an aspect of the present disclosure, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, performs the above music emotion recognition method.

According to an aspect of the present disclosure, there is provided an electronic device including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform any of the music emotion recognition methods described above via execution of the executable instructions.

According to the music emotion recognition method, the audio part and the lyric text part of a music file to be recognized are respectively processed to obtain a corresponding frequency spectrum matrix and a corresponding text vector matrix, and the frequency spectrum matrix and the text vector matrix are simultaneously processed by utilizing a multi-mode network model to obtain audio modal characteristics and text modal characteristics; and simultaneously processing the audio modal characteristics and the text modal characteristics by utilizing the classification layer of the model and mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label. The method can be used for uniformly considering the influence of the music melody and the lyric text meaning on the emotion category when performing emotion classification on the music, and the accuracy rate of music emotion recognition is improved.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

FIG. 1 schematically shows a flow diagram of a music emotion recognition method according to an embodiment of the present invention;

FIG. 2 schematically illustrates a flow diagram of a method of feature extraction by an audio processing network according to an embodiment of the present invention;

FIG. 3 schematically illustrates a flow diagram of a method for feature extraction by a text processing network, according to an embodiment of the invention;

FIG. 4 schematically shows a flow diagram of a method of sentiment classification according to an embodiment of the present invention;

FIG. 5 is a flow chart schematically illustrating a method for music emotion recognition based on a plurality of file segments, according to an embodiment of the present invention;

FIG. 6 schematically illustrates a flow chart of a method of training a multimodal network model according to an embodiment of the invention;

FIG. 7 schematically shows a block diagram of a music emotion recognition apparatus according to an embodiment of the present invention;

FIG. 8 shows a schematic diagram of a storage medium according to an embodiment of the invention;

FIG. 9 schematically shows a block diagram of an electronic device according to an embodiment of the invention; and

FIG. 10 schematically shows a network architecture diagram of a multi-modal network model according to an embodiment of the invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the invention, a music emotion recognition method, a music emotion recognition device, a storage medium and an electronic device are provided.

In this document, any number of elements in the drawings is by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

The inventor finds that in the prior art, when the music emotion recognition method based on the single-mode deep learning is used for emotion recognition, only single music audio information or lyrics and other information are used, and emotion contained in the lyrics and other information is ignored when the audio information is used alone; and when the information such as the lyrics is used independently, the influence of the rhythm and the melody contained in the audio frequency on the emotional expression of the lyrics is ignored. The music emotion recognition method based on multi-modal deep learning requires a large amount of data sets with accurate labeling. However, most existing data sets for music recognition are label data for attributes of music songs, musical instruments and the like, and labels for music emotion are absent, or the number of emotion categories is small, or emotion categories are seriously unbalanced. In addition, musical emotion data set annotation is costly and time consuming.

In view of the above, the basic idea of the present invention is: according to the music emotion recognition method provided by the embodiment of the invention, when emotion recognition is carried out on a music file, the influence of the music melody and the lyric text meaning on the music emotion can be considered at the same time, so that the accuracy of the music emotion recognition is improved. Meanwhile, a shared classification layer is configured for the model, and the audio modal characteristics corresponding to the music audio data and the text modal characteristics corresponding to the text data are processed simultaneously, so that the audio modal characteristics and the text modal characteristics are influenced mutually, and the accuracy of music emotion recognition is further improved. And, only a small number of labeled samples are needed to train the model.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Exemplary method

The music emotion recognition method according to an exemplary embodiment of the present invention is described below with reference to the drawings.

Referring to fig. 1, the music emotion recognition method may include the steps of:

s1, acquiring a frequency spectrum matrix and a text vector matrix corresponding to the music file to be recognized, and inputting the frequency spectrum matrix and the text vector matrix into a multi-mode network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer;

s2, performing feature extraction on the frequency spectrum feature matrix through the audio processing network to obtain audio modal features, and performing feature extraction on the text vector matrix through the text processing network to obtain text modal features;

s3, mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label through the classification layer so as to obtain an emotion classification result corresponding to the music file to be recognized.

In the music emotion recognition method, a corresponding frequency spectrum matrix and a corresponding text vector matrix are obtained by respectively processing an audio part and a lyric text part of a music file to be recognized, and the frequency spectrum matrix and the text vector matrix are simultaneously processed by utilizing a multi-mode network model to obtain audio modal characteristics and text modal characteristics; and simultaneously processing the audio modal characteristics and the text modal characteristics by utilizing the classification layer of the model and mapping the audio modal characteristics and the text modal characteristics to a preset emotion category label. The method can be used for uniformly considering the influence of the music melody and the lyric text meaning on the emotion when performing emotion classification on the music, and the accuracy rate of music emotion recognition is improved.

In step S1, a spectrum matrix and a text vector matrix corresponding to the music file to be recognized are obtained, and the spectrum matrix and the text vector matrix are input into the multimodal network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer.

In an exemplary embodiment of the disclosure, the music emotion recognition method may be applied to smart mobile terminal devices such as a mobile phone and a tablet computer, or may also be applied to terminals such as a desktop computer and a notebook computer. The method can be operated in a terminal in a mode of independent application program; alternatively, the method may be executed in an existing music playing application program in the form of a plug-in. The music file to be identified may be a music file stored in a local media library of the terminal, or may also be a music file in a music list stored in the cloud in the personal account of the user. Specifically, the operation of acquiring the music files to be identified may be to trigger acquisition of one or more specified music files to be identified in response to an operation of the user on the terminal interaction interface. For example, the user selects one or more music files in the terminal interactive interface as music files to be identified. Or, the operation of acquiring the music file to be identified may also be an operation that is automatically triggered by the system according to a preset rule. For example, the local media library or the cloud media library is periodically read, and when a newly added music file is found, data of the newly added music file is extracted and used as the music file to be identified. Of course, in other exemplary embodiments of the present disclosure, the music file to be recognized may be acquired in other manners. The present disclosure does not specifically limit the manner in which the music file to be recognized is acquired.

In an exemplary embodiment of the present disclosure, for the acquired music file to be recognized, audio data as well as text data may be included. The audio data may be an audio part of the music file to be identified, and may include audio of pure music and audio including human voice and music. The text data may be lyrics, song name, and album name of the music file to be recognized, and the like.

Alternatively, in other exemplary embodiments of the present disclosure, the text data of the music file to be recognized may also be comment data of the music file to be recognized. For example, for song 1, in a social networking platform or a music application, comment content with words in a preset range is commented, or comment content with words in a certain range is set, with highest praise.

In an exemplary embodiment of the disclosure, audio data and text data corresponding to a music file to be recognized are obtained, and the audio data and the text data are respectively preprocessed to obtain a corresponding frequency spectrum matrix and a corresponding text vector matrix.

In particular, the spectral matrix may be a mel-frequency spectral matrix. Preprocessing the audio data to obtain a corresponding spectral matrix may include:

step S111, performing voice endpoint detection on the audio data to screen non-silence framed audio data;

step S112, the Mel frequency spectrum matrix is constructed according to the screened non-silent framing audio data.

For example, the audio data is subjected to voice endpoint detection, and the audio data signal may be firstly subjected to framing processing; extracting features from each frame of data; and classifying the characteristics of the audio of each frame by utilizing a trained classifier, and judging whether the audio data of each frame belongs to a speech signal (non-silent information) or a silent signal. Wherein the classifier is trained on a set of data frames for which speech and silence signal regions are known. After the non-silent framed audio data is screened, a corresponding Mel frequency spectrum matrix is constructed.

The Mel frequency spectrum is a common audio signal representation method, and compared with other audio signal representation methods, the Mel frequency spectrum more completely retains the characteristics of music signals; meanwhile, the Mel frequency spectrum is more in line with the auditory characteristics of human beings, so the Mel frequency spectrum is selected by the method as input data of audio frequency analysis of the music. The mel-frequency spectrum is a (time length, characteristic length) two-dimensional matrix. For example, the present disclosure uses a time length feature len of 1024, a feature length mel dim of 128, and input samples of the spectrum matrix of 1024 × 128.

In this exemplary embodiment, specifically, the preprocessing the text data to obtain the text vector matrix may include:

For example, the word segmentation may be performed on the text data first, and the word vector representation may be performed on each word segmentation result. For example, a lyric word vector dimension (word dim) may be configured to be 128, a lyric maximum length (max length) may be 200, and a lyric text word vector matrix (embedding matrix) may be configured to be [200,128 ]. The lyric word vector dimension of the text vector matrix is set to be the same as the characteristic length of the Mel frequency spectrum matrix, so that subsequent synchronous processing of the characteristic matrix is facilitated.

In the exemplary embodiment of the present disclosure, the pre-processing of the audio data and the text data may be performed simultaneously by configuring different processes, thereby improving data processing efficiency. After the Mel frequency spectrum matrix and the text vector matrix of the music file to be recognized are obtained, the Mel frequency spectrum matrix and the text vector matrix can be input into a trained multi-mode network model, and music emotion recognition is carried out on the multi-mode network model by utilizing the multi-mode network model. The multimodal network model includes a parallel audio processing network, a text processing network, and a classification layer. The audio processing network is used for processing the Mel frequency spectrum matrix to obtain corresponding audio modal characteristics; the text processing network is used for synchronously processing the text vector matrix to obtain corresponding text modal characteristics; and the classification layer is used as a sharing layer, and simultaneously performs feature fusion and processing on the audio modal features and the text modal features to obtain the emotion classification result of the music file to be identified.

In step S2, performing feature extraction on the spectral feature matrix through the audio processing network to obtain audio modal features, and performing feature extraction on the text vector matrix through the text processing network to obtain text modal features.

In an exemplary embodiment of the present disclosure, in one aspect, the mel-frequency spectrum feature matrix may be processed by an audio processing network of a multi-modal network model. Specifically, as shown in fig. 2, the method may include:

step S211, performing convolution on the spectrum characteristic matrix by using a first convolution layer to obtain a first-dimension spectrum characteristic;

step S212, performing dimensionality reduction on the first-dimension spectral feature by using a first maximum pooling layer;

step S213, sequentially extracting the features of the first-dimension spectral features subjected to the dimensionality reduction by using a plurality of convolution layers which are continuously arranged so as to obtain target-dimension spectral features;

step S214, performing dimensionality reduction processing on the target dimensionality spectrum feature by using a second maximum pooling layer to obtain the audio modal feature.

Preferably, when the first-dimension spectral feature is subjected to dimension reduction processing by using the first maximum pooling layer, the method further includes:

utilizing an anti-overfitting processing layer to conduct regularization processing on the first-dimension frequency spectrum characteristics after the dimension reduction processing; or utilizing an anti-overfitting processing layer to carry out regularization processing on the first-dimension spectral features, and then inputting the first-dimension spectral features after regularization processing into the first maximum pooling layer to carry out dimension reduction processing; wherein the anti-overfitting treatment layer comprises two Dropout layers arranged in series.

In an exemplary embodiment of the present disclosure, as shown with reference to fig. 2 and 10, the audio processing network may include a first convolutional layer 1011, a maximum pooling layer and a Dropout layer 1012, which are sequentially disposed, and a second convolutional layer and a maximum pooling layer 1017 including four convolutional layers (1013, 1014, 1045, 1016) which are continuously disposed. The first convolution layer takes a mel-frequency spectrum matrix 1010 as an input parameter, the number of input channels can be configured to be 1, the step size stride is 3 x 3, the sliding window process uses pad compensation, the number of output channels is 16, and the convolution kernel is 3 x 3. The window of the first largest pooling layer is 2 x 2. The Dropout layer, which may be configured with a Dropout coefficient of 0.25, indicates that 25 percent of the network parameters are masked from training during training to prevent overfitting. The second convolution layer comprises four convolution layers which are continuously arranged, and the number of input channels is respectively 16, 32, 64 and 128; the number of output channels is respectively configured to be 32, 64, 128 and 64; all convolution kernels are configured to be 3 x 3, and the step size stride is 3 x 3; convolution kernels of different sizes are used for extracting spectrum characteristic information of different dimensions from the bottom layer to the upper layer. The window of the second largest pooling layer disposed after the second convolution layer is configured as 4 x 4. By configuring two maximum pooling layers, the size of the model can be effectively reduced, the calculation speed is increased, and the robustness of the extracted features is improved. In addition, a Dropout layer may be placed after the second max-pooling layer to prevent over-fitting.

In an exemplary embodiment of the present disclosure, on the other hand, while processing the mel-frequency spectrum feature matrix with the audio processing network, feature extraction may also be performed on the text vector matrix through the text processing network of the multi-modal network model to obtain text modal features, and in particular, as shown with reference to fig. 3, the method may include:

step S221, extracting the feature of the text vector matrix by using a one-dimensional convolution layer to obtain a first text feature;

step S222, performing dimension reduction processing on the first text feature through a maximum pooling layer;

step S223, processing the first text feature after the dimension reduction processing by using a long and short term memory network layer to obtain the text modal feature.

Specifically, referring to fig. 10, the text processing network may include a one-dimensional convolutional layer 1021, a max-pooling layer 1022, an LSTM network layer 1023, and a Dropout layer 1024, which are arranged in sequence. The one-dimensional convolution layer takes the text vector matrix 1020 as an input parameter, the output channel can be configured to be 16, the convolution kernel size is 2 x 2, and the step size is 2 x 2. The long and short term memory network (LSTM) layer may be configured with a state unit state size of 40 and an output unit output size of 2000. In addition, a Dropout layer can be arranged after the long-short term memory network layer, and the Dropout coefficient is configured to be 0.25 so as to prevent the text processing network from being over-fitted.

By setting an audio processing network and a text processing network based on a convolutional neural network, original feature data are mapped into a hidden feature space by utilizing operations such as convolution, pooling and the like for subsequent emotion classification.

In step S3, the audio modal features and the text modal features are mapped to a preset emotion category label through the classification layer, so as to obtain an emotion classification result corresponding to the music file to be identified.

In an exemplary embodiment of the present disclosure, as shown with reference to fig. 4, may include:

step S31, inputting the audio modal characteristics and the text modal characteristics into a shared full-connection network layer, so as to map characteristic data to a preset number of emotion category labels and output characteristic values of a target size; the target size is the same as the number of the preset emotion category labels;

step S32, processing the characteristic value by utilizing a Softmax layer to acquire the probability distribution of each emotion category label;

step S33, the probability distribution of each emotion category label is normalized to obtain the emotion classification result of the audio data to be identified and the corresponding text data.

Specifically, referring to fig. 10, the classification layer may include: sharing a fully connected network layer and a Softmax layer; wherein, the sharing full-connection network layer comprises two Linear processing layers (Linear layers) which are arranged in series. Wherein, the input parameters of the first Linear layer 1031 of the shared full-connection network layer are the output parameters of the maximum pooling layer of the audio processing network and the output parameters of the maximum pooling layer of the text processing network; the output parameters of the first Linear layer may be used as the input of the second Linear layer 1032; in addition, the input unit size of the first Linear layer may be configured to be 960+2000 corresponding to the order of the feature matrix output by the audio processing network and the text processing network, and the output unit size is 128; the size of the input unit of the second Linear layer is configured to be 128 corresponding to the size of the output unit of the first Linear layer, and the size of the output unit is the preset emotion label category number. For example, when the preset emotion category labels are 3 items of happiness, sadness and lyrics, the size of the output unit of the second Linear layer is 3. If the number of the preset emotion type labels is 7, the size of the output unit of the second Linear layer is 7. The number of the emotion category labels and the specific category classification of the emotion category labels are not characterized in the disclosure. For example, in other exemplary embodiments of the present disclosure, emotion category labels may also include cheerful, sadness, depression, and the like. The input cell size of Softmax layer 1033 may be configured to be the same as the number of preset emotion category tags.

For the shared fully-connected layer, the distributed feature representation mapping method plays a role of an emotion classifier in the whole network and maps distributed feature representation to emotion sample mark space. Taking 3 items of happiness, sadness and lyrics as examples of the preset emotion category labels, for the Softmax layer, the size of the input unit is 3, the size of the normalization unit output by the emotion category Softmax layer is 3, and the emotion score is converted into the probability of 0-1 after the Softmax operation. When the music sample is labeled, one-hot vector coding can be adopted, if the emotion categories of the disclosure are divided into 3 items of happiness, sadness and lyrics, the one-hot coding is respectively (1, 0, 0), (0, 1, 0) and (0, 0, 1), the corresponding position of each category is labeled as 1, and the sum of the probability values is 1, that is, the Softmax layer functions to make the probability normalization consistent with the labeling label, and the size of the Softmax output unit is 3, which is consistent with the preset total number of emotion categories. When the total number of emotion categories changes, the output unit and the output unit size of Softmax follow the change.

In a Softmax layer, after the probability distribution of music to be recognized to each preset emotion type is obtained, specific emotion type output is obtained according to the following formula:

wherein the content of the first and second substances,

representing the probability sum of each dimension of the vector;

representing the probability at the position of the vector dimension i; y represents the dimension position with the maximum probability value, and the position index is the corresponding emotion category.

Based on the above, in other exemplary embodiments of the present disclosure, as shown with reference to fig. 5, the above method may further include:

step S41, the music file to be identified is segmented to obtain corresponding frequency spectrum matrixes and text vector matrixes of a plurality of file segments;

step S42, respectively carrying out music emotion recognition on each file segment to obtain a plurality of segment emotion classification results;

step S43, calculating the average emotion label of the emotion classification results of the segments to obtain the emotion classification result of the music file to be recognized.

For example, for a music file to be identified, the music file to be identified can be divided into a plurality of file segments with the same length; or, dividing the music file to be recognized into a plurality of file segments according to whether the audio part contains human voice.

For each file segment, the method described above can be used to process the file segment to obtain the emotion classification result corresponding to each file segment. If the file segment only contains audio data or only contains text data, an audio processing network or a text processing network is used for the file segment alone. After obtaining the emotion classification result of each file segment, the average emotion label can be calculated by using the formula, so that the final emotion classification result of the music file to be identified can be obtained according to the emotion classification results of the file segments in a balanced evaluation mode, and the accuracy of music emotion classification is further improved.

The following describes a training method of a multi-modal network model for music emotion recognition according to an exemplary embodiment of the present invention with reference to the accompanying drawings. Referring to fig. 6, the training method of the model may include:

step S51, a teacher network and a student network are constructed, and the teacher network and the student network are initialized by the network parameters identified by the music songs;

step S52, sample data is obtained, wherein the sample data comprises an audio sample, a text sample and a corresponding emotion category label corresponding to a sample file;

step S53, the sample data is respectively input into a teacher network and a student network to obtain a first output result of the student network and a second output result output by the teacher network;

and step S54, constructing a knowledge distillation loss function according to the first output result and the second output result, and performing back propagation on the student network by using the knowledge distillation loss function so as to iteratively train the student network to be convergent and obtain the multi-mode network model.

In an exemplary embodiment of the present disclosure, training is performed using a teacher-student model, the teacher network and the student network employing the same neural network architecture. Referring to fig. 10, the teacher network and the student network each include: a parallel audio processing network, a text processing network, and a classification layer; the audio processing network comprises a first convolution layer, a maximum pooling layer and a Dropout layer which are sequentially arranged, and a second convolution layer and a maximum pooling layer which comprise four convolution layers which are continuously arranged; the text processing network comprises a one-dimensional convolution layer, a maximum pooling layer, an LSTM network layer and a Dropout layer which are sequentially arranged; the classification layer comprises a sharing full-connection network layer and a Softmax layer which are sequentially arranged; the shared fully-connected network layer comprises two continuous Linear layers.

For sample data, each sample file may include an audio sample, a text sample corresponding to the audio sample, and a real emotion category tag value (Ground Truth) corresponding to each sample file. Taking the emotion category label of happy, sad and lyric 3 as an example, group trouth is label ═ 1, 0, 0, (0, 1, 0) and ═ 0, 0,1, respectively, which represents the emotion categories of happy, sad and lyric. Firstly, the network model is initialized by adopting the network parameters of music wind identification, and the teacher-student network is migrated and learned. In the transfer learning process, the audio samples are respectively input into a teacher network and a student network, and are processed by the method to obtain a first output result output by the student network and a second output result output by the teacher network. Wherein the first output result and the second output result comprise a probability distribution for each emotion category.

In the model training process, the teacher network does not participate in the back propagation of the neural network. Parameters (W) of the teacher network model are obtained by exponential sliding average of parameters (W) of the student network model, and the expression of the teacher parameters during t times of iterative training is as follows:

W_t＝α*W_t-1+(1-a)*ω_t

wherein α represents the attenuation smoothing ratio, w_tRepresenting the parameters of the t training student networks.

The Loss function Loss of the teacher-student model consists of two parts, namely KL relative entropy and CE classification cross entropy Loss function, and lambda is used for controlling the importance of the KL relative entropy. In particular, the loss function may be constructed from a difference of probability distributions in the first output result and the second output result. Specifically, the formula may include:

Loss＝CE+λKL

and (4) performing knowledge distillation through the loss function, and performing iterative training, thereby training the multi-modal network model based on knowledge distillation. The training based on knowledge distillation is carried out by using the same data set, and the information learned by the complex teacher network is transmitted to the student network with a smaller scale, so that the student network obtained by training has a performance effect similar to that of the complex model, and a better training result can be realized by using a small amount of labeled samples. The influence of the audio and the text on the music emotion is considered during emotion classification, so that the problem of low emotion recognition accuracy caused by unbalanced sample emotion types is effectively solved.

In summary, the method provided by the present disclosure trains a multi-modal network model based on knowledge distillation. When music emotion recognition is carried out, an audio part and a text part of a music file to be recognized can be synchronously processed to obtain corresponding audio modal characteristics and text modal characteristics, and the audio data and the implicit emotion information of the text data are mapped into a characteristic space. By setting a sharing full-connection network layer shared by an audio processing network and a text processing network, the characteristics of the audio modal characteristics and the text modal characteristics can be fused, the emotion information contained in the audio data and the emotion information contained in the text are mapped into an emotion mark space together, and when an emotion label is distributed, the hidden emotion information of the audio data and the hidden emotion information of the text data can be considered in a balanced manner, so that the accuracy of music emotion classification is effectively improved. In addition, the multi-modal network model is obtained by training in a knowledge distillation-based mode, massive marking sample data is not needed in training, and the problem of low music emotion recognition accuracy caused by too small sample marking data amount or unbalanced sample emotion types is solved. The generalization capability of music emotion recognition can be effectively improved, and the condition that the model is trapped in a local optimal solution is avoided.

Exemplary devices

Having introduced the music emotion recognition method according to the exemplary embodiment of the present invention, next, a music emotion recognition apparatus according to an exemplary embodiment of the present invention will be described with reference to fig. 7.

Referring to fig. 7, the music emotion recognition apparatus 10 according to the exemplary embodiment of the present invention may include: a data acquisition module 101, a feature extraction module 102 and a classification result output module 103, wherein:

the data acquisition module 101 may be configured to acquire a frequency spectrum matrix and a text vector matrix corresponding to a music file to be recognized, and input the frequency spectrum matrix and the text vector matrix into a multimodal network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer.

The feature extraction module 102 may be configured to perform feature extraction on the spectral feature matrix through the audio processing network to obtain audio modal features, and perform feature extraction on the text vector matrix through the text processing network to obtain text modal features.

The classification result output module 103 may be configured to map the audio modal features and the text modal features to a preset emotion category label through the classification layer, so as to obtain an emotion classification result corresponding to the music file to be identified.

According to an exemplary embodiment of the present disclosure, the data acquisition module includes: a pre-processing unit (not shown in the figure).

The preprocessing unit can be used for acquiring audio data corresponding to the music file to be identified and corresponding text data, and respectively preprocessing the audio data and the text data to acquire a corresponding frequency spectrum matrix and a corresponding text vector matrix.

According to an exemplary embodiment of the present disclosure, the spectrum matrix is a mel-frequency spectrum matrix; the preprocessing unit includes: an audio processing unit (not shown in the figure).

The audio processing unit may be configured to perform voice endpoint detection on the audio data to filter non-silence framed audio data; and constructing the Mel frequency spectrum matrix according to the screened non-silent framing audio data.

According to an exemplary embodiment of the present disclosure, the preprocessing unit further includes: a text processing unit (not shown in the figure).

The text processing unit may be configured to perform word segmentation processing on the text data, and construct the text vector matrix of a target size according to a word segmentation result.

According to an exemplary embodiment of the present disclosure, the feature extraction module includes: an audio feature extraction unit (not shown in the figure).

The audio feature extraction unit may be configured to convolve the spectral feature matrix with a first convolution layer to obtain a first-dimension spectral feature; performing dimensionality reduction on the first-dimension spectral feature by using a first maximum pooling layer; sequentially extracting the first-dimension spectral features subjected to dimensionality reduction by using a plurality of convolution layers which are continuously arranged to obtain target-dimension spectral features; and performing dimensionality reduction processing on the target dimensionality spectrum feature by utilizing a second maximum pooling layer to obtain the audio modal feature.

According to an exemplary embodiment of the present disclosure, the audio feature extraction unit may be further configured to, when performing dimension reduction processing on the first-dimension spectral feature by using a first maximum pooling layer, perform regularization processing on the dimension-reduced first-dimension spectral feature by using an over-fitting prevention processing layer; or utilizing an anti-overfitting processing layer to carry out regularization processing on the first-dimension spectral features, and then inputting the first-dimension spectral features after regularization processing into the first maximum pooling layer to carry out dimension reduction processing; wherein the anti-overfitting treatment layer comprises two Dropout layers arranged in series.

According to an exemplary embodiment of the present disclosure, the feature extraction module includes: a text feature extraction unit (not shown in the figure).

The text feature extraction unit may be configured to perform feature extraction on the text vector matrix by using a one-dimensional convolution layer to obtain a first text feature; performing dimension reduction processing on the first text feature through a maximum pooling layer; and processing the first text feature after the dimension reduction processing by using a long-short term memory network layer to obtain the text modal feature.

According to an exemplary embodiment of the present disclosure, the classification layer includes: sharing a fully connected network layer and a Softmax layer; the classification result output module includes: a full connection processing unit and a classification processing unit (not shown in the figure). Wherein:

the full-connection processing unit may be configured to input the audio modal feature and the text modal feature into a shared full-connection network layer, so as to map feature data to a preset number of emotion category labels, and output a feature value of a target size; the target size is the same as the number of the preset emotion category labels; and processing the characteristic values by utilizing a Softmax layer to acquire the probability distribution of each emotion category label.

The classification processing unit may be configured to perform normalization processing on the probability distribution of each emotion category label to obtain emotion classification results of the audio data to be identified and the corresponding text data.

According to an exemplary embodiment of the present disclosure, the shared fully-connected network layer includes two Linear layers disposed in series.

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: a segmentation processing module (not shown in the figure).

The segmentation processing module can be used for segmenting the music file to be identified so as to obtain a frequency spectrum matrix and a text vector matrix corresponding to a plurality of file segments; respectively carrying out music emotion recognition on each file segment to obtain a plurality of segment emotion classification results; calculating the average emotion label of the emotion classification results of the segments to obtain the emotion classification result of the music file to be identified

According to an exemplary embodiment of the present disclosure, the apparatus further comprises: a model training module (not shown).

The model training module can be used for constructing a teacher network and a student network and initializing the teacher network and the student network by using the network parameters identified by the music songs; acquiring sample data, wherein the sample data comprises an audio sample, a text sample and a corresponding emotion category label, which correspond to a sample file; inputting the sample data into a teacher network and a student network respectively to obtain a first output result of the student network and a second output result output by the teacher network; and constructing a knowledge distillation loss function according to the first output result and the second output result, and performing back propagation on the student network by using the knowledge distillation loss function so as to iteratively train the student network to be convergent and obtain the multi-mode network model.

According to an exemplary embodiment of the disclosure, the knowledge distillation loss function comprises:

Loss＝CE+λKL

According to an exemplary embodiment of the present disclosure, the teacher network and the student network each include: a parallel audio processing network, a text processing network, and a classification layer; wherein the content of the first and second substances,

Since each functional module of the music emotion recognition apparatus in the embodiment of the present invention is the same as that in the embodiment of the music emotion recognition method in the embodiment of the present invention, further description is omitted here.

Exemplary storage Medium

Having described the audio playing method and apparatus, the audio sharing method and apparatus according to the exemplary embodiments of the present invention, a storage medium according to an exemplary embodiment of the present invention will be described with reference to fig. 8.

Referring to fig. 8, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary electronic device

Having described the storage medium of an exemplary embodiment of the present invention, next, an electronic device of an exemplary embodiment of the present invention will be described with reference to fig. 9.

The electronic device 900 shown in fig. 9 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present invention.

As shown in fig. 9, the electronic device 900 is embodied in the form of a general purpose computing device. Components of electronic device 900 may include, but are not limited to: the at least one processing unit 910, the at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

Wherein the storage unit stores program code that is executable by the processing unit 910 to cause the processing unit 910 to perform steps according to various exemplary embodiments of the present invention described in the above section "exemplary methods" of the present specification. For example, the processing unit 910 may perform steps S1 through S3 as shown in fig. 1.

The storage unit 920 may include volatile storage units such as a random access storage unit (RAM)9201 and/or a cache storage unit 9202, and may further include a read only storage unit (ROM) 9203.

Storage unit 920 may also include a program/utility 9204 having a set (at least one) of program modules 9205, such program modules 9205 including but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The bus 930 may include a data bus, an address bus, and a control bus.

The electronic device 900 may also communicate with one or more external devices 100 (e.g., keyboard, pointing device, bluetooth device, etc.), which may be through an input/output (I/O) interface 950. The electronic device 900 further comprises a display unit 940 connected to the input/output (I/O) interface 950 for displaying. Also, the electronic device 900 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) via the network adapter 960. As shown, the network adapter 860 communicates with the other modules of the electronic device 900 over the bus 930. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several modules or sub-modules of the audio playback device and the audio sharing device are mentioned, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A music emotion recognition method is characterized by comprising the following steps:

acquiring a frequency spectrum matrix and a text vector matrix corresponding to a music file to be identified, and inputting the frequency spectrum matrix and the text vector matrix into a multi-mode network model; wherein the multi-modal network model comprises a parallel audio processing network, a text processing network, and a classification layer;

2. The method according to claim 1, wherein the obtaining of the spectrum matrix and the text vector matrix corresponding to the music file to be recognized comprises:

3. The method of claim 2, wherein the spectrum matrix is a mel-frequency spectrum matrix; preprocessing the audio data to obtain a corresponding spectral matrix, comprising:

4. The method of claim 2, wherein preprocessing the text data to obtain the text vector matrix comprises:

5. The method according to claim 1, wherein the performing feature extraction on the spectral feature matrix through the audio processing network to obtain audio modal features comprises:

6. The method of claim 5, wherein the dimension reduction processing is performed on the first-dimension spectral feature using the first max-pooling layer, and the method further comprises:

7. The method according to claim 1, wherein the performing feature extraction on the text vector matrix through the text processing network to obtain text modal features comprises:

8. A musical emotion recognition apparatus, comprising:

9. A storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the music emotion recognition method of any of claims 1 to 7.

10. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the music emotion recognition method of any of claims 1 to 7 via execution of the executable instructions.