CN113177136A

CN113177136A - Multi-mode music style classification method based on attention audio frequency and lyrics

Info

Publication number: CN113177136A
Application number: CN202110460027.9A
Authority: CN
Inventors: 李优; 张志海; 常亮; 林煜明; 周娅
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-27
Anticipated expiration: 2041-04-27
Also published as: CN113177136B

Abstract

The invention relates to the technical field of music style classification, in particular to a multi-mode music style classification method of audio and lyrics based on attention, which comprises the following steps: firstly, acquiring data; secondly, audio preprocessing: firstly, Mel frequency spectrum feature extraction is carried out on audio data, and then audio features are obtained through a CNN network; thirdly, preprocessing the lyrics: performing BERT pre-training to obtain word vectors, and then obtaining lyric characteristic vectors through an HAN network; fourthly, fusing between the attention modules: performing interactive fusion on the acquired audio and lyric characteristics through Attention intermode fusion to acquire a fused Attention vector, and splicing the fused Attention vector with the audio and lyric characteristic vectors to acquire respective modal characteristics of the audio and lyric and music style characteristics of the intermodal fusion characteristics; and fifthly, classifying through a softmax layer. The invention can better learn and classify the data.

Description

Multi-mode music style classification method based on attention audio frequency and lyrics

Technical Field

The invention relates to the technical field of music style classification, in particular to a multi-mode music style classification method based on attention audio and lyrics.

Background

Music style classification is an important ring among music information retrieval and music recommendation. With the rapid development of internet technology and digital multimedia, the requirement for music style classification efficiency is higher and higher. However, the existing music style classification technology cannot completely represent the music style only by using audio features, and semantic information contained in lyrics also partially represents the style of the song, so that the content of the lyrics needs to be taken into consideration. And the technology of fusing the information of the audio frequency and the lyrics is less, and the main problem is that the processing difficulty of the lyrics is higher. General lyric processing methods, such as BOW (bag of words) and Word2Vec (Word vector), cannot acquire context semantic information of a sentence. But for lyrics, the semantic information of the context has an important meaning for the music style. For the audio and lyric fusion technology, fusion based on simple operations, such as feature level splicing and decision level weighting, is currently used, and these operations are only equivalent to simple integration of sub-classifiers, and do not really realize the fusion of audio and lyric information, and the improvement on the fusion result is small. Therefore, a multi-modal music style classification technique based on audio and lyrics has yet to be improved.

Disclosure of Invention

It is an object of the present invention to provide a method for multi-modal music style classification based on attention-directed audio and lyrics that overcomes some or some of the deficiencies of the prior art.

According to the invention, the method for classifying the multi-modal music style of the audio and the lyrics based on attention comprises the following steps:

firstly, data acquisition: acquiring a data set corresponding to the audio and the lyrics;

secondly, audio preprocessing: firstly, Mel frequency spectrum feature extraction is carried out on audio data, and then audio features are further obtained through a CNN network;

thirdly, preprocessing the lyrics: performing BERT pre-training to obtain word vectors of words, and then obtaining lyric characteristic vectors through an HAN network;

fourthly, fusing between the attention modules: performing interactive fusion on the acquired audio and lyric characteristics through Attention intermode fusion to acquire a fused Attention vector, and splicing the fused Attention vector with the audio and lyric characteristic vectors to acquire respective modal characteristics of the audio and lyric and music style characteristics of the intermodal fusion characteristics;

and fifthly, classifying through a softmax layer.

Preferably, in the first step, the corresponding songs are retrieved and downloaded according to singer and song name information in the rolling data set, so that batch downloading is realized; and combining the corresponding lyric data with the data of the song according to the serial number of the song to obtain a data set corresponding to the audio and the lyrics.

Preferably, in the second step, parameters of Mel spectrum feature extraction are as follows:

the sampling rate of the audio data is 22.5Hz, and the audio signal is downsampled at 12 Hz;

the frame length is 21 ms;

short-time fourier N _ FFT 512;

frame shift HOP _ LEN 256;

mel scale mel _ scale 125;

the mel frequency spectrum number N _ MELS is 96;

the audio feature size of a song is (96,1366).

Preferably, in step two, the CNN network structure is: the network has 5 convolution layers of 3x3 cores and 4 pooling layers, and then is connected with a full connection layer, the output tensor dimension is 64, and the characteristics of each song are changed into a 64-dimensional vector.

Preferably, in step three, the parameters of the BERT pre-training model are as follows:

uncased BERT-base(12-layer,768-hidden,12-heads)；

neglecting the size-written BERT base model (12 Encoder decoding layers, 768 hidden layers, 12 multi-head attention);

the Bert pre-training model is used for representing each word in the input lyric text into a vector with 768 dimensions, converting character symbols into numbers for computer processing, and simultaneously implying semantic information of word context.

Preferably, in step three, the HAN network parameters are:

768 dimensions of BERT word vector dimension, 66 maximum clause number, 300 maximum clause word number, 100 layers of Bi _ GRU and 200 dimension of entry output;

dividing each lyric text into sentences according to sentences, and embedding each word of each sentence by using a BERT pre-training word vector; then, each clause is subjected to word encoder word level coding through a Bi _ GRU layer, and the attention feature of each clause word attribute word is calculated; and then, carrying out word level coding on each clause of each lyric to obtain word attention characteristics, carrying out presence encoder sentence level coding on each clause of each lyric through a Bi _ GRU layer, and calculating the presence attention sentence attention characteristics of each lyric.

Preferably, in the fourth step, the step of calculating the inter-mode attention is as follows:

the song is represented by u, the audio data is represented by V, the lyric data is represented by T, d is the dimension 64 of the last full connection layer, the 64-dimensional vector V (u x d) of the audio, and the 64-dimensional vector T (u x d) of the lyric;

first computing matrix M₁，M₂；

M₁＝V.T^T&M₂＝T.V^T；

(2) Calculating M Using softmax₁，M₂Probability distribution N of₁，N₂；

i represents the ith sample of the audio of u; j represents the jth sample of the lyrics of u; k represents an accumulation variable from 1 to u;

(3) modality representation matrix O₁，O₂；

O₁＝N₁.T&O₂＝N₂.V；

(4) Attention matrix A₁，A₂；

A₁＝O₁⊙V&A₂＝O₂⊙T；

(5) Inter-mode attention matrix BA_VT；

BA_VT＝concat[A₁,A₂]；

Inter-module attention calculation first point multiplication calculation matrix M₁、M₂(ii) a Obtaining a probability distribution N by softmax₁、N₂A weight coefficient representing a modal characteristic; the new modal representation matrix O is obtained by the cross point multiplication₁、O₂(ii) a Then multiplying the original features element by element to obtain an attention matrix A₁、A₂；

Pair V, T, A₁、A₂The concatenation is the final feature vector.

The method comprises the steps of respectively processing audio and lyrics by using an acquired data set, extracting characteristics of audio data by using Mel _ spectrum, and performing word vector representation of the lyrics by using a BERT (bidirectional coding representation) pre-training model; CNN (convolutional neural network) and HAN (hierarchical Attention network) are respectively used for obtaining feature vectors of two modes, and then the Attention modes of feature levels are fused; and finally classifying the fusion feature vectors. On the basis of a traditional audio-based method, information utilizing lyrics is added, and then two kinds of characteristic information are fused. For the lyrics, the lyrics take the text as a carrier and express the emotion, theme, style and the like of the song. The audio frequency and the lyrics are used as data of two different modes, and the interaction relation between the modes can be better extracted through information fusion. Not only extracts important information of respective modalities, but also fuses data relation information among the modalities, so that a feature vector is fused to more effectively represent a song, and data can be easily learned and classified.

Drawings

FIG. 1 is a flowchart of a method for multi-modal music style classification of audio and lyrics based on attention in embodiment 1;

FIG. 2 is a diagram showing the Mel frequency spectrum, which is an audio feature in example 1;

fig. 3 is a schematic diagram of a CNN network structure for processing spectrum in embodiment 1;

FIG. 4 is a diagram illustrating the vector of the BERT pre-training words in example 1;

FIG. 5 is a schematic diagram of the HAN (hierarchical attention) network in example 1;

FIG. 6 is a flowchart of attention calculation between modes in example 1;

fig. 7 is a block diagram of an Attention fusion general network.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.

Example 1

As shown in fig. 1, the present embodiment provides a method for multi-modal music style classification of audio and lyrics based on attention, which includes the following steps:

one of the difficult problems with multimodal music styles is obtaining a usable data set with fewer data sets sorted about the music style. The data source is MetroLyrics (https:// www.metrolyrics.com /) lyrics for over 380000 songs obtained on a lyrics collection website, but the corresponding audio needs to be obtained. And retrieving and downloading the corresponding songs according to the singer and the song name information in the metroletics data set, thereby realizing batch downloading. And then the corresponding audio data and the lyric data are combined together according to the serial number of the song, and a data set corresponding to the audio and the lyric is obtained. The data set is 5 categories, each category being 1000. The 5 categories are Hip-Hop, Metal, Country, Folk, Jazz, respectively.

parameters for Mel (Mel) spectral feature extraction are as follows:

the frame length is 21 ms;

short-time fourier N _ FFT 512;

frame shift HOP _ LEN 256;

mel scale mel _ scale 125;

the mel frequency spectrum number N _ MELS is 96;

the audio feature size of a song is (96,1366).

Obtaining an audio characteristic-Mel frequency spectrum diagram shown in FIG. 2 after characteristic extraction, wherein the horizontal axis of the Mel-spectral (mel-spectral) diagram represents the time axis for 30 seconds; the vertical axis represents the frequency axis in Hz (hertz); the right side represents the energy value in dB (decibel).

CNN network structure for processing spectrum as shown in fig. 3, the audio signal is further extracted from the input features using a two-dimensional CNN model. The network has 5 convolution layers of 3x3 cores and 4 pooling layers, and then is connected with a full connection layer, the dimension of the output tensor is 64, and the audio features of each song are changed into a 64-dimensional vector. In fig. 3:

input (here, the input data is two-dimensional data-spectrogram);

output (one-dimensional feature vector);

frequency: a frequency axis;

time: a time axis;

n: the number of network layers.

Thirdly, preprocessing the lyrics: performing BERT (bidirectional coding representation) pre-training to obtain word vectors of words, and then obtaining lyric feature vectors through an HAN (hierarchical attention) network;

compared with the traditional BOW (bag of words), the BERT (bidirectional coding representation) pre-training model can acquire semantic information of the context of the text, learn information at sentence level and enable extracted features to be more representative. Meanwhile, the method avoids the problem that only word frequency information is acquired but the error region of text semantics is ignored based on a statistical method. The HAN (hierarchical attention) network is used for better acquiring information among words and sentences of lyrics in a sentence, and acquiring implicit information of context of the sentences before and after the words through two hierarchies of the words and the sentences.

BERT pre-training model parameters:

uncased BERT-base(12-layer,768-hidden,12-heads)；

one of the used BERT pre-training models, named uncapped BERT-base.

Representing each word in an input lyric text into a vector of 768 dimensions by using a BERT pre-training model, converting character symbols into numbers for computer processing, and simultaneously implying semantic information of word context; and then, words in the lyric text are replaced by word vectors obtained by BERT pre-training, and the words are embedded into a data matrix according to the content of the lyric text to obtain the digital data of the lyric text. A series of feature extraction operations may then be performed on the data of the song word text.

FIG. 4 is a diagram of a BERT pre-training word vector, wherein content 1: sentence 1; sensor 2: sentence 2; w1, w2, w3, w4 and w5 represent words; [ CLS ] Special notation in BERT for classification tasks; [ SEP ] BERT represents a sentence break symbol for breaking two sentences of the input corpus.

The acquired lyric text data is subjected to word and sentence two-level information extraction through HAN, an Attention mechanism can extract a part which is close to the music style in a sentence, and the characteristics of the part are given higher weight to make the part stand out.

Dividing each lyric text into sentences according to sentences, and embedding each word of each sentence by using a BERT pre-training word vector; then according to the network structure of fig. 5, each clause is subjected to word encoder word level encoding through a Bi _ GRU (bidirectional gated memory network) layer, and the attention feature of each clause word attribute is calculated; and then, coding each clause of each lyric at word level to obtain word attention characteristics, coding the clauses at the sentence level of a sensor encoder through a Bi _ GRU (bidirectional gated memory network) layer, and calculating the attention characteristics of the sensor attention sentences of each lyric. Therefore, the characteristics of the lyric text are obtained hierarchically, and the front and rear position information of the words and the context semantic information of the sentences are extracted.

The HAN (hierarchical attention network) network parameters are:

768 dimensions of BERT word vector dimension, 66 maximum clause number, 300 maximum clause word number, 100 layers of Bi _ GRU and 200 dimension of entry output.

In FIG. 5, word encoder is a word encoder, word level encoding;

word attention, word level attention;

a sensor encoder, a sentence-level encoder;

sensor attention, sentence attention layer, sentence level attention;

the softmax is a normalized exponential function which is used as a classification layer activation function in a multi-classification task of the neural network;

w represents words, s represents sentences, h represents a memory network hidden layer, v is a characteristic vector of the lyric text

h₁-h_LA hidden layer representing a sentence level Bi _ GRU (bidirectional gated memory network), L representing the number of clauses;

h₂₁-h_2Ta hidden layer representing a word level Bi _ GRU (bidirectional gated memory network), T representing the number of words;

the arrow indicates a forward direction to the right and a reverse direction to the right;

u_ssentence attention matrix; alpha is alpha₁-α_LSentence attention weight;

u_wword attention matrix, alpha₂₁-α_2TWord attention weight;

s₁-s_L: each clause is divided from the lyric text;

w₂₁-w_2Teach word vector of the sentence-dividing sentence.

the audio and the lyrics are two different information modes, and the data of the two modes contain the style information of the music, so that the prediction of the music style by fusing the two information can be better represented, and only one information is not enough. The current fusion scheme is usually to simply splice the acquired audio and lyric features directly, and this method only acquires one splicing vector. Interaction fusion of information between the modalities is not carried out, the audio and the lyrics have a time corresponding relation, and interaction information between the two modalities is considered besides the acquisition of information of the respective modalities. Through the integration of the Attention model, the acquired audio frequency and lyric characteristics are interactively integrated to acquire an integrated Attention vector, and then the integrated Attention vector is spliced with the audio frequency and lyric characteristic vectors to acquire the music style characteristics including respective modal characteristics of the audio frequency and the lyric and the integrated characteristics among the modalities.

The calculation steps of the attention between the molds are as follows:

first computing matrix M₁，M₂；

M₁＝V.T^T&M₂＝T.V^T；

(3) modality representation matrix O₁，O₂；

O₁＝N₁.T&O₂＝N₂.V；

(4) Attention matrix A₁，A₂；

A₁＝O₁⊙V&A₂＝O₂⊙T；

(5) Inter-mode attention matrix BA_VT；

BA_VT＝concat[A₁,A₂]；

The flow chart of calculating the inter-module attention is shown in FIG. 6, and the inter-module attention is calculated by first multiplying the calculation matrix M₁、M₂(ii) a Obtaining a probability distribution N by softmax₁、N₂A weight coefficient representing a modal characteristic; the new modal representation matrix O is obtained by the cross point multiplication₁、O₂(ii) a Then multiplying the original features element by element to obtain an attention matrix A₁、A₂(ii) a Pair V, T, A₁、A₂The concatenation is the final feature vector. In fig. 6:

row softmax: calculating softmax line by line to obtain a probability distribution result;

notations are symbols; matrix multiplication; elemwise multiplication element by element;

and fifthly, classifying through a softmax layer.

Fig. 7 is a block diagram of an Attention fusion general network, in which:

v represents audio data, and T represents lyric text; performing BERT pre-training on a lyric text, and extracting lyric characteristics by using a hierarchical attention network; and fusing the audio and lyric characteristics by an Attention intermode fusion method.

Mel-Spect is Mel spectrum; CNN: a convolutional neural network; dense: dense layer, one-dimensional hidden layer; BERT: pre-training a word vector model; HAN: (ii) hierarchical attention; attention concat, a linker layer; softmax classification layer.

In the embodiment, the audio and the lyrics are respectively processed by using the acquired data set, the audio data is subjected to feature extraction by using Mel _ spectrum, and word vector representation of the lyrics is performed by using a BERT (bidirectional coding representation) pre-training model; CNN (convolutional neural network) and HAN (hierarchical Attention network) are respectively used for obtaining feature vectors of two modes, and then the Attention modes of feature levels are fused; and finally classifying the fusion feature vectors. On the basis of a traditional audio-based method, information utilizing lyrics is added, and then two kinds of characteristic information are fused. For the lyrics, the lyrics take the text as a carrier and express the emotion, theme, style and the like of the song. The audio frequency and the lyrics are used as data of two different modes, and the interaction relation between the modes can be better extracted through information fusion. Not only extracts important information of respective modalities, but also fuses data relation information among the modalities, so that a feature vector is fused to more effectively represent a song, and data can be easily learned and classified.

Experiment one:

and comparing classification results of three modes of fusion among single audio frequency, single lyric and Attention mode.

The song data comprises 5000 pieces of data of 5 styles, 1000 songs respectively; the 5 styles are Hip-Hop, Metal, Country, Folk, Jazz, respectively;

comparison of the results of each classification F1:

table 1: four ways to classify the F1 results

From the above table, it can be seen that the audio has 78% higher classification results than the lyrics 70% because the audio contains more information than the lyrics, but the lyrics information is also an important part of the consideration of the music style. The result of fusing the audio and the lyrics is better than the single classification result, and the fused data is proved to be beneficial to the music style classification task in both a splicing mode and an Attention fusion mode. From the F1 result of classification, the enhancement of the Attention fusion model on each classification result reaches 84%, and is 2% higher than the result of simple splicing operation, which shows that the Attention fusion model accurately extracts the interaction information between the modalities, the fusion is more beneficial to the style classification task of music, and the practicability of the classification model is effectively enhanced.

Experiment two:

the experimental scheme is as follows:

1: respectively carrying out single-mode classification experiments on the audio frequency and the lyrics, and testing the classification effect of the single modes;

2: performing an Attention fusion experiment, and testing the multi-mode fusion classification effect;

3: carrying out experimental comparison on the lyric text in two modes of BOW (bag of words) pre-training and BERT (bidirectional coding representation) pre-training;

4: carrying out experimental comparison on direct splicing and Attention fusion;

5: compared with the existing scheme;

table 2: comparison of F1 value results for each experiment

Lyric of a song	Audio frequency	Splicing	Attention fusion
				BOW_Mel
64％	78％	79％	81.5％
				BERT_Mel	70％	78％	82％	84.4％

BOW _ Mel: indicating that the lyrics are processed by BOW, and the audio frequency adopts Mel frequency spectrum;

BERT _ Mel: the lyrics are processed by BERT, and the audio frequency adopts Mel frequency spectrum;

as can be seen from the table, the result of the BERT method processing the lyrics is 70% higher than that of the BOW method by 6%; all the Attention fusion results are higher than the results of direct feature splicing, and the advancement of the scheme is verified.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. A multi-modal music style classification method of audio and lyrics based on attention is characterized in that: the method comprises the following steps:

and fifthly, classifying through a softmax layer.

2. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the first step, according to singers and song name information in the rolling data set, retrieving and downloading corresponding songs to realize batch downloading; and combining the corresponding lyric data with the data of the song according to the serial number of the song to obtain a data set corresponding to the lyric and the song.

3. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the second step, parameters of Mel frequency spectrum feature extraction are as follows:

the frame length is 21 ms;

short-time fourier N _ FFT 512;

frame shift HOP _ LEN 256;

mel scale mel _ scale 125;

the mel frequency spectrum number N _ MELS is 96;

the audio feature size of a song is (96,1366).

4. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the second step, the CNN network structure is: the network has 5 convolution layers of 3x3 cores and 4 pooling layers, and then is connected with a full connection layer, the output tensor dimension is 64, and the characteristics of each song are changed into a 64-dimensional vector.

5. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the third step, the parameters of the BERT pre-training model are as follows:

uncased BERT-base(12-layer,768-hidden,12-heads)；

6. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the third step, the HAN network parameters are:

7. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the fourth step, the calculation of the attention between the modules comprises the following steps:

first computing matrix M₁，M₂；

M₁＝V.T^T&M₂＝T.V^T；

(3) modality representation matrix O₁，O₂；

O₁＝N₁.T&O₂＝N₂.V；

(4) Attention matrix A₁，A₂；

A₁＝O₁⊙V&A₂＝O₂⊙T；

(5) Inter-mode attention matrix BA_VT；

BA_VT＝concat[A₁,A₂]；

Pair V, T, A₁、A₂The concatenation is the final feature vector.