CN113177136A - Multi-mode music style classification method based on attention audio frequency and lyrics - Google Patents

Multi-mode music style classification method based on attention audio frequency and lyrics Download PDF

Info

Publication number
CN113177136A
CN113177136A CN202110460027.9A CN202110460027A CN113177136A CN 113177136 A CN113177136 A CN 113177136A CN 202110460027 A CN202110460027 A CN 202110460027A CN 113177136 A CN113177136 A CN 113177136A
Authority
CN
China
Prior art keywords
attention
audio
lyric
word
lyrics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110460027.9A
Other languages
Chinese (zh)
Other versions
CN113177136B (en
Inventor
李优
张志海
常亮
林煜明
周娅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN202110460027.9A priority Critical patent/CN113177136B/en
Publication of CN113177136A publication Critical patent/CN113177136A/en
Application granted granted Critical
Publication of CN113177136B publication Critical patent/CN113177136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to the technical field of music style classification, in particular to a multi-mode music style classification method of audio and lyrics based on attention, which comprises the following steps: firstly, acquiring data; secondly, audio preprocessing: firstly, Mel frequency spectrum feature extraction is carried out on audio data, and then audio features are obtained through a CNN network; thirdly, preprocessing the lyrics: performing BERT pre-training to obtain word vectors, and then obtaining lyric characteristic vectors through an HAN network; fourthly, fusing between the attention modules: performing interactive fusion on the acquired audio and lyric characteristics through Attention intermode fusion to acquire a fused Attention vector, and splicing the fused Attention vector with the audio and lyric characteristic vectors to acquire respective modal characteristics of the audio and lyric and music style characteristics of the intermodal fusion characteristics; and fifthly, classifying through a softmax layer. The invention can better learn and classify the data.

Description

Multi-mode music style classification method based on attention audio frequency and lyrics
Technical Field
The invention relates to the technical field of music style classification, in particular to a multi-mode music style classification method based on attention audio and lyrics.
Background
Music style classification is an important ring among music information retrieval and music recommendation. With the rapid development of internet technology and digital multimedia, the requirement for music style classification efficiency is higher and higher. However, the existing music style classification technology cannot completely represent the music style only by using audio features, and semantic information contained in lyrics also partially represents the style of the song, so that the content of the lyrics needs to be taken into consideration. And the technology of fusing the information of the audio frequency and the lyrics is less, and the main problem is that the processing difficulty of the lyrics is higher. General lyric processing methods, such as BOW (bag of words) and Word2Vec (Word vector), cannot acquire context semantic information of a sentence. But for lyrics, the semantic information of the context has an important meaning for the music style. For the audio and lyric fusion technology, fusion based on simple operations, such as feature level splicing and decision level weighting, is currently used, and these operations are only equivalent to simple integration of sub-classifiers, and do not really realize the fusion of audio and lyric information, and the improvement on the fusion result is small. Therefore, a multi-modal music style classification technique based on audio and lyrics has yet to be improved.
Disclosure of Invention
It is an object of the present invention to provide a method for multi-modal music style classification based on attention-directed audio and lyrics that overcomes some or some of the deficiencies of the prior art.
According to the invention, the method for classifying the multi-modal music style of the audio and the lyrics based on attention comprises the following steps:
firstly, data acquisition: acquiring a data set corresponding to the audio and the lyrics;
secondly, audio preprocessing: firstly, Mel frequency spectrum feature extraction is carried out on audio data, and then audio features are further obtained through a CNN network;
thirdly, preprocessing the lyrics: performing BERT pre-training to obtain word vectors of words, and then obtaining lyric characteristic vectors through an HAN network;
fourthly, fusing between the attention modules: performing interactive fusion on the acquired audio and lyric characteristics through Attention intermode fusion to acquire a fused Attention vector, and splicing the fused Attention vector with the audio and lyric characteristic vectors to acquire respective modal characteristics of the audio and lyric and music style characteristics of the intermodal fusion characteristics;
and fifthly, classifying through a softmax layer.
Preferably, in the first step, the corresponding songs are retrieved and downloaded according to singer and song name information in the rolling data set, so that batch downloading is realized; and combining the corresponding lyric data with the data of the song according to the serial number of the song to obtain a data set corresponding to the audio and the lyrics.
Preferably, in the second step, parameters of Mel spectrum feature extraction are as follows:
the sampling rate of the audio data is 22.5Hz, and the audio signal is downsampled at 12 Hz;
the frame length is 21 ms;
short-time fourier N _ FFT 512;
frame shift HOP _ LEN 256;
mel scale mel _ scale 125;
the mel frequency spectrum number N _ MELS is 96;
the audio feature size of a song is (96,1366).
Preferably, in step two, the CNN network structure is: the network has 5 convolution layers of 3x3 cores and 4 pooling layers, and then is connected with a full connection layer, the output tensor dimension is 64, and the characteristics of each song are changed into a 64-dimensional vector.
Preferably, in step three, the parameters of the BERT pre-training model are as follows:
uncased BERT-base(12-layer,768-hidden,12-heads);
neglecting the size-written BERT base model (12 Encoder decoding layers, 768 hidden layers, 12 multi-head attention);
the Bert pre-training model is used for representing each word in the input lyric text into a vector with 768 dimensions, converting character symbols into numbers for computer processing, and simultaneously implying semantic information of word context.
Preferably, in step three, the HAN network parameters are:
768 dimensions of BERT word vector dimension, 66 maximum clause number, 300 maximum clause word number, 100 layers of Bi _ GRU and 200 dimension of entry output;
dividing each lyric text into sentences according to sentences, and embedding each word of each sentence by using a BERT pre-training word vector; then, each clause is subjected to word encoder word level coding through a Bi _ GRU layer, and the attention feature of each clause word attribute word is calculated; and then, carrying out word level coding on each clause of each lyric to obtain word attention characteristics, carrying out presence encoder sentence level coding on each clause of each lyric through a Bi _ GRU layer, and calculating the presence attention sentence attention characteristics of each lyric.
Preferably, in the fourth step, the step of calculating the inter-mode attention is as follows:
the song is represented by u, the audio data is represented by V, the lyric data is represented by T, d is the dimension 64 of the last full connection layer, the 64-dimensional vector V (u x d) of the audio, and the 64-dimensional vector T (u x d) of the lyric;
first computing matrix M1,M2
M1=V.TT&M2=T.VT
(2) Calculating M Using softmax1,M2Probability distribution N of1,N2
Figure BDA0003042078160000031
i represents the ith sample of the audio of u; j represents the jth sample of the lyrics of u; k represents an accumulation variable from 1 to u;
(3) modality representation matrix O1,O2
O1=N1.T&O2=N2.V;
(4) Attention matrix A1,A2
A1=O1⊙V&A2=O2⊙T;
(5) Inter-mode attention matrix BAVT
BAVT=concat[A1,A2];
Inter-module attention calculation first point multiplication calculation matrix M1、M2(ii) a Obtaining a probability distribution N by softmax1、N2A weight coefficient representing a modal characteristic; the new modal representation matrix O is obtained by the cross point multiplication1、O2(ii) a Then multiplying the original features element by element to obtain an attention matrix A1、A2
Pair V, T, A1、A2The concatenation is the final feature vector.
The method comprises the steps of respectively processing audio and lyrics by using an acquired data set, extracting characteristics of audio data by using Mel _ spectrum, and performing word vector representation of the lyrics by using a BERT (bidirectional coding representation) pre-training model; CNN (convolutional neural network) and HAN (hierarchical Attention network) are respectively used for obtaining feature vectors of two modes, and then the Attention modes of feature levels are fused; and finally classifying the fusion feature vectors. On the basis of a traditional audio-based method, information utilizing lyrics is added, and then two kinds of characteristic information are fused. For the lyrics, the lyrics take the text as a carrier and express the emotion, theme, style and the like of the song. The audio frequency and the lyrics are used as data of two different modes, and the interaction relation between the modes can be better extracted through information fusion. Not only extracts important information of respective modalities, but also fuses data relation information among the modalities, so that a feature vector is fused to more effectively represent a song, and data can be easily learned and classified.
Drawings
FIG. 1 is a flowchart of a method for multi-modal music style classification of audio and lyrics based on attention in embodiment 1;
FIG. 2 is a diagram showing the Mel frequency spectrum, which is an audio feature in example 1;
fig. 3 is a schematic diagram of a CNN network structure for processing spectrum in embodiment 1;
FIG. 4 is a diagram illustrating the vector of the BERT pre-training words in example 1;
FIG. 5 is a schematic diagram of the HAN (hierarchical attention) network in example 1;
FIG. 6 is a flowchart of attention calculation between modes in example 1;
fig. 7 is a block diagram of an Attention fusion general network.
Detailed Description
For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.
Example 1
As shown in fig. 1, the present embodiment provides a method for multi-modal music style classification of audio and lyrics based on attention, which includes the following steps:
firstly, data acquisition: acquiring a data set corresponding to the audio and the lyrics;
one of the difficult problems with multimodal music styles is obtaining a usable data set with fewer data sets sorted about the music style. The data source is MetroLyrics (https:// www.metrolyrics.com /) lyrics for over 380000 songs obtained on a lyrics collection website, but the corresponding audio needs to be obtained. And retrieving and downloading the corresponding songs according to the singer and the song name information in the metroletics data set, thereby realizing batch downloading. And then the corresponding audio data and the lyric data are combined together according to the serial number of the song, and a data set corresponding to the audio and the lyric is obtained. The data set is 5 categories, each category being 1000. The 5 categories are Hip-Hop, Metal, Country, Folk, Jazz, respectively.
Secondly, audio preprocessing: firstly, Mel frequency spectrum feature extraction is carried out on audio data, and then audio features are further obtained through a CNN network;
parameters for Mel (Mel) spectral feature extraction are as follows:
the sampling rate of the audio data is 22.5Hz, and the audio signal is downsampled at 12 Hz;
the frame length is 21 ms;
short-time fourier N _ FFT 512;
frame shift HOP _ LEN 256;
mel scale mel _ scale 125;
the mel frequency spectrum number N _ MELS is 96;
the audio feature size of a song is (96,1366).
Obtaining an audio characteristic-Mel frequency spectrum diagram shown in FIG. 2 after characteristic extraction, wherein the horizontal axis of the Mel-spectral (mel-spectral) diagram represents the time axis for 30 seconds; the vertical axis represents the frequency axis in Hz (hertz); the right side represents the energy value in dB (decibel).
CNN network structure for processing spectrum as shown in fig. 3, the audio signal is further extracted from the input features using a two-dimensional CNN model. The network has 5 convolution layers of 3x3 cores and 4 pooling layers, and then is connected with a full connection layer, the dimension of the output tensor is 64, and the audio features of each song are changed into a 64-dimensional vector. In fig. 3:
input (here, the input data is two-dimensional data-spectrogram);
output (one-dimensional feature vector);
frequency: a frequency axis;
time: a time axis;
n: the number of network layers.
Thirdly, preprocessing the lyrics: performing BERT (bidirectional coding representation) pre-training to obtain word vectors of words, and then obtaining lyric feature vectors through an HAN (hierarchical attention) network;
compared with the traditional BOW (bag of words), the BERT (bidirectional coding representation) pre-training model can acquire semantic information of the context of the text, learn information at sentence level and enable extracted features to be more representative. Meanwhile, the method avoids the problem that only word frequency information is acquired but the error region of text semantics is ignored based on a statistical method. The HAN (hierarchical attention) network is used for better acquiring information among words and sentences of lyrics in a sentence, and acquiring implicit information of context of the sentences before and after the words through two hierarchies of the words and the sentences.
BERT pre-training model parameters:
uncased BERT-base(12-layer,768-hidden,12-heads);
neglecting the size-written BERT base model (12 Encoder decoding layers, 768 hidden layers, 12 multi-head attention);
one of the used BERT pre-training models, named uncapped BERT-base.
Representing each word in an input lyric text into a vector of 768 dimensions by using a BERT pre-training model, converting character symbols into numbers for computer processing, and simultaneously implying semantic information of word context; and then, words in the lyric text are replaced by word vectors obtained by BERT pre-training, and the words are embedded into a data matrix according to the content of the lyric text to obtain the digital data of the lyric text. A series of feature extraction operations may then be performed on the data of the song word text.
FIG. 4 is a diagram of a BERT pre-training word vector, wherein content 1: sentence 1; sensor 2: sentence 2; w1, w2, w3, w4 and w5 represent words; [ CLS ] Special notation in BERT for classification tasks; [ SEP ] BERT represents a sentence break symbol for breaking two sentences of the input corpus.
The acquired lyric text data is subjected to word and sentence two-level information extraction through HAN, an Attention mechanism can extract a part which is close to the music style in a sentence, and the characteristics of the part are given higher weight to make the part stand out.
Dividing each lyric text into sentences according to sentences, and embedding each word of each sentence by using a BERT pre-training word vector; then according to the network structure of fig. 5, each clause is subjected to word encoder word level encoding through a Bi _ GRU (bidirectional gated memory network) layer, and the attention feature of each clause word attribute is calculated; and then, coding each clause of each lyric at word level to obtain word attention characteristics, coding the clauses at the sentence level of a sensor encoder through a Bi _ GRU (bidirectional gated memory network) layer, and calculating the attention characteristics of the sensor attention sentences of each lyric. Therefore, the characteristics of the lyric text are obtained hierarchically, and the front and rear position information of the words and the context semantic information of the sentences are extracted.
The HAN (hierarchical attention network) network parameters are:
768 dimensions of BERT word vector dimension, 66 maximum clause number, 300 maximum clause word number, 100 layers of Bi _ GRU and 200 dimension of entry output.
In FIG. 5, word encoder is a word encoder, word level encoding;
word attention, word level attention;
a sensor encoder, a sentence-level encoder;
sensor attention, sentence attention layer, sentence level attention;
the softmax is a normalized exponential function which is used as a classification layer activation function in a multi-classification task of the neural network;
w represents words, s represents sentences, h represents a memory network hidden layer, v is a characteristic vector of the lyric text
h1-hLA hidden layer representing a sentence level Bi _ GRU (bidirectional gated memory network), L representing the number of clauses;
h21-h2Ta hidden layer representing a word level Bi _ GRU (bidirectional gated memory network), T representing the number of words;
the arrow indicates a forward direction to the right and a reverse direction to the right;
ussentence attention matrix; alpha is alpha1LSentence attention weight;
uwword attention matrix, alpha212TWord attention weight;
s1-sL: each clause is divided from the lyric text;
w21-w2Teach word vector of the sentence-dividing sentence.
Fourthly, fusing between the attention modules: performing interactive fusion on the acquired audio and lyric characteristics through Attention intermode fusion to acquire a fused Attention vector, and splicing the fused Attention vector with the audio and lyric characteristic vectors to acquire respective modal characteristics of the audio and lyric and music style characteristics of the intermodal fusion characteristics;
the audio and the lyrics are two different information modes, and the data of the two modes contain the style information of the music, so that the prediction of the music style by fusing the two information can be better represented, and only one information is not enough. The current fusion scheme is usually to simply splice the acquired audio and lyric features directly, and this method only acquires one splicing vector. Interaction fusion of information between the modalities is not carried out, the audio and the lyrics have a time corresponding relation, and interaction information between the two modalities is considered besides the acquisition of information of the respective modalities. Through the integration of the Attention model, the acquired audio frequency and lyric characteristics are interactively integrated to acquire an integrated Attention vector, and then the integrated Attention vector is spliced with the audio frequency and lyric characteristic vectors to acquire the music style characteristics including respective modal characteristics of the audio frequency and the lyric and the integrated characteristics among the modalities.
The calculation steps of the attention between the molds are as follows:
the song is represented by u, the audio data is represented by V, the lyric data is represented by T, d is the dimension 64 of the last full connection layer, the 64-dimensional vector V (u x d) of the audio, and the 64-dimensional vector T (u x d) of the lyric;
first computing matrix M1,M2
M1=V.TT&M2=T.VT
(2) Calculating M Using softmax1,M2Probability distribution N of1,N2
Figure BDA0003042078160000091
i represents the ith sample of the audio of u; j represents the jth sample of the lyrics of u; k represents an accumulation variable from 1 to u;
(3) modality representation matrix O1,O2
O1=N1.T&O2=N2.V;
(4) Attention matrix A1,A2
A1=O1⊙V&A2=O2⊙T;
(5) Inter-mode attention matrix BAVT
BAVT=concat[A1,A2];
The flow chart of calculating the inter-module attention is shown in FIG. 6, and the inter-module attention is calculated by first multiplying the calculation matrix M1、M2(ii) a Obtaining a probability distribution N by softmax1、N2A weight coefficient representing a modal characteristic; the new modal representation matrix O is obtained by the cross point multiplication1、O2(ii) a Then multiplying the original features element by element to obtain an attention matrix A1、A2(ii) a Pair V, T, A1、A2The concatenation is the final feature vector. In fig. 6:
row softmax: calculating softmax line by line to obtain a probability distribution result;
notations are symbols; matrix multiplication; elemwise multiplication element by element;
and fifthly, classifying through a softmax layer.
Fig. 7 is a block diagram of an Attention fusion general network, in which:
v represents audio data, and T represents lyric text; performing BERT pre-training on a lyric text, and extracting lyric characteristics by using a hierarchical attention network; and fusing the audio and lyric characteristics by an Attention intermode fusion method.
Mel-Spect is Mel spectrum; CNN: a convolutional neural network; dense: dense layer, one-dimensional hidden layer; BERT: pre-training a word vector model; HAN: (ii) hierarchical attention; attention concat, a linker layer; softmax classification layer.
In the embodiment, the audio and the lyrics are respectively processed by using the acquired data set, the audio data is subjected to feature extraction by using Mel _ spectrum, and word vector representation of the lyrics is performed by using a BERT (bidirectional coding representation) pre-training model; CNN (convolutional neural network) and HAN (hierarchical Attention network) are respectively used for obtaining feature vectors of two modes, and then the Attention modes of feature levels are fused; and finally classifying the fusion feature vectors. On the basis of a traditional audio-based method, information utilizing lyrics is added, and then two kinds of characteristic information are fused. For the lyrics, the lyrics take the text as a carrier and express the emotion, theme, style and the like of the song. The audio frequency and the lyrics are used as data of two different modes, and the interaction relation between the modes can be better extracted through information fusion. Not only extracts important information of respective modalities, but also fuses data relation information among the modalities, so that a feature vector is fused to more effectively represent a song, and data can be easily learned and classified.
Experiment one:
and comparing classification results of three modes of fusion among single audio frequency, single lyric and Attention mode.
The song data comprises 5000 pieces of data of 5 styles, 1000 songs respectively; the 5 styles are Hip-Hop, Metal, Country, Folk, Jazz, respectively;
comparison of the results of each classification F1:
table 1: four ways to classify the F1 results
Figure BDA0003042078160000101
From the above table, it can be seen that the audio has 78% higher classification results than the lyrics 70% because the audio contains more information than the lyrics, but the lyrics information is also an important part of the consideration of the music style. The result of fusing the audio and the lyrics is better than the single classification result, and the fused data is proved to be beneficial to the music style classification task in both a splicing mode and an Attention fusion mode. From the F1 result of classification, the enhancement of the Attention fusion model on each classification result reaches 84%, and is 2% higher than the result of simple splicing operation, which shows that the Attention fusion model accurately extracts the interaction information between the modalities, the fusion is more beneficial to the style classification task of music, and the practicability of the classification model is effectively enhanced.
Experiment two:
the experimental scheme is as follows:
1: respectively carrying out single-mode classification experiments on the audio frequency and the lyrics, and testing the classification effect of the single modes;
2: performing an Attention fusion experiment, and testing the multi-mode fusion classification effect;
3: carrying out experimental comparison on the lyric text in two modes of BOW (bag of words) pre-training and BERT (bidirectional coding representation) pre-training;
4: carrying out experimental comparison on direct splicing and Attention fusion;
5: compared with the existing scheme;
table 2: comparison of F1 value results for each experiment
Lyric of a song Audio frequency Splicing Attention fusion
BOW_Mel
64% 78% 79% 81.5%
BERT_Mel 70% 78% 82% 84.4%
BOW _ Mel: indicating that the lyrics are processed by BOW, and the audio frequency adopts Mel frequency spectrum;
BERT _ Mel: the lyrics are processed by BERT, and the audio frequency adopts Mel frequency spectrum;
as can be seen from the table, the result of the BERT method processing the lyrics is 70% higher than that of the BOW method by 6%; all the Attention fusion results are higher than the results of direct feature splicing, and the advancement of the scheme is verified.
The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims (7)

1. A multi-modal music style classification method of audio and lyrics based on attention is characterized in that: the method comprises the following steps:
firstly, data acquisition: acquiring a data set corresponding to the audio and the lyrics;
secondly, audio preprocessing: firstly, Mel frequency spectrum feature extraction is carried out on audio data, and then audio features are further obtained through a CNN network;
thirdly, preprocessing the lyrics: performing BERT pre-training to obtain word vectors of words, and then obtaining lyric characteristic vectors through an HAN network;
fourthly, fusing between the attention modules: performing interactive fusion on the acquired audio and lyric characteristics through Attention intermode fusion to acquire a fused Attention vector, and splicing the fused Attention vector with the audio and lyric characteristic vectors to acquire respective modal characteristics of the audio and lyric and music style characteristics of the intermodal fusion characteristics;
and fifthly, classifying through a softmax layer.
2. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the first step, according to singers and song name information in the rolling data set, retrieving and downloading corresponding songs to realize batch downloading; and combining the corresponding lyric data with the data of the song according to the serial number of the song to obtain a data set corresponding to the lyric and the song.
3. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the second step, parameters of Mel frequency spectrum feature extraction are as follows:
the sampling rate of the audio data is 22.5Hz, and the audio signal is downsampled at 12 Hz;
the frame length is 21 ms;
short-time fourier N _ FFT 512;
frame shift HOP _ LEN 256;
mel scale mel _ scale 125;
the mel frequency spectrum number N _ MELS is 96;
the audio feature size of a song is (96,1366).
4. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the second step, the CNN network structure is: the network has 5 convolution layers of 3x3 cores and 4 pooling layers, and then is connected with a full connection layer, the output tensor dimension is 64, and the characteristics of each song are changed into a 64-dimensional vector.
5. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the third step, the parameters of the BERT pre-training model are as follows:
uncased BERT-base(12-layer,768-hidden,12-heads);
neglecting the size-written BERT base model (12 Encoder decoding layers, 768 hidden layers, 12 multi-head attention);
the Bert pre-training model is used for representing each word in the input lyric text into a vector with 768 dimensions, converting character symbols into numbers for computer processing, and simultaneously implying semantic information of word context.
6. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the third step, the HAN network parameters are:
768 dimensions of BERT word vector dimension, 66 maximum clause number, 300 maximum clause word number, 100 layers of Bi _ GRU and 200 dimension of entry output;
dividing each lyric text into sentences according to sentences, and embedding each word of each sentence by using a BERT pre-training word vector; then, each clause is subjected to word encoder word level coding through a Bi _ GRU layer, and the attention feature of each clause word attribute word is calculated; and then, carrying out word level coding on each clause of each lyric to obtain word attention characteristics, carrying out presence encoder sentence level coding on each clause of each lyric through a Bi _ GRU layer, and calculating the presence attention sentence attention characteristics of each lyric.
7. The method of claim 1 for multimodal music style classification of audio and lyrics based on attention, characterized in that: in the fourth step, the calculation of the attention between the modules comprises the following steps:
the song is represented by u, the audio data is represented by V, the lyric data is represented by T, d is the dimension 64 of the last full connection layer, the 64-dimensional vector V (u x d) of the audio, and the 64-dimensional vector T (u x d) of the lyric;
first computing matrix M1,M2
M1=V.TT&M2=T.VT
(2) Calculating M Using softmax1,M2Probability distribution N of1,N2
Figure RE-FDA0003100623000000032
i represents the ith sample of the audio of u; j represents the jth sample of the lyrics of u; k represents an accumulation variable from 1 to u;
(3) modality representation matrix O1,O2
O1=N1.T&O2=N2.V;
(4) Attention matrix A1,A2
A1=O1⊙V&A2=O2⊙T;
(5) Inter-mode attention matrix BAVT
BAVT=concat[A1,A2];
Inter-module attention calculation first point multiplication calculation matrix M1、M2(ii) a Obtaining a probability distribution N by softmax1、N2A weight coefficient representing a modal characteristic; the new modal representation matrix O is obtained by the cross point multiplication1、O2(ii) a Then multiplying the original features element by element to obtain an attention matrix A1、A2
Pair V, T, A1、A2The concatenation is the final feature vector.
CN202110460027.9A 2021-04-27 2021-04-27 Multi-mode music style classification method based on attention audio frequency and lyrics Active CN113177136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110460027.9A CN113177136B (en) 2021-04-27 2021-04-27 Multi-mode music style classification method based on attention audio frequency and lyrics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110460027.9A CN113177136B (en) 2021-04-27 2021-04-27 Multi-mode music style classification method based on attention audio frequency and lyrics

Publications (2)

Publication Number Publication Date
CN113177136A true CN113177136A (en) 2021-07-27
CN113177136B CN113177136B (en) 2022-04-22

Family

ID=76926677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110460027.9A Active CN113177136B (en) 2021-04-27 2021-04-27 Multi-mode music style classification method based on attention audio frequency and lyrics

Country Status (1)

Country Link
CN (1) CN113177136B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627377A (en) * 2021-08-18 2021-11-09 福州大学 Cognitive radio frequency spectrum sensing method and system Based on Attention-Based CNN
CN113780811A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Musical instrument performance evaluation method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040312A1 (en) * 2009-04-23 2014-02-06 Glace Holding Llc Systems and methods for storage of declarative knowledge accessible by natural language in a computer capable of appropriately responding
US20140258856A1 (en) * 2013-03-06 2014-09-11 Nuance Communications, Inc, Task assistant including navigation control
CN107111642A (en) * 2014-12-31 2017-08-29 Pcms控股公司 For creating the system and method for listening to daily record and music libraries
WO2019222759A1 (en) * 2018-05-18 2019-11-21 Synaptics Incorporated Recurrent multimodal attention system based on expert gated networks
CN111222009A (en) * 2019-10-25 2020-06-02 汕头大学 Processing method of multi-modal personalized emotion based on long-time memory mechanism
CN111414513A (en) * 2020-03-16 2020-07-14 腾讯音乐娱乐科技(深圳)有限公司 Music genre classification method and device and storage medium
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes
CN111428074A (en) * 2020-03-20 2020-07-17 腾讯科技(深圳)有限公司 Audio sample generation method and device, computer equipment and storage medium
CN111753549A (en) * 2020-05-22 2020-10-09 江苏大学 Multi-mode emotion feature learning and recognition method based on attention mechanism
CN112203122A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Artificial intelligence-based similar video processing method and device and electronic equipment
CN112487237A (en) * 2020-12-14 2021-03-12 重庆邮电大学 Music classification method based on self-adaptive CNN and semi-supervised self-training model

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040312A1 (en) * 2009-04-23 2014-02-06 Glace Holding Llc Systems and methods for storage of declarative knowledge accessible by natural language in a computer capable of appropriately responding
US20140258856A1 (en) * 2013-03-06 2014-09-11 Nuance Communications, Inc, Task assistant including navigation control
CN107111642A (en) * 2014-12-31 2017-08-29 Pcms控股公司 For creating the system and method for listening to daily record and music libraries
WO2019222759A1 (en) * 2018-05-18 2019-11-21 Synaptics Incorporated Recurrent multimodal attention system based on expert gated networks
CN111222009A (en) * 2019-10-25 2020-06-02 汕头大学 Processing method of multi-modal personalized emotion based on long-time memory mechanism
CN111428044A (en) * 2020-03-06 2020-07-17 中国平安人寿保险股份有限公司 Method, device, equipment and storage medium for obtaining supervision identification result in multiple modes
CN111414513A (en) * 2020-03-16 2020-07-14 腾讯音乐娱乐科技(深圳)有限公司 Music genre classification method and device and storage medium
CN111428074A (en) * 2020-03-20 2020-07-17 腾讯科技(深圳)有限公司 Audio sample generation method and device, computer equipment and storage medium
CN111753549A (en) * 2020-05-22 2020-10-09 江苏大学 Multi-mode emotion feature learning and recognition method based on attention mechanism
CN112203122A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Artificial intelligence-based similar video processing method and device and electronic equipment
CN112487237A (en) * 2020-12-14 2021-03-12 重庆邮电大学 Music classification method based on self-adaptive CNN and semi-supervised self-training model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
倪璐: "基于音频与歌词双重模态的音乐情感分类方法设计", 《自动化技术与应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113627377A (en) * 2021-08-18 2021-11-09 福州大学 Cognitive radio frequency spectrum sensing method and system Based on Attention-Based CNN
CN113780811A (en) * 2021-09-10 2021-12-10 平安科技(深圳)有限公司 Musical instrument performance evaluation method, device, equipment and storage medium
CN113780811B (en) * 2021-09-10 2023-12-26 平安科技(深圳)有限公司 Musical instrument performance evaluation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113177136B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
CN110674339B (en) Chinese song emotion classification method based on multi-mode fusion
CN109766544B (en) Document keyword extraction method and device based on LDA and word vector
CN111460213B (en) Music emotion classification method based on multi-modal learning
CN109933686B (en) Song label prediction method, device, server and storage medium
CN112199956B (en) Entity emotion analysis method based on deep representation learning
CN107315737A (en) A kind of semantic logic processing method and system
CN107357837A (en) The electric business excavated based on order-preserving submatrix and Frequent episodes comments on sensibility classification method
CN113177136B (en) Multi-mode music style classification method based on attention audio frequency and lyrics
CN110263325A (en) Chinese automatic word-cut
CN111291188B (en) Intelligent information extraction method and system
CN110750635B (en) French recommendation method based on joint deep learning model
CN112487237B (en) Music classification method based on self-adaptive CNN and semi-supervised self-training model
CN112256861B (en) Rumor detection method based on search engine return result and electronic device
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
CN112487189B (en) Implicit discourse text relation classification method for graph-volume network enhancement
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN110472245B (en) Multi-label emotion intensity prediction method based on hierarchical convolutional neural network
CN111666376B (en) Answer generation method and device based on paragraph boundary scan prediction and word shift distance cluster matching
CN112905739A (en) False comment detection model training method, detection method and electronic equipment
CN111737414A (en) Song recommendation method and device, server and storage medium
CN115422947A (en) Ancient poetry assignment method and system based on deep learning
CN115952292A (en) Multi-label classification method, device and computer readable medium
Vayadande et al. Mood Detection and Emoji Classification using Tokenization and Convolutional Neural Network
CN116933782A (en) E-commerce text keyword extraction processing method and system
CN115906824A (en) Text fine-grained emotion analysis method, system, medium and computing equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20210727

Assignee: Guilin Zhongchen Information Technology Co.,Ltd.

Assignor: GUILIN University OF ELECTRONIC TECHNOLOGY

Contract record no.: X2022450000215

Denomination of invention: A Multi modal Music Style Classification Method Based on Attention Audio and Lyrics

Granted publication date: 20220422

License type: Common License

Record date: 20221206