CN115881160A - Music genre classification method and system based on knowledge graph fusion - Google Patents

Music genre classification method and system based on knowledge graph fusion Download PDF

Info

Publication number
CN115881160A
CN115881160A CN202211505311.4A CN202211505311A CN115881160A CN 115881160 A CN115881160 A CN 115881160A CN 202211505311 A CN202211505311 A CN 202211505311A CN 115881160 A CN115881160 A CN 115881160A
Authority
CN
China
Prior art keywords
genre
knowledge
audio
music
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211505311.4A
Other languages
Chinese (zh)
Inventor
丁菡
宋文静
赵衰
王鸽
赵鲲
惠维
赵季中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202211505311.4A priority Critical patent/CN115881160A/en
Publication of CN115881160A publication Critical patent/CN115881160A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a music genre classification method and system based on a fusion knowledge graph. The invention firstly proposes to use the knowledge graph to guide the audio representation learning and use the knowledge graph for genre classification; meanwhile, the invention constructs the knowledge map by using the metadata of the public music data set, learns the audio characteristics fused with the knowledge of the relationship among genres, obtains better genre classification performance and has wide application prospect.

Description

Music genre classification method and system based on knowledge graph fusion
Technical Field
The invention belongs to the technical field of music signal analysis and processing, and particularly relates to a music genre classification method and system integrating knowledge maps.
Background
Music genre classification may be used in many real-world applications, for example, a music streaming platform may create a more appropriate recommended playlist for a particular user, users may find other music similar to their favorite music style, and so forth. However, the boundaries of the classification between different Music genres are still fuzzy, which makes the automatic Recognition of the Music Genre type (MGR) from the audio sample an important task.
Experts in the related art have proposed some methods to try to solve this problem. Early approaches explored the use of different inputs (i.e., waveforms or spectrograms) or different classifiers for music classification, e.g., using an unsupervised learning model reconstructed from a variety of audio features (e.g., MFCC, chroma, tempogram, etc.) to improve classification performance. Recent research has proposed using correlation tasks (e.g., artist labeling) to obtain multi-level and multi-scale music representations and using migratory learning to enhance genre classifiers. The above solutions all use only audio samples as input. Still other methods use additional information (e.g., lyrics, comments, etc.) for genre classification, e.g., using natural language descriptions of music content to supervise learning audio representations, or extracting fusion features in conjunction with lyrics and audio for genre classification. However, it is worth noting that many open source datasets or real-world tasks do not provide such detailed information, and these methods require the assistance of certain music APIs or search engines to obtain accurate lyrics or descriptions for each piece of audio, which is a very labor-intensive and time-consuming process.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a music genre classification method and system integrating a knowledge base, which aims to provide a method and system for guiding audio characterization learning by using a knowledge base without acquiring corresponding additional information for each piece of audio, so as to effectively improve music genre classification performance, and solve the technical problem of using a side information-assisted neural network to perform automatic music genre classification.
The invention adopts the following technical scheme:
a music genre classification method fusing knowledge graphs comprises the following steps:
s1, converting audio data into a Mel spectrogram, inputting the Mel spectrogram into an audio feature extraction network to learn audio representation, and simultaneously adding a linear layer at the last of the audio feature extraction network to obtain a prediction score of each genre;
s2, constructing a knowledge graph related to genres;
s3, initializing the genre nodes in the knowledge graph constructed in the step S2 by using the prediction scores of each genre obtained in the step S1, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain final knowledge representation;
and S4, distributing different attention weights to the audio representation obtained in the step S1 and the knowledge representation obtained in the step S3 by using an SE module, splicing the weighted representations to obtain an enhanced audio representation, and inputting the enhanced audio representation into a full connection layer to form a music genre classification model so as to realize music genre classification.
Specifically, step S1 specifically includes:
s101, cutting the audio χ into a plurality of non-overlapping segments with the duration of 1 second, converting the cut audio segments into a 128-dimensional Mel spectrogram by using a librosa library, and obtaining a time-frequency representation S;
s102, inputting the time-frequency representation S obtained in the step S101 into a backbone network f (.) learning audio representation Z a
S103, adding a linear layer g () after the backbone network f () in the step S102, and pre-training the network g f to obtain a C-dimensional vector Z s Representing the predicted score of the network for each genreNumber, C represents the number of genres.
Further, in step S102, the backbone network f (.) uses the architecture of inclusion-ResNet-V2.
Specifically, in step S2, the knowledge-graph
Figure BDA0003967948800000021
Including entity sets and edge sets; the entity set V contains G + A + I elements, G is the number of music genres, A is the number of artists, and I is the number of instruments; side set E is on a knowledge graph->
Figure BDA0003967948800000031
Is the set of edges connecting between the various entities.
Further, knowledge maps
Figure BDA0003967948800000032
The edge set E in (1) is:
Figure BDA0003967948800000033
wherein, 0 G×G Is a zero matrix of size G, P G×A Is a correlation probability matrix of genre and artist of music of size G × A, P G×I Is a probability matrix, P, of correlation of musical genre with instrument of size G × I A×G Is a probability matrix of the correlation of artist and genre of music of size A G, 0 A×A Is a zero matrix of size A × A, 0 A×I Is a zero matrix of size AxI, P I×G Is the correlation probability matrix of the instrument with the music genre of size I G, 0 I×A Is a zero matrix of size I × A, 0 I×I Is a zero matrix of size I x I.
Specifically, step S3 specifically includes:
s301, initializing an artist node A and a musical instrument node I by using a zero vector, and using Z obtained in the step S1 s Initializing corresponding genre nodes G to obtain input characteristics x of each node v
S302、Hidden state h of node i at one iteration t i After T iterations, the message is determined in the whole graph according to the state of the previous step and the message propagated from the neighbor of the previous step
Figure BDA0003967948800000034
The final hidden state of all nodes is obtained, the final linear layer outputs the final characteristic of each node, and the characteristics are spliced to obtain the whole knowledge-graph->
Figure BDA0003967948800000035
Characterization of (2) KG
Further, in step S302, at one iteration t, the hidden state h of the node i i Determined by its last state and the messages propagated from its neighbors, specifically:
Figure BDA0003967948800000036
Figure BDA0003967948800000037
wherein,
Figure BDA0003967948800000038
is the initial hidden state of node i, x i Is an input characteristic of node i, is asserted>
Figure BDA0003967948800000039
Is the hidden state of node i at the tth iteration, is asserted>
Figure BDA00039679488000000310
Is the hidden state of node K at the t-1 st iteration, K is the total number of nodes in the knowledge-graph, E i Is a matrix representing the connection relationship of nodes i and their neighbors.
Specifically, step S4 specifically includes:
s401, representing the knowledge Z KG With the initial toneFrequency characterization Z a Inputting into SE module, and representing Z for knowledge by SE module KG With initial audio characterisation Z a Distributing different attention weights, and determining the characteristics in a self-adaptive mode; then, connecting the weighted features to obtain an audio representation F;
and S402, inputting the audio representation F obtained in the step S401 into a music genre classification model for genre classification.
Further, in step S402, the music genre classification model is trained using cross entropy loss, where the cross entropy loss L is:
Figure BDA0003967948800000041
wherein,
Figure BDA0003967948800000042
is a predicted genre label,>
Figure BDA0003967948800000043
is the true tag and N is the total number of audio samples input.
In a second aspect, an embodiment of the present invention provides a music genre classification system with a knowledge graph fused, including:
the learning module is used for converting the audio data into a Mel spectrogram, inputting the Mel spectrogram into an audio feature extraction network to learn audio representation, and meanwhile, adding a linear layer at the last of the audio feature extraction network to obtain a prediction score of each genre;
the building module is used for building a knowledge graph related to the genres;
the representation module is used for initializing the genre nodes in the knowledge graph constructed by the construction module by using the prediction scores of each genre obtained by the learning module, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain the final knowledge representation;
the classification module is used for distributing different attention weights to the audio representation obtained by the learning module and the knowledge representation obtained by the representation module by using the SE module, and then splicing the weighted representations to obtain an enhanced audio representation; and inputting the enhanced audio representation into the full connection layer to form a music genre classification model so as to realize music genre classification.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a music genre classification method fusing knowledge graphs, which constructs a knowledge graph related to music genres by using metadata (namely genres, artists and musical instruments) provided in an FMA-medium data set, learns the correlation among different genres from the graph by using GGNN, and fuses the learned knowledge and audio representation to enhance audio representation.
Further, in order to obtain an initial audio representation, the audio representation learning network needs to be pre-trained, and in order to make the knowledge graph guide the learning of the audio representation in a targeted manner, a linear layer needs to be added after the audio representation learning network to obtain prediction scores of each piece of audio for each genre, and the knowledge graph is initialized by using the prediction scores.
Furthermore, the backbone network f () selects a relatively mature framework of the classification network inclusion-ResNet-V2, the network combines the inclusion block with the residual error network (ResNet), the calculation loss is low, the training speed is high, and the music genre classification performance can be effectively improved.
Further, in order to make the atlas better enhance the audio representation, a knowledge atlas containing music genre and related information needs to be constructed
Figure BDA0003967948800000051
The entity set V contains G + A + I elements related to identifying music genres, wherein G represents the number of music genres, A represents the number of artists, and I represents the number of musical instruments; side set E is on a knowledge graph->
Figure BDA0003967948800000052
Represents the set of edges connecting the entities, and is used for representing the correlation relationship between the entities.
Further, in order to more conveniently input the knowledge graph to the GGNN network, the knowledge graph needs to be input
Figure BDA0003967948800000053
The edge set E in (1) is expressed in a matrix form, the relation between the entities which are not directly connected is expressed by a zero matrix, and the value of the edge connecting the two entities at the corresponding position of the matrix is the correlation probability between the two entities.
Further, in order to make the knowledge graph specifically guide the learning of the audio representation, it is necessary to initialize the genre node G with the prediction score of each audio for each genre, so that each node has the input feature x v Zero vector initialization needs to be used for the remaining artist node a and instrument node I.
Further, the characteristics of the knowledge-graph are learned using the GGNN network to iteratively update the node characteristics.
Furthermore, different attention weights are distributed to the initial audio representation and the knowledge representation by using an SE module, which feature is more favorable for the whole model is determined in a self-adaptive mode, and the fused audio representation is input into a full connection layer to classify music genres, so that the accuracy of genre classification is improved.
Furthermore, in order to better control the learning rate, the music genre classification model is trained by using cross entropy loss L, and meanwhile, gradient diffusion in the training process can be avoided.
It is to be understood that, the beneficial effects of the second to third aspects may be referred to the related description of the first aspect, and are not described herein again.
In summary, the invention constructs a knowledge graph related to music genres, inputs information in the knowledge graph into the GGNN network in a structured form to learn the correlation among different genres, and initializes nodes of the knowledge graph by obtaining the prediction scores of the genres corresponding to each audio through pre-training, so as to pertinently enhance audio representation, and finally obtain higher music genre classification performance by using the enhanced audio representation.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a system framework diagram of the present invention;
FIG. 2 is a schematic view of a knowledge graph according to the present invention;
FIG. 3 is a diagram comparing other genre classification methods;
FIG. 4 is a comparison of the effect of knowledge map.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and including such combinations, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that although the terms first, second, third, etc. may be used to describe preset ranges, etc. in embodiments of the present invention, these preset ranges should not be limited to these terms. These terms are only used to distinguish preset ranges from each other. For example, the first preset range may also be referred to as a second preset range, and similarly, the second preset range may also be referred to as the first preset range, without departing from the scope of the embodiments of the present invention.
The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.
The invention provides a music genre classification method fusing knowledge graphs, which comprises the steps of firstly converting audio data into Mel spectrograms, inputting the Mel spectrograms into an audio feature extraction network to learn audio representation, and meanwhile, adding a linear layer at the last of the audio feature extraction network to obtain a prediction score of each genre for initializing genre nodes in the knowledge graphs; in order to obtain the relation among all genres, the invention uses metadata in the public audio data set to construct a music knowledge graph related to the genres and uses a graph neural network to learn the knowledge representation of the graph; the initial audio representation is further enhanced with this knowledge representation and used for the genre classification task. The invention fully utilizes the correlation among different genres learned from the knowledge map and enhances the audio representation, thereby improving the accuracy of music genre classification and having wide application prospect.
Referring to fig. 1, the music genre classification method with knowledge graph fusion according to the present invention includes the following steps:
s1, acquiring audio data of music from a data set, cutting the audio and converting the audio into a Mel spectrogram; then inputting the obtained Mel spectrogram into an audio feature extraction network learning audio representation; meanwhile, a linear layer is added at the last of the audio characteristic extraction network to obtain a prediction score of each genre, and the prediction score is used for initializing a genre node of the knowledge graph;
referring to the lower left portion of fig. 1, the audio characterization module specifically includes:
s101, selecting a subset FMA-medium of an open source music data set FMA by audio data, wherein the subset comprises 25,000 tracks with the duration of 30 seconds, and the tracks belong to 16 unbalanced genres respectively; cutting each section of input audio χ into 30 non-overlapping segments with the duration of 1 second, converting the cut audio sample into a 128-dimensional Mel spectrogram by using a librosa library, and obtaining a time-frequency representation S of the Mel spectrogram;
s102, inputting the preprocessed audio data into an inclusion-ResNet-V2 network f (.) with the number of layers reduced to learn an audio representation Z a Multi-scale features of the audio are obtained under the condition of reducing the computational complexity;
s103, adding a linear layer g () after the inclusion-ResNet-V2, and pre-training the network g o f to obtain a vector Z s Which represents the predicted score, Z, of the network for each genre s Is equal to the number of genres, i.e., 16 dimensions; the model was trained using an Adam optimizer with a learning rate set to 0.001.
S2, building a knowledge graph related to the genres by using metadata in the data set, wherein the knowledge graph is used for expressing the relationship among the genres;
referring to FIG. 2, a visual representation of a portion of a knowledge-graph is shownAnd (5) converting the result. Building music genre knowledge graph
Figure BDA0003967948800000081
To represent the relationship between genres, including entity sets and edge sets.
The entity set V contains G + a + I elements, where G represents the number of genres of music, a represents the number of artists, and I represents the number of instruments. Edge set E in knowledge graph
Figure BDA0003967948800000082
Refers to the collection of edges connecting between various entities, of which there are two types in the knowledge graph constructed by the present invention.
An edge is a connection between artist and music genre that indicates the likelihood that an artist will have a song of a particular music genre, which can be statistically calculated, and is denoted P G×A
The calculation formula is specifically as follows:
Figure BDA0003967948800000091
where i is an artist, j is a genre, N i Is the number of all songs owned by artist i,
Figure BDA0003967948800000092
is the number of songs that belong to genre j.
The other side is connected with the musical instrument and the music genre and represents the probability that the song played by a certain musical instrument belongs to a certain genre, and the probability can be obtained from openMIC-2018 data set and is represented as P G×I
Knowledge graph
Figure BDA0003967948800000093
The edge set E in (1) is represented as:
Figure BDA0003967948800000094
wherein, when there is no connection between two nodes, the edge between them is represented as a zero matrix, e.g. 0 G×G Or 0 A×I
S3, initializing the genre nodes in the knowledge graph constructed in the step S2 by using the prediction scores of the genres obtained in the step S1, learning the feature vector of each node by using a graph neural network, and connecting all the feature vectors in series to obtain a final knowledge representation;
first, a prediction score versus knowledge graph for each genre obtained using a pre-training network
Figure BDA0003967948800000095
The genre nodes in the network are initialized, then the feature vectors of each node are learned by using the GGNN network, and all the feature vectors are connected in series to obtain a final knowledge representation Z KG The dimension of the finally output knowledge characterization is 1536 dimensions; referring to the lower right portion of fig. 1, the knowledge characterization module specifically includes:
s301, initializing an artist node A and an instrument node I by using a zero vector, and initializing a corresponding genre node G by using a prediction score Zs. Score Zs is through a pretrained network
Figure BDA0003967948800000097
Obtained, can be expressed herein as:
Zs={s 1 ,s 2 ,...,s G }
further, the input characteristics of each node after initialization can be expressed as:
Figure BDA0003967948800000096
wherein, 0 G-1 And 0 A+I Respectively represent zero vectors with dimensions of G-1 and A + I;
s302, learning knowledge graph by using GGNN (generalized Gaussian neural network)
Figure BDA0003967948800000101
Characterization of (2) Z KG
The GGNN is a recurrent neural network structure that learns the characteristics of arbitrary graph structure data by iteratively updating node characteristics.
Further, at one iteration t, the hidden state h of the node i i Determined by its last state and the messages propagated from its neighbors, expressed as:
Figure BDA0003967948800000102
Figure BDA0003967948800000103
wherein E is i Is a matrix representing the connection relationship between nodes i and their adjacent nodes.
After T iterations, where the number of iterations T is set to 5, the message will be in the whole graph
Figure BDA0003967948800000104
The method comprises the steps of (1) carrying out intermediate propagation to obtain the final hidden states of all nodes; the final linear layer will output the final features of each node, and the features are concatenated to obtain the entire knowledge-map->
Figure BDA0003967948800000105
Characterization of (2) KG
S4, distributing different attention weights to the audio representation obtained in the step S1 and the knowledge representation obtained in the step S3 by using an SE (sequence and interaction) module, and splicing the weighted representations to obtain an enhanced audio representation; and inputting the enhanced audio representation into a full connection layer for genre classification.
Referring to the upper part of fig. 1, the genre classification fusion module specifically includes:
s401, representing the knowledge Z KG With initial audio characterisation Z a Are input together to SE (Squeeze and exposure)n) in the module, an SE module allocates different attention weights to the two representations, and adaptively determines which feature is favorable for the whole model; the weighted features are then concatenated to obtain the final enhanced audio representation F, which is expressed as:
Figure BDA0003967948800000106
wherein, W a And W KG Representing SE blocks separately for an initial audio representation Z a And knowledge characterization Z KG (ii) assigned attention weight;
s402, inputting the enhanced audio representation F into a Full Connection (FC) layer for carrying out genre classification. The model is trained using cross entropy loss.
The loss function is defined as:
Figure BDA0003967948800000111
wherein,
Figure BDA0003967948800000112
is a predicted genre label, is->
Figure BDA0003967948800000113
Is a genuine label.
In another embodiment of the present invention, a music genre classification system with a fusion knowledge graph is provided, which can be used to implement the above music genre classification method with a fusion knowledge graph.
The learning module converts audio data into a Mel spectrogram, inputs the Mel spectrogram into an audio feature extraction network to learn audio representation, and simultaneously adds a linear layer at the last of the audio feature extraction network to obtain a prediction score for each genre;
the building module is used for building a knowledge graph related to the genres;
the representation module is used for initializing the genre nodes in the knowledge graph constructed by the construction module by using the prediction scores of each genre obtained by the learning module, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain the final knowledge representation;
the classification module is used for distributing different attention weights to the audio representation obtained by the learning module and the knowledge representation obtained by the representation module by using the SE module, and then splicing the weighted representations to obtain an enhanced audio representation; and inputting the enhanced audio representation into the full connection layer to form a music genre classification model so as to realize music genre classification.
In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of the music genre classification method with knowledge graph integration, and comprises the following steps:
converting the audio data into a Mel spectrogram, inputting the Mel spectrogram into an audio feature extraction network to learn audio representation, and simultaneously adding a linear layer at the last of the audio feature extraction network to obtain a prediction score for each genre; building a knowledge graph related to the genre; initializing the genre nodes in the knowledge graph by using the prediction scores of each genre, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain final knowledge representation; distributing different attention weights to the audio characterization and the knowledge characterization by using an SE module, and splicing the weighted characterizations to obtain an enhanced audio characterization; and inputting the enhanced audio representation into the full connection layer to form a music genre classification model so as to realize music genre classification.
In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in the terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, the memory space stores one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM Memory, or may be a Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.
One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps of the music genre classification method with respect to the fusion knowledge base in the above embodiments; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:
converting the audio data into a Mel spectrogram, inputting the Mel spectrogram into an audio feature extraction network to learn audio representation, and simultaneously adding a linear layer at the last of the audio feature extraction network to obtain a prediction score for each genre; constructing a knowledge graph related to the genres; initializing the genre nodes in the knowledge graph by using the prediction scores of each genre, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain final knowledge representation; distributing different attention weights to the audio characterization and the knowledge characterization by using an SE module, and splicing the weighted characterizations to obtain an enhanced audio characterization; and inputting the enhanced audio representation into the full connection layer to form a music genre classification model so as to realize music genre classification.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method follows the setting of an FMA-medium data set, and uses 80% of data to carry out model training, 10% of data is used for verification, and the remaining 10% of data is used for testing; to get a better prediction result, the invention averages the softmax results of 301 second clips of one audio during the test to get the final prediction result.
Referring to fig. 3, in order to prove that the accuracy of genre classification is improved by the present invention, the accuracy reaches 68.07% of accuracy when comparing with other related works, which is higher than all the most advanced methods in the past; ROC-AUC and PR-AUC were 0.883 and 0.471, respectively, which are 0.5% and 14.6% higher than CLMR system, respectively. These results show well the effectiveness of the present invention compared to prior methods.
The present invention also verifies the contribution of the knowledge-graph to audio feature representation learning:
referring to fig. 4, the accuracy of genre classification obtained by learning audio features guided by a knowledge graph in the framework of the present invention is improved by 3%, and the accuracy of genre classification obtained by enhancing audio representation in two other baseline networks by using the knowledge graph in the present invention is improved by 2.3% and 9.34%, respectively.
The above results show that the framework of the present invention supports the use of knowledge-graph guided audio representation learning, which can facilitate fine-grained genre classification.
In conclusion, the music genre classification method and system based on the knowledge graph, disclosed by the invention, have the advantages that the knowledge graph is used for guiding audio representation learning, corresponding additional information does not need to be acquired for each section of audio, and a large amount of labor and time cost for data acquisition and processing is saved; the invention classifies music genres by using the enhanced audio representation, fully utilizes the correlation among different genres learned from the knowledge graph, and enhances the audio representation, thereby improving the accuracy of music genre classification and having wide application prospect.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal and method may be implemented in other ways. For example, the above-described apparatus/terminal embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated module/unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, such as a record medium, a usb disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random Access Memory (RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, etc., wherein the computer readable medium comprises a content that can be increased or decreased as appropriate according to the requirements of legislation and patent practice in the jurisdiction, such as in some jurisdictions where computer readable media does not comprise electrical carrier signals and telecommunications signals according to legislation and patent practice.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. A music genre classification method integrating knowledge graphs is characterized by comprising the following steps:
s1, converting audio data into a Mel spectrogram, inputting the Mel spectrogram into an audio feature extraction network to learn audio representation, and simultaneously adding a linear layer at the last of the audio feature extraction network to obtain a prediction score of each genre;
s2, establishing a knowledge graph related to the genres;
s3, initializing the genre nodes in the knowledge graph constructed in the step S2 by using the prediction scores of each genre obtained in the step S1, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain final knowledge representation;
and S4, distributing different attention weights to the audio representation obtained in the step S1 and the knowledge representation obtained in the step S3 by using an SE module, splicing the weighted representations to obtain an enhanced audio representation, and inputting the enhanced audio representation into a full connection layer to form a music genre classification model so as to realize music genre classification.
2. The method for classifying music genres by fusing knowledge graphs according to claim 1, wherein the step S1 specifically comprises:
s101, cutting the audio χ into a plurality of non-overlapping segments with the duration of 1 second, converting the cut audio segments into a 128-dimensional Mel spectrogram by using a librosa library, and obtaining a time-frequency representation S;
s102, inputting the time-frequency representation S obtained in the step S101 into a backbone network f (.) learning audio representation Z a
S103, adding a linear layer g () after the backbone network f () in the step S102, and adding a linear layer g () to the network
Figure FDA0003967948790000011
Pre-training to obtain a C-dimensional vector Z s Denotes the prediction score of the network for each genre, and C represents the number of genres.
3. The method for classifying music genres according to claim 2, wherein in step S102, the backbone network f (.) uses an inclusion-ResNet-V2 architecture.
4. The method for classifying music genres by fusing knowledge-maps according to claim 1, wherein in step S2, the knowledge-maps
Figure FDA0003967948790000012
Including entity sets and edge sets; the entity set V contains G + A + I elements, G is the number of music genres, A is the number of artists, and I is the number of instruments; side set E is on a knowledge graph->
Figure FDA0003967948790000013
Is the set of edges connecting between the various entities.
5. The method of claim 4, wherein the knowledge-graph music genre classification method is based on knowledge-graph analysis
Figure FDA0003967948790000021
The edge set E in (1) is:
Figure FDA0003967948790000022
wherein, 0 G×G Is a zero matrix of size GXG, P G×A Is a correlation probability matrix of genre and artist of music of size G × A, P G×I Is a probability matrix, P, of correlation of musical genre with instrument of size G × I A×G Is a probability matrix of the correlation of artist and genre of music of size A G, 0 A×A Is a zero matrix of size A × A, 0 A×I Is a zero matrix of size AxI, P I×G Is the correlation probability matrix of the instrument with the music genre of size I G, 0 I×A Is a zero matrix of size I × A, 0 I×I Is a zero matrix of size I x I.
6. The method for classifying music genres by fusing knowledge graphs according to claim 1, wherein the step S3 specifically comprises:
s301, initializing an artist node A and an instrument node I by using a zero vector, and using Z obtained in the step S1 s Initializing corresponding genre nodes G to obtain input characteristics x of each node v
S302, in one iteration t, the hidden state h of the node i i After T iterations, the message is determined in the whole graph according to the state of the previous step and the message propagated from the neighbor of the previous step
Figure FDA0003967948790000023
The final linear layer outputs the final characteristics of each node, and the characteristics are spliced to obtain the whole knowledgeMap->
Figure FDA0003967948790000024
Characterization of (2) KG
7. The method for classifying music genres according to claim 6, wherein in step S302, at one iteration t, the hidden state h of node i i Determined by its last state and the messages propagated from its neighbors, specifically:
Figure FDA0003967948790000025
Figure FDA0003967948790000026
wherein,
Figure FDA0003967948790000027
is the initial hidden state of node i, x i Is an input characteristic of node i, is asserted>
Figure FDA0003967948790000028
Is the hidden state of node i at the tth iteration, is asserted>
Figure FDA0003967948790000029
Is the hidden state of node K at the t-1 st iteration, K is the total number of nodes in the knowledge graph, E i Is a matrix representing the connection relationship of nodes i and their neighbors.
8. The method for classifying music genres by fusing knowledge graphs according to claim 1, wherein the step S4 is specifically:
s401, representing the knowledge Z KG With initial audio characterisation Z a Inputting into SE module, and representing Z for knowledge by SE module KG With initial audio characterisation Z a Distributing different attention weights, and determining the characteristics in a self-adaptive mode; then, connecting the weighted features to obtain an audio representation F;
and S402, inputting the audio representation F obtained in the step S401 into a music genre classification model for genre classification.
9. The method for classifying music genres according to claim 8, wherein in step S402, the music genre classification model is trained using cross-entropy loss, wherein the cross-entropy loss L is:
Figure FDA0003967948790000031
wherein,
Figure FDA0003967948790000032
is a predicted genre label, is->
Figure FDA0003967948790000033
Is the true tag and N is the total number of audio samples input.
10. A system for knowledge-graph-fused music genre classification, comprising:
the learning module is used for converting the audio data into a Mel spectrogram, inputting the Mel spectrogram into an audio feature extraction network to learn audio representation, and meanwhile, adding a linear layer at the last of the audio feature extraction network to obtain a prediction score of each genre;
the building module is used for building a knowledge graph related to the genres;
the representation module is used for initializing the genre nodes in the knowledge graph constructed by the construction module by using the prediction scores of each genre obtained by the learning module, then learning the feature vectors of each genre node by using a graph neural network, and connecting all the feature vectors in series to obtain the final knowledge representation;
the classification module is used for distributing different attention weights to the audio representation obtained by the learning module and the knowledge representation obtained by the representation module by using the SE module, and then splicing the weighted representations to obtain an enhanced audio representation; and inputting the enhanced audio representation into the full connection layer to form a music genre classification model so as to realize music genre classification.
CN202211505311.4A 2022-11-28 2022-11-28 Music genre classification method and system based on knowledge graph fusion Pending CN115881160A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211505311.4A CN115881160A (en) 2022-11-28 2022-11-28 Music genre classification method and system based on knowledge graph fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211505311.4A CN115881160A (en) 2022-11-28 2022-11-28 Music genre classification method and system based on knowledge graph fusion

Publications (1)

Publication Number Publication Date
CN115881160A true CN115881160A (en) 2023-03-31

Family

ID=85764430

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211505311.4A Pending CN115881160A (en) 2022-11-28 2022-11-28 Music genre classification method and system based on knowledge graph fusion

Country Status (1)

Country Link
CN (1) CN115881160A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234455A (en) * 2023-11-14 2023-12-15 深圳市齐奥通信技术有限公司 Intelligent control method and system for audio device based on environment perception

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234455A (en) * 2023-11-14 2023-12-15 深圳市齐奥通信技术有限公司 Intelligent control method and system for audio device based on environment perception
CN117234455B (en) * 2023-11-14 2024-04-19 深圳市齐奥通信技术有限公司 Intelligent control method and system for audio device based on environment perception

Similar Documents

Publication Publication Date Title
Korzeniowski et al. A fully convolutional deep auditory model for musical chord recognition
Park et al. Towards unsupervised pattern discovery in speech
Prabhakar et al. Holistic approaches to music genre classification using efficient transfer and deep learning techniques
WO2021174760A1 (en) Voiceprint data generation method and device, computer device, and storage medium
CN111400540B (en) Singing voice detection method based on extrusion and excitation residual error network
Ruvolo et al. A learning approach to hierarchical feature selection and aggregation for audio classification
CN112084435A (en) Search ranking model training method and device and search ranking method and device
CN103761965A (en) Method for classifying musical instrument signals
Albornoz et al. Automatic classification of Furnariidae species from the Paranaense Littoral region using speech-related features and machine learning
CN113813609A (en) Game music style classification method and device, readable medium and electronic equipment
Mishra et al. Reliable local explanations for machine listening
CN113870863B (en) Voiceprint recognition method and device, storage medium and electronic equipment
Zhong et al. MusicCNNs: a new benchmark on content-based music recommendation
Yasmin et al. A rough set theory and deep learning-based predictive system for gender recognition using audio speech
CN115881160A (en) Music genre classification method and system based on knowledge graph fusion
CN110347825A (en) The short English film review classification method of one kind and device
You et al. Open set classification of sound event
Ding et al. Audio embeddings as teachers for music classification
CN114022192A (en) Data modeling method and system based on intelligent marketing scene
CN116646001B (en) Method for predicting drug target binding based on combined cross-domain attention model
CN112489689A (en) Cross-database voice emotion recognition method and device based on multi-scale difference confrontation
US20220382806A1 (en) Music analysis and recommendation engine
Wang et al. Weakly Supervised Chinese short text classification algorithm based on ConWea model
CN114999566A (en) Drug repositioning method and system based on word vector characterization and attention mechanism
Mahardhika et al. Method to Profiling the Characteristics of Indonesian Dangdut Songs, Using K-Means Clustering and Features Fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination