CN115910099A

CN115910099A - Musical instrument automatic identification method based on depth probability map neural network

Info

Publication number: CN115910099A
Application number: CN202211391028.3A
Authority: CN
Inventors: 张健; 侯海薇; 杜威; 丁世飞
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-04-04
Anticipated expiration: 2042-11-08
Also published as: CN115910099B

Abstract

A musical instrument automatic identification method based on a depth probability map neural network divides audio data into time slices, divides a section of audio into N time slices with fixed length, and simultaneously records a label corresponding to each time slice; converting the obtained audio data of each time slice into a Mel frequency spectrum image, and then carrying out regularization processing on the image; performing feature extraction on the obtained regularized Mel frequency spectrum image by using a convolutional neural network, mapping the extracted features from two dimensions into a one-dimensional form, and then combining labels to form a time slice Mel frequency spectrum image feature label pair; constructing an improved conditional restricted boltzmann machine model, and obtaining conditional probability distribution; training the improved CRBM model using Gibbs samples; constructing an objective function, and training an improved CRBM (CrBM) model by using the objective function to obtain an automatic musical instrument recognition model; the predicted instrument label is output. The method can effectively solve the problem that the polyphonic musical instrument is difficult to accurately identify in the prior art.

Description

Musical instrument automatic identification method based on depth probability map neural network

Technical Field

The invention belongs to the technical field of sound recognition, and particularly relates to an automatic musical instrument recognition method based on a depth probability map neural network.

Background

With the development of artificial intelligence, the intelligent music analysis method based on machine learning gradually becomes a core technology and research direction in the fields of melody recognition, music style detection and the like, wherein the automatic recognition of musical instruments in polyphonic music is a key step for realizing intelligent music analysis. In current research and applications, the mainstream approach is to combine machine learning methods to realize intelligent music analysis from the perspective of signal processing. However, for the task of identifying the musical instrument of the polyphonic music, due to the harmonics of the musical instrument, a complex signal superposition phenomenon exists in the polyphonic music in both time domain and frequency domain, and further the accuracy of the musical instrument identification is influenced, and the phenomenon is always a difficult point to be solved for identifying the polyphonic music musical instrument.

From the point of view of signal processing, instrument recognition can be regarded as a branch of audio data processing, however, unlike other audio data, vocal music and instrumental music have unique properties on harmonic energy distribution, and some scholars perform feature extraction on polyphonic music from an acoustic (and psychoacoustic) point of view to obtain a feature representation of the polyphonic music, for example: attack time, spectrum mass center, energy envelope, mel frequency spectrum cepstrum coefficient and the like, and then a corresponding traditional machine learning or deep learning method is designed to realize the musical instrument identification task. Such methods, although extracting acoustic (psychoacoustic) features manually, still have an unsatisfactory recognition effect on polyphonic music, especially if it is difficult to distinguish different instruments within the same instrument family. The reason is that harmonic distributions of musical instruments in different musical instrument families are different, but harmonic distributions of different musical instruments in the same musical instrument family are similar, and harmonic characteristics of the musical instruments cause that polyphonic music is overlapped by complex signals from time domains or frequency domains, so that signals on a certain frequency can be the fundamental frequency of a certain musical instrument and fundamental frequency and harmonic signals of other musical instruments can be simultaneously overlapped, and therefore, although the result of musical instrument identification is irrelevant to the pitch (fundamental frequency) played by the musical instrument, a musical instrument identification task based on machine learning is greatly influenced by the fundamental frequency and the harmonic signals thereof, and the harmonic distributions of different musical instruments are difficult to effectively distinguish. With the development of deep learning, learners use an excellent label of a deep neural network on an image processing task, represent polyphonic music in a time domain in a frequency domain, construct a spectrogram or a Mel spectrogram of the polyphonic music, and then use the deep neural network to complete an instrument identification task in a supervised learning mode. The deep neural network based on the frequency domain image brings great progress to the musical instrument recognition task, but the method also has some unsolved problems, firstly, the algorithm extracts the harmonic features of the musical instrument on the audio frequency of a single musical instrument, and then the harmonic features are applied to a polyphonic music test data set of multiple musical instruments for multi-label classification, and the classification result depends on whether the model can fully learn the features of the musical instrument on a training set or not; furthermore, since the test data is polyphonic music, superposition of multiple instruments on the frequency spectrum may greatly interfere with the result of instrument identification, while the neural network-based instrument identification method generally analyzes the identification problem of the instruments from the perspective of the frequency spectrum image features or generic features, thereby converting the instrument identification problem into the image identification problem, but rarely considers the similarity between instrument categories as a key feature for distinguishing instrument harmonic distribution, thereby resulting in low accuracy of instrument identification.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides the automatic musical instrument identification method based on the deep probability map neural network, which has the advantages of simple steps and high identification precision and can effectively solve the problem that polyphonic musical instruments in the prior art are difficult to accurately identify.

In order to achieve the above object, the present invention provides a musical instrument automatic identification method based on a depth probability map neural network, comprising the following steps:

the method comprises the following steps: preprocessing data;

s11, time slice division is carried out on the audio data, a section of audio is divided into N time slices with fixed length, and meanwhile, a label corresponding to each time slice is recorded;

s12, converting the obtained audio data of each time slice into a Mel frequency spectrum image, then carrying out regularization processing on the image, and regularizing the value range of pixel points to obtain a regularized Mel frequency spectrum image;

step two: extracting the characteristics of the data;

performing feature extraction on the obtained regularized Mel frequency spectrum image by using a convolutional neural network to obtain image features of Mel frequency spectrum, mapping the extracted features from two dimensions into a one-dimensional form, and then combining labels to form time slice Mel frequency spectrum image feature label pairs;

step three: modeling the tag correlation characteristics by using an improved CRBM model;

s31, constructing an improved CRBM model according to an energy function provided by a formula (1);

where x represents the input spectral feature data, y represents the label to which x corresponds, which is also the expected output in the prediction phase, h represents the desired feature representation, s is an introduced additional variable, W _y β, μ, b are training parameters;

s32, obtaining conditional joint probability distribution according to the formula (1), wherein the conditional joint probability distribution is shown in a formula (2);

wherein Z represents a partition function;

s33, obtaining formulas (3) and (4) based on the formula (2), and obtaining the conditional probability P (h | x, y) of h according to the formula (3); obtaining the activation probability of each component of h according to formula (4); obtaining a conditional probability formula (5) of y based on x and h according to the formulas (2) and (3);

P(h|x,y)＝Π _i P(h _i |x,y) (3)；

s34, obtaining formulas (6) and (7) based on the formulas (2) and (4), and obtaining the activation probability of each component of S according to the formula (6); obtaining an activation probability of each component of y according to formula (7);

wherein N represents a Gaussian distribution;

step four: training an improved CRBM model classified as a target by utilizing the correlation characteristics, and outputting a predicted musical instrument label;

s41, constructing an objective function according to the formula (8), and training an improved CRBM model by using the objective function;

Loss＝log(p(y|x)+Rank-Loss(y|x)+σ||y||l1 (8)；

s42, based on the formulas (3), (4), (6) and (7), using Gibbs sampling to obtain a gradient formula (9) of a likelihood function in the formula (8), calculating the gradient of the formula (8) according to the formula (9), training an improved CRBM model according to the gradient of the formula (8), and then obtaining a feature expression h containing label correlation and a conditional probability of a label through training;

where E represents the mathematical expectation, θ represents the set of parameters, and both mathematical expectations are obtained using Gibbs sampling according to equations (4), (6), and (7);

s43, after the improved CRBM training is completed, in the face of input of musical instrument categories needing to be predicted, calculating a label y which enables log (p (y | x) to be maximum according to a formula (10), and accordingly outputting a predicted musical instrument label according to the musical instrument automatic recognition model based on the improved CRBM to obtain a musical instrument automatic recognition model;

preferably, in S12 of step one, the audio data is converted into a mel-frequency spectrum image using an open source tool.

Further, in step two, a neural network ResNet101 pre-trained on the ImageNet dataset is used to extract features of the mel-frequency spectrum image. In this way, the method can effectively extract the characteristics which can be used for the musical instrument identification task by utilizing the Mel frequency spectrum mapping and matching with the ResNet101 image characteristic extraction mode, thereby further improving the identification precision by matching with a polyphonic musical instrument identification method based on the label correlation characteristics.

In the data processing section, after the polyphonic music audio is sliced by time slices, it is converted into a Mel frequency spectrum according to the time slices. In the model construction part, extracting the image characteristics of the Mel frequency spectrum of the polyphonic music by using a convolutional neural network, then modeling the correlation between the image characteristics and the musical instrument label corresponding to the Mel frequency spectrum of the time slice by using an improved Conditional Restricted Boltzmann Machine (CRBM) model, and finally training the improved CRBM model; and outputting the predicted musical instrument label based on the obtained correlation characteristics. Since a piece of polyphonic music usually has multiple instruments playing simultaneously, the problem of identification of polyphonic music is naturally a multi-tag identification problem. Meanwhile, the existing musical instrument identification method based on the neural network is as followsThe correlation between instrument categories has not been considered in identifying polyphonic music as a key feature for distinguishing instrument harmonic distributions. To this end, the invention first learns the correlation between harmonic features of an instrument and different instrument labels and generic features in polyphonic music from the two perspectives of spectral feature extraction and instrument generic features (wherein the generic feature (label specific feature) is intended to extract features in data directly related to the label from the perspective of the label), wherein the generic features are expressed in the form of conditional probabilities P (h | x, y) in an improved CRBM, and a variable pair (x, y) is introduced in the improved CRBM model to model the correlation between the image features and the labels, and the correlation between the labels, P (y | x) and P (y | x) are modeled by the activation probability of y _i | x, h, s) model the association of harmonic features with different instrument tags to fully extract features that can be used for instrument recognition tasks. Therefore, the idea of multi-label learning is used for reference, and the correlation among a plurality of instrument labels in the polyphonic music is modeled from two angles of the image characteristics and the generic characteristics, so that the instruments in the polyphonic music can be identified according to the label correlation and the correlation between the labels and the frequency spectrum image characteristics, therefore, harmonic overlapping possibly occurs in which instruments can be modeled through the label correlation, the overlapping possibly occurs is associated with the corresponding frequency spectrum images through the conditional probability between the frequency spectrum image characteristics and the corresponding labels, and the instruments in the polyphonic music can be effectively distinguished and identified. The method overcomes the defect that the prior identification method only analyzes the problem from the aspect of the spectral image characteristics or the generic characteristics, thereby greatly improving the identification precision and solving the problem of instrument identification of polyphonic music.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is an improved CRBM model constructed in the present invention;

fig. 3 is a structural diagram of an automatic recognition model of musical instruments in the present invention.

Detailed Description

The present invention is further described below.

As shown in fig. 1 and 3, the present invention provides an automatic musical instrument identification method based on a depth probability map neural network, comprising the following steps:

the method comprises the following steps: preprocessing data;

step two: extracting the characteristics of the data;

step three: modeling the tag correlation features using an improved CRBM model, as shown in FIG. 2;

where x represents the input spectral feature data, y represents the label to which x corresponds, which is also the expected output in the prediction phase, h represents the desired feature representation, s is the introduced additional variable, W _y β, μ, b are training parameters;

s32, obtaining conditional joint probability distribution according to the formula (1), wherein the conditional joint probability distribution is shown in the formula (2);

wherein Z represents a partition function;

P(h|x,y)＝Π _i P(h _i |x,y) (3)；

/>

however, the covariance matrix of formula (5) is a non-diagonal matrix, and although formula (5) can represent the correlation between the components of tag y by the covariance matrix, formula (5) is difficult to directly sample and is therefore not suitable for training the improved CRBM model, and in order to train the model, the invention further decomposes the conditional probability of y according to S34.

S34, obtaining formulas (6) and (7) based on the formulas (2) and (4), and obtaining the activation probability of each component of S according to the formula (6); obtaining an activation probability for each component of y according to equation (7);

wherein N represents a Gaussian distribution; thus, under the action of s, the conditions of the components of y are independent, and the calculation can be carried out through Gibbs sampling.

Loss＝log(p(y|x)+Rank-Loss(y|x)+σ||y|| _l1 (8)；

s42, based on the formulas (3), (4), (6) and (7), obtaining a gradient formula (9) of the likelihood function in the formula (8) by using Gibbs sampling,

where E represents the mathematical expectation and θ represents the parameter set, both mathematical expectations can be obtained using Gibbs sampling according to equations (4), (6) and (7). Thus, the gradient of equation (8) can be calculated according to equation (9), and the improved CRBM model can be trained according to the gradient of equation (8); obtaining a feature expression h containing label correlation and the conditional probability of the label through training;

s43, after the improved CRBM training is completed, in the face of input of musical instrument categories needing to be predicted, calculating a label y which enables log (p (y | x) to be maximum according to a formula (10);

calculating the label y maximizing log (p (y | x) is achieved by calculating the gradient of formula (10) with respect to y, thereby obtaining an instrument automatic recognition model from the instrument automatic recognition model output predicted instrument label based on the improved CRBM.

In order to more fully extract the features available for the musical instrument recognition task, in step two, the neural network ResNet101 pre-trained on the ImageNet dataset is used to extract the features of the mel-spectrum image. In this way, the method can effectively extract the characteristics which can be used for the musical instrument identification task by utilizing the Mel frequency spectrum mapping and matching with the ResNet101 image characteristic extraction mode, thereby further improving the identification precision by matching with a polyphonic musical instrument identification method based on the label correlation characteristics.

In the data processing section, after the polyphonic music audio is sliced by time slices, it is converted into a Mel frequency spectrum according to the time slices. In the model construction part, extracting the image characteristics of the Mel frequency spectrum of the polyphonic music by using a convolutional neural network, then modeling the correlation between the image characteristics and the musical instrument label corresponding to the Mel frequency spectrum of the time slice by using an improved Conditional Restricted Boltzmann Machine (CRBM) model, and finally training the improved CRBM model; and outputting the predicted musical instrument label based on the obtained correlation characteristics. Since a piece of polyphonic music usually has multiple instruments playing simultaneously, the problem of polyphonic music identification is naturally a multi-tag identification problem. Meanwhile, the existing instrument identification method based on the neural network does not consider the correlation among instrument categories as a key feature for distinguishing the harmonic distribution of the instruments in the process of identifying polyphonic music. To this end, the invention first learns the correlation between harmonic features of an instrument and different instrument labels and generic features in polyphonic music from the two perspectives of spectral feature extraction and instrument generic features (wherein the generic feature (label specific feature) is intended to extract features in data directly related to the label from the perspective of the label), wherein the generic features are expressed in the form of conditional probabilities P (h | x, y) in an improved CRBM, and a variable pair (x, y) is introduced in the improved CRBM model to model the correlation between the image features and the labels, and the correlation between the labels, P (y | x) and P (y | x) are modeled by the activation probability of y _i | x, h, s) model the association of harmonic features with different instrument tags to fully extract features that can be used for instrument recognition tasks. Therefore, the method not only uses the thought of multi-label learning for reference, but also simultaneously models the correlation among a plurality of musical instrument labels in polyphonic music from the two aspects of image characteristics and generic characteristics, thereby identifying the musical instruments in the polyphonic music according to the label correlation and the correlation between the labels and the frequency spectrum image characteristics, and further modeling which musical instrument is in polyphonic music according to the label correlationHarmonic overlapping may occur in some musical instruments, and the overlapping that may occur is associated with the corresponding spectral image by conditional probability between the spectral image features and the corresponding tags, thereby effectively distinguishing and identifying the musical instruments in the polyphonic music. The method overcomes the defect that the prior identification method only analyzes the problem from the aspect of the spectral image characteristics or the generic characteristics, thereby greatly improving the identification precision and solving the problem of instrument identification of polyphonic music.

Claims

1. An automatic musical instrument identification method based on a depth probability map neural network is characterized by comprising the following steps:

the method comprises the following steps: preprocessing data;

step two: extracting the characteristics of the data;

performing feature extraction on the obtained regularized Mel frequency spectrum image by using a convolutional neural network to obtain image features of the Mel frequency spectrum, mapping the extracted features from two dimensions into a one-dimensional form, and then combining labels to form a time slice Mel frequency spectrum image feature label pair;

s31, constructing an improved CRBM model according to an energy function provided by the formula (1);

where x represents the input spectral feature data, y represents the label to which x corresponds, which is also the desired output in the prediction phase, and h represents the desired feature tableD, s is an additional variable introduced, W _y β, μ, b are training parameters;

wherein Z represents a partition function;

P(h|x,y)＝∏ _i P(h _i |x,y) (3)；

wherein N represents a Gaussian distribution;

Loss＝log(p(y|x)+Rank-Loss(y|x)+σ||y|| _l1 (8)；

in the formula, log (p (y | x) represents a likelihood function, rank-Loss (y | x) represents a sorting Loss function, and | y | survival _l1 Representing l1 regularization, sigma being a hyperparameter;

s42, based on the formulas (3), (4), (6) and (7), a gradient formula (9) of a likelihood function in the formula (8) is obtained by using Gibbs sampling, the gradient of the formula (8) is calculated according to the formula (9), an improved CRBM model is trained according to the gradient of the formula (8), and then the feature expression h containing the label correlation and the conditional probability of the label are obtained through training;

s43, after the improved CRBM training is completed, in the face of the input of musical instrument categories needing to be predicted, calculating a label y which enables log (p (y | x) to be maximum according to a formula (10), and therefore outputting a predicted musical instrument label according to an automatic musical instrument recognition model based on the improved CRBM to obtain an automatic musical instrument recognition model;

2. the method for automatically identifying musical instruments based on the deep probability map neural network as claimed in claim 1, wherein in step one, S12, an open source tool is used to convert the audio data into mel-frequency spectrum images.

3. The method for automatically identifying musical instruments based on the deep probability map neural network as claimed in claim 1 or 2, wherein in the second step, a neural network ResNet101 pre-trained on ImageNet data set is used to extract the features of the Mel frequency spectrum image.