CN115910099B

CN115910099B - Automatic musical instrument identification method based on depth probability map neural network

Info

Publication number: CN115910099B
Application number: CN202211391028.3A
Authority: CN
Inventors: 张健; 侯海薇; 杜威; 丁世飞
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-08-04
Anticipated expiration: 2042-11-08
Also published as: CN115910099A

Abstract

The automatic musical instrument identification method based on the depth probability map neural network comprises the steps of dividing audio data into time slices, dividing a section of audio into N time slices with fixed lengths, and recording labels corresponding to each time slice; converting the obtained audio data of each time slice into a Mel frequency spectrum image, and then regularizing the image; performing feature extraction on the obtained regularized Mel spectrum image by using a convolutional neural network, mapping the extracted features from two dimensions to one dimension, and combining the tags to form a time slice Mel spectrum image feature tag pair; constructing an improved conditional restricted Boltzmann machine model, and obtaining conditional probability distribution; training an improved CRBM model using Gibbs sampling; constructing an objective function, and training an improved CRBM model by using the objective function to obtain an automatic identification model of the musical instrument; outputting the predicted instrument tag. The method can effectively solve the problem that the complex tone musical instrument is difficult to accurately identify in the prior art.

Description

Automatic musical instrument identification method based on depth probability map neural network

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to an automatic musical instrument recognition method based on a depth probability map neural network.

Background

With the development of artificial intelligence, the intelligent music analysis method based on machine learning gradually becomes a core technology and research direction in the fields of main melody recognition, music style detection and the like, wherein automatic recognition of musical instruments in multi-tune music is a key step for realizing intelligent music analysis. In current research and application, the mainstream method is to implement intelligent music analysis from the point of view of signal processing in combination with a machine learning method. However, for the task of identifying a musical instrument of multi-tone music, because of the harmonic nature of the musical instrument, there is a complex signal superposition phenomenon in the multi-tone music, both in the time domain and in the frequency domain, and thus the accuracy of identifying the musical instrument is affected, and this phenomenon has been a difficulty to be solved in identifying the musical instrument of multi-tone music.

From a signal processing perspective, instrument recognition can be seen as a branch of audio data processing, whereas unlike other audio data, vocal and instrumental music have unique properties in terms of harmonic energy distribution, and some scholars feature extract the complex tone music from an acoustic (as well as psychoacoustic) perspective, resulting in a feature representation of the complex tone music, such as: the playing time, the spectrum centroid, the energy envelope, the mel spectrum cepstrum coefficient and the like, and then designing a corresponding traditional machine learning or deep learning method to realize the musical instrument identification task. Such methods, although manually extracting acoustic (psychoacoustic) features, still do not provide an ideal recognition of complex music, especially for distinguishing between different instruments within the same instrument family. For this reason, the harmonic distributions of the instruments are different from one instrument family to another, but the harmonic distributions of the instruments in the same instrument family are similar, and the harmonic properties of the instruments result in complex signals being superimposed on each other, both in the time domain and in the frequency domain, so that the signal at a certain frequency may be the fundamental frequency of a certain instrument, and the fundamental frequency and harmonic signals of other instruments may be superimposed at the same time, and therefore, although the result of instrument identification is independent of the pitch (fundamental frequency) of instrument performance, the task of instrument identification based on machine learning is greatly affected by the fundamental frequency and its harmonic signals, and it is difficult to effectively distinguish the harmonic distributions of different instruments. Along with the development of deep learning, a learner refers to excellent labels of a deep neural network on an image processing task, expresses complex tone music on a time domain in a frequency domain, constructs a spectrogram or a Mel spectrogram of the complex tone music, and then uses the deep neural network to complete a musical instrument recognition task in a supervised learning mode. The deep neural network based on the frequency domain image brings about great progress to the task of identifying the musical instrument, but the method has some unresolved problems, firstly, an algorithm extracts harmonic characteristics of the musical instrument on single-instrument audio, then the harmonic characteristics are applied to a multi-instrument complex tone music test data set for multi-label classification, and the classification result depends on whether the model can fully learn the characteristics of the musical instrument on a training set; in addition, since the test data is a multi-tone music, superposition of various instruments on the frequency spectrum greatly interferes with the result of instrument recognition, and the neural network-based instrument recognition method generally analyzes the instrument recognition problem from the perspective of the frequency spectrum image characteristics or the generic characteristics, thereby converting the instrument recognition problem into an image recognition problem, but rarely considers the similarity between instrument categories as a key feature for distinguishing the harmonic distribution of the instruments at the same time, thereby resulting in low accuracy of instrument recognition.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an automatic musical instrument identification method based on a depth probability map neural network, which has simple steps and high identification precision and can effectively solve the problem that the complex tone musical instrument is difficult to accurately identify in the prior art.

In order to achieve the above object, the present invention provides an automatic instrument recognition method based on a depth probability map neural network, comprising the steps of:

step one: preprocessing data;

s11, dividing the audio data into time slices, dividing a section of audio into N time slices with fixed lengths, and simultaneously recording labels corresponding to each time slice;

s12, converting the obtained audio data of each time slice into a Mel frequency spectrum image, then regularizing the image, and regularizing the value range of the pixel points to obtain a regularized Mel frequency spectrum image;

step two: extracting the characteristics of the data;

performing feature extraction on the obtained regularized Mel spectrum image by using a convolutional neural network to obtain image features of Mel spectrum, mapping the extracted features from two dimensions to one dimension, and combining the tags to form a time slice Mel spectrum image feature tag pair;

step three: modeling tag correlation features using an improved CRBM model;

s31, constructing an improved CRBM model according to the energy function proposed by the formula (1);

where x represents the input spectral feature data, y represents the label corresponding to x, which is also the expected output in the prediction stage, h represents the expected feature expression, s is the additional variable introduced, W, W _y Beta, mu, b are training parameters;

s32, obtaining a conditional joint probability distribution according to a formula (1), as shown in a formula (2);

wherein Z represents a partitioning function;

s33, obtaining formulas (3) and (4) based on the formula (2), and obtaining the conditional probability P (h|x, y) of h according to the formula (3); obtaining the activation probability of each component of h according to formula (4); obtaining a conditional probability formula (5) of y based on x and h according to formulas (2) and (3);

P(h|x,y)＝Π _i P(h _i |x,y) (3)；

s34, obtaining formulas (6) and (7) based on formulas (2) and (4), and obtaining the activation probability of each component of S according to formula (6); obtaining the activation probability of each component of y according to formula (7);

wherein N represents a Gaussian distribution;

step four: training an improved CRBM model classified as a target by utilizing the correlation characteristics, and outputting a predicted instrument label;

s41, constructing an objective function according to a formula (8), and training an improved CRBM model by using the objective function;

Loss＝log(p(y|x)+Rank-Loss(y|x)+σ||y||l1 (8)；

where log (P (y|x)) represents the likelihood function, rank-Loss (y|x) represents the ordering Loss function, y _l1 Representing l1 regularization, σ being a hyper-parameter;

s42, based on formulas (3), (4), (6) and (7), using Gibbs sampling to obtain a gradient formula (9) of a likelihood function in a formula (8), calculating the gradient of the formula (8) according to the formula (9), training an improved CRBM model according to the gradient of the formula (8), and obtaining a feature expression h containing label correlation and a conditional probability of the label through training;

where E represents a mathematical expectation, θ represents a parameter set, and two mathematical expectations are obtained using Gibbs sampling according to formulas (4), (6) and (7);

s43, after the improved CRBM training is completed, the label y which maximizes log (p (y|x) is calculated according to a formula (10) in the face of the input of the instrument category to be predicted, and therefore a predicted instrument label is output according to the instrument automatic identification model based on the improved CRBM, and an instrument automatic identification model is obtained;

preferably, in step S12 of the first step, the audio data is converted into a mel-frequency spectrum image using an open source tool.

Further, in step two, features of the mel-spectrum image are extracted using a neural network res net101 that has been pre-trained on the ImageNet dataset. Therefore, features which can be used for a musical instrument recognition task can be effectively extracted by means of Mel frequency spectrum mapping and ResNet101 image feature extraction, and recognition accuracy can be further improved by means of a multi-tone musical instrument recognition method based on tag correlation features.

In the method, in a data processing part, after the music audio is cut and muted by time slices, the music audio is converted into a Mel frequency spectrum according to the time slices. In the model construction part, extracting image features of the complex-tuned music mel frequency spectrum by using a convolutional neural network, modeling correlations between the image features and musical instrument labels corresponding to the mel frequency spectrum of the time slice by using an improved condition-limited boltzmann machine (Conditional Restricted Boltzmann Machine, CRBM) model, and finally training an improved CRBM model; based on the obtained correlation characteristics, a predicted instrument tag is output. Since a piece of multi-tune music is typically played by a plurality of instruments at the same time, the recognition problem of multi-tune music is naturally a multi-tag recognition problem. Meanwhile, the existing instrument identification method based on the neural network does not consider correlation among instrument categories as a key feature for distinguishing harmonic distribution of instruments in the process of identifying multi-tone music. To this end, the invention first learns the correlation between harmonic features of instruments in complex-tuned music and different instrument tags and generic features (where generic features (label specific feature) are intended to extract features in the data that are directly related to the tag from the tag's perspective) from both spectral feature extraction and instrument generic features, where generic features are expressed in the form of conditional probabilities P (h|x, y) in the modified CRBM model, where variable pairs (x, y) are introduced in the modified CRBM model to model the correlation between image features and tags, and by the activation probabilities of y to model the correlation between tags, P (y|x) and P (y) _i I x, h, s) models the association of harmonic features and different musical instrument tags to be sufficientFeatures that can be used for instrument recognition tasks are extracted. In this way, not only the thought of multi-tag learning is consulted, but also the correlation among a plurality of musical instrument tags in the multi-tone music is modeled from the two angles of the image features and the generic features, so that the musical instruments in the multi-tone music can be identified according to the correlation among the tag correlation and the tag and the spectrum image features, which musical instruments are likely to have harmonic overlapping can be modeled through the tag correlation, and the overlapping which is likely to occur is associated with the corresponding spectrum image through the conditional probability between the spectrum image features and the corresponding tag, so that the musical instruments in the multi-tone music can be effectively distinguished and identified. The method overcomes the defect that the prior identification method only analyzes the characteristics of the spectrum image or the characteristics of the generic type, thereby greatly improving the identification precision and solving the identification problem of the musical instrument of the multi-tone music.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a modified CRBM model constructed in accordance with the present invention;

fig. 3 is a block diagram of an automatic recognition model of musical instruments in the present invention.

Detailed Description

The present invention will be further described below.

As shown in fig. 1 and 3, the present invention provides an automatic musical instrument recognition method based on a depth probability map neural network, comprising the steps of:

step one: preprocessing data;

step two: extracting the characteristics of the data;

step three: modeling tag correlation features using a modified CRBM model, as shown in fig. 2;

wherein Z represents a partitioning function;

s33, obtaining formulas (3) and (4) based on the formula (2), and obtaining the conditional probability P (h|x, y) of h according to the formula (3); obtaining the activation probability of each component of h according to formula (4); the conditional probability formula (5) of y based on x and h can be obtained according to the formulas (2) and (3);

P(h|x,y)＝Π _i P(h _i |x,y) (3)；

however, the covariance matrix of equation (5) is a non-diagonal matrix, and although equation (5) can represent the correlation between the components of the tag y through the covariance matrix, equation (5) is difficult to directly sample and thus is not suitable for training the improved CRBM model, and the present invention further decomposes the conditional probability of y according to S34 in order to train the model.

wherein N represents a Gaussian distribution; thus, under the action of s, the conditions between the components of y are independent, and the calculation can be performed through Gibbs sampling.

Loss＝log(p(y|x)+Rank-Loss(y|x)+σ||y|| _l1 (8)；

where log (p (y|x) represents the likelihood function, rank-Loss (y|x) represents the ordering Loss function, y _l1 Representing l1 regularization, σ being a hyper-parameter;

s42, based on the formulas (3), (4), (6) and (7), using a gradient formula (9) of likelihood function in Gibbs sampling acquisition formula (8),

where E represents a mathematical expectation and θ represents a parameter set, and both mathematical expectations can be obtained using Gibbs sampling according to formulas (4), (6) and (7). Thus, the gradient of equation (8) can be calculated according to equation (9), and the improved CRBM model can be trained according to the gradient of equation (8); obtaining a feature expression h containing label correlation and a conditional probability of the label through training;

s43, after the improved CRBM training is completed, calculating a label y maximizing log (p (y|x) according to a formula (10) in face of input of a musical instrument class to be predicted;

calculating the label y that maximizes log (p (y|x) is accomplished by calculating the gradient of equation (10) with respect to y, thereby obtaining an automatic instrument recognition model based on the predicted instrument label output from the automatic instrument recognition model based on the improved CRBM.

In order to more fully extract features that can be used for the instrument recognition task, in step two, features of the mel-spectrum image are extracted using neural network res net101 that is pre-trained on the ImageNet dataset. Therefore, features which can be used for a musical instrument recognition task can be effectively extracted by means of Mel frequency spectrum mapping and ResNet101 image feature extraction, and recognition accuracy can be further improved by means of a multi-tone musical instrument recognition method based on tag correlation features.

In the method, in a data processing part, after the music audio is cut and muted by time slices, the music audio is converted into a Mel frequency spectrum according to the time slices. In the model construction part, extracting image features of the complex-tuned music mel frequency spectrum by using a convolutional neural network, modeling correlations between the image features and musical instrument labels corresponding to the mel frequency spectrum of the time slice by using an improved condition-limited boltzmann machine (Conditional Restricted Boltzmann Machine, CRBM) model, and finally training an improved CRBM model; based on the obtained correlation characteristics, a predicted instrument tag is output. Because a piece of multi-tuned music is usually played by multiple instruments at the same time, the multi-tuned music is played by multiple instrumentsThe identification problem of music is naturally a multi-tag identification problem. Meanwhile, the existing instrument identification method based on the neural network does not consider correlation among instrument categories as a key feature for distinguishing harmonic distribution of instruments in the process of identifying multi-tone music. To this end, the invention first learns the correlation between harmonic features of instruments in complex-tuned music and different instrument tags and generic features (where generic features (label specific feature) are intended to extract features in the data that are directly related to the tag from the tag's perspective) from both spectral feature extraction and instrument generic features, where generic features are expressed in the form of conditional probabilities P (h|x, y) in the modified CRBM model, where variable pairs (x, y) are introduced in the modified CRBM model to model the correlation between image features and tags, and by the activation probabilities of y to model the correlation between tags, P (y|x) and P (y) _i I x, h, s) models the association of harmonic features and different instrument tags to fully extract features that can be used for instrument recognition tasks. In this way, not only the thought of multi-tag learning is consulted, but also the correlation among a plurality of musical instrument tags in the multi-tone music is modeled from the two angles of the image features and the generic features, so that the musical instruments in the multi-tone music can be identified according to the correlation among the tag correlation and the tag and the spectrum image features, which musical instruments are likely to have harmonic overlapping can be modeled through the tag correlation, and the overlapping which is likely to occur is associated with the corresponding spectrum image through the conditional probability between the spectrum image features and the corresponding tag, so that the musical instruments in the multi-tone music can be effectively distinguished and identified. The method overcomes the defect that the prior identification method only analyzes the characteristics of the spectrum image or the characteristics of the generic type, thereby greatly improving the identification precision and solving the identification problem of the musical instrument of the multi-tone music.

Claims

1. An automatic musical instrument identification method based on a depth probability map neural network is characterized by comprising the following steps:

step one: preprocessing data;

step two: extracting the characteristics of the data;

step three: modeling tag correlation features using an improved CRBM model, the CRBM model being a conditional restricted boltzmann machine;

（1）；

in the method, in the process of the invention,xthe spectral feature data representing the input is presented,yrepresentation ofxThe corresponding tag, which is also the expected output in the prediction phase,hwhich represents the expression of the desired characteristic,sis an additional variable that is introduced and,W、W _y 、β、μ、bis a training parameter;

（2）；

in the method, in the process of the invention,Zrepresenting a partitioning function;

s33, obtaining formulas (3) and (4) based on formula (2), and obtaining according to formula (3)hConditional probability of (2)The method comprises the steps of carrying out a first treatment on the surface of the Obtained according to formula (4)hActivation probability of each component of (a); obtained according to formulas (2), (3)yBased onxAndhconditional probability formula (5) of (2);

（3）；

（4）；

（5）；

s34, obtaining formulas (6) and (7) based on formulas (2) and (4), and obtaining according to formula (6)sActivation probability of each component of (a); obtained according to formula (7)yActivation probability of each component of (a);

（6）；

（7）；

in the method, in the process of the invention,Nrepresenting a gaussian distribution;

Loss = log(p(y|x) + Rank-Loss(y|x) + σ||y|| _l1 （8）；

in log%p(y|x) The likelihood function is represented as a function of the likelihood,Rank-Loss(y|x) Representing the ranking lossLoss function, ||y|| _l1 Representation ofl1 the regularization is carried out,σis a super parameter;

s42, based on formulas (3), (4), (6) and (7), using Gibbs sampling to obtain a gradient formula (9) of likelihood function in formula (8), calculating the gradient of formula (8) according to formula (9), training an improved CRBM model according to the gradient of formula (8), and training to obtain a feature expression containing label correlationhAnd conditional probability of the tag;

（9）；

in the method, in the process of the invention,Erepresenting the mathematical expectation that the data will be,θrepresenting a set of parameters, two mathematical expectations were obtained using Gibbs sampling according to formulas (4), (6) and (7);

s43, after finishing the improved CRBM training, calculating log according to the formula (10) to obtain the input of the type of musical instrument to be predictedp(y|x) Maximum tagyWhereby a predicted instrument tag is output from the improved CRBM based instrument automatic identification model to obtain an instrument automatic identification model;

（10）。

2. the automatic instrument recognition method based on the depth probability map neural network according to claim 1, wherein in step S12, the audio data is converted into mel-frequency spectrum images using an open source tool.

3. The automatic instrument recognition method based on the deep probability map neural network according to claim 1 or 2, wherein in the second step, features of mel spectrum images are extracted using a neural network res net101 pre-trained on an ImageNet dataset.